Tutorial
In this Tutorial, we are interested in using \(k\)-mer counts to predict where an E. Coli O157 strain geographically originated from. We imagine ourselves having collected data from years 2014-2018 and find ourselves in 2019. We want to preprocess the 2014-2018 dataset, perform feature selection, and train machine learning models on this data. We then collect data in 2019 and wish to see how accurate our machine learning models are for the new data. For demo purposes, the data has been truncated to contain only the first 100,000 \(k\)-mers which means reduces execution times at the cost of model performance.
We start by cloning an E. Coli O157 dataset which consists kmer count files (split into parts) and location region as the meta data. For the complete dataset clone my https://github.com/jordan-wei-taylor/e-coli-o157-data.git and execute ./merge.sh after executing cd genolearn-e-coli.
git clone https://github.com/jordan-wei-taylor/genolearn-tutorial.git
Upon cloning the repository, cd into the repository.
cd genolearn-tutorial
Once the above has been executed, your directory tree should look like the following:
genolearn-tutorial
├── data
│ ├── 14-18.txt.gz
│ ├── 19.txt.gz
│ └── metadata.csv
└── README.md
Lets install GenoLearn on a virtual python environment.
python3 -m venv env
source env/bin/activate
pip install git+https://github.com/jordan-wei-taylor/genolearn.git
Now that we have the most recent installation of GenoLearn, lets setup GenoLearn for our current directory.
genolearn-setup
Prompted Information
data directory : data
metadata csv : metadata.csv
With setup complete, we can just run the command genolearn
GenoLearn ({VERSION}) Command Line Interface
GenoLearn is designed to enable researchers to perform Machine Learning on their genome
sequence data such as fsm-lite or unitig files.
See https://genolearn.readthedocs.io for documentation.
Working directory: ~/genolearn-demo
1. exit exits GenoLearn
2. print prints various GenoLearn generated files
3. preprocess preprocess data into an easier format for file reading
As the user executes the commands available, new commands are made available. Lets start by preprocessing the 2014-2018 dataset by selecting preprocess followed by sequence upon executing genolearn.
Note
For the rest of the tutorial, if it is stated to run a command, we mean you should execute genolearn then select the command number from the numbered menu as shown above.
Prompted Information
sequence data : 14-18.txt.gz
batch-size [None]:
n-processes [None]:
sparse [True]:
dense [True]:
verbose [250000]:
max features [None]:
Following the execution of the above, which should take up to a few minutes, your directory tree should look like
genolearn-tutorial
├── data
│ ├── 14-18.txt.gz
│ ├── 19.txt.gz
│ └── metadata.csv
├── preprocess
│ ├── array [2460 .npz files]
│ ├── features.txt.gz
│ ├── info.json
│ └── preprocess.log
└── README.md
Imagine, at a later point in time, we had access to the 2019 dataset. We can combine the preprocessing if this data with our previous data with the preprocess combine command.
Prompted Information
sequence data : 19.txt.gz
batch-size [None]:
n-processes [None]:
The directory tree following the above execution should be
genolearn-tutorial
├── data
│ ├── 14-18.txt.gz
│ ├── 19.txt.gz
│ └── metadata.csv
├── preprocess
│ ├── combine.log
│ ├── array [2875 .npz files]
│ ├── features.txt.gz
│ ├── info.json
│ ├── meta.json
│ └── preprocess.log
└── README.md
Now we need a means of splitting our data into a train and validation datasets. The train set is used to fit our models whilst the validation set is used to check model performance which then helps us decide which of the potentially many models to choose from. For this tutorial, since we want to train on the 2014 - 2018 dataset and then evaluate against the 2019 dataset we execute preprocess meta with the following parameters
Prompted Information
output [default]: yearly
identifier : Accession
target : Region
group [None]: Year
train group values* : 2014, 2015, 2016, 2017, 2018
val group values* : 2019
If you do not have something similar to the Year column in your metadata, it is recommended to use the default values provided.
Prompted Information
output [default]:
identifier : Accession
target : Region
group [None]:
proportion train [0.75]:
This randomly assigns 75% of the data as the train dataset and the rest as your test dataset but the user is free to change the proportion that should be in the train dataset by entering a sensible proportion value. The directory tree should be
genolearn-tutorial
├── data
│ ├── 14-18.txt.gz
│ ├── 19.txt.gz
│ └── metadata.csv
├── meta
│ ├── default
│ └── yearly
├── preprocess
│ ├── combine.log
│ ├── array [2875 .npz files]
│ ├── features.txt.gz
│ ├── info.json
│ ├── meta.json
│ └── preprocess.log
└── README.md
At this point we can analyse the metadata by executing print meta analyse. If we choose our yearly metadata, it prints
| 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | Train | Val | Global
------------------+--------+--------+--------+--------+--------+--------+--------+--------+--------
Asia | 0 | 4 | 16 | 9 | 18 | 10 | 47 | 10 | 57
Australasia | 0 | 0 | 1 | 0 | 3 | 2 | 4 | 2 | 6
C. America | 0 | 2 | 6 | 6 | 10 | 5 | 24 | 5 | 29
C. Europe | 0 | 0 | 26 | 15 | 8 | 13 | 49 | 13 | 62
M. East | 3 | 11 | 33 | 23 | 42 | 45 | 112 | 45 | 157
N. Africa | 0 | 15 | 7 | 26 | 24 | 19 | 72 | 19 | 91
N. America | 1 | 1 | 3 | 3 | 3 | 1 | 11 | 1 | 12
N. Europe | 0 | 1 | 2 | 7 | 2 | 6 | 12 | 6 | 18
S. America | 0 | 0 | 1 | 2 | 0 | 0 | 3 | 0 | 3
S. Europe | 5 | 13 | 62 | 59 | 54 | 44 | 193 | 44 | 237
Subsaharan Africa | 3 | 2 | 7 | 7 | 6 | 3 | 25 | 3 | 28
UK | 133 | 135 | 765 | 406 | 468 | 267 | 1,907 | 267 | 2,174
Total | 145 | 184 | 929 | 563 | 638 | 415 | 2,459 | 415 | 2,874
suggested target subset: Asia, C. America, C. Europe, M. East, N. Africa, N. America, N. Europe, S. Europe, Subsaharan Africa, UK
From the above, we can see the target region distribution over each year as well as the train and validation datasets. Additionally, it prints the target regions that have a count of at least 10 (by default) to ensure our later models have enough examples to learn from (the number you choose is somewhat subjective).
If we instead, choose to print out the default metadata, it prints
| Train | Val | Global
------------------+--------+--------+--------
Asia | 43 | 14 | 57
Australasia | 5 | 1 | 6
C. America | 22 | 7 | 29
C. Europe | 38 | 24 | 62
M. East | 111 | 46 | 157
N. Africa | 73 | 18 | 91
N. America | 9 | 3 | 12
N. Europe | 14 | 4 | 18
S. America | 2 | 1 | 3
S. Europe | 187 | 50 | 237
Subsaharan Africa | 24 | 4 | 28
UK | 1,628 | 546 | 2,174
Total | 2,156 | 718 | 2,874
suggested target subset: Asia, C. America, C. Europe, M. East, N. Africa, N. Europe, S. Europe, Subsaharan Africa, UK
Note
Numbers suggested target subset may vary due to the randomness.
At this point, we can now use Fisher Scores to compute which genome sequences to take forward for modelling purposes (see Feature Selection for more details). Lets compute the Fisher Scores for each genome sequence with the feature-selection command using the yearly metadata.
Prompted Information
metadata : yearly
name [yearly-fisher]:
Upon execution of the above command, the directory tree should be now
genolearn-tutorial
├── data
│ ├── 14-18.txt.gz
│ ├── 19.txt.gz
│ └── metadata.csv
├── feature-selection
│ ├── yearly-fisher
│ └── yearly-fisher.log
├── meta
│ ├── default
│ └── yearly
├── preprocess
│ ├── combine.log
│ ├── array [2875 .npz files]
│ ├── features.txt.gz
│ ├── info.json
│ ├── meta.json
│ └── preprocess.log
└── README.md
The above command computes the fisher scores for all genome sequence counts that have been assigned as train.
Now that we have preprocessed all of our data and computed the Fisher Scores, we can look to see how our Machine Learning models perform on our dataset. Before training a machine learning model, we will need define a model config. Below is an example Random Forest config with varying max_depth and random_states after executing the model-config command
Prompted Information
config_name [random-forest] :
n_estimators [100] :
criterion {gini, entropy, log_loss} [gini] :
max_depth [None] : range(10, 51, 10)
min_samples_split [2] :
min_samples_leaf [1] :
min_weight_fraction_leaf [0.0] :
max_features {sqrt, log2, None} [sqrt] :
max_leaf_nodes [None] :
min_impurity_decrease [0.0] :
bootstrap [True] :
oob_score [False] :
n_jobs [-1] :
random_state [None] : 0, 1, 2
class_weight {balanced, balanced_subsample, None} [None]: balanced
Note
Note that you can enter a range of integers either manually like the random_state entry or use Python’s range function.
The above results in the following directory tree.
genolearn-tutorial
├── data
│ ├── 14-18.txt.gz
│ ├── 19.txt.gz
│ └── metadata.csv
├── feature-selection
│ ├── yearly-fisher
│ └── yearly-fisher.log
├── meta
│ ├── default
│ └── yearly
├── model
│ └── random-forest
├── preprocess
│ ├── combine.log
│ ├── array [2875 .npz files]
│ ├── features.txt.gz
│ ├── info.json
│ ├── meta.json
│ └── preprocess.log
└── README.md
Using the print command, the random-forest file contents are
random-forest
{
model: RandomForestClassifier,
n_estimators: 100,
criterion: gini,
max_depth: [
10,
20,
30,
40,
50
],
min_samples_split: 2,
min_samples_leaf: 1,
min_weight_fraction_leaf: 0.0,
max_features: sqrt,
max_leaf_nodes: null,
min_impurity_decrease: 0.0,
bootstrap: true,
oob_score: false,
n_jobs: -1,
random_state: [
0,
1,
2
],
class_weight: balanced
}
With these files now created, we can proceed to the train command. Lets assume we want to search for models that use the top 1,000, 10,000 and then the full 100,000 features as computed by the earlier Fisher Scores and further assume we want to only train / evaluate on regions where we have at least 10 examples to learn from.
Prompted Information
output_dir [yearly-fisher-random-forest] :
num_features* [1000] : 1000, 10000, 100000
binary [False] :
min_count [0] : 10
target_subset [None] :
metric [f1_score] :
aggregate_func {mean, weighted_mean} [weighted_mean]:
The above results in the following directory tree
genolearn-tutorial
├── data
│ ├── 14-18.txt.gz
│ ├── 19.txt.gz
│ └── metadata.csv
├── feature-selection
│ ├── yearly-fisher
│ └── yearly-fisher.log
├── meta
│ ├── default
│ └── yearly
├── model
│ └── random-forest
├── preprocess
│ ├── combine.log
│ ├── array [2875 .npz files]
│ ├── features.txt.gz
│ ├── info.json
│ ├── meta.json
│ └── preprocess.log
├── README.md
└── train
└── yearly-fisher
├── full-model.pickle
├── encoding.json
├── model.pickle
├── params.json
├── predictions.csv
├── results.npz
└── train.log
For a deeper insight on the results, you will need to run in Python
import numpy as np
npz = np.load('path/to/results.npz')
identifiers = npz['identifiers'] # shape (n,)
predictions = npz['predict'] # shape (3, 5, 3, n) (3 num_features, 5 max_depth, and 3 random_state parameters for n predictions)
times = npz['times'] # shape (3, 5, 3, 2) (last dimension measures training and predicting times in seconds)
where \(n\) is the number of validation strains present in 2019 i.e. only the strains in 2019 where the associated target region was present in 2018. This ignores strains in 2019 where the associated target region does not appear in 2018 as there are no training examples to learn from.
Feature importances can be obtained by the feature-importance command which results in the following directory tree
genolearn-tutorial
├── data
│ ├── 14-18.txt.gz
│ ├── 19.txt.gz
│ └── metadata.csv
├── feature-selection
│ ├── yearly-fisher
│ └── yearly-fisher.log
├── meta
│ ├── default
│ └── yearly
├── model
│ └── random-forest
├── preprocess
│ ├── combine.log
│ ├── array [2875 .npz files]
│ ├── features.txt.gz
│ ├── info.json
│ ├── meta.json
│ └── preprocess.log
├── README.md
└── train
└── yearly-fisher
├── full-model.pickle
├── encoding.json
├── importance
│ ├── full-importance.npz
│ ├── full-importance-rank.csv
│ ├── importance.npz
│ └── importance-rank.csv
├── model.pickle
├── params.json
├── predictions.csv
├── results.npz
└── train.log
The importance-rank.csv contains a single column of genome sequences which have been ranked from most important to least by the trained model’s logic based on the 2014 - 2018 dataset. See Feature Importance for further details on interpretation of the importance.npz file.
Finally, if new data is obtained but with no corresponding target labels (i.e. only the genome sequence counts are obtained), then after first executing preprocess combine on the new data, we can evaluate on the new unlabelled data points.
Prompted Information
output filename : unlabelled
group values* {2014, 2015, 2016, 2017, 2018, 2019, unlabelled}: unlabelled
which results in the following directory tree
genolearn-tutorial
├── data
│ ├── 14-18.txt.gz
│ ├── 19.txt.gz
│ └── metadata.csv
├── feature-selection
│ ├── yearly-fisher
│ └── yearly-fisher.log
├── meta
│ ├── default
│ └── yearly
├── model
│ └── random-forest
├── preprocess
│ ├── combine.log
│ ├── array [2875 .npz files]
│ ├── features.txt.gz
│ ├── info.json
│ ├── meta.json
│ └── preprocess.log
├── README.md
└── train
└── yearly-fisher
├── full-model.pickle
├── encoding.json
├── evaluate
│ ├── full-unlabelled.npz
│ └── unlabelled.csv
├── importance
│ ├── full-importance.npz
│ ├── full-importance-rank.csv
│ ├── importance.npz
│ └── importance-rank.csv
├── model.pickle
├── params.json
├── predictions.csv
├── results.npz
└── train.log
Since for our dataset, we had 1 example not labelled (i.e. it does not appread in the metadata but appears in one of preprocessed .gz files), unlabelled.csv looks like
identifier hat P(Asia) P(C. America) P(C. Europe) P(M. East) P(N. Africa) P(N. America) P(N. Europe) P(S. Europe) P(Subsaharan Africa) P(UK)
SRR10001272 UK 0.092434 0.052168 0.125289 0.067996 0.106553 0.134874 0.098422 0.136982 0.026605 0.158678