GenoLearn Preprocess
Users first need to preprocess both their sequence and meta data to use later commands in GenoLearn. Upon executing
genolearn
and selecting the preprocess option number, the user will be prompted to select a preprocess command.
Genolearn ({VERSION}) Command Line Interface
GenoLearn is designed to enable researchers to perform Machine Learning on their genome
sequence data such as fsm-lite or unitig files.
See https://genolearn.readthedocs.io for documentation.
Working directory: {WOKRING_DIRECTORY}
Command: preprocess
Select a preprocess subcommand
0. back goes to the previous command
1. sequence preprocesses sequence data
2. combine preprocesses sequence data and combines to previous preprocessing
3. meta preprocesses meta data
with combine and meta commands available upon having executed sequence.
Sequence
The user is asked to enter the option number for which .gz file to preprocess within their data directory.
The user is prompted for parameters
batch_size [None] :
n_processes [None] :
verbose [250000] :
max_features [None]:
where
batch_sizedetermines how many concurrent identifiers to preprocess per run of the.gzfile. By default, GenoLearn arbitrarily sets this to the minimum of your OS limit (RLIMIT_NOFILE) and \(2^14\) or \(512\) in the case of running on iOS.n_processesdetermines how many processes to run when converting temptxtfiles tonumpyarrays. By default, GenoLearn sets this to the number of physical CPU cores.verbosedetermines the number of sequences to cycle through before printing a new line.max_featuresdetermines the number of first number of sequences to preprocess. Mainly used for debugging purposes. Users should always leave this as the defaultNone.
Upon a successful execution, within the working directory is a preprocess subdirectory with the following tree
preprocess
├── array [{identifier}.npz files]
├── features.txt.gz
├── info.json
├── meta.json
└── preprocess.log
Note
Whilst the preprocess sequence command is being executed, it will print two numbers. The first is the number of unique identifiers found so far, and the second is the number of sequences the program has cycled over for preprocessing.
Combine
This command is only available once the user has executed the preprocess sequence command. The user is asked to enter the option number for which .gz file to preprocess within their data directory. The user cannot select the same file used during the original preprocess sequence. If this is not the first time preprocess combine is being executed, the user also cannot select any file used in previous runs of preprocess combine.
The user is prompted for parameters
batch_size [None] :
n_processes [None] :
verbose [250000] :
where the other parameters mentioned in preprocess sequence are automatically set according to the previous run of preprocess sequence. Upon a successful execution, within the working directory the preprocess subdirectory now has the following tree
preprocess
├── array [more {identifier}.npz files]
├── combine.log
├── features.txt.gz
├── info.json
├── meta.json
└── preprocess.log
Meta
This command is only available once the user has executed the preprocess sequence command. This command defines the train and validation datasets for later modelling purposes.
The user is prompted for parameters
output [default]:
identifier :
target :
group [None]:
proportion train [0.75]:
if the user selects None as the group value or
output [default]: identifier : target : group [None]: train group values* : val group values* :
if the user enters a column present in their metadata csv where
outputis the output filename storing the collected information from the user.identifieris a column within the metadata csv containing the unique identifiers.targetis a column within the metadata csv containing the target metadata labels.groupis either a column within the metadata csv that helps the user split the data into train and validation datasets or left asNoneproportion trainis only available ifgroupisNoneand is a sensible proportion value to randomly assign as train with the rest as validation.train group values*is only available ifgroupis notNoneand contains a comma seperated string indicating which group values belong to train.val group values*is only available ifgroupis notNoneand contains a comma seperated string indicating which group values belong to validation. Note that these values cannot overlap withtrain group values*.
Upon a successful execution of this command, within the working directory is a meta subdirectory with an additional entry of output.