GenoLearn Feature Selection

For genome count datasets, it is often the case that the number of sequences is large. This presents memory problems when trying to read the data. Assuming the dataset originally has \(n\) observations each having \(m\) genome sequence counts, the total memory cost would be of the order \(n\times m\). Feature selection is a process that aim to pre-select \(k\) features from \(m\) before conducting any further analysis and experimentation where \(k\) is much smaller than \(m\).

By default, GenoLearn provides an implementation of using Fisher Scores for Feature Selection as described in Fisher Score for Feature Selection.

Custom Feature Selection

GenoLearn offers the user the ability to create their own custom feature selection process. One needs to define init, loop, and post functions in their custom module. As an example, below is the Fisher Score implementation.

 1def init(dataloader):
 2    """
 3    Initialises statistics for the Fisher Score computation.
 4    """
 5    import numpy as np
 6
 7    # encoding from class label to integer
 8    encode = {c : i for i, c in enumerate(set(dataloader.meta['targets']))}
 9
10    # class label count
11    n  = np.zeros(dataloader.c)
12
13    # global sum
14    sg = np.zeros(dataloader.m)
15
16    # class label sum
17    s1 = np.zeros((dataloader.m, dataloader.c))
18
19    # class label sum of squares
20    s2 = np.zeros((dataloader.m, dataloader.c))
21
22    args   = (encode, n, sg, s1, s2)  # encoder, counts, sum, by class sum, by class sum of squares
23    kwargs = {}                       # no kwargs
24
25    return args, kwargs
26
27def loop(i, x, label, value, *args, **kwargs):
28    """
29    Incrementally updates count, global, and by class label statistics
30    """
31
32    encode, n, sg, s1, s2 = args
33        
34    y        = encode[label]
35
36    # increase count of label
37    n[y]    += 1
38
39    # increase global sum
40    sg      += x
41
42    # increase class label sum
43    s1[:,y] += x
44
45    # increase class label sum of squares
46    s2[:,y] += x ** 2
47
48def post(i, value, *args, **kwargs):
49    """
50    Computes the Fisher Score using statistics stored in ``*args``.
51    """
52    import numpy as np
53    
54    encode, n, sg, s1, s2 = args
55
56    # convert global sum to global mean
57    mu  = sg / n.sum()
58
59    # convert to first and second moments
60    m1  = np.divide(s1, n, where = n > 0)
61    m2  = np.divide(s2, n, where = n > 0)
62
63    # compute D and S as per www.genolearn.readthedocs.io/usage/feature-selection.html
64    D   = np.square(m1 - mu.reshape(-1, 1)) # broadcast second dimension ((m, c) - (m, 1))
65    S   = (m2 - np.square(m1))
66
67    # numerator and denominator expressions for Fisher Score
68    num = D @ n
69    den = S @ n
70
71    S   = np.divide(num, den, where = den > 0)
72
73    return -S # return negative scores such that argsort returns largest to smallest