GenoLearn Feature Selection
For genome count datasets, it is often the case that the number of sequences is large. This presents memory problems when trying to read the data. Assuming the dataset originally has \(n\) observations each having \(m\) genome sequence counts, the total memory cost would be of the order \(n\times m\). Feature selection is a process that aim to pre-select \(k\) features from \(m\) before conducting any further analysis and experimentation where \(k\) is much smaller than \(m\).
By default, GenoLearn provides an implementation of using Fisher Scores for Feature Selection as described in Fisher Score for Feature Selection.
Custom Feature Selection
GenoLearn offers the user the ability to create their own custom feature selection process. One needs to define init, loop, and post functions in their custom module. As an example, below is the Fisher Score implementation.
1def init(dataloader):
2 """
3 Initialises statistics for the Fisher Score computation.
4 """
5 import numpy as np
6
7 # encoding from class label to integer
8 encode = {c : i for i, c in enumerate(set(dataloader.meta['targets']))}
9
10 # class label count
11 n = np.zeros(dataloader.c)
12
13 # global sum
14 sg = np.zeros(dataloader.m)
15
16 # class label sum
17 s1 = np.zeros((dataloader.m, dataloader.c))
18
19 # class label sum of squares
20 s2 = np.zeros((dataloader.m, dataloader.c))
21
22 args = (encode, n, sg, s1, s2) # encoder, counts, sum, by class sum, by class sum of squares
23 kwargs = {} # no kwargs
24
25 return args, kwargs
26
27def loop(i, x, label, value, *args, **kwargs):
28 """
29 Incrementally updates count, global, and by class label statistics
30 """
31
32 encode, n, sg, s1, s2 = args
33
34 y = encode[label]
35
36 # increase count of label
37 n[y] += 1
38
39 # increase global sum
40 sg += x
41
42 # increase class label sum
43 s1[:,y] += x
44
45 # increase class label sum of squares
46 s2[:,y] += x ** 2
47
48def post(i, value, *args, **kwargs):
49 """
50 Computes the Fisher Score using statistics stored in ``*args``.
51 """
52 import numpy as np
53
54 encode, n, sg, s1, s2 = args
55
56 # convert global sum to global mean
57 mu = sg / n.sum()
58
59 # convert to first and second moments
60 m1 = np.divide(s1, n, where = n > 0)
61 m2 = np.divide(s2, n, where = n > 0)
62
63 # compute D and S as per www.genolearn.readthedocs.io/usage/feature-selection.html
64 D = np.square(m1 - mu.reshape(-1, 1)) # broadcast second dimension ((m, c) - (m, 1))
65 S = (m2 - np.square(m1))
66
67 # numerator and denominator expressions for Fisher Score
68 num = D @ n
69 den = S @ n
70
71 S = np.divide(num, den, where = den > 0)
72
73 return -S # return negative scores such that argsort returns largest to smallest