GenoLearn Feature Selection

For genome count datasets, it is often the case that the number of sequences is large. This presents memory problems when trying to read the data. Assuming the dataset originally has \(n\) observations each having \(m\) genome sequence counts, the total memory cost would be of the order \(n\times m\). Feature selection is a process that aim to pre-select \(k\) features from \(m\) before conducting any further analysis and experimentation where \(k\) is much smaller than \(m\).

By default, GenoLearn provides an implementation of using Fisher Scores for Feature Selection as described in Fisher Score for Feature Selection.

Custom Feature Selection

GenoLearn offers the user the ability to create their own custom feature selection process. One needs to define init, loop, and post functions in their custom module. As an example, below is the Fisher Score implementation.

def init(dataloader):
    """
    Initialises statistics for the Fisher Score computation.
    """
    import numpy as np

    # encoding from class label to integer
    encode = {c : i for i, c in enumerate(set(dataloader.meta['targets']))}

    # class label count
    n  = np.zeros(dataloader.c)

    # global sum
    sg = np.zeros(dataloader.m)

    # class label sum
    s1 = np.zeros((dataloader.m, dataloader.c))

    # class label sum of squares
    s2 = np.zeros((dataloader.m, dataloader.c))

    args   = (encode, n, sg, s1, s2)  # encoder, counts, sum, by class sum, by class sum of squares
    kwargs = {}                       # no kwargs

    return args, kwargs

def loop(i, x, label, value, *args, **kwargs):
    """
    Incrementally updates count, global, and by class label statistics
    """

    encode, n, sg, s1, s2 = args
        
    y        = encode[label]

    # increase count of label
    n[y]    += 1

    # increase global sum
    sg      += x

    # increase class label sum
    s1[:,y] += x

    # increase class label sum of squares
    s2[:,y] += x ** 2

def post(i, value, *args, **kwargs):
    """
    Computes the Fisher Score using statistics stored in ``*args``.
    """
    import numpy as np
    
    encode, n, sg, s1, s2 = args

    # convert global sum to global mean
    mu  = sg / n.sum()

    # convert to first and second moments
    m1  = np.divide(s1, n, where = n > 0)
    m2  = np.divide(s2, n, where = n > 0)

    # compute D and S as per www.genolearn.readthedocs.io/usage/feature-selection.html
    D   = np.square(m1 - mu.reshape(-1, 1)) # broadcast second dimension ((m, c) - (m, 1))
    S   = (m2 - np.square(m1))

    # numerator and denominator expressions for Fisher Score
    num = D @ n
    den = S @ n

    S   = np.divide(num, den, where = den > 0)

    return -S # return negative scores such that argsort returns largest to smallest