Feature Engineering and Baseline

ML models on MoA

Logistic regression

Screenshot-2021-02-23-at-7-54-58-PM

Input Data: Following features were stacked and passed as input to models.
- 875 features which were already provided in the dataset.
- 50 PCA components derieved from gene expression features.
- 10 PCA components derieved from cell viability features.
- 15 row wise statistics of each sample.

Code for computing row wise stats.

def fe_stats(train, test):
    
    features_g = GENES
    features_c = CELLS
    
    for df in train, test:
        df['g_sum'] = df[features_g].sum(axis = 1) # sum of gene expressions
        df['g_mean'] = df[features_g].mean(axis = 1) # mean of gene expressions
        df['g_std'] = df[features_g].std(axis = 1) # std. deviation of gene expressions
        df['g_kurt'] = df[features_g].kurtosis(axis = 1) # kurtosis of gene expressions
        df['g_skew'] = df[features_g].skew(axis = 1) # skewness of gene expressions
        df['c_sum'] = df[features_c].sum(axis = 1) # sum of cell viability
        df['c_mean'] = df[features_c].mean(axis = 1) # mean of cell viability
        df['c_std'] = df[features_c].std(axis = 1) # std. deviation of cell viability
        df['c_kurt'] = df[features_c].kurtosis(axis = 1) # kurtosis of cell viability
        df['c_skew'] = df[features_c].skew(axis = 1) # skewness of cell viability
        df['gc_sum'] = df[features_g + features_c].sum(axis = 1) # sum of gene expressions and cell viability
        df['gc_mean'] = df[features_g + features_c].mean(axis = 1) # mean of gene expressions and cell viability
        df['gc_std'] = df[features_g + features_c].std(axis = 1) # std. deviation of gene expressions and cell viability
        df['gc_kurt'] = df[features_g + features_c].kurtosis(axis = 1) # kurtosis of gene expressions and cell viability
        df['gc_skew'] = df[features_g + features_c].skew(axis = 1) # skewness of gene expressions and cell viability
        
    return train, test

Training:
- LR model for each target was trained. So overall 206 models were trained.
- 5 C.V. splits were made. Model trained on each split was saved.
- Now overall we have 206*5 = 1030 models.
Prediction:
- Each target was predicted with their respective 5 CV models.
- So at the end we took average of all the predictions from 5 CV models for each target.
Conclusion: The submission and score of Logistic Regression model is as shown here:
Click here to view the submission notebook

The LR model performance is much worse than our baseline model. The logloss is not acceptable. Lets try SVM model

SVM

Screenshot-2021-02-23-at-8-55-24-PM

Input Data: It was similar as that of above trained LR model.
Training and Prediction: Similar training strategy as that of LR model was followed.
Conclusion:

The SVM model also performs far worse than accepted baseline. Lets try XGBOOST model

XGBOOST

Screenshot-2021-02-23-at-9-14-16-PM

Input Data: Following features were stacked and passed as input to models.
- 875 features which were already provided in the dataset.
- 50 PCA components derieved from gene expression features.
- 10 PCA components derieved from cell viability features.
Training:
- To train faster we used MultiOutputClassifier wrapper in sklearn.
- 12 Fold C.V was carried out for each target.
Prediction:
- Each model trained on a C.V. set was used for prediction.
- So at the end we took average of all the predictions from 12 C.V. models for each target.
Conclusion:
- The submission and score of XGBOOST model is as shown here
  Click here to view the submission notebook
XGBOOST performs best of all ML models, lets try NN approach.

Further Ideas from ML paradigm:

Chain Classifier: On the basis of correlation within the target classes, we can train Chain Classifier such that prediction values of one target can be utilised for prediciting other correlated target.
Linear discriminant analysis: One can try and perform dimensionality reduction on gene expressions or cell viability feature using LDA and further train LR or SVM on the reduced dimensions.

Blog part
1. MoA problem definition link
2. EDA on LISH MoA dataset
3. Feature Engineering and Baseline model for MoA
5. DL techniques on MoA dataset