ML models on MoA
Logistic regression

- Input Data:
Following features were stacked and passed as input to models.
- 875 features which were already provided in the dataset.
- 50 PCA components derieved from gene expression features.
- 10 PCA components derieved from cell viability features.
- 15 row wise statistics of each sample.
Code for computing row wise stats.
def fe_stats(train, test):
features_g = GENES
features_c = CELLS
for df in train, test:
df['g_sum'] = df[features_g].sum(axis = 1) # sum of gene expressions
df['g_mean'] = df[features_g].mean(axis = 1) # mean of gene expressions
df['g_std'] = df[features_g].std(axis = 1) # std. deviation of gene expressions
df['g_kurt'] = df[features_g].kurtosis(axis = 1) # kurtosis of gene expressions
df['g_skew'] = df[features_g].skew(axis = 1) # skewness of gene expressions
df['c_sum'] = df[features_c].sum(axis = 1) # sum of cell viability
df['c_mean'] = df[features_c].mean(axis = 1) # mean of cell viability
df['c_std'] = df[features_c].std(axis = 1) # std. deviation of cell viability
df['c_kurt'] = df[features_c].kurtosis(axis = 1) # kurtosis of cell viability
df['c_skew'] = df[features_c].skew(axis = 1) # skewness of cell viability
df['gc_sum'] = df[features_g + features_c].sum(axis = 1) # sum of gene expressions and cell viability
df['gc_mean'] = df[features_g + features_c].mean(axis = 1) # mean of gene expressions and cell viability
df['gc_std'] = df[features_g + features_c].std(axis = 1) # std. deviation of gene expressions and cell viability
df['gc_kurt'] = df[features_g + features_c].kurtosis(axis = 1) # kurtosis of gene expressions and cell viability
df['gc_skew'] = df[features_g + features_c].skew(axis = 1) # skewness of gene expressions and cell viability
return train, test
- Training:
- LR model for each target was trained. So overall 206 models were trained.
- 5 C.V. splits were made. Model trained on each split was saved.
- Now overall we have 206*5 = 1030 models.
- Prediction:
- Each target was predicted with their respective 5 CV models.
- So at the end we took average of all the predictions from 5 CV models for each target.
-
Conclusion: The submission and score of Logistic Regression model is as shown here:
Click here to view the submission notebook
The LR model performance is much worse than our baseline model. The logloss is not acceptable. Lets try SVM model
SVM

-
Input Data: It was similar as that of above trained LR model.
-
Training and Prediction: Similar training strategy as that of LR model was followed.
-
Conclusion:

The SVM model also performs far worse than accepted baseline. Lets try XGBOOST model
XGBOOST

- Input Data:
Following features were stacked and passed as input to models.
- 875 features which were already provided in the dataset.
- 50 PCA components derieved from gene expression features.
- 10 PCA components derieved from cell viability features.
- Training:
- To train faster we used MultiOutputClassifier wrapper in sklearn.
- 12 Fold C.V was carried out for each target.
- Prediction:
- Each model trained on a C.V. set was used for prediction.
- So at the end we took average of all the predictions from 12 C.V. models for each target.
- Conclusion:
- The submission and score of XGBOOST model is as shown here
Click here to view the submission notebook

XGBOOST performs best of all ML models, lets try NN approach.
- The submission and score of XGBOOST model is as shown here
Further Ideas from ML paradigm:
-
Chain Classifier: On the basis of correlation within the target classes, we can train Chain Classifier such that prediction values of one target can be utilised for prediciting other correlated target.
-
Linear discriminant analysis: One can try and perform dimensionality reduction on gene expressions or cell viability feature using LDA and further train LR or SVM on the reduced dimensions.
| Blog part |
|---|
| 1. MoA problem definition link |
| 2. EDA on LISH MoA dataset |
| 3. Feature Engineering and Baseline model for MoA |
| 5. DL techniques on MoA dataset |