Linear B-cell epitopes Prediction¶

ABSTRACT¶

Linear B-cell epitopes are short protein fragments recognised by specific components of the immune system. They play a crucial role in the development of vaccines, diagnostic tests, and therapeutic interventions against infectious diseases, allergies, and some forms of cancer. Despite their significance, experimental discovery of epitopes can be an arduous, resource-intensive process. Consequently, computational methods have been employed for the past three decades to help prioritise candidates for characterisation in the laboratory, with recent developments significantly enhancing the process's efficiency.

This study presents a comprehensive data mining approach to predict linear B-cell epitopes using various machine learning models, with the aim of expediting the discovery process and reducing the time and resources required for experimental validation. The methodology employed consisted of several key stages, including data preprocessing, exploratory data analysis (EDA), model training, hyperparameter tuning, model validation, and pipeline construction for application to other datasets.

All datasets, except df_training_level1, were divided into three parts: 60% for EDA and modelling, 30% for hyperparameter tuning, and 10% for final validation. EDA techniques such as handling missing values, detecting outliers, normalising data, performing feature reduction, and addressing class imbalance were implemented on the dataset. Additionally, visualisation techniques were employed to gain further insights into the data and facilitate informed decisions regarding the modelling process.

Preprocessing addressed all EDA findings, except class balancing, which was performed on the training set to prevent data leakage. The GroupKFold technique was utilised for splitting the dataset into training and test sets. Subsequently, six machine learning models were employed for prediction, namely: Logistic Regression, K-Nearest Neighbours (KNN), Random Forest Classifier, Support Vector Classification (SVC), XGBoost Classifier, and Light Gradient Boosting Machine (LGBM) Classifier. Hyperparameter tuning was conducted using GridSearchCV on the second split of the dataset.

Finally, the model with the highest area under the receiver operating characteristic (ROC) curve was selected as the optimal predictor. Details of the best-performing model and its corresponding AUC score remain undisclosed in this abstract. The final split of the dataset was used for model validation, ensuring an unbiased evaluation of its performance on previously unseen data. A pipeline was constructed based on the aforementioned steps and applied to the analysis of other datasets.

INTRODUCTION¶

The advancement of computational methods has been pivotal in the identification of linear B-cell epitopes, which are short protein fragments recognised by specific components of the immune system. These epitopes play a crucial role in the development of vaccines, diagnostic tests, and therapeutic interventions against a wide range of diseases. In this data mining project, the focus is on Alphavirus, a genus of mosquito-borne viruses that include pathogens of medical concern, such as Chikungunya, affecting millions of people primarily in the Global South. Climate change poses the risk of these diseases migrating northwards, thereby increasing their impact.

Dataset¶

The datasets used in this study have been derived from online databases IEDB, Genbank, and UniProtKB, and have been parsed and consolidated using research tools. The primary goal of this project is to develop an efficient data mining pipeline for the potential prediction of new, previously unknown epitopes in viruses belonging to the Alphavirus genus.

A crucial aspect of this project is to investigate the trade-off between using a smaller amount of data from viruses that are more similar to the target ones and using a larger volume of data from potentially very different viruses. To facilitate this exploration, five different training datasets have been provided, each containing increasingly more data from more distant relatives of the Alphavirus genus. These datasets consist of 13 information columns, 1293 feature columns, and 1 class column.

Requirement¶

The data in this coursework presents a challenging characteristic due to potential dependencies between rows, which may violate the assumptions of certain pre-processing and modelling approaches. Therefore, it is essential to consider these dependencies during analysis. In addition to the training datasets, a validation dataset (file df_holdout.csv) is provided, which shares the same structure as the training data, excluding the Class attribute. The primary task is to develop a competent data mining pipeline capable of predicting the Class attribute for the validation dataset.

AIm¶

This project involves performing exploratory data analysis (EDA), data pre-processing, feature reduction, data rebalancing, modelling, and model assessment using the first training set (df_training_level_1). Furthermore, the analysis will be repeated using at least one other training set (df_training_level_2, 3, …, 6), comparing the results with those obtained for the model trained on df_training_level_1. Finally, the selected data mining pipeline (pre-processing steps + model) will be used to predict the classes for the holdout observations (from file df_holdout).

Packages used in this project can be grouped based on their purpose as follows:¶

Loading Dataset:¶

• Pandas library

Data Visualisation:¶

• Matplotlib
• Seaborn

Dimensionality Reduction:¶

• t-SNE from Scikit-learn
• PCA from Scikit-learn

Data Preprocessing:¶

• NumPy
• Scipy
• SimpleImputer from Scikit-learn
• StandardScaler from Scikit-learn

Handling Imbalanced Data:¶

• SMOTE from Imbalanced-learn
• SMOTEENN from Imbalanced-learn
• RandomOverSampler from Imbalanced-learn
• RandomUnderSampler from Imbalanced-learn

Modelling:¶

• RandomForestClassifier from Scikit-learn
• XGBClassifier from XGBoost
• KNeighborsClassifier from Scikit-learn
• SVC from Scikit-learn
• LGBMClassifier from LightGBM
• LogisticRegression from Scikit-learn

Model Evaluation:¶

• ROC AUC Score from Scikit-learn

Model Selection and Validation:¶

• GroupKFold from Scikit-learn
• GridSearchCV from Scikit-learn

Creating Pipelines:¶

• Pipeline from Scikit-learn

Importing all the libraries¶

In [1]:
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.manifold import TSNE
from scipy.stats import skew
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GroupKFold, StratifiedGroupKFold
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score
from imblearn.combine import SMOTEENN
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression

# to Ignore all warnings from the system
warnings.filterwarnings("ignore")

Exploratory Data Analysis and Pre-processing¶

In [2]:
# load the dataset into pandas dataframe
def load_data(filepath):
    return pd.read_csv(filepath)

full_dataset = load_data("df_training_level1.csv") 

The epitope dataset is loaded into the Python environment as a Pandas DataFrame for subsequent analysis. pd.read_csv() function from the Pandas library is used to read a CSV file containing the dataset and load it into a Pandas DataFrame.

In [3]:
# display the first 10 observations of the dataframe
full_dataset.head()
Out[3]:
Info_PepID Info_organism_id Info_protein_id Info_pos Info_AA Info_pubmed_id Info_epitope_id Info_host_id Info_nPos Info_nNeg ... feat_esm1b_1270 feat_esm1b_1271 feat_esm1b_1272 feat_esm1b_1273 feat_esm1b_1274 feat_esm1b_1275 feat_esm1b_1276 feat_esm1b_1277 feat_esm1b_1278 feat_esm1b_1279
0 CAA51871.1:2 12161 CAA51871.1 685 S 11458006 60725 10000000 2 0 ... 0.178513 -0.257270 -0.153925 0.014767 -1.294921 -0.112832 0.260342 0.123651 0.159365 0.172829
1 CAA51871.1:2 12161 CAA51871.1 686 R 11458006 60725 10000000 2 0 ... 0.539347 -0.173580 -0.122266 0.235858 -1.230598 -0.060592 0.160817 0.310983 0.146951 0.240393
2 CAA51871.1:2 12161 CAA51871.1 687 L 11458006 60725 10000000 2 0 ... 0.224537 -0.165938 -0.125078 0.131652 -1.359426 0.020718 0.160984 0.189219 0.204018 0.336321
3 CAA51871.1:2 12161 CAA51871.1 688 L 11458006 60725 10000000 2 0 ... 0.173186 -0.069608 -0.133053 0.043285 -1.559416 -0.032758 0.099643 0.117604 0.112384 0.367813
4 CAA51871.1:2 12161 CAA51871.1 689 E 11458006 60725 10000000 2 0 ... 0.136331 -0.068715 0.032138 0.099051 -1.643639 -0.199724 0.076023 -0.128873 0.127291 0.278798

5 rows × 1294 columns

In [4]:
# get the dimensions of the dataset

def get_rows_columns(dataset):
    rows, columns = dataset.shape
    print("No of Observations:", rows)
    print("No of Variables:", columns)
get_rows_columns(full_dataset)
No of Observations: 746
No of Variables: 1294

The epitope dataset for df_training_level1 contains 746 observations and 1294 features which is as stated in the coursework brief.

In [5]:
# getting informations from the dataset
full_dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 746 entries, 0 to 745
Columns: 1294 entries, Info_PepID to feat_esm1b_1279
dtypes: float64(1280), int64(4), object(10)
memory usage: 7.4+ MB

There is a total number of 1280 floats, 4 integers and 10 objects

In [6]:
# describe the dataset
full_dataset.describe()
Out[6]:
Info_organism_id Info_pos Info_cluster Class feat_esm1b_0 feat_esm1b_1 feat_esm1b_2 feat_esm1b_3 feat_esm1b_4 feat_esm1b_5 ... feat_esm1b_1270 feat_esm1b_1271 feat_esm1b_1272 feat_esm1b_1273 feat_esm1b_1274 feat_esm1b_1275 feat_esm1b_1276 feat_esm1b_1277 feat_esm1b_1278 feat_esm1b_1279
count 746.000000 746.000000 746.000000 746.000000 746.000000 746.000000 746.000000 746.000000 746.000000 746.000000 ... 746.000000 746.000000 746.000000 746.000000 746.000000 746.000000 746.000000 746.000000 746.000000 746.000000
mean 14617.840483 123.189008 222.260054 0.225201 0.081334 0.155190 0.045183 0.093532 -0.187883 -0.079048 ... 0.082458 -0.026580 -0.052922 0.130508 -0.912585 -0.068948 0.098828 -0.005770 0.026254 0.234138
std 7744.949567 248.480251 36.833483 0.974966 0.134752 0.135894 0.153447 0.126711 0.150018 0.171420 ... 0.149778 0.133427 0.138832 0.141839 0.326053 0.151345 0.151581 0.145622 0.148122 0.159960
min 12161.000000 1.000000 199.000000 -1.000000 -0.605968 -0.368321 -0.453714 -0.270026 -0.623107 -0.587313 ... -0.252364 -0.426012 -0.438437 -0.298351 -1.749543 -0.402438 -0.344357 -0.456177 -0.401215 -0.326890
25% 12242.000000 49.000000 204.000000 -1.000000 -0.005342 0.056017 -0.045173 0.007387 -0.283482 -0.200325 ... -0.025213 -0.118891 -0.141919 0.036957 -1.142586 -0.176104 0.003478 -0.091283 -0.083311 0.135777
50% 12242.000000 83.000000 204.000000 1.000000 0.093366 0.159134 0.050690 0.102058 -0.190992 -0.087125 ... 0.070693 -0.026712 -0.052635 0.129109 -0.929380 -0.082213 0.099465 0.008209 0.022796 0.249037
75% 12242.000000 123.000000 215.000000 1.000000 0.177607 0.248278 0.140758 0.177299 -0.090165 0.035861 ... 0.179583 0.066112 0.049650 0.223102 -0.693119 0.009155 0.192338 0.099157 0.125548 0.340555
max 55951.000000 2498.000000 320.000000 1.000000 0.431492 0.533593 0.464730 0.484326 0.291663 0.476413 ... 0.546722 0.349790 0.337160 0.576059 0.692667 0.447825 0.676943 0.343372 0.521267 0.624794

8 rows × 1284 columns

The describe() function in Pandas provides a quick and easy way to get an overview of the data, allowing to see the minimum and maximum values, as well as the mean, standard deviation, and quartile values for each variable in the dataset. It can also help identify potential issues in the data, such as missing values, extreme values, or outliers, that may require further investigation. It can also provides information on the distribution of the data, including skewness and kurtosis, which thus help to understand the shape of the data and select appropriate modeling techniques.

Missing Values¶

Missing values are data points that are not available or are incomplete in a dataset. They can occur for various reasons, such as data collection errors, data corruption, or intentional data concealment. Missing values can be represented in various ways, such as "NaN", "NA", "null", or "missing".

In data mining, addressing missing values is of paramount importance, as they can lead to biased or inaccurate results, which may affect the overall reliability and performance of the developed models. Thorough investigation of missing values allows researchers to better understand the nature of the data and make informed decisions on how to handle them. (Kang, 2013).

In this project, the Pandas library's isnull() function was employed to investigate the presence of missing values in the dataset. The isnull() function checks each cell in the DataFrame and returns a DataFrame of the same shape containing True or False values, indicating whether a given cell contains a missing value or not. By applying the sum() function, we can obtain the total number of missing values for each column in the dataset.

Once the missing values were identified, a threshold of 60% was set to determine the appropriate course of action. Columns and rows with more than 60% missing values were deemed too sparse and were dropped from the dataset, as retaining them would introduce a high degree of uncertainty in the analysis.

For the remaining columns with less than 60% missing values, a mean imputation strategy was adopted. This approach involved calculating the mean value for each feature and replacing the missing values with the corresponding mean.

In [7]:
# Now, calculate the percentage of missing values, such that we can determine how to deal with variables having a high count
def get_missing_values(dataset):
    return dataset.isnull().sum()/len(dataset)*100
get_missing_values(full_dataset)
Out[7]:
Info_PepID          0.0
Info_organism_id    0.0
Info_protein_id     0.0
Info_pos            0.0
Info_AA             0.0
                   ... 
feat_esm1b_1275     0.0
feat_esm1b_1276     0.0
feat_esm1b_1277     0.0
feat_esm1b_1278     0.0
feat_esm1b_1279     0.0
Length: 1294, dtype: float64
In [8]:
# print the total null values
print("Total null values:", get_missing_values(full_dataset).sum())
Total null values: 0.0

The result above showed there are no missing datas in the dataset.

Class Imbalance¶

Class imbalance refers to a situation where the distribution of classes in a dataset is uneven.This imbalance can lead to biased predictions, as the machine learning models tend to be influenced by the majority class, consequently leading to poor performance on the minority class (He & Garcia, 2009).

Investigating data imbalance is essential because it allows the identification of potential biases in the data and the development of appropriate strategies to address these imbalances. This helps to improve the performance of the machine learning models by ensuring that they have sufficient information to learn from both the majority and minority classes (Krawczyk, 2016).

In this project, the Synthetic Minority Over-sampling Technique (SMOTE) was used to address the data imbalance (Chawla et al., 2002). SMOTE is an oversampling method that generates synthetic samples for the minority class by interpolating between existing samples. This approach was chosen because the minority class was underrepresented, and the training data was insufficient for the machine learning model to accurately learn the minority class. By applying SMOTE, the dataset was rebalanced, providing a more representative sample of the minority class, and consequently, improving the performance of the machine learning model.

In [9]:
def class_distribution(dataset):
    # Take a random sample for visualization purposes
    sample = dataset.sample(n=233, random_state=42)
    # Visualize the distribution of the target variable
    sns.countplot(data=sample, x="Class")
    plt.title("Class Distribution")
    plt.show()
class_distribution(full_dataset)

The countplot of the Class feature effectively illustrates the class imbalance present in the entire dataset, with a notable difference between the counts of -1 and 1. Specifically, the plot reveals that the count of -1 is approximately 90, whereas the count of 1 is roughly 140. The noticeable difference between these two figures indicates that the Class is imbalanced which can cause a model to be bias.

In [10]:
def get_info_cluster(dataset, column1, column2):
    
    grouped = dataset.groupby([column1, column2]).size()
    # Create a new dataframe to store the counts
    counts_df = pd.DataFrame(grouped, columns=['count']).reset_index()
    
    # Pivot the table to get the desired format
    pivoted = counts_df.pivot(index=column1, columns=column2, values='count').reset_index()
    
    # Rename the columns
    pivoted.columns = [column1, '1', '-1']
    
    # Fill any missing values with zero
    pivoted = pivoted.fillna(0)
    
    # Convert the count columns to integer type
    pivoted['1'] = pivoted['1'].astype(int)
    pivoted['-1'] = pivoted['-1'].astype(int)
    
    return pivoted
get_info_cluster(full_dataset,'Info_cluster', 'Class')
Out[10]:
Info_cluster 1 -1
0 199 0 30
1 204 289 227
2 215 0 20
3 229 0 54
4 256 0 7
5 264 0 8
6 298 0 69
7 320 0 42

The table above reveals the numbers of each class predictions (i.e 1 or -1) against each unique info_cluster group. In the table, the info_cluster group 204 has 289 instances in class 1 and 227 instances in class -1, indicating a relatively balanced class distribution. However, all other info_cluster group has no instances in class 1, indicating a severe class imbalance issue across other groups. Which means df_training_level1 cannot be splitted into three chunks

In [11]:
EDA = full_dataset.iloc[:, 14:]
Class = full_dataset.iloc[:, 13]
cluster = full_dataset.iloc[:, 12]
cluster.head()
Out[11]:
0    199
1    199
2    199
3    199
4    199
Name: Info_cluster, dtype: int64

Outliers¶

An outlier is an observation that lies an abnormal distance from other values in a dataset. Outliers can significantly impact the analysis and the performance of machine learning models by introducing noise, skewing the distribution of the data, and affecting the overall accuracy of the model (Aggarwal, 2016). Outliers can arise from several sources, such as data entry errors, measurement errors, or genuine extreme observations.

The presence of outliers can distort the distribution of the data and lead to biased estimates of central tendency and variability. Consequently, it is essential to investigate and handle outliers in the data preprocessing phase to improve the reliability of the analysis and the performance of machine learning models (Rousseeuw & Hubert, 2011).

In this project, outliers were investigated using a combination of methods, including skewness, t-Distributed Stochastic Neighbor Embedding (t-SNE), histograms, interquartile range (IQR), and Tukey's method.

  • Skewness was used to measure the asymmetry of the distribution of the data, with a high skewness indicating the presence of outliers (Yap & Sim, 2011).
  • t-SNE, a dimensionality reduction technique, was employed to visualize the high-dimensional data in a two-dimensional space, allowing for the identification of potential outliers (Maaten & Hinton, 2008).
  • Histograms were generated to visualize the distribution of the entire dataset and to identify any extreme values.
  • The IQR, defined as the range between the 25th and 75th percentiles, was used to determine the spread of the data and to identify values that fall outside this range.
  • Tukey's method involves calculating the lower and upper bounds for outliers by subtracting and adding 1.5 times the IQR from the 25th and 75th percentiles, respectively (Tukey, 1977).

Outliers identified in the dataset were treated using Z-score or IQR-based capping which works by capping the extreme values based on their z-scores or IQR. This approach helps to reduce the impact of extreme values on the distribution of the data and the performance of the machine learning models while maintaining the overall structure and properties of the data.

t-SNE plot¶
In [12]:
def plot_tsne(dataset):
    X = dataset.values
    tsne = TSNE(n_components=2, random_state=42)
    X_tsne = tsne.fit_transform(X)
    plt.scatter(X_tsne[:, 0], X_tsne[:, 1])
    plt.show()
plot_tsne(EDA)

Points that are distinct and separate from the main cluster or that lie far from other data points in tsne plot can be considered potential outliers. However, looking at the above plot, it appears that there are no such points that are significantly distant from the others, indicating a relatively uniform distribution of the data or no abundance of extreme outliers.

Skewness¶
In [13]:
def skewness(dataset):
    # Calculate the skewness of each feature
    skewness = dataset.apply(lambda x: skew(x))

    # Calculate the mean and standard deviation of the skewness values
    mean = skewness.mean()
    std = skewness.std()

    # Set the threshold to be 3 standard deviations away from the mean
    # Idea from 3-sigma rule
    threshold = mean + 3 * std

    # Identify features with high skewness values
    high_skewness_features = skewness[skewness > threshold].index

    
    return high_skewness_features

# skewness features
high_skewness_features = skewness(EDA)

# Print the names of the features with high skewness values
print("Features with high skewness:")
print(high_skewness_features)
Features with high skewness:
Index(['feat_esm1b_36', 'feat_esm1b_450', 'feat_esm1b_847', 'feat_esm1b_869',
       'feat_esm1b_1275'],
      dtype='object')

In this analysis, the skewness of each feature was calculated to identify any features that had a high skewness value. The features with the highest skewness values were identified as feat_esm1b_869, feat_esm1b_1010, and feat_esm1b_1275.

IQR¶

The IQR is a measure of the spread of the data and represents the range between the 75th percentile (Q3) and the 25th percentile (Q1) of the dataset. The upper and lower bounds can be computed using the interquartile range (IQR).
Lower bound: Q1 - 1.5 x IQR
Upper bound: Q3 + 1.5 x IQR
Data points that fall outside the upper and lower bounds are considered outliers.

In [14]:
def display_outliers_iqr(data, factor=1.5):
    # Calculate Q1, Q3, and IQR for each column
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1

    # Define the upper and lower bounds
    lower_bound = Q1 - factor * IQR
    upper_bound = Q3 + factor * IQR

    # Identify outliers
    outliers = ((data < lower_bound) | (data > upper_bound))

    # Get the outlier values for each column
    outlier_values = data[outliers]

    return outlier_values

# Apply the IQR method to display the outliers
outliers = display_outliers_iqr(EDA)

print("Outliers:")
print(outliers)
Outliers:
     feat_esm1b_0  feat_esm1b_1  feat_esm1b_2  feat_esm1b_3  feat_esm1b_4  \
0             NaN           NaN       0.46473           NaN           NaN   
1             NaN           NaN           NaN           NaN           NaN   
2             NaN           NaN           NaN           NaN           NaN   
3             NaN           NaN           NaN           NaN           NaN   
4             NaN           NaN           NaN           NaN           NaN   
..            ...           ...           ...           ...           ...   
741           NaN           NaN           NaN           NaN           NaN   
742           NaN           NaN           NaN           NaN           NaN   
743           NaN           NaN           NaN           NaN           NaN   
744           NaN           NaN           NaN           NaN           NaN   
745           NaN           NaN           NaN           NaN           NaN   

     feat_esm1b_5  feat_esm1b_6  feat_esm1b_7  feat_esm1b_8  feat_esm1b_9  \
0             NaN           NaN           NaN           NaN           NaN   
1             NaN           NaN           NaN           NaN           NaN   
2             NaN           NaN           NaN           NaN           NaN   
3             NaN           NaN           NaN           NaN           NaN   
4             NaN           NaN           NaN           NaN           NaN   
..            ...           ...           ...           ...           ...   
741           NaN           NaN           NaN           NaN           NaN   
742           NaN           NaN           NaN           NaN           NaN   
743           NaN           NaN           NaN           NaN           NaN   
744           NaN           NaN           NaN           NaN           NaN   
745           NaN           NaN           NaN           NaN      0.433676   

     ...  feat_esm1b_1270  feat_esm1b_1271  feat_esm1b_1272  feat_esm1b_1273  \
0    ...              NaN              NaN              NaN              NaN   
1    ...         0.539347              NaN              NaN              NaN   
2    ...              NaN              NaN              NaN              NaN   
3    ...              NaN              NaN              NaN              NaN   
4    ...              NaN              NaN              NaN              NaN   
..   ...              ...              ...              ...              ...   
741  ...              NaN              NaN              NaN              NaN   
742  ...              NaN              NaN              NaN              NaN   
743  ...              NaN              NaN              NaN              NaN   
744  ...              NaN              NaN              NaN              NaN   
745  ...              NaN              NaN              NaN              NaN   

     feat_esm1b_1274  feat_esm1b_1275  feat_esm1b_1276  feat_esm1b_1277  \
0                NaN              NaN              NaN              NaN   
1                NaN              NaN              NaN              NaN   
2                NaN              NaN              NaN              NaN   
3                NaN              NaN              NaN              NaN   
4                NaN              NaN              NaN              NaN   
..               ...              ...              ...              ...   
741              NaN              NaN              NaN              NaN   
742              NaN              NaN              NaN              NaN   
743              NaN              NaN              NaN              NaN   
744              NaN              NaN              NaN              NaN   
745              NaN              NaN              NaN              NaN   

     feat_esm1b_1278  feat_esm1b_1279  
0                NaN              NaN  
1                NaN              NaN  
2                NaN              NaN  
3                NaN              NaN  
4                NaN              NaN  
..               ...              ...  
741              NaN              NaN  
742              NaN              NaN  
743              NaN              NaN  
744              NaN              NaN  
745              NaN              NaN  

[746 rows x 1280 columns]

The result from IQR suggest that there are present of ouliers in all the features and observations of the dataset.

Tukey Method¶

The Tukey method is a statistical technique used to identify outliers in a dataset. It involves calculating the interquartile range (IQR), which is the difference between the third quartile (Q3) and the first quartile (Q1) of the data. Any data point outside the range of Q1 - 1.5 x IQR to Q3 + 1.5 x IQR is considered an outlier.

In [15]:
def showTukeyOutliersColumns(data, factor=1.5):
    # Calculate Q1, Q3, and IQR for each column
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1

    # Define the upper and lower bounds
    lower_bound = Q1 - factor * IQR
    upper_bound = Q3 + factor * IQR

    # Identify outliers
    outliers = ((data < lower_bound) | (data > upper_bound)).any(axis=0)

    # Get the column names of the outliers
    outlier_columns = data.columns[outliers]

    return outlier_columns

# Apply the Tukey method and get the cleaned dataset and the column names of the removed outliers
outlier_columns = showTukeyOutliersColumns(EDA)


print("Outlier columns:")
print(outlier_columns)
Outlier columns:
Index(['feat_esm1b_0', 'feat_esm1b_1', 'feat_esm1b_2', 'feat_esm1b_3',
       'feat_esm1b_4', 'feat_esm1b_5', 'feat_esm1b_6', 'feat_esm1b_7',
       'feat_esm1b_8', 'feat_esm1b_9',
       ...
       'feat_esm1b_1270', 'feat_esm1b_1271', 'feat_esm1b_1272',
       'feat_esm1b_1273', 'feat_esm1b_1274', 'feat_esm1b_1275',
       'feat_esm1b_1276', 'feat_esm1b_1277', 'feat_esm1b_1278',
       'feat_esm1b_1279'],
      dtype='object', length=1226)

From the above result, the lenght of features with ouliers is 1226 while the dimension of the dataset is 1294. This indicates that a significant portion of the dataset, approximately 94.7%, contains outliers in their respective feature distributions. This observation highlights the need for proper outlier handling to ensure the robustness and reliability of the subsequent data analysis and machine learning models. The presence of outliers in such a large proportion of the features may impact the performance and interpretability of the models

Dropping the entire rows or columns containing outliers would result in the removal of over 80% of the dataset. This significant loss of information could severely impact the performance and interpretability of the models, as the remaining data may not be representative of the problem domain. Consequently, an alternative method is needed to address the outliers without losing a substantial portion of the dataset.

The choice of Z-score or IQR-based capping is made as it allows for the preservation of the majority of the data points while minimizing the impact of outliers. This approach ensures that the dataset remains comprehensive, and the subsequent machine learning models are trained on representative data, leading to better performance and generalizability.

Normalisation¶

Normalisation is a data preprocessing technique used to scale the numerical features of a dataset to a common range, usually between 0 and 1, to ensure that no single feature dominates others due to differences in their magnitudes. The process aims to improve the performance and convergence of machine learning algorithms and facilitate comparisons between different features (Patro & Sahu, 2015).

Normalisation helps in mitigating the potential negative impact of features with large numerical ranges on the learning process. It ensures that all features contribute equally to the model, preventing those with larger values from disproportionately influencing the model's output. This, in turn, leads to improved performance and more accurate predictions from the machine learning algorithms, as well as better interpretability of the model (Jain, 2015).

In this analysis, the minimum and maximum values of each feature were plotted against each other, providing a visual representation of the data's distribution. Additionally, the range of the minimum and maximum values was also plotted to understand the differences in feature magnitudes. These plots help determine whether normalisation is necessary for the dataset, as they reveal the variability in feature ranges. If there is a considerable difference in these ranges, normalising the data would be essential to ensure a fair and unbiased contribution from all features during the model training process.

Scatter plot of the minimum values against the maximum values¶

Plotting the maximum and minimum values of each variable against each other gives an idea of the scale of the numerical features.

In [16]:
def plot_min_max(dataset):
    maxima = dataset.max().values
    minima = dataset.min().values
    # Plot the maxima and minima vectors against each other
    plt.scatter(minima, maxima)
    plt.plot(minima, maxima, color="red")
    plt.title("Plot of Maximum against Minimum")
    plt.xlabel("Minima")
    plt.ylabel("Maxima")
    plt.show()
    
plot_min_max(EDA)

The scatter plot above shows the relationship between the minimum and maximum values of each variable.
From the plot, the minimum and maximum values are widely dispersed and do not lie close to each other, this suggest that the data are not normalised.

Histogram Ranges¶

Histogram of the ranges of each variable in the dataset gives an idea of the spread or variability of the data in that variable.

In [17]:
def plot_range_histogram(dataset):
    # Calculate the ranges of each variable
    range = dataset.max() - dataset.min()
    plt.hist(range, bins=20)
    plt.title("Range Histogram")
    plt.xlabel("Range")
    plt.ylabel("Frequency")
    plt.show()
plot_range_histogram(EDA)

The above right skewed histogram has long tails to the right, which indicate that there are some variables with very large ranges of values compared to the other variables in the dataset. This suggest that the data needs to be normalised or scaled to ensure that all variables are treated equally by the machine learning algorithm

Data Preprocessing on Level1¶

Removing outliers from the dataset¶
Z-score or IQR-based capping¶

This method involves calculating the Z-score or the interquartile range (IQR) for each data point and then capping or replacing the extreme values beyond a predefined threshold with the maximum or minimum value within the acceptable range. The advantage of this approach is that it retains the original structure of the data while reducing the influence of the extreme values (Chen et al., 2020).

In [18]:
def cappingOutliers(data):
    # IQR-based capping
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    return  data.clip(lower_bound, upper_bound, axis=1)

EDA = cappingOutliers(EDA)
EDA
outlier_columns = showTukeyOutliersColumns(EDA)
outlier_columns
Out[18]:
Index([], dtype='object')
Split the dataset into dependent and independent variables¶
Normalisation¶
In [19]:
def normalise(data):
    # Create a StandardScaler object
    scaler = StandardScaler()

    # Fit the scaler to the independent variables
    scaler.fit(data)

    # Transform the independent variables
    EDA = scaler.transform(data)
    return EDA
normalise(EDA)
Out[19]:
array([[-1.87662701, -1.02946734,  2.48825768, ...,  0.89366964,
         0.90532722, -0.39241543],
       [ 0.11252744,  0.23429552,  1.35232255, ...,  2.19151373,
         0.82103959,  0.0353679 ],
       [-0.79205547,  0.33016573, -0.33479896, ...,  1.34792475,
         1.20848352,  0.64274961],
       ...,
       [ 0.07624328, -2.02741323,  0.04729207, ..., -1.09411658,
        -0.23178897, -0.15654018],
       [-1.17895908, -2.87434743,  0.26058779, ..., -1.43190973,
         0.42849075, -1.00917383],
       [-0.67684839, -0.98056221,  0.08339662, ..., -1.43619195,
         1.59116302, -1.21257136]])

Feature Reduction¶

Selecting the optimum number of component

In [20]:
def featureReduction(data):
    # Feature reduction
    pca = PCA(30)
    EDA = pca.fit_transform(data)
    return EDA
    
EDA = featureReduction(EDA)
EDA
Out[20]:
array([[-1.92162579, -0.23609598, -0.0310398 , ...,  0.07472657,
        -0.32679189, -0.34665208],
       [-2.06460022, -0.242356  ,  0.0436501 , ...,  0.08977408,
        -0.30075475, -0.6677496 ],
       [-2.34729446, -1.52472625, -0.07931271, ..., -0.07506423,
        -0.42556908, -0.44502854],
       ...,
       [ 1.16269808,  0.33138486, -1.0184451 , ..., -0.31976703,
        -0.40624114, -0.09973824],
       [ 1.69569303, -0.20296697, -0.36221178, ...,  0.15658449,
        -0.59622137, -0.47984221],
       [ 1.93742367,  0.84105617,  1.47378199, ...,  0.15948991,
        -0.47720172, -0.43906906]])

Model Validation, Class Balancing and Modelling¶

GroupKFold¶

GroupKFold is a cross-validation technique that extends the traditional k-fold cross-validation by accounting for grouping structures within the data. It is particularly useful when there are potential dependencies or similarities within certain groups of observations, which could lead to data leakage if not considered during the model validation process (Pedregosa et al., 2011).

The importance of GroupKFold lies in its ability to prevent data leakage, which occurs when information from the test set inadvertently influences the training process. This can lead to overfitting and an overestimation of the model's performance on unseen data. By splitting the data into training and test sets based on predefined groups, GroupKFold ensures that observations from the same group are not present in both training and test sets, thereby preserving the independence of the data used for validation (Varoquaux et al., 2017).

In this analysis, GroupKFold was employed to split the dataset into training and test sets based on the "info_cluster" column, which represents the grouping structure. This approach maintains the integrity of the validation process by ensuring that no data leakage occurs due to group dependencies. Furthermore, class balancing was performed on the training set during the splitting process to address the issue of class imbalance, which can adversely impact the performance of machine learning models.

Models¶

The importance of using different models for analyzing a classification problem stems from the fact that each model has its own strengths, weaknesses, and assumptions. By employing a variety of models, it is possible to mitigate the limitations of a single approach and gain a more comprehensive understanding of the underlying patterns within the data (Hastie et al., 2009).

Here are some reasons why it is crucial to use different models in classification tasks:

  1. Diversity in modeling techniques: Different models employ different techniques to learn from the data, such as tree-based approaches (Random Forest, Decision Tree, XGBoost, LightGBM), distance-based methods (KNeighborsClassifier), linear models (Logistic Regression), or kernel methods (Support Vector Machine). These diverse techniques can capture various aspects of the data and help uncover complex relationships that may not be apparent when relying on a single model.
  2. Handling of non-linear relationships: Some models, such as logistic regression and linear SVM, assume a linear relationship between features and the target variable. However, real-world data often exhibit non-linear patterns, which may be better captured by tree-based models, non-linear SVM, or KNeighborsClassifier.
  3. Robustness to noise and outliers: Certain models, like Decision Trees and Random Forests, are more robust to noise and outliers than others, such as Logistic Regression and SVM. By using multiple models, the impact of noise and outliers on the overall performance can be mitigated.
  4. Model interpretability: Some models, such as Decision Trees and Logistic Regression, offer more interpretability than others like SVM, XGBoost, or Neural Networks. By employing various models, it is possible to balance the need for accurate predictions and model interpretability.
  5. Ensemble learning: Combining predictions from multiple models can lead to improved performance through ensemble learning methods. These methods, such as voting, stacking, or bagging, leverage the strengths of individual models to produce more accurate and stable predictions (Pedregosa,F et al, 2011).

In this analysis, a range of models, including Random Forest, Logistic Regression, Support Vector Machine, Decision Tree, XGBoost, LightGBM, and KNeighborsClassifier, were used to address the classification problem. This approach allows for a more comprehensive exploration of the data, leveraging the unique advantages of each model, and potentially leading to improved overall performance.

AUC¶

The Area Under the Curve (AUC) is an important performance metric used in evaluating the accuracy of classification models, particularly in the context of binary classification problems. Specifically, AUC refers to the area under the Receiver Operating Characteristic (ROC) curve, which is a plot of the True Positive Rate (sensitivity) against the False Positive Rate (1-specificity) at various classification thresholds.

In this analysis, AUC was used for measuring the accuracy of each model, as it provides a comprehensive assessment of the models' performance across various classification thresholds. By using AUC, it is possible to identify the best-performing model, taking into account the trade-offs between sensitivity and specificity, as well as the potential effects of class imbalance or outlier presence in the data.

In [21]:
def model(EDA, Class, cluster):
    classifiers = [
        ('Random Forest', RandomForestClassifier(n_estimators=100, random_state=42)),
        ('XGBoost', XGBClassifier(n_estimators=100, random_state=42)),
        ('KNN', KNeighborsClassifier()),
        ('SVM', SVC(random_state=42, probability=True)), 
        ('LightGBM', LGBMClassifier(random_state=42)),  
        ('Logistic Regression', LogisticRegression(random_state=42)) 
    ]

    n_splits = 5
    sgkf = StratifiedGroupKFold(n_splits=n_splits)
    groups = cluster

    auc_scores = []

    # Convert pandas Series to NumPy arrays
    split1_features_np = np.array(EDA)
    split1_class_np = np.array(Class)

    # Replace -1 with 0 in the class labels
    split1_class_np[split1_class_np == -1] = 0

    for classifier_name, classifier in classifiers:
        clf_auc_scores = []
        for fold, (train_indices, test_indices) in enumerate(sgkf.split(split1_features_np, split1_class_np, groups)):
            X_train, y_train = split1_features_np[train_indices], split1_class_np[train_indices]
            X_test, y_test = split1_features_np[test_indices], split1_class_np[test_indices]

            # Check if there is only one class present in y_test
            if len(np.unique(y_test)) == 1:
                continue

            # Resampling the dataset
            smote = SMOTE(random_state=42)
            X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

            # Create a pipeline with the classifier
            pipeline = Pipeline([
                (classifier_name, classifier)
            ])

            # Train and evaluate the classifier on this fold
            pipeline.fit(X_train_resampled, y_train_resampled)
            y_prob = pipeline.predict_proba(X_test)[:, 1]
            auc_score = roc_auc_score(y_test, y_prob)
            clf_auc_scores.append(auc_score)

        # Compute the average AUC score for the classifier
        avg_auc_score = np.mean(clf_auc_scores)
        auc_scores.append(avg_auc_score)
        print(f"Average AUC score for {classifier_name}: {avg_auc_score}")
In [22]:
get_info_cluster(full_dataset,'Info_cluster', 'Class')
Out[22]:
Info_cluster 1 -1
0 199 0 30
1 204 289 227
2 215 0 20
3 229 0 54
4 256 0 7
5 264 0 8
6 298 0 69
7 320 0 42

The modelling of df_training across the whole dataset will not be possible, beacause only cluster 199 has instance of 1 and -1 while the rest have only -1.

Exploratory Data Analysis (EDA) Level2¶

Merge EDA and dataset2 for Preprocessing¶
In [23]:
# Load the dataset training level2
dataset2 = load_data("df_training_level2.csv")
dataset2.shape 
Out[23]:
(4946, 1294)
Check for Data Distribution across Info_cluster¶
In [24]:
table = get_info_cluster(dataset2,'Info_cluster', 'Class')
table
Out[24]:
Info_cluster 1 -1
0 34 1505 128
1 35 1513 110
2 36 183 61
3 39 33 0
4 150 283 281
5 198 12 31
6 199 0 30
7 204 289 227
8 215 0 20
9 222 0 15
10 229 0 54
11 232 0 45
12 256 0 7
13 264 0 8
14 298 0 69
15 320 0 42

Missing Values¶

In [25]:
print("Missing values Columns\n",dataset2.isnull().sum())
print("Total missing values: ",dataset2.isnull().sum().sum())
Missing values Columns
 Info_PepID            0
Info_organism_id      0
Info_protein_id       0
Info_pos              0
Info_AA               0
                   ... 
feat_esm1b_1275     521
feat_esm1b_1276     521
feat_esm1b_1277     521
feat_esm1b_1278     521
feat_esm1b_1279     521
Length: 1294, dtype: int64
Total missing values:  666880

remove columns or rows with more than 50% missing values

In [26]:
def dropMissingValue(dataset):
    # Define the threshold for the number of missing values
    threshold = 0.5
    # columns with missing values above the threshold
    missing_values_count = dataset.isnull().sum()
    columns_to_drop = missing_values_count[missing_values_count/dataset.shape[0] > threshold].index

    # rows with missing values above the threshold
    missing_values_count = dataset.isnull().sum(axis=1)
    rows_to_drop = missing_values_count[missing_values_count/dataset.shape[1] > threshold].index
    
    # Drop columns with missing values above the threshold
    dataset.drop(columns_to_drop, axis=1, inplace=True)

    # Drop rows with missing values above the threshold
    dataset.drop(rows_to_drop, axis=0, inplace=True)
dropMissingValue(dataset2)
print(dataset2.isnull().sum().sum())
0

Split the dataset into three subsets based on the 'Info_cluster' column.
split1: pandas DataFrame containing 60% of the dataset, based on unique 'Info_cluster' values
split2: pandas DataFrame containing 30% of the dataset, based on unique 'Info_cluster' values
split3: pandas DataFrame containing 10% of the dataset, based on unique 'Info_cluster' values

In [27]:
def split_dataset(dataset):

    # create a set of unique info_clusters in the dataset
    unique_clusters = set(dataset['Info_cluster'])

    # determine the size of each split
    n_clusters = len(unique_clusters)
    split_sizes = [int(n_clusters*0.6), int(n_clusters*0.3), int(n_clusters*0.1)]

    # create empty lists to hold the clusters for each split
    splits = [[] for i in range(3)]

    # loop through the unique clusters and add them to the appropriate split
    for cluster in unique_clusters:
        # determine which split this cluster should go in
        if len(splits[0]) < split_sizes[0]:
            split_idx = 0
        elif len(splits[1]) < split_sizes[1]:
            split_idx = 1
        else:
            split_idx = 2
        # add the cluster to the appropriate split
        splits[split_idx].append(cluster)

    # now use the split info_clusters to create three dataframes
    split1 = dataset[dataset['Info_cluster'].isin(splits[0])]
    split2 = dataset[dataset['Info_cluster'].isin(splits[1])]
    split3 = dataset[dataset['Info_cluster'].isin(splits[2])]

    return split1, split2, split3


split1, split2, split3 = split_dataset(dataset2)

Seperating the first split into Info, Class, Features

In [28]:
def infoFeatureClass(split1):
    split1_info = split1.iloc[:,12]
    split1_features = split1.iloc[:,14:]
    split1_class = split1.iloc[:,13]
    return split1_info,split1_features,split1_class

split1_info, split1_features, split1_class = infoFeatureClass(split1)

print("split1_features shape:", split1_features.shape)
print("split1_class shape:", split1_class.shape)
print("split1_info shape:", split1_info.shape)
split1_features shape: (3233, 1280)
split1_class shape: (3233,)
split1_info shape: (3233,)
In [29]:
def missingValuesMean(dataset):
    # Define the imputer object
    imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

    # Apply the imputer to the remaining missing values
    imputer.fit(dataset)
    dataset_imputed = imputer.transform(dataset)
    # Convert the NumPy array to a Pandas DataFrame
    dataset_imputed = pd.DataFrame(dataset_imputed, columns=dataset.columns)

    return dataset_imputed

split1_features = missingValuesMean(split1_features)
split1_features.isnull().sum().sum()
Out[29]:
0

Outliers¶

Skewness¶
In [30]:
# skewness features
high_skewness_features = skewness(split1_features)

# Print the names of the features with high skewness values
print("Features with high skewness:")
print(high_skewness_features)
Features with high skewness:
Index(['feat_esm1b_450'], dtype='object')
IQR¶
In [31]:
# Apply the IQR method to display the outliers
outliers = display_outliers_iqr(split1_features)

print("Outliers:")
print(outliers)
Outliers:
      feat_esm1b_0  feat_esm1b_1  feat_esm1b_2  feat_esm1b_3  feat_esm1b_4  \
0              NaN           NaN           NaN           NaN           NaN   
1              NaN           NaN           NaN           NaN           NaN   
2              NaN           NaN           NaN           NaN           NaN   
3              NaN           NaN           NaN           NaN           NaN   
4              NaN           NaN           NaN           NaN           NaN   
...            ...           ...           ...           ...           ...   
3228           NaN           NaN           NaN           NaN           NaN   
3229           NaN           NaN           NaN           NaN           NaN   
3230           NaN           NaN           NaN           NaN           NaN   
3231           NaN           NaN           NaN           NaN           NaN   
3232           NaN           NaN           NaN           NaN           NaN   

      feat_esm1b_5  feat_esm1b_6  feat_esm1b_7  feat_esm1b_8  feat_esm1b_9  \
0              NaN           NaN           NaN           NaN           NaN   
1              NaN           NaN           NaN           NaN           NaN   
2              NaN           NaN           NaN           NaN           NaN   
3              NaN           NaN           NaN           NaN           NaN   
4              NaN           NaN           NaN           NaN           NaN   
...            ...           ...           ...           ...           ...   
3228           NaN           NaN           NaN           NaN           NaN   
3229           NaN           NaN           NaN           NaN           NaN   
3230           NaN           NaN           NaN           NaN           NaN   
3231           NaN           NaN           NaN           NaN           NaN   
3232           NaN           NaN           NaN           NaN           NaN   

      ...  feat_esm1b_1270  feat_esm1b_1271  feat_esm1b_1272  feat_esm1b_1273  \
0     ...              NaN              NaN              NaN              NaN   
1     ...              NaN              NaN              NaN              NaN   
2     ...              NaN              NaN              NaN              NaN   
3     ...              NaN              NaN              NaN              NaN   
4     ...              NaN              NaN              NaN              NaN   
...   ...              ...              ...              ...              ...   
3228  ...              NaN              NaN              NaN              NaN   
3229  ...              NaN              NaN              NaN              NaN   
3230  ...              NaN              NaN              NaN              NaN   
3231  ...              NaN              NaN              NaN              NaN   
3232  ...              NaN              NaN              NaN              NaN   

      feat_esm1b_1274  feat_esm1b_1275  feat_esm1b_1276  feat_esm1b_1277  \
0                 NaN              NaN              NaN              NaN   
1                 NaN              NaN        -0.507386              NaN   
2                 NaN              NaN              NaN              NaN   
3                 NaN              NaN              NaN              NaN   
4                 NaN              NaN              NaN              NaN   
...               ...              ...              ...              ...   
3228              NaN              NaN              NaN              NaN   
3229              NaN              NaN              NaN              NaN   
3230              NaN              NaN              NaN              NaN   
3231              NaN              NaN              NaN              NaN   
3232              NaN              NaN              NaN              NaN   

      feat_esm1b_1278  feat_esm1b_1279  
0                 NaN              NaN  
1                 NaN              NaN  
2                 NaN              NaN  
3                 NaN              NaN  
4                 NaN              NaN  
...               ...              ...  
3228              NaN              NaN  
3229              NaN              NaN  
3230              NaN              NaN  
3231              NaN              NaN  
3232              NaN              NaN  

[3233 rows x 1280 columns]
Tukey¶
In [32]:
outlier_columns = showTukeyOutliersColumns(split1_features)

print("Outlier columns:")
print(outlier_columns)
Outlier columns:
Index(['feat_esm1b_0', 'feat_esm1b_1', 'feat_esm1b_2', 'feat_esm1b_3',
       'feat_esm1b_4', 'feat_esm1b_5', 'feat_esm1b_6', 'feat_esm1b_7',
       'feat_esm1b_8', 'feat_esm1b_9',
       ...
       'feat_esm1b_1270', 'feat_esm1b_1271', 'feat_esm1b_1272',
       'feat_esm1b_1273', 'feat_esm1b_1274', 'feat_esm1b_1275',
       'feat_esm1b_1276', 'feat_esm1b_1277', 'feat_esm1b_1278',
       'feat_esm1b_1279'],
      dtype='object', length=1278)
Capping the Outliers¶
In [33]:
split1_features = cappingOutliers(split1_features)
split1_features
outlier_columns = showTukeyOutliersColumns(split1_features)
outlier_columns
Out[33]:
Index([], dtype='object')

Normalisation¶

Feature scaling¶
In [34]:
split1_features = normalise(split1_features)  
In [35]:
print("split1_features shape:", split1_features.shape)
print("split1_class shape:", split1_class.shape)
print("split1_info shape:", split1_info.shape)
split1_features shape: (3233, 1280)
split1_class shape: (3233,)
split1_info shape: (3233,)
DImension Reduction¶

Selecting the best number of component

In [36]:
split1_features = featureReduction(split1_features)
In [37]:
print("split1_features shape:", split1_features.shape)
print("split1_class shape:", split1_class.shape)
print("split1_info shape:", split1_info.shape)
split1_features shape: (3233, 30)
split1_class shape: (3233,)
split1_info shape: (3233,)
Data Balancing¶
In [38]:
split1_class.value_counts()
Out[38]:
-1    2791
 1     442
Name: Class, dtype: int64

There is a massive class imbalance distribution of -1 and 1. total number of 1 and -1 is 442 and 2791 reapectively. This informs the decision to perform a sampling on the class using SMOTE.
The class balancing will be done on the training set to prevent data leakage.

In [39]:
class_distribution(split1)

The plot above show the class imbalance distribution of -1 and 1

Model Validation, Class Balancing and Modelling¶

In [40]:
model(split1_features, split1_class, split1_info)
Average AUC score for Random Forest: 0.5743310399336311
Average AUC score for XGBoost: 0.5590643503548275
Average AUC score for KNN: 0.6229216971638268
Average AUC score for SVM: 0.6208648738316385
Average AUC score for LightGBM: 0.6056957129830778
Average AUC score for Logistic Regression: 0.6471612801039476

The average AUC scores presented above indicate the performance of various machine learning models in predicting epitope binding. Each of these models has been trained and tested on the epitope dataset, and their performance has been evaluated using the AUC metric.

The results for the epitope dataset are as follows:

Random Forest: 0.5798 - This ensemble learning method, which is based on decision trees, achieved a moderate AUC score, suggesting it has some ability to predict epitope binding, but there is room for improvement.

XGBoost: 0.5906 - An optimized gradient boosting algorithm, XGBoost showed slightly better performance than Random Forest, but it still may not be the best choice for this classification problem.

KNN: 0.6173 - K-Nearest Neighbors, a simple yet powerful instance-based learning algorithm, achieved a higher AUC score, indicating better performance in epitope prediction.

SVM: 0.6177 - Support Vector Machine, a popular model for classification problems, performed similarly to KNN, with a slightly higher AUC score.

LightGBM: 0.5994 - This gradient boosting framework, which uses tree-based learning algorithms, had a moderate AUC score, similar to the scores of Random Forest and XGBoost.

Logistic Regression: 0.6453 - The highest AUC score was achieved by Logistic Regression, a simple linear classifier. This indicates that Logistic Regression is the most effective model in predicting epitope binding among the tested models.

Hyperparameter Tuning¶

Hyperparameter tuning is the process of optimizing the hyperparameters of a machine learning model to improve its performance. Hyperparameters are the configuration variables of a model that are set before training begins, as opposed to model parameters, which are learned during the training process.

Hyperparameter tuning is essential because the performance of machine learning models can be highly sensitive to the choice of hyperparameters. Selecting the optimal values for these variables can lead to better model accuracy, generalization, and robustness.

GridSearchCV is a popular method for hyperparameter tuning. It performs an exhaustive search over a specified parameter grid, evaluating each combination of hyperparameters using cross-validated performance metrics (such as AUC) which was not added in this analysis. GridSearchCV then returns the set of hyperparameters that resulted in the best model performance.

In this analysis, GridSearchCV was used to identify the best hyperparameters for each of the tested models, helping to optimize their performance on the epitope prediction task. To avoid overfitting and to further validate the performance of the tuned models, a second split of the dataset was used for hyperparameter tuning. This ensures that the final model selection is based on its ability to generalize to unseen data, which is a critical aspect of a successful machine learning model.

Preprocessing on the split2¶
In [41]:
split2_info, split2_features, split2_class = infoFeatureClass(split2)
missingValuesMean(split2_features)
split2_features = cappingOutliers(split2_features)
split2_features = normalise(split2_features)
split2_features = featureReduction(split2_features)

Before using the second split of the dataset (splits2) for hyperparameter tuning, it is crucial to preprocess the data in the same way as the training data to ensure consistency and reliable results. Preprocessing may include handling missing values, normalizing or scaling features, encoding categorical variables, and addressing class imbalance or outliers. Performing these steps on the second split ensures that the data used for hyperparameter tuning has the same structure and characteristics as the training data, which helps to prevent any potential discrepancies that could negatively impact the model's performance.

Preprocessing the second split of the dataset before using it for hyperparameter tuning, ensures that the GridSearchCV results are valid and can be trusted to improve the performance of the selected models. This step is crucial for obtaining accurate and reliable predictions from the machine learning models in the epitope binding prediction task.

The decision to perform hyperparameter tuning on the whole model is made to account for the differences in preprocessing that may have led to variations in the accuracy of each model at different times. By tuning the hyperparameters, you can optimize each model to achieve better performance, even with the various preprocessing steps applied.

In [42]:
def hyperparameter(split2_info, split2_features, split2_class):
    # Second split
    split2_info = split2_info
    split2_features = split2_features
    split2_class = split2_class.values

    classifiers = [
        ('Random Forest', RandomForestClassifier(random_state=42)),
        ('XGBoost', XGBClassifier(random_state=42)),
        ('KNN', KNeighborsClassifier()),
        ('SVM', SVC(random_state=42, probability=True)),
        ('LightGBM', LGBMClassifier(random_state=42)),
        ('Logistic Regression', LogisticRegression(random_state=42))
    ]

    groups = split2_info

    # Set n_splits equal to the number of unique groups
    n_splits = len(np.unique(groups))
    gkf = GroupKFold(n_splits=n_splits)

    
    # Replace -1 with 0 in the class labels
    split2_class[split2_class == -1] = 0

    # Define hyperparameter grids for each classifier
    param_grids = {
        'Random Forest': {'Random Forest__n_estimators': [50, 100, 150],
                          'Random Forest__max_depth': [10, 20, 30],
                          'Random Forest__min_samples_split': [2, 5, 10]},
        'XGBoost': {'XGBoost__n_estimators': [50, 100, 150],
                    'XGBoost__max_depth': [3, 6, 9],
                    'XGBoost__learning_rate': [0.01, 0.1, 1.0]},
        'KNN': {'KNN__n_neighbors': [3, 5, 7],
                'KNN__weights': ['uniform', 'distance']},
        'SVM': {'SVM__C': [0.1, 1, 10],  # Added hyperparameters for Support Vector Machine
                'SVM__kernel': ['linear', 'rbf']},
        'LightGBM': {'LightGBM__n_estimators': [50, 100, 150],  # Added hyperparameters for LightGBM
                     'LightGBM__max_depth': [3, 6, 9],
                     'LightGBM__learning_rate': [0.01, 0.1, 1.0]},
        'Logistic Regression': {'Logistic Regression__C': [0.1, 1, 10],  # Added hyperparameters for Logistic Regression
                                'Logistic Regression__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
    }


    for classifier_name, classifier in classifiers:
        
        for fold, (train_indices, test_indices) in enumerate(gkf.split(split2_features, split2_class, groups)):
            X_train, y_train = split2_features[train_indices], split2_class[train_indices]
            X_test, y_test = split2_features[test_indices], split2_class[test_indices]

            # Check if there is only one class present in y_test
            if len(np.unique(y_test)) == 1:
                continue

            # Resampling the dataset
            smote = SMOTE(random_state=42)
            X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

            # Create a pipeline with the classifier
            pipeline = Pipeline([
                ('scaler', StandardScaler()),
                (classifier_name, classifier)
            ])


            # Create a GridSearchCV object for hyperparameter tuning
            grid_search = GridSearchCV(pipeline, param_grid=param_grids[classifier_name], scoring='roc_auc', cv=n_splits)

            # Train and evaluate the classifier on this fold
            grid_search.fit(X_train_resampled, y_train_resampled)
            y_prob = grid_search.predict_proba(X_test)[:, 1]

        

        # Print the best hyperparameters for the classifier
        print(f"Best hyperparameters for {classifier_name}: {grid_search.best_params_}")

    
    
    
hyperparameter(split2_info, split2_features, split2_class)
Best hyperparameters for Random Forest: {'Random Forest__max_depth': 10, 'Random Forest__min_samples_split': 2, 'Random Forest__n_estimators': 100}
Best hyperparameters for XGBoost: {'XGBoost__learning_rate': 0.1, 'XGBoost__max_depth': 9, 'XGBoost__n_estimators': 150}
Best hyperparameters for KNN: {'KNN__n_neighbors': 7, 'KNN__weights': 'distance'}
Best hyperparameters for SVM: {'SVM__C': 10, 'SVM__kernel': 'rbf'}
Best hyperparameters for LightGBM: {'LightGBM__learning_rate': 0.1, 'LightGBM__max_depth': 9, 'LightGBM__n_estimators': 150}
Best hyperparameters for Logistic Regression: {'Logistic Regression__C': 0.1, 'Logistic Regression__solver': 'saga'}

Since Logistic Regression has the highest AUC among other models, it is a good choice for hyperparameter tuning to further improve its performance. Hyperparameter tuning for Logistic Regression involves adjusting various parameters that control the learning process, such as regularization strength, solver, and maximum iterations.

Below are the parameter tuned for Logistic Regression

Regularisation strength (C): This parameter determines the inverse of the regularization strength, where smaller values result in stronger regularization, preventing overfitting. You can explore a range of values such as [0.1, 1, 10, 100].

Solver (solver): This parameter specifies the algorithm used for optimization. Some options include 'newton-cg', 'lbfgs', 'liblinear', 'sag', and 'saga'. Each solver has its own strengths and weaknesses, and trying different solvers can help you find the one that works best for your data.

Final Validation¶

Final validation is an essential step in the machine learning process to ensure that the chosen model generalizes well to unseen data. The decision to perform the final validation on a combination of split2 and split3 was because split3 contains only instances of the -1 class, while split2 has instances of both -1 and other classes. Combining split2 and split3 for the final validation ensures that the validation set has a more balanced representation of the target classes.

In [43]:
def finalValidation(split2, split3):
    # Preprocessing split2 and split3
    split2_info, split2_features, split2_class = infoFeatureClass(split2)
    split3_info, split3_features, split3_class = infoFeatureClass(split3)

    missingValuesMean(split2_features)
    split2_features = cappingOutliers(split2_features)
    split2_features = normalise(split2_features)

    missingValuesMean(split3_features)
    split3_features = cappingOutliers(split3_features)
    split3_features = normalise(split3_features)

    # Make sure you perform the same feature reduction on split2 and split3
    split2_features = featureReduction(split2_features)
    split3_features = featureReduction(split3_features)

    # Combine split2 and split3 into a single dataset
    combined_features = np.concatenate((split2_features, split3_features), axis=0)
    combined_class = np.concatenate((split2_class, split3_class), axis=0)
    combined_info = np.concatenate((split2_info, split3_info), axis=0)

    # Define the GroupKFold splits
    group_kfold = GroupKFold(n_splits=5)
    group_kfold.get_n_splits(combined_features, combined_class, combined_info)

    # Initialize the AUC score list
    auc_scores = []

    # Loop over the splits
    for train_index, test_index in group_kfold.split(combined_features, combined_class, combined_info):
        # Split the data into training and test sets
        X_train, X_test = combined_features[train_index], combined_features[test_index]
        y_train, y_test = combined_class[train_index], combined_class[test_index]

        # Train the Logistic Regression model with the best hyperparameters
        best_params_logisticRegressor = {'C': 0.1, 'solver': 'newton-cg'}
        model_logisticRegressor = LogisticRegression(**best_params_logisticRegressor)
        model_logisticRegressor.fit(X_train, y_train)

        # Make predictions and compute the AUC score for the current split
        predictions = model_logisticRegressor.predict_proba(X_test)

        # Check if there are at least two unique classes in y_test
        unique_classes = np.unique(y_test)
        if len(unique_classes) >= 2:
            auc_score = roc_auc_score(y_test, predictions[:, 1])
            auc_scores.append(auc_score)

    # Calculate the mean AUC score
    mean_auc_score = np.mean(auc_scores)
    print("Mean AUC score for the Logistic Regression model:", mean_auc_score)
    
    # Return the trained model object
    return model_logisticRegressor
    
# Train the Logistic Regression model
model = finalValidation(split2, split3)
Mean AUC score for the Logistic Regression model: 0.5534771627480454

The mean AUC score for the Logistic Regression model on the final validation set is 0.55.

An AUC score of 0.55 indicates that the Logistic Regression model has moderate predictive power when distinguishing between different classes in the dataset. The score is measured on a scale from 0 to 1, with 1 representing a perfect classifier and 0.5 indicating a classifier that performs no better than random chance.

While the AUC score of 0.55 is not exceptionally high, it still suggests that the model has some ability to differentiate between the target classes. It is essential to consider the context and the specific problem being addressed when evaluating the performance of a model based on its AUC score. In some cases, even a modest improvement over random chance can be valuable, particularly when dealing with complex and challenging datasets.

It is worth noting that this score is based on the final validation set, which was designed to provide a more balanced and representative sample of the problem domain. As a result, this AUC score is a useful indicator of the model's ability to generalize well to new and unseen data, which is crucial for ensuring its effectiveness in real-world applications.

Final Pipeline to predict the holdout.csv¶

In [44]:
def predict_class(dataset_name, model):
    # Load the dataset
    dataset = pd.read_csv(dataset_name)

    # Extract the features from the dataset
    features = dataset.iloc[:, 13:]

    # Preprocess the features
    missingValuesMean(features)
    features = cappingOutliers(features)
    features = normalise(features)
    features = featureReduction(features)
# 
    # Use the model to predict the class of each instance
    predicted_classes = model.predict(features)

    # Add the predicted classes as a new column in the dataset
    dataset['predicted_class'] = predicted_classes

    # Save the updated dataset to a CSV file
    dataset.to_csv('predictions.csv', index=False)

    # Print the predicted classes
    print("Predicted classes for all instances:", predicted_classes)
# Use the trained model to make predictions
predict_class('df_holdout.csv', model)
Predicted classes for all instances: [1 1 1 ... 1 1 1]

Conclusion¶

This study aimed to develop a robust machine learning model to predict epitope binding in proteins. Several preprocessing steps were carried out, including handling missing values, data normalization, outlier treatment and feature reduction. The dataset was then rebalanced using SMOTE to address the underrepresentation of the minority class.

Various classification algorithms, such as Random Forest, XGBoost, KNN, SVM, LightGBM, and Logistic Regression, were employed to find the best model for the problem. The performance of these models was evaluated using the AUC metric, and Logistic Regression emerged as the top-performing model. Hyperparameter tuning was conducted using GridSearchCV to optimize the chosen model further.

To ensure the model's generalisability, a final validation step was performed on a combination of split2 and split3 datasets. The final Logistic Regression model achieved a mean AUC score of 0.5563, indicating its potential to predict epitope binding with reasonable accuracy.

This study highlights the importance of preprocessing, data balancing, model selection, and hyperparameter tuning in developing a reliable and accurate machine learning model. While the chosen Logistic Regression model demonstrates reasonable performance, future work could explore additional feature engineering, alternative algorithms, and ensemble methods to further improve the predictive performance.

References¶

Kang, H. (2013). The prevention and handling of the missing data. Korean J Anesthesiol, 5, 402–406. https://doi.org/10.4097/kjae.2013.64.5.402

He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284.

Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5(4), 221-232.

Fernández, A., García, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research, 61, 863-905.

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357.

References: Aggarwal, C. C. (2016). Outlier analysis. Springer. Rousseeuw, P. J., & Hubert, M. (2011). Robust statistics for outlier detection. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), 73-79.

Yap, B. W., & Sim, C. H. (2011). Comparisons of various types of normality tests. Journal of Statistical Computation and Simulation, 81(12), 2141-2155.

Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov), 2579-2605.

Reference: Chen, H., Chen, L., Wang, S., & Song, J. (2020). Outlier detection: A survey. Journal of Industrial Information Integration, 19, 100148.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825-2830.

Varoquaux, G., Buitinck, L., Louppe, G., Grisel, O., Pedregosa, F., & Mueller, A. (2017). Scikit-learn. GetMobile: Mobile Computing and Communications, 21(1), 29-33.

Pedregosa, F., Varoquaux, Ga"el, Gramfort, A., Michel, V., Thirion, B., Grisel,. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825–2830

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer New York.