We will be classifying patient readmissions with a variety of algorithms and assessing their performance.
The algorithms that we are going to be testing are:
The dataset is obtained from Tahoe Open Data (https://data-trpa.opendata.arcgis.com/) and is related to clinical data, with a training dataset consisting of 2629 records. Below is a description of each of the variables present:
age: age of the patient, expressed as positive integers, ranging from 65 to 105 years old.
female: a binary variable where 1= 'female', and '0' = male.
flu_season: a binary variable where 1= 'flu season', and 0= 'not flu season'
ed_admit: a binary variable where 1 = an emergency admission into the hospital and 0 = no emergency admission into the hospital.
severity_score: the severity of the patient, expressed as a positive integer ranging from 0 to 112.
comorbidity_score: comorbidity is when 1 or more additional condition co-ocurrs at the same time as a primary condition, in this case the variable is the comorbidity–polypharmacy score which reflects the sum of all comorbid conditions and their associated medications, expressed as positive integers in the range of 1 to 322.
readmit30: a binary variable where 1 = patient readmitted within 30 days, and 0 = patient not readmitted within 30 days. This will be our dependent variable
# libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
%matplotlib inline
#Storing the datasets into pandas dataframe
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
#Taking a glance at the dataset
train.head()
#Describing the dataset
train.describe()
#Information of the dataset
train.info()
#Plotting a histogram of our variables
train.hist(figsize=(16, 20), xlabelsize=8, ylabelsize=8)
train['readmit30'].value_counts()
From the above histograms we can see that our training dataset has 626 readmitted patients, which is around 24% of our total datapoints. The dataset is quite balanced between females and males, and the age is positively skewed with less patients in the late 80's-100 range. The severity score follows a similar patttern as it is positively skewed, with most patients experiencing a lower severity score. The comborbidity score on the otherhand follows more of a bell shaped curve, showing how the majority of patients have 1 or more additional conditions to their primary condition. And finally, a little under half of the dataset was carried out during flu season, and a little more than half was carried out outside of flu season.
# split dataset into dependent and independent variables
X_train = train.drop(columns=['readmit30'])
y_train = train['readmit30']
X_test = test.drop(columns=['readmit30'])
y_test = test['readmit30']
We will begin by implementing a logistic regression approach, which is best known as a classification method to the traditional linear regression. The output of this model is the probability of output variable Y, irrelevant of if this is a binary classification problem or a multi-class problem.
We can regularize the logistic regression model in order to improve the generalisability of the model. This can be done via penalization (L1/L2) of the log-likelihood function. In addition, hyperparameter tuning stil needs to take place in order to select the optimal parameter for the training algorithm.
#First we apply z-score normalization
scaler = StandardScaler()
scaler.fit(X_train)
scaler.fit(X_test)
x_train = scaler.transform(X_train)
x_test = scaler.transform(X_test)
print('The mean of the training dataset is:', round(np.mean(x_train)))
print('The SD of the training dataset is:', round(np.std(x_train)))
print('The mean of the test dataset is:', round(np.mean(x_test)))
print('The SD of the test dataset is:', round(np.std(x_test)))
# we check the values we will be using for the regularisation parameter
np.logspace(-3,1, 21)
# create a function that iterates over the different regularisation parameters and save accuracy scores of each model
train_accuracy_lr = []
test_accuracy_lr = []
for i in np.logspace(-3,1, 21):
clf = logreg = LogisticRegression(C = i).fit(x_train, y_train)
train_score = clf.score(x_train, y_train)
train_accuracy_lr.append(train_score)
test_score = clf.score(x_test, y_test)
test_accuracy_lr.append(test_score)
# plot accuracy in train and test set against the different regularisation parameters
x = np.logspace(-3,1, 21)
plt.plot(x, train_accuracy_lr, label = "Train Accuracy", linestyle='--')
plt.plot(x, test_accuracy_lr, label = "Test Accuracy")
plt.legend()
plt.rcParams["figure.figsize"] = (8,5)
plt.xlabel("Reularisation parameters")
plt.ylabel("Model accuracy")
plt.show()
# we can also explore the values
df_lr = pd.DataFrame(list(zip(x, train_accuracy_lr,test_accuracy_lr)),
columns =['Regularisation parameters', 'Train set accuracy', 'Test set accuracy'])
df_lr
In this particular case, we see that the test set provides a higher accuracy than the train set. This could mean that we are successful in preventing our model from overfitting on our data.
Generally, we can see a small peak in the accuracy values when the regularisation parameters are low values, and after that it maintains into similar values with small variations.
In particular, the highest accuracy of 0.807298 at the test set is reached with a regularisation parameter of 0.006310
The kernel function measures the similarity/distance between samples. The kernel function is a non-linear version of SVM which proves to be quite useful when the relationship between input and output variables is non-linear. SVM stands for support-vector machine which works as feature transformation.
The kernelised SVM is then trained based on a new feature vector which is constructed based on the distances to all the training samples. As a result the decision boundary is linear for kernelised SVM but non-linear a per the original feature vector.
# we check the values we will be using for the regularisation parameter
np.logspace(-2,1, 21)
# create a function that iterates over the different regularisation parameters and save accuracy scores of each model
train_accuracy_svm = []
test_accuracy_svm = []
for i in np.logspace(-2,1, 21):
clf = SVC(C = i).fit(x_train, y_train)
train_score = clf.score(x_train, y_train)
train_accuracy_svm.append(train_score)
test_score = clf.score(x_test, y_test)
test_accuracy_svm.append(test_score)
# plot accuracy in train and test set against the different regularisation parameters
x2 = np.logspace(-2,1, 21)
plt.plot(x2, train_accuracy_svm, label = "Train Accuracy", linestyle='--')
plt.plot(x2, test_accuracy_svm, label = "Test Accuracy")
plt.legend()
plt.rcParams["figure.figsize"] = (8,5)
plt.xlabel("Reularisation parameters")
plt.ylabel("Model accuracy")
plt.show()
# we can also explore the values
df_svm = pd.DataFrame(list(zip(x2, train_accuracy_svm,test_accuracy_svm)),
columns =['Regularisation parameters', 'Train set accuracy', 'Test set accuracy'])
df_svm
In the case of the kernelised SVM, after we keep increasing the regularisation parameters, the test accuracy steadily stays below the train accuracy.
We see that the highest accuracy of 0.799316 at the test set is first reached with a regularisation parameter of 0.223872
When the Random Forest Classifier trains decision trees it only includes a subset of the features, that is, the features with the greatest information gain. Also, the performance of random forest models is quite stable with respect to the choice of hyper-parameters.
# we check the values we will be using for the number of decision trees
np.linspace(1, 200, 21, dtype = int)
# create a function that iterates over the different number of decision trees and save accuracy scores of each model
train_accuracy_rfc = []
test_accuracy_rfc = []
for i in np.linspace(1, 200, 21, dtype = int):
clf=RandomForestClassifier(n_estimators=i).fit(x_train, y_train)
train_score = clf.score(x_train, y_train)
train_accuracy_rfc.append(train_score)
test_score = clf.score(x_test, y_test)
test_accuracy_rfc.append(test_score)
# plot accuracy in train and test set against the different number of decision trees
x3 = np.linspace(1, 200, 21, dtype = int)
plt.plot(x3, train_accuracy_rfc, label = "Train Accuracy", linestyle='--')
plt.plot(x3, test_accuracy_rfc, label = "Test Accuracy")
plt.legend()
plt.rcParams["figure.figsize"] = (8,5)
plt.xlabel("Reularisation parameters")
plt.ylabel("Model accuracy")
plt.show()
# we can also explore the values
df_rfc = pd.DataFrame(list(zip(x3, train_accuracy_rfc,test_accuracy_rfc)),
columns =['Number of decision trees', 'Train set accuracy', 'Test set accuracy'])
df_rfc
In the case of the Random Forest Classifier, it is expected a high accuracy in the train set, but this could lead to overfitting, and our model could not as accurate as we want in new data. We can clearly see this case in our data, as there is a big gap with the accuracy of our model in the test set.
We see that the highest accuracy of 0.782212 at the test set is first reached with a number of decission trees of 60
# We choose the best hyperparameter detected in the last part which is 60 for the number of decision trees
clf=RandomForestClassifier(n_estimators=60).fit(x_train, y_train)
# The following code prints the list of input variables in decreasing order of feature importance.
col_names = ['age', 'female', 'flu_season', 'ed_admit', 'severity_score', 'comorbidity score']
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print("Feature ranking:")
for f in range(len(indices)):
print("%d. %s (%f)" % (f + 1, col_names[indices[f]], importances[indices[f]]))
We osberve from the features ranking in the model that the main feature with the highest weight is 'comorbidity score', followed by 'severity score'. After that, random forest position 'age' in the third place with a significant weight.
Using the random forest model trained with the best hyper-parameter, we plot the output variable against each input variable, including: ed_admit, severity score and comorbidity score. This was done using the code below
# we recover the column names
x_train2 = pd.DataFrame(x_train, columns=['age', 'female','flu_season','ed_admit','severity score','comorbidity score'])
from sklearn.inspection import plot_partial_dependence
plot_partial_dependence(clf, x_train2, [0,1,2,3,4,5])
From the plots above, we can see that the severity score and the comorbidity score have a greater influence on the output variable as the partial dependence increases significantly when the values for these scores rise. The variable ed_admit, on the contrary, has a lower impact on the output in comparisson with the other two scores.
Therefore, we plot the output variable againsts the severity score and the comorbidity score in a two-dimensional graph as shown below
plot_partial_dependence(clf, x_train2, [[4,5]])
From the plot, we can see that the output variable tend to be higher when the severity score is between 2 and 5, and the comorbidity score is between 1 and 2.0. As the value of these parameters decreases, the output variable also reduces, reaching its lowest values when both the severity score and the comorbidity score are negative.