Classification of Patient Readmissions with Different Algorithms

We will be classifying patient readmissions with a variety of algorithms and assessing their performance.

The algorithms that we are going to be testing are:

  • Logistic Regression
  • Kernelised Support Vector Machine (SVM)
  • Random Forest Classifier

The dataset is obtained from Tahoe Open Data (https://data-trpa.opendata.arcgis.com/) and is related to clinical data, with a training dataset consisting of 2629 records. Below is a description of each of the variables present:

  • age: age of the patient, expressed as positive integers, ranging from 65 to 105 years old.

  • female: a binary variable where 1= 'female', and '0' = male.

  • flu_season: a binary variable where 1= 'flu season', and 0= 'not flu season'

  • ed_admit: a binary variable where 1 = an emergency admission into the hospital and 0 = no emergency admission into the hospital.

  • severity_score: the severity of the patient, expressed as a positive integer ranging from 0 to 112.

  • comorbidity_score: comorbidity is when 1 or more additional condition co-ocurrs at the same time as a primary condition, in this case the variable is the comorbidity–polypharmacy score which reflects the sum of all comorbid conditions and their associated medications, expressed as positive integers in the range of 1 to 322.

  • readmit30: a binary variable where 1 = patient readmitted within 30 days, and 0 = patient not readmitted within 30 days. This will be our dependent variable

In [1]:
# libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

%matplotlib inline
In [5]:
#Storing the datasets into pandas dataframe 

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

Data description

In [6]:
#Taking a glance at the dataset
train.head()
Out[6]:
age female flu_season ed_admit severity score comorbidity score readmit30
0 71 1 0 1 35 33 0
1 76 0 0 1 12 99 0
2 66 0 1 1 2 56 0
3 86 1 1 1 69 126 1
4 67 0 0 1 1 1 0
In [7]:
#Describing the dataset
train.describe()
Out[7]:
age female flu_season ed_admit severity score comorbidity score readmit30
count 2629.000000 2629.000000 2629.000000 2629.000000 2629.000000 2629.000000 2629.000000
mean 76.942944 0.470521 0.410042 0.812476 22.512742 95.064283 0.238113
std 7.905151 0.499225 0.491935 0.390406 18.278965 57.151952 0.426010
min 65.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.000000
25% 70.000000 0.000000 0.000000 1.000000 8.000000 52.000000 0.000000
50% 76.000000 0.000000 0.000000 1.000000 19.000000 86.000000 0.000000
75% 83.000000 1.000000 1.000000 1.000000 32.000000 132.000000 0.000000
max 105.000000 1.000000 1.000000 1.000000 112.000000 322.000000 1.000000
In [8]:
#Information of the dataset
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2629 entries, 0 to 2628
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   age                2629 non-null   int64
 1   female             2629 non-null   int64
 2   flu_season         2629 non-null   int64
 3   ed_admit           2629 non-null   int64
 4   severity score     2629 non-null   int64
 5   comorbidity score  2629 non-null   int64
 6   readmit30          2629 non-null   int64
dtypes: int64(7)
memory usage: 143.9 KB
In [9]:
#Plotting a histogram of our variables
train.hist(figsize=(16, 20), xlabelsize=8, ylabelsize=8)
Out[9]:
array([[<AxesSubplot:title={'center':'age'}>,
        <AxesSubplot:title={'center':'female'}>,
        <AxesSubplot:title={'center':'flu_season'}>],
       [<AxesSubplot:title={'center':'ed_admit'}>,
        <AxesSubplot:title={'center':'severity score'}>,
        <AxesSubplot:title={'center':'comorbidity score'}>],
       [<AxesSubplot:title={'center':'readmit30'}>, <AxesSubplot:>,
        <AxesSubplot:>]], dtype=object)
In [10]:
train['readmit30'].value_counts()
Out[10]:
0    2003
1     626
Name: readmit30, dtype: int64

From the above histograms we can see that our training dataset has 626 readmitted patients, which is around 24% of our total datapoints. The dataset is quite balanced between females and males, and the age is positively skewed with less patients in the late 80's-100 range. The severity score follows a similar patttern as it is positively skewed, with most patients experiencing a lower severity score. The comborbidity score on the otherhand follows more of a bell shaped curve, showing how the majority of patients have 1 or more additional conditions to their primary condition. And finally, a little under half of the dataset was carried out during flu season, and a little more than half was carried out outside of flu season.

In [11]:
# split dataset into dependent and independent variables

X_train = train.drop(columns=['readmit30'])
y_train = train['readmit30']

X_test = test.drop(columns=['readmit30'])
y_test = test['readmit30']

Logistic Regression

We will begin by implementing a logistic regression approach, which is best known as a classification method to the traditional linear regression. The output of this model is the probability of output variable Y, irrelevant of if this is a binary classification problem or a multi-class problem.

We can regularize the logistic regression model in order to improve the generalisability of the model. This can be done via penalization (L1/L2) of the log-likelihood function. In addition, hyperparameter tuning stil needs to take place in order to select the optimal parameter for the training algorithm.

In [12]:
#First we apply z-score normalization 

scaler = StandardScaler()
scaler.fit(X_train)
scaler.fit(X_test)
x_train = scaler.transform(X_train)
x_test = scaler.transform(X_test)

print('The mean of the training dataset is:', round(np.mean(x_train)))
print('The SD of the training dataset is:', round(np.std(x_train)))
print('The mean of the test dataset is:', round(np.mean(x_test)))
print('The SD of the test dataset is:', round(np.std(x_test)))
The mean of the training dataset is: 0
The SD of the training dataset is: 1
The mean of the test dataset is: 0
The SD of the test dataset is: 1
In [13]:
# we check the values we will be using for the regularisation parameter
np.logspace(-3,1, 21)
Out[13]:
array([1.00000000e-03, 1.58489319e-03, 2.51188643e-03, 3.98107171e-03,
       6.30957344e-03, 1.00000000e-02, 1.58489319e-02, 2.51188643e-02,
       3.98107171e-02, 6.30957344e-02, 1.00000000e-01, 1.58489319e-01,
       2.51188643e-01, 3.98107171e-01, 6.30957344e-01, 1.00000000e+00,
       1.58489319e+00, 2.51188643e+00, 3.98107171e+00, 6.30957344e+00,
       1.00000000e+01])
In [14]:
# create a function that iterates over the different regularisation parameters and save accuracy scores of each model

train_accuracy_lr = []
test_accuracy_lr = []


for i in np.logspace(-3,1, 21):
    clf = logreg = LogisticRegression(C = i).fit(x_train, y_train)
    train_score = clf.score(x_train, y_train)
    train_accuracy_lr.append(train_score)
    test_score = clf.score(x_test, y_test)
    test_accuracy_lr.append(test_score)
In [15]:
# plot accuracy in train and test set against the different regularisation parameters

x =  np.logspace(-3,1, 21)

plt.plot(x, train_accuracy_lr, label = "Train Accuracy", linestyle='--')
plt.plot(x, test_accuracy_lr, label = "Test Accuracy")
plt.legend()
plt.rcParams["figure.figsize"] = (8,5)
plt.xlabel("Reularisation parameters")
plt.ylabel("Model accuracy")
plt.show()
In [16]:
# we can also explore the values

df_lr = pd.DataFrame(list(zip(x, train_accuracy_lr,test_accuracy_lr)),
               columns =['Regularisation parameters', 'Train set accuracy', 'Test set accuracy'])
df_lr
Out[16]:
Regularisation parameters Train set accuracy Test set accuracy
0 0.001000 0.762267 0.775371
1 0.001585 0.769114 0.783352
2 0.002512 0.775960 0.786773
3 0.003981 0.785089 0.792474
4 0.006310 0.790034 0.807298
5 0.010000 0.791556 0.802737
6 0.015849 0.791936 0.803877
7 0.025119 0.792697 0.803877
8 0.039811 0.793458 0.802737
9 0.063096 0.792697 0.803877
10 0.100000 0.793838 0.802737
11 0.158489 0.793838 0.802737
12 0.251189 0.794218 0.803877
13 0.398107 0.794599 0.803877
14 0.630957 0.794218 0.803877
15 1.000000 0.794218 0.803877
16 1.584893 0.794218 0.803877
17 2.511886 0.794599 0.803877
18 3.981072 0.794979 0.803877
19 6.309573 0.794979 0.803877
20 10.000000 0.794979 0.805017

In this particular case, we see that the test set provides a higher accuracy than the train set. This could mean that we are successful in preventing our model from overfitting on our data.

Generally, we can see a small peak in the accuracy values when the regularisation parameters are low values, and after that it maintains into similar values with small variations.

In particular, the highest accuracy of 0.807298 at the test set is reached with a regularisation parameter of 0.006310

Kernelised Support Vector Machine (SVM)

The kernel function measures the similarity/distance between samples. The kernel function is a non-linear version of SVM which proves to be quite useful when the relationship between input and output variables is non-linear. SVM stands for support-vector machine which works as feature transformation.

The kernelised SVM is then trained based on a new feature vector which is constructed based on the distances to all the training samples. As a result the decision boundary is linear for kernelised SVM but non-linear a per the original feature vector.

In [17]:
# we check the values we will be using for the regularisation parameter
np.logspace(-2,1, 21)
Out[17]:
array([ 0.01      ,  0.01412538,  0.01995262,  0.02818383,  0.03981072,
        0.05623413,  0.07943282,  0.11220185,  0.15848932,  0.22387211,
        0.31622777,  0.44668359,  0.63095734,  0.89125094,  1.25892541,
        1.77827941,  2.51188643,  3.54813389,  5.01187234,  7.07945784,
       10.        ])
In [18]:
# create a function that iterates over the different regularisation parameters and save accuracy scores of each model

train_accuracy_svm = []
test_accuracy_svm = []


for i in np.logspace(-2,1, 21):
    clf = SVC(C = i).fit(x_train, y_train)
    train_score = clf.score(x_train, y_train)
    train_accuracy_svm.append(train_score)
    test_score = clf.score(x_test, y_test)
    test_accuracy_svm.append(test_score)
In [19]:
# plot accuracy in train and test set against the different regularisation parameters

x2 =  np.logspace(-2,1, 21)

plt.plot(x2, train_accuracy_svm, label = "Train Accuracy", linestyle='--')
plt.plot(x2, test_accuracy_svm, label = "Test Accuracy")
plt.legend()
plt.rcParams["figure.figsize"] = (8,5)
plt.xlabel("Reularisation parameters")
plt.ylabel("Model accuracy")
plt.show()
In [20]:
# we can also explore the values

df_svm = pd.DataFrame(list(zip(x2, train_accuracy_svm,test_accuracy_svm)),
               columns =['Regularisation parameters', 'Train set accuracy', 'Test set accuracy'])
df_svm
Out[20]:
Regularisation parameters Train set accuracy Test set accuracy
0 0.010000 0.761887 0.775371
1 0.014125 0.761887 0.775371
2 0.019953 0.761887 0.775371
3 0.028184 0.761887 0.775371
4 0.039811 0.761887 0.775371
5 0.056234 0.764549 0.776511
6 0.079433 0.772537 0.781072
7 0.112202 0.780525 0.792474
8 0.158489 0.788513 0.795895
9 0.223872 0.793838 0.799316
10 0.316228 0.798402 0.795895
11 0.446684 0.800685 0.799316
12 0.630957 0.802587 0.795895
13 0.891251 0.801445 0.798176
14 1.258925 0.804488 0.798176
15 1.778279 0.805630 0.797035
16 2.511886 0.806010 0.797035
17 3.548134 0.808292 0.794755
18 5.011872 0.809433 0.793615
19 7.079458 0.808672 0.793615
20 10.000000 0.807531 0.793615

In the case of the kernelised SVM, after we keep increasing the regularisation parameters, the test accuracy steadily stays below the train accuracy.

We see that the highest accuracy of 0.799316 at the test set is first reached with a regularisation parameter of 0.223872

Random Forest Classifier

When the Random Forest Classifier trains decision trees it only includes a subset of the features, that is, the features with the greatest information gain. Also, the performance of random forest models is quite stable with respect to the choice of hyper-parameters.

In [21]:
# we check the values we will be using for the number of decision trees
np.linspace(1, 200, 21, dtype = int)
Out[21]:
array([  1,  10,  20,  30,  40,  50,  60,  70,  80,  90, 100, 110, 120,
       130, 140, 150, 160, 170, 180, 190, 200])
In [22]:
# create a function that iterates over the different number of decision trees and save accuracy scores of each model

train_accuracy_rfc = []
test_accuracy_rfc = []


for i in np.linspace(1, 200, 21, dtype = int):
    clf=RandomForestClassifier(n_estimators=i).fit(x_train, y_train)
    train_score = clf.score(x_train, y_train)
    train_accuracy_rfc.append(train_score)
    test_score = clf.score(x_test, y_test)
    test_accuracy_rfc.append(test_score)
In [23]:
# plot accuracy in train and test set against the different number of decision trees

x3 =  np.linspace(1, 200, 21, dtype = int)

plt.plot(x3, train_accuracy_rfc, label = "Train Accuracy", linestyle='--')
plt.plot(x3, test_accuracy_rfc, label = "Test Accuracy")
plt.legend()
plt.rcParams["figure.figsize"] = (8,5)
plt.xlabel("Reularisation parameters")
plt.ylabel("Model accuracy")
plt.show()
In [24]:
# we can also explore the values

df_rfc = pd.DataFrame(list(zip(x3, train_accuracy_rfc,test_accuracy_rfc)),
               columns =['Number of decision trees', 'Train set accuracy', 'Test set accuracy'])
df_rfc
Out[24]:
Number of decision trees Train set accuracy Test set accuracy
0 1 0.889312 0.701254
1 10 0.980981 0.776511
2 20 0.988969 0.777651
3 30 0.996577 0.776511
4 40 0.997337 0.776511
5 50 0.997718 0.779932
6 60 0.998479 0.782212
7 70 0.999239 0.777651
8 80 0.998859 0.774230
9 90 0.998859 0.776511
10 100 0.999239 0.781072
11 110 0.999239 0.781072
12 120 0.999239 0.782212
13 130 0.999239 0.782212
14 140 0.999239 0.782212
15 150 0.999239 0.781072
16 160 0.999239 0.781072
17 170 0.999239 0.781072
18 180 0.999239 0.778791
19 190 0.999239 0.777651
20 200 0.999239 0.781072

In the case of the Random Forest Classifier, it is expected a high accuracy in the train set, but this could lead to overfitting, and our model could not as accurate as we want in new data. We can clearly see this case in our data, as there is a big gap with the accuracy of our model in the test set.

We see that the highest accuracy of 0.782212 at the test set is first reached with a number of decission trees of 60

Variable/feature importance with random forests

In [25]:
# We choose the best hyperparameter detected in the last part which is 60 for the number of decision trees  

clf=RandomForestClassifier(n_estimators=60).fit(x_train, y_train)

# The following code prints the list of input variables in decreasing order of feature importance.
col_names = ['age', 'female', 'flu_season', 'ed_admit', 'severity_score', 'comorbidity score']

importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print("Feature ranking:")
for f in range(len(indices)):
    print("%d. %s (%f)" % (f + 1, col_names[indices[f]], importances[indices[f]]))
Feature ranking:
1. comorbidity score (0.425567)
2. severity_score (0.277426)
3. age (0.219150)
4. flu_season (0.033599)
5. female (0.024685)
6. ed_admit (0.019573)

We osberve from the features ranking in the model that the main feature with the highest weight is 'comorbidity score', followed by 'severity score'. After that, random forest position 'age' in the third place with a significant weight.

Partial plot

Using the random forest model trained with the best hyper-parameter, we plot the output variable against each input variable, including: ed_admit, severity score and comorbidity score. This was done using the code below

In [26]:
# we recover the column names
x_train2 =  pd.DataFrame(x_train, columns=['age', 'female','flu_season','ed_admit','severity score','comorbidity score'])
In [30]:
from sklearn.inspection import plot_partial_dependence
plot_partial_dependence(clf, x_train2, [0,1,2,3,4,5]) 
Out[30]:
<sklearn.inspection._plot.partial_dependence.PartialDependenceDisplay at 0x27bc031c640>

From the plots above, we can see that the severity score and the comorbidity score have a greater influence on the output variable as the partial dependence increases significantly when the values for these scores rise. The variable ed_admit, on the contrary, has a lower impact on the output in comparisson with the other two scores.

Therefore, we plot the output variable againsts the severity score and the comorbidity score in a two-dimensional graph as shown below

In [34]:
plot_partial_dependence(clf, x_train2, [[4,5]]) 
Out[34]:
<sklearn.inspection._plot.partial_dependence.PartialDependenceDisplay at 0x27bc024de50>

From the plot, we can see that the output variable tend to be higher when the severity score is between 2 and 5, and the comorbidity score is between 1 and 2.0. As the value of these parameters decreases, the output variable also reduces, reaching its lowest values when both the severity score and the comorbidity score are negative.

In [ ]: