We use the white wine data set from https://archive.ics.uci.edu/ml/datasets/wine+quality
We will create a dependent variable related to the quality of the wine and we will try to predict that quality using the characteristics of the wine as explanatory variables
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv("winequality-white.csv", sep=';')
df.head()
quality = df["quality"].values
good_wine = []
for num in quality:
if num>6:
good_wine.append(1)
else:
good_wine.append(0)
good_wine = pd.DataFrame(data=good_wine, columns=["good wine"])
df['good wine'] = good_wine
df
df['good wine'].value_counts()
print('The proportion of good wine over the rest is of: ', 1060/4898)
As we have a ratio of around 4:1 in our dependent variable, this could arise some problem in our classification algorithm. In this sense, an algorithm that classifies all the wine as "bad quality" would still have an aprroximate 80% success rate.
We will balancing our data with a oversampling method called SMOTE (Synthetic Minority Over-sampling Technique). This method generates synthetic data that tries to randomly generate a sample of the attributes from observations in the minority class
# divide dataset into dependent and independent variables
df_x = df.drop(['good wine', 'quality'], axis=1)
df_y = df['good wine']
#import imblearn
from imblearn.over_sampling import SMOTE
# Resample the minority class. You can change the strategy to 'auto' if you are not sure.
sm = SMOTE(sampling_strategy='minority', random_state=7)
# Fit the model to generate the data.
oversampled_x, oversampled_y = sm.fit_resample(df_x, df_y)
oversampled_df = pd.concat([pd.DataFrame(oversampled_x), pd.DataFrame(oversampled_y)], axis=1)
oversampled_df
# we can check the data now is balanced in a ratio 1:1
oversampled_df['good wine'].value_counts()
# create function to calculate z-score based on mean and std of the training set
def z_score(df):
# copy the dataframe
df_std = df.copy()
# apply the z-score method
for column in df_std.columns:
df_std[column] = (df_std[column] - df_std[column].mean()) / df_std[column].std()
return df_std
# call the z_score function for all the set without the "good wine" column
# we also remove the "quality" column, as is part of the output
df_standardised = z_score(oversampled_df[oversampled_df.columns[0:11]])
# add "good wine" column as before
df_standardised['good wine'] = oversampled_df[['good wine']]
df_standardised_x = df_standardised.drop(['good wine'], axis=1)
df_standardised_y = df_standardised['good wine']
# split data into train, validation and test sets (60 - 20 - 20)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_standardised_x, df_standardised_y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2
print(X_train.shape)
print(X_test.shape)
print(X_val.shape)
# try with k = 1
from sklearn.neighbors import KNeighborsClassifier
# train the algorithm with training set and k = 1
model = KNeighborsClassifier(n_neighbors = 1)
model.fit(X_train, y_train)
# evaluates with validation set
prediction = model.predict(X_val)
print(prediction)
# evaluate score of prediction against actual output
model.score(X_val, y_val)
# we automate for multiple k values
k_values = [1,2,3,4,5,6,7,8,9,10,20,30,40,50,60,70,80,90,100,200,300,400,500]
models = []
scores = []
for i in k_values:
model = KNeighborsClassifier(n_neighbors = i)
model.fit(X_train, y_train)
prediction = model.predict(X_val)
score = model.score(X_val, y_val)
models.append('k' + str(i))
scores.append(score)
print(models)
print(scores)
knn_results = pd.DataFrame(list(zip(models, scores)), columns =['models', 'scores'])
knn_results.head()
error_rate = []
for i in k_values:
knn = KNeighborsClassifier(n_neighbors = i)
knn.fit(X_train,y_train)
pred_i = knn.predict(X_val)
error_rate.append(np.mean(pred_i != y_val))
plt.figure(figsize=(10,6))
plt.plot(k_values,error_rate,color='blue', linestyle='dashed', marker='o',markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
print("Minimum error:-",min(error_rate),"at K =",k_values[error_rate.index(min(error_rate))])
In our case, the minimum error (0.115309) on the validation set is achieved with k = 1
We should take into account that when k = 1 we are estimating the probability based on a single sample, i.e., the closest neighbor. This is very sensitive to all sort of distortions like noise, outliers, mislabelling of data, and so on. By using a higher value for k, we can create a more robust model against those kind of distortions.
## saving the k=1 model
model_k1 = KNeighborsClassifier(n_neighbors = 1)
model_k1.fit(X_train,y_train)
# evaluates with test set
prediction_k1 = model_k1.predict(X_test)
# evaluate score of prediction against actual output
score_k1 = model_k1.score(X_test, y_test)
# calculate generalisation error
print('The generalisation error on our test set with a k=1 model is: ', 1-score_k1)
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(model_k1, X_test, y_test)
plt.show()
We see that our algorithm predicts correctly around 90% of the cases, while from the confusion matrix we can assess that the mistaken levels are mostly related to false positives, i.e., wines that are classified as good quality and are not, with 117 cases.