We are going to scrape data on Argentinian soccer players from the EA sports FIFA games website (http://sofifa.com), to then cluster them into different groups according to their scraped attributes (e.g. crossing, finishing, dribbling, etc.)
For the clustering we will try different methods, which are:
In esence all these methods are unspuervised clustering methods that divides the population into several clusters such that the data points in the same cluster are more similar and data points in different clusters are dissimilar.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import re
import unicodedata
# list of the attributes to scrap
attributes=['Attacking','Crossing','Finishing','Heading Accuracy',
'Short Passing','Volleys','Skill','Dribbling','Curve',
'FK Accuracy','Long Passing','Ball Control','Movement','Acceleration',
'Sprint Speed','Agility','Reactions','Balance','Power',
'Shot Power','Jumping','Stamina','Strength',
'Long Shots','Mentality','Aggression','Interceptions','Positioning',
'Vision','Penalties','Composure','Defending','Defensive Awareness',
'Standing Tackle','Sliding Tackle','Goalkeeping','GK Diving',
'GK Handling','GK Kicking','GK Positioning','GK Reflexes']
# page and iterations to get all the Argentinian players
links=[]
for offset in ['0','60','120','180','240','300','360','420','480','540','600']:
page=requests.get('http://sofifa.com/players?na=52&offset='+ offset)
soup=BeautifulSoup(page.content,'html.parser')
for link in soup.find_all('a'):
links.append(link.get('href'))
# Remove none values from links
Not_none_values = filter(None.__ne__, links)
links = list(Not_none_values)
links=['http://sofifa.com'+l for l in links if 'player/'in l]
# Now, links should contain all the links to individual Argentinian players
# Checking the first 10 players
links[0:10]
# Checking the number of player profiles
len(links)
#pattern regular expression
pattern=r"""\s*([\w\s]*?)\s*FIFA""" #file starts with empty spaces... players name...FIFA...other stuff
for attr in attributes:
pattern+=r""".*?(\d*\s*"""+attr+r""")""" #for each attribute we have other stuff..number..attribute..other stuff
pat=re.compile(pattern, re.DOTALL) #parsing multiline text
# Checking the pattern
pat
# Testing with one player
row=[links[0]]
playerpage=requests.get(links[0])
playersoup=BeautifulSoup(playerpage.content,'html.parser')
text=playersoup.get_text()
text=unicodedata.normalize('NFKD', text).encode('ascii','ignore')
# print (text)
text = text.decode('ascii')
text
a=pat.match(text)
print (a.group(1))
row.append(a.group(1))
for i in range(2,len(attributes)+2):
if((a.group(i).split()[0]).isdigit()):
row.append(int(a.group(i).split()[0]))
print (row)
### we run the code to get all the Argentinian players data, this takes a couple of minutes
rows=[]
for j,link in enumerate(links):
row=[link]
playerpage=requests.get(link)
playersoup=BeautifulSoup(playerpage.content,'html.parser')
text=playersoup.get_text()
text=unicodedata.normalize('NFKD', text).encode('ascii','ignore')
text = text.decode('ascii')
text
a=pat.match(text)
if not (a is None):
row.append(a.group(1))
for i in range(2,len(attributes)+2):
if((a.group(i).split()[0]).isdigit()):
row.append(int(a.group(i).split()[0]))
rows.append(row)
# Redefine the attributes again to match columns with values, removing category headers such as Attacking, Skill, ...
attributes=['Crossing','Finishing','Heading Accuracy',
'Short Passing','Volleys','Dribbling','Curve',
'FK Accuracy','Long Passing','Ball Control','Acceleration',
'Sprint Speed','Agility','Reactions','Balance',
'Shot Power','Jumping','Stamina','Strength',
'Long Shots','Aggression','Interceptions','Positioning',
'Vision','Penalties','Composure','Marking',
'Standing Tackle','Sliding Tackle','GK Diving',
'GK Handling','GK Kicking','GK Positioning','GK Reflexes']
# we generate a dataframe with our data
df = pd.DataFrame(rows,columns=['link','name']+attributes)
df.head()
df.describe()
We see that we have successfuly scrap the data of the 660 players and their attributes. We will continue by trying different clustering techniques
K-means clustering is used for unsupervised unlabeled data. The algorithm will find groups in the data with the number of groups represented by the variable K.
# create the dataframe we are going to be using
df2 = df.iloc[: , 2:]
df2.head()
#Fit the kmeans clustering algorithm.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4, random_state=0).fit(df2)
We choose a k = 4 knowing a priori that the players usually could be grouped in four different groups: strikers, defenders, midfield and goal keepers
# predict clusters of our data
predict = kmeans.predict(df2)
# Add the cluster number as a new column to our dataframe
df['Cluster'] = pd.Series(predict, index=df.index)
df.head()
Now the clusters are generated, we are going to analyze by inspection each cluster
#create 4 sub dataframes for the 4 clusters
df_c0 = df[df['Cluster']==0]
df_c1 = df[df['Cluster']==1]
df_c2 = df[df['Cluster']==2]
df_c3 = df[df['Cluster']==3]
df_c0.describe()
df_c1.describe()
df_c2.describe()
df_c3.describe()
By inspecting the four groups, we can see some tendencies on the mean values that helps us to classify the players between the four different profiles:
The agglomerative clustering is the most common type of hierarchical clustering used to group objects in clusters based on their similarity. It is also an unsupervised machine learning technique that divides the population into several clusters such that data points in the same cluster are more similar and data points in different clusters are dissimilar.
# fit the model, again choosing 4 clusters
from sklearn.cluster import AgglomerativeClustering
model2 = AgglomerativeClustering(distance_threshold=None, n_clusters=4)
model2 = model2.fit(df2)
#Add a new column called AggCluster to the dataframe
df['AggCluster'] = pd.Series(model2.fit_predict(df2), index=df.index)
df.head()
# calculate means of each cluster
df_ac0 = df[df['AggCluster']==0]
df_ac1 = df[df['AggCluster']==1]
df_ac2 = df[df['AggCluster']==2]
df_ac3 = df[df['AggCluster']==3]
df_mean_ac = pd.DataFrame(df_ac0.mean())
df_mean_ac[1] = df_ac1.mean()
df_mean_ac[2] = df_ac2.mean()
df_mean_ac[3] = df_ac3.mean()
df_mean_ac
By inspection of the mean values of each cluster we can see that:
Hierarchical clustering is another unsupervised learning method for clustering data points, in this case we will generate a visualization called dendogram
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as shc
plt.figure(figsize =(15, 8))
plt.title('Dendogram')
Dendrogram = shc.dendrogram((shc.linkage(df2 , method ='complete')))
This method divides the data into three different groups: goalkeepers, defenders and strikers. Midfielders in this case are divided between defenders and strikers because they share characteristics of both of them