Web scraping and Clustering of Football Players

We are going to scrape data on Argentinian soccer players from the EA sports FIFA games website (http://sofifa.com), to then cluster them into different groups according to their scraped attributes (e.g. crossing, finishing, dribbling, etc.)

For the clustering we will try different methods, which are:

  • K-means clustering
  • Agglomerative clustering
  • Hierarchical clustering

In esence all these methods are unspuervised clustering methods that divides the population into several clusters such that the data points in the same cluster are more similar and data points in different clusters are dissimilar.

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import re
import unicodedata

Scraping

In [8]:
# list of the attributes to scrap

attributes=['Attacking','Crossing','Finishing','Heading Accuracy',
 'Short Passing','Volleys','Skill','Dribbling','Curve',
 'FK Accuracy','Long Passing','Ball Control','Movement','Acceleration',
 'Sprint Speed','Agility','Reactions','Balance','Power',
 'Shot Power','Jumping','Stamina','Strength',
 'Long Shots','Mentality','Aggression','Interceptions','Positioning',
 'Vision','Penalties','Composure','Defending','Defensive Awareness',
 'Standing Tackle','Sliding Tackle','Goalkeeping','GK Diving',
 'GK Handling','GK Kicking','GK Positioning','GK Reflexes']
In [9]:
# page and iterations to get all the Argentinian players

links=[] 

for offset in ['0','60','120','180','240','300','360','420','480','540','600']:
    page=requests.get('http://sofifa.com/players?na=52&offset='+ offset)
    soup=BeautifulSoup(page.content,'html.parser')
    for link in soup.find_all('a'):
        links.append(link.get('href'))
In [10]:
# Remove none values from links
Not_none_values = filter(None.__ne__, links)
links = list(Not_none_values)

links=['http://sofifa.com'+l for l in links if 'player/'in l]
In [11]:
# Now, links should contain all the links to individual Argentinian players
# Checking the first 10 players
links[0:10]
Out[11]:
['http://sofifa.com/player/158023/lionel-messi/220069/',
 'http://sofifa.com/player/211110/paulo-dybala/220069/',
 'http://sofifa.com/player/231478/lautaro-martinez/220069/',
 'http://sofifa.com/player/183898/angel-di-maria/220069/',
 'http://sofifa.com/player/202811/emiliano-martinez/220069/',
 'http://sofifa.com/player/224334/marcos-acuna/220069/',
 'http://sofifa.com/player/143076/alejandro-gomez/220069/',
 'http://sofifa.com/player/212616/rodrigo-javier-de-paul/220069/',
 'http://sofifa.com/player/214997/angel-correa/220069/',
 'http://sofifa.com/player/232488/cristian-romero/220069/']
In [12]:
# Checking the number of player profiles
len(links)
Out[12]:
660
In [13]:
#pattern regular expression
pattern=r"""\s*([\w\s]*?)\s*FIFA"""   #file starts with empty spaces... players name...FIFA...other stuff
for attr in attributes:
    pattern+=r""".*?(\d*\s*"""+attr+r""")"""  #for each attribute we have other stuff..number..attribute..other stuff
pat=re.compile(pattern, re.DOTALL)    #parsing multiline text
In [14]:
# Checking the pattern
pat
Out[14]:
re.compile(r'\s*([\w\s]*?)\s*FIFA.*?(\d*\s*Attacking).*?(\d*\s*Crossing).*?(\d*\s*Finishing).*?(\d*\s*Heading Accuracy).*?(\d*\s*Short Passing).*?(\d*\s*Volleys).*?(\d*\s*Skill).*?(\d*\s*Dribbling).*?(\d*\s*Curve).*?(\d*\s*FK Accuracy).*?(\d*\s*Long Passing).*?(\d*\s*Ball Control).*?(\d*\s*Movement).*?(\d*\s*Acceleration).*?(\d*\s*Sprint Speed).*?(\d*\s*Agility).*?(\d*\s*Reactions).*?(\d*\s*Balance).*?(\d*\s*Power).*?(\d*\s*Shot Power).*?(\d*\s*Jumping).*?(\d*\s*Stamina).*?(\d*\s*Strength).*?(\d*\s*Long Shots).*?(\d*\s*Mentality).*?(\d*\s*Aggression).*?(\d*\s*Interceptions).*?(\d*\s*Positioning).*?(\d*\s*Vision).*?(\d*\s*Penalties).*?(\d*\s*Composure).*?(\d*\s*Defending).*?(\d*\s*Defensive Awareness).*?(\d*\s*Standing Tackle).*?(\d*\s*Sliding Tackle).*?(\d*\s*Goalkeeping).*?(\d*\s*GK Diving).*?(\d*\s*GK Handling).*?(\d*\s*GK Kicking).*?(\d*\s*GK Positioning).*?(\d*\s*GK Reflexes)',
re.DOTALL|re.UNICODE)
In [15]:
# Testing with one player
row=[links[0]]
playerpage=requests.get(links[0])
playersoup=BeautifulSoup(playerpage.content,'html.parser')
text=playersoup.get_text()
text=unicodedata.normalize('NFKD', text).encode('ascii','ignore')
# print (text)
text = text.decode('ascii')
text 
a=pat.match(text)

print (a.group(1))
row.append(a.group(1))
for i in range(2,len(attributes)+2):
    if((a.group(i).split()[0]).isdigit()):
        row.append(int(a.group(i).split()[0]))
print (row)
Lionel Messi
['http://sofifa.com/player/158023/lionel-messi/220069/', 'Lionel Messi', 85, 91, 70, 91, 88, 96, 93, 93, 91, 94, 90, 78, 91, 93, 95, 86, 68, 71, 69, 93, 44, 40, 93, 95, 75, 96, 20, 35, 24, 6, 11, 15, 14, 8]
In [16]:
### we run the code to get all the Argentinian players data, this takes a couple of minutes

rows=[]
for j,link in enumerate(links):

    row=[link]
    playerpage=requests.get(link)
    playersoup=BeautifulSoup(playerpage.content,'html.parser')
    text=playersoup.get_text()
    text=unicodedata.normalize('NFKD', text).encode('ascii','ignore')

    
    text = text.decode('ascii')
    text 
    a=pat.match(text)


    if not (a is None):
        row.append(a.group(1))
        for i in range(2,len(attributes)+2):
            if((a.group(i).split()[0]).isdigit()):
                row.append(int(a.group(i).split()[0]))
        rows.append(row)
In [17]:
# Redefine the attributes again to match columns with values, removing category headers such as Attacking, Skill, ...

attributes=['Crossing','Finishing','Heading Accuracy',
 'Short Passing','Volleys','Dribbling','Curve',
 'FK Accuracy','Long Passing','Ball Control','Acceleration',
 'Sprint Speed','Agility','Reactions','Balance',
 'Shot Power','Jumping','Stamina','Strength',
 'Long Shots','Aggression','Interceptions','Positioning',
 'Vision','Penalties','Composure','Marking',
 'Standing Tackle','Sliding Tackle','GK Diving',
 'GK Handling','GK Kicking','GK Positioning','GK Reflexes']
In [18]:
# we generate a dataframe with our data

df = pd.DataFrame(rows,columns=['link','name']+attributes)
df.head()
Out[18]:
link name Crossing Finishing Heading Accuracy Short Passing Volleys Dribbling Curve FK Accuracy ... Penalties Composure Marking Standing Tackle Sliding Tackle GK Diving GK Handling GK Kicking GK Positioning GK Reflexes
0 http://sofifa.com/player/158023/lionel-messi/2... Lionel Messi 85 91 70 91 88 96 93 93 ... 75 96 20 35 24 6 11 15 14 8
1 http://sofifa.com/player/211110/paulo-dybala/2... Paulo Dybala 82 84 59 87 88 90 89 88 ... 84 79 32 41 39 5 4 4 5 8
2 http://sofifa.com/player/231478/lautaro-martin... Lautaro Martinez 56 87 84 76 90 83 78 43 ... 74 86 55 31 44 11 8 8 8 13
3 http://sofifa.com/player/183898/angel-di-maria... Angel Di Maria 86 77 51 83 78 88 91 88 ... 73 86 41 55 52 10 7 11 12 11
4 http://sofifa.com/player/202811/emiliano-marti... Emiliano Martinez 22 16 26 50 13 13 22 20 ... 42 60 11 20 20 85 83 82 83 85

5 rows × 36 columns

In [19]:
df.describe()
Out[19]:
Crossing Finishing Heading Accuracy Short Passing Volleys Dribbling Curve FK Accuracy Long Passing Ball Control ... Penalties Composure Marking Standing Tackle Sliding Tackle GK Diving GK Handling GK Kicking GK Positioning GK Reflexes
count 660.000000 660.000000 660.000000 660.000000 660.000000 660.000000 660.000000 660.000000 660.000000 660.000000 ... 660.000000 660.000000 660.000000 660.000000 660.000000 660.000000 660.000000 660.000000 660.000000 660.000000
mean 55.163636 52.596970 57.737879 65.396970 48.698485 62.575758 53.765152 48.840909 59.677273 65.375758 ... 54.763636 65.660606 50.916667 51.927273 49.943939 15.456061 15.357576 15.251515 15.718182 15.501515
std 17.038398 19.135916 16.848862 12.546726 17.527996 17.380819 18.390000 17.035636 13.397521 14.853192 ... 15.215554 10.199674 18.876984 21.102701 20.741524 17.379882 16.771046 16.234728 17.289009 17.820254
min 8.000000 5.000000 10.000000 11.000000 4.000000 5.000000 8.000000 8.000000 12.000000 10.000000 ... 11.000000 22.000000 7.000000 7.000000 8.000000 4.000000 4.000000 3.000000 4.000000 5.000000
25% 47.000000 39.000000 50.000000 63.000000 36.000000 60.000000 41.000000 37.000000 55.000000 64.000000 ... 45.000000 59.000000 35.000000 31.750000 31.000000 8.000000 8.000000 8.000000 8.000000 8.000000
50% 60.000000 57.000000 60.000000 68.000000 51.000000 68.000000 55.000000 49.000000 62.000000 69.000000 ... 57.000000 67.000000 58.000000 61.000000 58.500000 11.000000 11.000000 11.000000 11.000000 11.000000
75% 67.000000 68.000000 70.000000 73.000000 63.000000 73.000000 68.250000 62.000000 69.000000 73.000000 ... 66.250000 73.000000 67.000000 70.000000 68.000000 14.000000 14.000000 14.000000 14.000000 13.250000
max 87.000000 91.000000 86.000000 91.000000 90.000000 96.000000 93.000000 93.000000 91.000000 94.000000 ... 88.000000 96.000000 84.000000 85.000000 84.000000 85.000000 83.000000 82.000000 83.000000 86.000000

8 rows × 34 columns

We see that we have successfuly scrap the data of the 660 players and their attributes. We will continue by trying different clustering techniques

K-means clustering

K-means clustering is used for unsupervised unlabeled data. The algorithm will find groups in the data with the number of groups represented by the variable K.

In [20]:
# create the dataframe we are going to be using

df2 = df.iloc[: , 2:]
df2.head()
Out[20]:
Crossing Finishing Heading Accuracy Short Passing Volleys Dribbling Curve FK Accuracy Long Passing Ball Control ... Penalties Composure Marking Standing Tackle Sliding Tackle GK Diving GK Handling GK Kicking GK Positioning GK Reflexes
0 85 91 70 91 88 96 93 93 91 94 ... 75 96 20 35 24 6 11 15 14 8
1 82 84 59 87 88 90 89 88 79 93 ... 84 79 32 41 39 5 4 4 5 8
2 56 87 84 76 90 83 78 43 73 85 ... 74 86 55 31 44 11 8 8 8 13
3 86 77 51 83 78 88 91 88 81 87 ... 73 86 41 55 52 10 7 11 12 11
4 22 16 26 50 13 13 22 20 51 35 ... 42 60 11 20 20 85 83 82 83 85

5 rows × 34 columns

In [21]:
#Fit the kmeans clustering algorithm. 
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=4, random_state=0).fit(df2)
C:\Users\hernan\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1332: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=3.
  warnings.warn(

We choose a k = 4 knowing a priori that the players usually could be grouped in four different groups: strikers, defenders, midfield and goal keepers

In [22]:
# predict clusters of our data

predict = kmeans.predict(df2)
In [24]:
# Add the cluster number as a new column to our dataframe

df['Cluster'] = pd.Series(predict, index=df.index)
df.head()
Out[24]:
link name Crossing Finishing Heading Accuracy Short Passing Volleys Dribbling Curve FK Accuracy ... Composure Marking Standing Tackle Sliding Tackle GK Diving GK Handling GK Kicking GK Positioning GK Reflexes Cluster
0 http://sofifa.com/player/158023/lionel-messi/2... Lionel Messi 85 91 70 91 88 96 93 93 ... 96 20 35 24 6 11 15 14 8 3
1 http://sofifa.com/player/211110/paulo-dybala/2... Paulo Dybala 82 84 59 87 88 90 89 88 ... 79 32 41 39 5 4 4 5 8 3
2 http://sofifa.com/player/231478/lautaro-martin... Lautaro Martinez 56 87 84 76 90 83 78 43 ... 86 55 31 44 11 8 8 8 13 3
3 http://sofifa.com/player/183898/angel-di-maria... Angel Di Maria 86 77 51 83 78 88 91 88 ... 86 41 55 52 10 7 11 12 11 3
4 http://sofifa.com/player/202811/emiliano-marti... Emiliano Martinez 22 16 26 50 13 13 22 20 ... 60 11 20 20 85 83 82 83 85 2

5 rows × 37 columns

Now the clusters are generated, we are going to analyze by inspection each cluster

In [25]:
#create 4 sub dataframes for the 4 clusters

df_c0 = df[df['Cluster']==0]
df_c1 = df[df['Cluster']==1]
df_c2 = df[df['Cluster']==2]
df_c3 = df[df['Cluster']==3]
In [26]:
df_c0.describe()
Out[26]:
Crossing Finishing Heading Accuracy Short Passing Volleys Dribbling Curve FK Accuracy Long Passing Ball Control ... Composure Marking Standing Tackle Sliding Tackle GK Diving GK Handling GK Kicking GK Positioning GK Reflexes Cluster
count 239.000000 239.000000 239.000000 239.000000 239.000000 239.000000 239.000000 239.000000 239.000000 239.000000 ... 239.000000 239.000000 239.000000 239.000000 239.000000 239.000000 239.000000 239.000000 239.000000 239.0
mean 64.602510 55.083682 57.778243 71.707113 51.866109 69.953975 62.841004 55.857741 67.615063 71.221757 ... 68.874477 63.104603 67.121339 64.903766 10.251046 10.644351 10.623431 10.723849 10.255230 0.0
std 7.662316 10.448418 9.443162 5.273008 10.643125 6.062869 12.124695 12.630356 6.531303 4.966800 ... 7.779048 8.009549 6.892171 7.801849 3.101191 3.215489 2.962021 3.151480 3.013576 0.0
min 37.000000 28.000000 32.000000 58.000000 27.000000 49.000000 33.000000 32.000000 46.000000 60.000000 ... 47.000000 37.000000 49.000000 36.000000 4.000000 5.000000 3.000000 4.000000 5.000000 0.0
25% 59.000000 47.000000 51.500000 68.000000 45.000000 66.000000 53.000000 45.000000 63.000000 68.000000 ... 65.000000 59.000000 63.000000 60.000000 8.000000 8.000000 8.000000 8.000000 8.000000 0.0
50% 65.000000 56.000000 57.000000 72.000000 50.000000 70.000000 63.000000 55.000000 68.000000 71.000000 ... 69.000000 64.000000 68.000000 66.000000 10.000000 11.000000 11.000000 11.000000 10.000000 0.0
75% 69.000000 62.000000 64.500000 75.000000 59.000000 73.000000 72.500000 66.000000 72.000000 74.000000 ... 75.000000 68.000000 72.000000 70.000000 13.000000 13.500000 13.000000 14.000000 13.000000 0.0
max 87.000000 82.000000 82.000000 85.000000 81.000000 87.000000 88.000000 82.000000 86.000000 88.000000 ... 87.000000 83.000000 84.000000 82.000000 16.000000 16.000000 16.000000 16.000000 16.000000 0.0

8 rows × 35 columns

In [27]:
df_c1.describe()
Out[27]:
Crossing Finishing Heading Accuracy Short Passing Volleys Dribbling Curve FK Accuracy Long Passing Ball Control ... Composure Marking Standing Tackle Sliding Tackle GK Diving GK Handling GK Kicking GK Positioning GK Reflexes Cluster
count 135.000000 135.000000 135.000000 135.000000 135.000000 135.000000 135.000000 135.000000 135.000000 135.000000 ... 135.000000 135.000000 135.000000 135.000000 135.000000 135.000000 135.000000 135.000000 135.000000 135.0
mean 46.370370 34.962963 68.251852 64.266667 32.829630 52.259259 38.288889 36.140741 57.629630 61.251852 ... 62.659259 68.770370 70.629630 68.570370 10.659259 10.392593 10.540741 10.725926 10.222222 1.0
std 12.471305 9.077944 8.175061 5.865660 7.617283 10.539169 9.187814 8.862265 8.904754 6.927900 ... 7.316066 5.394775 4.104495 4.032592 2.990440 3.005210 2.825161 2.996091 2.730881 0.0
min 21.000000 15.000000 44.000000 45.000000 14.000000 25.000000 21.000000 17.000000 24.000000 31.000000 ... 49.000000 55.000000 58.000000 62.000000 5.000000 5.000000 5.000000 5.000000 5.000000 1.0
25% 35.500000 28.000000 65.000000 60.500000 28.000000 44.000000 31.000000 31.000000 54.000000 58.000000 ... 56.500000 65.000000 68.000000 66.000000 8.000000 8.000000 8.000000 8.000000 8.000000 1.0
50% 45.000000 35.000000 70.000000 65.000000 32.000000 54.000000 36.000000 34.000000 58.000000 62.000000 ... 63.000000 69.000000 70.000000 68.000000 11.000000 10.000000 11.000000 11.000000 10.000000 1.0
75% 58.000000 40.000000 73.000000 68.000000 37.000000 60.500000 45.000000 40.000000 64.000000 65.000000 ... 68.000000 72.000000 73.000000 71.000000 13.000000 13.000000 13.000000 13.000000 12.000000 1.0
max 70.000000 62.000000 86.000000 77.000000 58.000000 76.000000 64.000000 73.000000 79.000000 76.000000 ... 80.000000 84.000000 85.000000 84.000000 16.000000 16.000000 16.000000 16.000000 16.000000 1.0

8 rows × 35 columns

In [28]:
df_c2.describe()
Out[28]:
Crossing Finishing Heading Accuracy Short Passing Volleys Dribbling Curve FK Accuracy Long Passing Ball Control ... Composure Marking Standing Tackle Sliding Tackle GK Diving GK Handling GK Kicking GK Positioning GK Reflexes Cluster
count 54.000000 54.000000 54.000000 54.000000 54.000000 54.000000 54.000000 54.000000 54.000000 54.000000 ... 54.000000 54.000000 54.000000 54.000000 54.000000 54.000000 54.000000 54.000000 54.000000 54.0
mean 15.203704 13.407407 15.148148 29.574074 14.000000 15.962963 16.870370 15.296296 27.611111 21.703704 ... 47.425926 18.462963 16.018519 15.833333 72.629630 70.462963 68.462963 72.537037 74.185185 2.0
std 3.636189 4.191269 3.838275 8.415175 3.880284 4.625454 5.337819 3.754894 9.517445 7.264880 ... 12.208259 5.622438 3.892376 3.830119 5.006774 4.562823 5.665896 5.254694 5.095592 0.0
min 8.000000 5.000000 10.000000 11.000000 4.000000 5.000000 8.000000 8.000000 12.000000 10.000000 ... 22.000000 7.000000 7.000000 8.000000 63.000000 63.000000 58.000000 63.000000 65.000000 2.0
25% 12.000000 11.000000 12.000000 24.250000 12.000000 13.000000 13.000000 12.250000 22.000000 16.000000 ... 39.250000 14.250000 13.000000 13.250000 68.000000 67.000000 65.000000 69.000000 70.000000 2.0
50% 15.000000 14.000000 15.000000 28.000000 15.000000 16.000000 16.000000 15.000000 25.000000 20.500000 ... 48.500000 18.000000 15.000000 15.000000 72.000000 69.500000 68.000000 72.000000 74.000000 2.0
75% 18.000000 17.000000 17.000000 35.000000 17.000000 19.750000 20.000000 18.000000 31.000000 26.000000 ... 58.000000 22.750000 19.750000 17.750000 75.750000 72.750000 72.000000 76.000000 77.000000 2.0
max 23.000000 20.000000 26.000000 50.000000 20.000000 25.000000 31.000000 22.000000 54.000000 40.000000 ... 67.000000 30.000000 23.000000 24.000000 85.000000 83.000000 82.000000 83.000000 86.000000 2.0

8 rows × 35 columns

In [29]:
df_c3.describe()
Out[29]:
Crossing Finishing Heading Accuracy Short Passing Volleys Dribbling Curve FK Accuracy Long Passing Ball Control ... Composure Marking Standing Tackle Sliding Tackle GK Diving GK Handling GK Kicking GK Positioning GK Reflexes Cluster
count 232.000000 232.000000 232.000000 232.000000 232.000000 232.000000 232.000000 232.000000 232.000000 232.000000 ... 232.000000 232.000000 232.000000 232.000000 232.000000 232.000000 232.000000 232.000000 232.000000 232.0
mean 59.857759 69.418103 61.491379 67.892241 62.745690 71.827586 62.008621 56.810345 60.155172 71.918103 ... 68.340517 35.525862 33.750000 31.633621 10.301724 10.275862 10.375000 10.543103 10.318966 3.0
std 11.902618 7.001065 13.155412 6.004438 9.245098 6.495787 12.090206 11.947152 9.648456 5.465929 ... 8.264591 10.111895 11.475836 10.429698 2.928350 2.924997 3.155683 3.000410 3.086345 0.0
min 23.000000 50.000000 25.000000 52.000000 37.000000 52.000000 33.000000 32.000000 32.000000 58.000000 ... 46.000000 16.000000 11.000000 11.000000 5.000000 4.000000 4.000000 5.000000 5.000000 3.0
25% 53.000000 65.000000 52.000000 64.000000 56.000000 68.000000 53.750000 47.000000 55.000000 68.000000 ... 63.000000 28.000000 25.000000 23.000000 8.000000 8.000000 8.000000 8.000000 8.000000 3.0
50% 63.000000 70.000000 63.500000 68.000000 63.000000 72.000000 62.000000 57.500000 61.000000 71.000000 ... 69.000000 35.500000 32.000000 32.000000 10.000000 10.000000 10.000000 11.000000 10.000000 3.0
75% 69.000000 74.000000 72.000000 72.000000 69.000000 75.250000 70.000000 66.000000 67.000000 75.000000 ... 73.250000 43.000000 43.000000 39.000000 13.000000 13.000000 13.000000 13.000000 13.000000 3.0
max 86.000000 91.000000 86.000000 91.000000 90.000000 96.000000 93.000000 93.000000 91.000000 94.000000 ... 96.000000 65.000000 59.000000 54.000000 16.000000 16.000000 16.000000 16.000000 16.000000 3.0

8 rows × 35 columns

By inspecting the four groups, we can see some tendencies on the mean values that helps us to classify the players between the four different profiles:

  • Cluster 0: we see high stats for crossing, short passing, dribbling, ball control, composure --> midfielders
  • Cluster 1: high stats for heading accuracy, short passing, marking, standing and slidin tackle --> defenders
  • Cluster 2: high stats for goal keeping --> goal keepers
  • Cluster 3: high stats for finishing, short passing, dribblng, ball control --> strikers

Agglomerative clustering

The agglomerative clustering is the most common type of hierarchical clustering used to group objects in clusters based on their similarity. It is also an unsupervised machine learning technique that divides the population into several clusters such that data points in the same cluster are more similar and data points in different clusters are dissimilar.

In [30]:
# fit the model, again choosing 4 clusters

from sklearn.cluster import AgglomerativeClustering

model2 = AgglomerativeClustering(distance_threshold=None, n_clusters=4)
model2 = model2.fit(df2)
In [32]:
#Add a new column called AggCluster to the dataframe

df['AggCluster'] = pd.Series(model2.fit_predict(df2), index=df.index)
df.head()
Out[32]:
link name Crossing Finishing Heading Accuracy Short Passing Volleys Dribbling Curve FK Accuracy ... Marking Standing Tackle Sliding Tackle GK Diving GK Handling GK Kicking GK Positioning GK Reflexes Cluster AggCluster
0 http://sofifa.com/player/158023/lionel-messi/2... Lionel Messi 85 91 70 91 88 96 93 93 ... 20 35 24 6 11 15 14 8 3 0
1 http://sofifa.com/player/211110/paulo-dybala/2... Paulo Dybala 82 84 59 87 88 90 89 88 ... 32 41 39 5 4 4 5 8 3 0
2 http://sofifa.com/player/231478/lautaro-martin... Lautaro Martinez 56 87 84 76 90 83 78 43 ... 55 31 44 11 8 8 8 13 3 0
3 http://sofifa.com/player/183898/angel-di-maria... Angel Di Maria 86 77 51 83 78 88 91 88 ... 41 55 52 10 7 11 12 11 3 0
4 http://sofifa.com/player/202811/emiliano-marti... Emiliano Martinez 22 16 26 50 13 13 22 20 ... 11 20 20 85 83 82 83 85 2 3

5 rows × 38 columns

In [35]:
# calculate means of each cluster
df_ac0 = df[df['AggCluster']==0]
df_ac1 = df[df['AggCluster']==1]
df_ac2 = df[df['AggCluster']==2]
df_ac3 = df[df['AggCluster']==3]

df_mean_ac = pd.DataFrame(df_ac0.mean())
df_mean_ac[1] = df_ac1.mean()
df_mean_ac[2] = df_ac2.mean()
df_mean_ac[3] = df_ac3.mean()
C:\Users\hernan\AppData\Local\Temp/ipykernel_16712/4087592638.py:7: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df_mean_ac = pd.DataFrame(df_ac0.mean())
C:\Users\hernan\AppData\Local\Temp/ipykernel_16712/4087592638.py:8: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df_mean_ac[1] = df_ac1.mean()
C:\Users\hernan\AppData\Local\Temp/ipykernel_16712/4087592638.py:9: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df_mean_ac[2] = df_ac2.mean()
C:\Users\hernan\AppData\Local\Temp/ipykernel_16712/4087592638.py:10: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df_mean_ac[3] = df_ac3.mean()
In [36]:
df_mean_ac
Out[36]:
0 1 2 3
Crossing 60.503704 39.161765 61.895522 15.203704
Finishing 68.270370 31.485294 50.059701 13.407407
Heading Accuracy 60.070370 71.411765 60.500000 15.148148
Short Passing 68.400000 62.941176 70.212687 29.574074
Volleys 61.985185 29.941176 47.063433 14.000000
Dribbling 72.007407 44.661765 67.011194 15.962963
Curve 62.618519 35.470588 56.921642 16.870370
FK Accuracy 57.262963 33.558824 50.992537 15.296296
Long Passing 61.074074 55.720588 65.735075 27.611111
Ball Control 71.903704 57.544118 69.585821 21.703704
Acceleration 74.577778 51.073529 70.089552 45.555556
Sprint Speed 73.325926 53.470588 69.320896 44.407407
Agility 74.885185 48.544118 70.589552 42.851852
Reactions 67.285185 65.808824 68.358209 67.944444
Balance 72.525926 49.250000 69.492537 48.240741
Shot Power 71.537037 49.220588 64.477612 51.425926
Jumping 65.844444 71.764706 69.447761 65.092593
Stamina 68.322222 65.882353 75.201493 34.148148
Strength 64.392593 81.161765 69.514925 65.259259
Long Shots 66.400000 31.205882 56.250000 13.703704
Aggression 57.633333 73.397059 73.093284 29.666667
Interceptions 35.974074 69.441176 67.522388 20.518519
Positioning 70.470370 35.735294 60.608209 13.518519
Vision 66.244444 39.808824 62.902985 45.222222
Penalties 64.937037 43.264706 53.962687 22.351852
Composure 68.400000 61.735294 67.570896 47.425926
Marking 37.892593 69.882353 65.764925 18.462963
Standing Tackle 37.059259 70.970588 69.309701 16.018519
Sliding Tackle 34.814815 68.485294 67.354478 15.833333
GK Diving 10.344444 10.808824 10.264925 72.629630
GK Handling 10.240741 10.705882 10.589552 70.462963
GK Kicking 10.422222 10.529412 10.593284 68.462963
GK Positioning 10.448148 11.132353 10.742537 72.537037
GK Reflexes 10.218519 10.205882 10.343284 74.185185
Cluster 2.577778 1.000000 0.250000 2.000000
AggCluster 0.000000 1.000000 2.000000 3.000000

By inspection of the mean values of each cluster we can see that:

  • Cluster 0 corresponds to strikers
  • Cluster 1 corresponds to defenders
  • Cluster 2 corresponds to midfielders
  • Cluster 3 corresponds to goal keepers

Hierarchical Clustering

Hierarchical clustering is another unsupervised learning method for clustering data points, in this case we will generate a visualization called dendogram

In [37]:
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as shc 
In [41]:
plt.figure(figsize =(15, 8))
plt.title('Dendogram')
Dendrogram = shc.dendrogram((shc.linkage(df2 , method ='complete')))

This method divides the data into three different groups: goalkeepers, defenders and strikers. Midfielders in this case are divided between defenders and strikers because they share characteristics of both of them

In [ ]: