After Leicester City's championship win in the 2015/2016 English Premier League Season was called a fairy tale story by many across the world. Given that they had finished 14th, in the season prior, they overcame 5000:1 odds to take home the title.
But this raises the question, given that Leicester were in the Football League Championship only a few seasons ago and were one of the least valuable clubs in the top division of English Football, does money dictate finishing position?
This tutorial will analyze transfer expenses and overall club values when trying to answer this question. It will be split into four parts: Data Collection, Data Plotting, Linear Regression Analysis, and Conclusion.
For this part of the analysis, we will collecting game data from ESPN and team value and transfer spend data from Transfermarkt.us
We will be using the following libraries to scrape data from Transfermarkt and ESPN, create regression models, and construct various scatter and violin plots.
re
import requests as r
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import pandas
import numpy
from pylab import *
from ggplot import *
import statistics as st
We will be getting the value data from Transfermarkt. Note that the data from the website is a string in the format XXX,XX Mill. €. Therefore, we will have to make some changes to the string so that we can convert it to a float and use it for graphs and calculations.
We have to use loops since the URL for each season is different. During this process, we will have to make some slight changes to the data that is scraped in order to be able to manipulate it for our analysis.
headers = {'User-Agent': 'Chrome/47.0.2526.106'}
teamNames = []
club_value = []
years = []
for year in range(2005, 2018):
page = "https://www.transfermarkt.us/premier-league/startseite/wettbewerb/GB1/plus/?saison_id="+str(year)
pageTree = r.get(page, headers = headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
names = pageSoup.find_all("a", {"class": "vereinprofil_tooltip"})
values = pageSoup.find_all("td", {"class": "rechts hide-for-small hide-for-pad"})
tempNames=[]
for index in range(len(names)):
if len(names[index].text)>1:
name = names[index].text.replace('AFC ', '')
name = name.replace(' FC', '')
name = name.replace(' AFC', '')
tempNames.append(name)
for index in range(40):
if (index%2==0):
teamNames.append(tempNames[index])
years.append(year)
for index in range(2, len(values)):
if (index%2==0):
real_value = values[index].text.replace(',','.')
real_value = real_value.replace(' Mill. €', '')
club_value.append(float(real_value))
Now we will put this data neatly into a DataFrame so it is easier to merge all the data later on.
values_table = pd.DataFrame({'Club':teamNames, 'Value':club_value, 'Season':years})
values_table.head()
This is a glimpse of the table that starts in the 2005/2006 season and goes right up through the last season (2017/2018). It is worth a reminder that the Value columns is in Millions of Euros.
Now we will scrape each club's game data from ESPN. This will get us stats such as games played, games won, and cruicially, points scored.
headers = {'User-Agent': 'Chrome/47.0.2526.106'}
teamNames=[]
teamID = []
teamPoints = []
gamesPlayed = []
gamesWon = []
gamesDrawn = []
gamesLost = []
goalDifference = []
for year in range(2005,2018):
page = "http://www.espn.com/soccer/standings/_/league/eng.1/season/"+str(year)
pageTree = r.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
teams = pageSoup.find_all("span", {"class":"team-names"})
gameinfo = pageSoup.find_all("td", {"class":"","style":"white-space:nowrap;"})
ids = pageSoup.find_all("abbr")
for team in teams:
teamNames.append(team.text.replace('AFC ',''))
for tID in ids:
teamID.append(tID.text)
#print(len(gameinfo))
for games in range(0, 160,8):
gamesPlayed.append(int(gameinfo[games].text))
for wins in range(1,160,8):
gamesWon.append(int(gameinfo[wins].text))
for draws in range(2,160,8):
gamesDrawn.append(int(gameinfo[draws].text))
for loss in range(3,160,8):
gamesLost.append(int(gameinfo[loss].text))
for gd in range(6,160,8):
goalDifference.append(gameinfo[gd].text)
for points in range(7,160,8):
teamPoints.append(int(gameinfo[points].text))
Now we will create a dataframe out of this game data
game_data = pd.DataFrame({'Club':teamNames, 'Club ID':teamID, 'Season':years, 'Games Played':gamesPlayed, 'Games Won':gamesWon,
'Games Drawn':gamesDrawn,'Games Lost':gamesLost, 'Goal Difference':goalDifference, 'Points':teamPoints})
game_data.head()
The last part of the data collection phase is to collect data on how much teams spent each season from Transfermarkt. Again we will have to modify the expenditure strings and change them into floats. It is important to note at this point that the expenditures do NOT take into account any money made from the club selling players, only money from buying players.
headers = {'User-Agent': 'Chrome/47.0.2526.106'}
teamNames = []
teamExp = []
years = []
for year in range(2005,2018):
page = "https://www.transfermarkt.us/premier-league/einnahmenausgaben/wettbewerb/GB1/plus/0?ids=a&sa=&saison_id="+str(year)+"&saison_id_bis="+str(year)+"&nat=&pos=&altersklasse=&w_s=&leihe=&intern=0"
pageTree = r.get(page,headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
expenditures=pageSoup.find_all("td",{"class":"rechts hauptlink redtext"})
teams = pageSoup.find_all("a",{"class":"vereinprofil_tooltip"})
for tI in range(40):
if len(teams[tI].text)>1:
name = teams[tI].text.replace('AFC ', '')
name = name.replace(' FC', '')
name = name.replace(' AFC', '')
teamNames.append(name)
for eI in range(20):
real_exp = expenditures[eI].text.replace(' Mill. €', '')
real_exp = real_exp.replace(',','.')
teamExp.append(float(real_exp))
years.append(year)
expTable = pd.DataFrame({'Club':teamNames, 'Season':years, 'Spend':teamExp})
expTable.head()
We now have 3 DataFrames that contain all the information we need to conduct our analysis. To make everything simpler, we will combine all three into one large DataFrame.
full_table = values_table.merge(expTable, how='inner', on=['Club','Season'])
full_table = full_table.merge(game_data, how='inner', on=['Club','Season'])
full_table.columns = ['Club', 'Value in mill €', 'Season','Spend in mill €', 'Club ID', 'Games Drawn',
'Games Lost', 'Games Played', 'Games Won', 'Goal Difference', 'Points']
full_table.head()
We can start our analysis of this data by plotting team points vs club value over the years. We will only display a few plots to keep it concise. To make these plots easier to read an find out which team is which point, their abbreviation will be displayed above their respective point.
spends_dict = {}
values_dict = {}
points_dict = {}
for year in range(2005,2018):
spend = []
points = []
team_ids = []
for index in range(len(full_table['Season'])):
if (full_table['Season'][index]==year):
spend.append(full_table['Spend in mill €'][index])
points.append(full_table['Points'][index])
team_ids.append(full_table['Club ID'][index])
spends_dict[year] = spend
points_dict[year] = points
if year%5==0:
plt.title("Club Spend vs. Points Won in "+str(year))
plt.scatter(spend, points)
ticks = numpy.arange(min(spend), max(spend), 10)
for i, teamID in enumerate(team_ids):
plt.annotate(teamID, (spend[i], points[i]))
plt.xlabel('Club Spend')
plt.ylabel('Points Won')
plt.show()
Like the Moneyball project earlier in the semester, we have plots here that show Club Spend vs. Points Won for each season in our dataset. However, unlike the Moneyball project, the idea of spend little and win big is not something introduced by Leicester City.
Rather we see that even in past seasons, teams like Chelsea, Manchester United, and Liverpool all spent relatively little and won big, either winning the league or coming very close to it. However, all these teams I just mentioned are known as the giants of English Football.
Perhaps then, we should add the club's existing value into the equation.
A club's value is based on it's players. If players do well and/or the team as a whole does well, the club's value increases. This means that it is not entirely dependent on how much cash a club has.
for year in range(2005,2018):
values = []
points = []
team_ids = []
for index in range(len(full_table['Season'])):
if (full_table['Season'][index]==year):
values.append(full_table['Value in mill €'][index])
points.append(full_table['Points'][index])
team_ids.append(full_table['Club ID'][index])
values_dict[year]=values
if year%5==0:
plt.title("Club Value vs. Points Won in "+str(year))
plt.scatter(values, points)
ticks = numpy.arange(min(values), max(values), 50)
for i, teamID in enumerate(team_ids):
plt.annotate(teamID, (values[i], points[i]))
plt.xlabel('Club Value')
plt.ylabel('Points Won')
plt.show()
It seems that based on these plots that the actual club value removes that "big team" bias. The points are not as split into two groups, meaning that there are not always two clusters, one on the bottom left and one on the top right.
However, an even better way to check which measure or combination of measures is the best predictor of team success is through linear regression.
In this part, we will be calculating different linear regression models that can be used to predict points won at the end of a season. We visualize how good our models are by using violin plots which will take residuals as data points. A residual in the context of this analysis will be calculated as $residual = actual - predicted$. The closer residuals from a given model are to zero, the better that model is at predicting points won in this case.
The Club Values and Spends will have to be standardized before we can make any regression models, due to inflation in the market.
stspends = []
stvals = []
v_mean = st.mean((full_table['Value in mill €'])[0:])
v_stdev = st.pstdev((full_table['Value in mill €'])[0:])
s_mean = st.mean((full_table['Spend in mill €'])[0:])
s_stdev = st.pstdev((full_table['Spend in mill €'])[0:])
for x in range(len(full_table['Season'])):
stspends.append((full_table['Spend in mill €'][x] - s_mean)/s_stdev)
stvals.append((full_table['Value in mill €'][x] - v_mean)/v_stdev)
Let's add these lists to our table for future reference.
full_table['Standardized Spends'] = stspends
full_table['Standardized Values'] = stvals
xs = full_table['Standardized Spends'].values.reshape(-1,1)
ys = full_table['Points'].values.reshape(-1,1)
spend_reg = LinearRegression().fit(xs, ys)
print("Expected_points = " + str(spend_reg.coef_[0][0])+"*spend + " + str(spend_reg.intercept_[0]))
Now, we are going to put this model to the test and take a look a plot of residuals.
spend_res = []
for spend_point, point_point in zip(full_table['Standardized Spends'], full_table['Points']):
spend_res.append(point_point - spend_reg.predict(np.array([[spend_point]]))[0][0])
full_table['Spend Residuals'] = spend_res
ggplot(aes(x='Season', y='Spend Residuals'), data=full_table) +\
geom_violin() +\
labs(title="Plot of Residuals using Spend Model", x = "Season", y = "Residual")
Using this violin plot, we can see that the residuals are equally spread out across the years. This means that our model is not very good at predicting the points won. Ideally, we want the bulk of our residuals to be as close to zero as possible.
Also, there are a number of points that deviate from the mean by upwards of 40 points. This is significant because the highest number of points a team has finished with is in the 90s.
Now we will create another linear regression model but this time, using club value as our parameter.
xv = full_table['Standardized Values'].values.reshape(-1,1)
yv = full_table['Points'].values.reshape(-1,1)
value_reg = LinearRegression().fit(xv, yv)
print("Expected_points = " + str(value_reg.coef_[0][0])+"*value + " + str(value_reg.intercept_[0]))
val_res = []
for val_point, point_point in zip(full_table['Standardized Values'], full_table['Points']):
val_res.append(point_point - value_reg.predict(np.array([[val_point]]))[0][0])
full_table['Value Residuals'] = val_res
ggplot(aes(x='Season', y='Value Residuals'), data=full_table) +\
geom_violin() +\
labs(title="Plot of Residuals Using Value Model", x = "Season", y = "Residuals")
Using value as the term we base our regression model on, we see an improvement in our violin plot of residuals. When we used spend only, points predictions were frequently over or underestimated by 30 points at least. With this value model, it does not overestimate as much, but it does still underestimate quite a bit.
For this two-term model, we will be using both club value and club transfer spend to hopefully better predict how many points a club will win.
residuals = []
x_contents = []
for v,s in zip(full_table['Standardized Values'], full_table['Standardized Spends']):
x_contents.append([v,s])
y = full_table['Points']
vs_reg = LinearRegression().fit(x_contents, np.asarray(y).reshape(-1,1))
for real_point, data_point in zip(y, x_contents):
residuals.append(real_point-(vs_reg.predict(np.array([data_point]))[0][0]))
full_table['Two Term Residuals'] = residuals
ggplot(aes(x='Season', y='Two Term Residuals'), data=full_table) +\
geom_violin() +\
labs(title="Plot of Residuals Using Value-Spend Model", x = "Season", y = "Residuals")
By using club value and transfer spend as parameters for a linear regression model, we have made some improvements over the club value model, which was the best model so far.
The two term model predicts points won better than the other models. We can see that we have even more residuals closer to zero than we did in the previous two models.
To conclude, the best way out of the three methods described here to predict number of finishing points in the English Premier League is to use a combination of a club's existing value and how much a club spends during transfer periods. Our final violin plot shows that this the case, since we can see a large number of residuals in between +10 and -10. However, given how much different the 2015 part of our plot looks to the rest of the seasons in this analysis, it is worth taking a deeper look into why our two-term model was so wrong.
In 2015, the model had vastly over and underpredicted the finishing positions. Not only because of Leicester City's big win, but because of the poor and unexpected performances of other teams. Below is a table showing the standing from that year.
table_2015 = full_table.loc[full_table['Season']==2015].sort_values(by=['Points'], axis=0,ascending=False)
table_2015
Chelsea had finished way lower than normal, a huge surprise given that they were the most valuable club that season and had won the league the previous season. Everton also finished quite low down given it's worth. Newcastle had probably the most dissapointing season given both their worth and how much the spent in the transfer market.
Southampton, West Ham, and Stoke had all finished much higher than they normally do with much lower club values and much lower transfer spending compared to other clubs that season.
On the whole however, the value-spend model was much better than the other single term models we looked at. It gave lower residuals and most of those residuals were concentrated in between +10 and -10, showing that it was very good at accurately predicting points won.