Analysing How Money Spent and Club Value Affects Finishing Place in the English Premier League¶

Author: Asad Zaidi¶

After Leicester City's championship win in the 2015/2016 English Premier League Season was called a fairy tale story by many across the world. Given that they had finished 14th, in the season prior, they overcame 5000:1 odds to take home the title.

But this raises the question, given that Leicester were in the Football League Championship only a few seasons ago and were one of the least valuable clubs in the top division of English Football, does money dictate finishing position?

This tutorial will analyze transfer expenses and overall club values when trying to answer this question. It will be split into four parts: Data Collection, Data Plotting, Linear Regression Analysis, and Conclusion.

Part 1: Data Collection¶

For this part of the analysis, we will collecting game data from ESPN and team value and transfer spend data from Transfermarkt.us

1.1 Importing Required Libraries¶

We will be using the following libraries to scrape data from Transfermarkt and ESPN, create regression models, and construct various scatter and violin plots.

re

import requests as r
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import pandas
import numpy
from pylab import *
from ggplot import *
import statistics as st

1.2 Scraping Club Value Data¶

We will be getting the value data from Transfermarkt. Note that the data from the website is a string in the format XXX,XX Mill. €. Therefore, we will have to make some changes to the string so that we can convert it to a float and use it for graphs and calculations.

We have to use loops since the URL for each season is different. During this process, we will have to make some slight changes to the data that is scraped in order to be able to manipulate it for our analysis.

headers = {'User-Agent': 'Chrome/47.0.2526.106'}
teamNames = []
club_value = []
years = []
for year in range(2005, 2018):
    page = "https://www.transfermarkt.us/premier-league/startseite/wettbewerb/GB1/plus/?saison_id="+str(year)
    pageTree = r.get(page, headers = headers)
    pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
    names = pageSoup.find_all("a", {"class": "vereinprofil_tooltip"})
    values = pageSoup.find_all("td", {"class": "rechts hide-for-small hide-for-pad"})
    tempNames=[]
    for index in range(len(names)):
        if len(names[index].text)>1:
            name = names[index].text.replace('AFC ', '')
            name = name.replace(' FC', '')
            name = name.replace(' AFC', '')
            tempNames.append(name)
    for index in range(40):
        if (index%2==0):
            teamNames.append(tempNames[index])
            years.append(year)
    for index in range(2, len(values)):
        if (index%2==0):
            real_value = values[index].text.replace(',','.')
            real_value = real_value.replace(' Mill. €', '')
            club_value.append(float(real_value))

Now we will put this data neatly into a DataFrame so it is easier to merge all the data later on.

values_table = pd.DataFrame({'Club':teamNames, 'Value':club_value, 'Season':years})
values_table.head()

This is a glimpse of the table that starts in the 2005/2006 season and goes right up through the last season (2017/2018). It is worth a reminder that the Value columns is in Millions of Euros.

1.3 Scraping Game Data¶

Now we will scrape each club's game data from ESPN. This will get us stats such as games played, games won, and cruicially, points scored.

headers = {'User-Agent': 'Chrome/47.0.2526.106'}
teamNames=[]
teamID = []
teamPoints = []
gamesPlayed = []
gamesWon = []
gamesDrawn = []
gamesLost = []
goalDifference = []

for year in range(2005,2018):
    page = "http://www.espn.com/soccer/standings/_/league/eng.1/season/"+str(year)
    pageTree = r.get(page, headers=headers)
    pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
    teams = pageSoup.find_all("span", {"class":"team-names"})
    gameinfo = pageSoup.find_all("td", {"class":"","style":"white-space:nowrap;"})
    ids = pageSoup.find_all("abbr")
    for team in teams:
        teamNames.append(team.text.replace('AFC ',''))
    for tID in ids:
        teamID.append(tID.text)
    #print(len(gameinfo))
    for games in range(0, 160,8):
        gamesPlayed.append(int(gameinfo[games].text))
    for wins in range(1,160,8):
        gamesWon.append(int(gameinfo[wins].text))
    for draws in range(2,160,8):
        gamesDrawn.append(int(gameinfo[draws].text))
    for loss in range(3,160,8):
        gamesLost.append(int(gameinfo[loss].text))
    for gd in range(6,160,8):
        goalDifference.append(gameinfo[gd].text)
    for points in range(7,160,8):
        teamPoints.append(int(gameinfo[points].text))

Now we will create a dataframe out of this game data

game_data = pd.DataFrame({'Club':teamNames, 'Club ID':teamID, 'Season':years, 'Games Played':gamesPlayed, 'Games Won':gamesWon, 
                          'Games Drawn':gamesDrawn,'Games Lost':gamesLost, 'Goal Difference':goalDifference, 'Points':teamPoints})
game_data.head()

1.4 Scraping Club Spend Data¶

The last part of the data collection phase is to collect data on how much teams spent each season from Transfermarkt. Again we will have to modify the expenditure strings and change them into floats. It is important to note at this point that the expenditures do NOT take into account any money made from the club selling players, only money from buying players.

headers = {'User-Agent': 'Chrome/47.0.2526.106'}

teamNames = []
teamExp = []
years = []
for year in range(2005,2018):
    page = "https://www.transfermarkt.us/premier-league/einnahmenausgaben/wettbewerb/GB1/plus/0?ids=a&sa=&saison_id="+str(year)+"&saison_id_bis="+str(year)+"&nat=&pos=&altersklasse=&w_s=&leihe=&intern=0"
    pageTree = r.get(page,headers=headers)
    pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
    expenditures=pageSoup.find_all("td",{"class":"rechts hauptlink redtext"})
    teams = pageSoup.find_all("a",{"class":"vereinprofil_tooltip"})
    
    for tI in range(40):
        if len(teams[tI].text)>1:
            name = teams[tI].text.replace('AFC ', '')
            name = name.replace(' FC', '')
            name = name.replace(' AFC', '')
            teamNames.append(name)
    for eI in range(20):
        real_exp = expenditures[eI].text.replace(' Mill. €', '')
        real_exp = real_exp.replace(',','.')
        teamExp.append(float(real_exp))
        years.append(year)

expTable = pd.DataFrame({'Club':teamNames, 'Season':years, 'Spend':teamExp})
expTable.head()

1.5 Data Tidying¶

We now have 3 DataFrames that contain all the information we need to conduct our analysis. To make everything simpler, we will combine all three into one large DataFrame.

full_table = values_table.merge(expTable, how='inner', on=['Club','Season'])
full_table = full_table.merge(game_data, how='inner', on=['Club','Season'])
full_table.columns = ['Club', 'Value in mill €', 'Season','Spend in mill €', 'Club ID', 'Games Drawn',
                     'Games Lost', 'Games Played', 'Games Won', 'Goal Difference', 'Points']
full_table.head()

Part 2: Data Plotting¶

2.1 Plotting Club Value vs. Points Won¶

We can start our analysis of this data by plotting team points vs club value over the years. We will only display a few plots to keep it concise. To make these plots easier to read an find out which team is which point, their abbreviation will be displayed above their respective point.

spends_dict = {}
values_dict = {}
points_dict = {}

for year in range(2005,2018):
    spend = []
    points = []
    team_ids = []
    for index in range(len(full_table['Season'])):
        if (full_table['Season'][index]==year):
            spend.append(full_table['Spend in mill €'][index])
            points.append(full_table['Points'][index])
            team_ids.append(full_table['Club ID'][index])
    spends_dict[year] = spend
    points_dict[year] = points
    if year%5==0:
        plt.title("Club Spend vs. Points Won in "+str(year))
        plt.scatter(spend, points)
        ticks = numpy.arange(min(spend), max(spend), 10)
        for i, teamID in enumerate(team_ids):
            plt.annotate(teamID, (spend[i], points[i]))
        plt.xlabel('Club Spend')
        plt.ylabel('Points Won')

        plt.show()

Like the Moneyball project earlier in the semester, we have plots here that show Club Spend vs. Points Won for each season in our dataset. However, unlike the Moneyball project, the idea of spend little and win big is not something introduced by Leicester City.

Rather we see that even in past seasons, teams like Chelsea, Manchester United, and Liverpool all spent relatively little and won big, either winning the league or coming very close to it. However, all these teams I just mentioned are known as the giants of English Football.

Perhaps then, we should add the club's existing value into the equation.

2.2 Plotting Club Transfer Spend vs. Points Won¶

A club's value is based on it's players. If players do well and/or the team as a whole does well, the club's value increases. This means that it is not entirely dependent on how much cash a club has.

for year in range(2005,2018):
    values = []
    points = []
    team_ids = []
    for index in range(len(full_table['Season'])):
        if (full_table['Season'][index]==year):
            values.append(full_table['Value in mill €'][index])
            points.append(full_table['Points'][index])
            team_ids.append(full_table['Club ID'][index])
    values_dict[year]=values
    if year%5==0:
        plt.title("Club Value vs. Points Won in "+str(year))
        plt.scatter(values, points)
        ticks = numpy.arange(min(values), max(values), 50)
        for i, teamID in enumerate(team_ids):
            plt.annotate(teamID, (values[i], points[i]))
        plt.xlabel('Club Value')
        plt.ylabel('Points Won')

        plt.show()

It seems that based on these plots that the actual club value removes that "big team" bias. The points are not as split into two groups, meaning that there are not always two clusters, one on the bottom left and one on the top right.

However, an even better way to check which measure or combination of measures is the best predictor of team success is through linear regression.

Part 3: Linear Regression Analysis¶

In this part, we will be calculating different linear regression models that can be used to predict points won at the end of a season. We visualize how good our models are by using violin plots which will take residuals as data points. A residual in the context of this analysis will be calculated as $residual = actual - predicted$. The closer residuals from a given model are to zero, the better that model is at predicting points won in this case.

3.1 Standardizing Data¶

The Club Values and Spends will have to be standardized before we can make any regression models, due to inflation in the market.

stspends = []
stvals = []
v_mean = st.mean((full_table['Value in mill €'])[0:])
v_stdev = st.pstdev((full_table['Value in mill €'])[0:])
s_mean = st.mean((full_table['Spend in mill €'])[0:])
s_stdev = st.pstdev((full_table['Spend in mill €'])[0:])
for x in range(len(full_table['Season'])):
    stspends.append((full_table['Spend in mill €'][x] - s_mean)/s_stdev)
    stvals.append((full_table['Value in mill €'][x] - v_mean)/v_stdev)

Let's add these lists to our table for future reference.

full_table['Standardized Spends'] = stspends
full_table['Standardized Values'] = stvals

3.2 Creating Spend Model¶

xs = full_table['Standardized Spends'].values.reshape(-1,1)
ys = full_table['Points'].values.reshape(-1,1)
spend_reg = LinearRegression().fit(xs, ys)
print("Expected_points = " + str(spend_reg.coef_[0][0])+"*spend + " + str(spend_reg.intercept_[0]))

Expected_points = 8.056491817399316*spend + 52.19230769230769

3.3 Plotting Spend Model Residuals¶

Now, we are going to put this model to the test and take a look a plot of residuals.

spend_res = []
for spend_point, point_point in zip(full_table['Standardized Spends'], full_table['Points']):
    spend_res.append(point_point - spend_reg.predict(np.array([[spend_point]]))[0][0])
full_table['Spend Residuals'] = spend_res
ggplot(aes(x='Season', y='Spend Residuals'), data=full_table) +\
    geom_violin() +\
    labs(title="Plot of Residuals using Spend Model", x = "Season", y = "Residual")

<ggplot: (118679829722)>

Using this violin plot, we can see that the residuals are equally spread out across the years. This means that our model is not very good at predicting the points won. Ideally, we want the bulk of our residuals to be as close to zero as possible.

Also, there are a number of points that deviate from the mean by upwards of 40 points. This is significant because the highest number of points a team has finished with is in the 90s.

3.4 Creating Value Model¶

Now we will create another linear regression model but this time, using club value as our parameter.

xv = full_table['Standardized Values'].values.reshape(-1,1)
yv = full_table['Points'].values.reshape(-1,1)
value_reg = LinearRegression().fit(xv, yv)
print("Expected_points = " + str(value_reg.coef_[0][0])+"*value + " + str(value_reg.intercept_[0]))

Expected_points = 12.735533479545966*value + 52.19230769230769

val_res = []
for val_point, point_point in zip(full_table['Standardized Values'], full_table['Points']):
    val_res.append(point_point - value_reg.predict(np.array([[val_point]]))[0][0])
full_table['Value Residuals'] = val_res
ggplot(aes(x='Season', y='Value Residuals'), data=full_table) +\
    geom_violin() +\
    labs(title="Plot of Residuals Using Value Model", x = "Season", y = "Residuals")

<ggplot: (118679831500)>

Using value as the term we base our regression model on, we see an improvement in our violin plot of residuals. When we used spend only, points predictions were frequently over or underestimated by 30 points at least. With this value model, it does not overestimate as much, but it does still underestimate quite a bit.

3.6 Creating Two-Term Regression Model¶

For this two-term model, we will be using both club value and club transfer spend to hopefully better predict how many points a club will win.

residuals = []
x_contents = []
for v,s in zip(full_table['Standardized Values'], full_table['Standardized Spends']):
    x_contents.append([v,s])
y = full_table['Points']
vs_reg = LinearRegression().fit(x_contents, np.asarray(y).reshape(-1,1))

3.7 Plotting Residuals Using Two-Term Model¶

for real_point, data_point in zip(y, x_contents):
    residuals.append(real_point-(vs_reg.predict(np.array([data_point]))[0][0]))
full_table['Two Term Residuals'] = residuals
ggplot(aes(x='Season', y='Two Term Residuals'), data=full_table) +\
    geom_violin() +\
    labs(title="Plot of Residuals Using Value-Spend Model", x = "Season", y = "Residuals")

<ggplot: (118679832377)>

By using club value and transfer spend as parameters for a linear regression model, we have made some improvements over the club value model, which was the best model so far.

The two term model predicts points won better than the other models. We can see that we have even more residuals closer to zero than we did in the previous two models.

Part 4: Conclusion¶

To conclude, the best way out of the three methods described here to predict number of finishing points in the English Premier League is to use a combination of a club's existing value and how much a club spends during transfer periods. Our final violin plot shows that this the case, since we can see a large number of residuals in between +10 and -10. However, given how much different the 2015 part of our plot looks to the rest of the seasons in this analysis, it is worth taking a deeper look into why our two-term model was so wrong.

In 2015, the model had vastly over and underpredicted the finishing positions. Not only because of Leicester City's big win, but because of the poor and unexpected performances of other teams. Below is a table showing the standing from that year.

table_2015 = full_table.loc[full_table['Season']==2015].sort_values(by=['Points'], axis=0,ascending=False)
table_2015

Chelsea had finished way lower than normal, a huge surprise given that they were the most valuable club that season and had won the league the previous season. Everton also finished quite low down given it's worth. Newcastle had probably the most dissapointing season given both their worth and how much the spent in the transfer market.

Southampton, West Ham, and Stoke had all finished much higher than they normally do with much lower club values and much lower transfer spending compared to other clubs that season.

On the whole however, the value-spend model was much better than the other single term models we looked at. It gave lower residuals and most of those residuals were concentrated in between +10 and -10, showing that it was very good at accurately predicting points won.

	Club	Value in mill €	Season	Spend in mill €	Club ID	Games Drawn	Games Lost	Games Played	Games Won	Goal Difference	Points	Standardized Spends	Standardized Values	Spend Residuals	Value Residuals	Two Term Residuals
218	Leicester City	91.25	2015	49.90	LEI	38	23	12	3	+32	81	0.000053	-0.661359	28.807269	37.230457	39.005143
202	Arsenal	408.60	2015	26.50	ARS	38	20	11	7	+29	71	-0.491378	1.523812	22.766477	-0.598862	-6.437211
205	Tottenham Hotspur	253.75	2015	71.00	TOT	38	19	13	6	+34	70	0.443180	0.457564	14.237215	11.980372	12.330844
201	Manchester City	452.75	2015	208.30	MNC	38	19	9	10	+30	66	3.326660	1.827814	-12.993521	-9.470500	-2.528487
203	Manchester United	374.15	2015	156.00	MAN	38	19	9	10	+14	66	2.228292	1.286600	-4.144523	-2.577844	1.905016
208	Southampton	187.50	2015	60.10	SOU	38	18	9	11	+18	63	0.214266	0.001387	9.081461	10.790022	11.549300
209	West Ham United	185.20	2015	52.70	WHU	38	16	14	8	+14	62	0.058856	-0.014450	9.333518	9.991716	10.240072
204	Liverpool	325.00	2015	125.40	LIV	38	16	12	10	+13	60	1.585652	0.948169	-4.967097	-4.267742	-1.165276
212	Stoke City	127.25	2015	53.65	STK	38	14	9	15	-14	51	0.078807	-0.413475	-1.827219	4.073515	5.463546
200	Chelsea	579.80	2015	90.50	CHE	38	12	14	12	+6	50	0.852706	2.702640	-9.062125	-36.611874	-40.826865
210	Swansea City	138.20	2015	21.60	SWA	38	12	11	15	-10	47	-0.594285	-0.338077	-0.404458	-0.886721	-2.095873
206	Everton	204.13	2015	48.90	EVE	38	11	14	13	+4	47	-0.020949	0.115896	-5.023534	-6.668310	-7.053871
216	Watford	110.13	2015	83.53	WAT	38	12	9	17	-10	45	0.706327	-0.531358	-12.882823	-0.425184	3.515736
214	West Bromwich Albion	117.00	2015	42.90	WBA	38	10	13	15	-14	43	-0.146957	-0.484053	-8.008352	-3.027633	-2.252179
213	Crystal Palace	121.00	2015	28.80	CRY	38	11	9	18	-12	42	-0.443075	-0.456510	-6.622676	-4.378405	-4.731329
219	Bournemouth	68.03	2015	55.11	BOU	38	11	9	18	-22	42	0.109469	-0.821245	-11.074247	0.266684	2.859994
211	Sunderland	128.25	2015	66.05	SUN	38	9	12	17	-14	39	0.339224	-0.406589	-15.925261	-8.014178	-5.715279
207	Newcastle United	200.65	2015	107.91	NEW	38	9	10	19	-21	37	1.218339	0.091934	-25.007843	-16.363138	-12.271304
217	Norwich City	103.75	2015	47.55	NOR	38	9	7	22	-28	34	-0.049301	-0.575288	-17.795118	-10.865703	-9.497702
215	Aston Villa	112.30	2015	66.45	AVL	38	3	8	27	-49	17	0.347624	-0.516416	-37.992939	-28.615477	-25.991988

	Club	Value	Season
0	Chelsea	352.18	2005
1	Manchester United	286.93	2005
2	Arsenal	218.95	2005
3	Liverpool	187.40	2005
4	Tottenham Hotspur	145.50	2005

	Club	Season	Spend
0	Chelsea	2005	91.50
1	Newcastle United	2005	61.25
2	Arsenal	2005	46.00
3	Liverpool	2005	44.06
4	Tottenham Hotspur	2005	36.51