<img src ="epl-logo-full-size.png">

Analysing How Money Spent and Club Value Affects Finishing Place in the English Premier League

Author: Asad Zaidi

After Leicester City's championship win in the 2015/2016 English Premier League Season was called a fairy tale story by many across the world. Given that they had finished 14th, in the season prior, they overcame 5000:1 odds to take home the title.

But this raises the question, given that Leicester were in the Football League Championship only a few seasons ago and were one of the least valuable clubs in the top division of English Football, does money dictate finishing position?

This tutorial will analyze transfer expenses and overall club values when trying to answer this question. It will be split into four parts: Data Collection, Data Plotting, Linear Regression Analysis, and Conclusion.

Part 1: Data Collection

For this part of the analysis, we will collecting game data from ESPN and team value and transfer spend data from Transfermarkt.us

1.1 Importing Required Libraries

We will be using the following libraries to scrape data from Transfermarkt and ESPN, create regression models, and construct various scatter and violin plots.

re

In [1]:
import requests as r
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import pandas
import numpy
from pylab import *
from ggplot import *
import statistics as st

1.2 Scraping Club Value Data

We will be getting the value data from Transfermarkt. Note that the data from the website is a string in the format XXX,XX Mill. €. Therefore, we will have to make some changes to the string so that we can convert it to a float and use it for graphs and calculations.

We have to use loops since the URL for each season is different. During this process, we will have to make some slight changes to the data that is scraped in order to be able to manipulate it for our analysis.

In [2]:
headers = {'User-Agent': 'Chrome/47.0.2526.106'}
teamNames = []
club_value = []
years = []
for year in range(2005, 2018):
    page = "https://www.transfermarkt.us/premier-league/startseite/wettbewerb/GB1/plus/?saison_id="+str(year)
    pageTree = r.get(page, headers = headers)
    pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
    names = pageSoup.find_all("a", {"class": "vereinprofil_tooltip"})
    values = pageSoup.find_all("td", {"class": "rechts hide-for-small hide-for-pad"})
    tempNames=[]
    for index in range(len(names)):
        if len(names[index].text)>1:
            name = names[index].text.replace('AFC ', '')
            name = name.replace(' FC', '')
            name = name.replace(' AFC', '')
            tempNames.append(name)
    for index in range(40):
        if (index%2==0):
            teamNames.append(tempNames[index])
            years.append(year)
    for index in range(2, len(values)):
        if (index%2==0):
            real_value = values[index].text.replace(',','.')
            real_value = real_value.replace(' Mill. €', '')
            club_value.append(float(real_value))

Now we will put this data neatly into a DataFrame so it is easier to merge all the data later on.

In [3]:
values_table = pd.DataFrame({'Club':teamNames, 'Value':club_value, 'Season':years})
values_table.head()
Out[3]:
Club Value Season
0 Chelsea 352.18 2005
1 Manchester United 286.93 2005
2 Arsenal 218.95 2005
3 Liverpool 187.40 2005
4 Tottenham Hotspur 145.50 2005

This is a glimpse of the table that starts in the 2005/2006 season and goes right up through the last season (2017/2018). It is worth a reminder that the Value columns is in Millions of Euros.

1.3 Scraping Game Data

Now we will scrape each club's game data from ESPN. This will get us stats such as games played, games won, and cruicially, points scored.

In [6]:
headers = {'User-Agent': 'Chrome/47.0.2526.106'}
teamNames=[]
teamID = []
teamPoints = []
gamesPlayed = []
gamesWon = []
gamesDrawn = []
gamesLost = []
goalDifference = []

for year in range(2005,2018):
    page = "http://www.espn.com/soccer/standings/_/league/eng.1/season/"+str(year)
    pageTree = r.get(page, headers=headers)
    pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
    teams = pageSoup.find_all("span", {"class":"team-names"})
    gameinfo = pageSoup.find_all("td", {"class":"","style":"white-space:nowrap;"})
    ids = pageSoup.find_all("abbr")
    for team in teams:
        teamNames.append(team.text.replace('AFC ',''))
    for tID in ids:
        teamID.append(tID.text)
    #print(len(gameinfo))
    for games in range(0, 160,8):
        gamesPlayed.append(int(gameinfo[games].text))
    for wins in range(1,160,8):
        gamesWon.append(int(gameinfo[wins].text))
    for draws in range(2,160,8):
        gamesDrawn.append(int(gameinfo[draws].text))
    for loss in range(3,160,8):
        gamesLost.append(int(gameinfo[loss].text))
    for gd in range(6,160,8):
        goalDifference.append(gameinfo[gd].text)
    for points in range(7,160,8):
        teamPoints.append(int(gameinfo[points].text))

Now we will create a dataframe out of this game data

In [7]:
game_data = pd.DataFrame({'Club':teamNames, 'Club ID':teamID, 'Season':years, 'Games Played':gamesPlayed, 'Games Won':gamesWon, 
                          'Games Drawn':gamesDrawn,'Games Lost':gamesLost, 'Goal Difference':goalDifference, 'Points':teamPoints})
game_data.head()
Out[7]:
Club Club ID Season Games Played Games Won Games Drawn Games Lost Goal Difference Points
0 Chelsea CHE 2005 38 29 4 5 +50 91
1 Manchester United MAN 2005 38 25 8 5 +38 83
2 Liverpool LIV 2005 38 25 7 6 +32 82
3 Arsenal ARS 2005 38 20 7 11 +37 67
4 Tottenham Hotspur TOT 2005 38 18 11 9 +15 65

1.4 Scraping Club Spend Data

The last part of the data collection phase is to collect data on how much teams spent each season from Transfermarkt. Again we will have to modify the expenditure strings and change them into floats. It is important to note at this point that the expenditures do NOT take into account any money made from the club selling players, only money from buying players.

In [8]:
headers = {'User-Agent': 'Chrome/47.0.2526.106'}

teamNames = []
teamExp = []
years = []
for year in range(2005,2018):
    page = "https://www.transfermarkt.us/premier-league/einnahmenausgaben/wettbewerb/GB1/plus/0?ids=a&sa=&saison_id="+str(year)+"&saison_id_bis="+str(year)+"&nat=&pos=&altersklasse=&w_s=&leihe=&intern=0"
    pageTree = r.get(page,headers=headers)
    pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
    expenditures=pageSoup.find_all("td",{"class":"rechts hauptlink redtext"})
    teams = pageSoup.find_all("a",{"class":"vereinprofil_tooltip"})
    
    for tI in range(40):
        if len(teams[tI].text)>1:
            name = teams[tI].text.replace('AFC ', '')
            name = name.replace(' FC', '')
            name = name.replace(' AFC', '')
            teamNames.append(name)
    for eI in range(20):
        real_exp = expenditures[eI].text.replace(' Mill. €', '')
        real_exp = real_exp.replace(',','.')
        teamExp.append(float(real_exp))
        years.append(year)
In [9]:
expTable = pd.DataFrame({'Club':teamNames, 'Season':years, 'Spend':teamExp})
expTable.head()
Out[9]:
Club Season Spend
0 Chelsea 2005 91.50
1 Newcastle United 2005 61.25
2 Arsenal 2005 46.00
3 Liverpool 2005 44.06
4 Tottenham Hotspur 2005 36.51

1.5 Data Tidying

We now have 3 DataFrames that contain all the information we need to conduct our analysis. To make everything simpler, we will combine all three into one large DataFrame.

In [10]:
full_table = values_table.merge(expTable, how='inner', on=['Club','Season'])
full_table = full_table.merge(game_data, how='inner', on=['Club','Season'])
full_table.columns = ['Club', 'Value in mill €', 'Season','Spend in mill €', 'Club ID', 'Games Drawn',
                     'Games Lost', 'Games Played', 'Games Won', 'Goal Difference', 'Points']
full_table.head()
Out[10]:
Club Value in mill € Season Spend in mill € Club ID Games Drawn Games Lost Games Played Games Won Goal Difference Points
0 Chelsea 352.18 2005 91.50 CHE 38 29 4 5 +50 91
1 Manchester United 286.93 2005 31.80 MAN 38 25 8 5 +38 83
2 Arsenal 218.95 2005 46.00 ARS 38 20 7 11 +37 67
3 Liverpool 187.40 2005 44.06 LIV 38 25 7 6 +32 82
4 Tottenham Hotspur 145.50 2005 36.51 TOT 38 18 11 9 +15 65

Part 2: Data Plotting

2.1 Plotting Club Value vs. Points Won

We can start our analysis of this data by plotting team points vs club value over the years. We will only display a few plots to keep it concise. To make these plots easier to read an find out which team is which point, their abbreviation will be displayed above their respective point.

In [11]:
spends_dict = {}
values_dict = {}
points_dict = {}
In [12]:
for year in range(2005,2018):
    spend = []
    points = []
    team_ids = []
    for index in range(len(full_table['Season'])):
        if (full_table['Season'][index]==year):
            spend.append(full_table['Spend in mill €'][index])
            points.append(full_table['Points'][index])
            team_ids.append(full_table['Club ID'][index])
    spends_dict[year] = spend
    points_dict[year] = points
    if year%5==0:
        plt.title("Club Spend vs. Points Won in "+str(year))
        plt.scatter(spend, points)
        ticks = numpy.arange(min(spend), max(spend), 10)
        for i, teamID in enumerate(team_ids):
            plt.annotate(teamID, (spend[i], points[i]))
        plt.xlabel('Club Spend')
        plt.ylabel('Points Won')

        plt.show()

Like the Moneyball project earlier in the semester, we have plots here that show Club Spend vs. Points Won for each season in our dataset. However, unlike the Moneyball project, the idea of spend little and win big is not something introduced by Leicester City.

Rather we see that even in past seasons, teams like Chelsea, Manchester United, and Liverpool all spent relatively little and won big, either winning the league or coming very close to it. However, all these teams I just mentioned are known as the giants of English Football.

Perhaps then, we should add the club's existing value into the equation.

2.2 Plotting Club Transfer Spend vs. Points Won

A club's value is based on it's players. If players do well and/or the team as a whole does well, the club's value increases. This means that it is not entirely dependent on how much cash a club has.

In [13]:
for year in range(2005,2018):
    values = []
    points = []
    team_ids = []
    for index in range(len(full_table['Season'])):
        if (full_table['Season'][index]==year):
            values.append(full_table['Value in mill €'][index])
            points.append(full_table['Points'][index])
            team_ids.append(full_table['Club ID'][index])
    values_dict[year]=values
    if year%5==0:
        plt.title("Club Value vs. Points Won in "+str(year))
        plt.scatter(values, points)
        ticks = numpy.arange(min(values), max(values), 50)
        for i, teamID in enumerate(team_ids):
            plt.annotate(teamID, (values[i], points[i]))
        plt.xlabel('Club Value')
        plt.ylabel('Points Won')

        plt.show()

It seems that based on these plots that the actual club value removes that "big team" bias. The points are not as split into two groups, meaning that there are not always two clusters, one on the bottom left and one on the top right.

However, an even better way to check which measure or combination of measures is the best predictor of team success is through linear regression.

Part 3: Linear Regression Analysis

In this part, we will be calculating different linear regression models that can be used to predict points won at the end of a season. We visualize how good our models are by using violin plots which will take residuals as data points. A residual in the context of this analysis will be calculated as $residual = actual - predicted$. The closer residuals from a given model are to zero, the better that model is at predicting points won in this case.

3.1 Standardizing Data

The Club Values and Spends will have to be standardized before we can make any regression models, due to inflation in the market.

In [15]:
stspends = []
stvals = []
v_mean = st.mean((full_table['Value in mill €'])[0:])
v_stdev = st.pstdev((full_table['Value in mill €'])[0:])
s_mean = st.mean((full_table['Spend in mill €'])[0:])
s_stdev = st.pstdev((full_table['Spend in mill €'])[0:])
for x in range(len(full_table['Season'])):
    stspends.append((full_table['Spend in mill €'][x] - s_mean)/s_stdev)
    stvals.append((full_table['Value in mill €'][x] - v_mean)/v_stdev)

Let's add these lists to our table for future reference.

In [16]:
full_table['Standardized Spends'] = stspends
full_table['Standardized Values'] = stvals

3.2 Creating Spend Model

In [17]:
xs = full_table['Standardized Spends'].values.reshape(-1,1)
ys = full_table['Points'].values.reshape(-1,1)
spend_reg = LinearRegression().fit(xs, ys)
print("Expected_points = " + str(spend_reg.coef_[0][0])+"*spend + " + str(spend_reg.intercept_[0]))
Expected_points = 8.056491817399316*spend + 52.19230769230769

3.3 Plotting Spend Model Residuals

Now, we are going to put this model to the test and take a look a plot of residuals.

In [18]:
spend_res = []
for spend_point, point_point in zip(full_table['Standardized Spends'], full_table['Points']):
    spend_res.append(point_point - spend_reg.predict(np.array([[spend_point]]))[0][0])
full_table['Spend Residuals'] = spend_res
ggplot(aes(x='Season', y='Spend Residuals'), data=full_table) +\
    geom_violin() +\
    labs(title="Plot of Residuals using Spend Model", x = "Season", y = "Residual")
Out[18]:
<ggplot: (118679829722)>

Using this violin plot, we can see that the residuals are equally spread out across the years. This means that our model is not very good at predicting the points won. Ideally, we want the bulk of our residuals to be as close to zero as possible.

Also, there are a number of points that deviate from the mean by upwards of 40 points. This is significant because the highest number of points a team has finished with is in the 90s.

3.4 Creating Value Model

Now we will create another linear regression model but this time, using club value as our parameter.

In [19]:
xv = full_table['Standardized Values'].values.reshape(-1,1)
yv = full_table['Points'].values.reshape(-1,1)
value_reg = LinearRegression().fit(xv, yv)
print("Expected_points = " + str(value_reg.coef_[0][0])+"*value + " + str(value_reg.intercept_[0]))
Expected_points = 12.735533479545966*value + 52.19230769230769
In [20]:
val_res = []
for val_point, point_point in zip(full_table['Standardized Values'], full_table['Points']):
    val_res.append(point_point - value_reg.predict(np.array([[val_point]]))[0][0])
full_table['Value Residuals'] = val_res
ggplot(aes(x='Season', y='Value Residuals'), data=full_table) +\
    geom_violin() +\
    labs(title="Plot of Residuals Using Value Model", x = "Season", y = "Residuals")
Out[20]:
<ggplot: (118679831500)>

Using value as the term we base our regression model on, we see an improvement in our violin plot of residuals. When we used spend only, points predictions were frequently over or underestimated by 30 points at least. With this value model, it does not overestimate as much, but it does still underestimate quite a bit.

3.6 Creating Two-Term Regression Model

For this two-term model, we will be using both club value and club transfer spend to hopefully better predict how many points a club will win.

In [21]:
residuals = []
x_contents = []
for v,s in zip(full_table['Standardized Values'], full_table['Standardized Spends']):
    x_contents.append([v,s])
y = full_table['Points']
vs_reg = LinearRegression().fit(x_contents, np.asarray(y).reshape(-1,1))

3.7 Plotting Residuals Using Two-Term Model

In [22]:
for real_point, data_point in zip(y, x_contents):
    residuals.append(real_point-(vs_reg.predict(np.array([data_point]))[0][0]))
full_table['Two Term Residuals'] = residuals
ggplot(aes(x='Season', y='Two Term Residuals'), data=full_table) +\
    geom_violin() +\
    labs(title="Plot of Residuals Using Value-Spend Model", x = "Season", y = "Residuals")
Out[22]:
<ggplot: (118679832377)>

By using club value and transfer spend as parameters for a linear regression model, we have made some improvements over the club value model, which was the best model so far.

The two term model predicts points won better than the other models. We can see that we have even more residuals closer to zero than we did in the previous two models.

Part 4: Conclusion

To conclude, the best way out of the three methods described here to predict number of finishing points in the English Premier League is to use a combination of a club's existing value and how much a club spends during transfer periods. Our final violin plot shows that this the case, since we can see a large number of residuals in between +10 and -10. However, given how much different the 2015 part of our plot looks to the rest of the seasons in this analysis, it is worth taking a deeper look into why our two-term model was so wrong.

In 2015, the model had vastly over and underpredicted the finishing positions. Not only because of Leicester City's big win, but because of the poor and unexpected performances of other teams. Below is a table showing the standing from that year.

In [56]:
table_2015 = full_table.loc[full_table['Season']==2015].sort_values(by=['Points'], axis=0,ascending=False)
table_2015
Out[56]:
Club Value in mill € Season Spend in mill € Club ID Games Drawn Games Lost Games Played Games Won Goal Difference Points Standardized Spends Standardized Values Spend Residuals Value Residuals Two Term Residuals
218 Leicester City 91.25 2015 49.90 LEI 38 23 12 3 +32 81 0.000053 -0.661359 28.807269 37.230457 39.005143
202 Arsenal 408.60 2015 26.50 ARS 38 20 11 7 +29 71 -0.491378 1.523812 22.766477 -0.598862 -6.437211
205 Tottenham Hotspur 253.75 2015 71.00 TOT 38 19 13 6 +34 70 0.443180 0.457564 14.237215 11.980372 12.330844
201 Manchester City 452.75 2015 208.30 MNC 38 19 9 10 +30 66 3.326660 1.827814 -12.993521 -9.470500 -2.528487
203 Manchester United 374.15 2015 156.00 MAN 38 19 9 10 +14 66 2.228292 1.286600 -4.144523 -2.577844 1.905016
208 Southampton 187.50 2015 60.10 SOU 38 18 9 11 +18 63 0.214266 0.001387 9.081461 10.790022 11.549300
209 West Ham United 185.20 2015 52.70 WHU 38 16 14 8 +14 62 0.058856 -0.014450 9.333518 9.991716 10.240072
204 Liverpool 325.00 2015 125.40 LIV 38 16 12 10 +13 60 1.585652 0.948169 -4.967097 -4.267742 -1.165276
212 Stoke City 127.25 2015 53.65 STK 38 14 9 15 -14 51 0.078807 -0.413475 -1.827219 4.073515 5.463546
200 Chelsea 579.80 2015 90.50 CHE 38 12 14 12 +6 50 0.852706 2.702640 -9.062125 -36.611874 -40.826865
210 Swansea City 138.20 2015 21.60 SWA 38 12 11 15 -10 47 -0.594285 -0.338077 -0.404458 -0.886721 -2.095873
206 Everton 204.13 2015 48.90 EVE 38 11 14 13 +4 47 -0.020949 0.115896 -5.023534 -6.668310 -7.053871
216 Watford 110.13 2015 83.53 WAT 38 12 9 17 -10 45 0.706327 -0.531358 -12.882823 -0.425184 3.515736
214 West Bromwich Albion 117.00 2015 42.90 WBA 38 10 13 15 -14 43 -0.146957 -0.484053 -8.008352 -3.027633 -2.252179
213 Crystal Palace 121.00 2015 28.80 CRY 38 11 9 18 -12 42 -0.443075 -0.456510 -6.622676 -4.378405 -4.731329
219 Bournemouth 68.03 2015 55.11 BOU 38 11 9 18 -22 42 0.109469 -0.821245 -11.074247 0.266684 2.859994
211 Sunderland 128.25 2015 66.05 SUN 38 9 12 17 -14 39 0.339224 -0.406589 -15.925261 -8.014178 -5.715279
207 Newcastle United 200.65 2015 107.91 NEW 38 9 10 19 -21 37 1.218339 0.091934 -25.007843 -16.363138 -12.271304
217 Norwich City 103.75 2015 47.55 NOR 38 9 7 22 -28 34 -0.049301 -0.575288 -17.795118 -10.865703 -9.497702
215 Aston Villa 112.30 2015 66.45 AVL 38 3 8 27 -49 17 0.347624 -0.516416 -37.992939 -28.615477 -25.991988

Chelsea had finished way lower than normal, a huge surprise given that they were the most valuable club that season and had won the league the previous season. Everton also finished quite low down given it's worth. Newcastle had probably the most dissapointing season given both their worth and how much the spent in the transfer market.

Southampton, West Ham, and Stoke had all finished much higher than they normally do with much lower club values and much lower transfer spending compared to other clubs that season.

On the whole however, the value-spend model was much better than the other single term models we looked at. It gave lower residuals and most of those residuals were concentrated in between +10 and -10, showing that it was very good at accurately predicting points won.