Setting up an Expected Goals model can be done in a number of ways, for example with the help of exponential decay, machine learning or some kind of regression model. I’ve chosen to use a logistic regression model because I think it has several advantages. Logistic regression is mostly used when the dependent variable only has two possible values, which translates well to football since a shot can end up either as a goal (1) or no goal (0). Also, logistic regression is used to return a calculated probability – i.e. our xG value. It’s also very easy to set up a logistic regression model and tinker with different variables using python’s statsmodels library.
First off, the dataset needs to be divided into two parts: one for training or constructing the model and one for testing it. This is done in order to avoid overfitting where the same data is used for constructing the model as for evaluating it, which would make the model possibly look better than it is.
I’ve chosen a number of variables which all in some way make sense to test in the model. They include:
- League: Could goal expectancy differ between Allsvenskan and Superettan? Since this variable isn’t numerical, it’s been recoded to either 0 (Allsvenskan) or 1 (Superettan).
- Attempt type: That the goal expectancy for regular shots and penalties are completely different from each other is obvious to anyone interested in football. This variable has also been recoded to either 0 (shot) or 1 (penalty).
- Distance to the center of the goal: It is probably easier to score the closer to goal the shot is taken.
- Angle: The distance to goal doesn’t tell the whole story of the importance of shot location as shots taken from the same distance but at different angles at least should have different expectancies. Higher angle means a more central position, which would probably be easier to score from.
- Game State: There’s been some work done on the importance of game state in football, and it’s use can be debated, but I’m at least going to try it out. It works by crediting teams who spend time having the lead. Teams start every game level at Game State 0. Going 1-0 up means a Game State of +1 while the trailing team’s Game State drops to -1, and so on.
- Number of players on the pitch. I think this is a first for using number of players in Expected Goals models, at least I haven’t seen anybody use it before. I’ve decided to call it Man Strength in lack of a better term and it works much like Game State. If an opponent is sent off your Man Strength goes to +1, while it drops to -1 for the opposing side. The reasoning behind using a variable like this is that as you face fewer opponents the defensive pressure could be less than usual, resulting in a higher goal expectancy.
Let’s take look at the individual goal expectancy for the variables. Goal expectancy for the two leagues is very similar but could possibly be of use if they interact with the other variables differently. Attempt type is pretty obvious with penalties having higher value than regular shots. In the graphs showing distance and angle the values have been rounded off for presentation, while higher precision is used in the model. There is some outliers here due to small sample size at the higher values but the patterns seems clear. It’s hard to tell from the graph if Game State is of any use since there isn’t much difference between the levels. But Man Strength shows a clear pattern, it certainly looks like goal expectancy rises when having more players on the pitch.
So let’s throw the training dataset (seasons 2011-2014) into a logistic model and have a look at a summary of the results:
There’s a lot of numbers here but let’s just focus on the p-values for each variable. Every variable is significant at the 95% significance level (p<0.05) except league. As expected from the plot above, there’s apperently no use to separate Allsvenskan and Superettan shots. Here’s how the model summary looks without the league variable:
So, with only significant variables left in the model, how does it perform when compared to actual goals? I’ve had the model calculate total xG for each player in Allsvenskan and Superettan for our test season (2015), and plotted this against their actual goals scored the same season.
With an r-squared of 0.77 I’d say the model is performing pretty well. Whats more encouraging is that the slope of the fitted line seems to be very close to 1, meaning that 1 expected goal is pretty much equal to 1 actual goal scored.
In the graph I’ve also plotted the players in the top 10 in either goals scored, xG, goals per 90 or xG per 90 for the season. Some of them have good numbers in several of the stats. Emir Kujovic and Henok Goitom for example are performing outstanding this season, both being crucial to their respective teams run at the title. Markus Rosenberg on the other hand is underperforming with only 9 goals scored compared to his 16 expected goals, which is one of the reasons why Malmö are not living up to the expectations this season. Örebro’s Broberg and Häcken’s Paulinho de Oliveira also make the list due to their great form in the recent months while Djurgården’s Mushekwi enjoyed a good goalscoring run in the first half of the season.
Let’s take a look at how the model perform on a team level:
On a team level, it looks like the model is performing better when it comes to xG against than for, but overall it is a reasonably good fit, although not as good as at player level.
As we can see the top teams are all performing well offensively. Göteborg stand out defensively with only 17 goals against in 27 games, even outperforming their excellent xG against at about 24. On the other end of the scale, Halmstad’s attack is underperforming with only 18 goals compared to over 35 xG.
That’s it for now when it comes to building my Expected Goals model for Swedish football, but I will probably bring it up again if I make any improvements and just maybe I’ll show how it’s been performing on the betting market. In my next post I’ll see how my model predicts the final table. Who will it pick as champion?