Preview: AIK vs. IFK Göteborg

Monday night AIK will host IFK Göteborg for an extremely important game in the race for the Allsvenskan title. Both teams are close behind Norrköping in the lead and will surely go for the win here to challenge for the title, and I thought it would be a good idea to have a look at some team stats as a preview to this crucial game.

The plot below contains goals, shots, Expected Goals, xG per attempt, goal conversion % and shot on target % – both for and against, normalized per game where necessary. Home and away stats for each team in the league are separated with home in blue and away in red. For each subplot the lower right corner is preferable, with high offensive and low defensive numbers.

aik_gbg_03

Besides SoT%, both AIK and Göteborg appear to be among the best in the league in each stat, which partly explain why they are fighting for the title. What is really striking though, and could be seen as a indicator of team style, is that while AIK’s offensive numbers at home are really good, Göteborg’s strength when playing away is their defence.

aik_gbg_01 aik_gbg_02

This is also evident from each teams xG maps, where it is clear that AIK’s main strength is their attacking power and ability to produce high volumes of shots with high xG values each game. Göteborg on the other hand rely heavily on their defensive skills to protect their box and limit the opposition’s scoring chances. This clash of styles adds yet another interesting flavor to an already interesting game.

aik_gbg_04Looking at each teams top 5 goalscorers it is clear that AIK’s impressive attack rely heavily on Henok Goitom. His 16 goals this season are pretty much in line with his xG of about 15 while Göteborgs Søren Rieks seems to be overperforming with his 10 goals equalling almost two times his xG numbers. Both teams have sold one of their best offensive players with Bahoui and Vibe both making a move abroad this summer.

What about a prediction then? While I won’t reveal any percentages for this (or any) game, what I can say is that my model is pretty much in tune with the betting market. AIK is a slight favourite due to their home advantage, but this is really anybody’s game and it will hopefully be highly entertaining.

Advertisement
Preview: AIK vs. IFK Göteborg

Predicting the final Allsvenskan table

With the Swedish season soon coming to an end it’s a good time to try out how the Expected Goals model will predict the final table. With only three games left a top trio consisting of this season’s big surprise Norrköping just in front of Göteborg and AIK are competing for the title as Swedish Champion. At the opposite end of the table Åtvidaberg, Halmstad and Falkenberg look pretty stuck, with the two latter teams battling it out for the possible salvation of the 14th place relegation play-off spot.

predict_table_01

Let’s take look at the remaining schedule for the top three teams:

Norrköping have two though away games left against Elfsborg and Malmö, who are both locked in a duel for the 4th place which could potentially mean a place in the Europa League qualification. Elfsborg are probably the tougher opponent here, with reigning champions Malmö busy in the Champions League group stage. Between these two away games Norrköping will play at home against Halmstad who are fighting for survival in the bottom of the table.

Göteborg have two though away games themselves, first off at Djurgården and later a very important game against fellow title contenders AIK. This game will probably decide which of the two will challenge Norrköping for the title in the last round. Göteborg finishes the season at home to Kalmar who could possibly play for their survival in this last game.

AIK have the best remaining schedule of the three top teams, with away games at Halmstad and Örebro on either side of the crucial home game against Göteborg. As mentioned, Halmstad is fighting for their existence in Allsvenskan, while Örebro’s recent great form have seen them through to a safe spot in the table.

At this late stage of the season there are a lot of psychological factors in play, with the motivation and spirit of teams and players often being connected to their position in the table. These aspects are very hard to quantify and have not been incorporated in my model. So my prediction of the table rely solely on my Expected Goals model used in Monte Carlo simulation. I won’t reveal exactly how I simulate games but the subject will probably be touched upon in a later post so I’ll spare you any boring technical details for now.

Each of the remaining 24 individual games have been simulated 10,000 times. For each of these fictional seasons I’ve counted up the points, goals scored and goal differences for every team to come up with a final table for that season. Lastly I’ve combined all these seasons into a table with expected points and probabilities of each teams possible league positions.

predict_table_03

The model clearly ranks Norrköping as the most likely winner with Göteborg as the main contender, while AIK’s chances of winning the title is only at about 18%. The bottom three looks rather fixed in their current positions with Falkenberg having only a 2% chance of overtaking Kalmar in the last safe spot in the table. At mid-table things are still quite open, even though Djurgården’s season is pretty much over with a 89% chance of placing 6th. Malmö seem to have an advantage against Elfsborg in the race for the 4th place, but given their Champions League schedule their chances should probably be less than the model predicts.

I’ll probably be posting updated predictions on my twitter feed after each of the top teams remaining games to see how the results change the predictions.

Predicting the final Allsvenskan table

The Model part 3 – Expected Goals for Swedish Allsvenskan

Now that we’ve explored the Expected Goals concept and the data available for Swedish football, it’s finally time to build the model and put it to the test.

Setting up an Expected Goals model can be done in a number of ways, for example with the help of exponential decay, machine learning or some kind of regression model. I’ve chosen to use a logistic regression model because I think it has several advantages. Logistic regression is mostly used when the dependent variable only has two possible values, which translates well to football since a shot can end up either as a goal (1) or no goal (0). Also, logistic regression is used to return a calculated probability –  i.e. our xG value. It’s also very easy to set up a logistic regression model and tinker with different variables using python’s statsmodels library.

First off, the dataset needs to be divided into two parts: one for training or constructing the model and one for testing it. This is done in order to avoid overfitting where the same data is used for constructing the model as for evaluating it, which would make the model possibly look better than it is.

I’ve chosen a number of variables which all in some way make sense to test in the model. They include:

  • League: Could goal expectancy differ between Allsvenskan and Superettan? Since this variable isn’t numerical, it’s been recoded to either 0 (Allsvenskan) or 1 (Superettan).
  • Attempt type: That the goal expectancy for regular shots and penalties are completely different from each other is obvious to anyone interested in football. This variable has also been recoded to either 0 (shot) or 1 (penalty).
  • Distance to the center of the goal: It is probably easier to score the closer to goal the shot is taken.
  • Angle: The distance to goal doesn’t tell the whole story of the importance of shot location as shots taken from the same distance but at different angles at least should have different expectancies. Higher angle means a more central position, which would probably be easier to score from.
  • Game State: There’s been some work done on the importance of game state in football, and it’s use can be debated, but I’m at least going to try it out. It works by crediting teams who spend time having the lead. Teams start every game level at Game State 0. Going 1-0 up means a Game State of +1 while the trailing team’s Game State drops to -1, and so on.
  • Number of players on the pitch. I think this is a first for using number of players in Expected Goals models, at least I haven’t seen anybody use it before. I’ve decided to call it Man Strength in lack of a better term and it works much like Game State. If an opponent is sent off your Man Strength goes to +1, while it drops to -1 for the opposing side. The reasoning behind using a variable like this is that as you face fewer opponents the defensive pressure could be less than usual, resulting in a higher goal expectancy.

model_01

Let’s take look at the individual goal expectancy for the variables. Goal expectancy for the two leagues is very similar but could possibly be of use if they interact with the other variables differently. Attempt type is pretty obvious with penalties having higher value than regular shots. In the graphs showing distance and angle the values have been rounded off for presentation, while higher precision is used in the model. There is some outliers here due to small sample size at the higher values but the patterns seems clear. It’s hard to tell from the graph if Game State is of any use since there isn’t much difference between the levels. But Man Strength shows a clear pattern, it certainly looks like goal expectancy rises when having more players on the pitch.

So let’s throw the training dataset (seasons 2011-2014) into a logistic model and have a look at a summary of the results:

model_02

There’s a lot of numbers here but let’s just focus on the p-values for each variable. Every variable is significant at the 95% significance level (p<0.05) except league. As expected from the plot above, there’s apperently no use to separate Allsvenskan and Superettan shots. Here’s how the model summary looks without the league variable:

model_03

So, with only significant variables left in the model, how does it perform when compared to actual goals? I’ve had the model calculate total xG for each player in Allsvenskan and Superettan for our test season (2015), and plotted this against their actual goals scored the same season.

model_04

With an r-squared of 0.77 I’d say the model is performing pretty well. Whats more encouraging is that the slope of the fitted line seems to be very close to 1, meaning that 1 expected goal is pretty much equal to 1 actual goal scored.
model_05In the graph I’ve also plotted the players in the top 10 in either goals scored, xG, goals per 90 or xG per 90 for the season. Some of them have good numbers in several of the stats. Emir Kujovic and Henok Goitom for example are performing outstanding this season, both being crucial to their respective teams run at the title. Markus Rosenberg on the other hand is underperforming with only 9 goals scored compared to his 16 expected goals, which is one of the reasons why Malmö are not living up to the expectations this season. Örebro’s Broberg and Häcken’s Paulinho de Oliveira also make the list due to their great form in the recent months while Djurgården’s Mushekwi enjoyed a good goalscoring run in the first half of the season.

Let’s take a look at how the model perform on a team level:

model_06 model_07

On a team level, it looks like the model is performing better when it comes to xG against than for, but overall it is a reasonably good fit, although not as good as at player level.
model_08As we can see the top teams are all performing well offensively. Göteborg stand out defensively with only 17 goals against in 27 games, even outperforming their excellent xG against at about 24. On the other end of the scale, Halmstad’s attack is underperforming with only 18 goals compared to over 35 xG.

model_09

That’s it for now when it comes to building my Expected Goals model for Swedish football, but I will probably bring it up again if I make any improvements and just maybe I’ll show how it’s been performing on the betting market. In my next post I’ll see how my model predicts the final table. Who will it pick as champion?

The Model part 3 – Expected Goals for Swedish Allsvenskan

The Model part 2 – The Data

In my last post I discussed the concept of Expected Goals and how its probabilistic nature opens up for simulations. Today I’m going to talk about another cornerstone when building my model – the data. I do this because I think it’s important to fully explore the data when building a model, to understand its strengths and weaknesses, its advantages and limitations and how these affect the model and its output and performance. No model is perfect, but if we’re aware of its biases and limitations we can still make good use of it.

While Opta produces very advanced data covering every on ball event in the bigger leagues, the data available for Swedish football is lesser in terms of detail, quality and reliability. What’s available for use is pretty much just shots, and there is no distinction between different types of shots besides penalties. Only shots that ended up as goals have detailed information on whether it was headed, came from a set piece and so on. Using this information would result in a skewed model, rating for example headers too high since every existing header is also a goal. I’ve therefore treated all these types of situations as regular shots. Furthermore the location of the shots is recorded with less accuracy than Opta’s. The x and y coordinates are recorded with only integers, making them less precise and the location of the shots is sometimes plain wrong. I regularly examine the shot maps of games I’ve watched live and there always seems to be some errors, but I’m hoping these will be insignificant. There’s no information on passes, defensive actions or anything like that, the only events recorded besides shots is fouls, corners, offsides, substitutions and cards.

Data exists for the top league Allsvenskan, but also second tier Superettan and the two Division 1 leagues below it, from season 2011 and onwards. However, the data from Division 1 seems to be of too poor quality for modelling and substitutions were not recorded properly until season 2013, so per90 stats from seasons 2011 and 2012 are pretty much useless. Anyway, here’s a shot map of every shot recorded for Allsvenskan and Superettan from season 2011 up till now.

data_01

With so many shots taken from the exact same locations, it’s probably easier to get a sense of the distribution of the shots through a hexbin plot, showing what could be described as the shot density of every location on the pitch:

data_02

As we can see, the penalty box and the area just in front of it seems to be the most frequent shooting locations, which makes sense. Also, the penalty spot stands out with so many shots taken from the exact same location.

data_03

Looking at only goals, the penalty spot again stands out but we can also see that most goals are scored inside the box, especially from more central locations. This again makes sense.

It’s also a good idea to take a look at the general characteristics of the games you want to model, so I’ve created some histograms of goal and shot distributions from Allsvenskan.

data_04

Examening these, we can see that an average game ends up with a total of 2.74 goals, with the home side having a 0.433 goal advantage. What about shots?

data_05

As expected, the home side also enjoy an advantage when it comes to shots, about 2.481 on average, while the average total number of shots in an Allsvenskan game is 21.931.

I think we have a good sense of the league and games we want to model now, so I’ll end this post here. Next up I’ll get down to business, building the model and putting it to the test.

The Model part 2 – The Data

The Model part 1 – Exploring Expected Goals

In a series of posts I will be covering the work done on and with my model on Swedish football. In this first part of the series I’ll talk about the underlying concept upon which the model is built – Expected Goals.

We’ve all seen those games were the result ended up being extremely unfair given how the game played out. Maybe the dominant team had a spell of bad luck and conceded an own goal while missing their clear chances, or the opposing goalkeeper played the game of his career making some huge saves, or maybe the lesser side luckily managed to score through their only real chance. All these scenarios point to the same thing – there’s a lot of randomness associated with goals. We often see teams playing great and still lose while a poorly playing side take home all three points.

Because of this random nature of football, only looking at results and goals scored and conceded is not a good way to assess true team and player strength. Sure, good teams usually win but they also sometimes run into spells of bad form and perform worse, while bad team sometimes goes on a good run, securing that last safe spot in the table just in time before the season ends.

To combat this problem, the football analytics community has turned its eyes to a more stable part of the game – shots – in hope that these will exhibit less randomness and hold more explaining power. While it is certainly true that examining how many shots a team produce and concede can tell you more than just goals, the same problem with randomness exists here too. Good teams usually take more shots than they concede but as we all know, this is not always the case.

Expected Goals aims at getting down to the core of why good teams perform well and bad teams perform worse, and in the process avoid some of the problems associated with just summing up goals and shots. It is based on the notion that good teams takes more shots in good situations while bad teams do the opposite. The same is true in defence, as good teams avoid conceding more shots in good situations than bad ones.  The hope is that these characteristics will be less random and more useful in explaining and predicting football.

In its essence, Expected Goals gives you a value of how often a typical shot ended up in the net, and this is done by examining huge datasets in a number of different ways. Usually an Expected Goals model is based on where on the pitch the shot was taken and the reason for this is quite clear once you come to think about it – it all comes down to shot quality. Imagine two different scoring opportunities, the first being 25 meters out from the goal and the other being 5 meters from the goal. In traditional football reporting these two shots will be treated just the same, but we all know that the latter is preferable since it is closer to goal and probably an easier shot to make.

Given the different methods, ideas and datasets football analysts work with, there’s no right way to calculate an Expected Goals or xG value. For example, an ambitious analyst might account for not only where the shot was taken, but also what type of shot it was, what kind of pass preceded the shot, if the player dribbled before taking the shot etc. The possibilities are only limited by the data, and with the likes of Opta covering the top European leagues, these are vast.

Let’s take a look at real example. In my database (more on that in later a post) I have 243 penalties recorded, of which 192 ended up in goal. To get the xG value for a penalty we just need to calculate the fraction of penalties which turned into goals, in this case 192/243, or about 0.79. In comparison, the xG value for a shot taken from the same penalty spot during regular play is estimated by my model to be about 0.25, which makes sense since it’s a harder shot than the penalty.

As shown by several football analysts (for example on the blog 11tegen11), Expected Goals hold some real power at explaining football results. But it also has its weaknesses. There’s currently no way of accounting for the position of the defenders when the shot was taken, which surely would effect scoring expectation. Furthermore Expected Goals only deals with actual shots taken but as we all know, not all scoring chances produces a shot. It’s also true that xG values are averages, meaning that there’s actually a whole range of different expectations for different players. Surely Leo Messi will have a higher chance to score than Carlton Cole in nearly every situation.

To me, the real strength of Expected Goals lie in that we can treat it as a probability and use it in simulations in order to examine more complex situations. Take a look at the penalty for example. With an xG value of 0.79, we can expect an average player to score most of the times, but he’ll also miss some shots. In fact, it’s not uncommon for him to miss several shots in a row. With the help of Monte Carlo simulation (again, more on that in later posts), we can examine the nature of the penalty shot more closely. Let’s say we get our player to take 10,000 penalty shots in a row. How many will he make?

Pen_sim_01

As we can see our player started out by making his first shot only to quickly drop below the expected 79% scoring rate, but as he took more and more shots he slowly moved towards his expected scoring rate. He actually scored 7928 penalties which is very close to the expected 7900.

Let’s try a more complex simulation just for fun. Imagine a penalty shoot-out. How likely is it to make all five shots? Four out of five? My database doesn’t contain any penalty shoot-outs but my guess is that these are converted on a slightly lesser scale than regular penalties, either due to the stress involved or maybe fatigue. But let’s use our standard xG value of 0.79 for simplicity. Let’s simulate 10,000 shoot-outs with five penalties each.

Pen_sim_02

Given the conditions we’ve set up, it seems there’s about a 30% chance to score all five penalties while making four is the most likely outcome. Going goalless from this shoot-out looks rather unlikely but as I’ve said the true chance of scoring after playing 120 minutes and with the hopes of thousands (or millions) of people on your shoulders is probably lower so a goalless shoot-out is probably more likely than our simulation shows.

That’s it about Expected Goals for now, and as I’ve said we will explore the possibilities of Monte Carlo simulation more thoroughly later. In my next post about my model I’ll talk about the data used for building it.

The Model part 1 – Exploring Expected Goals