The Model part 2 – The Data

In my last post I discussed the concept of Expected Goals and how its probabilistic nature opens up for simulations. Today I’m going to talk about another cornerstone when building my model – the data. I do this because I think it’s important to fully explore the data when building a model, to understand its strengths and weaknesses, its advantages and limitations and how these affect the model and its output and performance. No model is perfect, but if we’re aware of its biases and limitations we can still make good use of it.

While Opta produces very advanced data covering every on ball event in the bigger leagues, the data available for Swedish football is lesser in terms of detail, quality and reliability. What’s available for use is pretty much just shots, and there is no distinction between different types of shots besides penalties. Only shots that ended up as goals have detailed information on whether it was headed, came from a set piece and so on. Using this information would result in a skewed model, rating for example headers too high since every existing header is also a goal. I’ve therefore treated all these types of situations as regular shots. Furthermore the location of the shots is recorded with less accuracy than Opta’s. The x and y coordinates are recorded with only integers, making them less precise and the location of the shots is sometimes plain wrong. I regularly examine the shot maps of games I’ve watched live and there always seems to be some errors, but I’m hoping these will be insignificant. There’s no information on passes, defensive actions or anything like that, the only events recorded besides shots is fouls, corners, offsides, substitutions and cards.

Data exists for the top league Allsvenskan, but also second tier Superettan and the two Division 1 leagues below it, from season 2011 and onwards. However, the data from Division 1 seems to be of too poor quality for modelling and substitutions were not recorded properly until season 2013, so per90 stats from seasons 2011 and 2012 are pretty much useless. Anyway, here’s a shot map of every shot recorded for Allsvenskan and Superettan from season 2011 up till now.


With so many shots taken from the exact same locations, it’s probably easier to get a sense of the distribution of the shots through a hexbin plot, showing what could be described as the shot density of every location on the pitch:


As we can see, the penalty box and the area just in front of it seems to be the most frequent shooting locations, which makes sense. Also, the penalty spot stands out with so many shots taken from the exact same location.


Looking at only goals, the penalty spot again stands out but we can also see that most goals are scored inside the box, especially from more central locations. This again makes sense.

It’s also a good idea to take a look at the general characteristics of the games you want to model, so I’ve created some histograms of goal and shot distributions from Allsvenskan.


Examening these, we can see that an average game ends up with a total of 2.74 goals, with the home side having a 0.433 goal advantage. What about shots?


As expected, the home side also enjoy an advantage when it comes to shots, about 2.481 on average, while the average total number of shots in an Allsvenskan game is 21.931.

I think we have a good sense of the league and games we want to model now, so I’ll end this post here. Next up I’ll get down to business, building the model and putting it to the test.

The Model part 2 – The Data

Norrköping vs. Djurgården

Though I had planned not to share any shot maps before I had discussed the concept of Expected Goals and my model properly first, after watching last night’s game between Norrköping and Djurgården I just couldn’t help myself. Now some of you probably have seen this kind of plot before so I will leave out the explanation for later posts.

Peking-DIF_2015Here’s last night’s high-scoring game. Being a lifelong Djurgården supporter this was not a pleasant game to watch, and to add to the pain I actually had a bet on the under here. Sigh.

While I don’t believe much in year to year trends such as Team A vs. Team B always produces a lot of goals in today’s modern football where players and managers change teams frequently, watching the game I got a vague feeling of déjà vu and just had to look up this fixture from the last few years.

Peking-DIF_2014Peking-DIF_2013Looking at the games from seasons 2014 and 2013, it seemed there is some truth to the myth. But while it may look like this particular fixture usually end up a high scoring affair, in the other seasons in my database (2011 and 2012), the games ended 2-1 and 1-1 respectively. Furthermore, Djurgården actually only had two players starting in all three games: Kenneth Høie and Emil Bergström. The same goes for Norrköping with only David Mitov Nilsson and Andreas Johansson starting all three games.

With so few players playing all three games and the games therefore being played under completely different preconditions, I think we can safely put this high-scoring trend down to pure coincidence. I still feel like a fool for betting the under though.

Norrköping vs. Djurgården