In my last post I discussed the concept of Expected Goals and how its probabilistic nature opens up for simulations. Today I’m going to talk about another cornerstone when building my model – the data. I do this because I think it’s important to fully explore the data when building a model, to understand its strengths and weaknesses, its advantages and limitations and how these affect the model and its output and performance. No model is perfect, but if we’re aware of its biases and limitations we can still make good use of it.
While Opta produces very advanced data covering every on ball event in the bigger leagues, the data available for Swedish football is lesser in terms of detail, quality and reliability. What’s available for use is pretty much just shots, and there is no distinction between different types of shots besides penalties. Only shots that ended up as goals have detailed information on whether it was headed, came from a set piece and so on. Using this information would result in a skewed model, rating for example headers too high since every existing header is also a goal. I’ve therefore treated all these types of situations as regular shots. Furthermore the location of the shots is recorded with less accuracy than Opta’s. The x and y coordinates are recorded with only integers, making them less precise and the location of the shots is sometimes plain wrong. I regularly examine the shot maps of games I’ve watched live and there always seems to be some errors, but I’m hoping these will be insignificant. There’s no information on passes, defensive actions or anything like that, the only events recorded besides shots is fouls, corners, offsides, substitutions and cards.
Data exists for the top league Allsvenskan, but also second tier Superettan and the two Division 1 leagues below it, from season 2011 and onwards. However, the data from Division 1 seems to be of too poor quality for modelling and substitutions were not recorded properly until season 2013, so per90 stats from seasons 2011 and 2012 are pretty much useless. Anyway, here’s a shot map of every shot recorded for Allsvenskan and Superettan from season 2011 up till now.
With so many shots taken from the exact same locations, it’s probably easier to get a sense of the distribution of the shots through a hexbin plot, showing what could be described as the shot density of every location on the pitch:
As we can see, the penalty box and the area just in front of it seems to be the most frequent shooting locations, which makes sense. Also, the penalty spot stands out with so many shots taken from the exact same location.
Looking at only goals, the penalty spot again stands out but we can also see that most goals are scored inside the box, especially from more central locations. This again makes sense.
It’s also a good idea to take a look at the general characteristics of the games you want to model, so I’ve created some histograms of goal and shot distributions from Allsvenskan.
Examening these, we can see that an average game ends up with a total of 2.74 goals, with the home side having a 0.433 goal advantage. What about shots?
As expected, the home side also enjoy an advantage when it comes to shots, about 2.481 on average, while the average total number of shots in an Allsvenskan game is 21.931.
I think we have a good sense of the league and games we want to model now, so I’ll end this post here. Next up I’ll get down to business, building the model and putting it to the test.
2 thoughts on “The Model part 2 – The Data”
[…] The Model part 2 – The Data […]
Hey, would you mind sharing where you get the shot location data for the Allsvenskan?
I’ve been scouring the internet looking for more detailed data on this league but have came up empty. Cheers