top of page
  • Twitter
Search

Using the same xG models for women’s football is wrong, Opta – and needs to change

  • George Ferridge
  • Oct 19, 2023
  • 11 min read

Updated: Oct 26, 2023

The growth of women’s football in recent years has been exponential. Even before the Lionesses were able to finally bring football home with their European Championship success in 2022, the sport was growing at a fantastic clip, but the victory of Sarina Wiegman’s team seems to have accelerated its growth in the United Kingdom even more. Abroad, the sport continues to grow, especially in the United States where I would wager the average person could tell you more players from their women’s national team than their men’s.


Now, the two genders within the sport seem more integrated than ever. Female commentators and pundits are ubiquitous in football broadcasts, female referees continue to get chosen more and more for Premier League matches, and female coaches are gaining plaudits both in the women’s side of the game (Wiegman, Emma Hayes) but also increasingly on the men’s side, with Hannah Dingley being named the first manager of a men’s professional team with Forest Green Rovers this past summer.


A somewhat unfortunate circumstance of the meteoric rise of women’s football in the media, however, is that for the most part data analysis has not been adapted for the women’s game. The statistics and models that are being used to analyse competitions like the Women’s World Cup or the Women’s Super League have simply been taken over from those used in the men’s game. Upon further inspection, it appears that this is a glaring error.


While there are sure to be several instances of inaccurate analysis provided for the women’s game, in this article I will be focusing on the most widely used statistic in football, Expected Goals, and showing you how important it is that we redevelop our analysis for women’s football instead of just assuming that it’s the same game as the men’s side.


The idea behind the Expected Goals model is simple. Each shot that’s taken by a player has a probability of finding the back of the net. That’s affected by the position of the player and the ball, the body part striking it (left foot, right, head etc.) and the position of defenders between the ball and the goal. Expected Goals models collate thousands of instances of similar shots being taken from similar positions with similar body parts with similar defenders in front of them to figure out the probability of that shot going in.


The issue is almost all the shots that developed the original xG models were taken by men.


Expected Goals, in its current iteration, then inherently assumes that the shot taker is playing men’s football. Do we know for sure that the probability of a man scoring a shot in men’s football is the same as a woman in women’s football?


The way to answer this question is to dig into the data. Specifically, the data on expected goals provided by one of the leading football statistics platforms, Opta. If their assumption is correct, that women’s xG is the same as men’s xG, then women’s xG should behave in the same way. Functionally, that means that over a large sample, xG is very close to actual goals scored. If it isn’t, then current xG statistics are biased and inappropriate for use in women’s football.


For this analysis, I have collected the data from fbref.com (whose statistics are provided by Opta) for the last three seasons of both the Premier League and the Women’s Super League. I’ve chosen three years mainly because that’s as far back as I can go in the Women’s Super League while still having a full set of xG data for all matches played. Before then only some matches have full data. Since the two league seasons are of different lengths (38 matches for the Premier League, 22 for the WSL), I used the per 90 statistics.


The first thing to look at here is an overall breakdown of how accurate xG is to predict goals scored. If we look at the difference between actual goals scored and the xG generated by teams, xG models would predict that with enough teams and enough matches played, the difference between them should be very close to 0. Below, we can see the breakdown for the two leagues, with the x-axis looking at the difference between goals scored and xG, and the y axis measuring how often that difference is found.



Clearly, xG does a fantastic job in men’s football. Not only is the average difference (represented by the red dotted line) indistinguishable from 0, but the data also falls in a nice, neat distribution around that average. A team is just as likely to score 0.1 more goals per match than their xG as they are to score 0.1 fewer goals per match than their xG. It’s a good balance centred around 0, showing that if you were to pluck a random Premier League team out of the data and look only at their Expected Goals, you would have a pretty good indication of how many goals they’ve actually scored.


The distribution on the women’s side tells a different story. Instantly, it’s clear to see that the average difference is not close to 0, with an average value of approximately 0.27. While this doesn’t sound like a lot it means that, on average, xG fails to predict a goal in every 4 WSL matches. Over the course of a season, that amounts to xG failing to predict around 6 goals for a team. Considering the average number of goals scored by a WSL team last year was around 36, this means that xG is missing around 16% of all the goals scored by a team.


Additionally, the distribution of frequencies looks quite different than that of the men’s side. The right-hand tail of the distribution is longer, meaning that there are higher frequencies of large positive differences between goals and xG than there are large negative differences. This points to there being not just an inaccuracy of xG, but a one sided bias. xG in its current form is not just inaccurate, it systematically underpredicts goals scored by women’s teams.



There are two major counterarguments to this conclusion. Firstly, the argument that this difference is driven only by the big teams. Chelsea’s dominance in the WSL just highlights the disparity in the league, and their xG performance (as well as those of other big teams) is driving this result. Secondly, the xG difference we find on the women’s side sounds big, but is it actually significant? With the spread of the women’s data, is 0.27 really that different from 0? I’ll address both of these arguments now.


Firstly, the big team argument. My initial rebuttal would be to direct people to look at the disparity in quality in the Premier League. Is the difference between Chelsea’s women’s team and Brighton’s women’s team really that much greater than the difference between a treble winning, Premier League threepeat behemoth in Manchester City versus Luton Town? I would argue not. But either way, we can separate out the individual teams by expressing the data as a scatter plot, so we can see how different those big teams truly are compared to the minnows of their leagues.




Here we have the Premier League represented by the red dots on the plot along with their line of best fit, and the WSL represented by the blue triangles. The black dotted line represents “perfection” in the xG statistic. Being on this line means that your goals scored per game is exactly the same as the amount of xG you generate. Being above it means that xG underpredicts your goals, while being below it means you’re scoring less than xG would predict.


The first thing to notice here is that the quality of players matters. The better attacking teams (those who generate a lot of xG) tend to score even higher. This means that if you’re generating a lot of xG, you have better players than average, and better players than average tend to score more than expected. Interestingly, this is true for both the WSL and the PL. Better teams are better. Groundbreaking.


There are two interesting points to pay attention to in this chart. The first is the position of the WSL line of best fit. This line serves as a prediction of Goals Scored based on xG for WSL teams. Interestingly, for the dataset provided, this line never crosses the G=xG line. What this means is that, statistically, the best prediction for the goals scored by a WSL team is always above their generated xG. This means that for all teams, not just the elite, xG is underpredicting goals scored.


On the men’s side, the line of best fit is much, much closer to the G=xG line. While with higher xG it’s probably best to predict more goals, and with lower xG it’s probably best to predict fewer, both sides are quite close to that line. Predicting goals scored by just copying the xG is never going to be so far from the mark for the men’s game. In fact for an average team that generates somewhere between 1.25 and 1.6 xG a match, that xG prediction will be almost perfect. Again, this supports our conclusion that xG is a good predictive statistic on the men’s side but biased at almost all levels on the women’s side.


The second key feature of this graph is the number of data points that lie below that G=xG line. Overall, there are only a handful of women’s teams that have underperformed their xG in the last three years. On the men’s side, however, it’s a fairly common occurrence. If the reason for the differences we see between xG and goals scored was just down to the variation in the quality of the league, we would expect to see a lot more women’s teams falling below that line because their quality is not high enough to compete. They would be coming up against better goalkeepers and better defenders, as well as being less accurate in their shots, which would mean that they should be underperforming their xG a lot more frequently than we actually see. This is clearly more evidence that the current xG formula is biased.


The second argument against what we’re seeing is that we’re dealing with small numbers, and the difference actually isn’t all that significant. The argument would follow that if we were to keep collecting data and look at this ten years down the line, the women’s number would probably come down closer to 0. Luckily, we can run a test in statistics called a t-test to try and see whether this is true or not. I’ll give a little bit more explanation here for those who might be encountering a t-test for the first time.

A t-test is designed to measure random variation in a data sample. It takes into account the spread of our data, the number of data points that we have, and the average that we have from the small sample that we’re looking at, and gives us the probability of finding that average if the “true” value is something else.


For example, let’s say that we know for sure that xG is a good statistic, and that if we played infinite Premier League matches the average difference between xG and goals scored is 0. Then say we were to pick out 3 seasons at random (1992/1993, 2008/2009, and 2022/2023, for example) and look at the average difference between the xG difference in those 3 seasons alone. The average in those 3 might not be 0, because 1992/1993 might have been a crazy season where everyone finished their chances all the time and was exceptional. Then we might see an average difference different from 0. A t-test will then tell us how likely it is to find a difference like the one we see in those 3 seasons, given that in all games played ever the difference is 0. If the difference we see is big, it would be pretty unlikely to find it. If the difference is small, more likely.


We can then flip this around and think of that probability as the likelihood that our “true” value is 0. If we find in our smaller sample a really big number, it becomes less and less likely that the real difference, if we had infinite matches, is 0. So that is what a t-test provides us in statistics.


With that said, we can generate t-tests for both the PL and WSL under the assumption that our “true” value is 0. The big takeaway here is the p-value, which is the probability we were just talking about. Based on the sample that we have, that’s the probability of finding an average as big as we have if the infinite value is 0. Or, equally, it can seen as an estimation of the probability that 0 is the true value.


Average

t-statistic

p-value

Premier League

​-0.0075

-0.2939

0.77

Women's Super League

0.2766

4.8555

0.00002



Here we can see that the results we’ve found are not just an example of small sample size. For the men’s game, the probability of finding the difference we have in the last 3 seasons if the true value is 0 is 0.77, or 77%. We can pretty safely conclude that 0 is close to the true difference between goals scored and Expected Goals for the league. On the women’s side, however, the probability is 0.002% that we would find the average where it is if the true value is 0. We can say then, with 99.998% certainty, that the difference between goals scored and Expected Goals in the WSL is not zero. There is a bias. It’s there.


So what’s explaining this bias? Why does it exist and how do we fix it? It’s one thing to point it out and say “hey, we’re doing a bad job with WSL statistics” but we need to plan out an alternative.


Going back to what I mentioned at the beginning of the article, what this means is that the probability of a woman scoring a chance in women’s football is different from the probability of a man scoring the same chance in men’s football. Rather boringly, the answer we’re looking for might come down to physics.


Chloe Kelly’s penalty against Nigeria in the 2023 Women’s World Cup clocked in at 111 km/h. That was 4km/h faster than the hardest hit shot in the Premier League last season. I am not about to make the argument that women can’t shoot just as well as men can. They can find those corners just as well, hit those shots just as hard, and add just as much spin, as evidenced in this fantastic advert.


The issue, in my opinion, comes from the probability of a woman being able to block the shot after its left a boot. WSL goalkeepers are, on average, 7 inches shorter than their male counterparts in the Premier League. The goals are the same size. Fundamentally, it is more difficult for a goalkeeper in the women’s game to reach the same shots that a male goalkeeper can.


This is then compounded by other physical differences between women and men. If you look at the qualifying standards for the upcoming 2024 Tokyo Olympic Games, the qualification standard for women is lower than men in all of the sprinting and jumping categories. Between elite athletes of both genders, the men tend to be faster and be able to jump further and higher. If the shots being hit are of the same standard, it is more difficult for a female defender to get their body in the way of the ball than their male counterparts. That affects the probability of the ball going in. This also bears out in the data, where on average women's goalkeepers save 0.28 xG per 90 less than they should based on the xG after the shot, compared to just 0.07 fewer than expected for men.


Often, in the past and unfortunately the present, this argument of physical differences has been used to belittle women and argue in favour of the patriarchy. I would like to emphasize that that is in no way my argument here.


The current, and original, models for xG were trained with men in mind. The sample was men, the shots were by men, the intention was to use the statistics for the men’s game. Through a combination of factors, this means that xG is not an appropriate statistic for analysis of the women’s game. As we’ve seen, it’s wrong. The very use of the same statistic with the same standards being translated over to the women’s side is an example of the lack of consideration for the women’s game.


The statistics used in this article are provided by Opta. Opta generate statistics for Sky Sports and the BBC, the two major broadcasters of the WSL and International Women’s Football. The use of an incorrect statistic here means that the broadcast is worse, the punditry is worse, the package is worse. It may seem small, but it can have significant impacts on how people enjoy the game, whether people decide to watch women’s football, and the depth and quality of coverage.


So how do we fix it? We retrain the model! There’s a wealth of women’s football data out there now, with coverage improving every season. We have evidence of plenty of shots, and plenty of goals. We need to regenerate the probabilities afforded to us by xG using exclusively a female sample. Sure, the accuracy now may not be perfect but it will improve over time. We shouldn’t use regular old xG. It should be MxG and WxG.


That way we can start to properly understand the underlying statistics in women’s football. We can generate better news, better articles, better player analysis. We can begin to truly understand how good Sam Kerr is, or how unique Olga Carmona’s World Cup winning goal was. Once we set the baseline, we can start to properly understand how exceptional these players are.


The development of tailored women’s statistics, and a move away from the same ones used by the men’s game, represents an appreciation of the women’s game for its own uniqueness and idiosyncrasies. Its own beauty and majesty. Not just because women are playing the same game as men.

 
 
 

Recent Posts

See All
SSAC 25: The Interactive MOU Index

Hello there! Welcome to my blog, Zone Fourteen. Here you will find plenty of articles that I have written about the intersection of...

 
 
 

Comments


bottom of page