Evaluation – Quick Update – “The Model vs Lawro vs Merse” – Model takes the Lead

The featured image (and below) plots the cumulative forecasting performance of “The Model vs Lawro vs Merse” since round 10  in the Premier League, and as of the midweek fixtures just gone, according to the BBC Sport scoring metric (40 points for a perfect scoreline forecast, 10 points for a correct result only).

The Model got off to a good start, but by round 5 tipster Merson had overtaken the Model, and by round 9 tipster Lawro was getting close (see here).

But now the Model has taken the lead again, overhauling Merson and pulling away form Lawro – and we wouldn’t bet against that trend continuing till the end of the season!




Evaluation – A Quick Update – “The Model vs Lawro vs Merson”

The featured image plots the cumulative forecasting performance of “The Model vs Lawro vs Merson”, according to the BBC Sport scoring metric (40 points for a perfect scoreline forecast, 10 points for a correct result only).

The Model got off to a good start, but hasn’t really picked up the pace as more information about teams’ relative abilities this season has become available, which in theory should have improved its performance.

In the meantime, Merson has overtaken the Model, and Lawro is getting close.

There is a discussion to be had this weekend in the pub, however, whether matches at the start of this season were more predictable than normal. And since then, whether “predictability” in the EPL has fallen, perhaps offsetting any gains the Model has made in its abilities (e.g. the Model is gradually improving its opinion on Wolves, but struggling to distinguish Man U’s drop in form from their long-run average abilities).

The Model is beating the experts

After 40 Premier League matches, the Model is beating the experts. By experts we mean former professional footballers Mark Lawrenson (aka “Lawro”) and Paul Merson (aka “Merse”), who make well-publicised Premier League score forecasts for BBC Sport and Sky Sports, respectively.

Exact Scores:

The harshest performance metric for a football forecaster is the percentage of exact scorelines they get correct. The Model is currently performing at 15%. This just edges Merse, who has predicted 13% of scores bang on. Whereas Lawro is trailing way behind, only getting 5% so far.

Number of Exact Scores Predictions Correct in Premier League 2018/19 Rounds 1-4:

The Model: 6/40

Lawro: 2/40

Merse: 5/40


A more forgiving performance metric is the percentage of results forecast correctly. Again the Model is leading the pack, getting 65% correct. Lawro is getting less than one in every two results correct, 48%. Merse is doing better, 58%.

Number of Result Predictions Correct in Premier League 2018/19 Rounds 1-4:

The Model: 26/40

Lawro: 19/40

Merse: 23/40

“Lawro Points”:

Finally, The Model is clearly outperforming Lawro at his own game, and Merse too for that matter. Using the points scoring system from the BBC Sport predictions game (40 for an exact score, and 10 for just a result), the Model has taken 28% of the points so far on offer. This compares with 16% for Lawro and 24% for Merse.

Accumulated “Lawro points” in Premier League 2018/19 Rounds 1-4:

The Model: 440

Lawro: 250

Merse: 380

Should the Model be more humble?

Probably. 40 games is still a relatively small sample, and there is plenty of time for the experts to turn things around. Lawro remains the biggest threat, given his historical forecast performance outstrips Merse by some distance.


As discussed on this blog previously, we would expect the model to only get better and better as the season progresses.



So close, but yet so far

With fifteen minutes of play left in most matches on Tuesday, our scoreline predictions were spot on in seven matches, and just one goal in a number of other matches could have yielded more exact scores. It looked like it might just be a bonus week.

Leeds then equalised at Swansea, and one by one, until Crawley equalised with the last kick of the evening against nine-man Swindon, all seven disappeared, and none of the possibilities materialised.

At the same time, while we bemoan how close we have been, we’ve also been spectacularly out. We had Stoke to start strongly and win at Leeds. We picked QPR for a surprise win at West Brom, 1-0. The actual score was 7-1 to West Brom. Last night we thought Scunthorpe would beat Fleetwood 1-0 at home, but by the 29th minute they were 4-0 down, and eventually succumbed 5-0 at home.

Humblings all around. We got no scores, below expectation, and we got 14 results out of 34 matches, which means we got about 42% of results. That’s about the same frequency as the number of home wins, meaning had we just predicted a home win in every match, we’d have done about as well.

Which brings to light questions of how do we evaluate our forecasts? Do we just record a zero for a 1-7 when we picked 1-0? And a 1 when that 2-1 does actually happen that we predicted? Or do we sum up how many goals out, so that we were 7 out for West Brom, and 6 for Scunthorpe, (1-0)+(0-(-5))=6?

We plan to develop a little the ways we evaluate forecasts, not least to reflect the way we are evaluating our own forecasts to try and make them better.

Reality Check

Last weekend the Premier League forecast performance was exceptional. This weekend, it looks no better than what people who thought Gay Meadow was just the name of a funfair could achieve.

But with 40 matches played so far in the Football League, the model is still at a par with last weekend in terms of correct scorelines predicted (4/40; 10%), and could improve with 6 games left to play. Overall, this hit rate is a little lower than what we would expect the model to achieve, but we are still tinkering with our exact method.

Evaluation and Scoring Rules

A bunch of “other” matches took place last night, which we forecast. We also presented conditional probabilities, perhaps a bit more intuitive because while 1-1 is the most likely result, a draw is hardly ever the most likely result.

After the event, the natural question is: how did we do? The table below shows that our standard forecasts (loads of 1-1s) got 15 right results, and 8 scores (from 46 matches), while our conditional forecasts got 22 right results, and 5 scores.

Results Scores Lawro score Sky score Made up score
Forecast 15 8 47 35 31
Conditional 22 5 42 34.5 32

So what is better? To get more scores, or get more results? This all depends on preferences. The scoring rule of Mark Lawrenson’s forecasts on BBC Sport is 40 points for a score, 10 for a result. The Sky Super Six scoring rule, which we might attribute to Paul Merson’s forecasts, is 5 for a score, 2 for a result, thus valuing a score less than the BBC does. By the Lawro score (scaled down to make it comparable to Sky), our unconditional forecasts got 47, our conditional ones 42. By the Sky score, the difference was one half: 35 to 34.5. If we value a score at only twice a result (rather than 2.5 times as Sky does, or 4 times, as Lawro does), then we get that our conditional forecasts were better, scoring 32 to 31.

Scoring rules matter, and may well matter for how players play games. Scoring rules probably also reflect, to some extent, our preferences and beliefs. It’s clearly much harder to get an exact score right, so why not reward that, like Lawro’s score does, by a lot more than just a result?

What a weekend!

The Premier League has started, as have our forecasts for it. Testing the model on 2016/17 and 2017/18 seasons suggested that on average we might expect to get about one correct score per week, and about 5 results right (i.e. we predict 2-0 and it finishes 2-1).

As you can see from our evaluation page for the Premier League, we got 3 correct scores, and six right results. While we’d like to think this will be the norm for the season, history tells us that this kind of a remarkable performance only happens about 1% of the time! We scored 100 points more than Mark Lawrenson by his scoring metric (10 points for a correct result, 40 for a correct score). In our testing, we only got a better score than Lawro about 35% of the time.

Looking down the divisions, we see a more subdued performance; in the Championship just 1 correct score, and four correct results, in League One no correct scores and just two correct results, and in League Two no correct scores and six correct results.

Why were our picks so much better in the Premier League? A few hypotheses. One is that there is a lower turnover of teams in the Premier League, with only three teams entering each season relative to six in the Championship, seven in League One, and six in League Two. Our performance this week supports this, since our worst forecast performance was in League One.

Another hypothesis is that lower league clubs rely more heavily on the loan system for players, and short-term contracts, and as such turnover is higher at these levels. Again, there’s some support for this. The plot below is the total turnover in players at football clubs in the summer months (data from Soccerbase). The blue and green lines are Leagues One and Two, and in recent years have been higher than the Premier League and Championship.