By Tom Macaulay
July 13, 2017
The shock UK election result capped a miserable year for pollsters, following botched predictions for the EU referendum and US Presidential election, but one of them correctly forecast a hung parliament where almost all others failed.
As LBC and Newsnight presenter James O'Brien put it: "The only real winner so far is YouGov."
The YouGov prediction was the result of a new statistical model called Multilevel Regression and Post-stratification (MRP) that was developed to produce estimates for small geographies, such as constituencies. It led the firm to correctly forecast the winner in 93 percent of seats, despite relying on an average sample size in each constituency on just 75 people.
The model was primarily developed by Professor Ben Lauderdale of the London School of Economics and YouGov's data science team, headed by Doug Rivers of Stanford University, who told Computerworld UK how it works.
"A poll of 75 people can easily be off by ten plus points," says Rivers. "The trick was we knew how many people voted Conservative and Labour and SNP in 2015, and we know how many voted to leave or remain in 2016. Those two things when you add them to the demographics are much more powerful predictors."
Previous voting behaviour is added to demographic information to reinforce the small sample sizes that typically leave a lot of room for error when predicting each constituency.
The model thereby enriches insufficient data and low response rates to accurately predict which seats would have swings.
The YouGov MRP model
YouGov used poll data from the preceding seven days to relate the variables on respondent profiles to their current voting intentions. These variables include their constituency, demographics, past voter behaviour and interview date. The model then estimates the probability of each type of voter voting for a specific political party.
The Office of National Statistics (ONS) annual population survey, the British Election Study, and the 2015 general election and EU referendum votes are then used to estimate how many of each voter type there are in every constituency. YouGov can then predict how many of each type intends to vote in their constituency.
The model further compensates for the small number of interviews conducted in each electoral area by pooling data from respondents in other constituencies to augment the sample size and increase its accuracy. This works because voter profiles remain a fairly accurate predictor regardless of where they live.
The data is sent from YouGov's survey system to its in-house Crunch analytics database. The sample is then processed through a piece of open source probabilistic software called Stan that was invented by Columbia University statistician Andrew Gelman. It uses an algorithm known as the Hamiltonian Monte Carlo algorithm to model estimates of the data.