Tuesday, August 2, 2016

Frequently asked questions about methodology


Where do you get your data? I use two websites, Real Clear Politics and Pollster, the website now owned by The Huffington Post. The articles on RCP lean to the right and the articles on Pollster lean to the left, but both are reliable reporters of polling results.

Are the any polls you don't use? In the past, I used any poll either of those two websites decided to print. I have started a new policy of not using data where the polling sample is only of "adults". I will use polls that ask the question if people are registered or likely voters.

Do you "skew" the results? Not exactly. Back in 2012 when the state polling data showed how much of a disadvantage Romney faced, a fellow named Dean Chambers decided to look at the mix of independents, Republicans and Democrats each poll used and "fix" the ones he thought undersampled Republicans. Doing this, he showed Romney with a comfortable lead. Conservatives loved this and Chambers became a hero. After the election when his methodology was shown to be defective, he admitted his error, but within a few months he convinced himself he wasn't wrong and there had been irregularities at the polls. Most people ignored him.

I do change the results I get, but I don't change who has the lead. For example, a poll in June from Nevada with a sample size of 300 showed Trump with a 47% to 45% lead. Multiplying the sample size by the percentages, this would say 141 people favored Trump and 135 people favored Clinton. I take these numbers and effectively ignore people who are undecided or voting for someone else. Of the 276 people choosing one of the two most popular candidates, Trump has 51.1% and Clinton 48.9%. These are the values I use to get a probability from an assumed normally distributed set. If you are familiar with Excel, the formula I use here is

=NORMDIST(0.511,0.5,SQRT((0.511*0.489)/276),1)

This gives Trump a Confidence of Victory number of 64.10% in this case.

What about third party candidates? Confidence of Victory does have methods to deal with third party candidates or even more candidates. Unless the third party candidate is within a few percentage points of the two favorites, the CoV number for third place (or fourth place) becomes microscopic.

What if no polls exist? At the time this FAQ is being written, there are still many states that haven't been polled. In that case, I use the 2012 result for that contest and assume a sample size of 200.

What if more than one poll exists? I sort the polls by date and take the median of what I consider to be the freshest polls.

Why median instead of average? Median is not effected by polls that disagree with the consensus by a large amount, what we call outliers. Average does get skewed by these polls.

Define "freshest". In previous years, I usually started poll aggregation after Labor Day, but the conventions were early this year and so I start my collection of data in August instead of September. The polling companies have not ramped up to full frenzy, so many states that are considered battleground still have just a few polls. In ideal situations, I only look at polls no more than a week older that the most recent. In this early going, I stretched that out. In June, I'd use any polls from June and now, I usually stretch the freshness mark to two weeks if there has been a poll in July.

How do you create your probability of victory number? I choose 15 battleground states using this algorithm.

1) If the CoV number in a state is between 50% and 90% for the leader, that state is a battleground.
2) If this method gives us less than 15 states, states where the CoV is between 90% and 95% are added to the mix.
3) If we still fall short of 15, states with CoV over 95% are added on the basis of CoV times the number of electoral votes, which is called the expected value of the contest.

With 15 states, we have 2^15 or 32,768 possible outcomes. Each of these probabilities is tallied in the column of who ever wins the scenario. If a scenario is a 269-269 tie, I assume the House of Representatives will elect Trump, since the House currently has a large Republican majority. If Trump does well enough to tie, it is very unlikely the Republicans will lose the House.

What do you consider a toss-up state? Exactly 50%-50%. Any slight edge puts a contest in the column of the leader.

Does your system ever get predictions wrong? Yes, but in the general election it is rare. In the primaries, it's much more common for polls to get the result wrong. Bernie Sanders big win in Michigan primary is an obvious example from this year, and primaries with a lot of candidates can produce surprises as well. Polls do a bad job of predicting caucus winners, since the number of caucus voters is much smaller than the number who would vote in a primary. American elections are much longer than elections in most of the world and by November, the vast majority of people have chosen their candidate and the get out the vote campaigns are more important than ads trying change people's minds late in the game.

My other assumption about the bad performance of polls in primaries is the amount of time voters have taken to think about the candidates, especially in a multi-candidate race. The differences between the several candidates from the same party are usually not as stark as the differences between two candidates from different parties.

If you have any other questions, please add them in the comments. I'll try to add as many as possible to this list.

2 comments: