So I'm sitting here wondering how things will turn out tonight in Pennsylvania and it occurs to me that the wealth of data at Pollster.com might provide some insights. To that end, I've developed a simple little forecasting model for the Pennsylvania primary based on the relationship between Charles Franklin's final, pre-election estimates of poll standing (in states where there were enough polls to generate estimates, and excluding Florida) and actual votes cast. As the figure below illustrates, there have been a few hiccups along the way but the actual results have tracked fairly well with the final poll estimates:

Adding a control variable for whether support for whether Edwards was in the final poll estimate (the negative coefficient suggests that Clinton did about four points worse than expected in the polls when Edwards was in the mix), we get the following model for forecasting Clinton's share of the two-candidate vote in Pennsylvania:

Plugging in the final estimate of her share of the two-Candidate vote (53.58%) we get a predicted Clinton vote of 54.8%, with a 95% confidence interval ranging from 50.97% to 58.63%. Anything outside this range would be very surprising.

Reminder: this model does not include state for which pre-election polls were too sparse to create estimates. Also, including Florida in the analysis changes to prediction only slightly (54.6%).

### September Polls

As I mentioned last week, Jay DeSart and I have done some work on using September state-wide trial-heat polls to predict presidential election outcomes in the states. We use the Democratic candidate's share of the two-party vote in state-level September polls (averaged across publicly available polls) , as well as a lagged vote variable to predict state-level outcomes. While the lagged vote variable is important to our model, most of the predictive power comes from the September poll average. Some of the data on September polls are presented below.

First, let's look at some scatterplots of the relationship between September polls on November votes from 1992 to 2004:

Just so-so. No doubt one of the problems with September poll accuracy in 1992 was Perot's entry into the race in October. Even with this, however, point estimates from September polls called the correct winner in 39 of 50 states.

Better--just 6 errant calls (one of the two tied poll results was allocated as a "correct" prediction and the other as "incorrect").

Still good--stronger correlation and six errant calls.

Even better--just two errant calls (WI, NH)

The overall accuracy of September polls from 1992 to 2004 (below) is pretty impressive. The September poll average called the wrong winner in only 25 of the 200 election outcomes. And if you toss out 1992 on the basis of Perot's October candidacy, the polls were "wrong" in only 14 of the remaining 150 cases (9.3%).

One interesting thing to note from last week's post, though, is that in 2004 the correlation between earlier (May and June) state polls and the eventual outcomes was almost as strong as the correlation between the September polls the eventual outcome. The big caveat, however, is that the May and June polls only included results for23 and 21 states, respectively.

### How Well Do Early Polls Predict Election Outcomes?

SurveyUSA created quite a stir last month when they released the results of their fifty-state general election trial-heat survey. Over at Pollster.com Charles Franklin and Mark Blumenthal provided a nice breakdown of the results, categorizing states "strong" or "leaning" for one or the other candidates, or as a "toss-up," and concluded that these early results gave a slight advantage to both Obama and Clinton in their match ups with McCain. Bob Erikson and Karl Sigman added to the mix with their teched-up, simulation-based analysis of the same data, reaching a very similar conclusion.

This is all great fun, and poll junkies (I include myself) love to see this type of analysis. But all of this assumes that early polls are good indicators of what will eventually happen on election day. Are they?

Jay DeSart and I have done some work on using state polls to predict presidential election outcomes, and it's quite clear that polls taken in September are good predictors of the state outcomes in November. But how well do statewide polls from spring of the election year predict the eventual outcomes? I thought it might be useful to look at data from 2004 to provide some sense of how much stock we should put in early polls.

In the figures below, I use statewide presidential trial-heat polls from March through June (polls averaged by state and month) to calculate John Kerry's expected percent of the two-party vote and then plot it against the actual two-party vote for Kerry in the November election. Note that there were no spring polls in many states in 2004 (no Survey USA fifty-state poll, for instance) so none of the plots include all 50 states.

Let's start with March, since this is closest to the timing of the2008 Survey USA poll (late February).

As expected, there is a strong, positive relationship between March polls and November votes in the states that had polling results. But I wouldn't exactly describe the data points as tightly clustered, and the point estimates called the wrong winner in 4 (Michigan, Wisconsin, New Hampshire, and Pennsylvania) of the 11 states in which one candidate held a polling advantage (two other states, Ohio and West Virgina, were tied). It is worth noting that in each of these misfires both the polling margin and the eventual margin of victory were fairly narrow.

Results for April (N=21), May(N=23), and June (N=21) are posted below.

Two take-away points here. First, the correlation between statewide polls and the eventual election outcome grew stronger as the 2004 campaign progressed. Obvious enough, I suppose.

Second, when the polling margin was fairly narrow the outcome was truly up in the air. In fact, across all four months the poll result called the wrong winner in 17 of the 36 cases in which Kerry's share of the two-party vote in trial-heat polls was between 47% and 53% (this excludes two case in which the poll result was tied). These results suggest that we should take the term "toss-up" very seriously. At the same time, the poll result was wrong in only 3 of the 44 cases in which Kerry's poll margin was outside this range.

So what are the implications of this for how we should view the early Survey USA results for 2008? Assuming the data from 2004 provide a reasonable basis for speculating, I expect most of the "strong" (as categorized by Franklin and Blumenthal) states to stay in their candidate's camp. But I would not assign "toss-up" or "leaning" states to either candidate with much confidence. One exception to this would be states such as South Carolina (lean McCain) and Massachusetts (lean Obama), whose partisan histories argue in favor of greater confidence.

Update: Per Mark Blumenthal's suggestion, here are the graphs again, except with the same y-axis. Visually, this seems to make the most difference in the impression given by the March data.

