keenan's log Recording things so that I don't forget them.

Principal Component Analysis of Presidential Elections

Introduction

Using national exit polls, we undertake a principal component analysis of voter demographics for United States presidential elections from 1976 to 2012. We aim to quantify the dynamics of voters and identify the demographics that have best predicted presidential election outcomes from the Carter presidency to the current Obama presidency.

Data

National exit polls from Gallup and various news sources were aggregated into a raw data set. These polls captured 45 demographic groups which were further grouped into 12 categories.

IdeologyPartyGenderMarriedRaceReligion
LiberalDemocratMenMarriedWhiteProtestant
ModerateRepublicanWomenUnmarriedBlackCatholic
ConservativeIndependent  AsianJewish
    HispanicEvangelical
    Other 
AgeEducationIncomeRegionCommunityVoting Status
18-24 yearsLess than HSUnder $30,000EastUrbanFirst-time voter
25-29 yearsHigh School$30,000-49,999MidwestSuburban 
30-39 yearsSome College$50,000-99,999SouthRural 
40-49 yearsCollege$100,000-199,000West  
50-64 yearsPostgraduateOver $200,000   
65 and older     

In each election, the voting behavior of each demographic group are represented as a column vector capturing whether or not the greater proportion voted for the winning presidential candidate.

Note that R uses the singular value decomposition method to perform principal component analysis, where the convention is to consider rows as observations and columns as variables, so our matrix needs to be in election_year x demographic_group format. The data for all 45 demographics then comprises a 10 x 45 matrix. Since each demographic is encoded on the same scale of {1, 0, -1}, no further processing is needed, e.g. mean-centering or normalization.

Principal Component Analysis

Results

Each principal component represents a “dimension” of the data, describing some underlying dynamic. The table below shows the proportion of variance in the data that each principal component accounts for.

 PC1PC2PC3PC4PC5PC6PC7PC8PC9
Proportion of Variance0.60230.14010.083540.063340.031560.026620.020790.017670.01415
Cumulative Proportion0.60230.75230.825880.889220.920780.947390.968180.985851.00000

Mathematically, the principal components of a dataset form an orthogonal basis. Geometrically, basis vectors can be seen as the axes in a Cartesian coordinate system. Humans are comfortable seeing data represented in two dimensions: x and y, so we can project the first two principal components, PC1 and PC2, onto the standard x, y plane. Together, they capture 75% of variance in the data.

Presidential Election biplot

Here, the largest two principal components form the axes of the biplot above. There are two things to interpret: points and vectors. Points (presidential election years) that are close together have similar demographics while vectors (demographic groups) that are close together have similar voting behavior. We see that Democratic presidences are grouped on the left while Republican presidences are grouped on the right. Several demographics are clustered together:

  • Green: Democrat, Liberal, Hispanic, Urban, Jewish, Black
  • Orange: 18-24 years old, Women, Unmarried, East
  • Blue: Post-graduate education, First-time voter
  • Yellow: Republican, Protestant, $100,000-199,999 income

Analysis

Let’s focus on the first principal component, on how demographics are distributed along the x-axis. Demographics on the farthest left are Democrat, Liberal, Hispanic, Urban, Jewish, and Black. On the farthest right, we have Republican, Conservative, Over $200k annual income, and Protestant. Recall that a principal component is supposed to capture some dynamic of the data in a certain dimension, so we can infer that the first component represents a political spectrum from Democratic to Republican. The farther away from the origin a demographic group’s vector goes in along the x-axis, the more politically united it is.

Interestingly, we see that the more recent the presidency, the further out they are on the x-axis. This reflects the historical trend that presidencies and voters have become more partisan over the years. Jimmy Carter’s presidency in 1976 and Ronald Reagan’s presidency in 1984 are much closer to the origin than Obama’s presidency in 2012 and Bush’s presidency in 2004.

The second principal component appears to represent the predictive power of a demographic. Above the scale, knowing which candidate that Liberals, Conservatives, or Married people support tells us little about who will actually win. Below the scale, support from Independents, the West, and First-time voters have major predictive power. This reflects the enormous influence that swing states and California bring to the election.

Conclusion

In order to capture the White House, this year’s presidential candidates most likely need to be able to win Independents, the West, the blue cluster (post-graduates and first-time voters), and the orange cluster (18-24 year olds, women, and the East). The orange cluster also includes unmarried people, but presumably the 18-24 year old demographic encapsulates this group.

Weaknesses

Sample Data

Arguably the biggest weakness of the analysis lies in the dataset itself, in its small size and its imprecise demographic groups. Forty years of history only yields ten samples, leaving it difficult to create meaningful demographic groups. Demographic groups such as Married are too broad and can be broken down even further, e.g. Married Independent voters or Married with College degree, but this makes it even more difficult to normalize exit poll questions across years.

Recent polls offer more specific classifications, but earlier polls do not go into such depth so there is a trade-off between specificity and sample size. For example, national exit polls did not track distinguish sexual orientation until the 1996 presidential election between Bill Clinton and Al Gore.

PCA Assumptions

PCA seeks to draw out the dynamics of a system and is robust against time, but this strength is a weakness when applied to presidential elections. The very nature of winning elections requires identifying and capitalizing on recent trends, and this analysis missed the new influence of minority groups.

It is widely agreed that Obama won in 2008 and again in 2012 by capturing the women, youth, and minority vote. The analysis does highlight the significance of women and young voters, but placed Hispanics and Blacks (part of the Green cluster) as poorly predictive. Moreover, the analysis places Independent voters as strongly significant, but the dataset shows a declining influence over the past 12 years.

Demographic201220082004200019961992198819841980
Independent-11-1111111

In addition, PCA presupposes a linear dynamic in how demographics influence the outcome of a presidential election. This may seem reasonable as we expect elections to be zero-sum, i.e. a person can only vote for one person. We also expect a demographic’s influence to vary linearly with size and unity and that each person’s vote is equally weighted, i.e. presidential candidates win simply by being more popular. However, elections are won in the Electoral College rather than by popular vote, so the influence of a demographic group may not necessarily be linear with the probability of the outcome.

Extensions

Instead of looking at national exit polls, the analysis can focus on exit polls in states with the most electoral votes: California, Texas, New York, or look at exit polls in swing states only: Ohio, Pennsylvania, Florida, Colorado.

A survey of Hilary Clinton and Donald Trump’s electorate base would also supplement this study.

References

  1. Shlens, J., 2014. A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100. pdf

  2. “2012 President Exit Polls.” The New York Times. link

  3. “How Groups Voted.” Roper Center for Public Opinion Research, Cornell University. link

  4. “2004 U.S. President National Exit Poll” CNN. link

  5. 1992-1996 Voting Demographics. Gallup. link

  6. “CPI Inflation Calculator.” Bureau of Labor Statistics. link Used to normalize income across polling years.