Analyzing Opinions in Congressional Districts from Large National Polls

I recently came across an article in Vice by Sean McElwee which made a very compelling case for Democrats moving to the left on issues like immigration and racial justice to mobilize millennial voters instead of moving to the center on these issues in an attempt to win over white working class voters. This is a move that aligns with my personal politics, but it seems to run completely contrary to a good deal of conventional punditry on how Democrats should go about winning elections. The author made his case by analyzing several questions from the 2016 Cooperative Congressional Election Survey, which is a very large national survey that covers a number of topics related to policy and political attitudes.

In this post, I'll attempt a deeper dive into some of the CCES data using Bayesian Inference with PyMC3. I'm mainly interested in how much the CCES data can tell us about how opinions vary by congressional district, with a particular focus on college educated millennials. The results backup the main thesis of the vice article pretty strongly, so I won't try to rehash much of the commentary. Instead, this post will be more focused on the computational and modeling issues surrounding this type of analysis.

We can estimate opinions at the state or congressional district level from national surveys using a technique called Multilevel Regression and Poststratification, also known as MRP or Mister P. The analysis consists of two steps. The first is a regression step, where a model of someone's propensity to hold a particular opinion based on their individual demographic characteristics as well as the state or congressional district they live in is inferred from the survey data. In the second step, the model is used to predict district, state, or nationwide opinions using the demographic counts from the state population. This is called poststratification, and it accounts for the fact that the demographic composition of the surveyed group is often very different from the demographic composition of a given state or the country as a whole.

Like any statistical analysis, uncertainty in Mister P's estimates are unavoidable. However, by incorporating relevant state level information (such as vote percentages in previous elections for example), good quality estimates can be obtained in under sampled states or demographic groups. This is largely the result of the power of hierarchical regression modeling, which partially pools estimates of similar groups. This allows under sampled groups to 'borrow' statistical power from similar groups with more respondents.

Mister P has been attempted for estimating opinions from the CCES survey at the congressional district level, but there are some limitations. In order to do poststratification, we need to have counts of the number of people in each of the demographic categories we used in the regression step. If we considered gender, a few age categories, race, and education we would need to know how many people there are in each congressional district in every combination of gender, age, race, and education category. This information is available at the state level and can be put together easily using the census microdata. However age breakdowns at the congressional district level are not available, and further breakdowns by gender, education, and race are only available through the census factfinder for people older than 25. The authors of the congressional district Mister P paper claim that this is not much of a problem since people under 25 make up a relatively small part of the electorate and that other studies have found that for many issues, age is not too important after controlling for other factors.

Unfortunately, people under 25 make up a significant portion millennials, who are generally defined as those under 35. I'm also interested specifically in finding out about topics which are polarizing by age. So for the moment, I'll try to see how much can be learned about millennial's opinions from the multilevel regression model alone without poststratifying.

I'll start with the following CCES item on white privilege, where respondents were asked to rate the statement "White people in the U.S. have certain advantages because of the color of their skin" on a scale from strongly agree, somewhat agree, neither agree nor disagree, somewhat disagree, and strongly disagree. For the purposes of the regression model, I'll combine the five responses into two categories by putting strongly and somewhat agrees into one category, and all other responses into the second. 

This question is polarizing along lines of race, education, and age, so the following individual level indicators and interactions are included in the regression model:

  • female
  • white
  • millennial
  • college grad
  • income under 40k
  • female & millennial
  • female & white
  • white & millennial
  • white & college grad
  • white & income under 40k

These were chosen because as pointed out in the vice article, people of color generally agree with this statement a great deal more than white people, but the question is polarizing by age and education among whites. Other indicators may be relevant, and I should note that it is more typical to consider more than just binary demographic indicators for Mister P. This is because small differences in a particular demographic group leaning one way or another can lead to potentially large differences in a district or state's overall opinion during the poststratification step.

My initial approach was to use a much more complex set of individual indicators and interactions to estimate state level opinion, along the lines of the approach used in this paper estimating opinions in small demographic subgroups. However this led to inefficient inferences and large standard errors, which seemed to have been the result of correlations between the many of the individual level predictors. What I've done is simplify the individual portion of the model while increasing the complexity of geographic portion by going down to the congressional district level. The logic behind this is that we can incorporate external predictors for the congressional district coefficients which improve partial pooling, while (to my knowledge) the only external information we can incorporate into the model to assist estimation of individual indicators is through the priors. Estimates of opinion at the congressional district level are also relevant in and of themselves. However, I should note, it's possible that after further iterations of this model, the set of individual level indicators chosen here could prove to be too simple.

The district level predictors are median household income from the Census American Community Survey, Trump vote share (collected by the Daily Kos), and percent Evangelical, which has been collected at the congressional district level by a group of social science researchers. Percent Evangelical may seem like an odd indicator to include for a question about white privilege, but according to previous studies, white Evangelicals are less likely to perceive injustice against minority groups than the rest of the country. An interaction term between millennial and congressional district is also included.

In the congressional district Mister P paper I mentioned above, several geographical levels were used. At the top they include regions, which contain states, which in turn contain congressional districts. Having a hierarchical model is crucial since that's how we get the partial pooling between districts with similar characteristics that lead to improved estimates. Although I'm not sure that the additional levels above congressional districts are worth including, so I'll offer a different approach.

When we include an indicator at a lower geographic level, this could end up correlating with a higher level geographic indicator. For example, if we included % Evangelical as a district level indicator along with a regional indicator for the South, those regression parameters could end being correlated since districts with large Evangelical populations tend to be clustered in the South. Instead, the approach I've taken is to only use the smallest geographic level of interest, in this case the congressional district, and to additionally use a hierarchical student T distribution for the congressional district coefficients to make the inferences more robust against outliers.

The T distribution will make our estimates of the congressional district coefficients more robust against outliers, but this adds some computational difficulties as well. First, with a hierarchical model, we estimate a group variance parameter for the district coefficients from the data. This parameter controls the shrinkage, or how much estimates for the district coefficients get pulled toward the group mean. If the data support a lot of shrinkage, this can lead to biased inferences because the correlation between the group variance parameter and the individual parameters lead to posterior distributions with regions that are difficult for inference algorithms to explore.

There is a well known solution to this problem for hierarchical normal distributions, which is to use a so called non centered parameterization. However I haven't found or figured out how to do something similar for a hierarchical T. It's possible that this will not be much of a concern for using the CCES survey in particular, since every congressional district has a relatively large number of respondents and therefore we might not expect to see levels of shrinkage that are large enough to cause this particular problem. The distribution of respondents and millennial respondents in each district is shown below; every district is sampled fairly well.

The next issue is deciding on the priors for the standard deviation and normality parameter (or degrees of freedom) for the T distribution. Doing Bayesian Data Analysis has several examples of hierarchical modeling with a T distribution using a uniform prior for the standard deviation and an exponential prior for the normality parameter - 1 (the prior is specified on the normality parameter - 1 to exclude values less than 1). I use the same prior for the normality parameter, but use a half Cauchy prior for the standard deviation, which has good properties for as a prior for scale parameters. These two parameters of the T distribution will be correlated, but there were no obvious signs of this being a problem for the Hamiltonian Monte Carlo inference algorithm in PyMC3, and all diagnostics indicate convergence. The scatter plot below shows this correlation. The median posterior estimate for the normality parameter is about 9, indicating that there is some support in the data for outliers.


While there isn't enough shrinkage applied to the district coefficients to cause problems, the data support a lot more shrinkage for the district & millennial interaction, and so I've used a hierarchical normal with a non centered parameterization on those interaction coefficients. These interaction terms are small relative to the district and individual coefficients. All of this info can be summarized in a trace plot (I ran 3 Markov chains which have been combined to reduce clutter in the plot).


The regression coefficients for the district level predictors also yield some interesting results: there is a strong negative correlation with agreeing that whites have advantages and district margin for President Trump in 2016. There is also a negative correlation for district Evangelical population, which is consistent what we should expect given the findings in previous studies of attitudes on race among white evangelicals. The correlation with district median income is positive, and roughly the same magnitude as percent evangelical. The posterior mean estimate for each coefficient along with 50% and 95% credible intervals have been plotted below.


Let's return to the individual level components of the model. Whites are much less likely to agree that white people have advantages. However among whites, being female, a millennial, and/or college educated all increase the likelihood of agreeing.


Now we'll put everything together and take a look at some more detailed comparisons. With Mister P we can aggregate estimates to any demographic or geographic level that we want after poststratifying, which allows us to summarize the results in a number of different interesting ways. However since this analysis is all Mister and no P due to the lack of poststratification data for different age groups at the congressional district level, our options are a bit more limited. The best we can do is estimate the probability that a person in a given district with a particular set of demographic characteristics will agree. Without poststratification data, we can't combine any groups together since we don't know their relative proportions, and can only look at each group individually. Since there are many possible combinations of demographic characteristics, this can get a bit awkward.

For the purposes of this analysis, we'll use a white, college educated millennial male as a sort of reference demographic. Most of the groups in the traditional Democratic base are more likely agree that whites have advantages than white college educated millennial males, but the opinions of this group are much different from older and less educated white people. The posterior distributions for the probability of agreeing that white people have advantages are shown below for white college educated millennial males and white non millennial males with no college degree in New Jersey's 11th district. The degree of polarization is pretty stark; our best estimate is that there's about a 2 in 3 chance that the millennial will agree that white people have advantages compared to a 1 in 3 chance for the non millennial. Despite the uncertainty in these estimates, there are clear differences of opinion between these two groups.


High levels of agreement on the existence of white privilege are not limited to New Jersey or even the coasts. Each of the districts shown below has been rated a toss up by the Cook Political report at some point in the last year and/or is a district targeted by the Democratic Congressional Campaign Committee. The estimated probability of a white college educated millennial male agreeing +- 1 standard deviation is shown below for each target district. Note, estimates for white millennial women and people of color will tend to be higher.


Our best estimate is that this group is more likely to agree than not in each of the target districts, which cover a fairly broad geographic area and are not limited to just blue states. Despite the lack of post stratification data and inherent uncertainties in this type of analysis, it looks like we can still glean useful and interesting information from the regression model alone. For now this is encouraging, and next I'll try to work out ways to improve estimates for district-demographic interactions, so stay tuned for updates! 

All the source code for this post can be found on GitHub. The raw CCES survey data is not in the repo since the file is too large, but it can be downloaded here. If you're interested in reading more about Mister P with PyMC3, check out this tutorial from Austin Rochford.