This is the appendix to the safe wins post in our series about gerrymandering. The more technical details of the statistical tests are described. We also propose an additional test for gerrymandering based on a gaussian mixture model. Any feedback on the efficacy of these tests would be greatly appreciated.
Statistics for skew and spread
The statistic for skew, called C, is defined as follows
The use of this statistic for testing for skew was studied in Cabillo and Masaro. C is asymptotically normally distributed, but its variance depends on the underlying distribution F. Cabillo and Masaro showed that it is acceptable for many practical cases to assume that the variance of C is that of a unit normal distribution. This assumption is used to define the rejection region of the skew test. Similar conclusions about the validity of the assumption of taking the variance of other statistics for skew to be the same as the unit normal were found in these papers.
The difference between the ratio of IQR and standard deviation of a sample and its underlying distribution F was shown by DasGupta and Haff to be asymptotically normally distributed with the variance depending on F. Again under the assumption that F is normal, we have (see corollary 3a in DasGupta and Haff)
This relationship is used to define the rejection region for the spread test based on the spread statistic D
I have not done a study to determine if the assumption of an underlying normal distribution is acceptable with the same thoroughness as the papers on skew statistics, so this assumption may be open to criticism. I also wonder if other measures of spread can be used in place of the IQR. For example, the length of the shorth, or the smallest range containing at least half of the data, seems worth looking into. The difference between the length of the shorth of a sample and that of the underlying distribution is also asymptotically normally distributed. I've experimented with this a bit and the results can be found on github.
The Combined Skew and Spread Test
The rejection region for the combined skew and spread test was determined from 50,000 Monte Carlo simulations on a unit normal distribution. Here is a plot of these results for North Carolina
And for California
It would be great to avoid using simulations to define the rejection region, but since the two statistics are not independent, I don't know how to go about deriving a joint distribution of the two of them