Wednesday 29 July 2015

Party Vote Demographics: Appendix I - Extra Statistics Chat

The census dataset that I am using is http://www.stats.govt.nz/Census/2013-census/data-tables/electorate-tables.aspx (shout out to NZ Herald Data Editor Harkanwal Singh), which conveniently provides numbers at the electorate level. I can also recommend the data files produced from that dataset by Jonathan Marshall which is available on Github which is much nicer to work with. The vote numbers come from electionresults.org.nz, pulled and processed with some code kindly provided by Chuan-Zheng Lee.

After rejecting some of the less interesting variables provided in the census data (mostly about employment), I was left with only 1815 variables to check. Yes, that’s still a lot of variables. When I initially set the analysis to return correlations that were statistically significant at the 5% level of significance, it returned about 12,000 correlations. After talking to Chuan-Zheng I realised that I was dumb and forgot that I was actually working with the entire population, where “statistically significant” no longer makes sense because we’re not working with samples. So I got rid of statistical significance in terms of individual correlations entirely.

A straight bivariate correlation analysis would return a lot of misleading correlations, because in general, if there are more people in an electorate, there are also more people voting, and more people in any demographic category. To counter this I followed some internet advice and used an equation from Steiger (1980) to determine if there was a statistically significant difference between the two correlations:
- r12 the number of people in the electorate vs the number of votes for a particular party
- r13 the number of votes for a particular party vs the number of people in a particular demographic group
To help ensure that the claims made were strong and unlikely to be explained by the variation in electorate populations, I set the analysis to only return correlations where the difference between r12 and r13 was statistically significant at the 0.1% level.

Additionally, any correlations that had an r between -0.1 and 0.1 were removed and analysed separately, as they are so close to 0 that the relationship is likely that there is no relationship between the two variables (which may be statistically significant but not all that interesting for most of what we’re looking at here).

I should probably note somewhere (and here is as good a place as any) that the sample size in most cases was 71 (all the general electorates + Maori electorates), except for the immigrant data which was not available for the Maori electorates (and thus the sample size was reduced to 64).

Where I’ve used r≈ instead of r=, it’s because I’ve actually combined a couple of correlations for ease of communication. For example, “people earning $70,001 or more” is actually “people earning $70,001-$100,000, people earning $100,000-$150,000, and people earning $150,001 or more”, but I didn’t want to manually group that data because hey, I got hungry and needed time to make dinner. It’s an approximation of the strength of relationship at least, and I guess is intended to be more directional than accurate magnitudinally (magnitude-wise? in terms of magnitude?).

Everything was done in Python (without the use of NumPy or SciPy because as it turns out I would rather spend a few hours torturing myself trying to figure out how to implement the algorithms from scratch than spend a few minutes installing some commonly used modules). In retrospect I should have just pulled out R. Fun (questionable) fact: the number of R User Group meetings per month worldwide is (on average) increasing at a rate of 0.6 meetings per month since November 2008.

No comments:

Post a Comment