A simple set of functions to implement the Data Defect Index (d.d.i.), described in:
Xiao-Li Meng. 2018. “Statistical Paradises and Paradoxes in big data (I): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election.” Annals of Applied Statistics 12:2, 685–726. doi:10.1214/18-AOAS1161SF.
# install.packages("devtools")
remotes::install_github("kuriwaki/ddi")With a dataframe with columns for a group’s estimates and components
of the formula, ddc computes the data defect correlation
(ρ).
An example dataset from the 2016 US Presidential Election is included
(this also serves as the replication dataset for the AOAS article). The
dataset compares official election results with estimates the
Cooperative Congressional Election Study (CCES), the largest political
survey in the US. The CCES micro-data is fully public and accessible at
its website. Here, we
produce state-level estimates which are documented with
help(g2016).
library(ddi)
library(tidyverse)
data(g2016)
g2016## # A tibble: 51 x 10
##    state st    pct_djt_voters cces_pct_djt_vv cces_pct_djtrun… votes_djt
##    <chr> <chr>          <dbl>           <dbl>            <dbl>     <dbl>
##  1 Alab… AL            0.621           0.408            0.428    1318255
##  2 Alas… AK            0.513           0.306            0.319     163387
##  3 Ariz… AZ            0.487           0.423            0.445    1252401
##  4 Arka… AR            0.606           0.416            0.434     684872
##  5 Cali… CA            0.316           0.285            0.305    4483810
##  6 Colo… CO            0.433           0.350            0.371    1202484
##  7 Conn… CT            0.409           0.294            0.318     673215
##  8 Dela… DE            0.419           0.329            0.349     185127
##  9 Dist… DC            0.0409          0.0575           0.0690     12723
## 10 Flor… FL            0.490           0.403            0.422    4617886
## # … with 41 more rows, and 4 more variables: tot_votes <dbl>, cces_n_vv <dbl>,
## #   vap <dbl>, vep <dbl>We can compute the data defect correlation just by plugging in some numbers. For example
ddc(mu = 62984824/136639786, muhat = 12284/35829, N = 136639786, n = 35829)## [1] -0.003837163and the d.d.i. is the square of that, about 0.0000147.
we got these numbers by
select(g2016, cces_pct_djt_vv, cces_n_vv, tot_votes, votes_djt) %>%
  summarize_all(sum)## # A tibble: 1 x 4
##   cces_pct_djt_vv cces_n_vv tot_votes votes_djt
##             <dbl>     <dbl>     <dbl>     <dbl>
## 1            17.5     35829 136639786  62984824where
cces_totdjt_vv: The count of Trump voters (among
validated voters)cces_n_vv: The count of CCES validated voters (sample
size)votes_djt: Total votes for Trumptot_votes: Total turnoutcces_pct_djt_vv: Estimated vote share,
cces_totdjt_vv /     cces_n_vvpct_djt_voters: Estimated vote share,
votes_djt / tot_votesThe function also takes vectors as inputs:
with(g2016, ddc(mu = pct_djt_voters,
                muhat = cces_pct_djt_vv, 
                N = tot_votes, 
                n = cces_n_vv))##  [1] -0.0059541279 -0.0062341071 -0.0023488019 -0.0061097707 -0.0009864919
##  [6] -0.0025746344 -0.0035362241 -0.0033951165  0.0014015382 -0.0029747918
## [11] -0.0038228152 -0.0001757426 -0.0073716139 -0.0036437192 -0.0069956521
## [16] -0.0058255411 -0.0059093759 -0.0057837854 -0.0040533230 -0.0047893714
## [21] -0.0024905368 -0.0028280876 -0.0050296619 -0.0043292576 -0.0056626724
## [26] -0.0069305025 -0.0046563153 -0.0075840944 -0.0047785897 -0.0037497506
## [31] -0.0028289070 -0.0025619899 -0.0031936586 -0.0051968951 -0.0078308914
## [36] -0.0057088185 -0.0065654840 -0.0030642004 -0.0039137353 -0.0039907269
## [41] -0.0040871158 -0.0069019981 -0.0050741833 -0.0044884762 -0.0059634270
## [46] -0.0034491625 -0.0040918085 -0.0024121681 -0.0075404659 -0.0051378753
## [51] -0.0086086072so can be implemented in a tibble as well:
transmute(g2016, st,
          ddc = ddc(mu = pct_djt_voters, 
                    muhat = cces_pct_djt_vv, 
                    N = tot_votes,
                    n = cces_n_vv))## # A tibble: 51 x 2
##    st          ddc
##    <chr>     <dbl>
##  1 AL    -0.00595 
##  2 AK    -0.00623 
##  3 AZ    -0.00235 
##  4 AR    -0.00611 
##  5 CA    -0.000986
##  6 CO    -0.00257 
##  7 CT    -0.00354 
##  8 DE    -0.00340 
##  9 DC     0.00140 
## 10 FL    -0.00297 
## # … with 41 more rowsA negative ρ means ρ = Cor(Respond, 1(Trump Supporter)) < 0, i.e. Trump supporters were less likely to respond.