Title: Distributional Synthetic Controls Estimation
Version: 0.1.1
Description: The method of synthetic controls is a widely-adopted tool for evaluating causal effects of policy changes in settings with observational data. In many settings where it is applicable, researchers want to identify causal effects of policy changes on a treated unit at an aggregate level while having access to data at a finer granularity. This package implements a simple extension of the synthetic controls estimator, developed in Gunsilius (2023) <doi:10.3982/ECTA18260>, that takes advantage of this additional structure and provides nonparametric estimates of the heterogeneity within the aggregate unit. The idea is to replicate the quantile function associated with the treated unit by a weighted average of quantile functions of the control units. The package contains tools for aggregating and plotting the resulting distributional estimates, as well as for carrying out inference on them.
License: MIT + file LICENSE
BugReports: https://github.com/Davidvandijcke/DiSCos/issues
URL: http://www.davidvandijcke.com/DiSCos/, https://github.com/Davidvandijcke/DiSCos
LazyData: TRUE
Imports: CVXR, pracma, Rdpack, parallel, evmix, utils, extremeStat, MASS
Depends: data.table, R (≥ 2.10), ggplot2
RdMacros: Rdpack
Suggests: haven, latex2exp, knitr, rmarkdown, maps, testthat (≥ 3.0.0), quadprog
Encoding: UTF-8
RoxygenNote: 7.2.2
VignetteBuilder: knitr
Config/testthat/edition: 3
NeedsCompilation: no
Packaged: 2024-07-23 03:12:09 UTC; davidvandijcke
Author: David Van Dijcke ORCID iD [aut, cre], Florian Gunsilius ORCID iD [aut], Siyun He ORCID iD [aut]
Maintainer: David Van Dijcke <dvdijcke@umich.edu>
Repository: CRAN
Date/Publication: 2024-07-23 03:30:03 UTC

Distributional Synthetic Controls

Description

This function implements the distributional synthetic controls (DiSCo) method from Gunsilius (2023). as well as the alternative mixture of distributions approach.

Usage

DiSCo(
  df,
  id_col.target,
  t0,
  M = 1000,
  G = 1000,
  num.cores = 1,
  permutation = FALSE,
  q_min = 0,
  q_max = 1,
  CI = FALSE,
  boots = 500,
  replace = TRUE,
  uniform = FALSE,
  cl = 0.95,
  graph = FALSE,
  qmethod = NULL,
  qtype = 7,
  seed = NULL,
  simplex = FALSE,
  mixture = FALSE,
  grid.cat = NULL
)

Arguments

df

Data frame or data table containing the distributional data for the target and control units. The data table should contain the following columns:

  • y_col A numeric vector containing the outcome variable for each unit. Units can be individuals, states, etc., but they should be nested within a larger unit (e.g. individuals or counties within a state)

  • id_col A numeric vector containing the aggregate IDs of the units. This could be, for example, the state if the units are counties or individuals

  • time_col A vector containing the time period of the observation for each unit. This should be a monotonically increasing integer.

id_col.target

Variable indicating the name of the target unit, as specified in the id_col column of the data table. This variable can be any type, as long as it is the same type as the id_col column of the data table.

t0

Integer indicating period of treatment.

M

Integer indicating the number of control quantiles to use in the DiSCo method. Default is 1000.

G

Integer indicating the number of grid points for the grid on which the estimated functions are evaluated. Default is 1000.

num.cores

Integer, number of cores to use for parallel computation. Default is 1. If the permutation or CI arguments are set to TRUE, this can be slow and it is recommended to set this to 4 or more, if possible. If you get an error in "all cores" or similar, try setting num.cores=1 to see the precise error value.

permutation

Logical, indicating whether to use the permutation method for computing the optimal weights. Default is FALSE.

q_min

Numeric, minimum quantile to use. Set this together with q_max to restrict the range of quantiles used to construct the synthetic control. Default is 0 (all quantiles). Currently NOT implemented for the mixture approach.

q_max

Numeric, maximum quantile to use. Set this together with q_min to restrict the range of quantiles used to construct the synthetic control. Default is 1 (all quantiles). Currently NOT implemented for the mixture approach.

CI

Logical, indicating whether to compute confidence intervals for the counterfactual quantiles. Default is FALSE. The confidence intervals are computed using the bootstrap procedure described in Van Dijcke et al. (2024).

boots

Integer, number of bootstrap samples to use for computing confidence intervals. Default is 500.

replace

Logical, indicating whether to sample with replacement when computing the bootstrap samples. Default is TRUE.

uniform

Logical, indicating whether to construct uniform bootstrap confidence intervals. Default is FALSE If FALSE, the confidence intervals are pointwise.

cl

Numeric, confidence level for the (two-sided) confidence intervals.

graph

Logical, indicating whether to plot the permutation graph as in Figure 3 of the paper. Default is FALSE.

qmethod

Character, indicating the method to use for computing the quantiles of the target distribution. The default is NULL, which uses the quantile function from the stats package. Other options are "qkden" (based on smoothed kernel density function) and "extreme" (based on parametric extreme value distributions). Both are substantially slower than the default method but may be useful for fat-tailed distributions with few data points at the upper quantiles. Alternatively, one could use the q_max option to restrict the range of quantiles used.

qtype

Integer, indicating the type of quantile to compute when using quantile in the qmethod argument. The default 7. See the documentation for the quantile function for more information.

seed

Integer, seed for the random number generator. This needs to be set explicitly in the function call, since it will invoke RNGkind which will set the seed for each core when using parallel processes. Default is NULL, which does not set a seed.

simplex

Logical, indicating whether to use to constrain the optimal weights to the unit simplex. Default is FALSE, which only constrains the weights to sum up to 1 but allows them to be negative.

mixture

Logical, indicating whether to use the mixture of distributions approach instead. See Section 4.3. in Gunsilius (2023). This approach minimizes the distance between the CDFs instead of the quantile functions, and is preferred for categorical variables. When working with such variables, one should also provide a list of support points in the grid.cat parameter. When that is provided, this parameter is automatically set to TRUE. Default is FALSE.

grid.cat

List, containing the discrete support points for a discrete grid to be used with the mixture of distributions approach. This is useful for constructing synthetic distributions for categorical variables. Default is NULL, which uses a continuous grid based on the other parameters.

Details

This function is called for every time period in the DiSCo function. It implements the DiSCo method for a single time period, as well as the mixture of distributions approach. The corresponding results for each time period can be accessed in the results.periods list of the output of the DiSCo function. The DiSCo function returns the average weight for each unit across all periods, calculated as a uniform mean, as well as the counterfactual target distribution produced as the weighted average of the control distributions for each period, using these averaged weights.

Value

A list containing the following elements:

References

Gunsilius FF (2023). “Distributional synthetic controls.” Econometrica, 91(3), 1105–1117.

Van Dijcke D, Gunsilius F, Wright AL (2024). “Return to Office and the Tenure Distribution.” Working Paper 2024-56, University of Chicago, Becker Friedman Institute for Economics.()


Store aggregated treatment effects

Description

S3 object holding aggregated treatment effects

Usage

DiSCoT(
  agg,
  treats,
  ses,
  grid,
  ci_lower,
  ci_upper,
  t0,
  call,
  cl,
  N,
  J,
  agg_df,
  perm,
  plot
)

Arguments

agg

aggregation method

treats

list of treatment effects

ses

list of standard errors

grid

grid

ci_lower

list of lower confidence intervals

ci_upper

list of upper confidence intervals

t0

start time

call

call

cl

confidence level

N

number of observations

J

number of treated units

agg_df

dataframe of aggregated treatment effects and their confidence intervals

perm

list of per mutation results

plot

a ggplot object containing the plot for the aggregated treatment effects using the agg parameter

Value

S3 object of class DiSCoT with associated summary and print methods


Aggregate treatment effects from DiSCo function.

Description

Function to aggregate treatment effects from the output of the DiSCo function, plot the distribution of the aggregation statistic over time, and report summary tables.

Usage

DiSCoTEA(
  disco,
  agg = "quantileDiff",
  graph = TRUE,
  t_plot = NULL,
  savePlots = FALSE,
  xlim = NULL,
  ylim = NULL,
  samples = c(0.25, 0.5, 0.75)
)

Arguments

disco

Output of the DiSCo function.

agg

String indicating the aggregation statistic to be used. Options include

  • quantileDiff Difference in quantiles between the target and the weighted average of the controls.

  • quantile Plots both the observed and the counterfactual quantile functions. No summary statistics will be produced.

  • cdfDiff Difference in CDFs between the target and the weighted average of the controls.

  • cdf Plots both the observed and the counterfactual CDFs. No summary statistics will be produced.

graph

Boolean indicating whether to plot graphs (default is TRUE).

t_plot

Optional vector of time periods (t_col values in the original dataframe) to be plotted (default is NULL, which plots all time periods).

savePlots

Boolean indicating whether to save the plots to the current working directory (default is FALSE). The plot names will be ⁠[agg]_[start_year]_[end_year].pdf⁠.

xlim

Optional vector of length 2 indicating the x-axis limits of the plot. Useful for zooming in on relevant parts of the distribution for fat-tailed distributions.

ylim

Optional vector of length 2 indicating the y-axis limits of the plot.

samples

Numeric vector indicating the range of quantiles of the aggregation statistic (agg) to be summarized in the summary property of the S3 class returned by the function (default is c(0.25, 0.5, 0.75)). For example, if samples = c(0.25, 0.5, 0.75), the summary table will include the average effect for the 0-25th, 25-50th, 50-75th and 75-100th quantiles of the distribution of the aggregation statistic over time.

Details

This function takes in the output of the DiSCo_per function and computes aggregate treatment effect using a user-specified aggregation statistic. The default is the differences between the counterfactual and the observed quantile functions (quantileDiff). If graph is set to TRUE, the function will plot the distribution of the aggregation statistic over time. The S3 class returned by the function has a summary property that will print a selection of aggregated effects (specified by the samples parameter) for the chosen agg method, by post-treatment year (see examples below). This summary call will only print effects if the agg parameter requested a distribution difference (quantileDiff or cdfDiff). The other aggregations are meant to be inspected visually. If the permutation parameter was set to TRUE in the original DiSCo call, the summary table will include the results of the permutation test. If the original DiSCo call was restricted to a range of quantiles smaller than ⁠[0,1]⁠ (i.e. q_min > 0 or q_max < 1), the samples parameter is ignored and only the aggregated differences for the quantile range specified in the original call are returned.

Value

A DiSCoT object, which is an S3 class that stores a list of treatment effects, their standard errors, the corresponding confidence intervals (if specified), and a dataframe with treatment effects aggregated according to the agg input. The S3 class also has a summary property that will print a selection of aggregated effects (specified by the samples parameter) for the chosen agg method, by post-treatment year, as well as the permutation test results, if specified.


DiSCo_CI

Description

Function for computing the confidence intervals in the DiSCo method using the bootstrap approach described in

Usage

DiSCo_CI(
  redraw,
  controls,
  target,
  T_max,
  T0,
  grid,
  mc.cores = 1,
  evgrid = seq(from = 0, to = 1, length.out = 1001),
  qmethod = NULL,
  qtype = 7,
  M = 1000,
  mixture = FALSE,
  simplex = FALSE,
  replace = TRUE
)

Arguments

redraw

Integer indicating the current bootstrap redraw

controls

A list containing the raw data for the control group

target

A list containing the raw data for the target group

T_max

Index of last time period

T0

Index of the last pre-treatment period

grid

Grid to recompute the CDF on if mixture option is chosen

mc.cores

Number of cores to use for parallelization

qmethod

Character, indicating the method to use for computing the quantiles of the target distribution. The default is NULL, which uses the quantile function from the stats package. Other options are "qkden" (based on smoothed kernel density function) and "extreme" (based on parametric extreme value distributions). Both are substantially slower than the default method but may be useful for fat-tailed distributions with few data points at the upper quantiles. Alternatively, one could use the q_max option to restrict the range of quantiles used.

qtype

Integer, indicating the type of quantile to compute when using quantile in the qmethod argument. The default 7. See the documentation for the quantile function for more information.

M

Integer indicating the number of control quantiles to use in the DiSCo method. Default is 1000.

mixture

Logical, indicating whether to use the mixture of distributions approach instead. See Section 4.3. in Gunsilius (2023). This approach minimizes the distance between the CDFs instead of the quantile functions, and is preferred for categorical variables. When working with such variables, one should also provide a list of support points in the grid.cat parameter. When that is provided, this parameter is automatically set to TRUE. Default is FALSE.

simplex

Logical, indicating whether to use to constrain the optimal weights to the unit simplex. Default is FALSE, which only constrains the weights to sum up to 1 but allows them to be negative.

replace

Logical, indicating whether to sample with replacement when computing the bootstrap samples. Default is TRUE.

Value

A list with the following components


DiSCo_CI_iter

Description

Function for computing the confidence intervals in the DiSCo method in a single period

Usage

DiSCo_CI_iter(
  t,
  controls_t,
  target_t,
  grid,
  T0,
  M = 1000,
  evgrid = seq(from = 0, to = 1, length.out = 1001),
  qmethod = NULL,
  qtype = 7,
  mixture = FALSE,
  simplex = FALSE,
  replace = TRUE
)

Arguments

t

Time period

controls_t

List of control unit data for given period

target_t

List of target unit data for given period

grid

Grid to recompute the CDF on if mixture option is chosen

T0

Index of the last pre-treatment period

M

Integer indicating the number of control quantiles to use in the DiSCo method. Default is 1000.

qmethod

Character, indicating the method to use for computing the quantiles of the target distribution. The default is NULL, which uses the quantile function from the stats package. Other options are "qkden" (based on smoothed kernel density function) and "extreme" (based on parametric extreme value distributions). Both are substantially slower than the default method but may be useful for fat-tailed distributions with few data points at the upper quantiles. Alternatively, one could use the q_max option to restrict the range of quantiles used.

qtype

Integer, indicating the type of quantile to compute when using quantile in the qmethod argument. The default 7. See the documentation for the quantile function for more information.

mixture

Logical, indicating whether to use the mixture of distributions approach instead. See Section 4.3. in Gunsilius (2023). This approach minimizes the distance between the CDFs instead of the quantile functions, and is preferred for categorical variables. When working with such variables, one should also provide a list of support points in the grid.cat parameter. When that is provided, this parameter is automatically set to TRUE. Default is FALSE.

simplex

Logical, indicating whether to use to constrain the optimal weights to the unit simplex. Default is FALSE, which only constrains the weights to sum up to 1 but allows them to be negative.

replace

Logical, indicating whether to sample with replacement when computing the bootstrap samples. Default is TRUE.

Value

The resampled counterfactual barycenter of the target unit


Function for computing barycenters in the DiSCo method at every time period

Description

Compute barycenters in the DiSCo method at every time period, as in Definition 1, Step 4 in Gunsilius (2023).

Usage

DiSCo_bc(controls.q, weights, evgrid = seq(from = 0, to = 1, length.out = 101))

Arguments

controls.q

List with matrices of control quantile functions

weights

Vector of optimal synthetic control weights, computed using the DiSCo_weights_reg function.

Value

The quantile function of the barycenter associated with the "weights" evaluated at the vector "evgrid"

References

Gunsilius FF (2023). “Distributional synthetic controls.” Econometrica, 91(3), 1105–1117.


Estimate DiSCo in a single period

Description

This function implements the DiSCo method for a single time period, as well as the mixture of distributions approach. Its return values contain valuable period-specific estimation outputs.

Usage

DiSCo_iter(
  yy,
  df,
  evgrid,
  id_col.target,
  M,
  G,
  T0,
  qmethod = NULL,
  qtype = 7,
  q_min = 0,
  q_max = 1,
  simplex = FALSE,
  controls.id,
  grid.cat,
  mixture
)

Arguments

yy

Integer indicating the current year being processed.

df

Data frame or data table containing the distributional data for the target and control units. The data table should contain the following columns:

  • y_col A numeric vector containing the outcome variable for each unit. Units can be individuals, states, etc., but they should be nested within a larger unit (e.g. individuals or counties within a state)

  • id_col A numeric vector containing the aggregate IDs of the units. This could be, for example, the state if the units are counties or individuals

  • time_col A vector containing the time period of the observation for each unit. This should be a monotonically increasing integer.

evgrid

A vector of grid points on which to evaluate the quantile functions.

id_col.target

Variable indicating the name of the target unit, as specified in the id_col column of the data table. This variable can be any type, as long as it is the same type as the id_col column of the data table.

M

Integer indicating the number of control quantiles to use in the DiSCo method. Default is 1000.

G

Integer indicating the number of grid points for the grid on which the estimated functions are evaluated. Default is 1000.

T0

Integer indicating the last pre-treatment period starting from 1.

qmethod

Character, indicating the method to use for computing the quantiles of the target distribution. The default is NULL, which uses the quantile function from the stats package. Other options are "qkden" (based on smoothed kernel density function) and "extreme" (based on parametric extreme value distributions). Both are substantially slower than the default method but may be useful for fat-tailed distributions with few data points at the upper quantiles. Alternatively, one could use the q_max option to restrict the range of quantiles used.

qtype

Integer, indicating the type of quantile to compute when using quantile in the qmethod argument. The default 7. See the documentation for the quantile function for more information.

q_min

Numeric, minimum quantile to use. Set this together with q_max to restrict the range of quantiles used to construct the synthetic control. Default is 0 (all quantiles). Currently NOT implemented for the mixture approach.

q_max

Numeric, maximum quantile to use. Set this together with q_min to restrict the range of quantiles used to construct the synthetic control. Default is 1 (all quantiles). Currently NOT implemented for the mixture approach.

simplex

Logical, indicating whether to use to constrain the optimal weights to the unit simplex. Default is FALSE, which only constrains the weights to sum up to 1 but allows them to be negative.

controls.id

List of strings specifying the column names for the control units' identifiers.

grid.cat

List, containing the discrete support points for a discrete grid to be used with the mixture of distributions approach. This is useful for constructing synthetic distributions for categorical variables. Default is NULL, which uses a continuous grid based on the other parameters.

mixture

Logical, indicating whether to use the mixture of distributions approach instead. See Section 4.3. in Gunsilius (2023). This approach minimizes the distance between the CDFs instead of the quantile functions, and is preferred for categorical variables. When working with such variables, one should also provide a list of support points in the grid.cat parameter. When that is provided, this parameter is automatically set to TRUE. Default is FALSE.

Details

This function is part of the DiSCo method, called for each time period. It calculates the optimal weights for the DiSCo method and the mixture of distributions approach for a single time period. The function processes data f or both the target and control units, computes the quantile functions, and evaluates these on a specified grid. The function is designed to be used within the broader context of the DiSCo function, which aggregates results across multiple time periods.

Value

A list with the following elements:


DiSCo_mixture

Description

The alternative mixture of distributions approach in the paper

Usage

DiSCo_mixture(controls1, target, grid.min, grid.max, grid.rand, M, simplex)

Arguments

controls1

A list of controls

target

The target unit

grid.min

Minimal value of the grid on which the CDFs are evaluated.

grid.max

Maximal value of the grid on which the CDFs are evaluated.

grid.rand

Random grid on which the CDFs are evaluated.

M

Integer indicating the number of control quantiles to use in the DiSCo method. Default is 1000.

simplex

Logical, indicating whether to use to constrain the optimal weights to the unit simplex. Default is FALSE, which only constrains the weights to sum up to 1 but allows them to be negative.

Value

A list containing the following elements:


DiSCo_mixture_solve

Description

The solver for the alternative mixture of distributions approach in the paper

Usage

DiSCo_mixture_solve(
  c_len,
  CDF.matrix,
  grid.min,
  grid.max,
  grid.rand,
  M,
  simplex
)

Arguments

c_len

The number of controls

CDF.matrix

The matrix of CDFs

grid.min

Minimal value of the grid on which the CDFs are evaluated.

grid.max

Maximal value of the grid on which the CDFs are evaluated.

grid.rand

Random grid on which the CDFs are evaluated.

M

Integer indicating the number of control quantiles to use in the DiSCo method. Default is 1000.

simplex

Logical, indicating whether to use to constrain the optimal weights to the unit simplex. Default is FALSE, which only constrains the weights to sum up to 1 but allows them to be negative.

Value

A list containing the following elements:


DiSCo_per

Description

Function to implement permutation test for Distributional Synthetic Controls

Usage

DiSCo_per(
  results.periods,
  T0,
  ww = 0,
  peridx = 0,
  evgrid = seq(from = 0, to = 1, length.out = 101),
  graph = TRUE,
  num.cores = 1,
  weights = NULL,
  qmethod = NULL,
  qtype = qtype,
  q_min = 0,
  q_max = 1,
  M = 1000,
  simplex = FALSE,
  mixture = FALSE
)

Arguments

results.periods

List of period-specific results from DiSCo

T0

Integer indicating first year of treatment as counted from 1 (e.g, if treatment year 2002 was the 5th year in the sample, this parameter should be 5).

ww

Optional vector of weights indicating the relative importance of each time period. If not specified, each time period is weighted equally.

peridx

Optional integer indicating number of permutations. If not specified, by default equal to the number of units in the sample.

graph

Logical, indicating whether to plot the permutation graph as in Figure 3 of the paper. Default is FALSE.

num.cores

Integer, number of cores to use for parallel computation. Default is 1. If the permutation or CI arguments are set to TRUE, this can be slow and it is recommended to set this to 4 or more, if possible. If you get an error in "all cores" or similar, try setting num.cores=1 to see the precise error value.

weights

Optional vector of weights to use for the "true" treated unit. redo_weights has to be set to FALSE for these weights to be used.

qmethod

Character, indicating the method to use for computing the quantiles of the target distribution. The default is NULL, which uses the quantile function from the stats package. Other options are "qkden" (based on smoothed kernel density function) and "extreme" (based on parametric extreme value distributions). Both are substantially slower than the default method but may be useful for fat-tailed distributions with few data points at the upper quantiles. Alternatively, one could use the q_max option to restrict the range of quantiles used.

qtype

Integer, indicating the type of quantile to compute when using quantile in the qmethod argument. The default 7. See the documentation for the quantile function for more information.

q_min

Numeric, minimum quantile to use. Set this together with q_max to restrict the range of quantiles used to construct the synthetic control. Default is 0 (all quantiles). Currently NOT implemented for the mixture approach.

q_max

Numeric, maximum quantile to use. Set this together with q_min to restrict the range of quantiles used to construct the synthetic control. Default is 1 (all quantiles). Currently NOT implemented for the mixture approach.

M

Integer indicating the number of control quantiles to use in the DiSCo method. Default is 1000.

simplex

Logical, indicating whether to use to constrain the optimal weights to the unit simplex. Default is FALSE, which only constrains the weights to sum up to 1 but allows them to be negative.

mixture

Logical, indicating whether to use the mixture of distributions approach instead. See Section 4.3. in Gunsilius (2023). This approach minimizes the distance between the CDFs instead of the quantile functions, and is preferred for categorical variables. When working with such variables, one should also provide a list of support points in the grid.cat parameter. When that is provided, this parameter is automatically set to TRUE. Default is FALSE.

Details

This program iterates through all units and computes the optimal weights on the other units for replicating the unit of iteration's outcome variable, assuming that it is the treated unit. See Algorithm 1 in Gunsilius (2023) for more details. The only modification is that we take the ratio of post- and pre-treatment root mean squared Wasserstein distances to calculate the p-value, rather than the level in each period, following @abadie2010synthetic.

Value

List of matrices containing synthetic time path of the outcome variable for the target unit together with the time paths of the control units

References

Gunsilius FF (2023). “Distributional synthetic controls.” Econometrica, 91(3), 1105–1117.


DiSCo_per_iter

Description

This function performs one iteration of the permutation test

Usage

DiSCo_per_iter(
  c_df,
  c_df.q,
  t_df,
  T0,
  peridx,
  evgrid,
  idx,
  grid_df,
  M = 1000,
  ww = 0,
  qmethod = NULL,
  qtype = 7,
  q_min = 0,
  q_max = 1,
  simplex = FALSE,
  mixture = FALSE
)

Arguments

c_df

List of control units

c_df.q

List of quantiles of control units

t_df

List of target unit

idx

Index of permuted target unit

grid_df

Grids to evaluate CDFs on, only needed when mixture=TRUE

M

Integer indicating the number of control quantiles to use in the DiSCo method. Default is 1000.

qmethod

Character, indicating the method to use for computing the quantiles of the target distribution. The default is NULL, which uses the quantile function from the stats package. Other options are "qkden" (based on smoothed kernel density function) and "extreme" (based on parametric extreme value distributions). Both are substantially slower than the default method but may be useful for fat-tailed distributions with few data points at the upper quantiles. Alternatively, one could use the q_max option to restrict the range of quantiles used.

qtype

Integer, indicating the type of quantile to compute when using quantile in the qmethod argument. The default 7. See the documentation for the quantile function for more information.

q_min

Numeric, minimum quantile to use. Set this together with q_max to restrict the range of quantiles used to construct the synthetic control. Default is 0 (all quantiles). Currently NOT implemented for the mixture approach.

q_max

Numeric, maximum quantile to use. Set this together with q_min to restrict the range of quantiles used to construct the synthetic control. Default is 1 (all quantiles). Currently NOT implemented for the mixture approach.

simplex

Logical, indicating whether to use to constrain the optimal weights to the unit simplex. Default is FALSE, which only constrains the weights to sum up to 1 but allows them to be negative.

mixture

Logical, indicating whether to use the mixture of distributions approach instead. See Section 4.3. in Gunsilius (2023). This approach minimizes the distance between the CDFs instead of the quantile functions, and is preferred for categorical variables. When working with such variables, one should also provide a list of support points in the grid.cat parameter. When that is provided, this parameter is automatically set to TRUE. Default is FALSE.

Value

List of squared Wasserstein distances between the target unit and the control units


DiSCo_per_rank

Description

This function ranks the squared Wasserstein distances and returns the p-values for each time period

Usage

DiSCo_per_rank(distt, distp, T0)

Arguments

distt

List of squared Wasserstein distances between the target unit and the control units

distp

List of squared Wasserstein distances between the control units

Value

List of p-values for each time period


DiSCo_weights_reg

Description

Function for obtaining the weights in the DiSCo method at every time period

Usage

DiSCo_weights_reg(
  controls,
  target,
  M = 500,
  qmethod = NULL,
  qtype = 7,
  simplex = FALSE,
  q_min = 0,
  q_max = 1
)

Arguments

controls

List with matrices of control distributions

target

Matrix containing the target distribution

M

Integer indicating the number of control quantiles to use in the DiSCo method. Default is 1000.

qmethod

Character, indicating the method to use for computing the quantiles of the target distribution. The default is NULL, which uses the quantile function from the stats package. Other options are "qkden" (based on smoothed kernel density function) and "extreme" (based on parametric extreme value distributions). Both are substantially slower than the default method but may be useful for fat-tailed distributions with few data points at the upper quantiles. Alternatively, one could use the q_max option to restrict the range of quantiles used.

qtype

Integer, indicating the type of quantile to compute when using quantile in the qmethod argument. The default 7. See the documentation for the quantile function for more information.

simplex

Logical, indicating whether to use to constrain the optimal weights to the unit simplex. Default is FALSE, which only constrains the weights to sum up to 1 but allows them to be negative.

q_min

Numeric, minimum quantile to use. Set this together with q_max to restrict the range of quantiles used to construct the synthetic control. Default is 0 (all quantiles). Currently NOT implemented for the mixture approach.

q_max

Numeric, maximum quantile to use. Set this together with q_min to restrict the range of quantiles used to construct the synthetic control. Default is 1 (all quantiles). Currently NOT implemented for the mixture approach.

Details

Estimate the optimal weights for the distributional synthetic controls method. solving the convex minimization problem in Eq. (2) in Gunsilius (2023).. using a regression of the simulated target quantile on the simulated control quantiles, as in Eq. (3), \underset{\vec{\lambda} \in \Delta^J}{\operatorname{argmin}}\left\|\mathbb{Y}_t \vec{\lambda}_t-\vec{Y}_{1 t}\right\|_2^2. For the constrained optimization we rely on the package pracma the control distributions can be given in list form, where each list element contains a vector of observations for the given control unit, in matrix form; in matrix- each column corresponds to one unit and each row is one observation. The list-form is useful, because the number of draws for each control group can be different. The target must be given as a vector.

Value

Vector of optimal synthetic control weights

References

Gunsilius FF (2023). “Distributional synthetic controls.” Econometrica, 91(3), 1105–1117.


bootCounterfactuals

Description

Function for computing the bootstrapped counterfactuals in the DiSCo method

Usage

bootCounterfactuals(result_t, t, mixture, weights, evgrid, grid)

Arguments

result_t

A list containing the results of the DiSCo_CI_iter function

t

The current time period

mixture

Logical, indicating whether to use the mixture of distributions approach instead. See Section 4.3. in Gunsilius (2023). This approach minimizes the distance between the CDFs instead of the quantile functions, and is preferred for categorical variables. When working with such variables, one should also provide a list of support points in the grid.cat parameter. When that is provided, this parameter is automatically set to TRUE. Default is FALSE.

grid

Grid to recompute the CDF on if mixture option is chosen

Value

A list containing the bootstrapped counterfactuals


checks Carry out checks on the inputs

Description

checks Carry out checks on the inputs

Usage

checks(
  df,
  id_col.target,
  t0,
  M,
  G,
  num.cores,
  permutation,
  q_min,
  q_max,
  CI,
  boots,
  cl,
  graph,
  qmethod,
  seed
)

Arguments

df

Data frame or data table containing the distributional data for the target and control units. The data table should contain the following columns:

  • y_col A numeric vector containing the outcome variable for each unit. Units can be individuals, states, etc., but they should be nested within a larger unit (e.g. individuals or counties within a state)

  • id_col A numeric vector containing the aggregate IDs of the units. This could be, for example, the state if the units are counties or individuals

  • time_col A vector containing the time period of the observation for each unit. This should be a monotonically increasing integer.

id_col.target

Variable indicating the name of the target unit, as specified in the id_col column of the data table. This variable can be any type, as long as it is the same type as the id_col column of the data table.

t0

Integer indicating period of treatment.

M

Integer indicating the number of control quantiles to use in the DiSCo method. Default is 1000.

G

Integer indicating the number of grid points for the grid on which the estimated functions are evaluated. Default is 1000.

num.cores

Integer, number of cores to use for parallel computation. Default is 1. If the permutation or CI arguments are set to TRUE, this can be slow and it is recommended to set this to 4 or more, if possible. If you get an error in "all cores" or similar, try setting num.cores=1 to see the precise error value.

permutation

logical, whether to use permutation or not

q_min

Numeric, minimum quantile to use. Set this together with q_max to restrict the range of quantiles used to construct the synthetic control. Default is 0 (all quantiles). Currently NOT implemented for the mixture approach.

q_max

Numeric, maximum quantile to use. Set this together with q_min to restrict the range of quantiles used to construct the synthetic control. Default is 1 (all quantiles). Currently NOT implemented for the mixture approach.

CI

Logical, indicating whether to compute confidence intervals for the counterfactual quantiles. Default is FALSE. The confidence intervals are computed using the bootstrap procedure described in Van Dijcke et al. (2024).

boots

Integer, number of bootstrap samples to use for computing confidence intervals. Default is 500.

cl

Numeric, confidence level for the (two-sided) confidence intervals.

graph

Logical, indicating whether to plot the permutation graph as in Figure 3 of the paper. Default is FALSE.

qmethod

Character, indicating the method to use for computing the quantiles of the target distribution. The default is NULL, which uses the quantile function from the stats package. Other options are "qkden" (based on smoothed kernel density function) and "extreme" (based on parametric extreme value distributions). Both are substantially slower than the default method but may be useful for fat-tailed distributions with few data points at the upper quantiles. Alternatively, one could use the q_max option to restrict the range of quantiles used.

seed

Integer, seed for the random number generator. This needs to be set explicitly in the function call, since it will invoke RNGkind which will set the seed for each core when using parallel processes. Default is NULL, which does not set a seed.


citation

Description

print the citation for the relevant paper

Usage

citation()

Data from (Dube 2019)

Description

As used in the empirical application of Gunsilius (2023).

Usage

dube

Format

dube

A data frame with 652,870 rows and 3 columns:

id_col

State FIPS

time_col

Year

y_col

adj0contpov variable in Dube (2019). Captures the distribution of equalized family income from wages and salary, defined as multiples of the federal poverty threshold.

...


ex_gmm

Description

Example data for DiSCo command. Returns simulated target and control that are mixtures of Gaussian distributions.

Usage

ex_gmm(Ts = 2, num.con = 30, numdraws = 1000)

Arguments

Ts

an integer indicating the number of time periods

num.con

an integer indicating the number of control units

numdraws

an integer indicating the number of draws

Value

target

a vector.

control

a matrix.


getGrid

Description

Set up a grid for the estimation of the quantile functions and CDFs

Usage

getGrid(target, controls, G)

Arguments

target

A vector containing the data for the target unit

controls

A list containing the data for the control units

G

The number of grid points

Value

A list containing the following elements:


Check if a vector is integer

Description

Check if a vector is integer

Usage

is.integer(x)

Arguments

x

a vector

Value

TRUE if x is integer, FALSE otherwise


mclapply.hack

Description

This function mimics forking (done with mclapply in Mac or Linux) for the Windows environment. Designed to be used just like mclapply. Credit goes to Nathan VanHoudnos.

Usage

mclapply.hack(..., verbose = FALSE, mc.cores = 1)

Arguments

verbose

Should users be warned this is hack-y? Defaults to FALSE.

mc.cores

Number of cores to use. Defaults to 1.

See Also

mclapply


Compute the empirical quantile function

Description

Compute the empirical quantile function

Usage

myQuant(X, q, qtype = 7, qmethod = NULL, ...)

Arguments

X

A vector containing the data

q

A vector containing the quantiles

Value

A vector containing the empirical quantile function


parseBoots

Description

Function for parsing the bootstrapped counterfactuals in the DiSCo method

Usage

parseBoots(CI_temp, cl, q_disco, cdf_disco, q_obs, cdf_obs, uniform = TRUE)

Arguments

CI_temp

A list containing the bootstrapped counterfactuals

cl

The confidence level

q_disco

The estimated quantiles around which to center

cdf_disco

The estimated cdfs around which to center

q_obs

The observed quantiles

cdf_obs

The observed cdfs

uniform

Whether to use uniform or pointwise confidence intervals

Value

A list containing the confidence intervals for the quantiles and cdfs


permut

Description

Object to hold results of permutation test

Usage

permut(distp, distt, p_overall, J_1, q_min, q_max, plot)

Arguments

distp

List of squared Wasserstein distances between the control units

distt

List of squared Wasserstein distances between the target unit and the control units

p_overall

Overall p-value

J_1

Number of control units

q_min

Minimum quantile

q_max

Maximum quantile

plot

ggplot object containing plot of squared Wasserstein distances over time for all permutations.

Value

A list of class permut, with the same elements as the input arguments.


Plot distribution of treatment effects over time

Description

Plot distribution of treatment effects over time

Usage

plotDistOverTime(
  cdf_centered,
  grid_cdf,
  t_start,
  t_max,
  CI,
  ci_lower,
  ci_upper,
  ylim = c(0, 1),
  xlim = NULL,
  cdf = TRUE,
  xlab = "Distribution Difference",
  ylab = "CDF",
  obsLine = NULL,
  savePlots = FALSE,
  plotName = NULL,
  lty = 1,
  lty_obs = 1,
  t_plot = NULL
)

Arguments

cdf_centered

list of centered distributional statistics

grid_cdf

grid

t_start

start time

t_max

maximum time

CI

logical indicating whether to plot confidence intervals

ci_lower

lower confidence interval

ci_upper

upper confidence interval

ylim

y limits

xlim

x limits

cdf

logical indicating whether to plot CDF or quantile difference

xlab

x label

ylab

y label

obsLine

optional additional line to plot. Default is NULL which means no line is plotted.

savePlots

logical indicating whether to save plots

plotName

name of plot to save

lty

line type for the main line passed as cdf_centered

lty_obs

line type for the optional additional line passed as obsLine

t_plot

optional vector of times to plot. Default is NULL which means all times are plotted.

Value

plot of distribution of treatment effects over time


print.permut

Description

Print permutation test results

Usage

## S3 method for class 'permut'
print(x, ...)

Arguments

x

Object of class permut

...

Additional arguments

Value

Prints permutation test results


summary.DiSCoT

Description

Summary of DiSCoT object

Usage

## S3 method for class 'DiSCoT'
summary(object, ...)

Arguments

object

DiSCoT object

...

Additional arguments

Value

summary of DiSCoT object


summary.permut

Description

Summarize permutation test results

Usage

## S3 method for class 'permut'
summary(object, ...)

Arguments

object

Object of class permut

...

Additional arguments

Value

Prints permutation test results