Type: | Package |
Title: | Heckman Selection Models Based on Bayesian Analysis |
Version: | 1.0.0 |
Maintainer: | Heeju Lim <heeju.lim@uconn.edu> |
Description: | Implements Heckman selection models using a Bayesian approach via 'Stan' and compares the performance of normal, Student’s t, and contaminated normal distributions in addressing complexities and selection bias (Heeju Lim, Victor E. Lachos, and Victor H. Lachos, Bayesian analysis of flexible Heckman selection models using Hamiltonian Monte Carlo, 2025, under submission). |
Imports: | rstan (≥ 2.26.23), mvtnorm (≥ 1.2-3), loo, stats |
License: | GPL-3 |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.3.2 |
NeedsCompilation: | no |
Packaged: | 2025-05-02 19:55:20 UTC; heeju |
Author: | Heeju Lim [aut, cre], Victor E. Lachos [aut], Victor H. Lachos [aut] |
Depends: | R (≥ 3.5.0) |
Repository: | CRAN |
Date/Publication: | 2025-05-06 08:50:05 UTC |
Fit the Heckman Selection Stan model using the Normal, Student-t or Contaminated Normal distributions.
Description
'HeckmanStan()' fits the Heckman selection model using a Bayesian approach to address sample selection bias.
Usage
HeckmanStan(
y,
x,
w,
cc,
family = "CN",
init = "random",
thin = 5,
chains = 1,
iter = 10,
warmup = 5
)
Arguments
y |
A response vector. |
x |
A covariate matrix for the response y. |
w |
A covariate matrix for the missing indicator cc. |
cc |
A missing indicator vector (1=observed, 0=missing) . |
family |
The distribution family to be used (Normal, T, or CN). |
init |
Parameters specifies the initial values for model parameters. |
thin |
An Interval at which samples are retained from the MCMC process to reduce autocorrelation. |
chains |
The number of chains to run during the MCMC sampling. Running multiple chains is useful for checking convergence. |
iter |
The total number of iterations for the MCMC sampling, determining how many samples will be drawn. |
warmup |
The number of initial iterations that will be discarded as the algorithm stabilizes before collecting samples. |
Value
An object of class HeckmanStan
, which is a list containing two elements:
-
list[[1]]
: Includes inference results from the Stan model, along with EAIC and EBIC. -
list[[2]]
: Includes the HPC confidence intervals, along with LOOIC, WAIC, and CPO.
Examples
################################################################################
# Simulation
################################################################################
library(mvtnorm)
n<- 100
w<- cbind(1,rnorm(n),rnorm(n))
x<- cbind(w[,1:2])
family="CN"
sigma2<- 1
rho<-0.7
beta<- c(1,0.5)
gamma<- c(1,0.3,-.5)
nu=c(0.1,0.1)
data<-geraHeckman(x,w,beta,gamma,sigma2,rho,nu,family=family)
y<-data$y
cc<-data$cc
# Fit Heckman Normal Stan model
fit.n_stan <- HeckmanStan(y, x, w, cc, family="Normal"
, thin = 5, chains = 1, iter = 10000, warmup = 1000)
qoi=c("beta","gamma","sigma_e","sigma2", "rho","EAIC","EBIC")
print(fit.n_stan[[1]],par=qoi)
print(fit.n_stan[[2]])
require(rstan)
plot(fit.n_stan[[1]], pars=qoi)
plot(fit.n_stan[[1]], plotfun="hist", pars=qoi)
plot(fit.n_stan[[1]], plotfun="trace", pars=qoi)
plot(fit.n_stan[[1]], plotfun = "rhat")
MEPS 2001: Ambulatory Expenditures Data
Description
This dataset is an extract from the 2001 Medical Expenditure Panel Survey (MEPS), providing information on ambulatory expenditures and various demographic and health-related variables. It has been used for illustrative examples by Cameron and Trivedi (2009, Chapter 16).
Usage
data(MEPS2001)
Format
A data frame with 3,328 observations on the following 22 variables.
- educ
Education status
- age
Age
- income
Income
- female
Gender
- vgood
Self-reported health status, very good
- good
Self-reported health status, good
- hospexp
Hospital expenditures
- totchr
Total number of chronic diseases
- ffs
Family support
- dhospexp
Dummy variable for hospital expenditures
- age2
Age squared
- agefem
Interaction between age and gender
- fairpoor
Self-reported health status, fair or poor
- year01
Year of survey
- instype
Type of insurance
- ambexp
Ambulatory expenditures
- lambexp
Log of ambulatory expenditures
- blhisp
Ethnicity
- instype_s1
Insurance type, version 1
- dambexp
Dummy variable for ambulatory expenditures
- lnambx
Log-transformed ambulatory expenditures
- ins
Insurance status
Source
2001 Medical Expenditure Panel Survey by the Agency for Healthcare Research and Quality.
References
Cameron, C.A. and Trivedi, P.K. (2009). *Microeconometrics Using Stata*. College Station, TX: Stata Press.
Examples
data(MEPS2001)
head(MEPS2001)
Panel Study of Income Dynamics 1976 Extract
Description
Cross-section data originating from the 1976 Panel Study of Income Dynamics (PSID). The dataset includes demographic and economic characteristics of married women and their husbands, and is commonly used for analyzing female labor force participation.
Usage
data(PSID1976)
Format
A data frame with 753 observations on the following 22 variables.
- age
age of the woman
- city
dummy for living in a city
- college
dummy for college education (woman)
- education
years of education (woman)
- experience
years of labor market experience
- feducation
father's years of education
- fincome
family income in 1,000s
- hage
husband's age
- hcollege
dummy for husband's college education
- heducation
husband's years of education
- hhours
husband's weekly working hours
- hours
woman's weekly working hours
- hwage
husband's log hourly wage
- meducation
mother's years of education
- oldkids
number of children older than 6
- participation
dummy for woman's labor force participation
- repwage
replacement wage (predicted wage if not employed)
- tax
marginal tax rate
- unemp
state unemployment rate
- wage
log hourly wage of the woman
- youngkids
number of children 6 or younger
References
Mroz, T. A. (1987). The sensitivity of an empirical model of married women's hours of work to economic and statistical assumptions. *Econometrica*, 55(4), 765–799.
Examples
data(PSID1976)
head(PSID1976)
Generating Heckman data : Normal, Student-t, Slash and Laplace
Description
'geraHeckman()' generates a random sample from the Heckman selection model (Normal, Student-t or CN).
Usage
geraHeckman(x, w, beta, gamma, sigma2, rho, nu, family = "T")
Arguments
x |
A covariate matrix for the response y. |
w |
A covariate matrix for the missing indicator cc. |
beta |
Values for the beta vector. |
gamma |
Values for the gamma vector. |
sigma2 |
Value for the variance. |
rho |
Value for the dependence between the response and missing value. |
nu |
When using the t- distribution, the initial value for the degrees of freedom. |
family |
The distribution family to be used (Normal, T, or CN). |
Value
Return an object with the response (y) and missing values (cc).
References
Lachos, V. H., Prates, M. O., & Dey, D. K. (2021). Heckman selection-t model: Parameter estimation via the EM-algorithm. Journal of Multivariate Analysis, 184, 104737.
Examples
n <- 100
rho <- .6
cens <- 0.25
nu <- 4
set.seed(20200527)
w <- cbind(1,runif(n,-1,1),rnorm(n))
x <- cbind(w[,1:2])
family <- "T"
c <- qt(cens, df=nu)
sigma2 <- 1
beta <- c(1,0.5)
gamma<- c(1,0.3,-.5)
gamma[1] <- -c*sqrt(sigma2)
data <- geraHeckman(x,w,beta,gamma,sigma2,rho,nu,family=family)