Title: | Bayesian MI-LASSO for Variable Selection on Multiply-Imputed Datasets |
Version: | 1.0.1 |
Author: | Jungang Zou [aut, cre], Sijian Wang [aut], Qixuan Chen [aut] |
Maintainer: | Jungang Zou <jungang.zou@gmail.com> |
Description: | Provides a suite of Bayesian MI-LASSO for variable selection methods for multiply-imputed datasets. The package includes four Bayesian MI-LASSO models using shrinkage (Multi-Laplace, Horseshoe, ARD) and Spike-and-Slab (Spike-and-Laplace) priors, along with tools for model fitting via MCMC, three-step projection predictive variable selection, and hyperparameter calibration. Methods are suitable for both continuous and binary covariates under missing-at-random assumptions. See Zou, J., Wang, S. and Chen, Q. (2022), Variable Selection for Multiply-imputed Data: A Bayesian Framework. ArXiv, 2211.00114. <doi:10.48550/arXiv.2211.00114> for more details. We also provide the frequentist's MI-LASSO function. |
License: | Apache License (≥ 2) |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Depends: | R (≥ 3.5.0) |
Imports: | MCMCpack, mvnfast, GIGrvg, MASS, Rfast, foreach, doParallel, arm, mice, abind, stringr, stats, posterior |
Suggests: | testthat, knitr, rmarkdown |
NeedsCompilation: | no |
Packaged: | 2025-07-06 19:47:39 UTC; jz3183 |
Repository: | CRAN |
Date/Publication: | 2025-07-09 13:30:09 UTC |
ARD MCMC Sampler for Multiply-Imputed Regression
Description
Implements Bayesian variable selection using the Automatic Relevance Determination (ARD) prior across multiply-imputed datasets. The ARD prior imposes feature-specific shrinkage by placing a prior proportional to inverse of precision of each coefficient.
Usage
ARD_mcmc(
X,
Y,
intercept = TRUE,
nburn = 4000,
npost = 4000,
seed = NULL,
verbose = TRUE,
printevery = 1000,
chain_index = 1
)
Arguments
X |
A 3-D array of predictors with dimensions |
Y |
A matrix of outcomes with dimensions |
intercept |
Logical; include an intercept? Default |
nburn |
Integer; number of burn-in MCMC iterations. Default |
npost |
Integer; number of post-burn-in samples to retain. Default |
seed |
Integer or |
verbose |
Logical; print progress messages? Default |
printevery |
Integer; print progress every this many iterations. Default |
chain_index |
Integer; index of this MCMC chain (for labeling messages). Default |
Value
A named list
with components:
post_beta
Array
npost × D × p
of sampled regression coefficients.post_alpha
Matrix
npost × D
of sampled intercepts (if used).post_sigma2
Numeric vector length
npost
, sampled residual variances.post_psi2
Matrix
npost × p
of sampled precision parameters for each coefficient.post_fitted_Y
Array
npost × D × n
of posterior predictive draws (with noise).post_pool_beta
Matrix
(npost * D) × p
of pooled coefficient draws.post_pool_fitted_Y
Matrix
(npost * D) × n
of pooled predictive draws (with noise).hat_matrix_proj
Matrix
D × n × n
of averaged projection hat-matrices. To avoid recalculate for estimating degree of freedom.
Examples
sim <- sim_B(n = 100, p = 20, type = "MAR", SNP = 1.5, corr = 0.5,
low_missing = TRUE, n_imp = 5, seed = 123)
X <- sim$data_MI$X
Y <- sim$data_MI$Y
fit <- ARD_mcmc(X, Y, nburn = 100, npost = 100)
Bayesian MI-LASSO for Multiply-Imputed Regression
Description
Fit a Bayesian multiple-imputation LASSO (BMI-LASSO) model across multiply-imputed datasets, using one of four priors: Multi-Laplace, Horseshoe, ARD, or Spike-Laplace. Automatically standardizes data, runs MCMC in parallel, performs variable selection via three-step projection predictive variable selection, and selects a final submodel by BIC.
Usage
BMI_LASSO(
X,
Y,
model,
standardize = TRUE,
SNC = TRUE,
grid = seq(0, 1, 0.01),
orthogonal = FALSE,
nburn = 4000,
npost = 4000,
seed = NULL,
nchain = 1,
ncores = 1,
verbose = TRUE,
printevery = 1000,
...
)
Arguments
X |
A numeric matrix or array of predictors. If a matrix |
Y |
A numeric vector or matrix of outcomes. If a vector of length |
model |
Character; which prior to use. One of |
standardize |
Logical; whether to normalize each |
SNC |
Logical; if |
grid |
Numeric vector; grid of scaled neighborhood criterion (or thresholding) to explore.
Default |
orthogonal |
Logical; if |
nburn |
Integer; number of burn-in MCMC iterations per chain. Default |
npost |
Integer; number of post-burn-in samples to retain per chain. Default |
seed |
Optional integer; base random seed. Each chain adds its index. |
nchain |
Integer; number of MCMC chains to run in parallel. Default |
ncores |
Integer; number of parallel cores to use. Default |
verbose |
Logical; print progress messages. Default |
printevery |
Integer; print status every so many iterations. Default |
... |
Additional model-specific hyperparameters:
|
Value
A named list with elements:
posterior
List of length
nchain
of MCMC outputs (posterior draws).select
List of length
nchain
of logical matrices showing which variables are selected at each grid value.best_select
List of length
nchain
of the single best selection (by BIC) for each chain.posterior_best_models
List of length
nchain
of projected posterior draws for the best submodel.bic_models
List of length
nchain
of BIC values and degrees-of-freedom for each candidate submodel.summary_table_full
A data frame summarizing rank-normalized split-Rhat and other diagnostics for the full model.
summary_table_selected
A data frame summarizing diagnostics for the selected submodel after projection.
Examples
sim <- sim_A(n = 100, p = 20, type = "MAR", SNP = 1.5, low_missing = TRUE, n_imp = 5, seed = 123)
X <- sim$data_MI$X
Y <- sim$data_MI$Y
fit <- BMI_LASSO(X, Y, model = "Horseshoe",
nburn = 100, npost = 100,
nchain = 1, ncores = 1)
str(fit$best_select)
Multiple-Imputation LASSO (MI-LASSO)
Description
Fit a LASSO-like penalty across D
multiply-imputed datasets by
iteratively reweighted ridge regressions (Equation (4) of the manuscript).
For each tuning parameter in lamvec
, it returns the pooled
coefficient estimates, the BIC, and the selected variables.
Usage
MI_LASSO(
X,
Y,
lamvec = (2^(seq(-1, 4, by = 0.05)))^2/2,
maxiter = 200,
eps = 1e-20,
ncores = 1
)
Arguments
X |
A matrix |
Y |
A vector length |
lamvec |
Numeric vector of penalty parameters |
maxiter |
Integer; maximum number of ridge–update iterations per |
eps |
Numeric; convergence tolerance on coefficient change. Default |
ncores |
Integer; number of cores for parallelizing over |
Value
If length(lamvec) > 1
, a list with elements:
best
List for the
lambda
with minimal BIC containing:coefficients
((p+1)×D
intercept + slopes),bic
(BIC scalar),varsel
(logical length-p
vector of selected predictors),lambda
(the chosen penalty).lambda_path
length(lamvec)×2
matrix of eachlambda
and its corresponding BIC.
If length(lamvec) == 1
, returns a single list (as above) for that
penalty.
Examples
sim <- sim_A(n = 100, p = 20, type = "MAR", SNP = 1.5, low_missing = TRUE, n_imp = 5, seed = 123)
X <- sim$data_MI$X
Y <- sim$data_MI$Y
fit <- MI_LASSO(X, Y, lamvec = c(0.1))
Horseshoe MCMC Sampler for Multiply-Imputed Regression
Description
Implements Bayesian variable selection using the hierarchical Horseshoe prior
across multiply-imputed datasets. This model applies global–local shrinkage
to regression coefficients via a global scale (tau2
), local scales
(lambda2
), and auxiliary hyperpriors (kappa
, eta
).
Usage
horseshoe_mcmc(
X,
Y,
intercept = TRUE,
nburn = 4000,
npost = 4000,
seed = NULL,
verbose = TRUE,
printevery = 1000,
chain_index = 1
)
Arguments
X |
A 3-D array of predictors with dimensions |
Y |
A matrix of outcomes with dimensions |
intercept |
Logical; include an intercept term? Default |
nburn |
Integer; number of burn-in MCMC iterations. Default |
npost |
Integer; number of post-burn-in samples to retain. Default |
seed |
Integer or |
verbose |
Logical; print progress messages? Default |
printevery |
Integer; print progress every this many iterations. Default |
chain_index |
Integer; index of this MCMC chain (for labeling prints). Default |
Value
A named list with components:
post_beta
Array
npost × D × p
of sampled regression coefficients.post_alpha
Matrix
npost × D
of sampled intercepts (if used).post_sigma2
Numeric vector of length
npost
, sampled residual variances.post_lambda2
Matrix
npost × p
of local shrinkage parameters\lambda_j^2
.post_kappa
Matrix
npost × p
of auxiliary local hyperparameters\kappa_j
.post_tau2
Numeric vector of length
npost
, sampled global scale\tau^2
.post_eta
Numeric vector of length
npost
, sampled auxiliary global hyperparameter\eta
.post_fitted_Y
Array
npost × D × n
of posterior predictive draws (with noise).post_pool_beta
Matrix
(npost * D) × p
of pooled coefficient draws.post_pool_fitted_Y
Matrix
(npost * D) × n
of pooled predictive draws (with noise).hat_matrix_proj
Matrix
D × n × n
of averaged projection hat-matrices. To avoid recalculate for estimating degree of freedom.
Examples
sim <- sim_B(n = 100, p = 20, type = "MAR", SNP = 1.5, corr = 0.5,
low_missing = TRUE, n_imp = 5, seed = 123)
X <- sim$data_MI$X
Y <- sim$data_MI$Y
fit <- horseshoe_mcmc(X, Y, nburn = 100, npost = 100)
Multi-Laplace MCMC Sampler for Multiply-Imputed Regression
Description
Implements Bayesian variable selection under the Multi-Laplace prior on
regression coefficients across multiply-imputed datasets. The prior shares
local shrinkage parameters (lambda2
) across imputations and places
a Gamma(h
, v
) hyperprior on the global parameter rho
.
Usage
multi_laplace_mcmc(
X,
Y,
intercept = TRUE,
h = 2,
v = NULL,
nburn = 4000,
npost = 4000,
seed = NULL,
verbose = TRUE,
printevery = 1000,
chain_index = 1
)
Arguments
X |
A 3-D array of predictors with dimensions |
Y |
A matrix of outcomes with dimensions |
intercept |
Logical; include an intercept? Default |
h |
Numeric; shape parameter of the Gamma prior on |
v |
Numeric or |
nburn |
Integer; number of burn-in iterations. Default |
npost |
Integer; number of post-burn-in samples to store. Default |
seed |
Integer or |
verbose |
Logical; print progress messages? Default |
printevery |
Integer; print progress every this many iterations. Default |
chain_index |
Integer; index of this MCMC chain (for messages). Default |
Value
A named list
with elements:
post_beta
Array
npost × D × p
of sampled regression coefficients.post_alpha
Matrix
npost × D
of sampled intercepts (if used).post_sigma2
Numeric vector of length
npost
, sampled residual variances.post_lambda2
Matrix
npost × p
of sampled local shrinkage parameters.post_rho
Numeric vector of length
npost
, sampled global parameters.post_fitted_Y
Array
npost × D × n
of posterior predictive draws (with noise).post_pool_beta
Matrix
(npost * D) × p
of pooled coefficient draws.post_pool_fitted_Y
Matrix
(npost * D) × n
of pooled predictive draws (with noise).hat_matrix_proj
Matrix
D × n × n
of averaged projection hat-matrices. To avoid recalculate for estimating degree of freedom.h
,v
Numeric; the shape and scale hyperparameters used.
Examples
sim <- sim_B(n = 100, p = 20, type = "MAR", SNP = 1.5, corr = 0.5, low_missing = TRUE,
n_imp = 5, seed = 123)
X <- sim$data_MI$X
Y <- sim$data_MI$Y
fit <- multi_laplace_mcmc(X, Y, intercept = TRUE, nburn = 100, npost = 100)
Projecting Posterior Means of Full-Model Coefficients onto a Reduced Subset Model
Description
Given posterior means of beta1_mat
(and optional intercepts
alpha1_vec
) from a full model fitted on D
imputed
datasets, compute the predictive projection onto the submodel defined by
xs_vec
. Returns the projected coefficients (and intercepts, if requested).
Usage
projection_mean(X_arr, beta1_mat, xs_vec, sigma2, alpha1_vec = NULL)
Arguments
X_arr |
A 3-D array of predictors, of dimension |
beta1_mat |
A |
xs_vec |
Logical vector of length |
sigma2 |
Numeric scalar; the residual variance from the full model (pooled across imputations). |
alpha1_vec |
Optional numeric vector of length |
Value
A list with components:
beta2_mat
A
D * p
matrix of projected submodel coefficients.alpha2_vec
(If
alpha1_vec
provided) numeric vector lengthD
of projected intercepts.
Examples
# Simulate a single imputation with n=50, p=5:
D <- 3; n <- 50; p <- 5
X_arr <- array(rnorm(D * n * p), c(D, n, p))
beta1_mat <- matrix(rnorm(D * p), nrow = D)
# Suppose full-model sigma2 pooled is 1.2
sigma2 <- 1.2
# Project onto predictors 1 and 4 only:
xs_vec <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
proj <- projection_mean(X_arr, beta1_mat, xs_vec, sigma2)
str(proj)
# With intercept:
alpha1_vec <- rnorm(D)
proj2 <- projection_mean(X_arr, beta1_mat, xs_vec, sigma2, alpha1_vec)
str(proj2)
Projection of Full-Posterior Draws onto a Reduced-Subset Model
Description
Given posterior draws beta1_arr
(and optional intercepts alpha1_arr
)
from a full model fitted on D
imputed datasets, compute
the predictive projection of each draw onto the submodel defined by xs_vec
.
Returns the projected coefficients (and intercepts, if requested) plus the projected
residual variance for each posterior draw.
Usage
projection_posterior(X_arr, beta1_arr, sigma1_vec, xs_vec, alpha1_arr = NULL)
Arguments
X_arr |
A 3-D array of predictors, of dimension |
beta1_arr |
A |
sigma1_vec |
Numeric vector of length |
xs_vec |
Logical vector of length |
alpha1_arr |
Optional |
Value
A list with components:
beta2_arr
Array
npost * D * p
of projected submodel coefficients.alpha2_arr
(If
alpha1_arr
provided) matrixnpost * D
of projected intercepts.sigma2_opt
Numeric vector length
npost
of projected residual variances.
Examples
D <- 3; n <- 50; p <- 5; npost <- 100
X_arr <- array(rnorm(D*n*p), c(D, n, p))
beta1_arr <- array(rnorm(npost*D*p), c(npost, D, p))
sigma1_vec <- runif(npost, 0.5, 2)
xs_vec <- c(TRUE, FALSE, TRUE, FALSE, TRUE)
# Without intercept
proj <- projection_posterior(X_arr, beta1_arr, sigma1_vec, xs_vec)
str(proj)
# With intercept draws
alpha1_arr <- matrix(rnorm(npost*D), nrow = npost, ncol = D)
proj2 <- projection_posterior(X_arr, beta1_arr, sigma1_vec, xs_vec, alpha1_arr)
str(proj2)
Simulate dataset A: Independent continuous covariates with MCAR/MAR missingness
Description
Generates a dataset for Scenario A used in Bayesian MI-LASSO benchmarking. Covariates are iid standard normal, with a fixed true coefficient vector, linear outcome, missingness imposed on specified columns under MCAR or MAR, and multiple imputations via predictive mean matching.
Usage
sim_A(
n = 100,
p = 20,
type = "MAR",
SNP = 1.5,
low_missing = TRUE,
n_imp = 5,
seed = NULL
)
Arguments
n |
Integer. Number of observations. |
p |
Integer. Number of covariates (columns). Takes values in {20, 40}. |
type |
Character. Missingness mechanism: "MCAR" or "MAR". |
SNP |
Numeric. Signal-to-noise ratio controlling error variance. |
low_missing |
Logical. If TRUE, use low missingness rates; if FALSE, higher missingness. |
n_imp |
Integer. Number of multiple imputations to generate. |
seed |
Integer or NULL. Random seed for reproducibility. |
Value
A list with components:
- data_O
A list of complete covariate matrix and outcomes before missingness.
- data_mis
A list of covariate matrix and outcomes with missing values.
- data_MI
A list of array of imputed covariates (n_imp × n × p) and a matrix of imputed outcomes (n_imp × n).
- data_CC
A list of complete-case covariate matrix and outcomes.
- important
Logical vector of true nonzero coefficient indices.
- covmat
True covariance matrix used for X.
- beta
True coefficient vector.
Examples
sim <- sim_A(n = 100, p = 20, type = "MAR", SNP = 1.5,
low_missing = TRUE, n_imp = 5, seed = 123)
str(sim)
Simulate dataset B: AR(1)-correlated continuous covariates with MCAR/MAR missingness
Description
Generates a dataset for Scenario B used in Bayesian MI-LASSO benchmarking. Covariates are multivariate normal with AR(1) covariance, with a fixed true coefficient vector, linear outcome, missingness imposed on specified columns under MCAR or MAR, and multiple imputations via predictive mean matching.
Usage
sim_B(
n = 100,
p = 20,
low_missing = TRUE,
type = "MAR",
SNP = 1.5,
corr = 0.5,
n_imp = 5,
seed = NULL
)
Arguments
n |
Integer. Number of observations. |
p |
Integer. Number of covariates (columns). Takes values in {20, 40}. |
low_missing |
Logical. If TRUE, use low missingness rates; if FALSE, higher missingness. |
type |
Character. Missingness mechanism: "MCAR" or "MAR". |
SNP |
Numeric. Signal-to-noise ratio controlling error variance. |
corr |
Numeric. AR(1) correlation parameter |
n_imp |
Integer. Number of multiple imputations to generate. |
seed |
Integer or NULL. Random seed for reproducibility. |
Value
A list with components:
- data_O
A list of complete covariate matrix and outcomes before missingness.
- data_mis
A list of covariate matrix and outcomes with missing values.
- data_MI
A list of array of imputed covariates (n_imp × n × p) and a matrix of imputed outcomes (n_imp × n).
- data_CC
A list of complete-case covariate matrix and outcomes.
- important
Logical vector of true nonzero coefficient indices.
- covmat
True covariance matrix used for X.
- beta
True coefficient vector.
Examples
sim <- sim_B(n = 100, p = 20, type = "MAR", SNP = 1.5, corr = 0.5,
low_missing = TRUE, n_imp = 5, seed = 123)
str(sim)
Simulate dataset C: AR(1)-latent Gaussian dichotomized to binary covariates with MCAR/MAR missingness
Description
Generates binary covariates by thresholding an AR(1) latent Gaussian, then proceeds as in sim_B.
Usage
sim_C(
n = 100,
p = 20,
low_missing = TRUE,
type = "MAR",
SNP = 1.5,
corr = 0.5,
n_imp = 5,
seed = NULL
)
Arguments
n |
Integer. Number of observations. |
p |
Integer. Number of covariates (columns). Takes values in {20, 40}. |
low_missing |
Logical. If TRUE, use low missingness rates; if FALSE, higher missingness. |
type |
Character. Missingness mechanism: "MCAR" or "MAR". |
SNP |
Numeric. Signal-to-noise ratio controlling error variance. |
corr |
Numeric. AR(1) correlation parameter |
n_imp |
Integer. Number of multiple imputations to generate. |
seed |
Integer or NULL. Random seed for reproducibility. |
Value
A list with components:
- data_O
A list of complete covariate matrix and outcomes before missingness.
- data_mis
A list of covariate matrix and outcomes with missing values.
- data_MI
A list of array of imputed covariates (n_imp × n × p) and a matrix of imputed outcomes (n_imp × n).
- data_CC
A list of complete-case covariate matrix and outcomes.
- important
Logical vector of true nonzero coefficient indices.
- covmat
True covariance matrix used for X.
- beta
True coefficient vector.
Examples
sim <- sim_C(n = 100, p = 20, type = "MAR", SNP = 1.5, corr = 0.5,
low_missing = TRUE, n_imp = 5, seed = 123)
str(sim)
Spike-and-Laplace MCMC Sampler for Multiply-Imputed Regression
Description
Implements Bayesian variable selection using a spike-and-slab prior with a Laplace (double-exponential) slab
on nonzero coefficients. Latent inclusion indicators gamma
follow Bernoulli(theta
), and their probabilities
follow independent Beta(a
, b
) priors.
Usage
spike_laplace_partially_mcmc(
X,
Y,
intercept = TRUE,
a = 2,
b = NULL,
nburn = 4000,
npost = 4000,
seed = NULL,
verbose = TRUE,
printevery = 1000,
chain_index = 1
)
Arguments
X |
A 3-D array of predictors with dimensions |
Y |
A matrix of outcomes with dimensions |
intercept |
Logical; include an intercept term? Default |
a |
Numeric; shape parameter of the Gamma prior. Default |
b |
Numeric or |
nburn |
Integer; number of burn-in MCMC iterations. Default |
npost |
Integer; number of post-burn-in samples to retain. Default |
seed |
Integer or |
verbose |
Logical; print progress messages? Default |
printevery |
Integer; print progress every this many iterations. Default |
chain_index |
Integer; index of this MCMC chain (for labeling messages). Default |
Value
A named list with components:
post_rho
Numeric vector length
npost
, sampled global scale\rho
.post_gamma
Matrix
npost * p
of sampled inclusion indicators.post_theta
Matrix
npost * p
of sampled Beta parameters\theta_j
.post_alpha
Matrix
npost * D
of sampled intercepts (if used).post_lambda2
Matrix
npost * p
of sampled local scale parameters\lambda_j^2
.post_sigma2
Numeric vector length
npost
, sampled residual variances.post_beta
Array
npost * D * p
of sampled regression coefficients.post_fitted_Y
Array
npost * D * n
of posterior predictive draws (including noise).post_pool_beta
Matrix
(npost * D) * p
of pooled coefficient draws.post_pool_fitted_Y
Matrix
(npost * D) * n
of pooled predictive draws (with noise).hat_matrix_proj
Matrix
D * n * n
of averaged projection hat-matrices. To avoid recalculate for estimating degree of freedom.a
,b
Numeric values of the rho hyperparameters used.
Examples
sim <- sim_B(n = 100, p = 20, type = "MAR", SNP = 1.5, corr = 0.5,
low_missing = TRUE, n_imp = 5, seed = 123)
X <- sim$data_MI$X
Y <- sim$data_MI$Y
fit <- spike_laplace_partially_mcmc(X, Y, nburn = 10, npost = 10)