Type: Package
Title: Generalized Berk-Jones Test for Set-Based Inference in Genetic Association Studies
Version: 0.5.4
Date: 2024-01-28
Description: Offers the Generalized Berk-Jones (GBJ) test for set-based inference in genetic association studies. The GBJ is designed as an alternative to tests such as Berk-Jones (BJ), Higher Criticism (HC), Generalized Higher Criticism (GHC), Minimum p-value (minP), and Sequence Kernel Association Test (SKAT). All of these other methods (except for SKAT) are also implemented in this package, and we additionally provide an omnibus test (OMNI) which integrates information from each of the tests. The GBJ has been shown to outperform other tests in genetic association studies when signals are correlated and moderately sparse. Please see the vignette for a quickstart guide or Sun and Lin (2017) <doi:10.48550/arXiv.1710.02469> for more details.
Depends: R (≥ 2.10)
Imports: Rcpp (≥ 0.12.7), mvtnorm, SKAT, stats, BH
LinkingTo: Rcpp, BH
License: GPL-3
RoxygenNote: 7.3.1
Suggests: knitr, rmarkdown, bindata, rje, testthat
VignetteBuilder: knitr
NeedsCompilation: yes
Packaged: 2024-01-29 20:35:15 UTC; rsun3
Author: Ryan Sun [aut, cre]
Maintainer: Ryan Sun <ryansun.work@gmail.com>
Repository: CRAN
Date/Publication: 2024-01-31 08:40:06 UTC

BJ.R

Description

Calculate the Berk-Jones test statistic and p-value.

Usage

BJ(test_stats, cor_mat = NULL, pairwise_cors = NULL)

Arguments

test_stats

Vector of test statistics for each factor in the set (i.e. marginal test statistic for each SNP in a gene).

cor_mat

d*d matrix of the correlations between all the test statistics in the set, where d is the total number of test statistics in the set. You only need to specify EITHER cor_mat OR pairwise_cors.

pairwise_cors

A vector of all d(d-1)/2 pairwise correlations between the test statistics. You only need to specify EITHER cor_mat OR pairwise_cors.

Value

A list with the elements:

BJ

The observed Berk-Jones test statistic.

BJ_pvalue

The p-value of this observed value, given the size of the set and correlation structure.

Examples

# Should return statistic = 1.243353 and p_value = 0.256618
set.seed(100)
Z_vec <- rnorm(5) + rep(1,5)
cor_Z <- matrix(data=0.2, nrow=5, ncol=5)
diag(cor_Z) <- 1
BJ(test_stats=Z_vec, cor_mat=cor_Z)

Genotypes at FGFR2 SNPs for subjects from 'GBR' population in the 1000 Genomes Project.

Description

A dataset containing the genotypes (number of minor alleles) for each of 91 subjects from the 'GBR' population in the 1000 Genomes Projects. There are 64 SNPs documented here, all residing in the FGFR2 gene.

Usage

data(FGFR2)

Format

A matrix with 91 rows (one for each subject) and 64 columns (one for each SNP)

Source

https://www.internationalgenome.org/data


GBJ.R

Description

Calculate the Generalized Berk-Jones test statistic and p-value.

Usage

GBJ(test_stats, cor_mat = NULL, pairwise_cors = NULL)

Arguments

test_stats

Vector of test statistics for each factor in the set (i.e. marginal test statistic for each SNP in a gene).

cor_mat

d*d matrix of the correlations between all the test statistics in the set, where d is the total number of test statistics in the set. You only need to specify EITHER cor_mat OR pairwise_cors.

pairwise_cors

A vector of all d(d-1)/2 pairwise correlations between the test statistics. You only need to specify EITHER cor_mat OR pairwise_cors.

Value

A list with the elements:

GBJ

The observed Generalized Higher Criticism test statistic.

GBJ_pvalue

The p-value of this observed value, given the size of the set and correlation structure.

err_code

Sometimes if your p-value is very small (<10^(-12) usually), R/C++ do not have enough precision in their standard routines to calculate the number accurately. In these cases (and very rarely others) we switch to standard Berk-Jones instead (more stable numerically) and let you know with a message here.

Examples

# Should return statistic = 0.9248399 and p_value = 0.2670707
set.seed(100)
Z_vec <- rnorm(5) + rep(1,5)
cor_Z <- matrix(data=0.2, nrow=5, ncol=5)
diag(cor_Z) <- 1
GBJ(test_stats=Z_vec, cor_mat=cor_Z)

GBJ_objective.R

Description

Calculates the GBJ objective function at given threshold points. Used both to calculate the observed GBJ statistic and also to find the boundary points for p-value calculation (through uniroot). Mostly for internal use.

Usage

GBJ_objective(t_vec, d, k_vec = NULL, pairwise_cors, offset = 0)

Arguments

t_vec

A scalar or vector of threshold points (magnitudes of test statistics) to calculate the objective at.

d

The total number of test statistics in the set.

k_vec

If t_vec is not a vector all of the test statistics, let us know which ordered objective functions we are calculating (can be a vector with the same length as t).

pairwise_cors

A vector of all d(d-1)/2 pairwise correlations between the test statistics, where d is total number of test statistics in the set.

offset

Used if we want to subtract the observed gbj value for uniroot.

Value

The GBJ objective at t (for given d, kkk, pairwise_cors) minus the offset

Examples

GBJ_objective(t_vec=1:5, d=5, k_vec=NULL, pairwise_cors=rep(0.2,10), offset=0)

GBJ_pvalue.R

Description

Calculate the p-value for the Generalized Berk-Jones (GBJ) statistic.

Usage

GBJ_pvalue(observed_gbj, d, pairwise_cors, times_to_try = 5)

Arguments

observed_gbj

The observed value of the GBJ statistic.

d

The number of test statistics in the set.

pairwise_cors

A vector of all d(d-1)/2 pairwise correlations between the test statistics, where d is total number of test statistics in the set.

times_to_try

Sometimes the numerical root-finder is finnicky, so we have to give it extra chances to try and calculate the p-value if first time is failure. Recommend setting this parameter to 5.

Value

The p-value of the GBJ test.

Examples

GBJ_pvalue(observed_gbj=2, d=5, pairwise_cors=rep(0.2,10))

GHC.R

Description

Calculate the Generalized Higher Criticism test statistic and p-value.

Usage

GHC(test_stats, cor_mat = NULL, pairwise_cors = NULL)

Arguments

test_stats

Vector of test statistics for each factor in the set (i.e. marginal test statistic for each SNP in a gene).

cor_mat

d*d matrix of the correlations between all the test statistics in the set, where d is the total number of test statistics in the set. You only need to specify EITHER cor_mat OR pairwise_cors.

pairwise_cors

A vector of all d(d-1)/2 pairwise correlations between the test statistics. You only need to specify EITHER cor_mat OR pairwise_cors.

Value

A list with the elements:

GHC

The observed Generalized Higher Criticism test statistic.

GHC_pvalue

The p-value of this observed value, given the size of the set and correlation structure.

err_code

Sometimes if your p-value is very small (<10^(-12) usually), R/C++ do not have enough precision in their standard routines to calculate the number accurately. In these cases (and very rarely others) we switch to standard Higher Criticism instead (more stable numerically) and let you know with a message here.

Examples

set.seed(100)
Z_vec <- rnorm(5)
cor_Z <- matrix(data=0.2, nrow=5, ncol=5)
diag(cor_Z) <- 1
GHC(test_stats=Z_vec, cor_mat=cor_Z)

GHC_objective.R

Description

Internal function to calculate the value of the kth GHC objective (possibly minus an offset) when the kth p-value order statistic is x, for a set of size d.

Usage

GHC_objective(x, k, d, offset, pairwise_cors)

Arguments

x

The p-value of the kth p-value order statistic.

k

Which objective to use.

d

The size of the set.

offset

Used to zero the correct value when we put this into uniroot

pairwise_cors

A vector of all d(d-1)/2 pairwise correlations between the test statistics, where d is total number of test statistics in the set.

Value

The value of the kth HC objective

Examples

GHC_objective(x=0.1, k=2, d=5, offset=0, pairwise_cors=rep(0.2,10))

HC.R

Description

Calculate the Higher Criticism test statistic and p-value.

Usage

HC(test_stats, cor_mat = NULL, pairwise_cors = NULL)

Arguments

test_stats

Vector of test statistics for each factor in the set (i.e. marginal test statistic for each SNP in a gene).

cor_mat

d*d matrix of the correlations between all the test statistics in the set, where d is the total number of test statistics in the set. You only need to specify EITHER cor_mat OR pairwise_cors.

pairwise_cors

A vector of all d(d-1)/2 pairwise correlations between the test statistics. You only need to specify EITHER cor_mat OR pairwise_cors.

Value

A list with the elements:

HC

The observed Higher Criticism test statistic.

HC_pvalue

The p-value of this observed value, given the size of the set and correlation structure.

Examples

# Should return statistic = 2.067475 and p_value = 0.2755146
set.seed(100)
Z_vec <- rnorm(5) + rep(1,5)
cor_Z <- matrix(data=0.2, nrow=5, ncol=5)
diag(cor_Z) <- 1
HC(test_stats=Z_vec, cor_mat=cor_Z)

omni_individual.R

Description

Computes the omnibus test statistic combining GBJ, GHC, minP, and SKAT. This version of the function assumes you have the individual factor data (i.e. genotypes) for each subject. If you only have summary statistics, use omni_ss(). You WILL NOT be able to use this function unless you have also loaded the SKAT package (install.packages("SKAT"); library(SKAT)).

Usage

OMNI_individual(null_model, factor_matrix, link_function, num_boots = 100)

Arguments

null_model

An R regression model fitted using glm(). Do not use lm(), even for linear regression!

factor_matrix

An n*d matrix with each factor (i.e. each SNP) as one column. There should be no missing data.

link_function

Either "linear" or "logit" or "log".

num_boots

Number of bootstrap repetitions to find correlation matrix of set-based statistics.

Value

A list with the elements:

OMNI

The observed omnibus test statistic.

OMNI_pvalue

The p-value of the OMNI test

err_code

Sometimes if your p-value is very small (< 1*10^(-10)), R may run into numerical issues. This message will alert you if such a situation occurs.

Examples

factor_matrix <- matrix(data=rbinom(n=1000, size=2, prob=0.3), ncol=5)
Y <- rnorm(n=200)
null_mod <- glm(Y ~ 1)
OMNI_individual(null_model=null_mod, factor_matrix=factor_matrix,
link_function='linear', num_boots=5)

omni_ss.R

Description

Computes the omnibus test statistic combining GBJ, GHC, minP, and SKAT. This version of the function assumes you are using GWAS summary statistics. If you individual-level genotype data, use omni_individual().

Usage

OMNI_ss(test_stats, cor_mat, num_boots = 100)

Arguments

test_stats

Vector of test statistics for each factor in the set (i.e. marginal test statistic for each SNP in a gene)

cor_mat

d*d matrix of the correlations between all the test statistics in the set, where d is the total number of test statistics in the set.

num_boots

Number of bootstrap repetitions to find correlation matrix of set-based statistics.

Value

A list with the elements:

OMNI

The observed omnibus test statistic.

OMNI_pvalue

The p-value of the OMNI test

err_code

Sometimes if your p-value is very small (< 1*10^(-10)), R may run into numerical issues. This message will alert you if such a situation occurs.

Examples

cor_mat <- matrix(data=0.3, nrow=5, ncol=5)
diag(cor_mat) <- 1
test_stats <- as.numeric(mvtnorm::rmvnorm(n=1, sigma=cor_mat))
OMNI_ss(test_stats=test_stats, cor_mat=cor_mat, num_boots=5)

calc_score_stats.R

Description

Starting with individual-level data on p factors, generate score test statistics for each factor for input into GBJ/GHC/HC/BJ/minP. Also get the correlations between these test statistics. Designed to be used with linear or logistic or log-linear regression null models.

Usage

calc_score_stats(null_model, factor_matrix, link_function, P_mat = NULL)

Arguments

null_model

An R regression model fitted using glm(). Do not use lm(), even for linear regression!

factor_matrix

An n*p matrix with each factor as one column. There should be no missing data.

link_function

Either "linear" or "logit" or "log"

P_mat

The projection matrix used in calculation may be passed in to speed up the calculation. See paper for details. Default is null.

Value

A list with the elements:

test_stats

The p score test statistics.

cor_mat

The p*p matrix giving the pairwise correlation of every two test statistics.

Examples

Y <- rbinom(n=100, size=1, prob=0.5)
null_mod <- glm(Y~1, family=binomial(link="logit"))
factor_mat <- matrix(data=rnorm(n=100*5), nrow=100)
calc_score_stats(null_mod, factor_mat, "logit")

calc_var_mu_nonzero.R

Description

Internal function to calculate variance of S(t) when the Z_1,...,Z_d have nonzero mean. See GBJ paper for more details. Vectorized in t.

Usage

calc_var_nonzero_mu(d, t, mu, pairwise_cors)

Arguments

d

The number of test statistics in the set.

t

The threshold which determines the properties of S(t).

mu

The common mean of the test statistics Z_1,...,Z_d

pairwise_cors

A vector of all d(d-1)/2 pairwise correlations between the test statistics, where d is total number of test statistics in the set.

Value

The variance of S(t) when the Z_1,...,Z_d have nonzero mean mu.

Examples

calc_var_nonzero_mu(d=5, t=1, mu=1, pairwise_cors=rep(0.3, 10))

ebb_loglik.R

Description

An internal function only (doesn't do error checking or take care of boundary cases). The log-likelihood (log-PMF) calculation for the Extended Beta-Binomial proposed by Prentice (1986). Takes in a vectorized argument because we apply() it in run_BB_GBJ().

Usage

ebb_loglik(x, d)

Arguments

x

A vector of length 3 with (1) value of outcome (2) mu parameter (3) gamma parameter

d

The number of test stsatistics in the set

Value

log( Pr(V=x[1]) ) where V~EBB(mu,gamma; d)

Examples

ebb_loglik(x=c(1, 0.5, 0.1), d=10)	

estimate_ss_cor.R

Description

Estimate the correlations between GWAS summary statistics using reference panel eigenvectors and reference panel genotypes.

Usage

estimate_ss_cor(ref_pcs, ref_genotypes, link_function)

Arguments

ref_pcs

An n*m matrix containing PCs calculated from the reference panel. Here n is the number of subjects in the reference panel and m is roughly the number of PCs used in the original analysis which produced the summary statistics.

ref_genotypes

An n*d matrix holding the genotypes from the reference panel, where the d columns correspond to the d SNPs for which we have summary statistics. No missing data allowed.

link_function

Either "linear" or "logit" or "log".

Value

A list with the elements:

cor_mat

The d*d matrix giving the pairwise correlation of every two test statistics.

Examples

ref_pcs <- matrix(data=runif(n=1000, min=-0.2, max=0.2), ncol=5)
ref_genotypes <- matrix(data=rbinom(n=2000, size=2, prob=0.3), ncol=10)
estimate_ss_cor(ref_pcs=ref_pcs, ref_genotypes=ref_genotypes, link_function="linear")

Simulated Principal Components for 'GBR' population in the 1000 Genomes Project.

Description

A dataset containing 5 simulated Principal Components (PCs) for each of 91 subjects from the 'GBR' population in the 1000 Genomes Projects. These would normally be used as covariates in a regression model to control for population stratification.

Usage

data(gbr_pcs)

Format

A matrix with 91 rows (one for each subject) and 5 columns (one for each PC)


herm_poly_diff_t.R

Description

Internal function to add up the infinite sum term in calculating the variance of S(t) when the test statistics Z_1,...,Z_d have a non-zero mean. See GBJ paper for more details.

Usage

herm_poly_diff_t(t1, t2, pairwise_cors)

Arguments

t1

t1 and t2 are interchangeable. One should be t-mu and one should be -t-mu.

t2

t1 and t2 are interchangeable. One should be t-mu and one should be -t-mu.

pairwise_cors

A vector of all d(d-1)/2 pairwise correlations between the test statistics, where d is total number of test statistics in the set.

Value

The result of the infinite sum.

Examples

herm_poly_diff_t(t1=0, t2=0, pairwise_cors=rep(0.3, 10))

minP.R

Description

Given a vector of individual test statistics and their pairwise correlations, calculate the MinimumP (see Conneely and Boehnke, 2007) second-level test statistic and it's p-value.

Usage

minP(test_stats, cor_mat = NULL, pairwise_cors = NULL)

Arguments

test_stats

Vector of test statistics for each factor in the set (i.e. marginal test statistic for each SNP in a gene)

cor_mat

d*d matrix of the correlations between all the test statistics in the set, where d is the total number of test statistics in the set. You only need to specify EITHER cor_mat OR pairwise_cors.

pairwise_cors

A vector of all d(d-1)/2 pairwise correlations between the test statistics. You only need to specify EITHER cor_mat OR pairwise_cors.

Value

A list with the elements:

minP

The observed MinimumP test statistic.

minP_pvalue

The p-value of this observed value, given the size of the set and correlation structure.

Examples

# Should return statistic = 0.05918928 and p_value = 0.2525972.
set.seed(100)
Z_vec <- rnorm(5) + rep(1,5)
cor_Z <- matrix(data=0.2, nrow=5, ncol=5)
diag(cor_Z) <- 1
minP(test_stats=Z_vec, cor_mat=cor_Z)

parse_input.R

Description

Internal function to accept the input as unsorted Z-statistics and either a matrix or vector of correlations, and return the t_vec and pairwise_cors. Also do limited error checking.

Usage

parse_input(test_stats, cor_mat = NULL, pairwise_cors = NULL)

Arguments

test_stats

All test statistics in the set

cor_mat

The correlation matrix of the test statistics

pairwise_cors

The vector of all pairwise correlations

Value

A list with the elements:

t_vec

The sorted magnitudes of test statistics

pairwise_cors

The vector of all pairwise correlations

Examples

parse_input(test_stats=rnorm(5), pairwise_cors=rep(0.3,10))

qnorm_mu.R

Description

Internal function to calculate Pr(-t < Z < t) - kkk/d for Z~N(mu,1). Take the root of this function to find the mu such that P(|Z|>=t_k) = k/d.

Usage

qnorm_mu(mu, t, kkk, d)

Arguments

mu

The mean of the normal random variable.

t

The threshold (boundaries) we are interested in.

kkk

We decided to make you input the fraction in two parts.

d

We decided to make you input the fraction in two parts.

Value

Pr(-t < Z < t) - kkk/d for Z~N(mu,1).

Examples

qnorm_mu(mu=0, t=1.96, kkk=1, d=5)		# Should return 0

score_stats_only.R

Description

Starting with individual-level data on p factors, generate score test statistics for each factor for input into GBJ/GHC/HC/BJ/minP. DOES NOT get the correlations (assumed known).

Usage

score_stats_only(null_model, factor_matrix, link_function, P_mat = NULL)

Arguments

null_model

An R regression model fitted using glm(). Do not use lm(), even for linear regression!

factor_matrix

An n*d matrix with each factor as one column. There should be no missing data.

link_function

Either "linear" or "logit" or "log".

P_mat

The projection matrix used in calculation may be passed in to speed up the calculation. See paper for details. Default is null.

Value

The d score test statistics.

Examples

Y <- rbinom(n=100, size=1, prob=0.5)
null_mod <- glm(Y~1, family=binomial(link="logit"))
factor_matrix <- matrix(data=rnorm(n=100*5), nrow=100)
score_stats_only(null_mod, factor_matrix, "logit")

surv.R

Description

Survival (1 minus the CDF) function of standard normal random variable.

Usage

surv(x)

Arguments

x

Vector of quantiles

Value

Probability that a standard normal random variable is greater than x.

Examples

surv(0)		# Should return 0.5