Type: Package
Title: Irreproducible Discovery Rate
Version: 1.3
Date: 2022-06-20
Author: Qunhua Li
Maintainer: Qunhua Li <qunhua.li@psu.edu>
Description: This is a package for estimating the copula mixture model and plotting correspondence curves. Details are in "Measuring reproducibility of high-throughput experiments" (2011), Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779, by Li, Brown, Huang, and Bickel.
License: GPL-2 | GPL-3 [expanded from: GPL (≥ 2.0)]
Imports: stats
LazyLoad: yes
Packaged: 2022-06-21 02:57:35 UTC; Qunhua Li
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2022-06-21 07:30:07 UTC

Irreproducible discovery rate

Description

This package estimates the reproducibility of observations on a pair of replicate rank lists. It consists of three components: (1) plotting the correspondence curve to visualize reproducibility, (2) quantifying reproducibility using a copula mixture model and estimating the posterior probability for each obsrvation to be irreproducible (local irreproducible discovery rate), and (3) ranking and selecting observations by their irreproducibility.

Details

Package: idr
Type: Package
Version: 1.2
Date: 2012-10-26
Updates: Improve the convergence of est.IDR (2014-08-15)
License: GPL-2
LazyLoad: yes

The main functions are est.IDR(), get.correspondence() and select.IDR(). est.IDR estimates the copula mixture model and the posterior probability for each observation to be irreproducible. get.correspondence generates the values for plotting the correspondence curve. select.IDR ranks obervations by their reproducibility and reports the number of observations passing the specified IDR thresholds.

Author(s)

Qunhua Li

Maintainer: Qunhua Li <qunhua.li@gmail.com>

References

Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.

Examples


data("simu.idr")
x <- cbind(-simu.idr$x, -simu.idr$y)

mu <- 2.6
sigma <- 1.3
rho <- 0.8
p <- 0.7

idr.out <- est.IDR(x, mu, sigma, rho, p, eps=0.001, max.ite=20)

names(idr.out)

Compute the value of Psi and Psi' for a given t

Description

An internal function to compute Psi for a given (t, v)

Usage

comp.uri(tv, x)

Arguments

tv

A vector of two numeric values, t and v. Both t and v are in [0,1]

x

A numeric vector of x, sorted by the order of y

Value

A numeric value of Psi(t, v)

Author(s)

Qunhua Li

References

Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.

See Also

get.uri.2d, get.correspondence

Examples

x <- seq(1, 10, by=1)
tv <- c(0.1, 0.5)

# opposite order
y1 <- seq(10, 1, by=-1)
o1 <- order(y1)
x.o1 <- x[o1]

comp.uri(tv, x.o1)

# same order
y2 <- seq(1, 10, by=1)
o2 <- order(y2)
x.o2 <- x[o2]

comp.uri(tv, x.o2)

Log density of bivariate Gaussian distribution with symmetric marginals

Description

Compute the log-density for parameterized bivariate Gaussian distribution N(mu, mu, sigma, sigma, rho).

Usage

d.binormal(z.1, z.2, mu, sigma, rho)

Arguments

z.1

a numerical data vector on coordinate 1.

z.2

a numerical data vector on coordinate 1.

mu

mean

sigma

standard deviation

rho

correlation coefficient

Value

Log density of bivariate Gaussian distribution N(mu, mu, sigma, sigma, rho).

Author(s)

Qunhua Li

References

Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.

Examples


z.1 <- rnorm(500, 3, 1)
rho <- 0.8

## The component with higher values is correlated with correlation coefficient=0.8 
z.2 <- rnorm(500, 3 + 0.8*(z.1-3), (1-rho^2))
mu <- 3
sigma <- 1
den.z <- d.binormal(z.1, z.2, mu, sigma, rho)

den.z

E-step for parameterized bivariate 2-component Gaussian mixture models

Description

Expectation step in the EM algorithm for parameterized bivariate 2-component Gaussian mixture models with (1-p)N(0, 0, 1, 1, 0) + pN(mu, mu, sigma, sigma, rho).

Usage

e.step.2normal(z.1, z.2, mu, sigma, rho, p)

Arguments

z.1

a numerical data vector on coordinate 1.

z.2

a numerical data vector on coordinate 2.

mu

mean for the reproducible component.

sigma

standard deviation of the reproducible component.

rho

correlation coefficient of the reproducible component.

p

mixing proportion of the reproducible component.

Value

e.z

a numeric vector, where each entry represents the estimated expected conditional probability that an observation is in the reproducible component.

Author(s)

Qunhua Li

References

Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.

See Also

m.step.2normal, loglik.2binormal, est.IDR

Examples


z.1 <- c(rnorm(500, 0, 1), rnorm(500, 3, 1))
rho <- 0.8

## The component with higher values is correlated with correlation coefficient=0.8 
z.2 <- c(rnorm(500, 0, 1), rnorm(500, 3 + 0.8*(z.1[501:1000]-3), (1-rho^2)))

## Starting values
mu0 <- 3
sigma0 <- 1
rho0 <- 0.85
p0 <- 0.55

e.z <- e.step.2normal(z.1, z.2, mu0, sigma0, rho0, p0)

Estimate the irreproducible discovery rate using the copula mixture model

Description

Fit a Gaussian copula mixture model.

Usage

est.IDR(x, mu, sigma, rho, p, eps=0.001, max.ite=30)

Arguments

x

an n by m numeric matrix, where m= num of replicates, n=num of observations. Numerical values representing the significance of the observations. Note that significant signals are expected to have large values of x. In case that smaller values represent higher significance (e.g. p-value), a monotonic transformation needs to be applied to reverse the order before using this function, for example, -log(p-value). Currently, m=2.

mu

a starting value for the mean of the reproducible component.

sigma

a starting value for the standard deviation of the reproducible component.

rho

a starting value for the correlation coefficient of the reproducible component.

p

a starting value for the proportion of reproducible component.

eps

Stopping criterion. Iterations stop when the increment of log-likelihood is < eps*log-likelihood, Default=0.001.

max.ite

Maximum number of iterations. Default=30.

Value

para

estimated parameters: p, rho, mu, sigma.

idr

a numeric vector of the local idr for each observation (i.e. estimated conditional probablility for each observation to belong to the irreproducible component.

IDR

a numerical vector of the expected irreproducible discovery rate for observations that are as irreproducible or more irreproducible than the given observations.

loglik

log-likelihood at the end of iterations.

loglik.trace

trajectory of log-likelihood.

Author(s)

Qunhua Li

References

Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.

Examples

data("simu.idr")

# simu.idr$x and simu.idr$y are p-values
# Transfer them such that large values represent significant ones 
x <- cbind(-simu.idr$x, -simu.idr$y)

mu <- 2.6
sigma <- 1.3
rho <- 0.8
p <- 0.7

idr.out <- est.IDR(x, mu, sigma, rho, p, eps=0.001, max.ite=20)

names(idr.out)

Compute correspondence profiles

Description

Compute the correspondence profiles (Psi and Psi') and the corresponding smoothed curve using spline

Usage

get.correspondence(x1, x2, t, spline.df = NULL)

Arguments

x1

Data values or ranks of the data values on list 1, a vector of numeric values. Large values need to be significant signals. If small values represent significant signals, rank the signals reversely (e.g. by ranking negative values) and use the rank as x1.

x2

Data values or ranks of the data values on list 2, a vector of numeric values. Large values need to be significant signals. If small values represent significant signals, rank the signals reversely (e.g. by ranking negative values) and use the rank as x1.

t

A numeric vector between 0 and 1 in ascending order. t is the right-tail percentage.

spline.df

Degree of freedom for spline, to control the smoothness of the smoothed curve.

Value

psi

the correspondence profile in terms of the scale of percentage, i.e. between (0, 1)

dpsi

the derivative of the correspondence profile in terms of the scale of percentage, i.e. between (0, 1)

psi.n

the correspondence profile in terms of the scale of the number of observations

dpsi.n

the derivative of the correspondence profile in terms of the scale of the number of observations

Each object above is a list consisting of the following items: t: upper percentage (for psi and dpsi) or number of top ranked observations (for psi.n and dpsi.n) value: psi or dpsi smoothed.line: smoothing spline ntotal: the number of observations jump.point: the index of the vector of t such that psi(t[jump.point]) jumps up due to ties at the low values. This only happends when data consists of a large number of discrete values, e.g. values imputed for observations appearing on only one replicate.

Author(s)

Qunhua Li

References

Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.

Examples


# salmon data 
data(salmon)

# get.correspondence() needs the observations with high ranks have 
# high correlation and the observations with low ranks have low correlation. 
# In this dataset, small values have high correlation and large values 
# have low correlation.
# Ranking negative values makes the data follow the structure required
# by get.correspondence().
# There are 28 observations in this data set.

rank.x <- rank(-salmon$spawners)
rank.y <- rank(-salmon$recruits)
uv <- get.correspondence(rank.x, rank.y, seq(0.01, 0.99, by=1/28)) 


# plot correspondence curve on the scale of percentage
plot(uv$psi$t, uv$psi$value, xlab="t", ylab="psi", xlim=c(0, max(uv$psi$t)), 
ylim=c(0, max(uv$psi$value)), cex.lab=2)
lines(uv$psi$smoothed.line, lwd=4)
abline(coef=c(0,1), lty=3)

# plot the derivative of correspondence curve on the scale of percentage
plot(uv$dpsi$t, uv$dpsi$value, xlab="t", ylab="psi'", xlim=c(0, max(uv$dpsi$t)),
ylim=c(0, max(uv$dpsi$value)), cex.lab=2)
lines(uv$dpsi$smoothed.line, lwd=4)
abline(h=1, lty=3)

# plot correspondence curve on the scale of the number of observations
plot(uv$psi.n$t, uv$psi.n$value, xlab="t", ylab="psi", xlim=c(0, max(uv$psi.n$t)), 
ylim=c(0, max(uv$psi.n$value)), cex.lab=2)
lines(uv$psi.n$smoothed.line, lwd=4)
abline(coef=c(0,1), lty=3)

# plot the derivative of correspondence curve on the scale of the number
# of observations
plot(uv$dpsi.n$t, uv$dpsi.n$value, xlab="t", ylab="psi'", xlim=c(0, max(uv$dpsi.n$t)), 
ylim=c(0, max(uv$dpsi.n$value)), cex.lab=2)
lines(uv$dpsi.n$smoothed.line, lwd=4)
abline(h=1, lty=3)

# If the rank lists consist of a large number of ties at the bottom
# (e.g. the same low value is imputed to the list for the observations
# that appear on only one list), it may be desirable to plot only
# observations before hitting the ties. Then it can be plotted using the
# following
plot(uv$psi$t[1:uv$psi$jump.point], uv$psi$value[1:uv$psi$jump.point], xlab="t",
ylab="psi", xlim=c(0, max(uv$psi$t[1:uv$psi$jump.point])), 
ylim=c(0, max(uv$psi$value[1:uv$psi$jump.point])), cex.lab=2)
lines(uv$psi$smoothed.line, lwd=4)
abline(coef=c(0,1), lty=3)
 

Compute the pseudo values of a mixture model from the empirical CDF

Description

Compute the pseudo values of a mixture model from the empirical CDF

Usage

get.pseudo.mix(x, mu, sigma, rho, p)

Arguments

x

A vector of values of empirical CDF

mu

Mean of the reproducible component in the mixture model on the latent space

sigma

Standard deviation of the reproducible component in the mixture model on the latent space

rho

Correlation coefficient of the reproducible component in the mixture model on the latent space

p

Mixing proportion of the reproducible component in the mixture model on the latent space

Value

The values of a mixture model corresponding to the empirical CDF

Author(s)

Qunhua Li

References

Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.

Examples


x <- seq(0.1, 0.9, by=0.1)

mu <- 2.6
sigma <- 1.3
rho <- 0.8
p <- 0.7

pseudo.x <- get.pseudo.mix(x, mu, sigma, rho, p)

pseudo.x

Compute the correspondence profile (Psi(t, v)) and the derivative of the correspondence profile (Psi'(t, v))

Description

An internal function to compute the correspondence profile (Psi(t, v)) and the derivative of the correspondence profile (Psi'(t, v)) for t and v between 0 and 1

Usage

get.uri.2d(x1, x2, tt, vv, spline.df = NULL)

Arguments

x1
x2
tt

A vector of right-tail percentages for x1. It is a numeric vector between 0 and 1, e.g. seq(0.01, 0.99, by=0.01)

vv

A vector of right-tail percentages for x2. It is a numeric vector between 0 and 1, e.g. seq(0.01, 0.99, by=0.01)

spline.df

The degree of freedom to control the smoothness of the spline to be superimposed on the points

Author(s)

Qunhua Li

References

Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.

See Also

comp.uri, get.correspondence

Examples


# salmon data from Kallenberg and Ledwina, 1999
data(salmon)

# get.correspondence() needs large ranks have high correlation and
# small ranks have low correlation. In this dataset, small values
# have high correlation and large values have low correlation.
# Ranking negative values makes the data follow the structure required
# by get.correspondence().
# There are 28 observations in this data set.

rank.x <- rank(-salmon$spawners)
rank.y <- rank(-salmon$recruits)
t <- seq(0.01, 0.99, by=1/28)
psi.dpsi <- get.uri.2d(rank.x, rank.y, t, t, spline.df=6.4)

Compute log-likelihood of parameterized bivariate 2-component Gaussian mixture models

Description

Compute the log-likelihood for parameterized bivariate 2-component Gaussian mixture models with (1-p)N(0, 0, 1, 1, 0) + pN(mu, mu, sigma, sigma, rho).

Usage

loglik.2binormal(z.1, z.2, mu, sigma, rho, p)

Arguments

z.1

a numerical data vector on coordinate 1.

z.2

a numerical data vector on coordinate 1.

mu

mean for the reproducible component.

sigma

standard deviation of the reproducible component.

rho

correlation coefficient of the reproducible component.

p

mixing proportion of the reproducible component.

Value

Log-likelihood of the bivariate 2-component Gaussian mixture models (1-p)N(0, 0, 1, 1, 0) + N(mu, mu, sigma, sigma, rho)$.

Author(s)

Qunhua Li

References

Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.

See Also

m.step.2normal, e.step.normal, est.IDR

Examples

z.1 <- c(rnorm(500, 0, 1), rnorm(500, 3, 1))
rho <- 0.8

## The component with higher values is correlated with correlation coefficient=0.8 
z.2 <- c(rnorm(500, 0, 1), rnorm(500, 3 + 0.8*(z.1[501:1000]-3), (1-rho^2)))

## Starting values
mu <- 3
sigma <- 1
rho <- 0.85
p <- 0.55

## The function is currently defined as
loglik <- loglik.2binormal(z.1, z.2, mu, sigma, rho, p) 

loglik

M-step for parameterized bivariate 2-component Gaussian mixture models

Description

Maximization step in the EM algorithm for parameterized bivariate 2-component Gaussian mixture models with $(1-p)N(0, 0, 1, 1, 0) + pN(mu, mu, sigma^2, sigma^2, rho)$.

Usage

m.step.2normal(z.1, z.2, e.z)

Arguments

z.1

a numerical data vector on coordinate 1.

z.2

a numerical data vector on coordinate 2.

e.z

a vector of expected conditional probability that the $i$th observation is reproducible.

Details

This function is used in the EM algorithm for estimating the parameters of the Gaussian mixture model at the latent copula space.

Value

Estimated parameters, basically a list including elements

p

the estimated mixing proportion of the reproducible component.

mu

the estimated mean for the reproducible component.

sigma

the estimated standard deviation of the reproducible component.

rho

the estimated correlation coefficient of the reproducible component.

Author(s)

Qunhua Li

References

Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.

See Also

e.step.2normal, loglik.2binormal, est.IDR

Examples

##---- Should be DIRECTLY executable !! ----
##-- ==>  Define data, use random,
##--	or do  help(data=index)  for the standard data sets.

 
z.1 <- c(rnorm(500, 0, 1), rnorm(500, 3, 1))
rho <- 0.8

##The component with higher values is correlated with correlation coefficient=0.8 
z.2 <- c(rnorm(500, 0, 1), rnorm(500, 3 + 0.8*(z.1[501:1000]-3), (1-rho^2)))
e.z <- c(rep(0, 500) + abs(rnorm(0, 0.05)), rep(1, 500)-abs(rnorm(0, 0.05)))

para <- m.step.2normal(z.1, z.2, e.z) 

para

Salmon data

Description

This data is from Simonoff (1990, p 161). It concerns the size of the annual spawning stock and its production of new catchable-sized fish for 1940 through 1967 for the Skeena river sockeye salmon stock (in thousands of fish). It has three columns, year, spawners and recruits. It was speculated to consist of two different populations.

Usage

data(salmon)

Format

A data frame with 28 observations on the following 3 variables.

year

a numeric vector of the year

spawners

a numeric vector of the annual spawning stock

recruits

a numeric vector of the production of new catchable-sized fish

Source

Data is salmon.dat in Simonoff (1990). It can be downloaded from the book's website.

References

Simonoff, J. S. (1990), Smoothing Methods in Statistics, New York: Spinger-Verlag.

Examples

data(salmon)
plot(rank(salmon$spawners), rank(salmon$recruits)) 

Select observations according to IDR

Description

Select observations that exceeding a given IDR level

Usage

select.IDR(x, IDR.x, IDR.level)

Arguments

x

a n by m numeric matrix, where m= num of replicates, n=num of observations. Numerical values representing the significance of the observations, where larger values represent higher significance, for example, -log(p-value). Currently, m=2.

IDR.x

Irreproducibile discovery rate for each entry of x. It is computed from est.IDR().

IDR.level

IDR cutoff, a numerical value between [0,1].

Value

x

Observations that are selected.

n

Number of observations that are selected.

IDR.level

IDR cutoff, a numerical value between [0,1].

Author(s)

Qunhua Li

References

Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.

See Also

est.IDR

Examples


data("simu.idr")
x <- cbind(-simu.idr$x, -simu.idr$y)

mu <- 2.6
sigma <- 1.3
rho <- 0.8
p <- 0.7

idr.out <- est.IDR(x, mu, sigma, rho, p, eps=0.001, max.ite=20)
# select observations exceeding IDR threshold=0.01 
IDR.level <- 0.01
x.selected <- select.IDR(x, idr.out$IDR, IDR.level)

Simulated data

Description

This is a simulated dataset for testing the program. Data is first simulated from the copula mixture model with latent structure of 0.65 N(mu, mu, sigma, sigma, rho) + 0.95 N(0, 0, 1, 1, 0), where mu=2.5, sigma=1, rho=0.84. The observations in the dataset are then generated by taking the p-values from a z-test H0: mu=0.

Usage

data(simu.idr)

Format

A data frame with 1000 observations on the following 3 variables.

x

a numeric vector, representing p-values on replicate 1

y

a numeric vector, representing p-values on replicate 2

labels

a binary vector, where 1 represents the reproducible component and 0 represents the irreproducible component.

References

Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.

Examples

data(simu.idr)
plot(rank(simu.idr$x), rank(simu.idr$y))