Type: | Package |
Title: | Irreproducible Discovery Rate |
Version: | 1.3 |
Date: | 2022-06-20 |
Author: | Qunhua Li |
Maintainer: | Qunhua Li <qunhua.li@psu.edu> |
Description: | This is a package for estimating the copula mixture model and plotting correspondence curves. Details are in "Measuring reproducibility of high-throughput experiments" (2011), Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779, by Li, Brown, Huang, and Bickel. |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2.0)] |
Imports: | stats |
LazyLoad: | yes |
Packaged: | 2022-06-21 02:57:35 UTC; Qunhua Li |
NeedsCompilation: | no |
Repository: | CRAN |
Date/Publication: | 2022-06-21 07:30:07 UTC |
Irreproducible discovery rate
Description
This package estimates the reproducibility of observations on a pair of replicate rank lists. It consists of three components: (1) plotting the correspondence curve to visualize reproducibility, (2) quantifying reproducibility using a copula mixture model and estimating the posterior probability for each obsrvation to be irreproducible (local irreproducible discovery rate), and (3) ranking and selecting observations by their irreproducibility.
Details
Package: | idr |
Type: | Package |
Version: | 1.2 |
Date: | 2012-10-26 |
Updates: | Improve the convergence of est.IDR (2014-08-15) |
License: | GPL-2 |
LazyLoad: | yes |
The main functions are est.IDR(), get.correspondence() and select.IDR(). est.IDR estimates the copula mixture model and the posterior probability for each observation to be irreproducible. get.correspondence generates the values for plotting the correspondence curve. select.IDR ranks obervations by their reproducibility and reports the number of observations passing the specified IDR thresholds.
Author(s)
Qunhua Li
Maintainer: Qunhua Li <qunhua.li@gmail.com>
References
Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.
Examples
data("simu.idr")
x <- cbind(-simu.idr$x, -simu.idr$y)
mu <- 2.6
sigma <- 1.3
rho <- 0.8
p <- 0.7
idr.out <- est.IDR(x, mu, sigma, rho, p, eps=0.001, max.ite=20)
names(idr.out)
Compute the value of Psi and Psi' for a given t
Description
An internal function to compute Psi for a given (t, v)
Usage
comp.uri(tv, x)
Arguments
tv |
A vector of two numeric values, t and v. Both t and v are in [0,1] |
x |
A numeric vector of x, sorted by the order of y |
Value
A numeric value of Psi(t, v)
Author(s)
Qunhua Li
References
Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.
See Also
get.uri.2d, get.correspondence
Examples
x <- seq(1, 10, by=1)
tv <- c(0.1, 0.5)
# opposite order
y1 <- seq(10, 1, by=-1)
o1 <- order(y1)
x.o1 <- x[o1]
comp.uri(tv, x.o1)
# same order
y2 <- seq(1, 10, by=1)
o2 <- order(y2)
x.o2 <- x[o2]
comp.uri(tv, x.o2)
Log density of bivariate Gaussian distribution with symmetric marginals
Description
Compute the log-density for parameterized bivariate Gaussian distribution N(mu, mu, sigma, sigma, rho).
Usage
d.binormal(z.1, z.2, mu, sigma, rho)
Arguments
z.1 |
a numerical data vector on coordinate 1. |
z.2 |
a numerical data vector on coordinate 1. |
mu |
mean |
sigma |
standard deviation |
rho |
correlation coefficient |
Value
Log density of bivariate Gaussian distribution N(mu, mu, sigma, sigma, rho).
Author(s)
Qunhua Li
References
Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.
Examples
z.1 <- rnorm(500, 3, 1)
rho <- 0.8
## The component with higher values is correlated with correlation coefficient=0.8
z.2 <- rnorm(500, 3 + 0.8*(z.1-3), (1-rho^2))
mu <- 3
sigma <- 1
den.z <- d.binormal(z.1, z.2, mu, sigma, rho)
den.z
E-step for parameterized bivariate 2-component Gaussian mixture models
Description
Expectation step in the EM algorithm for parameterized bivariate 2-component Gaussian mixture models with (1-p)N(0, 0, 1, 1, 0) + pN(mu, mu, sigma, sigma, rho).
Usage
e.step.2normal(z.1, z.2, mu, sigma, rho, p)
Arguments
z.1 |
a numerical data vector on coordinate 1. |
z.2 |
a numerical data vector on coordinate 2. |
mu |
mean for the reproducible component. |
sigma |
standard deviation of the reproducible component. |
rho |
correlation coefficient of the reproducible component. |
p |
mixing proportion of the reproducible component. |
Value
e.z |
a numeric vector, where each entry represents the estimated expected conditional probability that an observation is in the reproducible component. |
Author(s)
Qunhua Li
References
Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.
See Also
m.step.2normal
, loglik.2binormal
, est.IDR
Examples
z.1 <- c(rnorm(500, 0, 1), rnorm(500, 3, 1))
rho <- 0.8
## The component with higher values is correlated with correlation coefficient=0.8
z.2 <- c(rnorm(500, 0, 1), rnorm(500, 3 + 0.8*(z.1[501:1000]-3), (1-rho^2)))
## Starting values
mu0 <- 3
sigma0 <- 1
rho0 <- 0.85
p0 <- 0.55
e.z <- e.step.2normal(z.1, z.2, mu0, sigma0, rho0, p0)
Estimate the irreproducible discovery rate using the copula mixture model
Description
Fit a Gaussian copula mixture model.
Usage
est.IDR(x, mu, sigma, rho, p, eps=0.001, max.ite=30)
Arguments
x |
an n by m numeric matrix, where m= num of replicates, n=num of observations. Numerical values representing the significance of the observations. Note that significant signals are expected to have large values of x. In case that smaller values represent higher significance (e.g. p-value), a monotonic transformation needs to be applied to reverse the order before using this function, for example, -log(p-value). Currently, m=2. |
mu |
a starting value for the mean of the reproducible component. |
sigma |
a starting value for the standard deviation of the reproducible component. |
rho |
a starting value for the correlation coefficient of the reproducible component. |
p |
a starting value for the proportion of reproducible component. |
eps |
Stopping criterion. Iterations stop when the increment of log-likelihood is < eps*log-likelihood, Default=0.001. |
max.ite |
Maximum number of iterations. Default=30. |
Value
para |
estimated parameters: p, rho, mu, sigma. |
idr |
a numeric vector of the local idr for each observation (i.e. estimated conditional probablility for each observation to belong to the irreproducible component. |
IDR |
a numerical vector of the expected irreproducible discovery rate for observations that are as irreproducible or more irreproducible than the given observations. |
loglik |
log-likelihood at the end of iterations. |
loglik.trace |
trajectory of log-likelihood. |
Author(s)
Qunhua Li
References
Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.
Examples
data("simu.idr")
# simu.idr$x and simu.idr$y are p-values
# Transfer them such that large values represent significant ones
x <- cbind(-simu.idr$x, -simu.idr$y)
mu <- 2.6
sigma <- 1.3
rho <- 0.8
p <- 0.7
idr.out <- est.IDR(x, mu, sigma, rho, p, eps=0.001, max.ite=20)
names(idr.out)
Compute correspondence profiles
Description
Compute the correspondence profiles (Psi and Psi') and the corresponding smoothed curve using spline
Usage
get.correspondence(x1, x2, t, spline.df = NULL)
Arguments
x1 |
Data values or ranks of the data values on list 1, a vector of numeric values. Large values need to be significant signals. If small values represent significant signals, rank the signals reversely (e.g. by ranking negative values) and use the rank as x1. |
x2 |
Data values or ranks of the data values on list 2, a vector of numeric values. Large values need to be significant signals. If small values represent significant signals, rank the signals reversely (e.g. by ranking negative values) and use the rank as x1. |
t |
A numeric vector between 0 and 1 in ascending order. t is the right-tail percentage. |
spline.df |
Degree of freedom for spline, to control the smoothness of the smoothed curve. |
Value
psi |
the correspondence profile in terms of the scale of percentage, i.e. between (0, 1) |
dpsi |
the derivative of the correspondence profile in terms of the scale of percentage, i.e. between (0, 1) |
psi.n |
the correspondence profile in terms of the scale of the number of observations |
dpsi.n |
the derivative of the correspondence profile in terms of the scale of the number of observations |
Each object above is a list consisting of the following items: t: upper percentage (for psi and dpsi) or number of top ranked observations (for psi.n and dpsi.n) value: psi or dpsi smoothed.line: smoothing spline ntotal: the number of observations jump.point: the index of the vector of t such that psi(t[jump.point]) jumps up due to ties at the low values. This only happends when data consists of a large number of discrete values, e.g. values imputed for observations appearing on only one replicate.
Author(s)
Qunhua Li
References
Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.
Examples
# salmon data
data(salmon)
# get.correspondence() needs the observations with high ranks have
# high correlation and the observations with low ranks have low correlation.
# In this dataset, small values have high correlation and large values
# have low correlation.
# Ranking negative values makes the data follow the structure required
# by get.correspondence().
# There are 28 observations in this data set.
rank.x <- rank(-salmon$spawners)
rank.y <- rank(-salmon$recruits)
uv <- get.correspondence(rank.x, rank.y, seq(0.01, 0.99, by=1/28))
# plot correspondence curve on the scale of percentage
plot(uv$psi$t, uv$psi$value, xlab="t", ylab="psi", xlim=c(0, max(uv$psi$t)),
ylim=c(0, max(uv$psi$value)), cex.lab=2)
lines(uv$psi$smoothed.line, lwd=4)
abline(coef=c(0,1), lty=3)
# plot the derivative of correspondence curve on the scale of percentage
plot(uv$dpsi$t, uv$dpsi$value, xlab="t", ylab="psi'", xlim=c(0, max(uv$dpsi$t)),
ylim=c(0, max(uv$dpsi$value)), cex.lab=2)
lines(uv$dpsi$smoothed.line, lwd=4)
abline(h=1, lty=3)
# plot correspondence curve on the scale of the number of observations
plot(uv$psi.n$t, uv$psi.n$value, xlab="t", ylab="psi", xlim=c(0, max(uv$psi.n$t)),
ylim=c(0, max(uv$psi.n$value)), cex.lab=2)
lines(uv$psi.n$smoothed.line, lwd=4)
abline(coef=c(0,1), lty=3)
# plot the derivative of correspondence curve on the scale of the number
# of observations
plot(uv$dpsi.n$t, uv$dpsi.n$value, xlab="t", ylab="psi'", xlim=c(0, max(uv$dpsi.n$t)),
ylim=c(0, max(uv$dpsi.n$value)), cex.lab=2)
lines(uv$dpsi.n$smoothed.line, lwd=4)
abline(h=1, lty=3)
# If the rank lists consist of a large number of ties at the bottom
# (e.g. the same low value is imputed to the list for the observations
# that appear on only one list), it may be desirable to plot only
# observations before hitting the ties. Then it can be plotted using the
# following
plot(uv$psi$t[1:uv$psi$jump.point], uv$psi$value[1:uv$psi$jump.point], xlab="t",
ylab="psi", xlim=c(0, max(uv$psi$t[1:uv$psi$jump.point])),
ylim=c(0, max(uv$psi$value[1:uv$psi$jump.point])), cex.lab=2)
lines(uv$psi$smoothed.line, lwd=4)
abline(coef=c(0,1), lty=3)
Compute the pseudo values of a mixture model from the empirical CDF
Description
Compute the pseudo values of a mixture model from the empirical CDF
Usage
get.pseudo.mix(x, mu, sigma, rho, p)
Arguments
x |
A vector of values of empirical CDF |
mu |
Mean of the reproducible component in the mixture model on the latent space |
sigma |
Standard deviation of the reproducible component in the mixture model on the latent space |
rho |
Correlation coefficient of the reproducible component in the mixture model on the latent space |
p |
Mixing proportion of the reproducible component in the mixture model on the latent space |
Value
The values of a mixture model corresponding to the empirical CDF
Author(s)
Qunhua Li
References
Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.
Examples
x <- seq(0.1, 0.9, by=0.1)
mu <- 2.6
sigma <- 1.3
rho <- 0.8
p <- 0.7
pseudo.x <- get.pseudo.mix(x, mu, sigma, rho, p)
pseudo.x
Compute the correspondence profile (Psi(t, v)) and the derivative of the correspondence profile (Psi'(t, v))
Description
An internal function to compute the correspondence profile (Psi(t, v)) and the derivative of the correspondence profile (Psi'(t, v)) for t and v between 0 and 1
Usage
get.uri.2d(x1, x2, tt, vv, spline.df = NULL)
Arguments
x1 |
|
x2 |
|
tt |
A vector of right-tail percentages for x1. It is a numeric vector between 0 and 1, e.g. seq(0.01, 0.99, by=0.01) |
vv |
A vector of right-tail percentages for x2. It is a numeric vector between 0 and 1, e.g. seq(0.01, 0.99, by=0.01) |
spline.df |
The degree of freedom to control the smoothness of the spline to be superimposed on the points |
Author(s)
Qunhua Li
References
Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.
See Also
comp.uri, get.correspondence
Examples
# salmon data from Kallenberg and Ledwina, 1999
data(salmon)
# get.correspondence() needs large ranks have high correlation and
# small ranks have low correlation. In this dataset, small values
# have high correlation and large values have low correlation.
# Ranking negative values makes the data follow the structure required
# by get.correspondence().
# There are 28 observations in this data set.
rank.x <- rank(-salmon$spawners)
rank.y <- rank(-salmon$recruits)
t <- seq(0.01, 0.99, by=1/28)
psi.dpsi <- get.uri.2d(rank.x, rank.y, t, t, spline.df=6.4)
Compute log-likelihood of parameterized bivariate 2-component Gaussian mixture models
Description
Compute the log-likelihood for parameterized bivariate 2-component Gaussian mixture models with (1-p)N(0, 0, 1, 1, 0) + pN(mu, mu, sigma, sigma, rho).
Usage
loglik.2binormal(z.1, z.2, mu, sigma, rho, p)
Arguments
z.1 |
a numerical data vector on coordinate 1. |
z.2 |
a numerical data vector on coordinate 1. |
mu |
mean for the reproducible component. |
sigma |
standard deviation of the reproducible component. |
rho |
correlation coefficient of the reproducible component. |
p |
mixing proportion of the reproducible component. |
Value
Log-likelihood of the bivariate 2-component Gaussian mixture models (1-p)N(0, 0, 1, 1, 0) + N(mu, mu, sigma, sigma, rho)$.
Author(s)
Qunhua Li
References
Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.
See Also
m.step.2normal
, e.step.normal
, est.IDR
Examples
z.1 <- c(rnorm(500, 0, 1), rnorm(500, 3, 1))
rho <- 0.8
## The component with higher values is correlated with correlation coefficient=0.8
z.2 <- c(rnorm(500, 0, 1), rnorm(500, 3 + 0.8*(z.1[501:1000]-3), (1-rho^2)))
## Starting values
mu <- 3
sigma <- 1
rho <- 0.85
p <- 0.55
## The function is currently defined as
loglik <- loglik.2binormal(z.1, z.2, mu, sigma, rho, p)
loglik
M-step for parameterized bivariate 2-component Gaussian mixture models
Description
Maximization step in the EM algorithm for parameterized bivariate 2-component Gaussian mixture models with $(1-p)N(0, 0, 1, 1, 0) + pN(mu, mu, sigma^2, sigma^2, rho)$.
Usage
m.step.2normal(z.1, z.2, e.z)
Arguments
z.1 |
a numerical data vector on coordinate 1. |
z.2 |
a numerical data vector on coordinate 2. |
e.z |
a vector of expected conditional probability that the $i$th observation is reproducible. |
Details
This function is used in the EM algorithm for estimating the parameters of the Gaussian mixture model at the latent copula space.
Value
Estimated parameters, basically a list including elements
p |
the estimated mixing proportion of the reproducible component. |
mu |
the estimated mean for the reproducible component. |
sigma |
the estimated standard deviation of the reproducible component. |
rho |
the estimated correlation coefficient of the reproducible component. |
Author(s)
Qunhua Li
References
Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.
See Also
e.step.2normal
, loglik.2binormal
, est.IDR
Examples
##---- Should be DIRECTLY executable !! ----
##-- ==> Define data, use random,
##-- or do help(data=index) for the standard data sets.
z.1 <- c(rnorm(500, 0, 1), rnorm(500, 3, 1))
rho <- 0.8
##The component with higher values is correlated with correlation coefficient=0.8
z.2 <- c(rnorm(500, 0, 1), rnorm(500, 3 + 0.8*(z.1[501:1000]-3), (1-rho^2)))
e.z <- c(rep(0, 500) + abs(rnorm(0, 0.05)), rep(1, 500)-abs(rnorm(0, 0.05)))
para <- m.step.2normal(z.1, z.2, e.z)
para
Salmon data
Description
This data is from Simonoff (1990, p 161). It concerns the size of the annual spawning stock and its production of new catchable-sized fish for 1940 through 1967 for the Skeena river sockeye salmon stock (in thousands of fish). It has three columns, year, spawners and recruits. It was speculated to consist of two different populations.
Usage
data(salmon)
Format
A data frame with 28 observations on the following 3 variables.
year
a numeric vector of the year
spawners
a numeric vector of the annual spawning stock
recruits
a numeric vector of the production of new catchable-sized fish
Source
Data is salmon.dat in Simonoff (1990). It can be downloaded from the book's website.
References
Simonoff, J. S. (1990), Smoothing Methods in Statistics, New York: Spinger-Verlag.
Examples
data(salmon)
plot(rank(salmon$spawners), rank(salmon$recruits))
Select observations according to IDR
Description
Select observations that exceeding a given IDR level
Usage
select.IDR(x, IDR.x, IDR.level)
Arguments
x |
a n by m numeric matrix, where m= num of replicates, n=num of observations. Numerical values representing the significance of the observations, where larger values represent higher significance, for example, -log(p-value). Currently, m=2. |
IDR.x |
Irreproducibile discovery rate for each entry of x. It is computed from est.IDR(). |
IDR.level |
IDR cutoff, a numerical value between [0,1]. |
Value
x |
Observations that are selected. |
n |
Number of observations that are selected. |
IDR.level |
IDR cutoff, a numerical value between [0,1]. |
Author(s)
Qunhua Li
References
Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.
See Also
Examples
data("simu.idr")
x <- cbind(-simu.idr$x, -simu.idr$y)
mu <- 2.6
sigma <- 1.3
rho <- 0.8
p <- 0.7
idr.out <- est.IDR(x, mu, sigma, rho, p, eps=0.001, max.ite=20)
# select observations exceeding IDR threshold=0.01
IDR.level <- 0.01
x.selected <- select.IDR(x, idr.out$IDR, IDR.level)
Simulated data
Description
This is a simulated dataset for testing the program. Data is first simulated from the copula mixture model with latent structure of 0.65 N(mu, mu, sigma, sigma, rho) + 0.95 N(0, 0, 1, 1, 0), where mu=2.5, sigma=1, rho=0.84. The observations in the dataset are then generated by taking the p-values from a z-test H0: mu=0.
Usage
data(simu.idr)
Format
A data frame with 1000 observations on the following 3 variables.
x
a numeric vector, representing p-values on replicate 1
y
a numeric vector, representing p-values on replicate 2
labels
a binary vector, where 1 represents the reproducible component and 0 represents the irreproducible component.
References
Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.
Examples
data(simu.idr)
plot(rank(simu.idr$x), rank(simu.idr$y))