Help for package CpmERCCutoff

Type:

Package

Title:

Calculation of Log2 Counts per Million Cutoff from ERCC Controls

Version:

1.0.0

Description:

Implementation of the empirical method to derive log2 counts per million (CPM) cutoff to filter out lowly expressed genes using ERCC spike-ins as described in Goll and Bosinger et.al (2022)<doi:10.1101/2022.06.23.497396>. This package utilizes the synthetic mRNA control pairs developed by the External RNA Controls Consortium (ERCC) (ERCC 1 / ERCC 2) that are spiked into sample pairs at known ratios at various absolute abundances. The relationship between the observed and expected fold changes is then used to empirically determine an optimal log2 CPM cutoff for filtering out lowly expressed genes.

Depends:

R (≥ 3.6)

License:

GPL (≥ 3)

Encoding:

UTF-8

LazyData:

true

Imports:

stats, graphics

Suggests:

spelling, testthat (≥ 3.0.0)

Config/testthat/edition:

RoxygenNote:

7.2.0

NeedsCompilation:

Packaged:

2022-09-12 18:36:42 UTC; ksteenbergen

Language:

en-US

Author:

Tyler Grimes

[aut], Travis L. Jensen

[aut]

Maintainer:

Kristen Steenbergen <ksteenbergen@emmes.com>

Repository:

CRAN

Date/Publication:

2022-09-13 11:20:07 UTC

A data frame of expected ERCC1 and ERCC2 ratios

Description

A data frame that contains the expected spike in ERCC data. This data can be obtained from 'ERCC Controls Analysis' manual located on Thermo Fisher's ERCC RNA Spike-In Mix product [page](https://assets.thermofisher.com/TFS-Assets/LSG/manuals/cms_095046.txt). The 'exp_input' data frame mirrors the fields shown in the ERCC manual. For the LCPM cutoff calculation, the last column of the log2 expected fold change ratios are used. Ensure that this column is titled "expected_lfc_ratio". See the example code below for formatting the data.

Usage

exp_input

Format

A data frame with 92 rows and 4 columns: Each row represents an ERCC transcript. Columns are described below:

ercc_id: ERCC spike-in mRNA Ids (ERCC-00002 – ERCC-00171)
subgroup: ERCC subgroups (A – D)
ercc1_conc: ERCC1 concentration (0.014 – 30,000)
ercc2_conc: ERCC2 concentration (0.007 – 30,000)
expected_fc_ratio: Expected fold change ratio (.5 – 4)
expected_lfc_ratio: Expected log2 fold change ratio (-1 – 2)

Examples

# Order rows by ERCC ID and assign to rownames.
exp_input = exp_input[order(exp_input$ercc_id), ]
rownames(exp_input) = exp_input$ercc_id

Function to empirically determine a log2 CPM cutoff based on ERCC RNA spike-in

Description

This function uses spike-in ERCC data, known control RNA probes, and paired samples to fit a 3rd order polynomial to determine an expression cutoff that meets the specified correlation between expected and observed fold changes. The obs data frame used as input for the observed expression of the 92 ERCC RNA spike-ins and stores the coverage-normalized read log2 counts per million (LCPM) that mapped to the respective ERCC sequences. Typically, prior to LCPM calculation, the read count data is normalized for any systematic differences in read coverage between samples, for example, by using the TMM normalization method as implemented in the edgeR package.

For each bootstrap replicate, the paired samples are subsampled with replacement. The mean LCPM of each ERCC transcript is determined by first calculating the average LCPM value for each paired sample, and then taking the mean of those averages. The ERCC transcripts are sorted based on these means, and are then grouped into n.bins ERCC bins. Next, the Spearman correlation metric is used to calculate the association between the empirical and expected log fold change (LFC) of the ERCCs in each bin for each sample. Additionally, the average LCPM for the ERCCs in each bin are calculated for each sample. This leads to a pair of values - the average LCPM and the association value - for each sample and each ERCC bin. Outliers within each ERCC bin are identified and removed based on >1.5 IQR. A 3rd order polynomial is fit with the explanatory variable being the average LCPM and the response variable being the Spearman correlation value between expected and observed log2 fold changes. The fitted curve is used to identify the average LCPM value with a Spearman correlation of cor.value. The results are output as an "empLCM" object as described below. The summary.empLCPM function can be used to extract a summary of the results, and the plot.empLCPM function to plot the results for visualization.

Usage

getLowLcpmCutoff(
  obs,
  exp,
  pairs,
  n.bins = 7,
  rep = 1000,
  ci = 0.95,
  cor.value = 0.9,
  remove.outliers = TRUE,
  seed = 20220719
)

Arguments

obs

A data frame of observed spike-in ERCC data. Each row is an ERCC transcript, and each column is a sample. Data are read coverage-normalized log2 counts per million (LCPM).

exp

A data frame of expected ERCC Mix 1 and Mix 2 ratios with a column titled 'expected_lfc_ratio' containing the expected log2 fold-change ratios. This data can be obtained from 'ERCC Controls Analysis' manual located on Thermo Fisher's ERCC RNA Spike-In Mix product [page](https://assets.thermofisher.com/TFS-Assets/LSG/manuals/cms_095046.txt). The 'exp_input' data frame mirrors the fields shown in the ERCC manual. For the LCPM cutoff calculation, the last column containing the log2 expected fold change ratios are used. Ensure that this column is titled "expected_lfc_ratio". See the example code below for formatting the data. #

pairs

A 2-column data frame where each row indicates a sample pair with the first column indicating the sample that received ERCC spike-ins from Mix 1 and the second column indicating the sample receiving Mix 2.

n.bins

Integer. The number of abundance bins to create. Default is 7.

rep

Integer. The number of bootstrap replicates. Default is 1000.

ci

Numeric. The confidence interval. Default is 0.95.

cor.value

Numeric. The desired Spearman correlation between the empirical log2 fold change across the ERCC transcripts. Default is 0.9.

remove.outliers

If TRUE (default) outliers are identified as exceeding 1.5 IQR, and are removed prior to fitting the polynomial. Set to FALSE to keep all points.

seed

Integer. The reproducibility seed. Default is 20220719.

Value

An "empLCPM" object is returned, which contains the following named elements:

`cutoff`	a vector containing 3 values: the threshold value, upper confidence interval,
	and the lower confidence interval value.
`args`	a key: value list of arguments that were provided.
`res`	a list containing the main results and other information from the input.
	The `summary.empLCPM` function should be used to extract a summary table.

Examples

library(CpmERCCutoff)
##############################
# Load and wrangle input data:
##############################
# Load observed read counts
data("obs_input")

# Set ERCC Ids to rownames
rownames(obs_input) = obs_input$X

# Load expected ERCC data:
data("exp_input")

# Order rows by ERCC ID.
exp_input = exp_input[order(exp_input$ercc_id), ]
rownames(exp_input) = exp_input$ercc_id

# Load metadata:
data("mta_dta")

# Pair samples that received ERCC Mix 1 with samples that received ERCC Mix 2.
# The resulting 2-column data frame is used for the 'pairs' argument.
# Note: the code here will depend on the details of the given experiment. In
#       this example, the post-vaccination samples (which received Mix 2)
#       for each subject are paired to their pre-vaccination samples (which
#       received Mix 1).
pairs_input = cbind(
  mta_dta[mta_dta$spike == 2, 'samid'],
  mta_dta[match(mta_dta[mta_dta$spike == 2, 'subid'],
                mta_dta[mta_dta$spike == 1,'subid']), 'samid'])
# Put Mix 1 in the first column and Mix 2 in the second.
pairs_input = pairs_input[, c(2, 1)]

###############################
# Run getLowLcpmCutoff Function:
###############################'
# Note: Here we use `rep = 10` for only 10 bootstrap replicates
#       to decrease the run time for this example; a lager number
#       should be used in practice (default = 1000).
res = getLowLcpmCutoff(obs = obs_input,
                       exp = exp_input,
                       pairs = pairs_input,
                       n.bins = 7,
                       rep = 10,
                       cor.value = 0.9,
		                  remove.outliers = TRUE,
                       seed = 20220719)

# Print a short summary of the results:
res

# Extract a summary table of the results:
summary(res)

# Create a plot of the results:
plot(x = res,
     main = "Determination of Empirical Minimum Expression Cutoffs using ERCCs",
     col.trend = "blue",
     col.outlier = c("black", "red"))

A data frame containing sample-level ERCC meta data

Description

A data frame containing sample-level ERCC meta data. In this experiment, subjects had repeated measures and the baseline samples were spiked with ERCC1 and four post-vaccination samples were each spiked with ERCC2. This meta data is used to create the 2-column data frame of sample pairs where the first column will contain sample IDs that received ERCC1, and the second column will contain sample IDs that received ERCC2. The pairs data frame is used as input to the 'pairs' argument in the 'getLowCpmCutOff' function. Use '?getLowCpmcutOff' to review example code regarding the formatting of the 'pairs' data frame.

Usage

mta_dta

Format

A data frame with 49 rows and 4 columns: Each row is a sample, and each column contains metadata such as subject IDs, spike in control type, and treatment groups. For this study, data was collected at various time points, however, under different experiment conditions, the 'day' column can be represented as treatment group.

samid: Sample IDs (SAM01 – SAM49)
subid: Subject/Participant ID (SUB01 – SUB10)
day: collection day (0 – 14)
spike: ERCC spike in control Mix (1 or 2)

A data frame of observed spike in ERCC normalized LCPM data

Description

A data frame of observed gene expression results for ERCC RNA that was spiked into samples. Data are read coverage-normalized log2 counts per million (LCPM).

Usage

obs_input

Format

A data frame with 92 rows and 50 Columns: Each row is an ERCC transcript, and each column is a sample. Data are read coverage-normalized LCPM.

X: This first column is ERCC spike in mRNA Ids (ERCC-00002 – ERCC-00171)
SAM15: Sample 15 containing normalized log2 counts per million
SAM36: Sample 36 containing normalized log2 counts per million
SAM19: Sample 19 containing normalized log2 counts per million
SAM53: Sample 53 containing normalized log2 counts per million
SAM42: Sample 42 containing normalized log2 counts per million
SAM32: Sample 32 containing normalized log2 counts per million
SAM18: Sample 18 containing normalized log2 counts per million
SAM48: Sample 48 containing normalized log2 counts per million
SAM26: Sample 26 containing normalized log2 counts per million
SAM37: Sample 37 containing normalized log2 counts per million
SAM38: Sample 38 containing normalized log2 counts per million
SAM29: Sample 29 containing normalized log2 counts per million
SAM17: Sample 17 containing normalized log2 counts per million
SAM41: Sample 41 containing normalized log2 counts per million
SAM09: Sample 09 containing normalized log2 counts per million
SAM07: Sample 07 containing normalized log2 counts per million
SAM14: Sample 14 containing normalized log2 counts per million
SAM02: Sample 02 containing normalized log2 counts per million
SAM05: Sample 05 containing normalized log2 counts per million
SAM25: Sample 25 containing normalized log2 counts per million
SAM08: Sample 08 containing normalized log2 counts per million
SAM28: Sample 28 containing normalized log2 counts per million
SAM44: Sample 44 containing normalized log2 counts per million
SAM04: Sample 04 containing normalized log2 counts per million
SAM10: Sample 10 containing normalized log2 counts per million
SAM31: Sample 31 containing normalized log2 counts per million
SAM21: Sample 21 containing normalized log2 counts per million
SAM20: Sample 20 containing normalized log2 counts per million
SAM52: Sample 52 containing normalized log2 counts per million
SAM46: Sample 46 containing normalized log2 counts per million
SAM01: Sample 01 containing normalized log2 counts per million
SAM13: Sample 13 containing normalized log2 counts per million
SAM39: Sample 39 containing normalized log2 counts per million
SAM49: Sample 49 containing normalized log2 counts per million
SAM30: Sample 30 containing normalized log2 counts per million
SAM50: Sample 50 containing normalized log2 counts per million
SAM11: Sample 11 containing normalized log2 counts per million
SAM35: Sample 35 containing normalized log2 counts per million
SAM06: Sample 06 containing normalized log2 counts per million
SAM27: Sample 27 containing normalized log2 counts per million
SAM33: Sample 33 containing normalized log2 counts per million
SAM22: Sample 22 containing normalized log2 counts per million
SAM24: Sample 24 containing normalized log2 counts per million
SAM16: Sample 16 containing normalized log2 counts per million
SAM34: Sample 34 containing normalized log2 counts per million
SAM03: Sample 03 containing normalized log2 counts per million
SAM47: Sample 47 containing normalized log2 counts per million
SAM40: Sample 40 containing normalized log2 counts per million
SAM23: Sample 23 containing normalized log2 counts per million

Plot the empirically derived LCPM cutoff

Description

Plot method for class "empLCPM" that plots the 3rd order polynomial fit for the Spearman correlation between the expected and observed ERCC fold changes across log2 CPM abundance bins and highlights the empirically derived LCPM cutoff.

Usage

## S3 method for class 'empLCPM'
plot(x, main = "", col.trend = "purple", col.outlier = c("black", "red"), ...)

Arguments

x

A 'empLCPM' object from getLowLcpmCutoff.

main

An (optional) title for the plot.

col.trend

Color used for the 3rd order polynomial fit.

col.outlier

A vector specifying the default color for the points in the scatterplot, and the color for the outlier points. The default is to color all non-outlier points black and the outliers red.

...

Additional arguments are ignored.

Value

The function plot.empLCPM creates a plot that summarizes 3rd order polynomial fit for the Spearman correlation between the expected and observed ERCC fold changes across log2 CPM abundance bins. Vertical lines are used to indicate the optimal cutoff value, along with the lower and upper 95% bootstrap confidence interval highlighted by dashed vertical lines.

Print the empirically derived LCPM cutoff results

Description

Print method for class "empLCPM".

Usage

## S3 method for class 'empLCPM'
print(x, ...)

Arguments

x

A 'empLCPM' object from getLowLcpmCutoff.

...

Additional arguments are ignored.

Value

The function print.empLCPM extracts the key results and returns a string of the summary.

Summarizing the empirically derived LCPM cutoff

Description

Summary method for class "empLCPM".

Usage

## S3 method for class 'empLCPM'
summary(object, ...)

Arguments

object

A 'empLCPM' object from getLowLcpmCutoff.

...

Additional arguments are ignored.

Value

The function summary.empLCPM extracts the key results and returns a string of the summary.

A data frame of expected ERCC1 and ERCC2 ratios

Description

Usage

Format

Examples

Function to empirically determine a log2 CPM cutoff based on ERCC RNA spike-in

Description

Usage

Arguments

Value

See Also

Examples

A data frame containing sample-level ERCC meta data

Description

Usage

Format

A data frame of observed spike in ERCC normalized LCPM data

Description

Usage

Format

Plot the empirically derived LCPM cutoff

Description

Usage

Arguments

Value

Print the empirically derived LCPM cutoff results

Description

Usage

Arguments

Value

Summarizing the empirically derived LCPM cutoff

Description

Usage

Arguments

Value