Help for package CB2

Type:

Package

Title:

CRISPR Pooled Screen Analysis using Beta-Binomial Test

Version:

1.3.4

Date:

2020-07-23

Description:

Provides functions for hit gene identification and quantification of sgRNA (single-guided RNA) abundances for CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) pooled screen data analysis. Details are in Jeong et al. (2019) <doi:10.1101/gr.245571.118> and Baggerly et al. (2003) <doi:10.1093/bioinformatics/btg173>.

Depends:

R (≥ 3.5.0)

License:

MIT + file LICENSE

LazyData:

true

Imports:

Rcpp (≥ 0.12.16), metap, magrittr, dplyr, tibble, stringr, ggplot2, tidyr, glue, pheatmap, tools, readr, parallel, R.utils

LinkingTo:

Rcpp, RcppArmadillo

Suggests:

testthat, knitr, rmarkdown

RoxygenNote:

7.1.1

Encoding:

UTF-8

VignetteBuilder:

knitr

NeedsCompilation:

yes

Packaged:

2020-07-23 15:58:35 UTC; hwan

Author:

Hyun-Hwan Jeong [aut, cre]

Maintainer:

Hyun-Hwan Jeong <jeong.hyunhwan@gmail.com>

Repository:

CRAN

Date/Publication:

2020-07-24 09:42:24 UTC

A benchmark CRISPRn pooled screen data from Evers et al.

Description

A benchmark CRISPRn pooled screen data from Evers et al.

Usage

data(Evers_CRISPRn_RT112)

Format

The data object is a list and contains below information:

count: The count matrix from Evers et al.'s paper and contains the CRISPRn screening result using RT112 cell-line. It contains three different replicates for T0 (before) and contains different three replicates for T1 (after).
egenes: The list of 46 essential genes used in Evers et al.'s study.
ngenes: The list of 47 non-essential genes used in Evers et al.'s study.
design: The data.frame contains study design.
sg_stat: The data.frame contains the sgRNA-level statistics.
gene_stat: The data.frame contains the gene-level statistics.

Source

https://www.ncbi.nlm.nih.gov/pubmed/27111720

A benchmark CRISPRn pooled screen data from Sanson et al.

Description

A benchmark CRISPRn pooled screen data from Sanson et al.

Usage

data(Sanson_CRISPRn_A375)

Format

The data object is a list and contains below information:

count: The count matrix from Sanson et al.'s paper and contains the CRISPRn screening result using A375 cell-line. It contains a sample of plasimd, and three biological replicates after three weeks.
egenes: The list of 1,580 essential genes used in Sanson et al.'s study.
ngenes: The list of 927 non-essential genes used in Sanson et al.'s study.
design: The data.frame contains study design.

Source

https://www.ncbi.nlm.nih.gov/pubmed/30575746

A function to calculate the mappabilities of each NGS sample.

Description

A function to calculate the mappabilities of each NGS sample.

Usage

calc_mappability(count_obj, df_design)

Arguments

count_obj

A list object is created by 'run_sgrna_quant'.

df_design

The table contains a study design.

Examples

library(CB2)
library(magrittr)
library(tibble)
library(dplyr)
library(glue)
FASTA <- system.file("extdata", "toydata", "small_sample.fasta", package = "CB2")
ex_path <- system.file("extdata", "toydata", package = "CB2")

df_design <- tribble(
  ~group, ~sample_name,
  "Base", "Base1",  
  "Base", "Base2", 
  "High", "High1",
  "High", "High2") %>% 
    mutate(fastq_path = glue("{ex_path}/{sample_name}.fastq"))

cb2_count <- run_sgrna_quant(FASTA, df_design)
calc_mappability(cb2_count, df_design)

A C++ function to perform a parameter estimation for the sgRNA-level test. It will estimate two different parameters 'phat' and 'vhat,' and we assume input count data follows the beta-binomial distribution. Dr. Keith Baggerly initially implemented this code in Matlab, and it has been rewritten it in C++ for the speed-up.

Description

A C++ function to perform a parameter estimation for the sgRNA-level test. It will estimate two different parameters 'phat' and 'vhat,' and we assume input count data follows the beta-binomial distribution. Dr. Keith Baggerly initially implemented this code in Matlab, and it has been rewritten it in C++ for the speed-up.

Usage

fit_ab(xvec, nvec)

Arguments

xvec

a matrix contains sgRNA read counts.

nvec

a vector contains the library size.

A function to normalize sgRNA read counts.

Description

A function to normalize sgRNA read counts.

Usage

get_CPM(sgcount)

Arguments

sgcount

The input table contains read counts of sgRNAs for each sample

A function to calculate the CPM (Counts Per Million) (required)

Value

a normalized CPM table will be returned

Examples

library(CB2)
data(Evers_CRISPRn_RT112)
get_CPM(Evers_CRISPRn_RT112$count)

A function to join a count table and a design table.

Description

A function to join a count table and a design table.

Usage

join_count_and_design(sgcount, df_design)

Arguments

sgcount

The input matrix contains read counts of sgRNAs for each sample.

df_design

The table contains a study design.

Value

A tall-thin and combined table of the sgRNA read counts and study design will be returned.

Examples

library(CB2)
data(Evers_CRISPRn_RT112) 
head(join_count_and_design(Evers_CRISPRn_RT112$count, Evers_CRISPRn_RT112$design))

A function to perform gene-level test using a sgRNA-level statistics.

Description

A function to perform gene-level test using a sgRNA-level statistics.

Usage

measure_gene_stats(sgrna_stat, logFC_level = "sgRNA")

Arguments

sgrna_stat

A data frame created by ‘measure_sgrna_stats’

logFC_level

The level of ‘logFC’ value. It can be ‘gene’ or ‘sgRNA’.

Value

A table contains the gene-level test result, and the table contains these columns:

‘gene’: Theg gene name to be tested.
‘n_sgrna’: The number of sgRNA targets the gene in the library.
‘cpm_a’: The mean of CPM of sgRNAs within the first group.
‘cpm_b’: The mean of CPM of sgRNAs within the second group.
‘logFC’: The log fold change of the gene between two groups. Taking the mean of sgRNA ‘logFC’s is default, and ‘logFC' is calculated by 'log2(cpm_b+1) - log2(cpm_a+1)’ if ‘logFC_level’ parameter is set to ‘gene’.
‘p_ts’: The p-value indicates a difference between the two groups at the gene-level.
‘p_pa’: The p-value indicates enrichment of the first group at the gene-level.
‘p_pb’: The p-value indicates enrichment of the second group at the gene-level.
‘fdr_ts’: The adjusted P-value of ‘p_ts’.
‘fdr_pa’: The adjusted P-value of ‘p_pa’.
‘fdr_pb’: The adjusted P-value of ‘p_pb’.

Examples

data(Evers_CRISPRn_RT112)
measure_gene_stats(Evers_CRISPRn_RT112$sg_stat)

A function to perform a statistical test at a sgRNA-level

Description

A function to perform a statistical test at a sgRNA-level

Usage

measure_sgrna_stats(
  sgcount,
  design,
  group_a,
  group_b,
  delim = "_",
  ge_id = NULL,
  sg_id = NULL
)

Arguments

sgcount

This data frame contains read counts of sgRNAs for the samples.

design

This table contains study design. It has to contain 'group.'

group_a

The first group to be tested.

group_b

The second group to be tested.

delim

The delimiter between a gene name and a sgRNA ID. It will be used if only rownames contains sgRNA ID.

ge_id

The column name of the gene column.

sg_id

The column/columns of sgRNA identifiers.

Value

A table contains the sgRNA-level test result, and the table contains these columns:

‘sgRNA’: The sgRNA identifier.
‘gene’: The gene is the target of the sgRNA
‘n_a’: The number of replicates of the first group.
‘n_b’: The number of replicates of the second group.
‘phat_a’: The proportion value of the sgRNA for the first group.
‘phat_b’: The proportion value of the sgRNA for the second group.
‘vhat_a’: The variance of the sgRNA for the first group.
‘vhat_b’: The variance of the sgRNA for the second group.
‘cpm_a’: The mean CPM of the sgRNA within the first group.
‘cpm_b’: The mean CPM of the sgRNA within the second group.
‘logFC’: The log fold change of sgRNA between two groups.
‘t_value’: The value for the t-statistics.
‘df’: The value of the degree of freedom, and will be used to calculate the p-value of the sgRNA.
‘p_ts’: The p-value indicates a difference between the two groups.
‘p_pa’: The p-value indicates enrichment of the first group.
‘p_pb’: The p-value indicates enrichment of the second group.
‘fdr_ts’: The adjusted P-value of ‘p_ts’.
‘fdr_pa’: The adjusted P-value of ‘p_pa’.
‘fdr_pb’: The adjusted P-value of ‘p_pb’.

Examples

library(CB2)
data(Evers_CRISPRn_RT112)
measure_sgrna_stats(Evers_CRISPRn_RT112$count, Evers_CRISPRn_RT112$design, "before", "after")

A function to plot the first two principal components of samples.

Description

This function will perform a principal component analysis, and it returns a ggplot object of the PCA plot.

Usage

plot_PCA(sgcount, df_design)

Arguments

sgcount

The input matrix contains read counts of sgRNAs for each sample.

df_design

The table contains a study design.

Value

A ggplot2 object contains a PCA plot for the input.

library(CB2) data(Evers_CRISPRn_RT112) plot_PCA(Evers_CRISPRn_RT112$count, Evers_CRISPRn_RT112$design)

A function to show a heatmap sgRNA-level corrleations of the NGS samples.

Description

A function to show a heatmap sgRNA-level corrleations of the NGS samples.

Usage

plot_corr_heatmap(sgcount, df_design, cor_method = "pearson")

Arguments

sgcount

The input matrix contains read counts of sgRNAs for each sample.

df_design

The table contains a study design.

cor_method

A string parameter of the correlation measure. One of the three - "pearson", "kendall", or "spearman" will be the string.

Value

A pheatmap object contains the correlation heatmap

library(CB2) data(Evers_CRISPRn_RT112) plot_corr_heatmap(Evers_CRISPRn_RT112$count, Evers_CRISPRn_RT112$design)

A function to plot read count distribution.

Description

A function to plot read count distribution.

Usage

plot_count_distribution(sgcount, df_design, add_dots = FALSE)

Arguments

sgcount

The input matrix contains read counts of sgRNAs for each sample.

df_design

The table contains a study design.

add_dots

The function will display dots of sgRNA counts if it is set to 'TRUE'.

Value

A ggplot2 object contains a read count distribution plot for 'sgcount'.

Examples

library(CB2)
data(Evers_CRISPRn_RT112)
cpm <- get_CPM(Evers_CRISPRn_RT112$count)
plot_count_distribution(cpm, Evers_CRISPRn_RT112$design)

A function to visualize dot plots for a gene.

Description

A function to visualize dot plots for a gene.

Usage

plot_dotplot(sgcount, df_design, gene, ge_id = NULL, sg_id = NULL)

Arguments

sgcount

The input matrix contains read counts of sgRNAs for each sample.

df_design

The table contains a study design.

gene

The gene to be shown.

ge_id

A name of the column contains gene names.

sg_id

A name of the column contains sgRNA IDs.

Value

A ggplot2 object contains dot plots of sgRNA read counts for a gene.

Examples

library(CB2)
data(Evers_CRISPRn_RT112)
plot_dotplot(get_CPM(Evers_CRISPRn_RT112$count), Evers_CRISPRn_RT112$design, "RPS7")

A C++ function to quantify sgRNA abundance from NGS samples.

Description

A C++ function to quantify sgRNA abundance from NGS samples.

Usage

quant(ref_path, fastq_path, verbose = FALSE)

Arguments

ref_path

the path of the annotation file and it has to be a FASTA formatted file.

fastq_path

a list of the FASTQ files.

verbose

Display some logs during the quantification if it is set to 'true'.

A function to perform a statistical test at a sgRNA-level, deprecated.

Description

A function to perform a statistical test at a sgRNA-level, deprecated.

Usage

run_estimation(
  sgcount,
  design,
  group_a,
  group_b,
  delim = "_",
  ge_id = NULL,
  sg_id = NULL
)

Arguments

sgcount

This data frame contains read counts of sgRNAs for the samples.

design

This table contains study design. It has to contain 'group.'

group_a

The first group to be tested.

group_b

The second group to be tested.

delim

The delimiter between a gene name and a sgRNA ID. It will be used if only rownames contains sgRNA ID.

ge_id

The column name of the gene column.

sg_id

The column/columns of sgRNA identifiers.

Value

A table contains the sgRNA-level test result, and the table contains these columns:

‘sgRNA’: The sgRNA identifier.
‘gene’: The gene is the target of the sgRNA
‘n_a’: The number of replicates of the first group.
‘n_b’: The number of replicates of the second group.
‘phat_a’: The proportion value of the sgRNA for the first group.
‘phat_b’: The proportion value of the sgRNA for the second group.
‘vhat_a’: The variance of the sgRNA for the first group.
‘vhat_b’: The variance of the sgRNA for the second group.
‘cpm_a’: The mean CPM of the sgRNA within the first group.
‘cpm_b’: The mean CPM of the sgRNA within the second group.
‘logFC’: The log fold change of sgRNA between two groups.
‘t_value’: The value for the t-statistics.
‘df’: The value of the degree of freedom, and will be used to calculate the p-value of the sgRNA.
‘p_ts’: The p-value indicates a difference between the two groups.
‘p_pa’: The p-value indicates enrichment of the first group.
‘p_pb’: The p-value indicates enrichment of the second group.
‘fdr_ts’: The adjusted P-value of ‘p_ts’.
‘fdr_pa’: The adjusted P-value of ‘p_pa’.
‘fdr_pb’: The adjusted P-value of ‘p_pb’.

A function to run a sgRNA quantification algorithm from NGS sample

Description

A function to run a sgRNA quantification algorithm from NGS sample

Usage

run_sgrna_quant(lib_path, design, map_path = NULL, ncores = 1, verbose = FALSE)

Arguments

lib_path

The path of the FASTA file.

design

A table contains the study design. It must contain 'fastq_path' and 'sample_name.'

map_path

The path of file contains gene-sgRNA mapping.

ncores

The number that indicates how many processors will be used with a parallelization. The parallelization will be enabled if users do not set the parameter as '-1“ (it means the full physical cores will be used) or greater than '1'.

verbose

Display some logs during the quantification if it is set to 'TRUE'

Value

It will return a list, and the list contains three elements. The first element (‘count’) is a data frame contains the result of the quantification for each sample. The second element (‘total’) is a numeric vector contains the total number of reads of each sample. The last element (‘sequence’) a data frame contains the sequence of each sgRNA in the library.

Examples

library(CB2)
library(magrittr)
library(tibble)
library(dplyr)
library(glue)
FASTA <- system.file("extdata", "toydata", "small_sample.fasta", package = "CB2")
ex_path <- system.file("extdata", "toydata", package = "CB2")

df_design <- tribble(
  ~group, ~sample_name,
  "Base", "Base1",  
  "Base", "Base2", 
  "High", "High1",
  "High", "High2") %>% 
    mutate(fastq_path = glue("{ex_path}/{sample_name}.fastq"))

cb2_count <- run_sgrna_quant(FASTA, df_design)