Type: | Package |
Title: | CRISPR Pooled Screen Analysis using Beta-Binomial Test |
Version: | 1.3.4 |
Date: | 2020-07-23 |
Description: | Provides functions for hit gene identification and quantification of sgRNA (single-guided RNA) abundances for CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) pooled screen data analysis. Details are in Jeong et al. (2019) <doi:10.1101/gr.245571.118> and Baggerly et al. (2003) <doi:10.1093/bioinformatics/btg173>. |
Depends: | R (≥ 3.5.0) |
License: | MIT + file LICENSE |
LazyData: | true |
Imports: | Rcpp (≥ 0.12.16), metap, magrittr, dplyr, tibble, stringr, ggplot2, tidyr, glue, pheatmap, tools, readr, parallel, R.utils |
LinkingTo: | Rcpp, RcppArmadillo |
Suggests: | testthat, knitr, rmarkdown |
RoxygenNote: | 7.1.1 |
Encoding: | UTF-8 |
VignetteBuilder: | knitr |
NeedsCompilation: | yes |
Packaged: | 2020-07-23 15:58:35 UTC; hwan |
Author: | Hyun-Hwan Jeong [aut, cre] |
Maintainer: | Hyun-Hwan Jeong <jeong.hyunhwan@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2020-07-24 09:42:24 UTC |
A benchmark CRISPRn pooled screen data from Evers et al.
Description
A benchmark CRISPRn pooled screen data from Evers et al.
Usage
data(Evers_CRISPRn_RT112)
Format
The data object is a list and contains below information:
- count
The count matrix from Evers et al.'s paper and contains the CRISPRn screening result using RT112 cell-line. It contains three different replicates for T0 (before) and contains different three replicates for T1 (after).
- egenes
The list of 46 essential genes used in Evers et al.'s study.
- ngenes
The list of 47 non-essential genes used in Evers et al.'s study.
- design
The data.frame contains study design.
- sg_stat
The data.frame contains the sgRNA-level statistics.
- gene_stat
The data.frame contains the gene-level statistics.
Source
https://www.ncbi.nlm.nih.gov/pubmed/27111720
A benchmark CRISPRn pooled screen data from Sanson et al.
Description
A benchmark CRISPRn pooled screen data from Sanson et al.
Usage
data(Sanson_CRISPRn_A375)
Format
The data object is a list and contains below information:
- count
The count matrix from Sanson et al.'s paper and contains the CRISPRn screening result using A375 cell-line. It contains a sample of plasimd, and three biological replicates after three weeks.
- egenes
The list of 1,580 essential genes used in Sanson et al.'s study.
- ngenes
The list of 927 non-essential genes used in Sanson et al.'s study.
- design
The data.frame contains study design.
Source
https://www.ncbi.nlm.nih.gov/pubmed/30575746
A function to calculate the mappabilities of each NGS sample.
Description
A function to calculate the mappabilities of each NGS sample.
Usage
calc_mappability(count_obj, df_design)
Arguments
count_obj |
A list object is created by 'run_sgrna_quant'. |
df_design |
The table contains a study design. |
Examples
library(CB2)
library(magrittr)
library(tibble)
library(dplyr)
library(glue)
FASTA <- system.file("extdata", "toydata", "small_sample.fasta", package = "CB2")
ex_path <- system.file("extdata", "toydata", package = "CB2")
df_design <- tribble(
~group, ~sample_name,
"Base", "Base1",
"Base", "Base2",
"High", "High1",
"High", "High2") %>%
mutate(fastq_path = glue("{ex_path}/{sample_name}.fastq"))
cb2_count <- run_sgrna_quant(FASTA, df_design)
calc_mappability(cb2_count, df_design)
A C++ function to perform a parameter estimation for the sgRNA-level test. It will estimate two different parameters 'phat' and 'vhat,' and we assume input count data follows the beta-binomial distribution. Dr. Keith Baggerly initially implemented this code in Matlab, and it has been rewritten it in C++ for the speed-up.
Description
A C++ function to perform a parameter estimation for the sgRNA-level test. It will estimate two different parameters 'phat' and 'vhat,' and we assume input count data follows the beta-binomial distribution. Dr. Keith Baggerly initially implemented this code in Matlab, and it has been rewritten it in C++ for the speed-up.
Usage
fit_ab(xvec, nvec)
Arguments
xvec |
a matrix contains sgRNA read counts. |
nvec |
a vector contains the library size. |
A function to normalize sgRNA read counts.
Description
A function to normalize sgRNA read counts.
Usage
get_CPM(sgcount)
Arguments
sgcount |
The input table contains read counts of sgRNAs for each sample A function to calculate the CPM (Counts Per Million) (required) |
Value
a normalized CPM table will be returned
Examples
library(CB2)
data(Evers_CRISPRn_RT112)
get_CPM(Evers_CRISPRn_RT112$count)
A function to join a count table and a design table.
Description
A function to join a count table and a design table.
Usage
join_count_and_design(sgcount, df_design)
Arguments
sgcount |
The input matrix contains read counts of sgRNAs for each sample. |
df_design |
The table contains a study design. |
Value
A tall-thin and combined table of the sgRNA read counts and study design will be returned.
Examples
library(CB2)
data(Evers_CRISPRn_RT112)
head(join_count_and_design(Evers_CRISPRn_RT112$count, Evers_CRISPRn_RT112$design))
A function to perform gene-level test using a sgRNA-level statistics.
Description
A function to perform gene-level test using a sgRNA-level statistics.
Usage
measure_gene_stats(sgrna_stat, logFC_level = "sgRNA")
Arguments
sgrna_stat |
A data frame created by ‘measure_sgrna_stats’ |
logFC_level |
The level of ‘logFC’ value. It can be ‘gene’ or ‘sgRNA’. |
Value
A table contains the gene-level test result, and the table contains these columns:
‘gene’: Theg gene name to be tested.
‘n_sgrna’: The number of sgRNA targets the gene in the library.
‘cpm_a’: The mean of CPM of sgRNAs within the first group.
‘cpm_b’: The mean of CPM of sgRNAs within the second group.
‘logFC’: The log fold change of the gene between two groups. Taking the mean of sgRNA ‘logFC’s is default, and ‘logFC' is calculated by 'log2(cpm_b+1) - log2(cpm_a+1)’ if ‘logFC_level’ parameter is set to ‘gene’.
‘p_ts’: The p-value indicates a difference between the two groups at the gene-level.
‘p_pa’: The p-value indicates enrichment of the first group at the gene-level.
‘p_pb’: The p-value indicates enrichment of the second group at the gene-level.
‘fdr_ts’: The adjusted P-value of ‘p_ts’.
‘fdr_pa’: The adjusted P-value of ‘p_pa’.
‘fdr_pb’: The adjusted P-value of ‘p_pb’.
Examples
data(Evers_CRISPRn_RT112)
measure_gene_stats(Evers_CRISPRn_RT112$sg_stat)
A function to perform a statistical test at a sgRNA-level
Description
A function to perform a statistical test at a sgRNA-level
Usage
measure_sgrna_stats(
sgcount,
design,
group_a,
group_b,
delim = "_",
ge_id = NULL,
sg_id = NULL
)
Arguments
sgcount |
This data frame contains read counts of sgRNAs for the samples. |
design |
This table contains study design. It has to contain 'group.' |
group_a |
The first group to be tested. |
group_b |
The second group to be tested. |
delim |
The delimiter between a gene name and a sgRNA ID. It will be used if only rownames contains sgRNA ID. |
ge_id |
The column name of the gene column. |
sg_id |
The column/columns of sgRNA identifiers. |
Value
A table contains the sgRNA-level test result, and the table contains these columns:
‘sgRNA’: The sgRNA identifier.
‘gene’: The gene is the target of the sgRNA
‘n_a’: The number of replicates of the first group.
‘n_b’: The number of replicates of the second group.
‘phat_a’: The proportion value of the sgRNA for the first group.
‘phat_b’: The proportion value of the sgRNA for the second group.
‘vhat_a’: The variance of the sgRNA for the first group.
‘vhat_b’: The variance of the sgRNA for the second group.
‘cpm_a’: The mean CPM of the sgRNA within the first group.
‘cpm_b’: The mean CPM of the sgRNA within the second group.
‘logFC’: The log fold change of sgRNA between two groups.
‘t_value’: The value for the t-statistics.
‘df’: The value of the degree of freedom, and will be used to calculate the p-value of the sgRNA.
‘p_ts’: The p-value indicates a difference between the two groups.
‘p_pa’: The p-value indicates enrichment of the first group.
‘p_pb’: The p-value indicates enrichment of the second group.
‘fdr_ts’: The adjusted P-value of ‘p_ts’.
‘fdr_pa’: The adjusted P-value of ‘p_pa’.
‘fdr_pb’: The adjusted P-value of ‘p_pb’.
Examples
library(CB2)
data(Evers_CRISPRn_RT112)
measure_sgrna_stats(Evers_CRISPRn_RT112$count, Evers_CRISPRn_RT112$design, "before", "after")
A function to plot the first two principal components of samples.
Description
This function will perform a principal component analysis, and it returns a ggplot object of the PCA plot.
Usage
plot_PCA(sgcount, df_design)
Arguments
sgcount |
The input matrix contains read counts of sgRNAs for each sample. |
df_design |
The table contains a study design. |
Value
A ggplot2 object contains a PCA plot for the input.
library(CB2) data(Evers_CRISPRn_RT112) plot_PCA(Evers_CRISPRn_RT112$count, Evers_CRISPRn_RT112$design)
A function to show a heatmap sgRNA-level corrleations of the NGS samples.
Description
A function to show a heatmap sgRNA-level corrleations of the NGS samples.
Usage
plot_corr_heatmap(sgcount, df_design, cor_method = "pearson")
Arguments
sgcount |
The input matrix contains read counts of sgRNAs for each sample. |
df_design |
The table contains a study design. |
cor_method |
A string parameter of the correlation measure. One of the three - "pearson", "kendall", or "spearman" will be the string. |
Value
A pheatmap object contains the correlation heatmap
library(CB2) data(Evers_CRISPRn_RT112) plot_corr_heatmap(Evers_CRISPRn_RT112$count, Evers_CRISPRn_RT112$design)
A function to plot read count distribution.
Description
A function to plot read count distribution.
Usage
plot_count_distribution(sgcount, df_design, add_dots = FALSE)
Arguments
sgcount |
The input matrix contains read counts of sgRNAs for each sample. |
df_design |
The table contains a study design. |
add_dots |
The function will display dots of sgRNA counts if it is set to 'TRUE'. |
Value
A ggplot2 object contains a read count distribution plot for 'sgcount'.
Examples
library(CB2)
data(Evers_CRISPRn_RT112)
cpm <- get_CPM(Evers_CRISPRn_RT112$count)
plot_count_distribution(cpm, Evers_CRISPRn_RT112$design)
A function to visualize dot plots for a gene.
Description
A function to visualize dot plots for a gene.
Usage
plot_dotplot(sgcount, df_design, gene, ge_id = NULL, sg_id = NULL)
Arguments
sgcount |
The input matrix contains read counts of sgRNAs for each sample. |
df_design |
The table contains a study design. |
gene |
The gene to be shown. |
ge_id |
A name of the column contains gene names. |
sg_id |
A name of the column contains sgRNA IDs. |
Value
A ggplot2 object contains dot plots of sgRNA read counts for a gene.
Examples
library(CB2)
data(Evers_CRISPRn_RT112)
plot_dotplot(get_CPM(Evers_CRISPRn_RT112$count), Evers_CRISPRn_RT112$design, "RPS7")
A C++ function to quantify sgRNA abundance from NGS samples.
Description
A C++ function to quantify sgRNA abundance from NGS samples.
Usage
quant(ref_path, fastq_path, verbose = FALSE)
Arguments
ref_path |
the path of the annotation file and it has to be a FASTA formatted file. |
fastq_path |
a list of the FASTQ files. |
verbose |
Display some logs during the quantification if it is set to 'true'. |
A function to perform a statistical test at a sgRNA-level, deprecated.
Description
A function to perform a statistical test at a sgRNA-level, deprecated.
Usage
run_estimation(
sgcount,
design,
group_a,
group_b,
delim = "_",
ge_id = NULL,
sg_id = NULL
)
Arguments
sgcount |
This data frame contains read counts of sgRNAs for the samples. |
design |
This table contains study design. It has to contain 'group.' |
group_a |
The first group to be tested. |
group_b |
The second group to be tested. |
delim |
The delimiter between a gene name and a sgRNA ID. It will be used if only rownames contains sgRNA ID. |
ge_id |
The column name of the gene column. |
sg_id |
The column/columns of sgRNA identifiers. |
Value
A table contains the sgRNA-level test result, and the table contains these columns:
‘sgRNA’: The sgRNA identifier.
‘gene’: The gene is the target of the sgRNA
‘n_a’: The number of replicates of the first group.
‘n_b’: The number of replicates of the second group.
‘phat_a’: The proportion value of the sgRNA for the first group.
‘phat_b’: The proportion value of the sgRNA for the second group.
‘vhat_a’: The variance of the sgRNA for the first group.
‘vhat_b’: The variance of the sgRNA for the second group.
‘cpm_a’: The mean CPM of the sgRNA within the first group.
‘cpm_b’: The mean CPM of the sgRNA within the second group.
‘logFC’: The log fold change of sgRNA between two groups.
‘t_value’: The value for the t-statistics.
‘df’: The value of the degree of freedom, and will be used to calculate the p-value of the sgRNA.
‘p_ts’: The p-value indicates a difference between the two groups.
‘p_pa’: The p-value indicates enrichment of the first group.
‘p_pb’: The p-value indicates enrichment of the second group.
‘fdr_ts’: The adjusted P-value of ‘p_ts’.
‘fdr_pa’: The adjusted P-value of ‘p_pa’.
‘fdr_pb’: The adjusted P-value of ‘p_pb’.
A function to run a sgRNA quantification algorithm from NGS sample
Description
A function to run a sgRNA quantification algorithm from NGS sample
Usage
run_sgrna_quant(lib_path, design, map_path = NULL, ncores = 1, verbose = FALSE)
Arguments
lib_path |
The path of the FASTA file. |
design |
A table contains the study design. It must contain 'fastq_path' and 'sample_name.' |
map_path |
The path of file contains gene-sgRNA mapping. |
ncores |
The number that indicates how many processors will be used with a parallelization. The parallelization will be enabled if users do not set the parameter as '-1“ (it means the full physical cores will be used) or greater than '1'. |
verbose |
Display some logs during the quantification if it is set to 'TRUE' |
Value
It will return a list, and the list contains three elements. The first element (‘count’) is a data frame contains the result of the quantification for each sample. The second element (‘total’) is a numeric vector contains the total number of reads of each sample. The last element (‘sequence’) a data frame contains the sequence of each sgRNA in the library.
Examples
library(CB2)
library(magrittr)
library(tibble)
library(dplyr)
library(glue)
FASTA <- system.file("extdata", "toydata", "small_sample.fasta", package = "CB2")
ex_path <- system.file("extdata", "toydata", package = "CB2")
df_design <- tribble(
~group, ~sample_name,
"Base", "Base1",
"Base", "Base2",
"High", "High1",
"High", "High2") %>%
mutate(fastq_path = glue("{ex_path}/{sample_name}.fastq"))
cb2_count <- run_sgrna_quant(FASTA, df_design)