Version: | 1.0.0 |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
Description: | A mutation analysis tool that discovers cancer driver genes with frequent mutations in protein signalling sites such as post-translational modifications (phosphorylation, ubiquitination, etc). The Poisson generalised linear regression model identifies genes where cancer mutations in signalling sites are more frequent than expected from the sequence of the entire gene. Integration of mutations with signalling information helps find new driver genes and propose candidate mechanisms to known drivers. Reference: Systematic analysis of somatic mutations in phosphorylation signaling predicts novel cancer drivers. Juri Reimand and Gary D Bader. Molecular Systems Biology (2013) 9:637 <doi:10.1038/msb.2012.68>. |
Title: | Finding Cancer Driver Proteins with Enriched Mutations in Post-Translational Modification Sites |
Depends: | R (≥ 3.0) |
Imports: | stats, parallel, MASS |
Collate: | 'ActiveDriver.R' |
RoxygenNote: | 6.0.1.9000 |
NeedsCompilation: | no |
Packaged: | 2017-08-23 18:26:21 UTC; dthompson |
Author: | Juri Reimand [aut, cre] |
Maintainer: | Juri Reimand <juri.reimand@utoronto.ca> |
Repository: | CRAN |
Date/Publication: | 2017-08-23 20:55:51 UTC |
Identification of active protein sites (post-translational modification sites, signalling domains, etc) with specific and significant mutations.
Description
Identification of active protein sites (post-translational modification sites, signalling domains, etc) with specific and significant mutations.
Usage
ActiveDriver(sequences, seq_disorder, mutations, active_sites, flank = 7,
mid_flank = 2, mc.cores = 1, simplified = FALSE,
return_records = FALSE, skip_mismatch = TRUE,
regression_type = "poisson", enriched_only = TRUE)
Arguments
sequences |
character vector of protein sequences, names are protein IDs. |
seq_disorder |
character vector of disorder in protein sequences, names are protein IDs and values are strings 1/0 for disordered/ordered protein residues. |
mutations |
data frame of mutations, with [gene, sample_id, position, wt_residue, mut_residue] as columns. |
active_sites |
data frame of active sites, with [gene, position, residue, kinase] as columns. Kinase field may be blank and is shown for informative purposes. |
flank |
numeric for selecting region size around active sites considered important for site activity. Default value is 7. Ignored in case of simplified analysis. |
mid_flank |
numeric for splitting flanking region size into proximal (<=X) and distal (>X). Default value is 2. Ignored in case of simplified analysis. |
mc.cores |
numeric for indicating number of computing cores dedicated to computation. Default value is 1. |
simplified |
true/false for selecting simplified analysis. Default value is FALSE. If TRUE, no flanking regions are considered and only indicated sites are tested for mutations. |
return_records |
true/false for returning a collection of gene records with more data regarding sites and mutations. Default value is FALSE. |
skip_mismatch |
true/false for skipping mutations whose reference protein residue does not match expected residue from FASTA sequence file. |
regression_type |
'nb' for negative binomial, 'poisson' for poisson GLM. The latter is default. |
enriched_only |
true/false to indicate whether only sites with enriched active site mutations will be included in the final p-value estimation (TRUE is default). If FALSE, sites with less than expected mutations will be also included. |
Value
list with the following components: @return all_active_mutations - table with mutations that hit or flank an active site. Additional columns of interest include Status (DI - direct active mutation; N1 - proximal flanking mutation; N2 - distal flanking mutation) and Active_region (region ID of active sites in that protein).
all_active_sites -
all_region_based_pval - p-values for regions of sites, statistics on observed mutations (obs) and expected mutations (exp, low, high based on mean and s.d. from Poisson sampling). The field Region identifies region in all_active_sites.
Author(s)
Juri Reimand <juri.reimand@utoronto.ca>
References
Systematic analysis of somatic mutations in phosphorylation signaling predicts novel cancer drivers (2013, Molecular Systems Biology) by Juri Reimand and Gary Bader.
Examples
data(ActiveDriver_data)
phos_results = ActiveDriver(sequences, sequence_disorder, mutations, phosphosites)
ovarian_mutations = mutations[grep("ovarian", mutations$sample_id),]
phos_results_ovarian = ActiveDriver(sequences, sequence_disorder, ovarian_mutations, phosphosites)
GBM_muts = mutations[grep("glioblastoma", mutations$sample_id),]
kin_rslt_GBM = ActiveDriver(sequences, sequence_disorder, GBM_muts, kinase_domains, simplified=TRUE)
kin_results = ActiveDriver(sequences, sequence_disorder, mutations, kinase_domains, simplified=TRUE)
Example kinase domains for ActiveDriver
Description
A dataset describing kinase domains. The variables are as follows:
Usage
data(ActiveDriver_data)
Format
A data frame with 1 observation of 4 variables
Details
gene. the gene symbol of the gene where the kinase domain occurs
position. the position in the protein sequence where the kinase domain begins
phos. TRUE
residue. the kinase domain residues
Example mutations for ActiveDriver
Description
A dataset describing mis-sense mutations (i.e., substitutions in proteins). The variables are as follows:
Usage
data(ActiveDriver_data)
Format
A data frame with 408 observations of 5 variables
Details
gene. the mutated gene
sample_id. the sample where the mutation originates
position. the position in the protein sequence where the mutation occurs
wt_residue. the wild-type residue
mut_residue. the mutant residue
Example phosphosites for ActiveDriver
Description
A dataset describing protein phosphorylation sites. The variables are as follows:
Usage
data(ActiveDriver_data)
Format
A data frame with 131 observations of 4 variables
Details
gene. the gene symbol the phosphosite occurs in
position. the position in the protein sequence where the phosphosite occurs
residue. the phosphosite residue
kinase. the kinase that phosphorylates this site
Read FASTA file as character vector.
Description
Read FASTA file as character vector.
Usage
read_fasta(fname)
Arguments
fname |
name of file to be read. |
Value
character vector with names corresponding to annotations from FASTA.
Example protein disorder for ActiveDriver
Description
A dataset containing the disorder of four proteins.
Usage
data(ActiveDriver_data)
Format
A named character vector with 4 elements
Example protein sequences for ActiveDriver
Description
A dataset containing the sequences of four proteins.
Usage
data(ActiveDriver_data)
Format
A named character vector with 4 elements