| Type: | Package | 
| Title: | Amos Tanay's Group High Performance Statistical Utilities | 
| Version: | 2.3.28 | 
| Author: | Michael Hoichman [aut], Aviezer Lifshitz [aut, cre] | 
| Maintainer: | Aviezer Lifshitz <aviezer.lifshitz@weizmann.ac.il> | 
| Description: | A collection of high performance utilities to compute distance, correlation, auto correlation, clustering and other tasks. Contains graph clustering algorithm described in "MetaCell: analysis of single-cell RNA-seq data using K-nn graph partitions" (Yael Baran, Akhiad Bercovich, Arnau Sebe-Pedros, Yaniv Lubling, Amir Giladi, Elad Chomsky, Zohar Meir, Michael Hoichman, Aviezer Lifshitz & Amos Tanay, 2019 <doi:10.1186/s13059-019-1812-2>). | 
| License: | GPL-2 | 
| BugReports: | https://github.com/tanaylab/tgstat/issues | 
| URL: | https://tanaylab.github.io/tgstat/ | 
| Depends: | R (≥ 3.5.0) | 
| Imports: | utils | 
| Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) | 
| VignetteBuilder: | knitr | 
| Encoding: | UTF-8 | 
| LazyLoad: | yes | 
| NeedsCompilation: | yes | 
| OS_type: | unix | 
| RoxygenNote: | 7.3.2 | 
| Config/testthat/edition: | 3 | 
| Packaged: | 2024-09-23 19:46:25 UTC; aviezerl | 
| Repository: | CRAN | 
| Date/Publication: | 2024-09-23 22:30:02 UTC | 
tgstat: Amos Tanay's Group High Performance Statistical Utilities
Description
A collection of high performance utilities to compute distance, correlation, auto correlation, clustering and other tasks. Contains graph clustering algorithm described in "MetaCell: analysis of single-cell RNA-seq data using K-nn graph partitions" (Yael Baran, Akhiad Bercovich, Arnau Sebe-Pedros, Yaniv Lubling, Amir Giladi, Elad Chomsky, Zohar Meir, Michael Hoichman, Aviezer Lifshitz & Amos Tanay, 2019 doi:10.1186/s13059-019-1812-2).
'tgstat' package is intended to help users to efficiently analyze data in time-patient space.
Details
For a complete list of help resources, use library(help = "tgstat").
More information about the options can be found in 'User manual' of the package.
Author(s)
Maintainer: Aviezer Lifshitz aviezer.lifshitz@weizmann.ac.il
Authors:
- Michael Hoichman misha@hoichman.com 
See Also
Useful links:
Calculates correlation or auto-correlation
Description
Calculates correlation between two matrices columns or auto-correlation between a matrix columns.
Usage
tgs_cor(
  x,
  y = NULL,
  pairwise.complete.obs = FALSE,
  spearman = FALSE,
  tidy = FALSE,
  threshold = 0
)
tgs_cor_knn(
  x,
  y,
  knn,
  pairwise.complete.obs = FALSE,
  spearman = FALSE,
  threshold = 0
)
Arguments
| x | numeric matrix | 
| y | numeric matrix | 
| pairwise.complete.obs | see below | 
| spearman | if 'TRUE' Spearman correlation is computed, otherwise Pearson | 
| tidy | if 'TRUE' data is outputed in tidy format | 
| threshold | absolute threshold above which values are outputed in tidy format | 
| knn | the number of highest correlations returned per column | 
Details
'tgs_cor' is very similar to 'stats::cor'. Unlike the latter it uses all available CPU cores to compute the correlation in a much faster way. The basic implementation of 'pairwise.complete.obs' is also more efficient giving overall great run-time advantage.
Unlike 'stats::cor' 'tgs_cor' implements only two modes of treating data containing NA, which are equivalent to 'use="everything"' and 'use="pairwise.complete.obs". Please refer the documentation of this function for more details.
'tgs_cor(x, y, spearman = FALSE)' is equivalent to 'cor(x, y, method = "pearson")' 'tgs_cor(x, y, spearman = TRUE)' is equivalent to 'cor(x, y, method = "spearman")' 'tgs_cor(x, y, pairwise.complete.obs = TRUE, spearman = TRUE)' is equivalent to 'cor(x, y, use = "pairwise.complete.obs", method = "spearman")' 'tgs_cor(x, y, pairwise.complete.obs = TRUE, spearman = FALSE)' is equivalent to 'cor(x, y, use = "pairwise.complete.obs", method = "pearson")'
'tgs_cor' can output its result in "tidy" format: a data frame with three columns named 'col1', 'col2' and 'cor'. Only the correlation values which abs are equal or above the 'threshold' are reported. For auto-correlation (i.e. when 'y=NULL') a pair of columns numbered X and Y is reported only if X < Y.
'tgs_cor_knn' works similarly to 'tgs_cor'. Unlike the latter it returns only the highest 'knn' correlations for each column in 'x'. The result of 'tgs_cor_knn' is always outputed in "tidy" format.
One of the reasons to opt 'tgs_cor_knn' over a pair of calls to 'tgs_cor' and 'tgs_knn' is the reduced memory consumption of the former. For auto-correlation case (i.e. 'y=NULL') given that the number of columns NC exceeds the number of rows NR, then 'tgs_cor' memory consumption becomes a factor of NCxNC. In contrast 'tgs_cor_knn' would consume in the similar scenario a factor of max(NCxNR,NCxKNN). Similarly 'tgs_cor(x,y)' would consume memory as a factor of NCXxNCY, wherever 'tgs_cor_knn(x,y,knn)' would reduce that to max((NCX+NCY)xNR,NCXxKNN).
Value
'tgs_cor_knn' or 'tgs_cor' with 'tidy=TRUE' return a data frame,
where each row represents correlation between two pairs of columns from 'x'
and 'y' (or two columns of 'x' itself if 'y==NULL'). 'tgs_cor' with the
'tidy=FALSE' returns a matrix of correlation values, where val[X,Y]
represents the correlation between columns X and Y of the input matrices (if
'y' is not 'NULL') or the correlation between columns X and Y of 'x' (if 'y'
is 'NULL').
Examples
# Note: all the available CPU cores might be used
set.seed(seed = 0)
rows <- 100
cols <- 1000
vals <- sample(1:(rows * cols / 2), rows * cols, replace = TRUE)
m <- matrix(vals, nrow = rows, ncol = cols)
m[sample(1:(rows * cols), rows * cols / 1000)] <- NA
r1 <- tgs_cor(m, spearman = FALSE)
r2 <- tgs_cor(m, pairwise.complete.obs = TRUE, spearman = TRUE)
r3 <- tgs_cor_knn(m, NULL, 5, spearman = FALSE)
Calculates distances between the matrix rows
Description
Calculates distances between the matrix rows.
Usage
tgs_dist(x, diag = FALSE, upper = FALSE, tidy = FALSE, threshold = Inf)
Arguments
| x | numeric matrix | 
| diag | see 'dist' documentation | 
| upper | see 'dist' documentation | 
| tidy | if 'TRUE' data is outputed in tidy format | 
| threshold | threshold below which values are outputed in tidy format | 
Details
This function is very similar to 'package:stats::dist'. Unlike the latter it uses all available CPU cores to compute the distances in a much faster way.
Unlike 'package:stats::dist' 'tgs_dist' uses always "euclidean" metrics (see 'method' parameter of 'dist' function). Thus:
'tgs_dist(x)' is equivalent to 'dist(x, method = "euclidean")'
'tgs_dist' can output its result in "tidy" format: a data frame with three columns named 'row1', 'row2' and 'dist'. Only the distances that are less or equal than the 'threshold' are reported. Distance between row number X and Y is reported only if X < Y. 'diag' and 'upper' parameters are ignored when the result is returned in "tidy" format.
Value
If 'tidy' is 'FALSE' - the output is similar to that of 'dist' function. If 'tidy' is 'TRUE' - 'tgs_dist' returns a data frame, where each row represents distances between two pairs of original rows.
Examples
# Note: all the available CPU cores might be used
set.seed(seed = 0)
rows <- 100
cols <- 1000
vals <- sample(1:(rows * cols / 2), rows * cols, replace = TRUE)
m <- matrix(vals, nrow = rows, ncol = cols)
m[sample(1:(rows * cols), rows * cols / 1000)] <- NA
r <- tgs_dist(m)
Checks whether all the elements of the vector are finite
Description
Checks whether all the elements of the vector are finite.
Usage
tgs_finite(x)
Arguments
| x | numeric or integer vector or matrix | 
Details
'tgs_finite' returns 'TRUE' if all the elements of 'x' are finite numbers. (See: 'is.finite'.)
Value
'TRUE' if all the elements of 'x' are finite, otherwise 'FALSE'.
Examples
tgs_finite(1:10)
tgs_finite(c(1:10, NaN))
tgs_finite(c(1:10, Inf))
Builds directed graph of correlations
Description
Builds directed graph of correlations where the nodes are the matrix columns.
Usage
tgs_graph(x, knn, k_expand, k_beta = 3)
Arguments
| x | see below | 
| knn | maximal node degree | 
| k_expand | see below | 
| k_beta | see below | 
Details
This function builds a directed graph based on the edges in 'x' and their ranks.
'x' is a data frame containing 4 columns named: 'col1', 'col2', 'val', 'rank'. The third column ('val' can have a different name). The result in the compatible format is returned, for example, by 'tgs_knn' function.
'tgs_graph' prunes some of the edges in 'x' based on the following steps:
- A pair of columns i, j that appears in 'x' in 'col1', 'col2' implies the edge in the graph from i to j: edge(i,j). Let the rank of i and j be rank(i,j). 
- Calculate symmetrised rank of i and j: sym_rank(i,j) = rank(i,j) * rank(j,i). If one of the ranks is missing from the previous result sym_rank is set to NA. 
- Prune the edges: remove edge(i,j) if sym_rank(i,j) == NA OR sym_rank(i,j) < knn * knn * k_expand 
- Prune excessive incoming edges: remove edge(i,j) if more than knn * k_beta edges of type edge(node,j) exist and sym_rank(i,j) is higher than sym_rank(node,j) for node != j. 
- Prune excessive outgoing edges: remove edge(i,j) if more than knn edges of type edge(i,node) exist and sym_rank(i,j) is higher than sym_rank(i,node) for node != i. 
Value
The graph edges are returned in a data frame, with the weight of each edge. edge(i,j) receives weight 1 if its sym_rank is the lowest among all edges of type edge(i,node). Formally defined: weight(i,j) = 1 - (place(i,j) - 1) / knn, where place(i,j) is the location of edge(i,j) within the sorted set of edges outgoing from i, i.e. edge(i,node). The sort is done by sym_rank of the edges.
Examples
# Note: all the available CPU cores might be used
set.seed(seed = 1)
rows <- 100
cols <- 1000
vals <- sample(1:(rows * cols / 2), rows * cols, replace = TRUE)
m <- matrix(vals, nrow = rows, ncol = cols)
m[sample(1:(rows * cols), rows * cols / 1000)] <- NA
r1 <- tgs_cor(m, pairwise.complete.obs = FALSE, spearman = TRUE)
r2 <- tgs_knn(r1, knn = 30, diag = FALSE)
r3 <- tgs_graph(r2, knn = 3, k_expand = 10)
Clusters directed graph
Description
Clusters directed graph.
Usage
tgs_graph_cover(graph, min_cluster_size, cooling = 1.05, burn_in = 10)
Arguments
| graph | directed graph in the format returned by tgs_graph | 
| min_cluster_size | used to determine the candidates for seeding (= min weight) | 
| cooling | factor that is used to gradually increase the chance of a node to stay in the cluster | 
| burn_in | number of node reassignments after which cooling is applied | 
Details
The algorithm is explained in a "MetaCell: analysis of single-cell RNA-seq data using K-nn graph partitions" paper, published in "Genome Biology" #20: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1812-2
Value
Data frame that maps each node to its cluster.
See Also
Examples
# Note: all the available CPU cores might be used
set.seed(seed = 0)
rows <- 100
cols <- 1000
vals <- sample(1:(rows * cols / 2), rows * cols, replace = TRUE)
m <- matrix(vals, nrow = rows, ncol = cols)
m[sample(1:(rows * cols), rows * cols / 1000)] <- NA
r1 <- tgs_cor(m, pairwise.complete.obs = FALSE, spearman = TRUE)
r2 <- tgs_knn(r1, knn = 30, diag = FALSE)
r3 <- tgs_graph(r2, knn = 3, k_expand = 10)
r4 <- tgs_graph_cover(r3, 5)
Clusters directed graph multiple times with randomized sample subset
Description
Clusters directed graph multiple times with randomized sample subset.
Usage
tgs_graph_cover_resample(
  graph,
  knn,
  min_cluster_size,
  cooling = 1.05,
  burn_in = 10,
  p_resamp = 0.75,
  n_resamp = 500,
  method = "hash"
)
Arguments
| graph | directed graph in the format returned by tgs_graph | 
| knn | maximal number of edges used per node for each sample subset | 
| min_cluster_size | used to determine the candidates for seeding (= min weight) | 
| cooling | factor that is used to gradually increase the chance of a node to stay in the cluster | 
| burn_in | number of node reassignments after which cooling is applied | 
| p_resamp | fraction of total number of nodes used in each sample subset | 
| n_resamp | number iterations the clustering is run on different sample subsets | 
| method | method for calculating co_cluster and co_sample; valid values: "hash", "full", "edges" | 
Details
The algorithm is explained in a "MetaCell: analysis of single-cell RNA-seq data using K-nn graph partitions" paper, published in "Genome Biology" #20: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1812-2
Value
If method == "hash", a list with two members. The first member is a data frame with 3 columns: "node1", "node2" and "cnt". "cnt" indicates the number of times "node1" and "node2" appeared in the same cluster. The second member of the list is a vector of number of nodes length reflecting how many times each node was used in the subset.
If method == "full", a list containing two matrices: co_cluster and co_sample.
If method == "edges", a list containing two data frames: co_cluster and co_sample.
See Also
Examples
# Note: all the available CPU cores might be used
set.seed(seed = 0)
rows <- 100
cols <- 200
vals <- sample(1:(rows * cols / 2), rows * cols, replace = TRUE)
m <- matrix(vals, nrow = rows, ncol = cols)
r1 <- tgs_cor(m, pairwise.complete.obs = FALSE, spearman = TRUE)
r2 <- tgs_knn(r1, knn = 20, diag = FALSE)
r3 <- tgs_graph(r2, knn = 3, k_expand = 10)
r4 <- tgs_graph_cover_resample(r3, 10, 1)
Returns k highest values of each column
Description
Returns k highest values of each column.
Usage
tgs_knn(x, knn, diag = FALSE, threshold = 0)
Arguments
| x | numeric matrix or data frame (see below) | 
| knn | the number of highest values returned per column | 
| diag | if 'F' values of row 'i' and col 'j' are skipped for each i == j | 
| threshold | filter out values lower than threshold | 
Details
'tgs_knn' returns the highest 'knn' values of each column of 'x' (if 'x' is a matrix). 'x' can be also a sparse matrix given in a data frame of 'col', 'row', 'value' format.
'NA' and 'Inf' values are skipped as well as the values below 'threshold'. If 'diag' is 'F' values of the diagonal (row == col) are skipped too.
Value
A sparse matrix in a data frame format with 'col1', 'col2', 'val' and 'rank' columns. 'col1' and 'col2' represent the column and the row number of 'x'.
Examples
# Note: all the available CPU cores might be used
set.seed(seed = 1)
rows <- 100
cols <- 1000
vals <- sample(1:(rows * cols / 2), rows * cols, replace = TRUE)
m <- matrix(vals, nrow = rows, ncol = cols)
m[sample(1:(rows * cols), rows * cols / 1000)] <- NA
r <- tgs_knn(m, 3)
For each matrix row apply a function over a ragged array
Description
For each matrix row apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.
Usage
tgs_matrix_tapply(x, index, fun, ...)
Arguments
| x | a matrix or a sparse matrix of 'dgCMatrix' type | 
| index | a 'list' of one or more 'factor's, each of same length as the number of columns in 'x'. The elements are coerced to factors by 'as.factor'. | 
| fun | the function to be applied | 
| ... | optional arguments to 'fun' | 
Details
'tgs_matrix_tapply(x, index, fun)' is essentialy an efficient implementation of 'apply(mat, 1, function(x) tapply(x, index, fun))'.
Value
A matrix of length(index) X nrow(x) size. Each [i,j] element
represents the result of applying 'fun' to
x[i,which(index==levels(index)[j])]. 
Note that the return value is a dense matrix even when x is sparse.
Examples
# Note: all the available CPU cores might be used
set.seed(seed = 1)
nr <- 6
nc <- 10
mat <- matrix(sample(c(rep(0, 6), 1:3), nr * nc, replace = TRUE), nrow = nr, ncol = nc)
index <- factor(rep_len(1:3, ncol(mat)), levels = 0:5)
r1 <- apply(mat, 1, function(x) tapply(x, index, sum))
r2 <- tgs_matrix_tapply(mat, index, sum)