https://github.com/tdhock/FLOPART/workflows/R-CMD-check/badge.svg
Functional Labeled Optimal Partitioning (FLOPART) is an optimal peak detection algorithm with label constraints. The dynamic programming algorithm computes a segmentation and corresponding set of peaks which minimizes the penalized Poisson loss (for non-negative integer count data), subject to the following constraints:
if(require("remotes"))install.packages("remotes")
remotes::install_github("tdhock/FLOPART")The main driver function is FLOPART, which takes three arguments *
coverage is a data table with columns chromStart, chromEnd,
count. * label is a data table with columns chromStart,
chromEnd, annotation. * penalty is a non-negative numeric
value, larger for fewer changepoints/peaks.
The constraints for the different label types are as follows:
For a simple real data example,
> library(data.table) # for print method.
> data("Mono27ac.simple", package="FLOPART")
> Mono27ac.simple
$coverage
      chrom chromStart chromEnd count
   1: chr11     145000   146765     0
   2: chr11     146765   146807     1
   3: chr11     146807   175254     0
   4: chr11     175254   175296     1
   5: chr11     175296   175738     0
  ---                                
2975: chr11     326980   326981     5
2976: chr11     326981   326983     6
2977: chr11     326983   326985     5
2978: chr11     326985   326992     4
2979: chr11     326992   327000     3
$label
   chrom chromStart chromEnd annotation
1: chr11     180000   200000    noPeaks
2: chr11     208000   220000    peakEnd
3: chr11     300000   308250  peakStart
4: chr11     308260   320000    peakEndTo run FLOPART on these data,
fit <- with(Mono27ac.simple, FLOPART::FLOPART(coverage, label, penalty=1400))The fit is a list of results; we can use
head to look at the first few items in each of these
results.
> lapply(fit, head)
$coverage_dt
   chromStart chromEnd count weight
1:     145000   146765     0   1765
2:     146765   146807     1     42
3:     146807   175254     0  28447
4:     175254   175296     1     42
5:     175296   175738     0    442
6:     175738   175780     1     42
$label_dt
   chromStart chromEnd annotation type firstRow lastRow
1:     180000   200000    noPeaks    0       14     118
2:     208000   220000    peakEnd   -1      725    1322
3:     300000   308250  peakStart    1     2723    2770
4:     308260   320000    peakEnd   -1     2773    2871
$cost_mat
           [,1]       [,2]
[1,] 0.00000000 0.00000000
[2,] 0.11067717 0.11067717
[3,] 0.01052251 0.01052251
[4,] 0.01909784 0.01909784
[5,] 0.01886280 0.01886280
[6,] 0.02660139 0.02660139
$intervals_mat
     [,1] [,2]
[1,]    1    1
[2,]    2    1
[3,]    3    3
[4,]    3    5
[5,]    4    4
[6,]    4    5
$segments_dt
          mean firstRow lastRow state chromStart chromEnd     status
1:  0.06556501        1     197     0     145000   206725 background
2: 13.17743878      198    1132  2986     206725   209216       peak
3:  0.22431609     1133    1423     0     209216   236120 background
4:  7.38781362     1424    1792  2986     236120   237515       peak
5:  0.19459495     1793    2079     0     237515   267598 background
6:  3.68970814     2080    2552  2986     267598   270853       peakThe main output of the algorithm is segments_dt (shown
above) which is a data table of optimal segment means, given the penalty
and constraints. Each row of this data table describes a segment in the
optimal model:
Other outputs which could be of interest:
cost_mat is the optimal cost computed by dynamic
programming at each data point (row) in each state (column).intervals_mat is the number of intervals (function
pieces) stored in the piecewise cost function at each data point (row)
in each state (column). This is useful for empirical analysis of time
complexity, since the computation time is linear in the number of
intervals.coverage_dt is a modified version of the initial
coverage data set. For example there will be additional rows if there
are labels with start/end values that are not present in the initial
coverage.label_dt has additional columns firstRow/lastRow which
correspond to the start/end positions of the labels, in units of the
rows in coverage_dt.