Title: | Clustering of Micro Panel Data |
Version: | 0.8.1 |
Description: | Two-step feature-based clustering method designed for micro panel (longitudinal) data with the artificial panel data generator. See Sobisek, Stachova, Fojtik (2018) <doi:10.48550/arXiv.1807.05926>. |
URL: | https://arxiv.org/ftp/arxiv/papers/1807/1807.05926.pdf |
Depends: | R (≥ 3.4.0) |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.1.1 |
Imports: | MASS, ggplot2 (≥ 3.0.0), dplyr (≥ 0.7.6), NbClust (≥ 3.0), amap (≥ 0.8-16), tableone, stats, data.table, rlang |
Suggests: | knitr, rmarkdown |
NeedsCompilation: | no |
Packaged: | 2020-11-26 19:53:06 UTC; 9afoj |
Author: | Jan Fojtik [aut, cre], Anna Grishko [aut], Lukas Sobisek [aut, cph, rev] |
Maintainer: | Jan Fojtik <9afojtik@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2020-11-27 02:20:29 UTC |
Cluster Micro-Panel (longitudinal) Data employing the CluMP algorithm
Description
This function clusters Micro-Panel (longitudinal) Data (or trajectories) to a pre-defined number of clusters by employing Feature-Based Clustering of Micro-Panel (longitudinal) Data algorithm called CluMP (see Reference). Currently, only univariate clustering analysis is available.
Usage
CluMP(formula, group, data, cl_numb = NA, base_val = FALSE, method = "ward.D")
Arguments
formula |
A two-sided |
group |
A grouping factor variable (vector), i.e. single identifier for each individual (trajectory). |
data |
A data frame containing the variables named in the |
cl_numb |
An integer, positive number (scalar) specifying the number of clusters. The |
base_val |
Indicates whether include a value at zero time point as an additional clustering variable. Default is FALSE and the standard number (7) of clustering parameters is used. |
method |
A method which use in hierarhical clustering, same as in |
Value
Cluster Micro-Panel data. The output is the list
of 5 components which contain results from clustering.
Source
Sobisek, L., Stachova, M., Fojtik, J. (2018) Novel Feature-Based Clustering of Micro-Panel Data (CluMP). Working paper version online: www.arxiv.org
Examples
data <- GeneratePanel(n = 100, Param = ParamLinear, NbVisit = 10)
CluMP(formula = Y ~ Time, group = "ID", data = data, cl_numb = 3,
base_val = FALSE, method = "ward.D")
CluMP(formula = Y ~ Time, group = "ID", data = data, cl_numb = 3,
base_val = TRUE, method = "ward.D")
Summary characteristics of identified clusters via CluMP
Description
The function CluMP_profiles provides a description (profile) for each cluster. The description is in the form of a summary list containing descriptive statistics of a cluster variable, time variable, cluster parameters and other variables (covariates), both continuous and categorical.
Usage
CluMP_profiles(CluMPoutput, cont_vars = NULL, cat_vars = NULL, show_NA = FALSE)
Arguments
CluMPoutput |
An object (output) from the |
cont_vars |
An optional single character or a character vector of continuous variables' names (from the original dataset). |
cat_vars |
An optional single character or a character vector of categorical variables' names (from the original dataset). |
show_NA |
Logical scalar. Should be calculated and shown descriptive statistics for NA cluster if exists? Default is FALSE. NA cluster gathers improper individuals (trajectories with < 3 not missing observations) for longitudinal clustering. |
Value
Returns a list
with cluster variable (Y) summary, both baseline and changes; time and a summary of the number of observations (visits); clustering parameters summary and optional continuous variables summary (baseline and changes) and categorical variables summary (baseline and end).
Examples
set.seed(123)
dataMale <- GeneratePanel(n = 50, Param = ParamLinear, NbVisit = 10)
dataMale$Gender <- "M"
dataFemale <- GeneratePanel(n = 50, Param = ParamLinear, NbVisit = 10)
dataFemale$ID <- dataFemale$ID + 50
dataFemale$Gender <- "F"
data <- rbind(dataMale, dataFemale)
CluMPoutput <- CluMP(formula = Y ~ Time, group = "ID", data = data, cl_numb = 3)
CluMP_profiles(CluMPoutput, cat_vars = "Gender")
Cluster profiles' (CluMP results) visualisation
Description
This graphical function enables to visualise cluster profiles (mean representatives of each cluster). Available are three types of plots: non-parametric (LOESS method for small/medium or GAM method for complex data of large size. Both methods are applied from ggplot2 representatives (mean within-cluster trajectories) with/without all individual (original) trajectories, and nonparametric mean trajectories with error bars.
Usage
CluMP_view(
CluMPoutput,
type = "all",
nb_intervals = NULL,
return_table = FALSE,
title = NULL,
x_title = NULL,
y_title = NULL,
plot_NA = FALSE
)
Arguments
CluMPoutput |
An object (output) from the |
type |
String. Indicates which type of graph is required. Possible values for this argument are: "all" (plots all data with non-parametric mean trajectories), "cont" (only non-parametric mean trajectories) or "breaks" (mean trajectories with error bars). |
nb_intervals |
An integer, positive number (scalar) specifying the number of regular timepoints into which should be follow-up period split. This argument works only with graph type = "breaks". In case of other graph types the argument is ignored. The number of error bars is equal to the number of timepoints specified by this argument. |
return_table |
Logical scalar indicating if the summary table of plotted values in the graph of type = "breaks" should be returned. Default is FALSE. |
title |
String. Optional title for a plot. If undefined, no title will used. |
x_title |
String. An optional title for x axis. If undefined, the variable name after ~ in |
y_title |
String. An optional title for y axis. If undefined, the variable name before ~ in |
plot_NA |
Plot NA cluster if exists. Default is FALSE. NA cluster gathers improper individuals (< 3 observations) for longitudinal clustering. |
Value
Returns graph for type "all" and "cont" or (list with) graph and table of mean trajectories (if specified) for type = "breaks".
Examples
set.seed(123)
dataMale <- GeneratePanel(n = 50, Param = ParamLinear, NbVisit = 10)
dataMale$Gender <- "M"
dataFemale <- GeneratePanel(n = 50, Param = ParamLinear, NbVisit = 10)
dataFemale$ID <- dataFemale$ID + 50
dataFemale$Gender <- "F"
data <- rbind(dataMale, dataFemale)
CluMPoutput <- CluMP(formula = Y ~ Time, group = "ID", data = data, cl_numb = 3)
title <- "Plotting clusters' representatives with error bars"
CluMP_view(CluMPoutput, type = "all" , return_table = TRUE)
CluMP_view(CluMPoutput, type = "cont")
CluMP_view(CluMPoutput, type = "breaks", nb_intervals = 5, return_table=TRUE, title = title)
Generate an artificial Micro-Panel (longitudinal) Data
Description
This function creates artificial linear or non-linear micro-panel (longitudinal) data coming from generating process with a certain function (linear, quadratic, cubic, exponencial) set of parameters (fixed and random (intercept, slope) effects of time).
Usage
GeneratePanel(
n,
Param,
NbVisit,
VisitFreq = NULL,
TimeVar = NULL,
RegModel = NULL,
ClusterProb = NULL,
Rho = NULL,
units = NULL
)
Arguments
n |
An integer specifying the number of individuals (trajectories) being observed. |
Param |
Object of |
NbVisit |
A positive integer numeric input defining expected number of visits. Option is Fixed or Random.
Number of visits given by the argument |
VisitFreq |
String that defines the frequency of visits for each individual. Option is Random or Fixed. If set to Fixed or not defined, each individual has the same number of visits given by |
TimeVar |
A positive integer representing daily, time variability of the occurrence of repeated measurement (timepoint) from the regular, fixed occurrence (visit) given by the argument units. For example, if this argument is set to 5 then the random integer from interval of -5 to 5 is drawn and added to the time variable. TimeVar must be lower than the regular frequency of repeat measurement given by the argument units. |
RegModel |
String specifying the mathematical function for generating trajectory for each of n individuals. Options are linear, quadratic, cubic or exponential. If set to linear or not defined, then each trajectory has a linear trend. If set to quadratic, then each trajectory has a quadratic development in time. If set to cubic then each trajectory has cubic development. If set to exponential, then each trajectory has exponential development. |
ClusterProb |
Numeric scalar (for 2 clusters) or a vector of numbers (for >2 clusters) defining the probability of each cluster. If not defined, then each cluster has the same occurrence probability. |
Rho |
A numeric scalar specifying autocorrelation parameter with the values from range 0 to 1. If set as 0 or not define then there is no autocorrelation between the within-individual repeated observations. |
units |
String defining the units of time series. Options are day, week, month or year. |
Value
Generates artificial panel data.
Examples
set.seed(123)
#Simple Linear model where each individual has 10 observations.
data <- GeneratePanel(n = 100, Param = ParamLinear, NbVisit = 10)
#Exponential model where each individual has 10 observations.
data <- GeneratePanel(100, ParamExpon, NbVisit = 10, VisitFreq = "Fixed", RegModel = "exponential")
PanelPlot(data)
#Cubic model where each individual has random number of observations on daily basis.
#Average number of observation is given by parameter NbVisit.
data <- GeneratePanel(n = 100, Param = ParamCubic, NbVisit = 100, RegModel = "cubic", units = "day")
PanelPlot(data)
#Quadratic model where each individual has random number of observations.
#Each object is observede weekly with variability 2 days.
data <- GeneratePanel(5,ParamQuadrat,NbVisit=50,RegModel="quadratic",units="week",TimeVar=2)
PanelPlot(data)
#Generate panel data with linear trend with 75% objects in first cluster and 25% in the second.
data <- GeneratePanel(n = 100, Param = ParamLinear, NbVisit = 10, ClusterProb = c(0.75, 0.25))
PanelPlot(data, colour = "Cluster")
Finding an optimal number of clusters
Description
This function finds optimal number of clusters based on evaluation criteria (indices) available from the NbClust package.
Usage
OptiNum(
formula,
group,
data,
index = c("silhouette", "ch", "db"),
max_clust = 10,
base_val = FALSE
)
Arguments
formula |
A two-sided |
group |
A grouping factor variable (vector), i.e. single identifier for each individual (trajectory). |
data |
A data frame containing the variables named in |
index |
String vector of indices to be computed. Default is c("silhouette", "ch", "db"). See NbClust package for available indices and their description. |
max_clust |
An integer, positive number (scalar) defining the maximum number of clusters to check. Default value of this argument is 10 or maximum number of individuals. |
base_val |
Indicates whether include a value at zero time point as an additional clustering variable. Default is FALSE and the standard number (7) of clustering parameters is used. |
Value
Determine the optimal number of clusters, returns graphical output (red dot in plot indicates the recommended number of clusters according to that index) and table with indices.
Source
Malika Charrad, Nadia Ghazzali, Veronique Boiteau, Azam Niknafs (2014). NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software, 61(6), 1-36. URL http://www.jstatsoft.org/v61/i06/.
Examples
set.seed(123)
data <- GeneratePanel(n = 100, Param = ParamLinear, NbVisit = 10)
OptiNum(data = data, formula = Y ~ Time, group = "ID")
Plot Micro-Panel (longitudinal) Data
Description
This function plots micro-panel (longitudinal) data from stored data.frame
or randomly generated panel data from GeneratePanel
function.
Usage
PanelPlot(
data,
formula = Y ~ Time,
group = "ID",
colour = NA,
mean_traj_all = FALSE,
mean_traj_group = FALSE,
show_legend = TRUE,
title = NULL,
x_title = NULL,
y_title = NULL
)
Arguments
data |
A data frame containing the variables named in |
formula |
A two-sided |
group |
A grouping factor variable (vector), i.e. single identifier for each (trajectory). |
colour |
Character, which is a variable's name in data. The trajectories are distinguished by colour according to this variable. |
mean_traj_all |
Logical scalar. It indicates whether to show mean overall trajectory. Default is FALSE. |
mean_traj_group |
Logical scalar. It indicates whether to show mean trajectory by group. Default is FALSE. |
show_legend |
Logical scalar. It indicates whether to show cluster legend. Default is TRUE. |
title |
String. Is an optional title for a plot. Otherwise no title will used. |
x_title |
String. Is an optional title for x axis. Otherwise variable name after ~ in |
y_title |
String. Is an optional title for y axis. Otherwise variable name before ~ in |
Value
Returns plot using package ggplot2.
Examples
set.seed(123)
dataMale <- GeneratePanel(n = 50, Param = ParamLinear, NbVisit = 10)
dataMale$Gender <- "M"
dataFemale <- GeneratePanel(n = 50, Param = ParamLinear, NbVisit = 10)
dataFemale$ID <- dataFemale$ID + 50
dataFemale$Gender <- "F"
data <- rbind(dataMale, dataFemale)
PanelPlot(data = data, formula = Y ~ Time, group = "ID", colour = "Gender")
PanelPlot(data = data, formula = Y ~ Time, group = "ID", colour = "Gender", mean_traj_all = TRUE)
PanelPlot(data = data, formula = Y ~ Time, group = "ID", colour = "Gender", mean_traj_group = TRUE)
Parameters of cubic model
Description
Default parameters to generate micro-panel (longitudinal) data with quadratic trend. The parameters may differ per each cluster. The parameters of each cluster are in rows. Number of rows denotes the number of clusters. Fixed effects are taken from Allen et al. (2005), and the source for random effects is Uher et al. (2017).
Usage
ParamCubic
Format
Its adviced to keep parameters in data.frame
. The Parameters structure is as follows:
- b0
fixed parameter of intercept
- b1
fixed parameter of slope
- b2
fixed parameter of defining the quadraticity
- b3
fixed parameter of defining the cubicity
- varU0
variance of random factor U0 given to fixed parameter b0
- varU1
variance of random factor U1 given to fixed parameter b1
- corr
correlation between random factors U0 and U1
- varE
the variability of the residuals
Source
Allen, JS, Bruss, J, Brown, CK, Damasio, H. Normal neuroanatomical variation due to age: the major lobes and a parcellation of the temporal region. Neurobiol Aging. 2005 Oct;26(9):1245-60; discussion 1279-82.
Uher T, Vaneckova M, Krasensky J, Sobisek L, Tyblova M, Volna J, Seidl Z, Bergsland N, Dwyer MG, Zivadinov R, De Stefano N, Sormani MP, Havrdova EK, Horakova D. Pathological cut-offs of global and regional brain volume loss in multiple sclerosis. Mult Scler. 2017 Nov 1:1352458517742739. doi: 10.1177/1352458517742739.
Parameters of exponential model
Description
Default parameters to generate micro-panel (longitudinal) data with exponencial trend. The parameters may differ per each cluster. The parameters of each cluster are in rows. Number of rows denotes the number of clusters. Fixed effects are taken from Jones et al. (2013).
Usage
ParamExpon
Format
It is adviced to keep parameters in data.frame
. The Parameters structure is as follows:
- b0
fixed parameter of intercept
- b1
fixed parameter of slope
- b2
fixed parameter of defining the decay
- varU0
variance of random factor U0 given to fixed parameter b0
- varU1
variance of random factor U1 given to fixed parameter b1
- corr
correlation between random factors U0 and U1
- varE
the variability of the residuals
Source
Jones BC, Nair G, Shea CD, Crainiceanu CM, Cortese IC, Reich DS. Quantification of multiple-sclerosis-related brain atrophy in two heterogeneous MRI datasets using mixed-effects modeling. Neuroimage Clin. 2013 Aug 13;3:171-9. doi: 10.1016/j.nicl.2013.08.001.
Parameters of linear model
Description
Default parameters to generate micro-panel (longitudinal) data with linear trend. The parameters may differ per each cluster. The parameters of each cluster are in rows. Number of rows denotes the number of clusters. Fixed and random effects are taken from Uher et al. (2017).
Usage
ParamLinear
Format
It is adviced to keep parameters in data.frame
. The Parameters structure is as follows:
- b0
fixed parameter of intercept
- b1
fixed parameter of slope
- varU0
variance of random factor U0 given to fixed parameter b0
- varU1
variance of random factor U1 given to fixed parameter b1
- corr
correlation between random factors U0 and U1
- varE
the variability of the residuals
Source
Uher T, Vaneckova M, Krasensky J, Sobisek L, Tyblova M, Volna J, Seidl Z, Bergsland N, Dwyer MG, Zivadinov R, De Stefano N, Sormani MP, Havrdova EK, Horakova D. Pathological cut-offs of global and regional brain volume loss in multiple sclerosis. Mult Scler. 2017 Nov 1:1352458517742739. doi: 10.1177/1352458517742739.
Parameters of quadratic model
Description
Parameters to generate panel data with quadratic trend. The parameters may differ per each cluster. The parameters of each cluster are in rows. Number of rows denotes the number of clusters. Fixed effects are taken from Allen et al. (2005), and the source for random effects is Uher et al. (2017).
Usage
ParamQuadrat
Format
It is adviced to keep parameters in data.frame
. The Parameters structure is as follows:
- b0
fixed parameter of intercept
- b1
fixed parameter of slope
- b2
fixed parameter of defining the quadraticity
- varU0
variance of random factor U0 given to fixed parameter b0
- varU1
variance of random factor U1 given to fixed parameter b1
- corr
correlation between random factors U0 and U1
- varE
the variability of the residuals
Source
Allen, JS, Bruss, J, Brown, CK, Damasio, H. Normal neuroanatomical variation due to age: the major lobes and a parcellation of the temporal region. Neurobiol Aging. 2005 Oct;26(9):1245-60; discussion 1279-82.
Uher T, Vaneckova M, Krasensky J, Sobisek L, Tyblova M, Volna J, Seidl Z, Bergsland N, Dwyer MG, Zivadinov R, De Stefano N, Sormani MP, Havrdova EK, Horakova D. Pathological cut-offs of global and regional brain volume loss in multiple sclerosis. Mult Scler. 2017 Nov 1:1352458517742739. doi: 10.1177/1352458517742739.