Help for package Evacluster

Type:

Package

Title:

Evaluation Clustering Methods for Disease Subtypes Diagnosis

Version:

0.1.0

Date:

2022-03-25

Author:

Fahimeh Nezhadmoghadam,Jose Gerardo Tamez-Pena

Maintainer:

Fahimeh Nezhadmoghadam <f.nejad.moghadam@gmail.com>

Description:

Contains a set of clustering methods and evaluation metrics to select the best number of the clusters based on clustering stability. Two references describe the methodology: Fahimeh Nezhadmoghadam, and Jose Tamez-Pena (2021)<doi:10.1016/j.compbiomed.2021.104753>, and Fahimeh Nezhadmoghadam, et al.(2021)<doi:10.2174/1567205018666210831145825>.

License:

LGPL-2 | LGPL-2.1 | LGPL-3 [expanded from: LGPL (≥ 2)]

Encoding:

UTF-8

Suggests:

mlbench,EMCluster,inaparc,ppclust,FRESA.CAD,MASS,stats,class,NMF,cluster,Rtsne,proxy,uwot,mclust,graphics

Packaged:

2022-03-31 18:27:26 UTC; fneja

RoxygenNote:

7.1.1

NeedsCompilation:

Repository:

CRAN

Date/Publication:

2022-04-01 07:50:07 UTC

Evaluation Clustering Methods for Disease Subtypes Diagnosis (Evacluster)

Description

Contains a set of clustering methods and evaluation metrics to select the best number of the clusters based on clustering stability.

Details

Package:	Evacluster
Type:	Package
Version:	0.1.0
Date:	2022-03-25
License:	LGPL (>= 2)

Purpose: The design of clustering models and evaluation metrics for finding the cluster's number via computing clustering stability. The best number of clusters is selected via consensus clustering and clustering stability.

Author(s)

Fahimeh Nezhadmoghadam, Jose Gerardo Tamez-Pena, Maintainer: <f.nejad.moghadam@gmail.com>

References

Nezhadmoghadam, Fahimeh, and Jose Tamez-Pena. "Risk profiles for negative and positive COVID-19 hospitalized patients.(2021) Computers in biology and medicine 136 : 104753.
Fahimeh Nezhadmoghadam, et al., Robust Discovery of Mild Cognitive impairment subtypes and their Risk of Alzheimer's Disease conversion using unsupervised machine learning and Gaussian Mixture Modeling (2021), Current Alzheimer Research, 18 (7), 595-606.

Examples

    ## Not run: 
    ### Evacluster Package Examples ####
    library(datasets)
    data(iris)

   # Split data to training set and testing set
   rndSamples <- sample(nrow(iris),100)
   trainData <- iris[rndSamples,]
   testData <- iris[-rndSamples,]

  
   ## Expectation Maximization Clustering
   # Perform Expectation Maximization Clustering on training set with 3 clusters 
   clsut <- EMCluster(trainData[,1:4],3)
   
   # Predict the labels of the cluster for new data based on cluster labels of the training set
   pre <- predict(clsut,testData[,1:4])
   
   
   ## Fuzzy C-means Clustering
   # Perform Fuzzy C-means Clustering on training set with 3 clusters 
   clsut <- FuzzyCluster(trainData[,1:4],3)
   
   # Predict the labels of the new data 
   pre <- predict(clsut,testData[,1:4])
   
   
   ## hierarchical clustering
   # Perform hierarchical clustering on training set with 3 clusters 
   clsut <- hierarchicalCluster(trainData[,1:4],distmethod="euclidean",clusters=3)
   
   # Predict the labels of the new data 
   pre <- predict(clsut,testData[,1:4])
   
   
   ## K-means Clustering
   # Perform K-means Clustering on training set with 3 clusters 
   clsut <- kmeansCluster(trainData[,1:4],3)
   
   # Predict the labels of the new data 
   pre <- predict(clsut,testData[,1:4])
   
   
   ## Partitioning Around Medoids (PAM) Clustering
   # Perform pam Clustering on training set with 3 clusters 
   clsut <- pamCluster(trainData[,1:4],3)
   
   # Predict the labels of the new data 
   pre <- predict(clsut,testData[,1:4])
   
   
   ## Non-negative matrix factorization (NMF)
   # Perform nmf Clustering on training set with 3 clusters 
   clsut <- nmfCluster(trainData[,1:4],rank=3)
   
   # Predict the labels of the new data 
   pre <- predict(clsut,testData[,1:4])
   
   
   ## t-Distributed Stochastic Neighbor Embedding (t-SNE)
   
   library(mlbench)
   data(Sonar)
 
   rndSamples <- sample(nrow(Sonar),150)
   trainData <- Sonar[rndSamples,]
   testData <- Sonar[-rndSamples,]
 
   # Perform tSNE dimensionality reduction method on training data 
   tsne_trainData <- tsneReductor(trainData[,1:60],dim = 3,perplexity = 10,max_iter = 1000)
   
   # performs an embedding of new data using an existing embedding
   tsne_testData <- predict(tsne_trainData,k=3,testData[,1:60])
   
   
   ## clustering stability function
   # Compute the stability of clustering to select the best number of clusters.
   library(mlbench)
   data(Sonar)
 
   Sonar$Class <- as.numeric(Sonar$Class)
   Sonar$Class[Sonar$Class == 1] <- 0
   Sonar$Class[Sonar$Class == 2] <- 1
   
   # Compute the stability of clustering using kmeans clustering, UMAP as 
   dimensionality reduction method, and feature selection technique
   
  ClustStab <- clusterStability(data=Sonar, clustermethod=kmeansCluster, dimenreducmethod="UMAP",
                              n_components = 3,featureselection="yes", outcome="Class",
                              fs.pvalue = 0.05,randomTests = 100,trainFraction = 0.7,center=3)
   
   
   # Get the labels of the subjects that share the same connectivity
   clusterLabels <- getConsensusCluster(ClustStab,who="training",thr=seq(0.80,0.30,-0.1))


     # Compute the stability of clustering using PAM clustering, tSNE as
     dimensionality reduction method, and feature selection technique
     
   ClustStab <- clusterStability(data=Sonar, clustermethod=pamCluster, dimenreducmethod="tSNE",
                              n_components = 3, perplexity=10,max_iter=100,k_neighbor=2,
                             featureselection="yes", outcome="Class",fs.pvalue = 0.05,
                               randomTests = 100,trainFraction = 0.7,k=3)
          
    # Get the labels of the subjects that share the same connectivity
   clusterLabels <- getConsensusCluster(ClustStab,who="training",thr=seq(0.80,0.30,-0.1))
                     
                     
    # Compute the stability of clustering using hierarchical clustering,
    PCA as dimensionality reduction method, and without applying feature selection
                                 
   ClustStab <- clusterStability(data=Sonar, clustermethod=hierarchicalCluster, 
                               dimenreducmethod="PCA", n_components = 3,featureselection="no",
                               randomTests = 100,trainFraction = 0.7,distmethod="euclidean", 
                               clusters=3)
                               
 # Get the labels of the subjects that share the same connectivity
   clusterLabels <- getConsensusCluster(ClustStab,who="training",thr=seq(0.80,0.30,-0.1))
   
   
   # Show the clustering stability resuldts
   mycolors <- c("red","green","blue","yellow")
 
   ordermatrix <- ClustStab$dataConcensus
 
   heatmapsubsample <- sample(nrow(ordermatrix),70)
 
   orderindex <- 10*clusterLabels + ClustStab$trainJaccardpoint
 
   orderindex <- orderindex[heatmapsubsample]
   orderindex <- order(orderindex)
   ordermatrix <- ordermatrix[heatmapsubsample,heatmapsubsample]
   ordermatrix <- ordermatrix[orderindex,orderindex]
   rowcolors <- mycolors[1+clusterLabels[heatmapsubsample]]
   rowcolors <- rowcolors[orderindex]
 
 
   hplot <- gplots::heatmap.2(as.matrix(ordermatrix),Rowv=FALSE,Colv=FALSE,
                            RowSideColors = rowcolors,ColSideColors = rowcolors,dendrogram = "none",
                            trace="none",main="Cluster Co-Association \n (k=3)")
                            
   
   # Compare the PAC values of clustering stability with different numbers of clusters 
   
   ClustStab2 <- clusterStability(data=Sonar, clustermethod=kmeansCluster, dimenreducmethod="UMAP",
                              n_components = 3,featureselection="yes", outcome="Class",
                              fs.pvalue = 0.05,randomTests = 100,trainFraction = 0.7,center=2)
 
   ClustStab3 <- clusterStability(data=Sonar, clustermethod=kmeansCluster, dimenreducmethod="UMAP",
                                n_components = 3,featureselection="yes", outcome="Class",
                                fs.pvalue = 0.05,randomTests = 100,trainFraction = 0.7,center=3)
 
   ClustStab4 <- clusterStability(data=Sonar, clustermethod=kmeansCluster, dimenreducmethod="UMAP",
                                n_components = 3,featureselection="yes", outcome="Class",
                                fs.pvalue = 0.05,randomTests = 100,trainFraction = 0.7,center=4)
                                
                                
   color_range<- c(black="#FDFC74", orange="#76FF7A", skyblue="#B2EC5D")
 
 
   max.temp <- c(ClustStab2$PAC,ClustStab3$PAC,ClustStab4$PAC) 
 
   barplot(max.temp,xlab = "Number of clusters",ylab = "PAC", names.arg = c( "2","3","4"), 
          ylim=c(0,0.3),col= color_range[1:length(c(1,6,2,6,1))])
                            
   
## End(Not run)

Expectation Maximization Clustering

Description

This function perform EM algorithm for model-based clustering of finite mixture multivariate Gaussian distribution.The general purpose of clustering is to detect clusters of data and to assign the data to the clusters.

Usage

EMCluster(data = NULL, ...)

Arguments

data

A Data set

...

k: The number of Clusters

Value

A list of cluster labels and a returned object from init.EM

Examples

library(datasets)
data(iris)

rndSamples <- sample(nrow(iris),100)
trainData <- iris[rndSamples,]
testData <- iris[-rndSamples,]

clsut <- EMCluster(trainData[,1:4],3)

Fuzzy C-means Clustering Algorithm

Description

This function works by assigning membership to each data point corresponding to each cluster center based on the distance between the cluster center and the data point. A data object is the member of all clusters with varying degrees of fuzzy membership between 0 and 1.

Usage

FuzzyCluster(data = NULL, ...)

Arguments

data

A Data set

...

k: The number of Clusters

Value

A list of cluster labels and a R object of class "fcm ppclust"

Examples

library(datasets)
data(iris)

rndSamples <- sample(nrow(iris),100)
trainData <- iris[rndSamples,]
testData <- iris[-rndSamples,]

cls <- FuzzyCluster(trainData[,1:4],3)

clustering stability function

Description

This function computes the stability of clustering that helps to select the best number of clusters. Feature selection and dimensionality reduction methods can be used before clustering the data.

Usage

clusterStability(
  data = NULL,
  clustermethod = NULL,
  dimenreducmethod = NULL,
  n_components = 3,
  perplexity = 25,
  max_iter = 1000,
  k_neighbor = 3,
  featureselection = NULL,
  outcome = NULL,
  fs.pvalue = 0.05,
  randomTests = 20,
  trainFraction = 0.5,
  pac.thr = 0.1,
  ...
)

Arguments

data

A Data set

clustermethod

The clustering method. This can be one of "Mclust","pamCluster","kmeansCluster", "hierarchicalCluster",and "FuzzyCluster".

dimenreducmethod

The dimensionality reduction method. This must be one of "UMAP","tSNE", and "PCA".

n_components

The dimension of the space that data embed into. It can be set to any integer value in the range of 2 to 100.

perplexity

The Perplexity parameter that determines the optimal number of neighbors in tSNE method.(it is only used in the tSNE reduction method)

max_iter

The maximum number of iterations for performing tSNE reduction method.

k_neighbor

The k_neighbor is used for computing the means of #neighbors with min distance (#Neighbor=sqrt(#Samples/k) for performing an embedding of new data using an existing embedding in the tSNE method.

featureselection

This parameter determines whether feature selection is applied before clustering data or not. if used, it should be "yes", otherwisw "no".

outcome

The outcome feature is used for feature selection.

fs.pvalue

The threshold pvalue used for feature selection process. The default value is 0.05.

randomTests

The number of iterations of the clustering process for computing the cluster stability.

trainFraction

This parameter determines the ratio of training data. The default value is 0.5.

pac.thr

The pac.thr is the thresold to use for computing the proportion of ambiguous clustering (PAC) score. It is as the fraction of sample pairs with consensus indices falling in the interval.The default value is 0.1.

...

Additional arguments passed to clusterStability().

Value

A list with the following elements:

randIndex - A vector of the Rand Index that computes a similarity measure between two clusterings.
jaccIndex - A vector of jaccard Index that measures how frequently pairs of items are joined together in two clustering data sets.
randomSamples - A vector with indexes of selected samples for training in each iteration.
clusterLabels - A vector with clusters' labels in all iterations. jaccardpoint
jaccardpoint - The corresponding Jaccard index for each data point of testing set
averageNumberofClusters - The mean Number of Clusters.
testConsesus - A vector of consensus clustering results of testing set.
trainRandIndex - A vector of the Rand Index for training set.
trainJaccIndex - A vector of the jaccard Index for training set.
trainJaccardpoint - The corresponding Jaccard index for each data point of training set.
PAC - The proportion of ambiguous clustering (PAC) score.
dataConcensus - A vector of consensus clustering results of training set.

Examples


library("mlbench")
data(Sonar)

Sonar$Class <- as.numeric(Sonar$Class)
Sonar$Class[Sonar$Class == 1] <- 0 
Sonar$Class[Sonar$Class == 2] <- 1

ClustStab <- clusterStability(data=Sonar, clustermethod=kmeansCluster, dimenreducmethod="UMAP",
                              n_components = 3,featureselection="yes", outcome="Class",
                              fs.pvalue = 0.05,randomTests = 100,trainFraction = 0.7,center=3)


ClustStab <- clusterStability(data=Sonar, clustermethod=pamCluster, dimenreducmethod="tSNE",
                              n_components = 3, perplexity=10,max_iter=100,k_neighbor=2,
                              featureselection="yes", outcome="Class",fs.pvalue = 0.05,
                              randomTests = 100,trainFraction = 0.7,k=3)


ClustStab <- clusterStability(data=Sonar, clustermethod=hierarchicalCluster, 
                              dimenreducmethod="PCA", n_components = 3,featureselection="no",
                              randomTests = 100,trainFraction = 0.7,distmethod="euclidean",
                              clusters=3)

Consensus Clustering Results

Description

This function gets the labels of the subjects that share the same connectivity.

Usage

getConsensusCluster(object, who = "training", thr = seq(0.8, 0.3, -0.1))

Arguments

object

A object of "clusterStability" function result

who

This value shows the consensus clustering result of training and testing sets. If who="training" for training set, otherwise other sets.

thr

This is the seq function with three arguments that are: initial value, final value, and increment (or decrement for a declining sequence). This produces ascending or descending sequences.

Value

A list of samples' labels with same connectivity.

Examples


library("mlbench")
data(Sonar)

Sonar$Class <- as.numeric(Sonar$Class)
Sonar$Class[Sonar$Class == 1] <- 0 
Sonar$Class[Sonar$Class == 2] <- 1

ClustStab <- clusterStability(data=Sonar, clustermethod=kmeansCluster, dimenreducmethod="UMAP",
                              n_components = 3,featureselection="yes", outcome="Class",
                              fs.pvalue = 0.05,randomTests = 100,trainFraction = 0.7,center=3)

clusterLabels <- getConsensusCluster(ClustStab,who="training",thr=seq(0.80,0.30,-0.1))

hierarchical clustering

Description

This function seeks to build a hierarchy of clusters

Usage

hierarchicalCluster(data = NULL, distmethod = NULL, clusters = NULL, ...)

Arguments

data

A Data set

distmethod

The distance measure to be used. This must be one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski".

clusters

The number of Clusters

...

Additional parameters passed to hclust function

Value

A list of cluster labels

Examples

library(datasets)
data(iris)

rndSamples <- sample(nrow(iris),100)
trainData <- iris[rndSamples,]
testData <- iris[-rndSamples,]

cls <- hierarchicalCluster(trainData[,1:4],distmethod="euclidean",clusters=3)

K-means Clustering

Description

This function classifies unlabeled data by grouping them by features, rather than pre-defined categories. It splits the data into K different clusters and describes the location of the center of each cluster. Then, a new data point can be assigned a cluster (class) based on the closed center of mass.

Usage

kmeansCluster(data = NULL, ...)

Arguments

data

A Data set

...

center: The number of centers

Value

A list of cluster labels and a R object of class "kmeans"

Examples

library(datasets)
data(iris)

rndSamples <- sample(nrow(iris),100)
trainData <- iris[rndSamples,]
testData <- iris[-rndSamples,]

cls <- kmeansCluster(trainData[,1:4],3)

Non-negative matrix factorization (NMF)

Description

This function factorizes samples matrix into (usually) two matrices W the cluster centroids and H the cluster membership,

Usage

nmfCluster(data = NULL, rank = NULL)

Arguments

data

A Data set

rank

Specification of the factorization rank

Value

A list of cluster labels, a R object of class "nmf" and the centers of the clusters

Examples

library(datasets)
data(iris)

rndSamples <- sample(nrow(iris),100)
trainData <- iris[rndSamples,]
testData <- iris[-rndSamples,]

cls <- nmfCluster(trainData[,1:4],rank=3)

Partitioning Around Medoids (PAM) Clustering

Description

This function partitions (clustering) of the data into k clusters "around medoids". In contrast to the k-means algorithm, this clustering methods chooses actual data points as centers

Usage

pamCluster(data = NULL, ...)

Arguments

data

A Data set

...

k: The number of clusters

Value

A list of cluster labels and a R object of class "pam cluster"

Examples

library(datasets)
data(iris)

rndSamples <- sample(nrow(iris),100)
trainData <- iris[rndSamples,]
testData <- iris[-rndSamples,]

cls <- pamCluster(trainData[,1:4],3)

EMCluster prediction function

Description

This function predicts the labels of the cluster for new data based on cluster labels of the training set.

Usage

## S3 method for class 'EMCluster'
predict(object, ...)

Arguments

object

A returned object of EMCluster

...

New sample set

Value

A list of cluster labels

FuzzyCluster prediction function

Description

This function predicts the labels of the cluster for new data based on cluster labels of the training set.

Usage

## S3 method for class 'FuzzyCluster'
predict(object, ...)

Arguments

object

A returned object of FuzzyCluster function

...

New samples set

Value

A list of cluster labels

hierarchicalCluster prediction function

Description

This function predicts the labels of the cluster for new data based on cluster labels of the training set.

Usage

## S3 method for class 'hierarchicalCluster'
predict(object, ...)

Arguments

object

A returned object of hierarchicalCluster function

...

New samples set

Value

A list of cluster labels

kmeansCluster prediction function

Description

This function predicts the labels of the cluster for new data based on cluster labels of the training set.

Usage

## S3 method for class 'kmeansCluster'
predict(object, ...)

Arguments

object

A returned object of kmeansCluster function

...

New samples set

Value

A list of cluster labels

nmfCluster prediction function

Description

This function predicts the labels of the cluster for new data based on cluster labels of the training set.

Usage

## S3 method for class 'nmfCluster'
predict(object, ...)

Arguments

object

A returned object of nmfCluster

...

New samples set

Value

A list of cluster labels

pamCluster prediction function

Description

This function predicts the labels of the cluster for new data based on cluster labels of the training set.

Usage

## S3 method for class 'pamCluster'
predict(object, ...)

Arguments

object

A returned object of pamCluster function

...

New samples set

Value

A list of cluster labels

tsneReductor prediction function

Description

This function performs an embedding of new data using an existing embedding.

Usage

## S3 method for class 'tsneReductor'
predict(object, k = NULL, ...)

Arguments

object

A returned object of tsneReductor function

k

The number is used for computing the means of #neighbors with min distance (#Neighbor=sqrt(#Samples/k).

...

New samples set

Value

tsneY:An embedding of new data

Examples

library("mlbench")
data(Sonar)

rndSamples <- sample(nrow(Sonar),150)
trainData <- Sonar[rndSamples,]
testData <- Sonar[-rndSamples,]

tsne_trainData <- tsneReductor(trainData[,1:60],dim = 3,perplexity = 10,max_iter = 1000)

tsne_testData <- predict(tsne_trainData,k=3,testData[,1:60])

t-Distributed Stochastic Neighbor Embedding (t-SNE)

Description

This method is an unsupervised, non-linear technique used for data exploration and visualizing high-dimensional data.This function constructs a low-dimensional embedding of high-dimensional data, distances, or similarities.

Usage

tsneReductor(data = NULL, dim = 2, perplexity = 30, max_iter = 500)

Arguments

data

Data matrix (each row is an observation, each column is a variable)

dim

Integer number; Output dimensional (default=2)

perplexity

numeric; Perplexity parameter (should not be bigger than 3 * perplexity < nrow(X) - 1, default=30)

max_iter

Integer; Number of iterations (default: 500)

Value

tsneY: A Matrix containing the new representations for the observation with selected dimensions by user

Examples

library("mlbench")
data(Sonar)

rndSamples <- sample(nrow(Sonar),150)
trainData <- Sonar[rndSamples,]
testData <- Sonar[-rndSamples,]

tsne_trainData <- tsneReductor(trainData[,1:60],dim = 3,perplexity = 10,max_iter = 1000)