Type: | Package |
Title: | Machine Learning |
Version: | 1.0.7 |
Date: | 2025-04-19 |
Imports: | graphics, grDevices, MASS, stats |
Depends: | R (≥ 3.3.2) |
Description: | Machine learning, containing several algorithms for supervised and unsupervised classification, in addition to a function that plots the Receiver Operating Characteristic (ROC) and Precision-Recall (PRC) curve graphs, and also a function that returns several metrics used for model evaluation, the latter can be used in ranking results from other packs. |
License: | GPL-3 |
Encoding: | UTF-8 |
NeedsCompilation: | yes |
Author: | Paulo Cesar Ossani
|
Maintainer: | Paulo Cesar Ossani <ossanipc@hotmail.com> |
Repository: | CRAN |
Packaged: | 2025-04-18 16:40:48 UTC; Ossan |
Date/Publication: | 2025-04-18 17:00:02 UTC |
Machine learning and data mining.
Description
Machine learning, containing several algorithms, in addition to functions that plot the graphs of the Receiver Operating Characteristic (ROC) and Precision-Recall (PRC) curve, and also a function that returns several metrics used to evaluate the models, the latter can be used in the classification results of other packages.
Details
Package: | Kira |
Type: | Package |
Version: | 1.0.7 |
Date: | 2025-04-19 |
License: | GPL(>= 3) |
LazyLoad: | yes |
This package contains:
Algorithms for supervised classification: knn, linear (lda) and quadratic (qda) discriminant analysis, linear regression, etc.
Algorithms for unsupervised classification: hierarchical, kmeans, etc.
A function that plots the ROC and PRC curve.
A function that returns a series of metrics from models.
Functions that determine the ideal number of clusters: elbow and silhouette.
Author(s)
Paulo Cesar Ossani <ossanipc@hotmail.com>
References
Aha, D. W.; Kibler, D. and Albert, M. K. Instance-based learning algorithms. Machine learning. v.6, n.1, p.37-66. 1991.
Anitha, S.; Metilda, M. A. R. Y. An extensive investigation of outlier detection by cluster validation indices. Ciencia e Tecnica Vitivinicola - A Science and Technology Journal, v. 34, n. 2, p. 22-32, 2019. doi: 10.13140/RG.2.2.26801.63848
Charnet, R. at al. Analise de modelos de regressao lienar, 2a ed. Campinas: Editora da Unicamp, 2008. 357 p.
Chicco, D.; Warrens, M. J. and Jurman, G. The matthews correlation coefficient (mcc) is more informative than cohen's kappa and brier score in binary classification assessment. IEEE Access, IEEE, v. 9, p. 78368-78381, 2021.
Erich, S. Stop using the Elbow criterion for k-means and how to choose the number of clusters instead. ACM SIGKDD Explorations Newsletter. 25 (1): 36-42. arXiv:2212.12189. 2023. doi: 10.1145/3606274.3606278
Ferreira, D. F. Estatistica Multivariada. 2a ed. revisada e ampliada. Lavras: Editora UFLA, 2011. 676 p.
Kaufman, L. and Rousseeuw, P. J. Finding Groups in Data: An Introduction to Cluster Analysis, New York: John Wiley & Sons. 1990.
Martinez, W. L.; Martinez, A. R.; Solka, J. Exploratory data analysis with MATLAB. 2nd ed. New York: Chapman & Hall/CRC, 2010. 499 p.
Mingoti, S. A. analysis de dados atraves de metodos de estatistica multivariada: uma abordagem aplicada. Belo Horizonte: UFMG, 2005. 297 p.
Nicoletti, M. do C. O modelo de aprendizado de maquina baseado em exemplares: principais caracteristicas e algoritmos. Sao Carlos: EdUFSCar, 2005. 61 p.
Onumanyi, A. J.; Molokomme, D. N.; Isaac, S. J. and Abu-Mahfouz, A. M. Autoelbow: An automatic elbow detection method for estimating the number of clusters in a dataset. Applied Sciences 12, 15. 2022. doi: 10.3390/app12157515
Rencher, A. C. Methods of multivariate analysis. 2th. ed. New York: J.Wiley, 2002. 708 p.
Rencher, A. C. and Schaalje, G. B. Linear models in statisctic. 2th. ed. New Jersey: John & Sons, 2008. 672 p.
Rousseeuw P. J. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics, 20:53-65. 1987. doi: 10.1016/0377-0427(87)90125-7
Sugar, C. A. and James, G. M. Finding the number of clusters in a dataset: An information-theoretic approach. Journal of the American Statistical Association, 98, 463, 750-763. 2003. doi: 10.1198/016214503000000666
Venabless, W. N. and Ripley, B. D. Modern Applied Statistics with S. Fourth edition. Springer, 2002.
Zhang, Y.; Mandziuk, J.; Quek, H. C. and Goh, W. Curvature-based method for determining the number of clusters. Inf. Sci. 415, 414-428, 2017. doi: 10.1016/j.ins.2017.05.024
Brute force method for variable selection.
Description
Brute force method used to determine the smallest number of variables in a supervised classification model.
Usage
brute.force(func = NA, train, test, class.train,
class.test, args = NA, measure = "Rate Hits",
output = 10)
Arguments
func |
Supervised classification function to be analyzed. |
train |
Data set of training, without classes. |
test |
Test data set. |
class.train |
Vector with training data class names. |
class.test |
Vector with test data class names. |
args |
Argument using in the classifier giving in 'func'. |
measure |
Measure to evaluate the model:
"Rate Hits" (default), "Kappa Index", "Sensitivity",
"Specificity", "Precision", "FP Rate", "FN Rate",
"Negative Predictive Rate", "F-Score", "MCC",
"ROC Are" or "PRC Area". |
output |
Number of elements with the best combinations of variables in the matrix 'best.model' (default = 10). |
Value
best.model |
Matrix with the names of the best combinations of variables, according to the evaluation measure used: accuracy, precision, recall etc. |
text.model |
Structure of the classification model used. |
Author(s)
Paulo Cesar Ossani
Examples
data(iris) # data set
data <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")
#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test = as.factor(data.test$class) # class names of the test data set
#### End - hold out validation method ####
r <- (ncol(data) - 1)
res <- brute.force(func = "knn", train = data.train[,1:r],
test = data.test[,1:r], class.train = class.train,
class.test = class.test, args = "k = 1, dist = 'EUC'",
measure = "Rate Hits", output = 20)
res$best.model
res$text.model
res <- brute.force(func = "regression", train = data.train[,1:r],
test = data.test[,1:r], class.train = class.train,
class.test = class.test, args = "intercept = TRUE",
measure = "Rate Hits", output = 20)
res$best.model
res$text.model
test_a <- as.integer(rownames(data.test)) # test data index
class <- data[,c(r+1)] # classes names
res <- brute.force(func = "lda", train = data[,1:r], test = test_a,
class.train = class, class.test = class.test,
args = "type = 'test', method = 'mle'",
measure = "Rate Hits", output = 20)
res$best.model
res$text.model
Elbow method to determine the optimal number of clusters.
Description
Generates the Elbow graph and returns the ideal number of clusters.
Usage
elbow(data, k.max = 10, method = "AutoElbow", plot = TRUE,
cut = TRUE, title = NA, xlabel = NA, ylabel = NA, size = 1.1,
grid = TRUE, color = TRUE, savptc = FALSE, width = 3236,
height = 2000, res = 300, casc = TRUE)
Arguments
data |
Data with x and y coordinates. |
k.max |
Maximum number of clusters for comparison (default = 10). |
method |
Method used to find the ideal number k of clusters: "jump", "curvature", "Exp", "AutoElbow" (default). |
plot |
Indicates whether to plot the elbow graph (default = TRUE). |
cut |
Indicates whether to plot the best cluster indicative line (default = TRUE). |
title |
Title of the graphic, if not set, assumes the default text. |
xlabel |
Names the X axis, if not set, assumes the default text. |
ylabel |
Names the Y axis, if not set, assumes the default text. |
size |
Size of points on the graph and line thickness (default = 1.1). |
grid |
Put grid on graph (default = TRUE). |
color |
Colored graphic (default = TRUE). |
savptc |
Saves the graph image to a file (default = FALSE). |
width |
Graphic image width when savptc = TRUE (defaul = 3236). |
height |
Graphic image height when savptc = TRUE (default = 2000). |
res |
Nominal resolution in ppi of the graphic image when savptc = TRUE (default = 300). |
casc |
Cascade effect in the presentation of the graphic (default = TRUE). |
Value
k.ideal |
Ideal number of clusters. |
Author(s)
Paulo Cesar Ossani
References
Erich, S. Stop using the Elbow criterion for k-means and how to choose the number of clusters instead. ACM SIGKDD Explorations Newsletter. 25 (1): 36-42. arXiv:2212.12189. 2023. doi: 10.1145/3606274.3606278
Sugar, C. A. and James, G. M. Finding the number of clusters in a dataset: An information-theoretic approach. Journal of the American Statistical Association, 98, 463, 750-763. 2003. doi: 10.1198/016214503000000666
Zhang, Y.; Mandziuk, J.; Quek, H. C. and Goh, W. Curvature-based method for determining the number of clusters. Inf. Sci. 415, 414-428, 2017. doi: 10.1016/j.ins.2017.05.024
Onumanyi, A. J.; Molokomme, D. N.; Isaac, S. J. and Abu-Mahfouz, A. M. Autoelbow: An automatic elbow detection method for estimating the number of clusters in a dataset. Applied Sciences 12, 15. 2022. doi: 10.3390/app12157515
Examples
data(iris) # data set
res <- elbow(data = iris[,1:4], k.max = 20, method = "AutoElbow", cut = TRUE,
plot = TRUE, title = NA, xlabel = NA, ylabel = NA, size = 1.1,
grid = TRUE, savptc = FALSE, width = 3236, color = TRUE,
height = 2000, res = 300, casc = FALSE)
res$k.ideal # number of clusters
Hierarchical unsupervised classification.
Description
Performs hierarchical unsupervised classification analysis in a data set.
Usage
hierarchical(data, titles = NA, analysis = "Obs", cor.abs = FALSE,
normalize = FALSE, distance = "euclidean", method = "complete",
horizontal = FALSE, num.groups = 0, lambda = 2, savptc = FALSE,
width = 3236, height = 2000, res = 300, casc = TRUE)
Arguments
data |
Data to be analyzed. |
titles |
Titles of the graphics, if not set, assumes the default text. |
analysis |
"Obs" for analysis on observations (default), "Var" for analysis on variables. |
cor.abs |
Matrix of absolute correlation case 'analysis' = "Var" (default = FALSE). |
normalize |
Normalize the data only for case 'analysis' = "Obs" (default = FALSE). |
distance |
Metric of the distances in case of hierarchical groupings: "euclidean" (default), "maximum", "manhattan", "canberra", "binary" or "minkowski". Case Analysis = "Var" the metric will be the correlation matrix, according to cor.abs. |
method |
Method for analyzing hierarchical groupings: "complete" (default), "ward.D", "ward.D2", "single", "average", "mcquitty", "median" or "centroid". |
horizontal |
Horizontal dendrogram (default = FALSE). |
num.groups |
Number of groups to be formed. |
lambda |
Value used in the minkowski distance. |
savptc |
Saves graphics images to files (default = FALSE). |
width |
Graphics images width when savptc = TRUE (defaul = 3236). |
height |
Graphics images height when savptc = TRUE (default = 2000). |
res |
Nominal resolution in ppi of the graphics images when savptc = TRUE (default = 300). |
casc |
Cascade effect in the presentation of the graphics (default = TRUE). |
Value
Several graphics.
tab.res |
Table with similarities and distances of the groups formed. |
groups |
Original data with groups formed. |
res.groups |
Results of the groups formed. |
R.sqt |
Result of the R squared. |
sum.sqt |
Total sum of squares. |
mtx.dist |
Matrix of the distances. |
Author(s)
Paulo Cesar Ossani
References
Rencher, A. C. Methods of multivariate analysis. 2th. ed. New York: J.Wiley, 2002. 708 p.
Mingoti, S. A. analysis de dados atraves de metodos de estatistica multivariada: uma abordagem aplicada. Belo Horizonte: UFMG, 2005. 297 p.
Ferreira, D. F. Estatistica Multivariada. 2a ed. revisada e ampliada. Lavras: Editora UFLA, 2011. 676 p.
Examples
data(iris) # data set
data <- iris
res <- hierarchical(data[,1:4], titles = NA, analysis = "Obs", cor.abs = FALSE,
normalize = FALSE, distance = "euclidean", method = "ward.D",
horizontal = FALSE, num.groups = 3, savptc = FALSE, width = 3236,
height = 2000, res = 300, casc = FALSE)
message("R squared: ", res$R.sqt)
# message("Total sum of squares: ", res$sum.sqt)
message("Groups formed: "); res$groups
# message("Table with similarities and distances:"); res$tab.res
# message("Table with the results of the groups:"); res$res.groups
# message("Distance Matrix:"); res$mtx.dist
#write.table(file=file.path(tempdir(),"GroupData.csv"), res$groups, sep=";",
# dec=",",row.names = TRUE)
kmeans unsupervised classification.
Description
Performs kmeans unsupervised classification analysis in a data set.
Usage
kmeans(data, normalize = FALSE, num.groups = 2)
Arguments
data |
Data to be analyzed. |
normalize |
Normalize the data (default = FALSE). |
num.groups |
Number of groups to be formed (default = 2). |
Value
groups |
Original data with groups formed. |
res.groups |
Results of the groups formed. |
R.sqt |
Result of the R squared. |
sum.sqt |
Total sum of squares. |
Author(s)
Paulo Cesar Ossani
References
Rencher, A. C. Methods of multivariate analysis. 2th. ed. New York: J.Wiley, 2002. 708 p.
Mingoti, S. A. analysis de dados atraves de metodos de estatistica multivariada: uma abordagem aplicada. Belo Horizonte: UFMG, 2005. 297 p.
Ferreira, D. F. Estatistica Multivariada. 2a ed. revisada e ampliada. Lavras: Editora UFLA, 2011. 676 p.
Examples
data(iris) # data set
data <- iris
res <- kmeans(data[,1:4], normalize = FALSE, num.groups = 3)
message("R squared: ", res$R.sqt)
# message("Total sum of squares: ", res$sum.sqt)
message("Groups formed:"); res$groups
# message("Table with the results of the groups:"); res$res.groups
#write.table(file=file.path(tempdir(),"GroupData.csv"), res$groups, sep=";",
# dec=",",row.names = TRUE)
k-nearest neighbor (kNN) supervised classification method
Description
Performs the k-nearest neighbor (kNN) supervised classification method.
Usage
knn(train, test, class, k = 1, dist = "euclidean", lambda = 3)
Arguments
train |
Data set of training, without classes. |
test |
Test data set. |
class |
Vector with data classes names. |
k |
Number of nearest neighbors (default = 1). |
dist |
Distances used in the method: "euclidean" (default), "manhattan", "minkowski", "canberra", "maximum" or "chebyshev". |
lambda |
Value used in the minkowski distance (default = 3). |
Value
predict |
The classified factors of the test set. |
Author(s)
Paulo Cesar Ossani
References
Aha, D. W.; Kibler, D. and Albert, M. K. Instance-based learning algorithms. Machine learning. v.6, n.1, p.37-66. 1991.
Nicoletti, M. do C. O modelo de aprendizado de maquina baseado em exemplares: principais caracteristicas e algoritmos. Sao Carlos: EdUFSCar, 2005. 61 p.
See Also
plot_curve
and results
Examples
data(iris) # data set
data <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")
#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test = as.factor(data.test$class) # class names of the test data set
#### End - hold out validation method ####
dist = "euclidean"
# dist = "manhattan"
# dist = "minkowski"
# dist = "canberra"
# dist = "maximum"
# dist = "chebyshev"
k = 1
lambda = 5
r <- (ncol(data) - 1)
res <- knn(train = data.train[,1:r], test = data.test[,1:r], class = class.train,
k = 1, dist = dist, lambda = lambda)
resp <- results(orig.class = class.test, predict = res$predict)
message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix:"); resp$conf.mtx
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
message("General results of the classes:"); resp$res.class
Linear discriminant analysis (LDA).
Description
Perform linear discriminant analysis.
Usage
lda(data, test = NA, class = NA, type = "train",
method = "moment", prior = NA)
Arguments
data |
Data to be classified. |
test |
Vector with indices that will be used in 'data' as test. For type = "train", one has test = NA. |
class |
Vector with data classes names. |
type |
Type of type: |
method |
Classification method: |
prior |
Probabilities of occurrence of classes. If not specified, it will take the proportions of the classes. If specified, probabilities must follow the order of factor levels. |
Value
predict |
The classified factors of the set. |
Author(s)
Paulo Cesar Ossani
References
Rencher, A. C. Methods of multivariate analysis. 2th. ed. New York: J.Wiley, 2002. 708 p.
Venabless, W. N. and Ripley, B. D. Modern Applied Statistics with S. Fourth edition. Springer, 2002.
Mingoti, S. A. Analise de dados atraves de metodos de estatistica multivariada: uma abordagem aplicada. Belo Horizonte: UFMG, 2005. 297 p.
Ferreira, D. F. Estatistica Multivariada. 2a ed. revisada e ampliada. Lavras: Editora UFLA, 2011. 676 p.
See Also
plot_curve
and results
Examples
data(iris) # data set
data <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")
#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test = as.factor(data.test$class) # class names of the test data set
#### End - hold out validation method ####
r <- (ncol(data) - 1)
class <- data[,c(r+1)] # classes names
## Data training example
res <- lda(data = data[,1:r], test = NA, class = class,
type = "train", method = "moment", prior = NA)
resp <- results(orig.class = class, predict = res$predict)
message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix:"); resp$conf.mtx
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
message("General results of the classes:"); resp$res.class
## Data test example
class.table <- table(class) # table with the number of elements per class
prior <- as.double(class.table/sum(class.table))
test = as.integer(rownames(data.test)) # test data index
res <- lda(data = data[,1:r], test = test, class = class,
type = "test", method = "mle", prior = prior)
resp <- results(orig.class = class.test, predict = res$predict)
message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix: "); resp$conf.mtx
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
message("General results of the classes:"); resp$res.class
Graphics of the results of the classification process
Description
Return graphics of the results of the classification process.
Usage
plot_curve(data, type = "ROC", title = NA, xlabel = NA, ylabel = NA,
posleg = 3, boxleg = FALSE, axis = TRUE, size = 1.1, grid = TRUE,
color = TRUE, classcolor = NA, savptc = FALSE, width = 3236,
height = 2000, res = 300, casc = TRUE)
Arguments
data |
Data with x and y coordinates. |
type |
ROC (default) or PRC graphics type. |
title |
Title of the graphic, if not set, assumes the default text. |
xlabel |
Names the X axis, if not set, assumes the default text. |
ylabel |
Names the Y axis, if not set, assumes the default text. |
posleg |
0 with no caption, |
boxleg |
Puts the frame in the caption (default = TRUE). |
axis |
Put the diagonal axis on the graph (default = TRUE). |
size |
Size of the points in the graphs (default = 1.1). |
grid |
Put grid on graphs (default = TRUE). |
color |
Colored graphics (default = TRUE). |
classcolor |
Vector with the colors of the classes. |
savptc |
Saves graphics images to files (default = FALSE). |
width |
Graphics images width when savptc = TRUE (defaul = 3236). |
height |
Graphics images height when savptc = TRUE (default = 2000). |
res |
Nominal resolution in ppi of the graphics images when savptc = TRUE (default = 300). |
casc |
Cascade effect in the presentation of the graphic (default = TRUE). |
Value
ROC or PRC curve.
Author(s)
Paulo Cesar Ossani
See Also
Examples
data(iris) # data set
data <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")
#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test = as.factor(data.test$class) # class names of the test data set
#### End - hold out validation method ####
dist = "euclidean"
# dist = "manhattan"
# dist = "minkowski"
# dist = "canberra"
# dist = "maximum"
# dist = "chebyshev"
k = 1
lambda = 5
r <- (ncol(data) - 1)
res <- knn(train = data.train[,1:r], test = data.test[,1:r], class = class.train,
k = 1, dist = dist, lambda = lambda)
resp <- results(orig.class = class.test, predict = res$predict)
message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix:"); resp$conf.mtx
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
# message("Data for the ROC curve in classes:"); resp$roc.curve
# message("Data for the PRC curve in classes:"); resp$prc.curve
message("General results of the classes:"); resp$res.class
dat <- resp$roc.curve; tp = "roc"; ps = 3
# dat <- resp$prc.curve; tp = "prc"; ps = 4
plot_curve(data = dat, type = tp, title = NA, xlabel = NA, ylabel = NA,
posleg = ps, boxleg = FALSE, axis = TRUE, size = 1.1, grid = TRUE,
color = TRUE, classcolor = NA, savptc = FALSE,
width = 3236, height = 2000, res = 300, casc = FALSE)
Quadratic discriminant analysis (QDA).
Description
Perform quadratic discriminant analysis.
Usage
qda(data, test = NA, class = NA, type = "train",
method = "moment", prior = NA)
Arguments
data |
Data to be classified. |
test |
Vector with indices that will be used in 'data' as test. For type = "train", one has test = NA. |
class |
Vector with data classes names. |
type |
Type of type: |
method |
Classification method: |
prior |
Probabilities of occurrence of classes. If not specified, it will take the proportions of the classes. If specified, probabilities must follow the order of factor levels. |
Value
predict |
The classified factors of the set. |
Author(s)
Paulo Cesar Ossani
References
Rencher, A. C. Methods of multivariate analysis. 2th. ed. New York: J.Wiley, 2002. 708 p.
Venabless, W. N. and Ripley, B. D. Modern Applied Statistics with S. Fourth edition. Springer, 2002.
Mingoti, S. A. Analise de dados atraves de metodos de estatistica multivariada: uma abordagem aplicada. Belo Horizonte: UFMG, 2005. 297 p.
Ferreira, D. F. Estatistica Multivariada. 2a ed. revisada e ampliada. Lavras: Editora UFLA, 2011. 676 p.
See Also
plot_curve
and results
Examples
data(iris) # data set
data <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")
#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test = as.factor(data.test$class) # class names of the test data set
#### End - hold out validation method ####
r <- (ncol(data) - 1)
class <- data[,c(r+1)] # classes names
## Data training example
res <- qda(data = data[,1:r], test = NA, class = class,
type = "train", method = "moment", prior = NA)
resp <- results(orig.class = class, predict = res$predict)
message("Mean Squared Error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix: "); resp$conf.mtx
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
message("General results of the classes:"); resp$res.class
## Data test example
class.table <- table(class) # table with the number of elements per class
prior <- as.double(class.table/sum(class.table))
test = as.integer(rownames(data.test)) # test data index
res <- qda(data = data[,1:r], test = test, class = class,
type = "test", method = "mle", prior = prior)
resp <- results(orig.class = class.test, predic = res$predict)
message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix: "); resp$conf.mtx
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
message("General results of the classes:"); resp$res.class
Linear regression supervised classification method
Description
Performs supervised classification using the linear regression method.
Usage
regression(train, test, class, intercept = TRUE)
Arguments
train |
Data set of training, without classes. |
test |
Test data set. |
class |
Vector with data classes names. |
intercept |
Consider the intercept in the regression (default = TRUE). |
Value
predict |
The classified factors of the test set. |
Author(s)
Paulo Cesar Ossani
References
Charnet, R. at al. Analise de modelos de regressao lienar, 2a ed. Campinas: Editora da Unicamp, 2008. 357 p.
Rencher, A. C. and Schaalje, G. B. Linear models in statisctic. 2th. ed. New Jersey: John & Sons, 2008. 672 p.
Rencher, A. C. Methods of multivariate analysis. 2th. ed. New York: J.Wiley, 2002. 708 p.
See Also
plot_curve
and results
Examples
data(iris) # data set
data <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")
#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test = as.factor(data.test$class) # class names of the test data set
#### End - hold out validation method ####
r <- (ncol(data) - 1)
res <- regression(train = data.train[,1:r], test = data.test[,1:r],
class = class.train, intercept = TRUE)
resp <- results(orig.class = class.test, predict = res$predict)
message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix:"); resp$conf.mtx
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
message("General results of the classes:"); resp$res.class
Results of the classification process
Description
Returns the results of the classification process.
Usage
results(orig.class, predict)
Arguments
orig.class |
Data with the original classes. |
predict |
Data with classes of results of classifiers. |
Value
mse |
Mean squared error. |
mae |
Mean absolute error. |
rae |
Relative absolute error. |
conf.mtx |
Confusion matrix. |
rate.hits |
Hit rate. |
rate.error |
Error rate. |
num.hits |
Number of correct instances. |
num.error |
Number of wrong instances. |
kappa |
Kappa coefficient. |
roc.curve |
Data for the ROC curve in classes. |
prc.curve |
Data for the PRC curve in classes. |
res.class |
General results of the classes: Sensitivity, Specificity, Precision, TP Rate, FP Rate, NP Rate, F-Score, MCC, ROC Area, PRC Area. |
Author(s)
Paulo Cesar Ossani
References
Chicco, D.; Warrens, M. J. and Jurman, G. The matthews correlation coefficient (mcc) is more informative than cohen's kappa and brier score in binary classification assessment. IEEE Access, IEEE, v. 9, p. 78368-78381, 2021.
See Also
Examples
data(iris) # data set
data <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")
#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test = as.factor(data.test$class) # class names of the test data set
#### End - hold out validation method ####
dist = "euclidean"
# dist = "manhattan"
# dist = "minkowski"
# dist = "canberra"
# dist = "maximum"
# dist = "chebyshev"
k = 1
lambda = 5
r <- (ncol(data) - 1)
res <- knn(train = data.train[,1:r], test = data.test[,1:r], class = class.train,
k = 1, dist = dist, lambda = lambda)
resp <- results(orig.class = class.test, predict = res$predict)
message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix:"); resp$conf.mtx
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
# message("Data for the ROC curve in classes:"); resp$roc.curve
# message("Data for the PRC curve in classes:"); resp$prc.curve
message("General results of the classes:"); resp$res.class
dat <- resp$roc.curve; tp = "roc"; ps = 3
# dat <- resp$prc.curve; tp = "prc"; ps = 4
plot_curve(data = dat, type = tp, title = NA, xlabel = NA, ylabel = NA,
posleg = ps, boxleg = FALSE, axis = TRUE, size = 1.1, grid = TRUE,
color = TRUE, classcolor = NA, savptc = FALSE, width = 3236,
height = 2000, res = 300, casc = FALSE)
Silhouette method to determine the optimal number of clusters.
Description
Generates the silhouette graph and returns the ideal number of clusters in the k-means method.
Usage
silhouette(data, k.cluster = 2:10, plot = TRUE, cut = TRUE,
title = NA, xlabel = NA, ylabel = NA, size = 1.1, grid = TRUE,
color = TRUE, savptc = FALSE, width = 3236, height = 2000,
res = 300, casc = TRUE)
Arguments
data |
Data with x and y coordinates. |
k.cluster |
Cluster numbers for comparison in the k-means method (default = 2:10). |
plot |
Indicates whether to plot the silhouette graph (default = TRUE). |
cut |
Indicates whether to plot the best cluster indicative line (default = TRUE). |
title |
Title of the graphic, if not set, assumes the default text. |
xlabel |
Names the X axis, if not set, assumes the default text. |
ylabel |
Names the Y axis, if not set, assumes the default text. |
size |
Size of points on the graph and line thickness (default = 1.1). |
grid |
Put grid on graph (default = TRUE). |
color |
Colored graphic (default = TRUE). |
savptc |
Saves the graph image to a file (default = FALSE). |
width |
Graphic image width when savptc = TRUE (defaul = 3236). |
height |
Graphic image height when savptc = TRUE (default = 2000). |
res |
Nominal resolution in ppi of the graphic image when savptc = TRUE (default = 300). |
casc |
Cascade effect in the presentation of the graphic (default = TRUE). |
Value
k.ideal |
Ideal number of clusters. |
eve.si |
Vector with averages of silhouette indices of cluster groups (si). |
Author(s)
Paulo Cesar Ossani
References
Anitha, S.; Metilda, M. A. R. Y. An extensive investigation of outlier detection by cluster validation indices. Ciencia e Tecnica Vitivinicola - A Science and Technology Journal, v. 34, n. 2, p. 22-32, 2019. doi: 10.13140/RG.2.2.26801.63848
Kaufman, L. and Rousseeuw, P. J. Finding Groups in Data: An Introduction to Cluster Analysis, New York: John Wiley & Sons. 1990.
Martinez, W. L.; Martinez, A. R.; Solka, J. Exploratory data analysis with MATLAB. 2nd ed. New York: Chapman & Hall/CRC, 2010. 499 p.
Rousseeuw P. J. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics, 20:53-65. 1987. doi: 10.1016/0377-0427(87)90125-7
Examples
data(iris) # data set
res <- silhouette(data = iris[,1:4], k.cluster = 2:10, cut = TRUE,
plot = TRUE, title = NA, xlabel = NA, ylabel = NA,
size = 1.1, grid = TRUE, savptc = FALSE, width = 3236,
color = TRUE, height = 2000, res = 300, casc = TRUE)
res$k.ideal # number of clusters
res$eve.si # vector with averages of si indices
res <- silhouette(data = iris[,1:4], k.cluster = 3, cut = TRUE,
plot = TRUE, title = NA, xlabel = NA, ylabel = NA,
size = 1.1, grid = TRUE, savptc = FALSE, width = 3236,
color = TRUE, height = 2000, res = 300, casc = TRUE)
res$k.ideal # number of clusters
res$eve.si # vector with averages of si indices
Performs the supervised classification vote method.
Description
Performs the supervised classification voting method, using maximum agreement between classifiers.
Usage
vote(mtx.algtms = NA)
Arguments
mtx.algtms |
Matrix with the results of the supervised classification algorithms to be analyzed. |
Value
predict |
The classified factors of the test set. |
Author(s)
Paulo Cesar Ossani
References
Kittler, J.; Hatef, M.; Duin, R. P. W. and Matas, J. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence. 20(3):226-239. 1998. doi: 10.1109/34.667881
See Also
Examples
data(iris) # data set
data <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")
#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test = as.factor(data.test$class) # class names of the test data set
#### End - hold out validation method ####
test <- as.integer(rownames(data.test)) # test data index
r <- (ncol(data)-1)
class <- data[,c(r+1)] # classes names
mod1 <- knn(train = data.train[,1:r], test = data.test[,1:r],
class = class.train, k = 1, dist = 'EUC')
mod2 <- knn(train = data.train[,1:r], test = data.test[,1:r],
class = class.train, k = 2, dist = 'EUC')
mod3 <- lda(data = data[,1:r], test = test, class = class,
type = 'test', method = 'moment', prior = NA)
mod4 <- qda(data = data[,1:r], test = test, class = class,
type = 'test', method = 'mle', prior = NA)
mod5 <- qda(data = data[,1:r], test = test, class = class,
type = 'test', method = 'moment', prior = NA)
mod6 <- regression(train = data.train[,1:r], test = data.test[,1:r],
class = class.train, intercept = TRUE)
mod <- cbind(as.data.frame(mod1$predict), mod2$predict, mod3$predict,
mod4$predict, mod5$predict, mod6$predict)
res <- vote(mtx.algtms = mod)
resp <- results(orig.class = class.test, predict = res$predict)
print("Confusion matrix:"); resp$conf.mtx
cat("Hit rate:", resp$rate.hits,
"\nError rate:", resp$rate.error,
"\nNumber of correct instances:", resp$num.hits,
"\nNumber of wrong instances:", resp$num.error,
"\nKappa coefficient:", resp$kappa)
print("General results of the classes:"); resp$res.class