Type: | Package |
Title: | Fast Cross-Validation via Sequential Testing |
Version: | 0.2-3 |
Date: | 2022-02-19 |
Depends: | kernlab,Matrix |
Author: | Tammo Krueger, Mikio Braun |
Maintainer: | Tammo Krueger <tammokrueger@googlemail.com> |
Description: | The fast cross-validation via sequential testing (CVST) procedure is an improved cross-validation procedure which uses non-parametric testing coupled with sequential analysis to determine the best parameter set on linearly increasing subsets of the data. By eliminating under-performing candidates quickly and keeping promising candidates as long as possible, the method speeds up the computation while preserving the capability of a full cross-validation. Additionally to the CVST the package contains an implementation of the ordinary k-fold cross-validation with a flexible and powerful set of helper objects and methods to handle the overall model selection process. The implementations of the Cochran's Q test with permutations and the sequential testing framework of Wald are generic and can therefore also be used in other contexts. |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2.0)] |
NeedsCompilation: | no |
Packaged: | 2022-02-21 18:10:19 UTC; tammok |
Repository: | CRAN |
Date/Publication: | 2022-02-21 18:40:02 UTC |
Fast Cross-Validation via Sequential Testing
Description
The fast cross-validation via sequential testing (CVST) procedure is an improved cross-validation procedure which uses non-parametric testing coupled with sequential analysis to determine the best parameter set on linearly increasing subsets of the data. By eliminating under-performing candidates quickly and keeping promising candidates as long as possible, the method speeds up the computation while preserving the capability of a full cross-validation. Additionally to the CVST the package contains an implementation of the ordinary k-fold cross-validation with a flexible and powerful set of helper objects and methods to handle the overall model selection process. The implementations of the Cochran's Q test with permutations and the sequential testing framework of Wald are generic and can therefore also be used in other contexts.
Details
Package: | CVST |
Type: | Package |
Title: | Fast Cross-Validation via Sequential Testing |
Version: | 0.2-3 |
Date: | 2022-02-19 |
Depends: | kernlab,Matrix |
Author: | Tammo Krueger, Mikio Braun |
Maintainer: | Tammo Krueger <tammokrueger@googlemail.com> |
Description: | The fast cross-validation via sequential testing (CVST) procedure is an improved cross-validation procedure which uses non-parametric testing coupled with sequential analysis to determine the best parameter set on linearly increasing subsets of the data. By eliminating under-performing candidates quickly and keeping promising candidates as long as possible, the method speeds up the computation while preserving the capability of a full cross-validation. Additionally to the CVST the package contains an implementation of the ordinary k-fold cross-validation with a flexible and powerful set of helper objects and methods to handle the overall model selection process. The implementations of the Cochran's Q test with permutations and the sequential testing framework of Wald are generic and can therefore also be used in other contexts. |
License: | GPL (>=2.0) |
Index of help topics:
CV Perform a k-fold Cross-validation CVST-package Fast Cross-Validation via Sequential Testing cochranq.test Cochran's Q Test with Permutation constructCVSTModel Setup for a CVST Run. constructData Construction and Handling of 'CVST.data' Objects constructLearner Construction of Specific Learners for CVST constructParams Construct a Grid of Parameters constructSequentialTest Construct and Handle Sequential Tests. fastCV The Fast Cross-Validation via Sequential Testing (CVST) Procedure noisyDonoho Generate Donoho's Toy Data Sets noisySine Regression and Classification Toy Data Set
Author(s)
Tammo Krueger, Mikio Braun
Maintainer: Tammo Krueger <tammokrueger@googlemail.com>
References
Tammo Krueger, Danny Panknin, and Mikio Braun. Fast cross-validation via sequential testing. Journal of Machine Learning Research 16 (2015) 1103-1155. URL https://jmlr.org/papers/volume16/krueger15a/krueger15a.pdf.
Abraham Wald. Sequential Analysis. Wiley, 1947.
W. G. Cochran. The comparison of percentages in matched samples. Biometrika, 37 (3-4):256–266, 1950.
M. Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32 (200):675–701, 1937.
Examples
ns = noisySine(100)
svm = constructSVMLearner()
params = constructParams(kernel="rbfdot", sigma=10^(-3:3), nu=c(0.05, 0.1, 0.2, 0.3))
opt = fastCV(ns, svm, params, constructCVSTModel())
Perform a k-fold Cross-validation
Description
Performs the usual k-fold cross-validation procedure on a given data set, parameter grid and learner.
Usage
CV(data, learner, params, fold = 5, verbose = TRUE)
Arguments
data |
The data set as |
learner |
The learner as |
params |
the parameter grid as |
fold |
The number of folds that should be generated for each set of parameters. |
verbose |
Should the procedure report the performance for each model? |
Value
Returns the optimal parameter settings as determined by k-fold cross-validation.
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
References
M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B, 36(2):111–147, 1974.
Sylvain Arlot, Alain Celisse, and Paul Painleve. A survey of cross-validation procedures for model selection. Statistics Surveys, 4:40–79, 2010.
See Also
fastCV
constructData
constructLearner
constructParams
Examples
ns = noisySine(100)
svm = constructSVMLearner()
params = constructParams(kernel="rbfdot", sigma=10^(-3:3), nu=c(0.05, 0.1, 0.2, 0.3))
opt = CV(ns, svm, params)
Cochran's Q Test with Permutation
Description
Performs the Cochran's Q test on the data. If the data matrix contains too few elements, the chisquare distribution of the test statistic is replaced by a permutation variant.
Usage
cochranq.test(mat)
Arguments
mat |
The data matrix with the individuals in the rows and treatments in the columns. |
Value
Returns a htest
object with the usual entries.
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
References
W. G. Cochran. The comparison of percentages in matched samples. Biometrika, 37 (3-4):256–266, 1950.
Kashinath D. Patil. Cochran's Q test: Exact distribution. Journal of the American Statistical Association, 70 (349):186–189, 1975.
Merle W. Tate and Sara M. Brown. Note on the Cochran Q test. Journal of the American Statistical Association, 65 (329):155–160, 1970.
Examples
mat = matrix(c(rep(0, 10), 1, 1, 0, 0, 0, 1, 0, 0, 0, 0,
1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1,
0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0,
1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1), ncol=4)
cochranq.test(mat)
mat = matrix(c(rep(0, 7), 1, rep(0, 12), 1, 1, 0, 1,
rep(0, 5), 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1), nrow=8)
cochranq.test(mat)
Setup for a CVST Run.
Description
This is an helper object of type CVST.setup
conatining all
necessary parameters for a CVST run.
Usage
constructCVSTModel(steps = 10, beta = 0.1, alpha = 0.01,
similaritySignificance = 0.05, earlyStoppingSignificance = 0.05,
earlyStoppingWindow = 3, regressionSimilarityViaOutliers = FALSE)
Arguments
steps |
Number of steps CVST should run |
beta |
Significance level for H0. |
alpha |
Significance level for H1. |
similaritySignificance |
Significance level of the similarity test. |
earlyStoppingSignificance |
Significance level of the early stopping test. |
earlyStoppingWindow |
Size of the early stopping window. |
regressionSimilarityViaOutliers |
Should the less strict outlier-based similarity measure for regression tasks be used. |
Value
A CVST.setup
object suitable for fastCV
.
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
References
Tammo Krueger, Danny Panknin, and Mikio Braun. Fast cross-validation via sequential testing. Journal of Machine Learning Research 16 (2015) 1103-1155. URL https://jmlr.org/papers/volume16/krueger15a/krueger15a.pdf.
See Also
Construction and Handling of CVST.data
Objects
Description
The CVST methods needs a structured interface to both regression and classification data sets. These helper methods allow the construction and consistence handling of these types of data sets.
Usage
constructData(x, y)
getN(data)
getSubset(data, subset)
getX(data, subset = NULL)
shuffleData(data)
isClassification(data)
isRegression(data)
Arguments
x |
The feature data as vector or matrix. |
y |
The observed values (regressands/labels) as list, vector or factor. |
data |
A |
subset |
A index set. |
Value
constructData
returns a CVST.data
object. getN
returns the number of data points in the data set. getSubset
returns a subset of the data as a CVST.data
object, while
getX
just return the feature data. shuffleData
returns a
randomly shuffled instance of the data.
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
Examples
nsine = noisySine(10)
isClassification(nsine)
isRegression(nsine)
getN(nsine)
getX(nsine)
nsineShuffeled = shuffleData(nsine)
getX(nsineShuffeled)
getSubset(nsineShuffeled, 1:3)
Construction of Specific Learners for CVST
Description
These methods construct a CVST.learner
object suitable for the
CVST method. These objects provide the common interface needed for the
CV
and fastCV
methods. We provide kernel
logistic regression, kernel ridge regression, support vector machines
and support vector regression as fully functional implementation templates.
Usage
constructLearner(learn, predict)
constructKlogRegLearner()
constructKRRLearner()
constructSVMLearner()
constructSVRLearner()
Arguments
learn |
The learning methods which takes a |
predict |
The prediction method which takes a model and |
Details
The nu-SVM and nu-SVR are build on top the corresponding implementations of
the kernlab
package (see reference). In the list of parameters these
implementations expect an entry named kernel
, which gives the
name of the kernel that should be used, an entry named nu
specifying the nu parameter, and an entry named C
giving the C
parameter for the nu-SVR.
The KRR and KLR also expect kernel
and necessary other
parameters to construct the kernel. Both methods expect a lambda
parameter and KLR additonally a tol and maxiter parameter in the
parameter list.
Note that the lambda of KRR/KLR and the C parameter of SVR are scaled by the data set size to allow for comparable results in the fast CV loop.
Value
Returns a learner of type CVST.learner
suitable for CV
and fastCV
.
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
References
Alexandros Karatzoglou, Alexandros Smola, Kurt Hornik, Achim Zeileis. kernlab - An S4 Package for Kernel Methods in R Journal of Statistical Software Vol. 11, Issue 9, Nov 2004. DOI: doi: 10.18637/jss.v011.i09.
Volker Roth. Probabilistic discriminative kernel classifiers for multi-class problems. In Proceedings of the 23rd DAGM-Symposium on Pattern Recognition, pages 246–253, 2001.
See Also
Examples
# SVM
ns = noisySine(100)
svm = constructSVMLearner()
p = list(kernel="rbfdot", sigma=100, nu=.1)
m = svm$learn(ns, p)
nsTest = noisySine(1000)
pred = svm$predict(m, nsTest)
sum(pred != nsTest$y) / getN(nsTest)
# Kernel logistic regression
klr = constructKlogRegLearner()
p = list(kernel="rbfdot", sigma=100, lambda=.1/getN(ns), tol=10e-6, maxiter=100)
m = klr$learn(ns, p)
pred = klr$predict(m, nsTest)
sum(pred != nsTest$y) / getN(nsTest)
# SVR
ns = noisySinc(100)
svr = constructSVRLearner()
p = list(kernel="rbfdot", sigma=100, nu=.1, C=1*getN(ns))
m = svr$learn(ns, p)
nsTest = noisySinc(1000)
pred = svr$predict(m, nsTest)
sum((pred - nsTest$y)^2) / getN(nsTest)
# Kernel ridge regression
krr = constructKRRLearner()
p = list(kernel="rbfdot", sigma=100, lambda=.1/getN(ns))
m = krr$learn(ns, p)
pred = krr$predict(m, nsTest)
sum((pred - nsTest$y)^2) / getN(nsTest)
Construct a Grid of Parameters
Description
This is a helper function which, geiven a named list of parameter
choices, expand the complete grid and returns a CVST.params
object suitable for CV
and fastCV
.
Usage
constructParams(...)
Arguments
... |
The parameters that should be expanded. |
Value
Returns a CVST.params
wich is basically a named list of
possible parameter vallues.
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
See Also
Examples
params = constructParams(kernel="rbfdot", sigma=10^(-1:5), nu=c(0.1, 0.2))
# the expanded grid contains 14 parameter lists:
length(params)
Construct and Handle Sequential Tests.
Description
These functions handle the construction and calculation with
sequential tests as introduced by Wald (1947). getCVSTTest
constructs a special sequential test as introduced in Krueger
(2011). testSequence
test a sequence of 0/1 whether it is
distributed according to H0 or H1.
Usage
constructSequentialTest(piH0 = 0.5, piH1 = 0.9, beta, alpha)
getCVSTTest(steps, beta = 0.1, alpha = 0.01)
testSequence(st, s)
plotSequence(st, s)
Arguments
piH0 |
Probability of the binomial distribution for H0. |
piH1 |
Probability of the binomial distribution for H1. |
beta |
Significance level for H0. |
alpha |
Significance level for H1. |
steps |
Number of steps the CVST procedure should be executed. |
st |
A sequential test of type |
s |
A sequence of 0/1 values. |
Value
constructSequentialTest
and getCVSTTest
return a
CVST.sequentialTest
with the specified
properties. testSequence
returns 1, if H1 can be expected, -1
if H0 can be accepted, and 0 if the test needs more data for a
decission. plotSequence
gives a graphical impression of the
this testing procedure.
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
References
Abraham Wald. Sequential Analysis. Wiley, 1947.
Tammo Krueger, Danny Panknin, and Mikio Braun. Fast cross-validation via sequential testing. Journal of Machine Learning Research 16 (2015) 1103-1155. URL https://jmlr.org/papers/volume16/krueger15a/krueger15a.pdf.
See Also
Examples
st = getCVSTTest(10)
s = rbinom(10,1, .5)
plotSequence(st, s)
testSequence(st, s)
The Fast Cross-Validation via Sequential Testing (CVST) Procedure
Description
CVST is an improved cross-validation procedure which uses non-parametric testing coupled with sequential analysis to determine the best parameter set on linearly increasing subsets of the data. By eliminating underperforming candidates quickly and keeping promising candidates as long as possible, the method speeds up the computation while preserving the capability of a full cross-validation.
Usage
fastCV(train, learner, params, setup, test = NULL, verbose = TRUE)
Arguments
train |
The data set as |
learner |
The learner as |
params |
the parameter grid as |
setup |
A |
test |
An independent test set that should be used at each step. If
|
verbose |
Should the procedure report the performance after each step? |
Value
Returns the optimal parameter settings as determined by fast cross-validation via sequential testing.
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
References
Tammo Krueger, Danny Panknin, and Mikio Braun. Fast cross-validation via sequential testing. Journal of Machine Learning Research 16 (2015) 1103-1155. URL https://jmlr.org/papers/volume16/krueger15a/krueger15a.pdf.
See Also
CV
constructCVSTModel
constructData
constructLearner
constructParams
Examples
ns = noisySine(100)
svm = constructSVMLearner()
params = constructParams(kernel="rbfdot", sigma=10^(-3:3), nu=c(0.05, 0.1, 0.2, 0.3))
opt = fastCV(ns, svm, params, constructCVSTModel())
Generate Donoho's Toy Data Sets
Description
This function allows to generate noisy variants of the toy signals introduced by Donoho (see reference section). The scaling is chosen to reflect the setting as discussed in the original paper.
Usage
noisyDonoho(n, fun = doppler, sigma = 1)
blocks(x, scale = 3.656993)
bumps(x, scale = 10.52884)
doppler(x, scale = 24.22172)
heavisine(x, scale = 2.356934)
Arguments
n |
Number of data points that should be generated. |
fun |
Function to use to generate the data. |
sigma |
Standard deviation of the noise component. |
x |
Number of data points that should be generated. |
scale |
Scaling parameter. |
Value
Returns a data set of type CVST.data
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
References
David L. Donoho and Jain M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81 (3) 425–455, 1994.
See Also
Examples
bumpsSet = noisyDonoho(1000, fun=bumps)
plot(bumpsSet)
dopplerSet = noisyDonoho(1000, fun=doppler)
plot(dopplerSet)
Regression and Classification Toy Data Set
Description
Regression and Classification Toy Data Set based on the sine and sinc function.
Usage
noisySine(n, dim = 5, sigma = 0.25)
noisySinc(n, dim = 2, sigma = 0.1)
Arguments
n |
Number of data points that should be generated. |
dim |
Intrinsic dimensionality of the data set (see references for details). |
sigma |
Standard deviation of the noise component. |
Value
Returns a data set of type CVST.data
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
References
Tammo Krueger, Danny Panknin, and Mikio Braun. Fast cross-validation via sequential testing. Journal of Machine Learning Research 16 (2015) 1103-1155. URL https://jmlr.org/papers/volume16/krueger15a/krueger15a.pdf.
See Also
Examples
nsine = noisySine(1000)
plot(nsine, col=nsine$y)
nsinc = noisySinc(1000)
plot(nsinc)