Help for package MultivariateRandomForest

Type:

Package

Title:

Models Multivariate Cases Using Random Forests

Version:

1.1.5

Date:

2017-04-05

Author:

Raziur Rahman

Maintainer:

Raziur Rahman <razeeebuet@gmail.com>

Description:

Models and predicts multiple output features in single random forest considering the linear relation among the output features, see details in Rahman et al (2017)<doi:10.1093/bioinformatics/btw765>.

License:

GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]

RoxygenNote:

6.0.1

Depends:

R (≥ 2.10)

Imports:

Rcpp, bootstrap, stats

LinkingTo:

Rcpp

NeedsCompilation:

yes

Packaged:

2017-05-01 00:21:14 UTC; Raziur_Rahman

Repository:

CRAN

Date/Publication:

2017-05-01 10:20:31 UTC

Generate training and testing samples for cross validation

Description

Generates Cross Validation Input Matrices and Output Vectors for training and testing, where number of folds in cross validation is user defined.

Usage

CrossValidation(X, Y, F)

Arguments

X

M x N Input matrix, M is the number of samples and N is the number of features

Y

output responses as column vector

F

Number of Folds

Value

List with the following components:

TrainingData

List of matrices where each matrix contains a fold of Cross Validation Training Data, where the number of matrices is equal to F

TestingData

List of matrices where each matrix contains a fold of Cross Validation Testing Data, where the number of matrices is equal to F

OutputTrain

List of matrices where each matrix contains a fold of Cross Validation Training Output Feature Data, where the number of matrices is equal to F

OutputTest

List of matrices where each matrix contains a fold of Cross Validation Testing Output Feature Data, where the number of matrices is equal to F

FoldedIndex

Index of Different Folds. (e.g., for Sample Index 1:6 and 3 fold, FoldedIndex are [1 2 3 4], [1 2 5 6], [3 4 5 6])

Imputation of a numerical vector

Description

Imputes the values of the vector that are NaN

Usage

Imputation(XX)

Arguments

XX

a vector of size N x 1

Details

If a value is missing, it will be replaced by an imputed value that is an average of previous and next value. If previous or next value is also missing, the closest value is used as the imputed value.

Value

Imputed vector of size N x 1

Information Gain

Description

Compute the cost function of a tree node

Usage

Node_cost(y, Inv_Cov_Y, Command)

Arguments

y

Output Features for the samples of the node

Inv_Cov_Y

Inverse of Covariance matrix of Output Response matrix for MRF(Input [0 0;0 0] for RF)

Command

1 for univariate Regression Tree (corresponding to RF) and 2 for Multivariate Regression Tree (corresponding to MRF)

Details

In multivariate trees (MRF) node cost is measured as the sum of squares of the Mahalanobis distance to capture the correlations in the data whereas in univariate trees node cost is measured as the sum of Euclidean distance square. Mahalanobis Distance captures the distance of the sample point from the mean of the node along the principal component axes.

Value

cost or entropy of samples in a node of a tree

References

Segal, Mark, and Yuanyuan Xiao. "Multivariate random forests." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1.1 (2011): 80-87.

Examples

library(MultivariateRandomForest)
y=matrix(runif(10*2),10,2)
Inv_Cov_Y=solve(cov(y))
Command=2
#Command=2 for MRF and 1 for RF
#This function calculates information gain of a node
Cost=Node_cost(y,Inv_Cov_Y,Command)

Prediction using Random Forest or Multivariate Random Forest

Description

Builds Model of Random Forest or Multivariate Random Forest (when the number of output features > 1) using training samples and generates the prediction of testing samples using the inferred model.

Usage

build_forest_predict(trainX, trainY, n_tree, m_feature, min_leaf, testX)

Arguments

trainX

Input Feature matrix of M x N, M is the number of training samples and N is the number of input features

trainY

Output Response matrix of M x T, M is the number of training samples and T is the number of ouput features

n_tree

Number of trees in the forest, which must be positive integer

m_feature

Number of randomly selected features considered for a split in each regression tree node, which must be positive integer and less than N (number of input features)

min_leaf

Minimum number of samples in the leaf node. If a node has less than or equal to min_leaf samples, then there will be no splitting in that node and this node will be considered as a leaf node. Valid input is positive integer, which is less than or equal to M (number of training samples)

testX

Testing samples of size Q x N, where Q is the number of testing samples and N is the number of features (Same number of features as training samples)

Details

Random Forest regression refers to ensembles of regression trees where a set of n_tree un-pruned regression trees are generated based on bootstrap sampling from the original training data. For each node, the optimal feature for node splitting is selected from a random set of m_feature from the total N features. The selection of the feature for node splitting from a random set of features decreases the correlation between different trees and thus the average prediction of multiple regression trees is expected to have lower variance than individual regression trees. Larger m_feature can improve the predictive capability of individual trees but can also increase the correlation between trees and void any gains from averaging multiple predictions. The bootstrap resampling of the data for training each tree also increases the variation between the trees.

In a node with training predictor features (X) and output feature vectors (Y), node splitting is done with the aim of selecting a feature from a random set of m_feature and threshold z to partition the node into two child nodes, left node (with samples < z) and right node (with samples >=z). In multivariate trees (MRF) node cost is measured as the sum of squares of the Mahalanobis distance where as in univariate trees (RF) node cost is measured as the Euclidean distance.

After the Model of the forest is built using training Input features (trainX) and output feature matrix (trainY), the Model is used to generate the prediction of output features (testY) for the testing samples (testX).

Value

Prediction result of the Testing samples

References

[Random Forest] Breiman, Leo. "Random forests." Machine learning 45.1 (2001): 5-32.

[Multivariate Random Forest] Segal, Mark, and Yuanyuan Xiao. "Multivariate random forests." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1.1 (2011): 80-87.

Examples

library(MultivariateRandomForest)
#Input and Output Feature Matrix of random data (created using runif)
trainX=matrix(runif(50*100),50,100) 
trainY=matrix(runif(50*5),50,5) 
n_tree=2
m_feature=5
min_leaf=5
testX=matrix(runif(10*100),10,100) 
#Prediction size is 10 x 5, where 10 is the number 
#of testing samples and 5 is the number of output features
Prediction=build_forest_predict(trainX, trainY, n_tree, m_feature, min_leaf, testX)

Model of a single tree of Random Forest or Multivariate Random Forest

Description

Build a Univariate Regression Tree (for generation of Random Forest (RF) ) or Multivariate Regression Tree ( for generation of Multivariate Random Forest (MRF) ) using the training samples, which is used for the prediction of testing samples.

Usage

build_single_tree(X, Y, m_feature, min_leaf, Inv_Cov_Y, Command)

Arguments

X

Input Feature matrix of M x N, M is the number of training samples and N is the number of input features

Y

Output Feature matrix of M x T, M is the number of training samples and T is the number of ouput features

m_feature

Number of randomly selected features considered for a split in each regression tree node, which must be positive integer and less than N (number of input features)

min_leaf

Minimum number of samples in the leaf node, which must be positive integer and less than or equal to M (number of training samples)

Inv_Cov_Y

Inverse of Covariance matrix of Output Response matrix for MRF(Input [0 0;0 0] for RF)

Command

1 for univariate Regression Tree (corresponding to RF) and 2 for Multivariate Regression Tree (corresponding to MRF)

Details

The regression tree structure is represented as a list of lists. For a non-leaf node, it contains the splitting criteria (feature for split and threshold) and for a leaf node, it contains the output responses for the samples contained in the leaf node.

Value

Model of a single regression tree (Univariate or Multivariate Regression Tree). An example of the list of the non-leaf node:

Flag for determining whether the node is leaf node or branch node. 0 means branch node and 1 means leaf node.

Index of samples for the left node

int [1:34] 1 2 4 5 ...

Index of samples for the right node

int [1:16] 3 6 9 ...

Feature for split

int 34

Threshold values for split, average them

num [1:3] 0.655 0.526 0.785

List number for the left and right nodes

num [1:2] 2 3

An example of the list of the leaf node:

Output responses

num[1:4,1:5] 0.0724 0.1809 0.0699 ...

Prediction of testing sample in a node

Description

Provides the value of a testing sample in a node which refers to which child node it will go using the splitting criteria of the tree node or prediction results if the node is a leaf.

Usage

predicting(Single_Model, i, X_test, Variable_number)

Arguments

Single_Model

Model of a particular tree

i

Number of split. Used as an index, which indicates where in the list the splitting criteria of this split has been stored.

X_test

Testing samples of size Q x N, Q is the number of testing samples and N is the number of features (same order and size used as training)

Variable_number

Number of Output Features

Details

The function considers the output at a particular node. If the node is a leaf, the average of output responses is returned as prediction result. For a non-leaf node, the direction of left or right node is decided based on the node threshold and splitting feature value.

Value

Prediction result of a testing samples in a node

Prediction of Testing Samples for single tree

Description

Predicts the output responses of testing samples based on the input regression tree

Usage

single_tree_prediction(Single_Model, X_test, Variable_number)

Arguments

Single_Model

Random Forest or Multivariate Random Forest Model of a particular tree

X_test

Testing samples of size Q x N, Q is the number of testing samples and N is the number of features (same order and size used as training)

Variable_number

Number of Output Features

Details

A regression tree model contains splitting criteria for all the splits in the tree and output responses of training samples in the leaf nodes. A testing sample using these criteria will reach a leaf node and the average of the Output response vectors in the leaf node is considered as the prediction of the testing sample.

Value

Prediction result of the Testing samples for a particular tree

Splitting Criteria of all the nodes of the tree

Description

Stores the Splitting criteria of all the nodes of a tree in a list

Usage

split_node(X, Y, m_feature, Index, i, model, min_leaf, Inv_Cov_Y, Command)

Arguments

X

Input Training matrix of size M x N, M is the number of training samples and N is the number of features

Y

Output Training response of size M x T, M is the number of samples and T is the number of output responses

m_feature

Number of randomly selected features considered for a split in each regression tree node

Index

Index of training samples

i

Number of split. Used as an index, which indicates where in the list the splitting criteria of this split will be stored.

model

A list of lists with the spliting criteria of all the node splits. In each iteration, a new list is included with the spliting criteria of the new split of a node.

min_leaf

Minimum number of samples in the leaf node. If a node has less than or, equal to min_leaf samples, then there will be no splitting in that node and the node is a leaf node. Valid input is a positive integer and less than or equal to M (number of training samples)

Inv_Cov_Y

Inverse of Covariance matrix of Output Response matrix for MRF(Give Zero for RF)

Command

1 for univariate Regression Tree (corresponding to RF) and 2 for Multivariate Regression Tree (corresponding to MRF)

Details

This function calculates the splitting criteria of a node and stores the information in a list format. If the node is a parent node, then indices of left and right nodes and feature number and threshold value of the feature for the split are stored. If the node is a leaf, the output feature matrix of the samples for the node are stored as a list.

Value

Model: A list of lists with the splitting criteria of all the split of the nodes. In each iteration, the Model is updated with a new list that includes the splitting criteria of the new split of a node.

Split of the Parent node

Description

Split of the training samples of the parent node into the child nodes based on the feature and threshold that produces the minimum cost

Usage

splitt2(X, Y, m_feature, Index, Inv_Cov_Y, Command, ff)

Arguments

X

Input Training matrix of size M x N, M is the number of training samples and N is the number of features

Y

Output Training response of size M x T, M is the number of samples and T is the number of output responses

m_feature

Number of randomly selected features considered for a split in each regression tree node.

Index

Index of training samples

Inv_Cov_Y

Inverse of Covariance matrix of Output Response matrix for MRF (Input [0 0; 0 0] for RF)

Command

1 for univariate Regression Tree (corresponding to RF) and 2 for Multivariate Regression Tree (corresponding to MRF)

ff

Vector of m_feature from all features of X. This varies with each split

Details

At each node of a regression a tree, a fixed number of features (m_feature) are selected randomly to be considered for generating the split. Node cost for all selected features along with possible n-1 thresholds for n samples are considered to select the feature and threshold with minimum cost.

Value

List with the following components:

index_left

Index of the samples that are in the left node after splitting

index_right

Index of the samples that are in the right node after splitting

which_feature

The number of the feature that produces the minimum splitting cost

threshold_feature

The threshold value for the node split. A feature value less than or equal to the threshold will go to the left node and it will go to the right node otherwise.

Examples

library(MultivariateRandomForest)
X=matrix(runif(20*100),20,100)
Y=matrix(runif(20*3),20,3)
m_feature=5
Index=1:20
Inv_Cov_Y=solve(cov(Y))
ff2 = ncol(X) # number of features
ff =sort(sample(ff2, m_feature)) 
Command=2#MRF, as number of output feature is greater than 1
Split_criteria=splitt2(X,Y,m_feature,Index,Inv_Cov_Y,Command,ff)

Calculates variable Importance of a Regression Tree Model

Description

Number of times a variable has been picked in the branch nodes of a (single) regression tree.

Usage

variable_importance_measure(Model_VIM,NumVariable)

Arguments

Model_VIM

Regression Tree model in which the variable importance is measured

NumVariable

Number of variables in the training or testing matrix

Details

In time of calculating node cost of a tree of a random forest, a user defined number of variables are randomly picked. Among this, the best variable is chosen for the node using the node cost. While an important variable for a model will always come out as the best. This function calculates the number of times a variable has been picked in the regression tree. It has been done by checking which variables are picked, how many times, in the branch nodes of the model.

Value

Vector of size (1 x NumVariable), showing the number of repetition of variables (serially) in the branch nodes of the model.

Examples

library(MultivariateRandomForest)
trainX=matrix(runif(50*100),50,100)
trainY=matrix(runif(50*5),50,5) 
n_tree=2
m_feature=5
min_leaf=5
testX=matrix(runif(10*100),10,100) 

theta <- function(trainX){trainX}
results <- bootstrap::bootstrap(1:nrow(trainX),n_tree,theta) 
b=results$thetastar

Variable_number=ncol(trainY)
if (Variable_number>1){
  Command=2
}else if(Variable_number==1){
  Command=1
} 
NumVariable=ncol(trainX)
NumRepeatation=matrix(rep(0,n_tree*NumVariable),nrow=n_tree)

for (i in 1:n_tree){
  Single_Model=NULL
  X=trainX[ b[ ,i],  ]
  Y=matrix(trainY[ b[ ,i],  ],ncol=Variable_number)
  Inv_Cov_Y = solve(cov(Y)) # calculate the V inverse
  if (Command==1){
    Inv_Cov_Y=matrix(rep(0,4),ncol=2)
  }
  Single_Model=build_single_tree(X, Y, m_feature, min_leaf,Inv_Cov_Y,Command)
  NumRepeatation[i,]=variable_importance_measure(Single_Model,NumVariable)
}