Help for package LilRhino

Type:

Package

Title:

For Implementation of Feed Reduction, Learning Examples, NLP and Code Management

Version:

1.2.2

Author:

Travis Barton (2018)

Maintainer:

Travis Barton <travisdatabarton@gmail.com>

Description:

This is for code management functions, NLP tools, a Monty Hall simulator, and for implementing my own variable reduction technique called Feed Reduction. The Feed Reduction technique is not yet published, but is merely a tool for implementing a series of binary neural networks meant for reducing data into N dimensions, where N is the number of possible values of the response variable.

License:

GPL-2

Encoding:

UTF-8

Suggests:

textclean

Imports:

FNN, stringi, beepr, ggplot2, keras, dplyr, readr, parallel, tm, e1071, SnowballC, data.table, fastmatch, neuralnet

NeedsCompilation:

Packaged:

2022-04-27 20:27:08 UTC; travisbarton

Repository:

CRAN

Date/Publication:

2022-04-27 22:10:14 UTC

Binary Decision Neural Network Wrapper

Description

Used as a function of Feed_Reduction, Binary_Networt uses a 3 layer neural network with an adam optimizer, leaky RELU for the first two activation functions, followed by a softmax on the last layer. The loss function is binary_crossentropy. This is a keras wrapper, and uses tensorflow in the backend.

Usage

Binary_Network(X, Y, X_test, val_split, nodes, epochs, batch_size, verbose = 0)

Arguments

X

Training data.

Y

Training Labels. These must be binary.

X_test

The test Data

val_split

The validation split for keras.

nodes

The number of nodes in the hidden layers.

epochs

The number of epochs for the network

batch_size

The batch size for the network

verbose

Weither or not you want details about the run as its happening. 0 = silent, 1 = progress bar, 2 = one line per epoch.

Details

This function is a subset for the larger function Feed_Network. The output is the list containing the training and testing data converted into an approximation of probability space for that binary decision.

Value

Train

The training data in approximate probability space

Test

The testing data in 'double' approximate probability space

Author(s)

Travis Barton

References

Check out http://wbbpredictions.com/wp-content/uploads/2018/12/Redditbot_Paper.pdf and Keras for details

Examples


## Not run: 
if(8 * .Machine$sizeof.pointer == 64){
  #Feed Network Testing
  library(keras)
  library(dplyr)
    install_keras()
    dat <- keras::dataset_mnist()
    X_train = array_reshape(dat$train$x/255, c(nrow(dat$train$x/255), 784))
    y_train = to_categorical(dat$train$y, 10)
    X_test = array_reshape(dat$test$x/255, c(nrow(dat$test$x/255), 784))
    y_test =to_categorical(dat$test$y, 10)


    index_train = which(dat$train$y == 6 | dat$train$y == 5)
    index_train = sample(index_train, length(index_train))
    index_test = which(dat$test$y == 6 | dat$test$y == 5)
    index_test = sample(index_test, length(index_test))

    temp = Binary_Network(X_train[index_train,],
    y_train[index_train,c(7, 6)], X_test[index_test,], .3, 350, 30, 50)
  }
  
## End(Not run)

A function for bootstraping textual data so that all levels have the same number of entries.

Description

This function takes a corpus and a set of labels and uses Bootstrap_Vocab to increase the size of each label until they are all the same length. Stop words are not bootstrapped.

Usage

Bootstrap_Data_Frame(text, tags, stopwords, min_length = 7, max_length = 15)

Arguments

text

text is the collection of textual data to bootstrap up.

tags

tags are the collection of tags that will be used to bootstrap. There should be one for every entry in 'text'. They do not have to be unique.

stopwords

stopwords to make sure are not apart of the bootstrapping process. It is advised to eliminate the most common words. See Stop_Word_Maker()

min_length

The shortest length allowable for bootstrapped words

max_length

The longest length allowable for bootstrapped words

Details

Most of the bootstrapped words will be nonseneical. The intention of this package is not to create new sentences, but to instead trick your model into thinking it has equal lengthed levels. This method is meant for bag of words style models.

Value

A data frame of your original documents along with the bootstrapped ones (column 1) along with their tags (column 2).

Author(s)

Travis Barton

Examples

test_set = c('I like cats', 'I like dogs', 'we love animals', 'I am a vet',
             'US politics bore me', 'I dont like to vote',
             'The rainbow looked nice today dont you think tommy')
test_tags = c('animals', 'animals', 'animals', 'animals',
             'politics', 'politics',
             'misc')

Bootstrap_Data_Frame(test_set, test_tags, c("I", "we"), min_length = 3, max_length = 8)

An internal function for Bootstrap_Data_Frame.

Description

This function takes a selection of documents and bootstraps words from said sentences until there are N total sentences (both sudo and original).

Usage

Bootstrap_Vocab(vocab, N, stopwds, min_length = 7, max_length = 15)

Arguments

vocab

The collection of documents to boostrap.

N

The total amount of sentences to end up with

stopwds

A list of stopwords to not include in the bootstrapping proccess

min_length

The shortest allowable bootstrapped doument

max_length

The longest allowable bootstrapped document

Details

The min and max length arguements to not gaurantee that a sentence will reach that length. These senteces will be nonsensical.

Value

A vector of bootstrapped sentences.

Author(s)

Travis Barton

Examples



testing_set = c(paste('this is test',  as.character(seq(1, 10, 1))))

Bootstrap_Vocab(testing_set, 20, c('this'))

For announcing when code is done.

Description

for alerting you when your code is done.

Usage

Codes_done(title, msg, sound = FALSE, effect = 1)

Arguments

title

The title of the notification

msg

The message to be sent

sound

Optional sound to blurt as well

effect

If sound it blurted, what should it be? (check beepr package for sound options)

Details

Only for Linix (as far as I know)

Author(s)

smacdonald (stack overflow) with modificaion by Travis Barton

References

https://stackoverflow.com/questions/3365657/is-there-a-way-to-make-r-beep-play-a-sound-at-the-end-of-a-script

Examples

Codes_done("done", "check it", sound = TRUE, effect = 1)

For Creating a test and train set from a whole set

Description

for making one dataset into two (test and train)

Usage

Cross_val_maker(data, alpha)

Arguments

data

matrix of data you want to split

alpha

the percent of data to split

Value

returns a list with accessable with the '$' sign. Test and Train are labeled as such.

Author(s)

Travis Barton

Examples

dat <- Cross_val_maker(iris, .1)
train <- dat$Train
test <- dat$Test

A Function for converting data into approximations of probability space.

Description

It takes the number of unique labels in the training data and tries to predict a one vs all binary neural network for each unique label. The output is an approximation of the probability that each individual input does not not match the label. Travis Barton (2018) http://wbbpredictions.com/wp-content/uploads/2018/12/Redditbot_Paper.pdf

Usage

Feed_Reduction(X, Y, X_test, val_split = .1,
               nodes = NULL, epochs = 15,
               batch_size = 30, verbose = 0)

Arguments

X

Training data

Y

Training labels

X_test

Testing data

val_split

The validation split for the keras, binary, neural networks

nodes

The number nodes for the hidden layers, default is 1/4 of the length of the training data.

epochs

The number of epochs for the fitting of the networks

batch_size

The batch size for the networks

verbose

Weither or not you want details about the run as its happening. 0 = silent, 1 = progress bar, 2 = one line per epoch.

Details

This is a new technique for dimensionality reduction of my own creation. Data is converted to the same number of dimensions as there are unique labels. Each dimension is an approximation of the probability that the data point is inside the a unique label. The return value is a list the training and test data with their dimensionality reduced.

Value

Train

The training data in the new probability space

Test

The testing data in the new probability space

Author(s)

Travis Barton.

References

Check out http://wbbpredictions.com/wp-content/uploads/2018/12/Redditbot_Paper.pdf for details on the proccess

Examples

## Not run: 
if(8 * .Machine$sizeof.pointer == 64){
#Feed Network Testing
library(keras)

  install_keras()
  dat <- keras::dataset_mnist()
  X_train = array_reshape(dat$train$x/255, c(nrow(dat$train$x/255), 784))
  y_train = dat$train$y
  X_test = array_reshape(dat$test$x/255, c(nrow(dat$test$x/255), 784))
  y_test = dat$test$y

  Reduced_Data2 = Feed_Reduction(X_train, y_train, X_test,
                                val_split = .3, nodes = 350,
                                30, 50, verbose = 1)

  library(e1071)
  names(Reduced_Data2$test) = names(Reduced_Data2$train)
  newdat = as.data.frame(cbind(rbind(Reduced_Data2$train, Reduced_Data2$test), c(y_train, y_test)))
  colnames(newdat) = c(paste("V", c(1:11), sep = ""))
  mod = svm(V11~., data = newdat, subset = c(1:60000),
           kernel = 'linear', cost = 1, type = 'C-classification')
  preds = predict(mod, newdat[60001:70000,-11])
  sum(preds == y_test)/10000

}

## End(Not run)

Function for loading in pre-trained or personal word embedding softwares.

Description

Loads in GloVes' pretrained 42 billion token embeddings, trained on the common crawl.

Usage

Load_Glove_Embeddings(path = 'glove.42B.300d.txt', d = 300)

Arguments

path

The path to the embeddings file.

d

The dimension of the embeddings file.

Details

The embeddings file should be the word, followed by numeric values, ending with a carriage return.

Value

The embeddings matrix.

Author(s)

Travis Barton

Examples

#This code only works if you have the 5g file found here: <https://nlp.stanford.edu/projects/glove/>

## Not run: emb = Load_Glove_Embeddings()

Monty Hall Simulator

Description

A simulator for the famous Monty Hall Problem

Usage

Monty_Hall(Games = 10, Choice = "Stay")

Arguments

Games

The number of games to run on the simulation

Choice

Wether you would like the simulation to either 'Stay' with the first chosen door, 'Switch' to the other door, or 'Random' where you randomly decide to either stay or switch.

Details

This is just a toy example of the famous Monty Hall problem. It returns a ggplot bar chart showing the counts for wins or loses in the simulation.

Value

A ggplot graph is produced. There is no return value.

Author(s)

Travis Barton

Examples

Monty_Hall(100, 'Stay')

For performing the nearest centroid problem (with modifications) on MNST data specifically (general to come)

Description

For Chen's homework, I'll change this when I generalize it.

Usage

Nearest_Centroid(X_train, X_test, Y_train)

Arguments

X_train

Training data

X_test

data to be tested

Y_train

training labels

Note

Based on homework from Guangling Chen's M251 class at SJSU

Author(s)

Travis Barton

Number/alpha numeric seperator for strings.

Description

A Function for the separating of numbers from letters. 'b4' for example would be converted to 'b 4'.

Usage

Num_Al_Sep(vec)

Arguments

vec

The string vector in which you wish to separate the numbers from the letters.

Value

output

The separated vector.

Note

This is a really simple function really used inside other functions.

Author(s)

Travis Barton

Examples

test_vec = 'The most iconic American weapon has to be the AR15'
res = Num_Al_Sep(test_vec)
print(res)

Percent of confusion matrix

Description

For finding the accuracy of confusion matricies with true/pred values

Usage

Percent(true, test)

Arguments

true

The true values

test

the test values

Details

Make sure your strings have the right values and create a square matrix.

Value

the percent acc.

Author(s)

Travis Barton

Examples

true <- rep(1:10, 10)
test <- rep(1:10, 10)
test[c(2, 22, 33, 89)] = 1
Percent(true, test)
#or
#percent(table(true, test))

Pretreatment of textual documents for NLP.

Description

This function goes through a number of pretreatment steps in preparation for vectorization. These steps are designed to help the data become more standard so that there are fewer outliers when training during NLP. The following effects are applied: 1. Non-alpha/numerics are removed. 2. Numbers are separated from letters. 3. Numbers are replaced with their word equivalents. 4. Words are stemmed (optional). 5. Words are lowercased (optinal).

Usage

Pretreatment(title_vec, stem = TRUE, lower = TRUE, parallel = FALSE)

Arguments

title_vec

Vector of documents to be pre-treated.

stem

Boolian variable to decide whether to stem or not.

lower

Boolian variable to decide whether to lowercase words or not.

parallel

Boolian variable to decide whether to run this function in parallel or not.

Details

This function returns a list. It should be able to accept any format that the function lapply would accept. The parallelization is done with the function Mcapply from the package 'parallel' and will only work on systems that allow forking (Sorry windows users). Future updates will allow for socketing.

Value

output

The list of character strings post-pretreatment

Author(s)

Travis Barton

Examples

## Not run:  # for some reason it takes longer than 5 seconds on CRAN's computers
test_vec = c('This is a test', 'Ahoy!', 'my battle-ship is on... b6!')
res = Pretreatment(test_vec)
print(res)

## End(Not run)

Random Brains: Neural Network Implementation of Random Forest

Description

Creates a random forest style collection of neural networks for classification

Usage

Random_Brains(data, y, x_test,
variables = ceiling(ncol(data)/10),
brains = floor(sqrt(ncol(data))),
hiddens = c(3, 4))

Arguments

data

The data that holds the predictors ONLY.

y

The responce variable

x_test

The testing predictors

variables

The number of predictors to select for each brain in 'data'. The default is one tenth of the number of columns in 'data'.

brains

The number of neural networks to create. The default is the square root of the number of columns in 'data'.

hiddens

The is a vector with length equal to the desired number of hidden layers. Each entry in the vector corresponds to the number of nodes in that layer. The default is c(3, 4) which is a two layer network with 3 and 4 nodes in the layers respectively.

Details

This function is meant to mirror the classic random forest function exctly. The only difference being that it uses shallow neural networks to build the forest instead of decision trees.

Value

predictions

The predictions for x_test.

num_brains

The number of neural networks used to decide the predictions.

predictors_per_brain

The number of variabled used for the neural networks used to decide the predictions.

hidden_layers

The vector describing the number of layers, as well as how many there were.

preds_per_brain

This matrix describes which columns where selected by each brain. Each row is a new brain. each column describes the index of the column used.

raw_results

The matrix of raw predictions from the brains. Each row is the cummulative predictions of all the brains. Which prediciton won by majority vote can be seen in 'predictions

Note

The neural networks are created using the neuralnet package!

Author(s)

Travis Barton

Examples


dat = Cross_val_maker(iris, .2)

train = dat$Train
test = dat$Test

Final_Test = Random_Brains(train[,-5],
  train$Species, as.matrix(test[,-5]),
  variables = 3, brains = 2)
table(Final_Test$predictions, as.numeric(test$Species))

Function for extracting the sentence vector from an embeddings matrix.

Description

Function for extracting the sentence vector from an embeddings matrix in a fast and convenient manner.

Usage

Sentence_Vector(Sentence, emb_matrix, dimension, stopwords)

Arguments

Sentence

The sentence to find the vector of.

emb_matrix

The embeddings matrix to search.

dimension

The dimension of the vector to return.

stopwords

Words that should not be included in the averaging proccess.

Details

The function splits the sentence into words, eliminates all stopwords, finds the vectors of each word, then averages the word vectors into a sentence vector.

Value

The sentence vector from an embeddings matrix.

Author(s)

Travis Barton

Examples

  emb = data.frame(matrix(c(1, 2, 3, 4, 5, 5,
  4, 3, 2, 1, 1, 5, 3, 2, 4), nrow = 3),
  row.names = c('sentence', 'in', 'question'))

  Sentence_Vector(c('this is the sentence in question'), emb, 5, c('this', 'is', 'the'))

For the finding of the $N$ most populous words in a corpus.

Description

This function finds the $N$ most used words in a corpus. This is done to identify stop words to better prune data sets before training.

Usage

Stopword_Maker(titles, cutoff = 20)

Arguments

titles

The documents in which the most populous words are sought.

cutoff

The number of $N$ top most used words to keep as stop words.

Value

output

A vector of the $N$ most populous words.

Author(s)

Travis Barton

Examples

test_set = c('this is a testset', 'I am searching for a list of words',
'I like turtles',
'A rocket would be a fast way of getting to work, but I do not think it is very practical')
res = Stopword_Maker(test_set, 4)
print(res)

Table Percent

Description

Finds the acc of square tables.

Usage

Table_percent(in_table)

Arguments

in_table

a confusion matrix

Details

The table must be square

Note

make sure its square.

Author(s)

Travis Barton

Examples


true <- rep(1:10, 10)
test <- rep(1:10, 10)
test[c(2, 22, 33, 89)] = 1
Table_percent(table(true, test))

Function for extacting word vectors from embeddings.

Description

Function for extacting word vectors from embeddings. This function is an internal function for 'Sentence_Puller'. It averages the word vectors and returns the average of these vectors.

Usage

Vector_Puller(words, emb_matrix, dimension)

Arguments

words

The word to be extracted.

emb_matrix

The embeddings matrix. It must be a data frame.

dimension

The Dimension of the embeddings to extract. They do not have to match that of the matrix, but they cannot exceed its maximum column count.

Details

This is a simple and fast internal function.

Value

The vector that corresponds to the average of the word vectors.

Author(s)

Travis Barton

Examples

# This is an example emb_matrix

emb = data.frame(matrix(c(1, 2, 3, 4, 5, 5, 4, 3, 2, 1), nrow = 2), row.names = c('cow', 'moo'))

Vector_Puller(c('cow', 'moo'), emb, 5)

Binary Decision Neural Network Wrapper

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

A function for bootstraping textual data so that all levels have the same number of entries.

Description

Usage

Arguments

Details

Value

Author(s)

Examples

An internal function for Bootstrap_Data_Frame.

Description

Usage

Arguments

Details

Value

Author(s)

Examples

For announcing when code is done.

Description

Usage

Arguments

Details

Author(s)

References

Examples

For Creating a test and train set from a whole set

Description

Usage

Arguments

Value

Author(s)

Examples

A Function for converting data into approximations of probability space.

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Function for loading in pre-trained or personal word embedding softwares.

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Monty Hall Simulator

Description

Usage

Arguments

Details

Value

Author(s)

Examples

For performing the nearest centroid problem (with modifications) on MNST data specifically (general to come)

Description

Usage

Arguments

Note

Author(s)

Number/alpha numeric seperator for strings.

Description

Usage

Arguments

Value

Note

Author(s)