Type: | Package |
Title: | Missing Data Segments Imputation in Multivariate Streams |
Version: | 0.1.0 |
Author: | Siyavash Shabani, Reza Rawassizadeh |
Maintainer: | Siyavash Shabani <s.shabani.aut@gmail.com> |
Description: | Helper functions provide an accurate imputation algorithm for reconstructing the missing segment in a multi-variate data streams. Inspired by single-shot learning, it reconstructs the missing segment by identifying the first similar segment in the stream. Nevertheless, there should be one column of data available, i.e. a constraint column. The values of columns can be characters (A, B, C, etc.). The result of the imputed dataset will be returned a .csv file. For more details see Reza Rawassizadeh (2019) <doi:10.1109/TKDE.2019.2914653>. |
URL: | https://www.researchgate.net/publication/332779980_Ghost_Imputation_Accurately_Reconstructing_Missing_Data_of_the_Off_Period |
Depends: | R (≥ 2.10) |
License: | GPL-3 |
Encoding: | UTF-8 |
LazyData: | true |
NeedsCompilation: | no |
Packaged: | 2020-03-23 22:25:06 UTC; lorman |
Imports: | R6 |
RoxygenNote: | 7.0.1 |
Repository: | CRAN |
Date/Publication: | 2020-03-25 16:50:05 UTC |
HTCls
Description
An internal function.
MissCls
Description
An internal function.
addtoList
Description
This function saves a section of the dataset in a list of objects.
Usage
addtoList(inHTCls, inHT)
appendOffset
Description
This function is used in the addtolist function for analysis of the offsets for each one of rows.
Usage
appendOffset(curoffsets, newoffsets)
Checking the variables and functions.
Description
This function is used to determine a variable or function is defined or not.
Usage
check(x)
chr:Number to char
Description
A function that covert the numbers to the charecters.
Usage
chr(n)
Converting the column name to the column number of the dataset.
Description
This Function checks the validation of the inserted name of the constraint column and after that, the column name will be converted to column number of the dataset.
Usage
constraint_check(constraintCol, data_frame)
countNulls
Description
This function determines the number of missing parts in a special part of the dataset.
Checking the equality of two parts of the dataset.
Description
Exactidentical.norowname function is one of the best ways to test whether two parts of the dataset are exactly equal. Exactidentical.norowname function returns TRUE when two parts of the dataset are equal and it returns FALSE when two parts of the dataset are not equal.
Usage
exactidentical.norowname(df1,df2)
hasNullRow
Description
This function checks to have the missing rows for the special part of the dataset.
Usage
hasNullRow(indf)
ht is a list of Hash objects.
Description
This is the collection of all Hash objects with in their offest.
Usage
ht
Checking the equality of two rows of the dataset.
Description
Identical.norowname function is one of the best ways to test whether two rows of the dataset are equal. Identical.norowname function returns TRUE when two parts of the dataset are equal and it returns FALSE when two parts of the dataset are not equal.Epsilon is a threshold for determining the equality for two rows.
Usage
identical.norowname(df1, df2, epsilon)
Exporting a .csv file to the special path in pc.
Description
This function gets a dataset and convert that to .csv file and finally save that to a special path in pc.
Usage
out_csv(data_frame, direction_save)
reconstruct: Missing Data Segments Imputation in Multivariate Streams
Description
Ghost is an accurate imputation algorithm for reconstructing the missing segment in multi-variate data streams. Inspired by single-shot learning, it reconstructs the missing segment by identifying the first similar segment in the stream. Nevertheless, there should be one column of data available, i.e. a constraint column. The values of columns can be characters ( A, B, C,etc.). The result of the imputed dataset will be returned a .csv file.
Usage
reconstruct(data_frame, constraintCol, wSize, direction_save,epsilon)
Arguments
data_frame |
A data frame with missing values. |
constraintCol |
The column number that all of its fields have data (without missing values). This column is considered as a constraint and it can always produce data, even if the system is shut down. |
wSize |
Length of a window that is used for data reconstruction, before and after the missing row(s). |
direction_save |
A direction for saving the output .CSV file(see details). |
epsilon |
A similarity coefficient that is used for searching in the algorithm (see details). |
Details
More information about operation of algorithm is prepared in algorithm's article: https://www.researchgate.net/publication/332779980_Ghost_Imputation_Accurately_Reconstructing_Missing_Data_of_the_Off_Period
Epsilon: The algorithm searches the data for the closest similar segment. As the first step, the algorithm determines prior and posterior segments missing part (the size of the segment will be given by wSize). As the second step, the algorithm starts to find the similar segment that passes the segment size and constraint similiarty condition. Sometimes, finding windows with exact similarity is impossible in a dataset. To mitigate this issue, and finding windows with approximate similarities the user can define the minimum percentage of similarity for searching the dataset with Epsilon coefficient.
direction_save: If the user inserts the Direction_save, the output file will be saved in the specified folder. Contrarily, if the user does not insert the Direction_save, the output file will be saved in the Environment R.
Author(s)
Siyavash Shabani,s.shabani.aut@gmail.com Reza Rawassizadeh,rrawassizadeh@acm.org
References
Rawassizadeh, Reza, Hamidreza Keshavarz, and Michael Pazzani. "Ghost Imputation: Accurately Reconstructing Missing Data of the Off Period." IEEE Transactions on Knowledge and Data Engineering (to appear).
Examples
#An example of the operation of the Algorithm.
data(test_ghost_csv)
## sample dataset----------------------------------
# S0 S1 S2 S3
#1 5 F G H
#2 5 B N T
#3 4 P O
#4 1 X C B
#5 1 N X
#6 1 R R R
#7 1 W W
#8 1 W W W
#9 2
#10 2
#11 1 O K O
#12 1 B O
#13 2 S D
#14 1 W W
#15 1 W S W
#16 2 P I M
#17 2 R U
#18 1 O K O
#19 1 B O
#20 1 R R R
#21 5 F G H
#22 5 B N T
#23 4
#24 1 X C B
#25 1 N X
reconstruct(test_ghost_csv,1,2,epsilon=0.4)
### output---------------------------------------------------
# S0 S1 S2 S3
#1 5 F G H
#2 5 B N T
#3 4 P O
#4 1 X C B
#5 1 N X
#6 1 R R R
#7 1 W W
#8 1 W W W
#9 2 P I M
#10 2 R U
#11 1 O K O
#12 1 B O
#13 2 S D
#14 1 W W
#15 1 W S W
#16 2 P I M
#17 2 R U
#18 1 O K O
#19 1 B O
#20 1 R R R
#21 5 F G H
#22 5 B N T
#23 4 P O
#24 1 X C B
#25 1 N X
Reconstructing the missing section.
Description
This function finds a possible section of the dataset for repacing with the missing section.
saxTransform
Description
This function is added to the package to enable users converting numeric data to discrete data. This is due to the fact that Ghost designed for discrete data and this function discretize numeric data and prepare them for the ghost algorithm.
Usage
saxTransform(data_frame, buckets,skipColumnVec,constraint_row)
Arguments
data_frame |
A data frame with numeric values. |
buckets |
The Input data range is divided to this number. |
skipColumnVec |
Column number that is not used in the algorithm. |
constraint_row |
Column number that is considered for constant column. |
Author(s)
Siyavash Shabani, s.shabani.aut@gmail.com, Reza Rawassizadeh, rrawassizadeh@acm.org
References
1- Rawassizadeh, Reza, Hamidreza Keshavarz, and Michael Pazzani. "Ghost Imputation: Accurately Reconstructing Missing Data of the Off Period." IEEE Transactions on Knowledge and Data Engineering (to appear).
2- Lin, J., Keogh, E., Lonardi, S., & Chiu, B. (2003). A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery (pp. 2-11). ACM.
Examples
data(sax_test)
#### Input dataframe-----------------------
# S0 S1 S2 S3
#1 1 2 54 65
#2 1 NA 21 54
#3 2 34 32 87
#4 1 23 58 52
#5 1 43 75 56
#6 2 12 20 95
#7 1 54 14 87
#8 3 -6 NA 30
#9 2 5 -60 32
#10 1 -85 58 25
#11 2 78 95 45
#12 3 52 52 62
#13 2 20 NA 58
#14 3 NA -62 78
#15 1 20 -10 96
#16 1 30 -6 NA
#17 1 12 -85 45
#18 1 NA 78 20
#19 1 23 95 NA
saxTransform(sax_test,buckets =10,skipColumnVec=1, constraint_row=1)
### Output data----------------------------------------------
# S0 S1 S2 S3
# [1,] "1" "2" "54" "65"
# [2,] "1" "" "f" "h"
# [3,] "2" "g" "g" "j"
# [4,] "1" "g" "h" "h"
# [5,] "1" "h" "i" "h"
# [6,] "2" "f" "f" "k"
# [7,] "1" "h" "f" "j"
# [8,] "3" "e" "" "g"
# [9,] "2" "f" "b" "g"
#[10,] "1" "a" "h" "g"
#[11,] "2" "j" "k" "h"
#[12,] "3" "h" "h" "i"
#[13,] "2" "f" "" "h"
#[14,] "3" "" "b" "j"
#[15,] "1" "f" "e" "k"
#[16,] "1" "g" "e" ""
#[17,] "1" "f" "a" "h"
#[18,] "1" "" "j" "f"
#[19,] "1" "g" "k" ""
A simple dataset.
Description
A simple file to use in the saxTransform function.
Usage
data("sax_test")
slidewindow
Description
This function divides the dataset into different parts and saves them in an object.
Usage
slidewindow(indf, constraintCol, windowS)
A simple .csv file to use in the reconstruct function.
Description
A simple .csv file to use in the reconstruct function.
Usage
data("test_ghost_csv")
write2file_revised
Description
This function replaces the finded section with the missing section.
Usage
write2file_revised(inObj, missidx, wSize, cnt, df1)