Version: | 4.3.1 |
Title: | Quantitative Analysis of Textual Data |
Description: | A fast, flexible, and comprehensive framework for quantitative text analysis in R. Provides functionality for corpus management, creating and manipulating tokens and n-grams, exploring keywords in context, forming and manipulating sparse matrices of documents by features and feature co-occurrences, analyzing keywords, computing feature similarities and distances, applying content dictionaries, applying supervised and unsupervised machine learning, visually representing text and text analyses, and more. |
License: | GPL-3 |
Depends: | R (≥ 4.1.0), methods |
Imports: | fastmatch, jsonlite, lifecycle, magrittr, Matrix (≥ 1.5-0), Rcpp (≥ 0.12.12), SnowballC, stopwords, stringi, xml2, yaml |
LinkingTo: | Rcpp |
NeedsCompilation: | yes |
Suggests: | rmarkdown, spelling, testthat, formatR, tm (≥ 0.6), knitr, lsa, rlang, slam |
Enhances: | dplyr, lda, purrr, spacyr, stm, text2vec, tibble, tidytext, tokenizers, topicmodels |
URL: | https://quanteda.io |
Encoding: | UTF-8 |
BugReports: | https://github.com/quanteda/quanteda/issues |
LazyData: | TRUE |
VignetteBuilder: | knitr |
Language: | en-GB |
RoxygenNote: | 7.3.2 |
Collate: | 'RcppExports.R' 'tokenizers.R' 'meta.R' 'quanteda-documentation.R' 'aaa.R' 'bootstrap_dfm.R' 'casechange-functions.R' 'char_select.R' 'convert.R' 'corpus-addsummary-metadata.R' 'corpus-methods.R' 'corpus.R' 'corpus_chunk.R' 'corpus_group.R' 'corpus_reshape.R' 'corpus_sample.R' 'corpus_segment.R' 'corpus_subset.R' 'corpus_trim.R' 'data-documentation.R' 'dfm-classes.R' 'dfm-methods.R' 'dfm-print.R' 'dfm-subsetting.R' 'dfm.R' 'dfm_compress.R' 'dfm_group.R' 'dfm_lookup.R' 'dfm_match.R' 'dfm_replace.R' 'dfm_sample.R' 'dfm_select.R' 'dfm_sort.R' 'dfm_subset.R' 'dfm_trim.R' 'dfm_weight.R' 'dictionaries.R' 'dimnames.R' 'fcm-classes.R' 'docnames.R' 'docvars.R' 'fcm-methods.R' 'fcm-print.R' 'fcm-subsetting.R' 'fcm.R' 'fcm_select.R' 'index.R' 'kwic.R' 'message.R' 'nfunctions.R' 'object-builder.R' 'object2fixed.R' 'pattern2fixed.R' 'phrases.R' 'quanteda-package.R' 'quanteda_options.R' 'spacyr-methods.R' 'stopwords.R' 'summary.R' 'textmodel.R' 'textplot.R' 'texts.R' 'textstat.R' 'tokens-methods.R' 'tokens.R' 'tokens_chunk.R' 'tokens_compound.R' 'tokens_group.R' 'tokens_lookup.R' 'tokens_ngrams.R' 'tokens_replace.R' 'tokens_restore.R' 'tokens_sample.R' 'tokens_segment.R' 'tokens_select.R' 'tokens_split.R' 'tokens_subset.R' 'tokens_trim.R' 'tokens_xptr.R' 'utils.R' 'validator.R' 'wordstem.R' 'zzz.R' |
Packaged: | 2025-07-02 16:53:31 UTC; kbenoit |
Author: | Kenneth Benoit |
Maintainer: | Kenneth Benoit <kbenoit@lse.ac.uk> |
Repository: | CRAN |
Date/Publication: | 2025-07-10 12:50:05 UTC |
An R package for the quantitative analysis of textual data
Description
Functions for creating and managing textual corpora, extracting features from textual data, and analyzing those features using quantitative methods.
A fast, flexible, and comprehensive framework for quantitative text analysis in R. Provides functionality for corpus management, creating and manipulating tokens and n-grams, exploring keywords in context, forming and manipulating sparse matrices of documents by features and feature co-occurrences, analyzing keywords, computing feature similarities and distances, applying content dictionaries, applying supervised and unsupervised machine learning, visually representing text and text analyses, and more.
Details
quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data. quanteda includes tools to make it easy and fast to manipulate the texts in a corpus, by performing the most common natural language processing tasks simply and quickly, such as tokenizing, stemming, or forming ngrams. quanteda's functions for tokenizing texts and forming multiple tokenized documents into a document-feature matrix are both extremely fast and very simple to use. quanteda can segment texts easily by words, paragraphs, sentences, or even user-supplied delimiters and tags.
Built on the text processing functions in the stringi package, which is in turn built on C++ implementation of the ICU libraries for Unicode text handling, quanteda pays special attention to fast and correct implementation of Unicode and the handling of text in any character set.
quanteda is built for efficiency and speed, through its design around three infrastructures: the stringi package for text processing, the Matrix package for sparse matrix objects, and computationally intensive processing (e.g. for tokens) handled in parallelized C++. If you can fit it into memory, quanteda will handle it quickly. (And eventually, we will make it possible to process objects even larger than available memory.)
quanteda is principally designed to allow users a fast and convenient method to go from a corpus of texts to a selected matrix of documents by features, after defining what the documents and features. The package makes it easy to redefine documents, for instance by splitting them into sentences or paragraphs, or by tags, as well as to group them into larger documents by document variables, or to subset them based on logical conditions or combinations of document variables. The package also implements common NLP feature selection functions, such as removing stopwords and stemming in numerous languages, selecting words found in dictionaries, treating words as equivalent based on a user-defined "thesaurus", and trimming and weighting features based on document frequency, feature frequency, and related measures such as tf-idf.
Tools for working with dictionaries are one of quanteda's principal strengths, and the package includes several core functions for preparing and applying dictionaries to texts, for example for lexicon-based sentiment analysis.
Once constructed, a quanteda document-feature matrix ("dfm") can be easily analyzed using either quanteda's built-in tools for scaling document positions, or used with a number of other text analytic tools, such as: topic models (including converters for direct use with the topicmodels, LDA, and stm packages) document scaling (using the quanteda.textmodels package's functions for the "wordfish" and "Wordscores" models, or direct use with the ca package for correspondence analysis), or machine learning through a variety of other packages that take matrix or matrix-like inputs. quanteda includes functions for converting its core objects, but especially a dfm, into other formats so that these are easy to use with other analytic packages.
Additional features of quanteda include:
powerful, flexible tools for working with dictionaries;
the ability to identify keywords associated with documents or groups of documents;
the ability to explore texts using keywords-in-context;
quick computation of word or document statistics, using the quanteda.textstats package, for clustering or to compute distances for other purposes;
a comprehensive suite of descriptive statistics on text such as the number of sentences, words, characters, or syllables per document; and
flexible, easy to use graphical tools to portray many of the analyses available in the package.
Source code and additional information
https://github.com/quanteda/quanteda
Author(s)
Maintainer: Kenneth Benoit kbenoit@lse.ac.uk (ORCID) [copyright holder]
Authors:
Kohei Watanabe watanabe.kohei@gmail.com (ORCID)
Haiyan Wang whyinsa@yahoo.com (ORCID)
Paul Nulty paul.nulty@gmail.com (ORCID)
Adam Obeng quanteda@binaryeagle.com (ORCID)
Stefan Müller stefan.mueller@ucd.ie (ORCID)
Akitaka Matsuo a.matsuo@essex.ac.uk (ORCID)
William Lowe lowe@hertie-school.org (ORCID)
Other contributors:
Christian Müller C.Mueller@lse.ac.uk [contributor]
Olivier Delmarcelle olivier.delmarcelle@ugent.be (ORCID) [contributor]
European Research Council (ERC-2011-StG 283794-QUANTESS) [funder]
See Also
Useful links:
Report bugs at https://github.com/quanteda/quanteda/issues
Useful links:
Report bugs at https://github.com/quanteda/quanteda/issues
Pipe operator
Description
See magrittr::%>%
for details.
Usage
lhs %>% rhs
Modify only documents matching a logical condition
Description
Applies the modification only to documents matching a condition.
Arguments
apply_if |
logical vector of length |
Coercion and checking methods for corpus objects
Description
Coercion functions to and from corpus objects, including conversion to a plain character object; and checks for whether an object is a corpus.
Usage
## S3 method for class 'corpus'
as.character(x, use.names = TRUE, ...)
is.corpus(x)
as.corpus(x)
Arguments
x |
object to be coerced or checked |
use.names |
logical; preserve (document) names if |
... |
additional arguments used by specific methods |
Value
as.character()
returns the corpus as a plain character vector, with
or without named elements.
is.corpus
returns TRUE
if the object is a corpus.
as.corpus()
upgrades a corpus object to the newest format.
object.
Note
as.character(x)
where x
is a corpus is equivalent to
calling the deprecated texts(x)
.
Convert a dfm to a data.frame
Description
Defunct function to convert a dfm into a data.frame.
Use convert(x, to = "data.frame")
instead.
Usage
## S3 method for class 'dfm'
as.data.frame(
x,
row.names = NULL,
...,
document = docnames(x),
docid_field = "doc_id",
check.names = FALSE
)
Arguments
x |
any R object. |
row.names |
|
... |
unused |
document |
optional first column of mode |
docid_field |
character; the name of the column containing document
names used when |
check.names |
logical; passed to the |
See Also
Coercion and checking functions for dfm objects
Description
Convert an eligible input object into a dfm, or check whether an object is a dfm. Current eligible inputs for coercion to a dfm are: matrix, (sparse) Matrix, TermDocumentMatrix and DocumentTermMatrix (from the tm package), data.frame, and other dfm objects.
Usage
as.dfm(x)
is.dfm(x)
Arguments
x |
a candidate object for checking or coercion to dfm |
Value
as.dfm
converts an input object into a dfm. Row names
are used for docnames, and column names for featnames, of the resulting
dfm.
is.dfm
returns TRUE
if and only if its argument is a dfm.
See Also
as.data.frame.dfm()
, as.matrix.dfm()
,
convert()
Coercion and checking functions for dictionary objects
Description
Convert a dictionary from a different format into a quanteda dictionary, or check to see if an object is a dictionary.
Usage
as.dictionary(x, ...)
## S3 method for class 'data.frame'
as.dictionary(x, format = c("tidytext"), separator = " ", tolower = FALSE, ...)
is.dictionary(x)
Arguments
x |
a object to be coerced to a dictionary object. |
... |
additional arguments passed to underlying functions. |
format |
input format for the object to be coerced to a
dictionary; current legal values are a data.frame with the fields
|
separator |
the character in between multi-word dictionary values. This
defaults to |
tolower |
if |
Value
as.dictionary
returns a quanteda dictionary
object. This conversion function differs from the dictionary()
constructor function in that it converts an existing object rather than
creates one from components or from a file.
is.dictionary
returns TRUE
if an object is a
quanteda dictionary.
Examples
## Not run:
data(sentiments, package = "tidytext")
as.dictionary(subset(sentiments, lexicon == "nrc"))
as.dictionary(subset(sentiments, lexicon == "bing"))
# to convert AFINN into polarities - adjust thresholds if desired
datafinn <- subset(sentiments, lexicon == "AFINN")
datafinn[["sentiment"]] <-
with(datafinn,
sentiment <- ifelse(score < 0, "negative",
ifelse(score > 0, "positive", "netural"))
)
with(datafinn, table(score, sentiment))
as.dictionary(datafinn)
dat <- data.frame(
word = c("Great", "Horrible"),
sentiment = c("positive", "negative")
)
as.dictionary(dat)
as.dictionary(dat, tolower = FALSE)
## End(Not run)
is.dictionary(dictionary(list(key1 = c("val1", "val2"), key2 = "val3")))
# [1] TRUE
is.dictionary(list(key1 = c("val1", "val2"), key2 = "val3"))
# [1] FALSE
Coercion and checking functions for fcm objects
Description
Convert an eligible input object into a fcm, or check whether an object is a fcm. Current eligible inputs for coercion to a dfm are: matrix, (sparse) Matrix and other fcm objects.
Usage
as.fcm(x)
Arguments
x |
a candidate object for checking or coercion to dfm |
Value
as.fcm
converts an input object into a fcm.
Coercion, checking, and combining functions for tokens objects
Description
Coercion functions to and from tokens objects, checks for whether an object is a tokens object, and functions to combine tokens objects.
Usage
## S3 method for class 'tokens'
as.list(x, ...)
## S3 method for class 'tokens'
as.character(x, use.names = FALSE, ...)
is.tokens(x)
as.tokens(x, concatenator = "_", ...)
## S3 method for class 'spacyr_parsed'
as.tokens(
x,
concatenator = "/",
include_pos = c("none", "pos", "tag"),
use_lemma = FALSE,
...
)
is.tokens(x)
Arguments
x |
object to be coerced or checked |
... |
additional arguments used by specific methods. For c.tokens, these are the tokens objects to be concatenated. |
use.names |
logical; preserve names if |
concatenator |
character; the concatenation character that will connect the tokens making up a multi-token sequence. |
include_pos |
character; whether and which part-of-speech tag to use:
|
use_lemma |
logical; if |
Details
The concatenator
is used to automatically generate dictionary
values for multi-word expressions in tokens_lookup()
and
dfm_lookup()
. The underscore character is commonly used to join
elements of multi-word expressions (e.g. "piece_of_cake", "New_York"), but
other characters (e.g. whitespace " " or a hyphen "-") can also be used.
In those cases, users have to tell the system what is the concatenator in
your tokens so that the conversion knows to treat this character as the
inter-word delimiter, when reading in the elements that will become the
tokens.
Value
as.list
returns a simple list of characters from a
tokens object.
as.character
returns a character vector from a
tokens object.
is.tokens
returns TRUE
if the object is of class
tokens, FALSE
otherwise.
as.tokens
returns a quanteda tokens object.
is.tokens
returns TRUE
if the object is of class
tokens, FALSE
otherwise.
Examples
# create tokens object from list of characters with custom concatenator
dict <- dictionary(list(country = "United States",
sea = c("Atlantic Ocean", "Pacific Ocean")))
lis <- list(c("The", "United-States", "has", "the", "Atlantic-Ocean",
"and", "the", "Pacific-Ocean", "."))
toks <- as.tokens(lis, concatenator = "-")
tokens_lookup(toks, dict)
Coerce a dfm to a matrix or data.frame
Description
Methods for coercing a dfm object to a matrix or data.frame object.
Usage
## S3 method for class 'dfm'
as.matrix(x, ...)
Arguments
x |
dfm to be coerced |
... |
unused |
Examples
# coercion to matrix
as.matrix(data_dfm_lbgexample[, 1:10])
Convert quanteda dictionary objects to the YAML format
Description
Converts a quanteda dictionary object constructed by the dictionary function into the YAML format. The YAML files can be edited in text editors and imported into quanteda again.
Usage
as.yaml(x)
Arguments
x |
a dictionary object |
Value
as.yaml
a dictionary in the YAML format, as a character object
Examples
## Not run:
dict <- dictionary(list(one = c("a b", "c*"), two = c("x", "y", "z??")))
cat(yaml <- as.yaml(dict))
cat(yaml, file = (yamlfile <- paste0(tempfile(), ".yml")))
dictionary(file = yamlfile)
## End(Not run)
Function extending base::attributes()
Description
Function extending base::attributes()
Usage
attributes(x, overwrite = TRUE) <- value
Arguments
x |
an object |
overwrite |
if |
value |
new attributes |
Bootstrap a dfm
Description
Create an array of resampled dfms.
Usage
bootstrap_dfm(x, n = 10, ..., verbose = quanteda_options("verbose"))
Arguments
x |
a dfm object |
n |
number of resamples |
... |
additional arguments passed to |
verbose |
if |
Details
Function produces multiple, resampled dfm objects, based on resampling sentences (with replacement) from each document, recombining these into new "documents" and computing a dfm for each. Resampling of sentences is done strictly within document, so that every resampled document will contain at least some of its original tokens.
Value
A named list of dfm objects, where the first, dfm_0
, is
the dfm from the original texts, and subsequent elements are the
sentence-resampled dfms.
Author(s)
Kenneth Benoit
Examples
set.seed(10)
txt <- c(textone = "This is a sentence. Another sentence. Yet another.",
texttwo = "Premiere phrase. Deuxieme phrase.")
dfmat <- corpus_reshape(corpus(txt), to = "sentences") |>
tokens() |>
dfm()
bootstrap_dfm(dfmat, n = 3)
Combine dfm objects by Rows or Columns
Description
Combine a dfm with another dfm, or numeric, or matrix object, returning a dfm with the combined documents or features, respectively.
Usage
## S3 method for class 'dfm'
cbind(...)
## S3 method for class 'dfm'
rbind(...)
Arguments
... |
dfm, numeric, or matrix objects to be joined column-wise
( |
Details
cbind(x, y, ...)
combines dfm objects by columns, returning a
dfm object with combined features from input dfm objects. Note that this
should be used with extreme caution, as joining dfms with different
documents will result in a new row with the docname(s) of the first dfm,
merging in those from the second. Furthermore, if features are shared
between the dfms being cbinded, then duplicate feature labels will result.
In both instances, warning messages will result.
rbind(x, y, ...)
combines dfm objects by rows, returning a
dfm object with combined features from input dfm objects. Features are
matched between the two dfm objects, so that the order and names of the
features do not need to match. The order of the features in the resulting
dfm is not guaranteed. The attributes and settings of this new dfm are not
currently preserved.
Examples
# cbind() for dfm objects
(dfmat1 <- dfm(tokens(c("a b c d", "c d e f"))))
(dfmat2 <- dfm(tokens(c("a b", "x y z"))))
cbind(dfmat1, dfmat2)
cbind(dfmat1, 100)
cbind(100, dfmat1)
cbind(dfmat1, matrix(c(101, 102), ncol = 1))
cbind(matrix(c(101, 102), ncol = 1), dfmat1)
# rbind() for dfm objects
(dfmat1 <- dfm(tokens(c(doc1 = "This is one sample text sample."))))
(dfmat2 <- dfm(tokens(c(doc2 = "One two three text text."))))
(dfmat3 <- dfm(tokens(c(doc3 = "This is the fourth sample text."))))
rbind(dfmat1, dfmat2)
rbind(dfmat1, dfmat2, dfmat3)
Select or remove elements from a character vector
Description
These function select or discard elements from a character object. For
convenience, the functions char_remove
and char_keep
are defined as
shortcuts for char_select(x, pattern, selection = "remove")
and
char_select(x, pattern, selection = "keep")
, respectively.
These functions make it easy to change, for instance, stopwords based on pattern matching.
Usage
char_select(
x,
pattern,
selection = c("keep", "remove"),
valuetype = c("glob", "fixed", "regex"),
case_insensitive = TRUE
)
char_remove(x, ...)
char_keep(x, ...)
Arguments
x |
an input character vector |
pattern |
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
selection |
whether to |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
... |
additional arguments passed by |
Value
a modified character vector
Examples
# character selection
mykeywords <- c("natural", "national", "denatured", "other")
char_select(mykeywords, "nat*", valuetype = "glob")
char_select(mykeywords, "nat", valuetype = "regex")
char_select(mykeywords, c("natur*", "other"))
char_select(mykeywords, c("natur*", "other"), selection = "remove")
# character removal
char_remove(letters[1:5], c("a", "c", "x"))
words <- c("any", "and", "Anna", "as", "announce", "but")
char_remove(words, "an*")
char_remove(words, "an*", case_insensitive = FALSE)
char_remove(words, "^.n.+$", valuetype = "regex")
# remove some of the system stopwords
stopwords("en", source = "snowball")[1:6]
stopwords("en", source = "snowball")[1:6] |>
char_remove(c("me", "my*"))
# character keep
char_keep(letters[1:5], c("a", "c", "x"))
Convert the case of character objects
Description
char_tolower
and char_toupper
are replacements for
base::tolower() and base::tolower()
based on the stringi package. The stringi functions for case
conversion are superior to the base functions because they correctly
handle case conversion for Unicode. In addition, the *_tolower()
functions
provide an option for preserving acronyms.
Usage
char_tolower(x, keep_acronyms = FALSE)
char_toupper(x)
Arguments
x |
the input object whose character/tokens/feature elements will be case-converted |
keep_acronyms |
logical; if |
Examples
txt1 <- c(txt1 = "b A A", txt2 = "C C a b B")
char_tolower(txt1)
char_toupper(txt1)
# with acronym preservation
txt2 <- c(text1 = "England and France are members of NATO and UNESCO",
text2 = "NASA sent a rocket into space.")
char_tolower(txt2)
char_tolower(txt2, keep_acronyms = TRUE)
char_toupper(txt2)
Check object class for functions
Description
Checks if the method is defined for the class.
Usage
check_class(class, method, defunct_methods = NULL)
Arguments
class |
the object class to check |
method |
the name of functions to be called |
Examples
## Not run:
quanteda:::check_class("tokens", "dfm_select")
## End(Not run)
Check arguments passed to other functions via ...
Description
Check arguments passed to other functions via ...
Usage
check_dots(..., method = NULL)
Arguments
... |
dots to check |
method |
the names of functions |
Validate input vectors
Description
Check the range of values and the length of input vectors before used in control flow or passed to C++ functions.
Usage
check_integer(
x,
min_len = 1,
max_len = 1,
min = -Inf,
max = Inf,
strict = FALSE,
allow_null = FALSE
)
check_double(
x,
min_len = 1,
max_len = 1,
min = -Inf,
max = Inf,
strict = FALSE,
allow_null = FALSE
)
check_logical(
x,
min_len = 1,
max_len = 1,
strict = FALSE,
allow_null = FALSE,
allow_na = FALSE
)
check_character(
x,
min_len = 1,
max_len = 1,
min_nchar = 0,
max_nchar = Inf,
strict = FALSE,
allow_null = FALSE
)
Arguments
min_len |
minimum length of the vector |
max_len |
maximum length of the vector |
min |
minimum value in the vector |
max |
maximum value in the vector |
strict |
raise error when |
allow_null |
if |
allow_na |
if |
min_nchar |
minimum character length of values in the vector |
max_nchar |
maximum character length of values in the vector |
Details
Note that value checks are performed after coercion to expected input types.
Examples
## Not run:
check_integer(0, min = 1) # error
check_integer(-0.1, min = 0) # return 0
check_double(-0.1, min = 0) # error
check_double(numeric(), min_len = 0) # return numeric()
check_double("1.1", min = 1) # returns 1.1
check_double("1.1", min = 1, strict = TRUE) # error
check_double("xyz", min = 1) # error
check_logical(c(TRUE, FALSE), min_len = 3) # error
check_character("_", min_nchar = 1) # return "_"
check_character("", min_nchar = 1) # error
## End(Not run)
Return the concatenator character from an object
Description
Get the concatenator character from a tokens object.
Usage
concat(x)
concatenator(x)
Arguments
x |
a tokens object |
Details
The concatenator character is a special delimiter used to link
separate tokens in multi-token phrases. It is embedded in the meta-data of
tokens objects and used in downstream operations, such as tokens_compound()
or tokens_lookup()
. It can be extracted using concat()
and set using
tokens(x, concatenator = ...)
when x
is a tokens object.
The default _
is recommended since it will not be removed during normal
cleaning and tokenization (while nearly all other punctuation characters, at
least those in the Unicode punctuation class [P]
will be removed).
Value
a character of length 1
Examples
toks <- tokens(data_corpus_inaugural[1:5])
concat(toks)
Convert quanteda objects to non-quanteda formats
Description
Convert a quanteda dfm or corpus object to a format useable by other
packages. The general function convert
provides easy conversion from a dfm
to the document-term representations used in all other text analysis packages
for which conversions are defined. For corpus objects, convert
provides
an easy way to make a corpus and its document variables into a data.frame.
Usage
convert(x, to, ...)
## S3 method for class 'dfm'
convert(
x,
to = c("lda", "tm", "stm", "austin", "topicmodels", "lsa", "matrix", "data.frame",
"tripletlist"),
docvars = NULL,
omit_empty = TRUE,
docid_field = "doc_id",
...
)
## S3 method for class 'corpus'
convert(x, to = c("data.frame", "json"), pretty = FALSE, ...)
Arguments
x |
|
to |
target conversion format, one of:
|
... |
unused directly |
docvars |
optional data.frame of document variables used as the
|
omit_empty |
logical; if |
docid_field |
character; the name of the column containing document
names used when |
pretty |
adds indentation whitespace to JSON output. Can be TRUE/FALSE or a number specifying the number of spaces to indent (default is 2). Use a negative number for tabs instead of spaces. |
Value
A converted object determined by the value of to
(see above).
See conversion target package documentation for more detailed descriptions
of the return formats.
Examples
## convert a dfm
toks <- corpus_subset(data_corpus_inaugural, Year > 1970) |>
tokens()
dfmat1 <- dfm(toks)
# austin's wfm format
identical(dim(dfmat1), dim(convert(dfmat1, to = "austin")))
# stm package format
stmmat <- convert(dfmat1, to = "stm")
str(stmmat)
# triplet
tripletmat <- convert(dfmat1, to = "tripletlist")
str(tripletmat)
## Not run:
# tm's DocumentTermMatrix format
tmdfm <- convert(dfmat1, to = "tm")
str(tmdfm)
# topicmodels package format
str(convert(dfmat1, to = "topicmodels"))
# lda package format
str(convert(dfmat1, to = "lda"))
## End(Not run)
## convert a corpus into a data.frame
corp <- corpus(c(d1 = "Text one.", d2 = "Text two."),
docvars = data.frame(dvar1 = 1:2, dvar2 = c("one", "two"),
stringsAsFactors = FALSE))
convert(corp, to = "data.frame")
convert(corp, to = "json")
Convenience wrappers for dfm convert
Description
To make the usage as consistent as possible with other packages, quanteda
also provides shortcut wrappers to convert()
, designed to be
similar in syntax to analogous commands in the packages to whose format they
are converting.
Usage
dfm2austin(x)
dfm2tm(x, weighting = tm::weightTf)
dfm2lda(x, omit_empty = TRUE)
dtm2lda(x, omit_empty = TRUE)
dfm2dtm(x, omit_empty = TRUE)
dfm2stm(x, docvars = NULL, omit_empty = TRUE)
Arguments
x |
the dfm to be converted |
weighting |
a tm weight, see |
omit_empty |
logical; if |
docvars |
optional data.frame of document variables used as the
|
Details
dfm2lda
provides converts a dfm into the list representation
of terms in documents used by the lda package (a list with components
"documents" and "vocab" as needed by
lda::lda.collapsed.gibbs.sampler()
).
dfm2ldaformat
provides converts a dfm into the list
representation of terms in documents used by the lda package (a list
with components "documents" and "vocab" as needed by
lda::lda.collapsed.gibbs.sampler()
).
Value
A converted object determined by the value of to
(see above).
See conversion target package documentation for more detailed descriptions
of the return formats.
Note
Additional coercion methods to base R objects are also available:
[as.data.frame](x)
converts a dfm into a data.frame
[as.matrix](x)
Examples
dfmat <- corpus_subset(data_corpus_inaugural, Year > 1970) |>
tokens() |>
dfm()
## Not run:
# shortcut conversion to lda package list format
identical(quanteda:::dfm2lda(dfmat), convert(dfmat, to = "lda"))
## End(Not run)
## Not run:
# shortcut conversion to lda package list format
identical(dfm2ldaformat(dfmat), convert(dfmat, to = "lda"))
## End(Not run)
Construct a corpus object
Description
Creates a corpus object from available sources. The currently available sources are:
a character vector, consisting of one document per element; if the elements are named, these names will be used as document names.
a data.frame (or a tibble
tbl_df
), whose default document id is a variable identified bydocid_field
; the text of the document is a variable identified bytext_field
; and other variables are imported as document-level meta-data. This matches the format of data.frames constructed by the the readtext package.a tm VCorpus or SimpleCorpus class object, with the fixed metadata fields imported as docvars and corpus-level metadata imported as meta information.
a corpus object.
Usage
corpus(x, ...)
## S3 method for class 'corpus'
corpus(
x,
docnames = quanteda::docnames(x),
docvars = quanteda::docvars(x),
meta = quanteda::meta(x),
...
)
## S3 method for class 'character'
corpus(
x,
docnames = NULL,
docvars = NULL,
meta = list(),
unique_docnames = TRUE,
...
)
## S3 method for class 'data.frame'
corpus(
x,
docid_field = "doc_id",
text_field = "text",
meta = list(),
unique_docnames = TRUE,
...
)
## S3 method for class 'kwic'
corpus(
x,
split_context = TRUE,
extract_keyword = TRUE,
meta = list(),
concatenator = " ",
...
)
## S3 method for class 'Corpus'
corpus(x, ...)
Arguments
x |
a valid corpus source object |
... |
not used directly |
docnames |
Names to be assigned to the texts. Defaults to the names of
the character vector (if any); |
docvars |
a data.frame of document-level variables associated with each text |
meta |
a named list that will be added to the corpus as corpus-level,
user meta-data. This can later be accessed or updated using
|
unique_docnames |
logical; if |
docid_field |
optional column index of a document identifier; defaults
to "doc_id", but if this is not found, then will use the rownames of the
data.frame; if the rownames are not set, it will use the default sequence
based on |
text_field |
the character name or numeric index of the source
|
split_context |
logical; if |
extract_keyword |
logical; if |
concatenator |
character between tokens, default is the whitespace. |
Details
The texts and document variables of corpus objects can also be
accessed using index notation and the $
operator for accessing or assigning
docvars. For details, see [.corpus()
.
Value
A corpus class object containing the original texts, document-level variables, document-level metadata, corpus-level metadata, and default settings for subsequent processing of the corpus.
For quanteda >= 2.0, this is a specially classed character vector. It has many additional attributes but you should not access these attributes directly, especially if you are another package author. Use the extractor and replacement functions instead, or else your code is not only going to be uglier, but also likely to break should the internal structure of a corpus object change. Using the accessor and replacement functions ensures that future code to manipulate corpus objects will continue to work.
See Also
corpus, docvars()
,
meta()
, as.character.corpus()
, ndoc()
,
docnames()
Examples
# create a corpus from texts
corpus(data_char_ukimmig2010)
# create a corpus from texts and assign meta-data and document variables
summary(corpus(data_char_ukimmig2010,
docvars = data.frame(party = names(data_char_ukimmig2010))), 5)
# import a tm VCorpus
if (requireNamespace("tm", quietly = TRUE)) {
data(crude, package = "tm") # load in a tm example VCorpus
vcorp <- corpus(crude)
summary(vcorp)
data(acq, package = "tm")
summary(corpus(acq), 5)
vcorp2 <- tm::VCorpus(tm::VectorSource(data_char_ukimmig2010))
corp <- corpus(vcorp2)
summary(corp)
}
# construct a corpus from a data.frame
dat <- data.frame(letter_factor = factor(rep(letters[1:3], each = 2)),
some_ints = 1L:6L,
some_text = paste0("This is text number ", 1:6, "."),
stringsAsFactors = FALSE,
row.names = paste0("fromDf_", 1:6))
dat
summary(corpus(dat, text_field = "some_text",
meta = list(source = "From a data.frame called mydf.")))
Base method extensions for corpus objects
Description
Extensions of base R functions for corpus objects.
Usage
## S3 method for class 'corpus'
c1 + c2
## S3 method for class 'corpus'
c(..., recursive = FALSE)
## S3 method for class 'corpus'
x[i, drop_docid = TRUE]
## S3 method for class 'summary.corpus'
print(x, ...)
Arguments
c1 |
corpus one to be added |
c2 |
corpus two to be added |
recursive |
logical used by |
x |
a corpus object |
i |
document names or indices for documents to extract. |
drop_docid |
if |
Details
The +
operator for a corpus object will combine two corpus
objects, resolving any non-matching docvars()
by making them
into NA
values for the corpus lacking that field. Corpus-level meta
data is concatenated, except for source
and notes
, which are
stamped with information pertaining to the creation of the new joined
corpus.
The c()
operator is also defined for corpus class objects, and provides
an easy way to combine multiple corpus objects.
There are some issues that need to be addressed in future revisions of
quanteda concerning the use of factors to store document variables and
meta-data. Currently most or all of these are not recorded as factors,
because we use stringsAsFactors=FALSE
in the
data.frame()
calls that are used to create and store the
document-level information, because the texts should always be stored as
character vectors and never as factors.
Value
The +
and c()
operators return a corpus()
object.
Indexing a corpus works in three ways, as of v2.x.x:
-
[
returns a subsetted corpus -
[[
returns the textual contents of a subsetted corpus (similar toas.character()
) -
$
returns a vector containing the single named docvars
See Also
Examples
# concatenate corpus objects
corp1 <- corpus(data_char_ukimmig2010[1:2])
corp2 <- corpus(data_char_ukimmig2010[3:4])
corp3 <- corpus(data_char_ukimmig2010[5:6])
summary(c(corp1, corp2, corp3))
# two ways to index corpus elements
data_corpus_inaugural["1793-Washington"]
data_corpus_inaugural[2]
# return the text itself
data_corpus_inaugural[["1793-Washington"]]
Segment a corpus into chunks of a given size
Description
Segment a corpus into new documents of roughly equal sized text chunks, with the possibility of overlapping the chunks.
Usage
corpus_chunk(
x,
size,
truncate = FALSE,
use_docvars = TRUE,
verbose = quanteda_options("verbose")
)
Arguments
x |
tokens object whose token elements will be segmented into chunks |
size |
integer; the (approximate) token length of the chunks. See Details. |
truncate |
logical; if |
use_docvars |
if |
verbose |
if |
Details
The token length is estimated using stringi::stri_length(txt) / stringi::stri_count_boundaries(txt)
to avoid needing to tokenize and rejoin
the corpus from the tokens.
Note that when used for chunking texts prior to sending to large language
models (LLMs) with limited input token lengths, size should typically be set
to approximately 0.75-0.80 of the LLM's token limit. This is because
tokenizers (such as LLaMA's SentencePiece Byte-Pair Encoding tokenizer)
require more tokens than the linguistically defined grammatically-based
tokenizer that is the quanteda default. Note also that because of the
use of stringi::stri_count_boundaries(txt)
to approximate token length
(efficiently), the exact token length for chunking will be approximate.
See Also
Examples
data_corpus_inaugural[1] |>
corpus_chunk(size = 10)
Combine documents in corpus by a grouping variable
Description
Combine documents in a corpus object by a grouping variable, by concatenating their texts in the order of the documents within each grouping variable.
Usage
corpus_group(x, groups = docid(x), fill = FALSE, concatenator = " ")
Arguments
x |
corpus object |
groups |
grouping variable for sampling, equal in length to the number
of documents. This will be evaluated in the docvars data.frame, so that
docvars may be referred to by name without quoting. This also changes
previous behaviours for |
fill |
logical; if |
concatenator |
the concatenation character that will connect the grouped documents. |
Value
a corpus object whose documents are equal to the unique group combinations, and whose texts are the concatenations of the texts by group. Document-level variables that have no variation within groups are saved in docvars. Document-level variables that are lists are dropped from grouping, even when these exhibit no variation within groups.
Examples
corp <- corpus(c("a a b", "a b c c", "a c d d", "a c c d"),
docvars = data.frame(grp = c("grp1", "grp1", "grp2", "grp2")))
corpus_group(corp, groups = grp)
corpus_group(corp, groups = c(1, 1, 2, 2))
corpus_group(corp, groups = factor(c(1, 1, 2, 2), levels = 1:3))
# with fill
corpus_group(corp, groups = factor(c(1, 1, 2, 2), levels = 1:3), fill = TRUE)
Recast the document units of a corpus
Description
For a corpus, reshape (or recast) the documents to a different level of aggregation. Units of aggregation can be defined as documents, paragraphs, or sentences. Because the corpus object records its current "units" status, it is possible to move from recast units back to original units, for example from documents, to sentences, and then back to documents (possibly after modifying the sentences).
Usage
corpus_reshape(
x,
to = c("sentences", "paragraphs", "documents"),
use_docvars = TRUE,
...
)
Arguments
x |
corpus whose document units will be reshaped |
to |
new document units in which the corpus will be recast |
use_docvars |
if |
... |
additional arguments passed to |
Value
A corpus object with the documents defined as the new units, including document-level meta-data identifying the original documents.
Examples
# simple example
corp1 <- corpus(c(textone = "This is a sentence. Another sentence. Yet another.",
textwo = "Premiere phrase. Deuxieme phrase."),
docvars = data.frame(country=c("UK", "USA"), year=c(1990, 2000)))
summary(corp1)
summary(corpus_reshape(corp1, to = "sentences"))
# example with inaugural corpus speeches
(corp2 <- corpus_subset(data_corpus_inaugural, Year>2004))
corp2para <- corpus_reshape(corp2, to = "paragraphs")
corp2para
summary(corp2para, 50, showmeta = TRUE)
## Note that Bush 2005 is recorded as a single paragraph because that text
## used a single \n to mark the end of a paragraph.
Randomly sample documents from a corpus
Description
Take a random sample of documents of the specified size from a corpus, with or without replacement, optionally by grouping variables or with probability weights.
Usage
corpus_sample(x, size = ndoc(x), replace = FALSE, prob = NULL, by = NULL)
Arguments
x |
a corpus object whose documents will be sampled |
size |
a positive number, the number of documents to select; when used
with |
replace |
if |
prob |
a vector of probability weights for obtaining the elements of the
vector being sampled. May not be applied when |
by |
optional grouping variable for sampling. This will be evaluated in
the docvars data.frame, so that docvars may be referred to by name without
quoting. This also changes previous behaviours for |
Value
a corpus object (re)sampled on the documents, containing the document variables for the documents sampled.
Examples
set.seed(123)
# sampling from a corpus
summary(corpus_sample(data_corpus_inaugural, size = 5))
summary(corpus_sample(data_corpus_inaugural, size = 10, replace = TRUE))
# sampling with by
corp <- data_corpus_inaugural
corp$century <- paste(floor(corp$Year / 100) + 1)
corp$century <- paste0(corp$century, ifelse(corp$century < 21, "th", "st"))
corpus_sample(corp, size = 2, by = century) |>
summary()
# needs drop = TRUE to avoid empty interactions
corpus_sample(corp, size = 1, by = interaction(Party, century, drop = TRUE), replace = TRUE) |>
summary()
# sampling sentences by document
corp <- corpus(c(one = "Sentence one. Sentence two. Third sentence.",
two = "First sentence, doc2. Second sentence, doc2."),
docvars = data.frame(var1 = c("a", "a"), var2 = c(1, 2)))
corpus_reshape(corp, to = "sentences") %>%
corpus_sample(replace = TRUE, by = docid(.))
# oversampling
corpus_sample(corp, size = 5, replace = TRUE)
Segment texts on a pattern match
Description
Segment corpus text(s) or a character vector, splitting on a pattern match. This is useful for breaking the texts into smaller documents based on a regular pattern (such as a speaker identifier in a transcript) or a user-supplied annotation.
Usage
corpus_segment(
x,
pattern = "##*",
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
extract_pattern = TRUE,
pattern_position = c("before", "after"),
use_docvars = TRUE
)
char_segment(
x,
pattern = "##*",
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
remove_pattern = TRUE,
pattern_position = c("before", "after")
)
Arguments
x |
character or corpus object whose texts will be segmented |
pattern |
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
extract_pattern |
extracts matched patterns from the texts and save in docvars if
|
pattern_position |
either |
use_docvars |
if |
remove_pattern |
removes matched patterns from the texts if |
Details
For segmentation into syntactic units defined by the locale (such as
sentences), use corpus_reshape()
instead. In cases where more
fine-grained segmentation is needed, such as that based on commas or
semi-colons (phrase delimiters within a sentence),
corpus_segment()
offers greater user control than
corpus_reshape()
.
Value
corpus_segment
returns a corpus of segmented texts
char_segment
returns a character vector of segmented texts
Boundaries and segmentation explained
The pattern
acts as a
boundary delimiter that defines the segmentation points for splitting a
text into new "document" units. Boundaries are always defined as the
pattern matches, plus the end and beginnings of each document. The new
"documents" that are created following the segmentation will then be the
texts found between boundaries.
The pattern itself will be saved as a new document variable named
pattern
. This is most useful when segmenting a text according to
tags such as names in a transcript, section titles, or user-supplied
annotations. If the beginning of the file precedes a pattern match, then
the extracted text will have a NA
for the extracted pattern
document variable (or when pattern_position = "after"
, this will be
true for the text split between the last pattern match and the end of the
document).
To extract syntactically defined sub-document units such as sentences and
paragraphs, use corpus_reshape()
instead.
Using patterns
One of the most common uses for
corpus_segment
is to partition a corpus into sub-documents using
tags. The default pattern value is designed for a user-annotated tag that
is a term beginning with double "hash" signs, followed by a whitespace, for
instance as ##INTRODUCTION The text
.
Glob and fixed pattern types use a whitespace character to signal the end of the pattern.
For more advanced pattern matches that could include whitespace or newlines, a regex pattern type can be used, for instance a text such as
Mr. Smith: Text
Mrs. Jones: More text
could have as pattern = "\\b[A-Z].+\\.\\s[A-Z][a-z]+:"
, which
would catch the title, the name, and the colon.
For custom boundary delimitation using punctuation characters that come
come at the end of a clause or sentence (such as ,
and.
,
these can be specified manually and pattern_position
set to
"after"
. To keep the punctuation characters in the text (as with
sentence segmentation), set extract_pattern = FALSE
. (With most tag
applications, users will want to remove the patterns from the text, as they
are annotations rather than parts of the text itself.)
See Also
corpus_reshape()
, for segmenting texts into pre-defined
syntactic units such as sentences, paragraphs, or fixed-length chunks
Examples
## segmenting a corpus
# segmenting a corpus using tags
corp1 <- corpus(c("##INTRO This is the introduction.
##DOC1 This is the first document. Second sentence in Doc 1.
##DOC3 Third document starts here. End of third document.",
"##INTRO Document ##NUMBER Two starts before ##NUMBER Three."))
corpseg1 <- corpus_segment(corp1, pattern = "##*")
cbind(corpseg1, docvars(corpseg1))
# segmenting a transcript based on speaker identifiers
corp2 <- corpus("Mr. Smith: Text.\nMrs. Jones: More text.\nMr. Smith: I'm speaking, again.")
corpseg2 <- corpus_segment(corp2, pattern = "\\b[A-Z].+\\s[A-Z][a-z]+:",
valuetype = "regex")
cbind(corpseg2, docvars(corpseg2))
# segmenting a corpus using crude end-of-sentence segmentation
corpseg3 <- corpus_segment(corp1, pattern = ".", valuetype = "fixed",
pattern_position = "after", extract_pattern = FALSE)
cbind(corpseg3, docvars(corpseg3))
## segmenting a character vector
# segment into paragraphs and removing the "- " bullet points
cat(data_char_ukimmig2010[4])
char_segment(data_char_ukimmig2010[4],
pattern = "\\n\\n(-\\s){0,1}", valuetype = "regex",
remove_pattern = TRUE)
# segment a text into clauses
txt <- c(d1 = "This, is a sentence? You: come here.", d2 = "Yes, yes okay.")
char_segment(txt, pattern = "\\p{P}", valuetype = "regex",
pattern_position = "after", remove_pattern = FALSE)
Extract a subset of a corpus
Description
Returns subsets of a corpus that meet certain conditions, including direct
logical operations on docvars (document-level variables). corpus_subset
functions identically to subset.data.frame()
, using non-standard
evaluation to evaluate conditions based on the docvars in the corpus.
Usage
corpus_subset(x, subset, drop_docid = TRUE, ...)
Arguments
x |
corpus object to be subsetted. |
subset |
logical expression indicating the documents to keep: missing values are taken as false. |
drop_docid |
if |
... |
not used |
Value
corpus object, with a subset of documents (and docvars) selected according to arguments
See Also
Examples
summary(corpus_subset(data_corpus_inaugural, Year > 1980))
summary(corpus_subset(data_corpus_inaugural, Year > 1930 & President == "Roosevelt"))
Remove sentences based on their token lengths or a pattern match
Description
Removes sentences from a corpus or a character vector shorter than a specified length.
Usage
corpus_trim(
x,
what = c("sentences", "paragraphs", "documents"),
min_ntoken = 1,
max_ntoken = NULL,
exclude_pattern = NULL
)
char_trim(
x,
what = c("sentences", "paragraphs", "documents"),
min_ntoken = 1,
max_ntoken = NULL,
exclude_pattern = NULL
)
Arguments
x |
corpus or character object whose sentences will be selected. |
what |
units of trimming, |
min_ntoken , max_ntoken |
minimum and maximum lengths in word tokens (excluding punctuation). Note that these are approximate numbers of tokens based on checking for word boundaries, rather than on-the-fly full tokenisation. |
exclude_pattern |
a stringi regular expression whose match (at the sentence level) will be used to exclude sentences |
Value
a corpus or character vector equal in length to the input. If
the input was a corpus, then the all docvars and metadata are preserved.
For documents whose sentences have been removed entirely, a null string
(""
) will be returned.
Examples
txt <- c("PAGE 1. This is a single sentence. Short sentence. Three word sentence.",
"PAGE 2. Very short! Shorter.",
"Very long sentence, with multiple parts, separated by commas. PAGE 3.")
corp <- corpus(txt, docvars = data.frame(serial = 1:3))
corp
# exclude sentences shorter than 3 tokens
corpus_trim(corp, min_ntoken = 3)
# exclude sentences that start with "PAGE <digit(s)>"
corpus_trim(corp, exclude_pattern = "^PAGE \\d+")
# trimming character objects
char_trim(txt, "sentences", min_ntoken = 3)
char_trim(txt, "sentences", exclude_pattern = "sentence\\.")
Internal data sets
Description
Data sets used for mainly internal purposes by the quanteda package.
Formerly included data objects
Description
The following corpus objects have been relocated to the quanteda.textmodels package:
-
data_corpus_dailnoconf1991
-
data_corpus_irishbudget2010
See Also
quanteda.textmodels::quanteda.textmodels-package
A paragraph of text for testing various text-based functions
Description
This is a long paragraph (2,914 characters) of text taken from a debate on Joe Higgins, delivered December 8, 2011.
Usage
data_char_sampletext
Format
character vector with one element
Source
Dáil Éireann Debate, Financial Resolution No. 13: General (Resumed). 7 December 2011. vol. 749, no. 1.
Examples
tokens(data_char_sampletext, remove_punct = TRUE)
Immigration-related sections of 2010 UK party manifestos
Description
Extracts from the election manifestos of 9 UK political parties from 2010, related to immigration or asylum-seekers.
Usage
data_char_ukimmig2010
Format
A named character vector of plain ASCII texts
Examples
data_corpus_ukimmig2010 <-
corpus(data_char_ukimmig2010,
docvars = data.frame(party = names(data_char_ukimmig2010)))
summary(data_corpus_ukimmig2010, showmeta = TRUE)
US presidential inaugural address texts
Description
US presidential inaugural address texts, and metadata (for the corpus), from 1789 to present.
Usage
data_corpus_inaugural
Format
a corpus object with the following docvars:
-
Year
a four-digit integer year -
President
character; President's last name -
FirstName
character; President's first name (and possibly middle initial) -
Party
factor; name of the President's political party
Details
data_corpus_inaugural
is the quanteda-package corpus
object of US presidents' inaugural addresses since 1789. Document variables
contain the year of the address and the last name of the president.
Source
https://archive.org/details/Inaugural-Address-Corpus-1789-2009 and https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/inaugural-addresses.
Examples
# some operations on the inaugural corpus
summary(data_corpus_inaugural)
head(docvars(data_corpus_inaugural), 10)
dfm from data in Table 1 of Laver, Benoit, and Garry (2003)
Description
Constructed example data to demonstrate the Wordscores algorithm, from Laver Benoit and Garry (2003), Table 1.
Usage
data_dfm_lbgexample
Format
A dfm object with 6 documents and 37 features.
Details
This is the example word count data from Laver, Benoit and Garry's (2003) Table 1. Documents R1 to R5 are assumed to have known positions: -1.5, -0.75, 0, 0.75, 1.5. Document V1 is assumed unknown, and will have a raw text score of approximately -0.45 when computed as per LBG (2003).
References
Laver, M., Benoit, K.R., & Garry, J. (2003). Estimating Policy Positions from Political Text using Words as Data. American Political Science Review, 97(2), 311–331.
Lexicoder Sentiment Dictionary (2015)
Description
The 2015 Lexicoder Sentiment Dictionary in quanteda dictionary format.
Usage
data_dictionary_LSD2015
Format
A dictionary of four keys containing glob-style pattern matches.
negative
2,858 word patterns indicating negative sentiment
positive
1,709 word patterns indicating positive sentiment
neg_positive
1,721 word patterns indicating a positive word preceded by a negation (used to convey negative sentiment)
neg_negative
2,860 word patterns indicating a negative word preceded by a negation (used to convey positive sentiment)
Details
The dictionary consists of 2,858 "negative" sentiment words and 1,709 "positive" sentiment words. A further set of 2,860 and 1,721 negations of negative and positive words, respectively, is also included. While many users will find the non-negation sentiment forms of the LSD adequate for sentiment analysis, Young and Soroka (2012) did find a small, but non-negligible increase in performance when accounting for negations. Users wishing to test this or include the negations are encouraged to subtract negated positive words from the count of positive words, and subtract the negated negative words from the negative count.
Young and Soroka (2012) also suggest the use of a pre-processing script to remove specific cases of some words (i.e., "good bye", or "nobody better", which should not be counted as positive). Pre-processing scripts are available at https://www.snsoroka.com/data-lexicoder/.
License and Conditions
The LSD is available for non-commercial academic purposes only. By using
data_dictionary_LSD2015
, you accept these terms.
Please cite the references below when using the dictionary.
References
The objectives, development and reliability of the dictionary are discussed in detail in Young and Soroka (2012). Please cite this article when using the Lexicoder Sentiment Dictionary and related resources. Young, L. & Soroka, S. (2012). Lexicoder Sentiment Dictionary. Available at https://www.snsoroka.com/data-lexicoder/.
Young, L. & Soroka, S. (2012). Affective News: The Automated Coding of Sentiment in Political Texts. doi:10.1080/10584609.2012.671234. Political Communication, 29(2), 205–231.
Examples
# simple example
txt <- "This aggressive policy will not win friends."
tokens_lookup(tokens(txt), dictionary = data_dictionary_LSD2015, exclusive = FALSE)
## tokens from 1 document.
## text1 :
## [1] "This" "NEGATIVE" "policy" "will" "NEG_POSITIVE" "POSITIVE" "POSITIVE" "."
# notice that double-counting of negated and non-negated terms is avoided
# when using nested_scope = "dictionary"
tokens_lookup(tokens(txt), dictionary = data_dictionary_LSD2015,
exclusive = FALSE, nested_scope = "dictionary")
## tokens from 1 document.
## text1 :
## [1] "This" "NEGATIVE" "policy" "will" "NEG_POSITIVE" "POSITIVE."
# compound neg_negative and neg_positive tokens before creating a dfm object
toks <- tokens_compound(tokens(txt), data_dictionary_LSD2015)
dfm_lookup(dfm(toks), data_dictionary_LSD2015)
Create a document-feature matrix
Description
Construct a sparse document-feature matrix from a tokens or dfm object.
Usage
dfm(
x,
tolower = TRUE,
remove_padding = FALSE,
verbose = quanteda_options("verbose"),
...
)
Arguments
x |
|
tolower |
convert all features to lowercase. |
remove_padding |
logical; if |
verbose |
display messages if |
... |
not used. |
Value
a dfm object
Changes in version 3
In quanteda v4, many convenience functions formerly available in
dfm()
were removed.
See Also
Examples
## for a corpus
toks <- data_corpus_inaugural |>
corpus_subset(Year > 1980) |>
tokens()
dfm(toks)
# removal options
toks <- tokens(c("a b c", "A B C D")) |>
tokens_remove("b", padding = TRUE)
toks
dfm(toks)
dfm(toks) |>
dfm_remove(pattern = "") # remove "pads"
# preserving case
dfm(toks, tolower = FALSE)
Virtual class "dfm" for a document-feature matrix
Description
The dfm class of object is a type of Matrix-class object with
additional slots, described below. quanteda uses two subclasses of the
dfm
class, depending on whether the object can be represented by a
sparse matrix, in which case it is a dfm
class object, or if dense,
then a dfmDense
object. See Details.
Usage
## S4 method for signature 'dfm'
t(x)
## S4 method for signature 'dfm'
colSums(x, na.rm = FALSE, dims = 1, ...)
## S4 method for signature 'dfm'
rowSums(x, na.rm = FALSE, dims = 1, ...)
## S4 method for signature 'dfm'
colMeans(x, na.rm = FALSE, dims = 1, ...)
## S4 method for signature 'dfm'
rowMeans(x, na.rm = FALSE, dims = 1, ...)
## S4 method for signature 'dfm,numeric'
Arith(e1, e2)
## S4 method for signature 'numeric,dfm'
Arith(e1, e2)
## S4 method for signature 'dfm,index,index,missing'
x[i, j, ..., drop = TRUE]
## S4 method for signature 'dfm,index,index,logical'
x[i, j, ..., drop = TRUE]
## S4 method for signature 'dfm,missing,missing,missing'
x[i, j, ..., drop = TRUE]
## S4 method for signature 'dfm,missing,missing,logical'
x[i, j, ..., drop = TRUE]
## S4 method for signature 'dfm,index,missing,missing'
x[i, j, ..., drop = TRUE]
## S4 method for signature 'dfm,index,missing,logical'
x[i, j, ..., drop = TRUE]
## S4 method for signature 'dfm,missing,index,missing'
x[i, j, ..., drop = TRUE]
## S4 method for signature 'dfm,missing,index,logical'
x[i, j, ..., drop = TRUE]
Arguments
x |
the dfm object |
na.rm |
if |
dims |
ignored |
... |
additional arguments not used here |
e1 |
first quantity in an Arith operation for dfm |
e2 |
second quantity in an Arith operation for dfm |
i |
document names or indices for documents to extract. |
j |
feature names or indices for documents to extract. |
Details
The dfm
class is a virtual class that will contain
dgCMatrix-class.
Slots
weightTf
the type of term frequency weighting applied to the dfm. Default is
"frequency"
, indicating that the values in the cells of the dfm are simple feature counts. To change this, use thedfm_weight()
method.weightFf
the type of document frequency weighting applied to the dfm. See
docfreq()
.smooth
a smoothing parameter, defaults to zero. Can be changed using the
dfm_smooth()
method.Dimnames
These are inherited from Matrix-class but are named
docs
andfeatures
respectively.
See Also
Examples
# dfm subsetting
dfmat <- dfm(tokens(c("this contains lots of stopwords",
"no if, and, or but about it: lots",
"and a third document is it"),
remove_punct = TRUE))
dfmat[1:2, ]
dfmat[1:2, 1:5]
Internal functions for dfm objects
Description
Internal function documentation for dfm objects.
Usage
## S4 method for signature 'dfm,numeric'
Compare(e1, e2)
Arguments
e1 |
a dfm |
e2 |
a numeric value to compare with values in a dfm |
See Also
Comparison operators
Convert a dfm to an lsa "textmatrix"
Description
Converts a dfm to a textmatrix for use with the lsa package.
Usage
dfm2lsa(x)
Arguments
x |
dfm to be converted |
Examples
## Not run:
(dfmat <- dfm(tokens(c(d1 = "this is a first matrix",
d2 = "this is second matrix as example"))))
lsa::lsa(convert(dfmat, to = "lsa"))
## End(Not run)
Recombine a dfm or fcm by combining identical dimension elements
Description
"Compresses" or groups a dfm or fcm whose dimension names are
the same, for either documents or features. This may happen, for instance,
if features are made equivalent through application of a thesaurus. It could also be needed after a
cbind.dfm()
or rbind.dfm()
operation. In most cases, you will not
need to call dfm_compress
, since it is called automatically by functions that change the
dimensions of the dfm, e.g. dfm_tolower()
.
Usage
dfm_compress(
x,
margin = c("both", "documents", "features"),
verbose = quanteda_options("verbose")
)
fcm_compress(x)
Arguments
x |
|
margin |
character indicating on which margin to compress a dfm, either
|
verbose |
if |
Value
dfm_compress
returns a dfm whose dimensions have been
recombined by summing the cells across identical dimension names
(docnames or featnames). The docvars will be
preserved for combining by features but not when documents are combined.
fcm_compress
returns an fcm whose features have been
recombined by combining counts of identical features, summing their counts.
Note
fcm_compress
works only when the fcm was created with a
document context.
Examples
# dfm_compress examples
dfmat <- rbind(dfm(tokens(c("b A A", "C C a b B")), tolower = FALSE),
dfm(tokens("A C C C C C"), tolower = FALSE))
colnames(dfmat) <- char_tolower(featnames(dfmat))
dfmat
dfm_compress(dfmat, margin = "documents")
dfm_compress(dfmat, margin = "features")
dfm_compress(dfmat)
# no effect if no compression needed
dfmatsubset <- dfm(tokens(data_corpus_inaugural[1:5]))
dim(dfmatsubset)
dim(dfm_compress(dfmatsubset))
# compress an fcm
fcmat1 <- fcm(tokens("A D a C E a d F e B A C E D"),
context = "window", window = 3)
## this will produce an error:
# fcm_compress(fcmat1)
txt <- c("The fox JUMPED over the dog.",
"The dog jumped over the fox.")
toks <- tokens(txt, remove_punct = TRUE)
fcmat2 <- fcm(toks, context = "document")
colnames(fcmat2) <- rownames(fcmat2) <- tolower(colnames(fcmat2))
colnames(fcmat2)[5] <- rownames(fcmat2)[5] <- "fox"
fcmat2
fcm_compress(fcmat2)
Combine documents in a dfm by a grouping variable
Description
Combine documents in a dfm by a grouping variable, by summing the cell frequencies within group and creating new "documents" with the group labels.
Usage
dfm_group(
x,
groups = docid(x),
fill = FALSE,
force = FALSE,
verbose = quanteda_options("verbose")
)
Arguments
x |
a dfm |
groups |
grouping variable for sampling, equal in length to the number
of documents. This will be evaluated in the docvars data.frame, so that
docvars may be referred to by name without quoting. This also changes
previous behaviours for |
fill |
logical; if |
force |
logical; if |
verbose |
if |
Value
dfm_group
returns a dfm whose documents are equal to
the unique group combinations, and whose cell values are the sums of the
previous values summed by group. Document-level variables that have no
variation within groups are saved in docvars. Document-level
variables that are lists are dropped from grouping, even when these exhibit
no variation within groups.
Examples
corp <- corpus(c("a a b", "a b c c", "a c d d", "a c c d"),
docvars = data.frame(grp = c("grp1", "grp1", "grp2", "grp2")))
dfmat <- dfm(tokens(corp))
dfm_group(dfmat, groups = grp)
dfm_group(dfmat, groups = c(1, 1, 2, 2))
# with fill = TRUE
dfm_group(dfmat, fill = TRUE,
groups = factor(c("A", "A", "B", "C"), levels = LETTERS[1:4]))
Apply a dictionary to a dfm
Description
Apply a dictionary to a dfm by looking up all dfm features for matches in a a
set of dictionary values, and replace those features with a count of
the dictionary's keys. If exclusive = FALSE
then the behaviour is to
apply a "thesaurus", where each value match is replaced by the dictionary
key, converted to capitals if capkeys = TRUE
(so that the replacements
are easily distinguished from features that were terms found originally in
the document).
Usage
dfm_lookup(
x,
dictionary,
levels = 1:5,
exclusive = TRUE,
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
capkeys = !exclusive,
nomatch = NULL,
verbose = quanteda_options("verbose")
)
Arguments
x |
the dfm to which the dictionary will be applied |
dictionary |
a dictionary-class object |
levels |
levels of entries in a hierarchical dictionary that will be applied |
exclusive |
if |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
capkeys |
if |
nomatch |
an optional character naming a new feature that will contain
the counts of features of |
verbose |
print status messages if |
Note
If using dfm_lookup
with dictionaries containing multi-word
values, matches will only occur if the features themselves are multi-word
or formed from n-grams. A better way to match dictionary values that include
multi-word patterns is to apply tokens_lookup()
to the tokens,
and then construct the dfm.
See Also
dfm_replace
Examples
dict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
opposition = c("Opposition", "reject", "notincorpus"),
taxglob = "tax*",
taxregex = "tax.+$",
country = c("United_States", "Sweden")))
dfmat <- dfm(tokens(c("My Christmas was ruined by your opposition tax plan.",
"Does the United_States or Sweden have more progressive taxation?")))
dfmat
# glob format
dfm_lookup(dfmat, dict, valuetype = "glob")
dfm_lookup(dfmat, dict, valuetype = "glob", case_insensitive = FALSE)
# regex v. glob format: note that "united_states" is a regex match for "tax*"
dfm_lookup(dfmat, dict, valuetype = "glob")
dfm_lookup(dfmat, dict, valuetype = "regex", case_insensitive = TRUE)
# fixed format: no pattern matching
dfm_lookup(dfmat, dict, valuetype = "fixed")
dfm_lookup(dfmat, dict, valuetype = "fixed", case_insensitive = FALSE)
# show unmatched tokens
dfm_lookup(dfmat, dict, nomatch = "_UNMATCHED")
Match the feature set of a dfm to given feature names
Description
Match the feature set of a dfm to a specified vector of feature names.
For existing features in x
for which there is an exact match for an
element of features
, these will be included. Any features in x
not features
will be discarded, and any feature names specified in
features
but not found in x
will be added with all zero counts.
Usage
dfm_match(x, features, verbose = quanteda_options("verbose"))
Arguments
x |
a dfm |
features |
character; the feature names to be matched in the output dfm |
verbose |
if |
Details
Selecting on another dfm's featnames()
is useful when you
have trained a model on one dfm, and need to project this onto a test set
whose features must be identical. It is also used in
bootstrap_dfm()
.
Value
A dfm whose features are identical to those specified in
features
.
Note
Unlike dfm_select()
, this function will add feature names
not already present in x
. It also provides only fixed,
case-sensitive matches. For more flexible feature selection, see
dfm_select()
.
See Also
Examples
# matching a dfm to a feature vector
dfm_match(dfm(tokens("")), letters[1:5])
dfm_match(data_dfm_lbgexample, c("A", "B", "Z"))
dfm_match(data_dfm_lbgexample, c("B", "newfeat1", "A", "newfeat2"))
# matching one dfm to another
txt <- c("This is text one", "The second text", "This is text three")
(dfmat1 <- dfm(tokens(txt[1:2])))
(dfmat2 <- dfm(tokens(txt[2:3])))
(dfmat3 <- dfm_match(dfmat1, featnames(dfmat2)))
setequal(featnames(dfmat2), featnames(dfmat3))
Replace features in dfm
Description
Substitute features based on vectorized one-to-one matching for lemmatization or user-defined stemming.
Usage
dfm_replace(
x,
pattern,
replacement,
case_insensitive = TRUE,
verbose = quanteda_options("verbose")
)
Arguments
x |
dfm whose features will be replaced |
pattern |
a character vector. See pattern for more details. |
replacement |
if |
case_insensitive |
logical; if |
verbose |
if |
Examples
dfmat1 <- dfm(tokens(data_corpus_inaugural))
# lemmatization
taxwords <- c("tax", "taxing", "taxed", "taxed", "taxation")
lemma <- rep("TAX", length(taxwords))
featnames(dfm_select(dfmat1, pattern = taxwords))
dfmat2 <- dfm_replace(dfmat1, pattern = taxwords, replacement = lemma)
featnames(dfm_select(dfmat2, pattern = taxwords))
# stemming
feat <- featnames(dfmat1)
featstem <- char_wordstem(feat, "porter")
dfmat3 <- dfm_replace(dfmat1, pattern = feat, replacement = featstem, case_insensitive = FALSE)
identical(dfmat3, dfm_wordstem(dfmat1, "porter"))
Randomly sample documents from a dfm
Description
Take a random sample of documents of the specified size from a dfm, with or without replacement, optionally by grouping variables or with probability weights.
Usage
dfm_sample(
x,
size = NULL,
replace = FALSE,
prob = NULL,
by = NULL,
verbose = quanteda_options("verbose")
)
Arguments
x |
the dfm object whose documents will be sampled |
size |
a positive number, the number of documents to select; when used
with |
replace |
if |
prob |
a vector of probability weights for obtaining the elements of the
vector being sampled. May not be applied when |
by |
optional grouping variable for sampling. This will be evaluated in
the docvars data.frame, so that docvars may be referred to by name without
quoting. This also changes previous behaviours for |
verbose |
if |
Value
a dfm object (re)sampled on the documents, containing the document variables for the documents sampled.
See Also
Examples
set.seed(10)
dfmat <- dfm(tokens(c("a b c c d", "a a c c d d d", "a b b c")))
dfmat
dfm_sample(dfmat)
dfm_sample(dfmat, replace = TRUE)
# by groups
dfmat <- dfm(tokens(data_corpus_inaugural[50:58]))
dfm_sample(dfmat, by = Party, size = 2)
Select features from a dfm or fcm
Description
This function selects or removes features from a dfm or fcm,
based on feature name matches with pattern
. The most common usages
are to eliminate features from a dfm already constructed, such as stopwords,
or to select only terms of interest from a dictionary.
Usage
dfm_select(
x,
pattern = NULL,
selection = c("keep", "remove"),
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
min_nchar = NULL,
max_nchar = NULL,
padding = FALSE,
verbose = quanteda_options("verbose")
)
dfm_remove(x, ...)
dfm_keep(x, ...)
fcm_select(
x,
pattern = NULL,
selection = c("keep", "remove"),
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
verbose = quanteda_options("verbose"),
...
)
fcm_remove(x, ...)
fcm_keep(x, ...)
Arguments
x |
|
pattern |
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
selection |
whether to |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
min_nchar , max_nchar |
optional numerics specifying the minimum and
maximum length in characters for tokens to be removed or kept; defaults are
|
padding |
if |
verbose |
if |
... |
used only for passing arguments from |
Details
dfm_remove
and fcm_remove
are simply a convenience
wrappers to calling dfm_select
and fcm_select
with
selection = "remove"
.
dfm_keep
and fcm_keep
are simply a convenience wrappers to
calling dfm_select
and fcm_select
with selection = "keep"
.
Value
A dfm or fcm object, after the feature selection has been applied.
For compatibility with earlier versions, when pattern
is a
dfm object and selection = "keep"
, then this will be
equivalent to calling dfm_match()
. In this case, the following
settings are always used: case_insensitive = FALSE
, and
valuetype = "fixed"
. This functionality is deprecated, however, and
you should use dfm_match()
instead.
Note
This function selects features based on their labels. To select
features based on the values of the document-feature matrix, use
dfm_trim()
.
See Also
Examples
dfmat <- tokens(c("My Christmas was ruined by your opposition tax plan.",
"Does the United_States or Sweden have more progressive taxation?")) |>
dfm(tolower = FALSE)
dict <- dictionary(list(countries = c("United_States", "Sweden", "France"),
wordsEndingInY = c("by", "my"),
notintext = "blahblah"))
dfm_select(dfmat, pattern = dict)
dfm_select(dfmat, pattern = dict, case_insensitive = FALSE)
dfm_select(dfmat, pattern = c("s$", ".y"), selection = "keep", valuetype = "regex")
dfm_select(dfmat, pattern = c("s$", ".y"), selection = "remove", valuetype = "regex")
dfm_select(dfmat, pattern = stopwords("english"), selection = "keep", valuetype = "fixed")
dfm_select(dfmat, pattern = stopwords("english"), selection = "remove", valuetype = "fixed")
# select based on character length
dfm_select(dfmat, min_nchar = 5)
dfmat <- dfm(tokens(c("This is a document with lots of stopwords.",
"No if, and, or but about it: lots of stopwords.")))
dfmat
dfm_remove(dfmat, stopwords("english"))
toks <- tokens(c("this contains lots of stopwords",
"no if, and, or but about it: lots"),
remove_punct = TRUE)
fcmat <- fcm(toks)
fcmat
fcm_remove(fcmat, stopwords("english"))
Sort a dfm by frequency of one or more margins
Description
Sorts a dfm by descending frequency of total features, total features in documents, or both.
Usage
dfm_sort(x, decreasing = TRUE, margin = c("features", "documents", "both"))
Arguments
x |
Document-feature matrix created by |
decreasing |
logical; if |
margin |
which margin to sort on |
Value
A sorted dfm matrix object
Author(s)
Ken Benoit
Examples
dfmat <- dfm(tokens(data_corpus_inaugural))
head(dfmat)
head(dfm_sort(dfmat))
head(dfm_sort(dfmat, decreasing = FALSE, "both"))
Extract a subset of a dfm
Description
Returns document subsets of a dfm that meet certain conditions,
including direct logical operations on docvars (document-level variables).
dfm_subset
functions identically to subset.data.frame()
,
using non-standard evaluation to evaluate conditions based on the
docvars in the dfm.
Usage
dfm_subset(
x,
subset,
min_ntoken = NULL,
max_ntoken = NULL,
drop_docid = TRUE,
verbose = quanteda_options("verbose"),
...
)
Arguments
x |
dfm object to be subsetted. |
subset |
logical expression indicating the documents to keep: missing values are taken as false. |
min_ntoken , max_ntoken |
minimum and maximum lengths of the documents to extract. |
drop_docid |
if |
verbose |
if |
... |
not used |
Details
To select or subset features, see dfm_select()
instead.
Value
dfm object, with a subset of documents (and docvars) selected according to arguments
Examples
corp <- corpus(c(d1 = "a b c d", d2 = "a a b e",
d3 = "b b c e", d4 = "e e f a b"),
docvars = data.frame(grp = c(1, 1, 2, 3)))
dfmat <- dfm(tokens(corp))
# selecting on a docvars condition
dfm_subset(dfmat, grp > 1)
# selecting on a supplied vector
dfm_subset(dfmat, c(TRUE, FALSE, TRUE, FALSE))
Weight a dfm by tf-idf
Description
Weight a dfm by term frequency-inverse document frequency (tf-idf), with full control over options. Uses fully sparse methods for efficiency.
Usage
dfm_tfidf(
x,
scheme_tf = "count",
scheme_df = "inverse",
base = 10,
force = FALSE,
...
)
Arguments
x |
object for which idf or tf-idf will be computed (a document-feature matrix) |
scheme_tf |
scheme for |
scheme_df |
scheme for |
base |
the base for the logarithms in the |
force |
logical; if |
... |
additional arguments passed to |
Details
dfm_tfidf
computes term frequency-inverse document frequency
weighting. The default is to use counts instead of normalized term
frequency (the relative term frequency within document), but this
can be overridden using scheme_tf = "prop"
.
References
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
Examples
dfmat1 <- as.dfm(data_dfm_lbgexample)
head(dfmat1[, 5:10])
head(dfm_tfidf(dfmat1)[, 5:10])
docfreq(dfmat1)[5:15]
head(dfm_weight(dfmat1)[, 5:10])
# replication of worked example from
# https://en.wikipedia.org/wiki/Tf-idf#Example_of_tf.E2.80.93idf
dfmat2 <-
matrix(c(1,1,2,1,0,0, 1,1,0,0,2,3),
byrow = TRUE, nrow = 2,
dimnames = list(docs = c("document1", "document2"),
features = c("this", "is", "a", "sample",
"another", "example"))) |>
as.dfm()
dfmat2
docfreq(dfmat2)
dfm_tfidf(dfmat2, scheme_tf = "prop") |> round(digits = 2)
## Not run:
# comparison with tm
if (requireNamespace("tm")) {
convert(dfmat2, to = "tm") |> tm::weightTfIdf() |> as.matrix()
# same as:
dfm_tfidf(dfmat2, base = 2, scheme_tf = "prop")
}
## End(Not run)
Convert the case of the features of a dfm and combine
Description
dfm_tolower()
and dfm_toupper()
convert the features of the dfm or
fcm to lower and upper case, respectively, and then recombine the counts.
Usage
dfm_tolower(x, keep_acronyms = FALSE, verbose = quanteda_options("verbose"))
dfm_toupper(x, verbose = quanteda_options("verbose"))
fcm_tolower(x, keep_acronyms = FALSE, verbose = quanteda_options("verbose"))
fcm_toupper(x, verbose = quanteda_options("verbose"))
Arguments
x |
the input object whose character/tokens/feature elements will be case-converted |
keep_acronyms |
logical; if |
verbose |
if |
Details
fcm_tolower()
and fcm_toupper()
convert both dimensions of
the fcm to lower and upper case, respectively, and then recombine
the counts. This works only on fcm objects created with context = "document"
.
Examples
# for a document-feature matrix
dfmat <- dfm(tokens(c("b A A", "C C a b B")), tolower = FALSE)
dfmat
dfm_tolower(dfmat)
dfm_toupper(dfmat)
# for a feature co-occurrence matrix
fcmat <- fcm(tokens(c("b A A d", "C C a b B e")),
context = "document")
fcmat
fcm_tolower(fcmat)
fcm_toupper(fcmat)
Trim a dfm using frequency threshold-based feature selection
Description
Returns a document by feature matrix reduced in size based on document and term frequency, usually in terms of a minimum frequency, but may also be in terms of maximum frequencies. Setting a combination of minimum and maximum frequencies will select features based on a range.
Feature selection is implemented by considering features across
all documents, by summing them for term frequency, or counting the
documents in which they occur for document frequency. Rank and quantile
versions of these are also implemented, for taking the first n
features in terms of descending order of overall global counts or document
frequencies, or as a quantile of all frequencies.
Usage
dfm_trim(
x,
min_termfreq = NULL,
max_termfreq = NULL,
termfreq_type = c("count", "prop", "rank", "quantile"),
min_docfreq = NULL,
max_docfreq = NULL,
docfreq_type = c("count", "prop", "rank", "quantile"),
sparsity = NULL,
verbose = quanteda_options("verbose")
)
Arguments
x |
a dfm object |
min_termfreq , max_termfreq |
minimum/maximum values of feature frequencies across all documents, below/above which features will be removed |
termfreq_type |
how |
min_docfreq , max_docfreq |
minimum/maximum values of a feature's document frequency, below/above which features will be removed |
docfreq_type |
specify how |
sparsity |
equivalent to |
verbose |
if |
Value
A dfm reduced in features (with the same number of documents)
Note
Trimming a dfm object is an operation based on the values
in the document-feature matrix. To select subsets of a dfm based on the
features themselves (meaning the feature labels from
featnames()
) – such as those matching a regular expression, or
removing features matching a stopword list, use dfm_select()
.
See Also
Examples
dfmat <- dfm(tokens(data_corpus_inaugural))
# keep only words occurring >= 10 times and in >= 2 documents
dfm_trim(dfmat, min_termfreq = 10, min_docfreq = 2)
# keep only words occurring >= 10 times and in at least 0.4 of the documents
dfm_trim(dfmat, min_termfreq = 10, min_docfreq = 0.4, docfreq_type = "prop")
# keep only words occurring <= 10 times and in <=2 documents
dfm_trim(dfmat, max_termfreq = 10, max_docfreq = 2)
# keep only words occurring <= 10 times and in at most 3/4 of the documents
dfm_trim(dfmat, max_termfreq = 10, max_docfreq = 0.75, docfreq_type = "prop")
# keep only words occurring 5 times in 1000, and in 2 of 5 of documents
dfm_trim(dfmat, min_docfreq = 0.4, min_termfreq = 0.005, termfreq_type = "prop")
## Not run:
# compare to removeSparseTerms from the tm package
(dfmattm <- convert(dfmat, "tm"))
tm::removeSparseTerms(dfmattm, 0.7)
dfm_trim(dfmat, min_docfreq = 0.3)
dfm_trim(dfmat, sparsity = 0.7)
## End(Not run)
Weight the feature frequencies in a dfm
Description
Weight the feature frequencies in a dfm
Usage
dfm_weight(
x,
scheme = c("count", "prop", "propmax", "logcount", "boolean", "augmented", "logave"),
weights = NULL,
base = 10,
k = 0.5,
smoothing = 0.5,
force = FALSE
)
dfm_smooth(x, smoothing = 1)
Arguments
x |
document-feature matrix created by dfm |
scheme |
a label of the weight type:
|
weights |
if |
base |
base for the logarithm when |
k |
the k for the augmentation when |
smoothing |
constant added to the dfm cells for smoothing, default is 1
for |
force |
logical; if |
Value
dfm_weight
returns the dfm with weighted values. Note the
because the default weighting scheme is "count"
, simply calling this
function on an unweighted dfm will return the same object. Many users will
want the normalized dfm consisting of the proportions of the feature counts
within each document, which requires setting scheme = "prop"
.
dfm_smooth
returns a dfm whose values have been smoothed by
adding the smoothing
amount. Note that this effectively converts a
matrix from sparse to dense format, so may exceed memory requirements
depending on the size of your input matrix.
References
Manning, C.D., Raghavan, P., & Schütze, H. (2008). An Introduction to Information Retrieval. Cambridge: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
See Also
Examples
dfmat1 <- dfm(tokens(data_corpus_inaugural))
dfmat2 <- dfm_weight(dfmat1, scheme = "prop")
topfeatures(dfmat2)
dfmat3 <- dfm_weight(dfmat1)
topfeatures(dfmat3)
dfmat4 <- dfm_weight(dfmat1, scheme = "logcount")
topfeatures(dfmat4)
dfmat5 <- dfm_weight(dfmat1, scheme = "logave")
topfeatures(dfmat5)
# combine these methods for more complex dfm_weightings, e.g. as in Section 6.4
# of Introduction to Information Retrieval
head(dfm_tfidf(dfmat1, scheme_tf = "logcount"))
# smooth the dfm
dfmat <- dfm(tokens(data_corpus_inaugural))
dfm_smooth(dfmat, 0.5)
Create a dictionary
Description
Create a quanteda dictionary class object, either from a list or by importing from a foreign format. Currently supported input file formats are the WordStat, LIWC, Lexicoder v2 and v3, and Yoshikoder formats. The import using the LIWC format works with all currently available dictionary files supplied as part of the LIWC 2001, 2007, and 2015 software (see References).
Usage
dictionary(
x,
file = NULL,
format = NULL,
separator = " ",
tolower = TRUE,
encoding = "utf-8"
)
Arguments
x |
a named list of character vector dictionary entries, including
valuetype pattern matches, and including multi-word expressions
separated by |
file |
file identifier for a foreign dictionary |
format |
character identifier for the format of the foreign dictionary. If not supplied, the format is guessed from the dictionary file's extension. Available options are:
|
separator |
the character in between multi-word dictionary values. This
defaults to |
tolower |
if |
encoding |
additional optional encoding value for reading in imported dictionaries. This uses the iconv labels for encoding. See the "Encoding" section of the help for file. |
Details
Dictionaries can be subsetted using
[
and
[[
, operating the same as the equivalent
list operators.
Dictionaries can be coerced from lists using as.dictionary()
,
coerced to named lists of characters using
as.list()
, and checked using
is.dictionary()
.
Value
A dictionary class object, essentially a specially classed named list of characters.
References
WordStat dictionaries page, from Provalis Research https://provalisresearch.com/products/content-analysis-software/wordstat-dictionary/.
Pennebaker, J.W., Chung, C.K., Ireland, M., Gonzales, A., & Booth, R.J. (2007). The development and psychometric properties of LIWC2007. [Software manual]. Austin, TX (https://www.liwc.app/).
Yoshikoder page, from Will Lowe https://conjugateprior.org/software/yoshikoder/.
Lexicoder format, https://www.snsoroka.com/data-lexicoder/
See Also
as.dictionary()
,
as.list()
, is.dictionary()
Examples
corp <- corpus_subset(data_corpus_inaugural, Year>1900)
dict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
opposition = c("Opposition", "reject", "notincorpus"),
taxing = "taxing",
taxation = "taxation",
taxregex = "tax*",
country = "america"))
tokens(corp) |>
tokens_lookup(dictionary = dict) |>
dfm()
# subset a dictionary
dict[1:2]
dict[c("christmas", "opposition")]
dict[["opposition"]]
# combine dictionaries
c(dict["christmas"], dict["country"])
## Not run:
dfmat <- dfm(tokens(data_corpus_inaugural))
# import the Laver-Garry dictionary from Provalis Research
dictfile <- tempfile()
download.file("https://provalisresearch.com/Download/LaverGarry.zip",
dictfile, mode = "wb")
unzip(dictfile, exdir = (td <- tempdir()))
dictlg <- dictionary(file = paste(td, "LaverGarry.cat", sep = "/"))
dfm_lookup(dfmat, dictlg)
# import a LIWC formatted dictionary from http://www.moralfoundations.org
download.file("http://bit.ly/37cV95h", tf <- tempfile())
dictliwc <- dictionary(file = tf, format = "LIWC")
dfm_lookup(dfmat, dictliwc)
## End(Not run)
dictionary class objects and functions
Description
The dictionary2
class constructed by dictionary()
, and associated core
class functions.
Usage
## S4 method for signature 'dictionary2'
as.list(x, flatten = FALSE, levels = 1:100)
## S4 method for signature 'dictionary2,index,ANY,ANY'
x[i]
## S4 method for signature 'dictionary2,index'
x[[i]]
## S3 method for class 'dictionary2'
x$name
## S4 method for signature 'dictionary2'
c(x, ...)
Arguments
flatten |
flatten the nested structure if |
levels |
integer vector indicating levels in the dictionary. Used only
when |
i |
index for entries |
name |
the dictionary key |
... |
dictionary objects to be concatenated |
Slots
.Data
named list of mode character, where each element name is a dictionary "key" and each element is one or more dictionary entry "values" consisting of a pattern match
meta
list of object metadata
Compute the (weighted) document frequency of a feature
Description
For a dfm object, returns a (weighted) document frequency for each term. The default is a simple count of the number of documents in which a feature occurs more than a given frequency threshold. (The default threshold is zero, meaning that any feature occurring at least once in a document will be counted.)
Usage
docfreq(
x,
scheme = c("count", "inverse", "inversemax", "inverseprob", "unary"),
base = 10,
smoothing = 0,
k = 0,
threshold = 0
)
Arguments
x |
a dfm |
scheme |
type of document frequency weighting, computed as
follows, where
|
base |
the base with respect to which logarithms in the inverse document frequency weightings are computed; default is 10 (see Manning, Raghavan, and Schütze 2008, p123). |
smoothing |
added to the quotient before taking the logarithm |
k |
added to the denominator in the "inverse" weighting types, to prevent a zero document count for a term |
threshold |
numeric value of the threshold above which a feature will considered in the computation of document frequency. The default is 0, meaning that a feature's document frequency will be the number of documents in which it occurs greater than zero times. |
Value
a numeric vector of document frequencies for each feature
References
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
Examples
dfmat1 <- dfm(tokens(data_corpus_inaugural))
docfreq(dfmat1[, 1:20])
# replication of worked example from
# https://en.wikipedia.org/wiki/Tf-idf#Example_of_tf.E2.80.93idf
dfmat2 <-
matrix(c(1,1,2,1,0,0, 1,1,0,0,2,3),
byrow = TRUE, nrow = 2,
dimnames = list(docs = c("document1", "document2"),
features = c("this", "is", "a", "sample",
"another", "example"))) |>
as.dfm()
dfmat2
docfreq(dfmat2)
docfreq(dfmat2, scheme = "inverse")
docfreq(dfmat2, scheme = "inverse", k = 1, smoothing = 1)
docfreq(dfmat2, scheme = "unary")
docfreq(dfmat2, scheme = "inversemax")
docfreq(dfmat2, scheme = "inverseprob")
Get or set document names
Description
Get or set the document names of a corpus, tokens, or dfm object.
Usage
docnames(x)
docnames(x) <- value
docid(x)
segid(x)
Arguments
x |
the object with docnames |
value |
a character vector of the same length as |
Value
docnames
returns a character vector of the document names
docnames <-
assigns new values to the document names of an object.
docnames can only be character, so any non-character value assigned to be a
docname will be coerced to mode character
.
docid
returns an internal variable denoting the original "docname"
from which a document came. If an object has been reshaped (e.g.
corpus_reshape()
or segmented (e.g. corpus_segment()
), docid(x)
returns
the original docnames but segid(x)
does the serial number of those segments
within the original document.
Note
docid
and segid
are designed primarily for developers, not for end users. In
most cases, you will want docnames
instead. It is, however, the
default for groups, so that documents that have been previously reshaped
(e.g. corpus_reshape()
or segmented (e.g.
corpus_segment()
) will be regrouped into their original docnames
when
groups = docid(x)
.
See Also
Examples
# get and set doument names to a corpus
corp <- data_corpus_inaugural
docnames(corp) <- char_tolower(docnames(corp))
# get and set doument names to a tokens
toks <- tokens(corp)
docnames(toks) <- char_tolower(docnames(toks))
# get and set doument names to a dfm
dfmat <- dfm(tokens(corp))
docnames(dfmat) <- char_tolower(docnames(dfmat))
# reassign the document names of the inaugural speech corpus
corp <- data_corpus_inaugural
docnames(corp) <- paste0("Speech", seq_len(ndoc(corp)))
corp <- corpus(c(textone = "This is a sentence. Another sentence. Yet another.",
textwo = "Sentence 1. Sentence 2."))
corp_sent <- corp |>
corpus_reshape(to = "sentences")
docnames(corp_sent)
# docid
docid(corp_sent)
docid(tokens(corp_sent))
docid(dfm(tokens(corp_sent)))
# segid
segid(corp_sent)
segid(tokens(corp_sent))
segid(dfm(tokens(corp_sent)))
Get or set document-level variables
Description
Get or set variables associated with a document in a corpus, tokens or dfm object.
Usage
docvars(x, field = NULL)
docvars(x, field = NULL) <- value
## S3 method for class 'corpus'
x$name
## S3 replacement method for class 'corpus'
x$name <- value
## S3 method for class 'tokens'
x$name
## S3 replacement method for class 'tokens'
x$name <- value
## S3 method for class 'dfm'
x$name
## S3 replacement method for class 'dfm'
x$name <- value
Arguments
x |
corpus, tokens, or dfm object whose document-level variables will be read or set |
field |
string containing the document-level variable name |
value |
a vector of document variable values to be assigned to |
name |
a literal character string specifying a single docvars name |
Value
docvars
returns a data.frame of the document-level variables,
dropping the second dimension to form a vector if a single docvar is
returned.
docvars<-
assigns value
to the named field
Accessing or assigning docvars using the $
operator
As of quanteda v2, it is possible to access and assign a docvar using
the $
operator. See Examples.
Note
Reassigning document variables for a tokens or dfm object is allowed, but discouraged. A better, more reproducible workflow is to create your docvars as desired in the corpus, and let these continue to be attached "downstream" after tokenization and forming a document-feature matrix. Recognizing that in some cases, you may need to modify or add document variables to downstream objects, the assignment operator is defined for tokens or dfm objects as well. Use with caution.
Examples
# retrieving docvars from a corpus
head(docvars(data_corpus_inaugural))
tail(docvars(data_corpus_inaugural, "President"), 10)
head(data_corpus_inaugural$President)
# assigning document variables to a corpus
corp <- data_corpus_inaugural
docvars(corp, "President") <- paste("prez", 1:ndoc(corp), sep = "")
head(docvars(corp))
corp$fullname <- paste(data_corpus_inaugural$FirstName,
data_corpus_inaugural$President)
tail(corp$fullname)
# accessing or assigning docvars for a corpus using "$"
data_corpus_inaugural$Year
data_corpus_inaugural$century <- floor(data_corpus_inaugural$Year / 100)
data_corpus_inaugural$century
# accessing or assigning docvars for tokens using "$"
toks <- tokens(corpus_subset(data_corpus_inaugural, Year <= 1805))
toks$Year
toks$Year <- 1991:1995
toks$Year
toks$nonexistent <- TRUE
docvars(toks)
# accessing or assigning docvars for a dfm using "$"
dfmat <- dfm(toks)
dfmat$Year
dfmat$Year <- 1991:1995
dfmat$Year
dfmat$nonexistent <- TRUE
docvars(dfmat)
Internal function for select_types()
to escape regular expressions
Description
This function escapes glob patterns before utils:glob2rx()
, therefore * and
? are unescaped.
Usage
escape_regex(x)
Arguments
x |
character vector to be escaped |
Simpler and faster version of expand.grid() in base package
Description
Simpler and faster version of expand.grid() in base package
Usage
expand(elem)
Arguments
elem |
list of elements to be combined |
Examples
quanteda:::expand(list(c("a", "b", "c"), c("x", "y")))
Create a feature co-occurrence matrix
Description
Create a sparse feature co-occurrence matrix, measuring co-occurrences of features within a user-defined context. The context can be defined as a document or a window within a collection of documents, with an optional vector of weights applied to the co-occurrence counts.
Usage
fcm(
x,
context = c("document", "window"),
count = c("frequency", "boolean", "weighted"),
window = 5L,
weights = NULL,
ordered = FALSE,
tri = TRUE,
...
)
Arguments
x |
a tokens, or dfm object from which to generate the feature co-occurrence matrix |
context |
the context in which to consider term co-occurrence:
|
count |
how to count co-occurrences:
|
window |
positive integer value for the size of a window on either side of the target feature, default is 5, meaning 5 words before and after the target feature |
weights |
a vector of weights applied to each distance from
|
ordered |
if |
tri |
if |
... |
not used here |
Details
The function fcm()
provides a very general
implementation of a "context-feature" matrix, consisting of a count of
feature co-occurrence within a defined context. This context, following
Momtazi et. al. (2010), can be defined as the document,
sentences within documents, syntactic relationships between
features (nouns within a sentence, for instance), or according to a
window. When the context is a window, a weighting function is
typically applied that is a function of distance from the target word (see
Jurafsky and Martin 2015, Ch. 16) and ordered co-occurrence of the two
features is considered (see Church & Hanks 1990).
fcm provides all of this functionality, returning a V * V
matrix (where V
is the vocabulary size, returned by
nfeat()
). The tri = TRUE
option will only return the
upper part of the matrix.
Unlike some implementations of co-occurrences, fcm counts feature co-occurrences with themselves, meaning that the diagonal will not be zero.
fcm also provides "boolean" counting within the context of "window", which differs from the counting within "document".
is.fcm(x)
returns TRUE
if and only if its x is an object of
type fcm.
Author(s)
Kenneth Benoit (R), Haiyan Wang (R, C++), Kohei Watanabe (C++)
References
Momtazi, S., Khudanpur, S., & Klakow, D. (2010). "A comparative study of word co-occurrence for term clustering in language model-based sentence retrieval. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, Los Angeles, California, June 2010, 325-328. https://aclanthology.org/N10-1046/
Jurafsky, D. & Martin, J.H. (2018). From Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft of September 23, 2018 (Chapter 6, Vector Semantics). Available at https://web.stanford.edu/~jurafsky/slp3/.
Church, K. W. & P. Hanks (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22-29. https://aclanthology.org/J90-1003/
Examples
# see http://bit.ly/29b2zOA
toks1 <- tokens(c("A D A C E A D F E B A C E D"))
fcm(toks1, context = "window", window = 2)
fcm(toks1, context = "window", count = "weighted", window = 3)
fcm(toks1, context = "window", count = "weighted", window = 3,
weights = c(3, 2, 1), ordered = TRUE, tri = FALSE)
# with multiple documents
toks2 <- tokens(c("a a a b b c", "a a c e", "a c e f g"))
fcm(toks2, context = "document", count = "frequency")
fcm(toks2, context = "document", count = "boolean")
fcm(toks2, context = "window", window = 2)
txt3 <- c("The quick brown fox jumped over the lazy dog.",
"The dog jumped and ate the fox.")
toks3 <- tokens(char_tolower(txt3), remove_punct = TRUE)
fcm(toks3, context = "document")
fcm(toks3, context = "window", window = 3)
Virtual class "fcm" for a feature co-occurrence matrix
Description
The fcm class of object is a special type of fcm object with additional slots, described below.
Usage
## S4 method for signature 'fcm'
t(x)
## S4 method for signature 'fcm,numeric'
Arith(e1, e2)
## S4 method for signature 'numeric,fcm'
Arith(e1, e2)
## S4 method for signature 'fcm,index,index,missing'
x[i, j, ..., drop = TRUE]
## S4 method for signature 'fcm,index,index,logical'
x[i, j, ..., drop = TRUE]
## S4 method for signature 'fcm,missing,missing,missing'
x[i, j, ..., drop = TRUE]
## S4 method for signature 'fcm,missing,missing,logical'
x[i, j, ..., drop = TRUE]
## S4 method for signature 'fcm,index,missing,missing'
x[i, j, ..., drop = TRUE]
## S4 method for signature 'fcm,index,missing,logical'
x[i, j, ..., drop = TRUE]
## S4 method for signature 'fcm,missing,index,missing'
x[i, j, ..., drop = TRUE]
## S4 method for signature 'fcm,missing,index,logical'
x[i, j, ..., drop = TRUE]
Arguments
x |
the fcm object |
e1 |
first quantity in "+" operation for fcm |
e2 |
second quantity in "+" operation for fcm |
i |
index for features |
j |
index for features |
... |
additional arguments not used here |
drop |
always set to |
Slots
context
the context definition
window
the size of the window, if
context = "window"
count
how co-occurrences are counted
weights
context weighting for distance from target feature, equal in length to
window
margin
tri
whether the lower triangle of the symmetric
V \times V
matrix is recordedordered
whether a term appears before or after the target feature are counted separately
See Also
Examples
# fcm subsetting
fcmat <- fcm(tokens(c("this contains lots of stopwords",
"no if, and, or but about it: lots"),
remove_punct = TRUE))
fcmat[1:3, ]
fcmat[4:5, 1:5]
Sort an fcm in alphabetical order of the features
Description
Sorts an fcm in alphabetical order of the features.
Usage
fcm_sort(x)
Arguments
x |
fcm object |
Value
A fcm object whose features have been alphabetically sorted.
Differs from fcm_sort()
in that this function sorts the fcm by
the feature labels, not the counts of the features.
Author(s)
Kenneth Benoit
Examples
# with tri = FALSE
fcmat1 <- fcm(tokens(c("A X Y C B A", "X Y C A B B")), tri = FALSE)
rownames(fcmat1)[3] <- colnames(fcmat1)[3] <- "Z"
fcmat1
fcm_sort(fcmat1)
# with tri = TRUE
fcmat2 <- fcm(tokens(c("A X Y C B A", "X Y C A B B")), tri = TRUE)
rownames(fcmat2)[3] <- colnames(fcmat2)[3] <- "Z"
fcmat2
fcm_sort(fcmat2)
Compute the frequencies of features
Description
For a dfm object, returns a frequency for each feature, computed
across all documents in the dfm. This is equivalent to colSums(x)
.
Usage
featfreq(x)
Arguments
x |
a dfm |
Value
a (named) numeric vector of feature frequencies
See Also
Examples
dfmat <- dfm(tokens(data_char_sampletext))
featfreq(dfmat)
Get the feature labels from a dfm
Description
Get the features from a document-feature matrix, which are stored as the column names of the dfm object.
Usage
featnames(x)
Arguments
x |
the dfm whose features will be extracted |
Value
character vector of the feature labels
Examples
dfmat <- dfm(tokens(data_corpus_inaugural))
# first 50 features (in original text order)
head(featnames(dfmat), 50)
# first 50 features alphabetically
head(sort(featnames(dfmat)), 50)
# contrast with descending total frequency order from topfeatures()
names(topfeatures(dfmat, 50))
Shortcut functions to access or assign metadata
Description
Internal functions to access or replace an object metadata field without
going through attribute trees. field_system()
, field_object()
and
field_user()
correspond to the system, object and user meta fields,
respectively.
Usage
field_system(x, field = NULL)
field_system(x, field = NULL) <- value
field_object(x, field = NULL)
field_object(x, field = NULL) <- value
field_user(x, field = NULL)
field_user(x, field = NULL) <- value
Arguments
x |
a list of attributes extracted from a |
field |
name of the sub-field to access or assign values |
Flatten a hierarchical dictionary into a list of character vectors
Description
Converts a hierarchical dictionary (a named list of named lists, ending in character vectors at the lowest level) into a flat list of character vectors.
Usage
flatten_dictionary(dictionary, levels = 1:100)
Arguments
dictionary |
a dictionary-class object to be flattened |
levels |
an integer vector indicating levels in the dictionary |
Value
A named list of character vectors
Examples
dict1 <- dictionary(
list(populism=c("elit*", "consensus*", "undemocratic*", "referend*",
"corrupt*", "propagand", "politici*", "*deceit*",
"*deceiv*", "*betray*", "shame*", "scandal*", "truth*",
"dishonest*", "establishm*", "ruling*"))
)
flatten_dictionary(dict1)
dict2 <- dictionary(
list(level1a = list(level1a1 = c("l1a11", "l1a12"),
level1a2 = c("l1a21", "l1a22")),
level1b = list(level1b1 = c("l1b11", "l1b12"),
level1b2 = c("l1b21", "l1b22", "l1b23")),
level1c = list(level1c1a = list(level1c1a1 = c("lowest1", "lowest2")),
level1c1b = list(level1c1b1 = c("lowestalone"))))
)
flatten_dictionary(dict2)
flatten_dictionary(dict2, 2)
flatten_dictionary(dict2, 1:2)
Internal function to flatten a nested list
Description
Internal function to flatten a nested list
Usage
flatten_list(
lis,
levels = 1:100,
level = 1,
key_parent = "",
lis_flat = list()
)
Arguments
lis |
a nested list |
levels |
an integer vector indicating levels in the list |
level |
an internal argument to pass current levels |
key_parent |
an internal argument to pass for parent keys |
lis_flat |
an internal argument to pass the flattened list |
Examples
lis <- list("A" = list("B" = c("b", "B"), c("a", "A", "aa")))
quanteda:::flatten_list(lis, 1:2)
quanteda:::flatten_list(lis, 1)
format a sparsity value for printing
Description
Inputs a dfm sparsity value from sparsity()
and formats it for
printing in print.dfm()
.
Usage
format_sparsity(x)
Arguments
x |
input sparsity value, ranging from 0 to 1.0 |
Examples
ss <- c(1, .99999, .9999, .999, .99, .9,
.1, .01, .001, .0001, .000001, .0000001, .00000001, .000000000001, 0)
for (s in ss)
cat(format(s, width = 10), ":", quanteda:::format_sparsity(s), "\n")
Internal function to extract docvars
Description
Internal function to extract docvars
Usage
get_docvars(x, field = NULL, user = TRUE, system = FALSE, drop = FALSE)
Arguments
x |
an object from which docvars are extracted |
field |
name of docvar fields |
user |
if |
system |
if |
drop |
if |
Get the package version that created an object
Description
Return the the quanteda package version in which a dfm, tokens, or corpus object was created.
Usage
get_object_version(x)
is_pre2(x)
Value
A three-element integer vector of class "package_version". For
versions of the package < 1.5 for which no version was recorded in the
object, c(1, 4, 0)
is returned.
ispre2()
returns TRUE
if the object was created before
quanteda version 2, or FALSE
otherwise
Grouping variable(s) for various functions
Description
Groups for aggregation by various functions that take grouping options.
Arguments
groups |
grouping variable for sampling, equal in length to the number
of documents. This will be evaluated in the docvars data.frame, so that
docvars may be referred to by name without quoting. This also changes
previous behaviours for |
fill |
logical; if |
See Also
corpus_group()
, tokens_group()
, dfm_group()
Return the first or last part of a dfm
Description
For a dfm object, return the dfm with only the first or last n
documents.
Usage
## S3 method for class 'dfm'
head(x, n = 6L, ...)
## S3 method for class 'dfm'
tail(x, n = 6L, ...)
Arguments
x |
a dfm object |
n |
an integer vector of length up to |
... |
arguments to be passed to or from other methods. |
Value
A dfm class object corresponding to the subset of documents
determined by by n
.
Examples
head(data_dfm_lbgexample, 3)
head(data_dfm_lbgexample, -4)
tail(data_dfm_lbgexample)
tail(data_dfm_lbgexample, n = 3)
Locate a pattern in a tokens object
Description
Locates a pattern within a tokens object, returning the index positions of the beginning and ending tokens in the pattern.
Usage
index(
x,
pattern,
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE
)
is.index(x)
Arguments
x |
an input tokens object |
pattern |
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
Value
a data.frame consisting of one row per pattern match, with columns
for the document name, index positions from
and to
, and the pattern
matched.
is.index
returns TRUE
if the object was created by
index()
; FALSE
otherwise.
Examples
toks <- tokens(data_corpus_inaugural[1:8])
index(toks, pattern = "secure*")
index(toks, pattern = c("secure*", phrase("united states"))) |> head()
Get information on TBB library
Description
Get information on TBB library
Usage
info_tbb()
Check if an object is collocations
Description
Function to check if an object is a collocations object, created by
quanteda.textstats::textstat_collocations()
.
Usage
is.collocations(x)
Arguments
x |
object to be checked |
Value
TRUE
if the object is of class collocations
, FALSE
otherwise
Check if patterns contains glob wildcard
Description
Check if patterns contains glob wildcard
Usage
is_glob(pattern)
Arguments
pattern |
a glob pattern to be tested |
Check if a glob pattern is indexed by index_types
Description
Internal function for select_types
to check if a glob pattern is indexed by
index_types
.
Usage
is_indexed(pattern)
Arguments
pattern |
a glob pattern to be tested |
Check if a string is a regular expression
Description
Internal function for select_types()
to check if a string is a regular expression
Usage
is_regex(x)
Arguments
x |
a character string to be tested |
Locate keywords-in-context
Description
For a text or a collection of texts (in a quanteda corpus object), return a list of a keyword supplied by the user in its immediate context, identifying the source text and the word index number within the source text. (Not the line number, since the text may or may not be segmented using end-of-line delimiters.)
Usage
kwic(
x,
pattern,
window = 5,
valuetype = c("glob", "regex", "fixed"),
separator = " ",
case_insensitive = TRUE,
index = NULL,
...
)
is.kwic(x)
## S3 method for class 'kwic'
as.data.frame(x, ...)
Arguments
x |
|
pattern |
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
window |
the number of context words to be displayed around the keyword |
valuetype |
the type of pattern matching: |
separator |
a character to separate words in the output |
case_insensitive |
logical; if |
index |
an index object to specify keywords |
... |
unused |
Value
A kwic
classed data.frame, with the document name
(docname
) and the token index positions (from
and to
,
which will be the same for single-word patterns, or a sequence equal in
length to the number of elements for multi-word phrases).
Note
pattern
will be a keyword pattern or phrase, possibly multiple
patterns, that may include punctuation. If a pattern contains whitespace,
it is best to wrap it in phrase()
to make this explicit. However if
pattern
is a collocations
(see quanteda.textstats or
dictionary object, then the collocations or multi-word dictionary keys
will automatically be considered phrases where each whitespace-separated
element matches a token in sequence.
See Also
Examples
# single token matching
toks <- tokens(data_corpus_inaugural[1:8])
kwic(toks, pattern = "secure*", valuetype = "glob", window = 3)
kwic(toks, pattern = "secur", valuetype = "regex", window = 3)
kwic(toks, pattern = "security", valuetype = "fixed", window = 3)
# phrase matching
kwic(toks, pattern = phrase("secur* against"), window = 2)
kwic(toks, pattern = phrase("war against"), valuetype = "regex", window = 2)
# use index
idx <- index(toks, phrase("secur* against"))
kwic(toks, index = idx, window = 2)
kw <- kwic(tokens(data_corpus_inaugural[1:20]), "provident*")
is.kwic(kw)
is.kwic("Not a kwic")
is.kwic(kw[, c("pre", "post")])
toks <- tokens(data_corpus_inaugural[1:8])
kw <- kwic(toks, pattern = "secure*", valuetype = "glob", window = 3)
as.data.frame(kw)
Internal function to convert a list to a dictionary
Description
A dictionary is internally a list of list to keys and values to coexist in the same level.
Usage
list2dictionary(dict)
Arguments
dict |
list of object |
Internal function to lowercase dictionary values
Description
Internal function to lowercase dictionary values
Usage
lowercase_dictionary_values(dict)
Arguments
dict |
the dictionary whose values will be lowercased |
Examples
dict <- list(KEY1 = list(SUBKEY1 = c("A", "B"),
SUBKEY2 = c("C", "D")),
KEY2 = list(SUBKEY3 = c("E", "F"),
SUBKEY4 = c("G", "F", "I")),
KEY3 = list(SUBKEY5 = list(SUBKEY7 = c("J", "K")),
SUBKEY6 = list(SUBKEY8 = c("L"))))
quanteda:::lowercase_dictionary_values(dict)
Internal function to make new system-level docvars
Description
Internal function to make new system-level docvars
Usage
make_docvars(n, docname = NULL, unique = TRUE, drop_docid = TRUE)
Arguments
n |
the number of documents |
docname |
a character vector for the names of documents. Must be the
same length as |
unique |
if |
drop_docid |
if |
Internal functions to create a list of the meta fields
Description
Internal functions to create a list of the meta fields
Usage
make_meta(class, inherit = NULL, ...)
make_meta_system(inherit = NULL)
make_meta_corpus(inherit = NULL, ...)
make_meta_tokens(inherit = NULL, ...)
make_meta_dfm(inherit = NULL, ...)
make_meta_fcm(inherit = NULL, ...)
make_meta_dictionary2(inherit = NULL, ...)
update_meta(default, inherit, ..., warn = TRUE)
Arguments
class |
object class either |
inherit |
list from the meta attribute |
... |
values assigned to the object meta fields |
default |
default values for the meta attribute |
Converts a Matrix to a dfm
Description
Converts a Matrix to a dfm
Usage
matrix2dfm(x, docvars = NULL, meta = NULL)
Arguments
x |
a Matrix |
meta |
a list of values to be assigned to slots |
Converts a Matrix to a fcm
Description
Converts a Matrix to a fcm
Usage
matrix2fcm(x, meta = NULL)
Arguments
x |
a Matrix |
Internal function to merge values of duplicated keys
Description
Internal function to merge values of duplicated keys
Usage
merge_dictionary_values(dict)
Arguments
dict |
a dictionary object |
Examples
dict <- list("A" = list(AA = list("aaaaa"), "a"),
"B" = list("b"),
"C" = list("c"),
"A" = list("aa"))
quanteda:::merge_dictionary_values(dict)
Print messages in corpus methods
Description
Print messages in corpus methods
Usage
message_corpus(operation, before, after)
Arguments
before , after |
object statistics before and after the operation. |
Print messages in dfm methods
Description
Print messages in dfm methods
Usage
message_dfm(operation, before, after)
Arguments
before , after |
object statistics before and after the operation. |
Return an error message
Description
Return an error message
Usage
message_error(key = NULL)
Arguments
key |
type of error message |
Print messages in tokens methods
Description
Print messages in tokens methods
Usage
message_tokens(operation, before, after)
Arguments
before , after |
object statistics before and after the operation. |
Message parameter documentation
Description
Used in printing verbose messages for message_tokens() and message_dfm()
Arguments
verbose |
if |
before , after |
object statistics before and after the operation. |
See Also
message_tokens() message_dfm()
Get or set object metadata
Description
Get or set the object metadata in a corpus, tokens, dfm, or dictionary object. With the exception of dictionaries, this will be corpus-level metadata.
Usage
meta(x, field = NULL, type = c("user", "object", "system", "all"))
meta(x, field = NULL) <- value
Arguments
x |
an object for which the metadata will be read or set |
field |
metadata field name(s); if |
type |
|
value |
new value of the metadata field |
Value
For meta
, a named list of the metadata fields in the corpus.
For meta <-
, the corpus with the updated user-level metadata. Only
user-level metadata may be assigned.
Examples
meta(data_corpus_inaugural)
meta(data_corpus_inaugural, "source")
meta(data_corpus_inaugural, "citation") <- "Presidential Speeches Online Project (2014)."
meta(data_corpus_inaugural, "citation")
Internal function to get, set or initialize system metadata
Description
Sets or initializes system metadata for new objects.
Usage
meta_system(x, field = NULL)
meta_system(x, field = NULL) <- value
## S3 replacement method for class 'corpus'
meta_system(x, field = NULL) <- value
## S3 replacement method for class 'tokens'
meta_system(x, field = NULL) <- value
## S3 replacement method for class 'dfm'
meta_system(x, field = NULL) <- value
## S3 replacement method for class 'dictionary'
meta_system(x, field = NULL) <- value
meta_system_defaults()
Arguments
x |
an object for which the metadata will be read or set |
field |
metadata field name(s); if |
value |
new value of the metadata field |
Value
meta_system
returns a list with the object's system metadata.
It is literally a wrapper to meta(x, field, type = "system")()
.
meta_system<-
returns the object with the system metadata
modified. This is an internal function and not designed for users!
meta_system_defaults
returns a list of default system
values, with the user setting the "source" value. This should be used
to set initial system meta information.
Examples
corp <- corpus(c(d1 = "one two three", d2 = "two three four"))
# quanteda:::`meta_system<-`(corp, value = quanteda:::meta_system_defaults("example"))
quanteda:::meta_system(corp)
Conditionally format messages
Description
Conditionally format messages
Usage
msg(format, ..., pretty = TRUE)
Arguments
format |
character vector of format strings |
... |
vectors (coercible to integer, real, or character) |
pretty |
if |
See Also
Examples
quanteda:::msg("you cannot delete %s %s", 2000, "documents")
Special handling for names of quanteda objects
Description
Keeps the element names and rownames in sync with the system docvar
docname_
.
Usage
## S3 replacement method for class 'corpus'
names(x) <- value
## S3 replacement method for class 'tokens'
names(x) <- value
## S4 replacement method for signature 'dfm'
rownames(x) <- value
## S4 replacement method for signature 'fcm'
rownames(x) <- value
Arguments
x |
an R object. |
value |
a character vector of up to the same length as |
Count the number of documents or features
Description
Get the number of documents or features in an object.
Usage
ndoc(x)
nfeat(x)
Arguments
x |
a quanteda object: a corpus, dfm, tokens, or tokens_xptr object, or a readtext object from the readtext package |
Value
ndoc()
returns an integer count of the number of documents in an
object whose texts are organized as "documents" (a corpus, dfm, or
tokens/tokens_xptr object.
nfeat()
returns an integer count of the number of features. It is
an alias for ntype()
for a dfm. This function is only defined for dfm
objects because only these have "features".
See Also
Examples
# number of documents
ndoc(data_corpus_inaugural)
ndoc(corpus_subset(data_corpus_inaugural, Year > 1980))
ndoc(tokens(data_corpus_inaugural))
ndoc(dfm(tokens(corpus_subset(data_corpus_inaugural, Year > 1980))))
# number of features
toks1 <- tokens(corpus_subset(data_corpus_inaugural, Year > 1980), remove_punct = FALSE)
toks2 <- tokens(corpus_subset(data_corpus_inaugural, Year > 1980), remove_punct = TRUE)
nfeat(dfm(toks1))
nfeat(dfm(toks2))
Utility function to generate a nested list
Description
Utility function to generate a nested list
Usage
nest_dictionary(dict, depth)
Arguments
dict |
a flat dictionary |
depth |
depths of nested element |
Examples
lis <- list("A" = c("a", "aa", "aaa"), "B" = c("b", "bb"), "C" = c("c", "cc"), "D" = c("ddd"))
dict <- quanteda:::list2dictionary(lis)
quanteda:::nest_dictionary(dict, c(1, 1, 2, 2))
quanteda:::nest_dictionary(dict, c(1, 2, 1, 2))
Count the number of sentences
Description
Return the count of sentences in a corpus or character object.
Usage
nsentence(x)
Arguments
x |
a character or corpus whose sentences will be counted |
Value
count(s) of the total sentences per text
Note
nsentence()
is now deprecated for all usages except tokens objects that
have already been tokenised with tokens(x, what = "sentence")
. Using it
on character or corpus objects will now generate a warning.
nsentence()
relies on the boundaries definitions in the stringi
package (see stri_opts_brkiter). It does not
count sentences correctly if the text has been transformed to lower case,
and for this reason nsentence()
will issue a warning if it detects all
lower-cased text.
Examples
# simple example
txt <- c(text1 = "This is a sentence: second part of first sentence.",
text2 = "A word. Repeated repeated.",
text3 = "Mr. Jones has a PhD from the LSE. Second sentence.")
tokens(txt, what = "sentence") |>
nsentence()
Count the number of tokens or types
Description
Get the count of tokens (total features) or types (unique tokens).
Usage
ntoken(x, ...)
ntype(x, ...)
Arguments
x |
|
... |
additional arguments passed to |
Value
ntoken()
returns a named integer vector of the counts of the total
tokens
ntypes()
returns a named integer vector of the counts of the types (unique
tokens) per document. For dfm objects, ntype()
will only return the
count of features that occur more than zero times in the dfm.
Examples
# simple example
txt <- c(text1 = "This is a sentence, this.", text2 = "A word. Repeated repeated.")
toks <- tokens(txt)
ntoken(toks)
ntype(toks)
ntoken(tokens_tolower(toks)) # same
ntype(tokens_tolower(toks)) # fewer types
# with some real texts
toks <- tokens(corpus_subset(data_corpus_inaugural, Year < 1806))
ntoken(tokens(toks, remove_punct = TRUE))
ntype(tokens(toks, remove_punct = TRUE))
ntoken(dfm(toks))
ntype(dfm(toks))
Object builders
Description
Functions to build or re-build core objects, or to upgrade earlier versions of these objects to the current format.
Usage
build_dfm(
x,
features,
docvars = data.frame(),
meta = list(),
class = NULL,
...
)
rebuild_dfm(x, attrs)
upgrade_dfm(x)
build_tokens(
x,
types,
padding = TRUE,
docvars = data.frame(),
meta = list(),
class = NULL,
...
)
rebuild_tokens(x, attrs)
upgrade_tokens(x)
build_corpus(x, docvars = data.frame(), meta = list(), class = NULL, ...)
rebuild_corpus(x, attrs)
upgrade_corpus(x)
build_dictionary2(x, meta = list(), class = "dictionary2", ...)
rebuild_dictionary2(x, attrs)
upgrade_dictionary2(x)
build_fcm(
x,
features1,
features2 = features1,
meta = list(),
class = "fcm",
...
)
rebuild_fcm(x, attrs)
upgrade_fcm(x)
Arguments
x |
an input corpus, tokens, dfm, fcm or dictionary object. |
features |
character for feature of resulting |
docvars |
data.frame for document level variables created by
|
meta |
list for meta fields |
class |
class labels to be attached to the object. |
... |
values saved in the object meta fields. They overwrite values
passed via |
attrs |
a list of attributes to be reassigned |
types |
character for types of resulting the |
padding |
logical indicating if the |
features1 |
character for row feature of resulting |
features2 |
character for column feature of resulting |
Examples
quanteda:::build_tokens(
list(c(1, 2, 3), c(4, 5, 6)),
docvars = quanteda:::make_docvars(n = 2L),
types = c("a", "b", "c", "d", "e", "f"),
padding = FALSE
)
quanteda:::build_corpus(
c("a b c", "d e f"),
docvars = quanteda:::make_docvars(n = 2L),
unit = "sentence"
)
Match quanteda objects against token types
Description
Developer function to match patterns in quanteda objects against token types.
Usage
object2id(
x,
types,
valuetype = c("glob", "fixed", "regex"),
case_insensitive = TRUE,
concatenator = "_",
levels = 1,
match_pattern = c("any", "single", "multi"),
keep_nomatch = FALSE
)
object2fixed(
x,
types,
valuetype = c("glob", "fixed", "regex"),
case_insensitive = TRUE,
concatenator = "_",
levels = 1,
match_pattern = c("any", "single", "multi"),
keep_nomatch = FALSE
)
Arguments
x |
a list of character vectors, dictionary or collocations object |
types |
token types against which patterns are matched |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
concatenator |
the concatenation character that joins multi-word
expression in |
levels |
integers specifying the levels of entries in a hierarchical
dictionary that will be applied. The top level is 1, and subsequent levels
describe lower nesting levels. Values may be combined, even if these
levels are not contiguous, e.g. |
match_pattern |
select only single-word patterns or multi-word patterns should be matched. If "any", it matches both single-word and multi-word patterns. |
keep_nomatch |
keep patterns that did not match |
Value
object2fixed()
returns a list of character vectors of matched
types. object2id()
returns a list of indices of matched types with
attributes. The "pattern" attribute records the indices of the matched patterns
in x
; the "key" attribute records the keys of the matched patterns when x
is
dictionary.
See Also
Examples
types <- c("A", "AA", "B", "BB", "B_B", "C", "C-C")
# dictionary
dict <- dictionary(list(A = c("a", "aa"),
B = c("BB", "B B"),
C = c("C", "C-C")))
object2fixed(dict, types)
object2fixed(dict, types, match_pattern = "single")
object2fixed(dict, types, match_pattern = "multi")
# phrase
pats <- phrase(c("a", "aa", "zz", "bb", "b b"))
object2fixed(pats, types)
object2fixed(pats, types, keep_nomatch = TRUE)
Pattern for feature, token and keyword matching
Description
Pattern(s) for use in matching features, tokens, and keywords through a valuetype pattern.
Arguments
pattern |
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
Details
The pattern
argument is a vector of patterns, including
sequences, to match in a target object, whose match type is specified by
valuetype. Note that an empty pattern (""
) will match
"padding" in a tokens object.
character
A character vector of token patterns to be selected or removed. Whitespace is not privileged, so that in a character vector, white space is interpreted literally. If you wish to consider whitespace-separated elements as sequences of tokens, wrap the argument in
phrase()
.list of character objects
If the list elements are character vectors of length 1, then this is equivalent to a vector of characters. If a list element contains a vector of characters longer than length 1, then for matching will consider these as sequences of matches, equivalent to wrapping the argument in
phrase()
, except for matching to dfm features where this does not apply.dictionary
Values in dictionary are used as patterns, for literal matches. Multi-word values are automatically converted into phrases, so performing selection or compounding using a dictionary is the same as wrapping the dictionary in
phrase()
.collocations
Collocations objects created from
quanteda.textstats::textstat_collocations()
, which are treated as phrases automatically.
See Also
Examples
# these are interpreted literally
(patt1 <- c("president", "white house", "house of representatives"))
# as multi-word sequences
phrase(patt1)
# three single-word patterns
(patt2 <- c("president", "white_house", "house_of_representatives"))
phrase(patt2)
# this is equivalent to phrase(patt1)
(patt3 <- list(c("president"), c("white", "house"),
c("house", "of", "representatives")))
# glob expression can be used
phrase(patt4 <- c("president?", "white house", "house * representatives"))
# this is equivalent to phrase(patt4)
(patt5 <- list(c("president?"), c("white", "house"), c("house", "*", "representatives")))
# dictionary with multi-word matches
(dict1 <- dictionary(list(us = c("president", "white house", "house of representatives"))))
phrase(dict1)
Match patterns against token types
Description
Developer function to match regex, fixed or glob patterns against token types. This allows C++ function to perform fast searches in tokens object. C++ functions use a list of type IDs to construct a hash table, against which sub-vectors of tokens object are matched. This function constructs an index of glob patterns for faster matching.
pattern2fixed
converts regex and glob patterns to fixed patterns.
Usage
pattern2id(
pattern,
types,
valuetype = c("glob", "fixed", "regex"),
case_insensitive = TRUE,
keep_nomatch = FALSE,
use_index = TRUE
)
pattern2fixed(
pattern,
types,
valuetype = c("glob", "fixed", "regex"),
case_insensitive = TRUE,
keep_nomatch = FALSE,
use_index = TRUE
)
Arguments
pattern |
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
types |
token types against which patterns are matched |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
keep_nomatch |
keep patterns that did not match |
use_index |
construct index of types for quick search |
Value
a list of integer vectors containing indices of matched types
pattern2fixed
returns a list of character vectors containing
types
Examples
types <- c("A", "AA", "B", "BB", "BBB", "C", "CC")
pats_regex <- list(c("^a$", "^b"), c("c"), c("d"))
pattern2id(pats_regex, types, "regex", case_insensitive = TRUE)
pats_glob <- list(c("a*", "b*"), c("c"), c("d"))
pattern2id(pats_glob, types, "glob", case_insensitive = TRUE)
pattern <- list(c("^a$", "^b"), c("c"), c("d"))
types <- c("A", "AA", "B", "BB", "BBB", "C", "CC")
pattern2fixed(pattern, types, "regex", case_insensitive = TRUE)
Declare a pattern to be a sequence of separate patterns
Description
Declares that a character expression consists of multiple patterns, separated
by an element such as whitespace. This is typically used as a wrapper around
pattern()
to make it explicit that the pattern elements are to be used for
matches to multi-word sequences, rather than individual, unordered matches to
single words.
Usage
phrase(x, separator = " ")
as.phrase(x)
is.phrase(x)
Arguments
x |
character, dictionary, list, collocations, or tokens object; the
compound patterns to be treated as a sequence separated by |
separator |
character; the character in between the patterns. This
defaults to " ". For |
Value
phrase()
and as.phrase()
return a specially classed list whose
elements have been split into separate character
(pattern) elements.
is.phrase
returns TRUE
if the object was created by
phrase()
; FALSE
otherwise.
See Also
Examples
# make phrases from characters
phrase(c("natural language processing"))
phrase(c("natural_language_processing", "text_analysis"), separator = "_")
# from a dictionary
phrase(dictionary(list(catone = c("a b"), cattwo = "c d e", catthree = "f")))
# from a list
as.phrase(list(c("natural", "language", "processing")))
# from tokens
as.phrase(tokens("natural language processing"))
Print methods for quanteda core objects
Description
Print method for quanteda objects. In each max_n*
option, 0 shows none, and
-1 shows all.
Usage
## S3 method for class 'corpus'
print(
x,
max_ndoc = quanteda_options("print_corpus_max_ndoc"),
max_nchar = quanteda_options("print_corpus_max_nchar"),
show_summary = quanteda_options("print_corpus_summary"),
...
)
## S4 method for signature 'dfm'
print(
x,
max_ndoc = quanteda_options("print_dfm_max_ndoc"),
max_nfeat = quanteda_options("print_dfm_max_nfeat"),
show_summary = quanteda_options("print_dfm_summary"),
...
)
## S4 method for signature 'dictionary2'
print(
x,
max_nkey = quanteda_options("print_dictionary_max_nkey"),
max_nval = quanteda_options("print_dictionary_max_nval"),
show_summary = quanteda_options("print_dictionary_summary"),
...
)
## S4 method for signature 'fcm'
print(
x,
max_nfeat = quanteda_options("print_dfm_max_nfeat"),
show_summary = TRUE,
...
)
## S3 method for class 'kwic'
print(
x,
max_nrow = quanteda_options("print_kwic_max_nrow"),
show_summary = quanteda_options("print_kwic_summary"),
...
)
## S3 method for class 'tokens'
print(
x,
max_ndoc = quanteda_options("print_tokens_max_ndoc"),
max_ntoken = quanteda_options("print_tokens_max_ntoken"),
show_summary = quanteda_options("print_tokens_summary"),
...
)
Arguments
x |
the object to be printed |
max_ndoc |
max number of documents to print; default is from the
|
max_nchar |
max number of tokens to print; default is from the
|
show_summary |
print a brief summary indicating the number of documents and other characteristics of the object, such as docvars or sparsity. |
... |
passed to |
max_nfeat |
max number of features to print; default is from the
|
max_nkey |
max number of keys to print; default is from the
|
max_nval |
max number of values to print; default is from the
|
max_nrow |
max number of documents to print; default is from the
|
max_ntoken |
max number of tokens to print; default is from the
|
See Also
Examples
corp <- corpus(data_char_ukimmig2010)
print(corp, max_ndoc = 3, max_nchar = 40)
toks <- tokens(corp)
print(toks, max_ndoc = 3, max_ntoken = 6)
dfmat <- dfm(toks)
print(dfmat, max_ndoc = 3, max_nfeat = 10)
Print a phrase object
Description
prints a phrase object in a way that looks like a standard list.
Usage
## S3 method for class 'phrases'
print(x, ...)
Arguments
x |
a phrases (constructed by |
... |
further arguments passed to or from other methods |
Get or set package options for quanteda
Description
Get or set global options affecting functions across quanteda.
Usage
quanteda_options(..., reset = FALSE, initialize = FALSE)
Arguments
... |
options to be set, as key-value pair, same as
|
reset |
logical; if |
initialize |
logical; if |
Details
Currently available options are:
verbose
logical; if
TRUE
then use this as the default for all functions with averbose
argumentthreads
integer; specifies the number of threads to use in parallelized functions; defaults to the maximum number of threads
print_dfm_max_ndoc
,print_corpus_max_ndoc
,print_tokens_max_ndoc
integer; specify the number of documents to display when using the defaults for printing a dfm, corpus, or tokens object
print_dfm_max_nfeat
,print_corpus_max_nchar
,print_tokens_max_ntoken
integer; specifies the number of features to display when printing a dfm, the number of characters to display when printing corpus documents, or the number of tokens to display when printing tokens objects
print_dfm_summary
integer; specifies the number of documents to display when using the defaults for printing a dfm
print_dictionary_max_nkey
,print_dictionary_max_nval
the number of keys or values (respectively) to display when printing a dictionary
print_kwic_max_nrow
the number of rows to display when printing a kwic object
base_docname
character; stem name for documents that are unnamed when a corpus, tokens, or dfm are created or when a dfm is converted from another object
base_featname
character; stem name for features that are unnamed when they are added, for whatever reason, to a dfm through an operation that adds features
base_compname
character; stem name for components that are created by matrix factorization
language_stemmer
character; language option for
char_wordstem()
,tokens_wordstem()
, anddfm_wordstem()
pattern_hashtag
,pattern_username
character; regex patterns for (social media) hashtags and usernames respectively, used to avoid segmenting these in the default internal "word" tokenizer
tokens_block_size
integer; specifies the number of documents to be tokenized at a time in blocked tokenization. When the number is large, tokenization becomes faster but also memory-intensive.
tokens_locale
character; specify locale in stringi boundary detection in tokenization and corpus reshaping. See
stringi::stri_opts_brkiter()
.tokens_tokenizer_word
character; the current word tokenizer version used as a default for
what = "word"
intokens()
, one of"word1"
,"word2"
,"word3"
(same as"word2"
), or"word4"
.
Value
When called using a key = value
pair (where key
can be
a label or quoted character name)), the option is set and TRUE
is
returned invisibly.
When called with no arguments, a named list of the package options is returned.
When called with reset = TRUE
as an argument, all arguments are
options are reset to their default values, and TRUE
is returned
invisibly.
Examples
(opt <- quanteda_options())
quanteda_options(verbose = TRUE)
quanteda_options("verbose" = FALSE)
quanteda_options("threads")
quanteda_options(print_dfm_max_ndoc = 50L)
# reset to defaults
quanteda_options(reset = TRUE)
# reset to saved options
quanteda_options(opt)
Internal functions to import dictionary files
Description
Internal functions to import dictionary files in a variety of formats
read_dict_lexicoder
imports Lexicoder files in the .lc3
format.
read_dict_wordstat
imports WordStat files in the
.cat
format.
read_dict_liwc
imports LIWC dictionary files in the
.dic
format.
read_dict_yoshikoder
imports Yoshikoder files in the
.ykd
format.
Usage
read_dict_lexicoder(path)
read_dict_wordstat(path, encoding = "utf-8")
read_dict_liwc(path, encoding = "utf-8")
read_dict_yoshikoder(path)
Arguments
path |
the full path and filename of the dictionary file to be read |
encoding |
the encoding of the file to be imported |
Value
a quanteda dictionary object
Examples
dict <- quanteda:::read_dict_lexicoder(
system.file("extdata", "LSD2015.lc3", package = "quanteda")
)
## Not run:
dict <- quanteda:::read_dict_wordstat(system.file("extdata", "RID.cat", package = "quanteda"))
# dict <- read_dict_wordstat("/home/kohei/Documents/Dictionary/LaverGarry.txt", "utf-8")
# dict <- read_dict_wordstat("/home/kohei/Documents/Dictionary/Wordstat/ROGET.cat", "utf-8")
# dict <- read_dict_wordstat("/home/kohei/Documents/Dictionary/Wordstat/WordStat Sentiments.cat",
# encoding = "iso-8859-1")
## End(Not run)
dict <- quanteda:::read_dict_liwc(
system.file("extdata", "moral_foundations_dictionary.dic", package = "quanteda")
)
dict <- quanteda:::read_dict_yoshikoder(system.file("extdata", "laver_garry.ykd",
package = "quanteda"))
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
- stopwords
Stopwords
Stopword lists were formerly built into quanteda, but have been moved to the
stopwords package. See stopwords::stopwords()
.
Utility function to remove empty keys
Description
Utility function to remove empty keys
Usage
remove_empty_keys(dict)
Arguments
dict |
a flat or hierarchical dictionary |
Internal function to replace dictionary values
Description
Internal function to replace dictionary values
Usage
replace_dictionary_values(dict, from, to)
Arguments
dict |
a dictionary object |
Examples
dict <- list(KEY1 = list(SUBKEY1 = list("A_B"),
SUBKEY2 = list("C_D")),
KEY2 = list(SUBKEY3 = list("E_F"),
SUBKEY4 = list("G_F_I")),
KEY3 = list(SUBKEY5 = list(SUBKEY7 = list("J_K")),
SUBKEY6 = list(SUBKEY8 = list("L"))))
quanteda:::replace_dictionary_values(dict, "_", " ")
Sample a vector
Description
Return a sample from a vector within a grouping variable if specified.
Usage
resample(x, size = NULL, replace = FALSE, prob = NULL, by = NULL)
Arguments
x |
numeric vector |
size |
the number of items to sample within each group, as a positive
number or a vector of numbers equal in length to the number of groups. If
|
replace |
if |
prob |
a vector of probability weights for values in |
by |
a grouping vector equal in length to |
Value
x
resampled within groups
Examples
set.seed(100)
grvec <- c(rep("a", 3), rep("b", 4), rep("c", 3))
quanteda:::resample(1:10, replace = FALSE, by = grvec)
quanteda:::resample(1:10, replace = TRUE, by = grvec)
quanteda:::resample(1:10, size = 2, replace = TRUE, by = grvec)
quanteda:::resample(1:10, size = c(1, 1, 3), replace = TRUE, by = grvec)
Internal function to subset or duplicate docvar rows
Description
Internal function to subset or duplicate docvar rows
Usage
reshape_docvars(x, i = NULL, unique = FALSE, drop_docid = TRUE)
Arguments
x |
docvar data.frame |
i |
numeric or logical vector for subsetting/duplicating rows |
unique |
if |
drop_docid |
if |
Select types without performing slow regex search
Description
This is an internal function for pattern2id
that select types using
keys in index when available.
index_types
is an internal function for pattern2id
that
constructs an index of "glob" or "fixed" patterns to avoid expensive
sequential search.
Usage
search_glob(pattern, types_search, case_insensitive, index = NULL)
search_glob_multi(patterns, types_search, case_insensitive, index)
search_regex(pattern, types_search, case_insensitive)
search_regex_multi(patterns, types_search, case_insensitive)
search_fixed(pattern, types_search, index = NULL)
search_fixed_multi(patterns, types_search, index)
index_types(
pattern,
types,
valuetype = c("glob", "fixed", "regex"),
case_insensitive = TRUE
)
Arguments
pattern |
a "glob", "fixed" or "regex" pattern |
types_search |
lowercased types when |
case_insensitive |
logical; if |
index |
index object created by |
patterns |
a list of "glob", "fixed" or "regex" patterns |
valuetype |
the type of pattern matching: |
Value
index_types
returns a list of integer vectors containing type
IDs with index keys as an attribute
Examples
index <- quanteda:::index_types("yy*", c("xxx", "yyyy", "ZZZ"), "glob", FALSE)
quanteda:::search_glob("yy*", attr(index, "types_search"), index)
Internal function for select_types
to search the index using
fastmatch.
Description
Internal function for select_types
to search the index using
fastmatch.
Usage
search_index(pattern, index)
Arguments
index |
an index object created by |
See Also
Function to serialize list-of-character tokens
Description
Creates a serialized object of tokens, called by tokens()
.
Usage
serialize_tokens(x, types_reserved = NULL, ...)
Arguments
x |
a list of character vectors |
types_reserved |
optional pre-existing types for mapping of tokens |
... |
additional arguments |
Value
a list the serialized tokens found in each text
Internal functions to set dimnames
Description
Default dimnames()
converts a zero-length character vector to NULL,
leading to the improper functioning of subsetting functions. These are safer
methods to set the dimnames of a dfm or fcm object.
Usage
set_dfm_dimnames(x) <- value
set_dfm_docnames(x) <- value
set_dfm_featnames(x) <- value
set_fcm_dimnames(x) <- value
set_fcm_featnames(x) <- value
Arguments
x |
|
value |
character a vector for docnames or featnames or a list of them for dimnames |
Examples
dfmat <- dfm(tokens(c("a a b b c", "b b b c")))
quanteda:::set_dfm_featnames(dfmat) <- paste0("feature", 1:3)
quanteda:::set_dfm_docnames(dfmat) <- paste0("DOC", 1:2)
quanteda:::set_dfm_dimnames(dfmat) <- list(c("docA", "docB"), LETTERS[1:3])
Extensions for and from spacy_parse objects
Description
These functions provide quanteda methods for spacyr objects, and also extend spacy_parse and spacy_tokenize to work directly with corpus objects.
Arguments
x |
an object returned by |
... |
not used for these functions |
Details
spacy_parse(x, ...)
and spacy_tokenize(x, ...)
work directly on
quanteda corpus objects.
docnames(x)
returns the document names
ndoc(x)
returns the number of documents
ntoken(x, ...)
returns the number of tokens by document
ntype(x, ...)
returns the number of types (unique tokens) by document
nsentence(x)
returns the number of sentences by document
Examples
## Not run:
library("spacyr")
spacy_initialize()
corp <- corpus(c(doc1 = "And now, now, now for something completely different.",
doc2 = "Jack and Jill are children."))
spacy_tokenize(corp)
(parsed <- spacy_parse(corp))
ntype(parsed)
ntoken(parsed)
ndoc(parsed)
docnames(parsed)
## End(Not run)
Compute the sparsity of a document-feature matrix
Description
Return the proportion of sparseness of a document-feature matrix, equal to the proportion of cells that have zero counts.
Usage
sparsity(x)
Arguments
x |
the document-feature matrix |
Examples
dfmat <- dfm(tokens(data_corpus_inaugural))
sparsity(dfmat)
sparsity(dfm_trim(dfmat, min_termfreq = 5))
Internal function for special handling of multi-word dictionary values
Description
Internal function for special handling of multi-word dictionary values
Usage
split_values(dict, concatenator_dictionary, concatenator_tokens)
Arguments
dict |
a flatten dictionary |
concatenator_dictionary |
concatenator from a dictionary object |
concatenator_tokens |
concatenator from a tokens object |
Summarize a corpus
Description
Displays information about a corpus, including attributes and metadata such as date of number of texts, creation and source.
Usage
## S3 method for class 'corpus'
summary(object, n = 100, tolower = FALSE, showmeta = TRUE, ...)
Arguments
object |
corpus to be summarized |
n |
maximum number of texts to describe, default=100 |
tolower |
convert texts to lower case before counting types |
showmeta |
set to |
... |
additional arguments passed through to |
Examples
summary(data_corpus_inaugural)
summary(data_corpus_inaugural, n = 10)
corp <- corpus(data_char_ukimmig2010,
docvars = data.frame(party=names(data_char_ukimmig2010)))
summary(corp, showmeta = TRUE) # show the meta-data
sumcorp <- summary(corp) # (quietly) assign the results
sumcorp$Types / sumcorp$Tokens # crude type-token ratio
Functions to add or retrieve corpus summary metadata
Description
Functions to add or retrieve corpus summary metadata
Usage
add_summary_metadata(x, extended = FALSE, ...)
get_summary_metadata(x, ...)
summarize_texts_extended(x, stop_words = stopwords("en"), n = 100)
Arguments
x |
corpus object |
... |
additional arguments passed to |
Details
This is provided so that a corpus object can be stored with
summary information to avoid having to compute this every time
summary.corpus()
is called.
So in future calls, if !is.null(meta(x, "summary", type = "system") && !length(list(...))
,
then summary.corpus()
will simply return get_system_meta()
rather than
compute the summary statistics on the fly, which requires tokenizing the
text.
Value
add_summary_metadata()
returns a corpus with summary metadata added
as a data.frame, with the top-level list element names summary()
.
get_summary_metadata()
returns the summary metadata as a data.frame.
summarize_texts_extended()
returns extended summary information.
Examples
corp <- corpus(data_char_ukimmig2010)
corp <- quanteda:::add_summary_metadata(corp)
quanteda:::get_summary_metadata(corp)
## using extended summary
## Not run:
extended_data <- quanteda:::summarize_texts_extended(data_corpus_inaugural)
topfeatures(extended_data$top_dfm)
## End(Not run)
Models for scaling and classification of textual data
Description
The textmodel_*()
functions formerly in quanteda have now been moved
to the quanteda.textmodels package.
See Also
quanteda.textmodels::quanteda.textmodels-package
Plots for textual data
Description
The textplot_*()
functions formerly in quanteda have now been moved
to the quanteda.textplots package.
See Also
quanteda.textplots::quanteda.textplots-package
Get or assign corpus texts [deprecated]
Description
This function has been made defunct and replaced.
Use
as.character.corpus()
to turn a corpus into a simple named character vector.Use
corpus_group()
instead oftexts(x, groups = ...)
to aggregate texts by a grouping variable.Use
[<-
instead oftexts()<-
for replacing texts in a corpus object.
Usage
texts(x, groups = NULL, spacer = " ")
texts(x) <- value
Arguments
x |
a corpus |
groups |
grouping variable for sampling, equal in length to the number
of documents. This will be evaluated in the docvars data.frame, so that
docvars may be referred to by name without quoting. This also changes
previous behaviours for |
spacer |
when concatenating texts by using |
value |
character vector of the new texts |
Details
Get or replace the texts in a corpus, with grouping options.
Works for plain character vectors too, if groups
is a factor.
Value
For texts
, a character vector of the texts in the corpus.
For texts <-
, the corpus with the updated texts.
for texts <-
, a corpus with the texts replaced by value
Note
The groups
will be used for concatenating the texts based on shared
values of groups
, without any specified order of aggregation.
You are strongly encouraged as a good practice of text analysis
workflow not to modify the substance of the texts in a corpus.
Rather, this sort of processing is better performed through downstream
operations. For instance, do not lowercase the texts in a corpus, or you
will never be able to recover the original case. Rather, apply
tokens_tolower()
after applying tokens()
to a
corpus, or use the option tolower = TRUE
in dfm()
.
Statistics for textual data
Description
The textstat_*()
functions formerly in quanteda have now been moved
to the quanteda.textstats package.
See Also
quanteda.textstats::quanteda.textstats-package
Customizable tokenizer
Description
Allows users to tokenize texts using customized boundary rules. See the ICU website for how to define boundary rules.
Tools for custom word and sentence breakrules, to retrieve, set, or reset them to package defaults.
Usage
tokenize_custom(x, rules)
breakrules_get(what = c("word", "sentence"))
breakrules_set(x, what = c("word", "sentence"))
breakrules_reset(what = c("word", "sentence"))
Arguments
x |
character vector for texts to tokenize |
rules |
a list of rules for rule-based boundary detection |
what |
character; which set of rules to return, one of |
Details
The package contains internal sets of rules for word and sentence
breaks, which are lists
of rules for word and sentence boundary detection. base
is copied from
the ICU library. Other rules are created by the package maintainers in
system.file("breakrules/breakrules_custom.yml")
.
This function allows modification of those rules, and applies them as a new tokenizer.
Custom word rules:
base
ICU's rules for detecting word/sentence boundaries
keep_hyphens
quanteda's rule for preserving hyphens
keep_url
quanteda's rule for preserving URLs
keep_email
quanteda's rule for preserving emails
keep_tags
quanteda's rule for preserving tags
split_elisions
quanteda's rule for splitting elisions
split_tags
quanteda's rule for splitting tags
Value
tokenize_custom()
returns a list of characters containing tokens.
breakrules_get()
returns the existing break rules as a list.
breakrules_set()
returns nothing but reassigns the global
breakrules to x
.
breakrules_reset()
returns nothing but reassigns the global
breakrules to the system defaults. These rules are defined in
system.file("breakrules/")
.
Source
https://raw.githubusercontent.com/unicode-org/icu/main/icu4c/source/data/brkitr/rules/word.txt
https://raw.githubusercontent.com/unicode-org/icu/main/icu4c/source/data/brkitr/rules/sent.txt
Examples
lis <- tokenize_custom("a well-known http://example.com", rules = breakrules_get("word"))
tokens(lis, remove_separators = TRUE)
breakrules_get("word")
breakrules_get("sentence")
brw <- breakrules_get("word")
brw$keep_email <- "@[a-zA-Z0-9_]+"
breakrules_set(brw, what = "word")
breakrules_reset("sentence")
breakrules_reset("word")
quanteda tokenizers
Description
Internal methods for tokenization providing default and legacy methods for text segmentation.
Usage
tokenize_word2(
x,
split_hyphens = FALSE,
verbose = quanteda_options("verbose"),
...
)
tokenize_word3(
x,
split_hyphens = FALSE,
verbose = quanteda_options("verbose"),
...
)
tokenize_word4(
x,
split_hyphens = FALSE,
split_tags = FALSE,
split_elisions = FALSE,
verbose = quanteda_options("verbose"),
...
)
tokenize_word1(
x,
split_hyphens = FALSE,
verbose = quanteda_options("verbose"),
...
)
tokenize_character(x, ...)
tokenize_sentence(x, verbose = FALSE, ...)
tokenize_fasterword(x, ...)
tokenize_fastestword(x, ...)
Arguments
x |
(named) character; input texts |
split_hyphens |
logical; if |
verbose |
if |
... |
used to pass arguments among the functions |
split_tags |
logical; if |
Details
Each of the word tokenizers corresponds to a major version of quanteda,
kept here for backward compatibility and comparison. tokenize_word3()
is
identical to tokenize_word2()
.
Value
a list of characters corresponding to the (most conservative)
tokenization, including whitespace where applicable; except for
tokenize_word1()
, which is a special tokenizer for Internet language that
includes URLs, #hashtags, @usernames, and email addresses.
Examples
## Not run:
txt <- c(doc1 = "Tweet https://quanteda.io using @quantedainit and #rstats.",
doc2 = "The £1,000,000 question.",
doc4 = "Line 1.\nLine2\n\nLine3.",
doc5 = "?",
doc6 = "Self-aware machines! \U0001f600",
doc7 = "Qu'est-ce que c'est?")
tokenize_word2(txt)
tokenize_word2(txt, split_hyphens = FALSE)
tokenize_word1(txt, split_hyphens = FALSE)
tokenize_word4(txt, split_hyphens = FALSE, split_elisions = TRUE)
tokenize_fasterword(txt)
tokenize_fastestword(txt)
tokenize_sentence(txt)
tokenize_character(txt[2])
## End(Not run)
Construct a tokens object
Description
Construct a tokens object, either by importing a named list of characters from an external tokenizer, or by calling the internal quanteda tokenizer.
tokens()
can also be applied to tokens class objects, which
means that the removal rules can be applied post-tokenization, although it
should be noted that it will not be possible to remove things that are not
present. For instance, if the tokens
object has already had punctuation
removed, then tokens(x, remove_punct = TRUE)
will have no additional
effect.
Usage
tokens(
x,
what = "word",
remove_punct = FALSE,
remove_symbols = FALSE,
remove_numbers = FALSE,
remove_url = FALSE,
remove_separators = TRUE,
split_hyphens = FALSE,
split_tags = FALSE,
include_docvars = TRUE,
padding = FALSE,
concatenator = "_",
verbose = quanteda_options("verbose"),
...,
xptr = FALSE
)
Arguments
x |
the input object to the tokens constructor; a tokens, corpus or character object to tokenize. |
what |
character; which tokenizer to use. The default |
remove_punct |
logical; if |
remove_symbols |
logical; if |
remove_numbers |
logical; if |
remove_url |
logical; if |
remove_separators |
logical; if |
split_hyphens |
logical; if |
split_tags |
logical; if |
include_docvars |
if |
padding |
if |
concatenator |
character; the concatenation character that will connect the tokens making up a multi-token sequence. |
verbose |
if |
... |
used to pass arguments among the functions |
xptr |
if |
Value
quanteda tokens
class object, by default a serialized list of
integers corresponding to a vector of types.
Details
As of version 2, the choice of tokenizer is left more to
the user, and tokens()
is treated more as a constructor (from a named
list) than a tokenizer. This allows users to use any other tokenizer that
returns a named list, and to use this as an input to tokens()
, with
removal and splitting rules applied after this has been constructed (passed
as arguments). These removal and splitting rules are conservative and will
not remove or split anything, however, unless the user requests it.
You usually do not want to split hyphenated words or social media tags, but
extra steps required to preserve such special tokens. If there are many
random characters in your texts, you should split_hyphens = TRUE
and
split_tags = TRUE
to avoid a slowdown in tokenization.
Using external tokenizers is best done by piping the output from these
other tokenizers into the tokens()
constructor, with additional removal
and splitting options applied at the construction stage. These will only
have an effect, however, if the tokens exist for which removal is specified
at in the tokens()
call. For instance, it is impossible to remove
punctuation if the input list to tokens()
already had its punctuation
tokens removed at the external tokenization stage.
To construct a tokens object from a list with no additional processing,
call as.tokens()
instead of tokens()
.
Recommended tokenizers are those from the tokenizers package, which are generally faster than the default (built-in) tokenizer but always splits infix hyphens, or spacyr. The default tokenizer in quanteda is very smart, however, and if you do not have special requirements, it works extremely well for most languages as well as text from social media (including hashtags and usernames).
quanteda Tokenizers
The default word tokenizer what = "word"
is
updated in major version 4. It is even smarter than the v3 and v4
versions, with additional options for customization. See
tokenize_word4()
for full details.
The default tokenizer splits tokens using stri_split_boundaries(x, type = "word") but by default preserves infix hyphens (e.g. "self-funding"), URLs, and social media "tag" characters (#hashtags and @usernames), and email addresses. The rules defining a valid "tag" can be found at https://www.hashtags.org/featured/what-characters-can-a-hashtag-include/ for hashtags and at https://help.twitter.com/en/managing-your-account/twitter-username-rules for usernames.
For backward compatibility, the following older tokenizers are also
supported through what
:
"word1"
(legacy) implements similar behaviour to the version of
what = "word"
found in pre-version 2. (It preserves social media tags and infix hyphens, but splits URLs.) "word1" is also slower than "word2" and "word4". In "word1", the argumentremove_twitter
controlled whether social media tags were preserved or removed, even whenremove_punct = TRUE
. This argument is not longer functional in versions >= 2, but equivalent control can be had using thesplit_tags
argument and selective tokens removals."word2", "word3"
(legacy) implements similar behaviour to the versions of "word" found in quanteda versions 3 and 4.
"fasterword"
(legacy) splits on whitespace and control characters, using
stringi::stri_split_charclass(x, "[\\p{Z}\\p{C}]+")
"fastestword"
(legacy) splits on the space character, using
stringi::stri_split_fixed(x, " ")
"character"
tokenization into individual characters
"sentence"
sentence segmenter based on stri_split_boundaries, but with additional rules to avoid splits on words like "Mr." that would otherwise incorrectly be detected as sentence boundaries. For better sentence tokenization, consider using spacyr.
See Also
tokens_ngrams()
, tokens_skipgrams()
, tokens_compound()
,
tokens_lookup()
, concat()
, as.list.tokens()
, as.tokens()
Examples
txt <- c(doc1 = "A sentence, showing how tokens() works.",
doc2 = "@quantedainit and #textanalysis https://example.com?p=123.",
doc3 = "Self-documenting code??",
doc4 = "£1,000,000 for 50¢ is gr8 4ever \U0001f600")
tokens(txt)
tokens(txt, what = "word1")
# removing punctuation marks but keeping tags and URLs
tokens(txt[1:2], remove_punct = TRUE)
# splitting hyphenated words
tokens(txt[3])
tokens(txt[3], split_hyphens = TRUE)
# symbols and numbers
tokens(txt[4])
tokens(txt[4], remove_numbers = TRUE)
tokens(txt[4], remove_numbers = TRUE, remove_symbols = TRUE)
## Not run: # using other tokenizers
tokens(tokenizers::tokenize_words(txt[4]), remove_symbols = TRUE)
tokenizers::tokenize_words(txt, lowercase = FALSE, strip_punct = FALSE) |>
tokens(remove_symbols = TRUE)
tokenizers::tokenize_characters(txt[3], strip_non_alphanum = FALSE) |>
tokens(remove_punct = TRUE)
tokenizers::tokenize_sentences(
"The quick brown fox. It jumped over the lazy dog.") |>
tokens()
## End(Not run)
Base method extensions for tokens objects
Description
Extensions of base R functions for tokens objects.
Usage
## S3 method for class 'tokens'
unlist(x, recursive = FALSE, use.names = TRUE)
## S3 method for class 'tokens'
x[i, drop_docid = TRUE]
## S3 method for class 'tokens'
t1 + t2
## S3 method for class 'tokens_xptr'
c(...)
## S3 method for class 'tokens'
c(...)
Arguments
x |
a tokens object |
recursive |
a required argument for unlist but inapplicable to tokens objects. |
i |
document names or indices for documents to extract. |
drop_docid |
if |
t1 |
tokens one to be added |
t2 |
tokens two to be added |
Value
unlist
returns a simple vector of characters from a
tokens object.
c(...)
and +
return a tokens object whose documents
have been added as a single sequence of documents.
Examples
toks <- tokens(c(d1 = "one two three", d2 = "four five six", d3 = "seven eight"))
str(toks)
toks[c(1,3)]
# combining tokens
toks1 <- tokens(c(doc1 = "a b c d e", doc2 = "f g h"))
toks2 <- tokens(c(doc3 = "1 2 3"))
toks1 + toks2
c(toks1, toks2)
Segment tokens object by chunks of a given size
Description
Segment tokens into new documents of equally sized token lengths, with the possibility of overlapping the chunks.
Usage
tokens_chunk(
x,
size,
overlap = 0,
use_docvars = TRUE,
verbose = quanteda_options("verbose")
)
Arguments
x |
tokens object whose token elements will be segmented into chunks |
size |
integer; the token length of the chunks |
overlap |
integer; the number of tokens in a chunk to be taken from the
last |
use_docvars |
if |
verbose |
if |
Value
A tokens object whose documents have been split into chunks of
length size
.
See Also
Examples
txts <- c(doc1 = "Fellow citizens, I am again called upon by the voice of
my country to execute the functions of its Chief Magistrate.",
doc2 = "When the occasion proper for it shall arrive, I shall
endeavor to express the high sense I entertain of this
distinguished honor.")
toks <- tokens(txts)
tokens_chunk(toks, size = 5)
tokens_chunk(toks, size = 5, overlap = 4)
Convert token sequences into compound tokens
Description
Replace multi-token sequences with a multi-word, or "compound" token. The
resulting compound tokens will represent a phrase or multi-word expression,
concatenated with concatenator
(by default, the "_
" character) to form a
single "token". This ensures that the sequences will be processed
subsequently as single tokens, for instance in constructing a dfm.
Usage
tokens_compound(
x,
pattern,
valuetype = c("glob", "regex", "fixed"),
concatenator = concat(x),
window = 0L,
case_insensitive = TRUE,
join = TRUE,
keep_unigrams = FALSE,
apply_if = NULL,
verbose = quanteda_options("verbose")
)
Arguments
x |
an input tokens object |
pattern |
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
valuetype |
the type of pattern matching: |
concatenator |
character; the concatenation character that will connect the tokens making up a multi-token sequence. |
window |
integer; a vector of length 1 or 2 that specifies size of the
window of tokens adjacent to |
case_insensitive |
logical; if |
join |
logical; if |
keep_unigrams |
if |
apply_if |
logical vector of length |
verbose |
if |
Value
A tokens object in which the token sequences matching pattern
have been replaced by new compounded "tokens" joined by the concatenator.
Note
Patterns to be compounded (naturally) consist of multi-word sequences,
and how these are expected in pattern
is very specific. If the elements
to be compounded are supplied as space-delimited elements of a character
vector, wrap the vector in phrase()
. If the elements to be compounded
are separate elements of a character vector, supply it as a list where each
list element is the sequence of character elements.
See the examples below.
Examples
txt <- "The United Kingdom is leaving the European Union."
toks <- tokens(txt, remove_punct = TRUE)
# character vector - not compounded
tokens_compound(toks, c("United", "Kingdom", "European", "Union"))
# elements separated by spaces - not compounded
tokens_compound(toks, c("United Kingdom", "European Union"))
# list of characters - is compounded
tokens_compound(toks, list(c("United", "Kingdom"), c("European", "Union")))
# elements separated by spaces, wrapped in phrase() - is compounded
tokens_compound(toks, phrase(c("United Kingdom", "European Union")))
# supplied as values in a dictionary (same as list) - is compounded
# (keys do not matter)
tokens_compound(toks, dictionary(list(key1 = "United Kingdom",
key2 = "European Union")))
# pattern as dictionaries with glob matches
tokens_compound(toks, dictionary(list(key1 = c("U* K*"))), valuetype = "glob")
# note the differences caused by join = FALSE
compounds <- list(c("the", "European"), c("European", "Union"))
tokens_compound(toks, pattern = compounds, join = TRUE)
tokens_compound(toks, pattern = compounds, join = FALSE)
# use window to form ngrams
tokens_remove(toks, pattern = stopwords("en")) |>
tokens_compound(pattern = "leav*", join = FALSE, window = c(0, 3))
Combine documents in a tokens object by a grouping variable
Description
Combine documents in a tokens object by a grouping variable, by concatenating the tokens in the order of the documents within each grouping variable.
Usage
tokens_group(
x,
groups = docid(x),
fill = FALSE,
env = NULL,
verbose = quanteda_options("verbose")
)
Arguments
x |
tokens object |
groups |
grouping variable for sampling, equal in length to the number
of documents. This will be evaluated in the docvars data.frame, so that
docvars may be referred to by name without quoting. This also changes
previous behaviours for |
fill |
logical; if |
env |
an environment or a list object in which |
verbose |
if |
Value
a tokens object whose documents are equal to the unique group combinations, and whose tokens are the concatenations of the tokens by group. Document-level variables that have no variation within groups are saved in docvars. Document-level variables that are lists are dropped from grouping, even when these exhibit no variation within groups.
Examples
corp <- corpus(c("a a b", "a b c c", "a c d d", "a c c d"),
docvars = data.frame(grp = c("grp1", "grp1", "grp2", "grp2")))
toks <- tokens(corp)
tokens_group(toks, groups = grp)
tokens_group(toks, groups = c(1, 1, 2, 2))
# with fill
tokens_group(toks, groups = factor(c(1, 1, 2, 2), levels = 1:3))
tokens_group(toks, groups = factor(c(1, 1, 2, 2), levels = 1:3), fill = TRUE)
Apply a dictionary to a tokens object
Description
Convert tokens into equivalence classes defined by values of a dictionary object.
Usage
tokens_lookup(
x,
dictionary,
levels = 1:5,
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
capkeys = !exclusive,
exclusive = TRUE,
nomatch = NULL,
append_key = FALSE,
separator = "/",
concatenator = concat(x),
nested_scope = c("key", "dictionary"),
apply_if = NULL,
verbose = quanteda_options("verbose")
)
Arguments
x |
the tokens object to which the dictionary will be applied |
dictionary |
the dictionary-class object that will be applied to
|
levels |
integers specifying the levels of entries in a hierarchical
dictionary that will be applied. The top level is 1, and subsequent levels
describe lower nesting levels. Values may be combined, even if these
levels are not contiguous, e.g. |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
capkeys |
if |
exclusive |
if |
nomatch |
an optional character naming a new key for tokens that do not
matched to a dictionary values If |
append_key |
if |
separator |
a character to separate tokens and keys when |
concatenator |
the concatenation character that will connect the words making up the multi-word sequences. |
nested_scope |
how to treat matches from different dictionary keys that
are nested. When one value is nested within another, such as "a b" being
nested within "a b c", then |
apply_if |
logical vector of length |
verbose |
if |
Details
Dictionary values may consist of sequences, and there are different methods of counting key matches based on values that are nested or that overlap.
When two different keys in a dictionary are nested matches of one another,
the nested_scope
options provide the choice of matching each key's
values independently (the "key"
) option, or just counting the
longest match (the "dictionary"
option). Values that are nested
within the same key are always counted as a single match. See the
last example below comparing the New York and New York Times
for these two different behaviours.
Overlapping values, such as "a b"
and "b a"
are
currently always considered as separate matches if they are in different
keys, or as one match if the overlap is within the same key.
Note: apply_if
This applies the dictionary lookup only to documents that
match the logical condition. When exclusive = TRUE
(the default),
however, this means that empty documents will be returned for those not
meeting the condition, since no lookup will be applied and hence no tokens
replaced by matching keys.
See Also
tokens_replace
Examples
toks1 <- tokens(data_corpus_inaugural)
dict1 <- dictionary(list(country = "united states",
law=c("law*", "constitution"),
freedom=c("free*", "libert*")))
dfm(tokens_lookup(toks1, dict1, valuetype = "glob", verbose = TRUE))
dfm(tokens_lookup(toks1, dict1, valuetype = "glob", verbose = TRUE, nomatch = "NONE"))
dict2 <- dictionary(list(country = "united states",
law = c("law", "constitution"),
freedom = c("freedom", "liberty")))
# dfm(applyDictionary(toks1, dict2, valuetype = "fixed"))
dfm(tokens_lookup(toks1, dict2, valuetype = "fixed"))
# hierarchical dictionary example
txt <- c(d1 = "The United States has the Atlantic Ocean and the Pacific Ocean.",
d2 = "Britain and Ireland have the Irish Sea and the English Channel.")
toks2 <- tokens(txt)
dict3 <- dictionary(list(US = list(Countries = c("States"),
oceans = c("Atlantic", "Pacific")),
Europe = list(Countries = c("Britain", "Ireland"),
oceans = list(west = "Irish Sea",
east = "English Channel"))))
tokens_lookup(toks2, dict3, levels = 1)
tokens_lookup(toks2, dict3, levels = 2)
tokens_lookup(toks2, dict3, levels = 1:2)
tokens_lookup(toks2, dict3, levels = 3)
tokens_lookup(toks2, dict3, levels = c(1,3))
tokens_lookup(toks2, dict3, levels = c(2,3))
# show unmatched tokens
tokens_lookup(toks2, dict3, nomatch = "_UNMATCHED")
# nested matching differences
dict4 <- dictionary(list(paper = "New York Times", city = "New York"))
toks4 <- tokens("The New York Times is a New York paper.")
tokens_lookup(toks4, dict4, nested_scope = "key", exclusive = FALSE)
tokens_lookup(toks4, dict4, nested_scope = "dictionary", exclusive = FALSE)
Create n-grams and skip-grams from tokens
Description
Create a set of n-grams (tokens in sequence) from already tokenized text objects, with an optional skip argument to form skip-grams. Both the n-gram length and the skip lengths take vectors of arguments to form multiple lengths or skips in one pass. Implemented in C++ for efficiency.
Usage
tokens_ngrams(
x,
n = 2L,
skip = 0L,
concatenator = concat(x),
apply_if = NULL,
verbose = quanteda_options("verbose")
)
char_ngrams(x, n = 2L, skip = 0L, concatenator = "_")
tokens_skipgrams(
x,
n,
skip,
concatenator = concat(x),
apply_if = NULL,
verbose = quanteda_options("verbose")
)
Arguments
x |
a tokens object, or a character vector, or a list of characters |
n |
integer vector specifying the number of elements to be concatenated
in each n-gram. Each element of this vector will define a |
skip |
integer vector specifying the adjacency skip size for tokens
forming the n-grams, default is 0 for only immediately neighbouring words.
For |
concatenator |
character for combining words, default is |
apply_if |
logical vector of length |
verbose |
if |
Details
Normally, these functions will be called through
[tokens](x, ngrams = , ...)
, but these functions are provided
in case a user wants to perform lower-level n-gram construction on tokenized
texts.
tokens_skipgrams()
is a wrapper to tokens_ngrams()
that requires
arguments to be supplied for both n
and skip
. For k
-skip
skip-grams, set skip
to 0:
k
, in order to conform to the
definition of skip-grams found in Guthrie et al (2006): A k
skip-gram
is an n-gram which is a superset of all n-grams and each (k-i)
skip-gram until (k-i)==0
(which includes 0 skip-grams).
Value
a tokens object consisting a list of character vectors of n-grams, one list element per text, or a character vector if called on a simple character vector
Note
char_ngrams
is a convenience wrapper for a (non-list)
vector of characters, so named to be consistent with quanteda's naming
scheme.
References
Guthrie, David, Ben Allison, Wei Liu, Louise Guthrie, and Yorick Wilks. 2006.
"A Closer Look at Skip-Gram Modelling." https://aclanthology.org/L06-1210/
Examples
# ngrams
tokens_ngrams(tokens(c("a b c d e", "c d e f g")), n = 2:3)
toks <- tokens(c(text1 = "the quick brown fox jumped over the lazy dog"))
tokens_ngrams(toks, n = 1:3)
tokens_ngrams(toks, n = c(2,4), concatenator = " ")
tokens_ngrams(toks, n = c(2,4), skip = 1, concatenator = " ")
# skipgrams
toks <- tokens("insurgents killed in ongoing fighting")
tokens_skipgrams(toks, n = 2, skip = 0:1, concatenator = " ")
tokens_skipgrams(toks, n = 2, skip = 0:2, concatenator = " ")
tokens_skipgrams(toks, n = 3, skip = 0:2, concatenator = " ")
recompile a serialized tokens object
Description
This function recompiles a serialized tokens object when the vocabulary has been changed in a way that makes some of its types identical, such as lowercasing when a lowercased version of the type already exists in the type table, or introduces gaps in the integer map of the types. It also re-indexes the types attribute to account for types that may have become duplicates, through a procedure such as stemming or lowercasing; or the addition of new tokens through compounding.
Usage
tokens_recompile(x, method = c("C++", "R"))
Arguments
x |
the tokens object to be recompiled |
method |
|
Examples
# lowercasing
toks1 <- tokens(c(one = "a b c d A B C D",
two = "A B C d"))
attr(toks1, "types") <- char_tolower(attr(toks1, "types"))
unclass(toks1)
unclass(quanteda:::tokens_recompile(toks1))
# stemming
toks2 <- tokens("Stemming stemmed many word stems.")
unclass(toks2)
unclass(quanteda:::tokens_recompile(tokens_wordstem(toks2)))
# compounding
toks3 <- tokens("One two three four.")
unclass(toks3)
unclass(tokens_compound(toks3, "two three"))
# lookup
dict <- dictionary(list(test = c("one", "three")))
unclass(tokens_lookup(toks3, dict))
# empty pads
unclass(tokens_select(toks3, dict))
unclass(tokens_select(toks3, dict, padding = TRUE))
# ngrams
unclass(tokens_ngrams(toks3, n = 2:3))
Replace tokens in a tokens object
Description
Substitute token types based on vectorized one-to-one matching. Since this
function is created for lemmatization or user-defined stemming. It supports
substitution of multi-word features by multi-word features, but substitution
is fastest when pattern
and replacement
are character vectors
and valuetype = "fixed"
as the function only substitute types of
tokens. Please use tokens_lookup()
with exclusive = FALSE
to replace dictionary values.
Usage
tokens_replace(
x,
pattern,
replacement,
valuetype = "glob",
case_insensitive = TRUE,
apply_if = NULL,
verbose = quanteda_options("verbose")
)
Arguments
x |
tokens object whose token elements will be replaced |
pattern |
a character vector or list of character vectors. See pattern for more details. |
replacement |
a character vector or (if |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
apply_if |
logical vector of length |
verbose |
if |
See Also
tokens_lookup
Examples
toks1 <- tokens(data_corpus_inaugural, remove_punct = TRUE)
# lemmatization
taxwords <- c("tax", "taxing", "taxed", "taxed", "taxation")
lemma <- rep("TAX", length(taxwords))
toks2 <- tokens_replace(toks1, taxwords, lemma, valuetype = "fixed")
kwic(toks2, "TAX") |>
tail(10)
# stemming
type <- types(toks1)
stem <- char_wordstem(type, "porter")
toks3 <- tokens_replace(toks1, type, stem, valuetype = "fixed", case_insensitive = FALSE)
identical(toks3, tokens_wordstem(toks1, "porter"))
# multi-multi substitution
toks4 <- tokens_replace(toks1, phrase(c("Supreme Court")),
phrase(c("Supreme Court of the United States")))
kwic(toks4, phrase(c("Supreme Court of the United States")))
Restore special tokens
Description
Compounds a sequence of tokens marked by special markers. The beginning and the end of the sequence should be marked by U+E001 and U+E002 respectively.
Usage
tokens_restore(x)
Arguments
x |
tokens object |
Randomly sample documents from a tokens object
Description
Take a random sample of documents of the specified size from a corpus, with or without replacement, optionally by grouping variables or with probability weights.
Usage
tokens_sample(
x,
size = NULL,
replace = FALSE,
prob = NULL,
by = NULL,
env = NULL,
verbose = quanteda_options("verbose")
)
Arguments
x |
a tokens object whose documents will be sampled |
size |
a positive number, the number of documents to select; when used
with |
replace |
if |
prob |
a vector of probability weights for obtaining the elements of the
vector being sampled. May not be applied when |
by |
optional grouping variable for sampling. This will be evaluated in
the docvars data.frame, so that docvars may be referred to by name without
quoting. This also changes previous behaviours for |
env |
an environment or a list object in which |
verbose |
if |
Value
a tokens object (re)sampled on the documents, containing the document variables for the documents sampled.
See Also
Examples
set.seed(123)
toks <- tokens(data_corpus_inaugural[1:6])
toks
tokens_sample(toks)
tokens_sample(toks, replace = TRUE) |> docnames()
tokens_sample(toks, size = 3, replace = TRUE) |> docnames()
# sampling using by
docvars(toks)
tokens_sample(toks, size = 2, replace = TRUE, by = Party) |> docnames()
Segment tokens object by patterns
Description
Segment tokens by splitting on a pattern match. This is useful for breaking
the tokenized texts into smaller document units, based on a regular pattern
or a user-supplied annotation. While it normally makes more sense to do this
at the corpus level (see corpus_segment()
), tokens_segment
provides the option to perform this operation on tokens.
Usage
tokens_segment(
x,
pattern,
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
extract_pattern = FALSE,
pattern_position = c("before", "after"),
use_docvars = TRUE,
verbose = quanteda_options("verbose")
)
Arguments
x |
tokens object whose token elements will be segmented |
pattern |
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
extract_pattern |
remove matched patterns from the texts and save in
docvars, if |
pattern_position |
either |
use_docvars |
if |
verbose |
if |
Value
tokens_segment
returns a tokens object whose documents
have been split by patterns
Examples
txts <- "Fellow citizens, I am again called upon by the voice of my country to
execute the functions of its Chief Magistrate. When the occasion proper for
it shall arrive, I shall endeavor to express the high sense I entertain of
this distinguished honor."
toks <- tokens(txts)
# split by any punctuation
tokens_segment(toks, "^\\p{Sterm}$", valuetype = "regex",
extract_pattern = TRUE,
pattern_position = "after")
tokens_segment(toks, c(".", "?", "!"), valuetype = "fixed",
extract_pattern = TRUE,
pattern_position = "after")
Select or remove tokens from a tokens object
Description
These function select or discard tokens from a tokens object. For
convenience, the functions tokens_remove
and tokens_keep
are defined as
shortcuts for tokens_select(x, pattern, selection = "remove")
and
tokens_select(x, pattern, selection = "keep")
, respectively. The most
common usage for tokens_remove
will be to eliminate stop words from a text
or text-based object, while the most common use of tokens_select
will be to
select tokens with only positive pattern matches from a list of regular
expressions, including a dictionary. startpos
and endpos
determine the
positions of tokens searched for pattern
and areas affected are expanded by
window
.
Usage
tokens_select(
x,
pattern,
selection = c("keep", "remove"),
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
padding = FALSE,
window = 0,
min_nchar = NULL,
max_nchar = NULL,
startpos = 1L,
endpos = -1L,
apply_if = NULL,
verbose = quanteda_options("verbose")
)
tokens_remove(x, ...)
tokens_keep(x, ...)
Arguments
x |
tokens object whose token elements will be removed or kept |
pattern |
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
selection |
whether to |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
padding |
if |
window |
integer of length 1 or 2; the size of the window of tokens
adjacent to Terms from overlapping windows are never double-counted, but simply
returned in the pattern match. This is because |
min_nchar , max_nchar |
optional numerics specifying the minimum and
maximum length in characters for tokens to be removed or kept; defaults are
|
startpos , endpos |
integer; position of tokens in documents where pattern
matching starts and ends, where 1 is the first token in a document. For
negative indexes, counting starts at the ending token of the document, so
that -1 denotes the last token in the document, -2 the second to last, etc.
When the length of the vector is equal to |
apply_if |
logical vector of length |
verbose |
if |
... |
additional arguments passed by |
Value
a tokens object with tokens selected or removed based on their
match to pattern
Examples
## tokens_select with simple examples
toks <- as.tokens(list(letters, LETTERS))
tokens_select(toks, c("b", "e", "f"), selection = "keep", padding = FALSE)
tokens_select(toks, c("b", "e", "f"), selection = "keep", padding = TRUE)
tokens_select(toks, c("b", "e", "f"), selection = "remove", padding = FALSE)
tokens_select(toks, c("b", "e", "f"), selection = "remove", padding = TRUE)
# how case_insensitive works
tokens_select(toks, c("b", "e", "f"), selection = "remove", case_insensitive = TRUE)
tokens_select(toks, c("b", "e", "f"), selection = "remove", case_insensitive = FALSE)
# use window
tokens_select(toks, c("b", "f"), selection = "keep", window = 1)
tokens_select(toks, c("b", "f"), selection = "remove", window = 1)
tokens_remove(toks, c("b", "f"), window = c(0, 1))
tokens_select(toks, pattern = c("e", "g"), window = c(1, 2))
# tokens_remove example: remove stopwords
txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my
country to execute the functions of its Chief Magistrate.",
wash2 <- "When the occasion proper for it shall arrive, I shall
endeavor to express the high sense I entertain of this
distinguished honor.")
tokens_remove(tokens(txt, remove_punct = TRUE), stopwords("english"))
# token_keep example: keep two-letter words
tokens_keep(tokens(txt, remove_punct = TRUE), "??")
Split tokens by a separator pattern
Description
Replaces tokens by multiple replacements consisting of elements split by a
separator pattern, with the option of retaining the separator. This function
effectively reverses the operation of tokens_compound()
.
Usage
tokens_split(
x,
separator = " ",
valuetype = c("fixed", "regex"),
remove_separator = TRUE,
apply_if = NULL,
verbose = quanteda_options("verbose")
)
Arguments
x |
a tokens object |
separator |
a single-character pattern match by which tokens are separated |
valuetype |
the type of pattern matching: |
remove_separator |
if |
apply_if |
logical vector of length |
verbose |
if |
Examples
# undo tokens_compound()
toks1 <- tokens("pork barrel is an idiomatic multi-word expression")
tokens_compound(toks1, phrase("pork barrel"))
tokens_compound(toks1, phrase("pork barrel")) |>
tokens_split(separator = "_")
# similar to tokens(x, remove_hyphen = TRUE) but post-tokenization
toks2 <- tokens("UK-EU negotiation is not going anywhere as of 2018-12-24.")
tokens_split(toks2, separator = "-", remove_separator = FALSE)
Extract a subset of a tokens
Description
Returns document subsets of a tokens that meet certain conditions, including
direct logical operations on docvars (document-level variables).
tokens_subset()
functions identically to subset.data.frame()
, using
non-standard evaluation to evaluate conditions based on the docvars in the
tokens.
Usage
tokens_subset(
x,
subset,
min_ntoken = NULL,
max_ntoken = NULL,
drop_docid = TRUE,
verbose = quanteda_options("verbose"),
...
)
Arguments
x |
tokens object to be subsetted. |
subset |
logical expression indicating the documents to keep: missing values are taken as false. |
min_ntoken , max_ntoken |
minimum and maximum lengths of the documents to extract. |
drop_docid |
if |
verbose |
if |
... |
not used |
Value
tokens object, with a subset of documents (and docvars) selected according to arguments
See Also
Examples
corp <- corpus(c(d1 = "a b c d", d2 = "a a b e",
d3 = "b b c e", d4 = "e e f a b"),
docvars = data.frame(grp = c(1, 1, 2, 3)))
toks <- tokens(corp)
# selecting on a docvars condition
tokens_subset(toks, grp > 1)
# selecting on a supplied vector
tokens_subset(toks, c(TRUE, FALSE, TRUE, FALSE))
Convert the case of tokens
Description
tokens_tolower()
and tokens_toupper()
convert the features of a
tokens object and re-index the types.
Usage
tokens_tolower(x, keep_acronyms = FALSE)
tokens_toupper(x)
Arguments
x |
the input object whose character/tokens/feature elements will be case-converted |
keep_acronyms |
logical; if |
Examples
# for a document-feature matrix
toks <- tokens(c(txt1 = "b A A", txt2 = "C C a b B"))
tokens_tolower(toks)
tokens_toupper(toks)
Trim tokens using frequency threshold-based feature selection
Description
Returns a tokens object reduced in size based on document and term frequency, usually in terms of a minimum frequency, but may also be in terms of maximum frequencies. Setting a combination of minimum and maximum frequencies will select features based on a range.
Usage
tokens_trim(
x,
min_termfreq = NULL,
max_termfreq = NULL,
termfreq_type = c("count", "prop", "rank", "quantile"),
min_docfreq = NULL,
max_docfreq = NULL,
docfreq_type = c("count", "prop", "rank", "quantile"),
padding = FALSE,
verbose = quanteda_options("verbose")
)
Arguments
x |
a dfm object |
min_termfreq , max_termfreq |
minimum/maximum values of feature frequencies across all documents, below/above which features will be removed |
termfreq_type |
how |
min_docfreq , max_docfreq |
minimum/maximum values of a feature's document frequency, below/above which features will be removed |
docfreq_type |
specify how |
padding |
if |
verbose |
if |
Value
A tokens object with reduced size.
See Also
Examples
toks <- tokens(data_corpus_inaugural)
# keep only words occurring >= 10 times and in >= 2 documents
tokens_trim(toks, min_termfreq = 10, min_docfreq = 2, padding = TRUE)
# keep only words occurring >= 10 times and no more than 90% of the documents
tokens_trim(toks, min_termfreq = 10, max_docfreq = 0.9, docfreq_type = "prop",
padding = TRUE)
Stem the terms in an object
Description
Apply a stemmer to words. This is a wrapper to wordStem designed to allow this function to be called without loading the entire SnowballC package. wordStem uses Martin Porter's stemming algorithm and the C libstemmer library generated by Snowball.
Usage
tokens_wordstem(
x,
language = quanteda_options("language_stemmer"),
verbose = quanteda_options("verbose")
)
char_wordstem(
x,
language = quanteda_options("language_stemmer"),
check_whitespace = TRUE
)
dfm_wordstem(
x,
language = quanteda_options("language_stemmer"),
verbose = quanteda_options("verbose")
)
Arguments
x |
a character, tokens, or dfm object whose word stems are to be removed. If tokenized texts, the tokenization must be word-based. |
language |
the name of a recognized language, as returned by getStemLanguages, or a two- or three-letter ISO-639 code corresponding to one of these languages (see references for the list of codes) |
verbose |
if |
check_whitespace |
logical; if |
Value
tokens_wordstem()
returns a tokens object whose word
types have been stemmed.
char_wordstem()
returns a character object whose word
types have been stemmed.
dfm_wordstem()
returns a dfm object whose word
types (features) have been stemmed, and recombined to consolidate features made
equivalent because of stemming.
References
https://www.iso.org/iso-639-language-code for the ISO-639 language codes
See Also
Examples
# example applied to tokens
txt <- c(one = "eating eater eaters eats ate",
two = "taxing taxes taxed my tax return")
th <- tokens(txt)
tokens_wordstem(th)
# simple example
char_wordstem(c("win", "winning", "wins", "won", "winner"))
# example applied to a dfm
(origdfm <- dfm(tokens(txt)))
dfm_wordstem(origdfm)
Methods for tokens_xptr objects
Description
Methods for creating and testing for tokens_xptr
objects, which are
tokens objects containing pointers to memory locations that can be passed
by reference for efficient processing in tokens_*()
functions that modify
them, or for constructing a document-feature matrix without requiring a deep
copy to be passed to dfm()
.
is.tokens_xptr()
tests whether an object is of class
tokens_xtpr
.
as.tokens_xptr()
coerces a tokens
object to an external
pointer-based tokens object, or returns a deep copy of a tokens_xtpr
when
x
is already a tokens_xtpr
object.
Usage
is.tokens_xptr(x)
as.tokens_xptr(x)
## S3 method for class 'tokens'
as.tokens_xptr(x)
## S3 method for class 'tokens_xptr'
as.tokens_xptr(x)
Arguments
x |
a tokens object to convert or a |
Value
is.tokens_xptr()
returns TRUE
if the object is a external
pointer-based tokens object, FALSE
otherwise.
as.tokens_xptr()
returns a (deep copy of a) tokens_xtpr
class
object.
Identify the most frequent features in a dfm
Description
List the most (or least) frequently occurring features in a dfm, either as a whole or separated by document.
Usage
topfeatures(
x,
n = 10,
decreasing = TRUE,
scheme = c("count", "docfreq"),
groups = NULL
)
Arguments
x |
the object whose features will be returned |
n |
how many top features should be returned |
decreasing |
If |
scheme |
one of |
groups |
grouping variable for sampling, equal in length to the number
of documents. This will be evaluated in the docvars data.frame, so that
docvars may be referred to by name without quoting. This also changes
previous behaviours for |
Value
A named numeric vector of feature counts, where the names are the
feature labels, or a list of these if groups
is given.
Examples
dfmat1 <- corpus_subset(data_corpus_inaugural, Year > 1980) |>
tokens(remove_punct = TRUE) |>
dfm()
dfmat2 <- dfm_remove(dfmat1, stopwords("en"))
# most frequent features
topfeatures(dfmat1)
topfeatures(dfmat2)
# least frequent features
topfeatures(dfmat2, decreasing = FALSE)
# top features of individual documents
topfeatures(dfmat2, n = 5, groups = docnames(dfmat2))
# grouping by president last name
topfeatures(dfmat2, n = 5, groups = President)
# features by document frequencies
tail(topfeatures(dfmat1, scheme = "docfreq", n = 200))
Get word types from a tokens object
Description
Get unique types of tokens from a tokens object.
Usage
types(x)
Arguments
x |
a tokens object |
See Also
Examples
toks <- tokens(data_corpus_inaugural)
head(types(toks), 20)
Unlist a list of character vectors safely
Description
Unlist a list of character vectors safely
Usage
unlist_character(x, unique = FALSE, ...)
Arguments
x |
a list of integers |
unique |
if |
... |
passed to |
Value
character vector
Unlist a list of integer vectors safely
Description
Unlist a list of integer vectors safely
Usage
unlist_integer(x, unique = FALSE, ...)
Arguments
x |
a list of integers |
unique |
if |
... |
passed to |
Value
integer vector
Pattern matching using valuetype
Description
Pattern matching in quanteda using the valuetype
argument.
Arguments
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
Details
Pattern matching in in quanteda uses "glob"-style pattern
matching as the default, because this is simpler than regular expression
matching while addressing most users' needs. It is also has the advantage
of being identical to fixed pattern matching when the wildcard characters
(*
and ?
) are not used. Finally, most dictionary formats use glob
matching.
"glob"
"glob"-style wildcard expressions, the quanteda default. The implementation used in quanteda uses
*
to match any number of any characters including none, and?
to match any single character. See alsoutils::glob2rx()
and References below."regex"
Regular expression matching.
"fixed"
Fixed (literal) pattern matching.
Note
If "fixed" is used with case_insensitive = TRUE
, features will
typically be lowercased internally prior to matching. Also, glob matches
are converted to regular expressions (using utils::glob2rx()
) when
they contain wild card characters, and to fixed pattern matches when they
do not.
See Also
utils::glob2rx()
, glob pattern matching (Wikipedia),
stringi::stringi-search-regex()
, stringi::stringi-search-fixed()