Added alt text to figures in vignettes and README (#233)
Update vignette for quanteda::dfm() v4 (#242)
stm()
tidiers for high FREX and lift words
(#223)dfm
because of
the upcoming release of Matrix (#218)scale_x/y_reordered()
now uses a function
labels
as its main input (#200)to_lower
is passed to underlying tokenization
function for character shingles (#208)content
,
thanks to @jonathanvoelkle (#209)collapse
argument to
unnest_functions()
. This argument now takes either
NULL
(do not collapse text across rows for tokenizing) or a
character vector of variables (use said variables to collapse text
across rows for tokenizing). This fixes a long-standing bug and provides
more consistent behavior, but does change results for many situations
(such as n-gram tokenization).reorder_within()
now handles multiple variables, thanks
to @tmastny
(#170)to_lower
argument to other tokenizing functions,
for more consistent behavior (#175)glance()
method for stm’s estimated regressions,
thanks to @vincentarelbundock (#176)augment()
function for stm topic model.tibble()
where appropriate, thanks to @luisdza (#136).unnest_tokens()
.unnest_tokens
can now unnest a data frame with a list
column (which formerly threw the error
unnest_tokens expects all columns of input to be atomic vectors (not lists)
).
The unnested result repeats the objects within each list. (It’s still
not possible when collapse = TRUE
, in which tokens can span
multiple lines).get_tidy_stopwords()
to obtain stopword lexicons in
multiple languages in a tidy format.nma_words
of negators, modals, and
adverbs that affect sentiment analysis (#55).NA
values are handled in
unnest_tokens
so they no longer cause other columns to
become NA
(#82).data.table
)
consistently (#88).unnest_tokens
, bind_tf_idf
, all sparse
casters) (#67, #74).stm
package
(#51).get_sentiments
now works regardless of whether
tidytext
has been loaded or not (#50).unnest_tokens
now supports data.table objects
(#37).to_lower
parameter in unnest_tokens
to work properly for all tokenizing options.tidy.corpus
, glance.corpus
, tests,
and vignette for changes to quanteda APIpair_count
function, which is
now in the in-development widyr packagemallet
packageunnest_tokens
preserves custom attributes of data
frames and data.tablescast_sparse
, cast_dtm
, and other
sparse casters to ignore groups in the input (#19)unnest_tokens
so that it no longer uses tidyr’s
unnest, but rather a custom version that removes some overhead. In some
experiments, this sped up unnest_tokens on large inputs by about 40%.
This also moves tidyr from Imports to Suggests for now.unnest_tokens
now checks that there are no list columns
in the input, and raises an error if present (since those cannot be
unnested).format
argument to unnest_tokens so that it can
process html, xml, latex or man pages using the hunspell package, though
only when token = "words"
.get_sentiments
function that takes the name of
a lexicon (“nrc”, “bing”, or “sentiment”) and returns just that
sentiment data frame (#25)cast_sparse
to work with dplyr 0.5.0pair_count
function, which has been
moved to pairwise_count
in the widyr package. This will
be removed entirely in a future version.