Version: | 0.5.4 |
Title: | Keyword Assisted Topic Models |
Description: | Fits keyword assisted topic models (keyATM) using collapsed Gibbs samplers. The keyATM combines the latent dirichlet allocation (LDA) models with a small number of keywords selected by researchers in order to improve the interpretability and topic classification of the LDA. The keyATM can also incorporate covariates and directly model time trends. The keyATM is proposed in Eshima, Imai, and Sasaki (2024) <doi:10.1111/ajps.12779>. |
License: | GPL-3 |
Depends: | R (≥ 4.0) |
Imports: | Rcpp (≥ 1.0.7), cli (≥ 3.6.1), dplyr (≥ 1.1.0), fastmap, future.apply, fs (≥ 1.6.0), ggplot2 (≥ 3.4.0), ggrepel, magrittr, Matrix, matrixNormal (≥ 0.1.0), MASS, pgdraw, purrr (≥ 1.0.0), quanteda (≥ 3.3.0), rlang (≥ 1.1.0), stringr, tibble, tidyr (≥ 1.0.0), tidyselect (≥ 1.2.0) |
LinkingTo: | Rcpp, RcppEigen, cli |
Suggests: | readtext, stats, testthat (≥ 3.1.5) |
URL: | https://keyatm.github.io/keyATM/ |
Encoding: | UTF-8 |
BugReports: | https://github.com/keyATM/keyATM/issues |
LazyData: | TRUE |
RoxygenNote: | 7.3.2 |
SystemRequirements: | C++17 |
NeedsCompilation: | yes |
Packaged: | 2025-07-21 12:39:53 UTC; shusei |
Author: | Shusei Eshima |
Maintainer: | Shusei Eshima <shuseieshima@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-07-21 13:01:36 UTC |
Keyword Assisted Topic Models
Description
The implementation of keyATM models.
Author(s)
Maintainer: Shusei Eshima shuseieshima@gmail.com (ORCID)
Authors:
Tomoya Sasaki tomoyas@mit.edu
Kosuke Imai imai@harvard.edu
Other contributors:
Chung-hong Chan chainsawtiney@gmail.com (ORCID) [contributor]
Romain François (ORCID) [contributor]
Martin Feldkircher (ORCID) [contributor]
William Lowe wlowe@princeton.edu [contributor]
Seo-young Silvia Kim sy.silvia.kim@gmail.com (ORCID) [contributor]
See Also
Useful links:
Estimate document-topic distribution by strata (for covariate models)
Description
Estimate document-topic distribution by strata (for covariate models)
Usage
by_strata_DocTopic(x, by_var, labels, by_values = NULL, ...)
Arguments
x |
the output from the covariate keyATM model (see |
by_var |
character. The name of the variable to use. |
labels |
character. The labels for the values specified in |
by_values |
numeric. Specific values for |
... |
other arguments passed on to the |
Value
strata_topicword object (a list).
Estimate subsetted topic-word distribution
Description
Estimate subsetted topic-word distribution
Usage
by_strata_TopicWord(x, keyATM_docs, by)
Arguments
x |
the output from a keyATM model (see |
keyATM_docs |
an object generated by |
by |
a vector whose length is the number of documents. |
Value
strata_topicword object (a list).
Calculate the probability for Polya-Gamma Covariate Model
Description
Same as utils::calc_PGtheta, but this is for calling from R
Usage
calc_PGtheta_R(theta_tilda, theta, num_doc, num_topics)
Arguments
theta_tilda |
Parameters |
theta |
Parameters |
num_doc |
Number of documents |
num_topics |
Number of topics |
Return covariates used in the iteration
Description
Return covariates used in the iteration
Usage
covariates_get(x)
Arguments
x |
the output from the covariate keyATM model (see |
Show covariates information
Description
Show covariates information
Usage
covariates_info(x)
Arguments
x |
the output from the covariate keyATM model (see |
keyATM main function
Description
Fit keyATM models.
Usage
keyATM(
docs,
model,
no_keyword_topics,
keywords = list(),
model_settings = list(),
priors = list(),
options = list(),
keep = c()
)
Arguments
docs |
texts read via |
model |
keyATM model: |
no_keyword_topics |
the number of regular topics. |
keywords |
a list of keywords. |
model_settings |
a list of model specific settings (details are in the online documentation). |
priors |
a list of priors of parameters. |
options |
a list of options
|
keep |
a vector of the names of elements you want to keep in output. |
Value
A keyATM_output
object containing:
- keyword_k
number of keyword topics
- no_keyword_topics
number of no-keyword topics
- V
number of terms (number of unique words)
- N
number of documents
- model
the name of the model
- theta
topic proportions for each document (document-topic distribution)
- phi
topic specific word generation probabilities (topic-word distribution)
- topic_counts
number of tokens assigned to each topic
- word_counts
number of times each word type appears
- doc_lens
length of each document in tokens
- vocab
words in the vocabulary (a vector of unique words)
- priors
priors
- options
options
- keywords_raw
specified keywords
- model_fit
perplexity and log-likelihood
- pi
estimated
\pi
(the probability of using keyword topic word distribution) for the last iteration- values_iter
values stored during iterations
- kept_values
outputs you specified to store in
keep
option- information
information about the fitting
See Also
https://keyatm.github.io/keyATM/articles/pkgdown_files/Options.html
Examples
## Not run:
library(keyATM)
library(quanteda)
data(keyATM_data_bills)
bills_keywords <- keyATM_data_bills$keywords
bills_dfm <- keyATM_data_bills$doc_dfm # quanteda dfm object
keyATM_docs <- keyATM_read(bills_dfm)
# keyATM Base
out <- keyATM(docs = keyATM_docs, model = "base",
no_keyword_topics = 5, keywords = bills_keywords)
# Visit our website for full examples: https://keyatm.github.io/keyATM/
## End(Not run)
Bills data
Description
Bills data
Usage
keyATM_data_bills
Format
A list with following objects:
- doc_dfm
A
quanteda
dfm object of 140 documents. The text data is a part of the Congressional Bills scraped from CONGRESS.GOV.- cov
An integer vector which takes one if the Republican proposed the bill.
- keywords
A list of length 4 which contains keywords for four selected topics.
- time_index
An integer vector indicating the session number of each bill.
- labels
An integer vector indicating 40 labels.
- labels_all
An integer vector indicating all labels.
Source
CONGRESS.GOV
Run the Collapsed Gibbs sampler for the keyATM Dynamic
Description
Run the Collapsed Gibbs sampler for the keyATM Dynamic
Usage
keyATM_fit_HMM(model, resume = FALSE)
Arguments
model |
A initialized model |
resume |
resume or not |
Run the Collapsed Gibbs sampler for weighted LDA
Description
Run the Collapsed Gibbs sampler for weighted LDA
Usage
keyATM_fit_LDA(model, resume = FALSE)
Arguments
model |
A initialized model |
resume |
resume or not |
Run the Collapsed Gibbs sampler for the weighted LDA with HMM model
Description
Run the Collapsed Gibbs sampler for the weighted LDA with HMM model
Usage
keyATM_fit_LDAHMM(model, resume = FALSE)
Arguments
model |
A initialized model |
resume |
resume or not |
Run the Collapsed Gibbs sampler for weighted LDA with covariates
Description
Run the Collapsed Gibbs sampler for weighted LDA with covariates
Usage
keyATM_fit_LDAcov(model, resume = FALSE)
Arguments
model |
A initialized model |
resume |
resume or not |
Run the Collapsed Gibbs sampler for the keyATM Base
Description
Run the Collapsed Gibbs sampler for the keyATM Base
Usage
keyATM_fit_base(model, resume = FALSE)
Arguments
model |
A initialized model |
resume |
resume or not |
Run the Collapsed Gibbs sampler for the keyATM covariates (Dir-Multi)
Description
Run the Collapsed Gibbs sampler for the keyATM covariates (Dir-Multi)
Usage
keyATM_fit_cov(model, resume = FALSE)
Arguments
model |
A initialized model |
resume |
resume or not |
Run the Collapsed Gibbs sampler for the keyATM covariates (Polya-Gamma)
Description
Run the Collapsed Gibbs sampler for the keyATM covariates (Polya-Gamma)
Usage
keyATM_fit_covPG(model, resume = FALSE)
Arguments
model |
A initialized model |
resume |
resume or not |
Initialize a keyATM model
Description
keyATM_initialize is wrapped by keyATM() and weightedLDA()
Usage
keyATM_initialize(
docs,
model,
no_keyword_topics,
keywords = list(),
model_settings = list(),
priors = list(),
options = list()
)
Create an output object
Description
Create an output object
Usage
keyATM_output(model, keep, used_iter)
Read texts
Description
Read texts and create a keyATM_docs
object, which is a list of texts.
Usage
keyATM_read(
texts,
encoding = "UTF-8",
check = TRUE,
keep_docnames = FALSE,
split = 0
)
Arguments
texts |
input. keyATM takes a quanteda dfm (dgCMatrix), data.frame, tibble tbl_df, or a vector of file paths. |
encoding |
character. Only used when |
check |
logical. If |
keep_docnames |
logical. If |
split |
numeric. This option works only with a quanteda dfm. It creates a two subset of the dfm by randomly splitting each document (i.e., the total number of documents is the same between two subsets). This option specifies the split proportion. Default is |
Value
a keyATM_docs object. The first element is a list whose elements are split texts. The length of the list equals to the number of documents.
Examples
## Not run:
# Use quanteda dfm
keyATM_docs <- keyATM_read(texts = quanteda_dfm)
# Use data.frame or tibble (texts should be stored in a column named `text`)
keyATM_docs <- keyATM_read(texts = data_frame_object)
keyATM_docs <- keyATM_read(texts = tibble_object)
# Use a vector that stores full paths to the text files
files <- list.files(doc_folder, pattern = "*.txt", full.names = TRUE)
keyATM_docs <- keyATM_read(texts = files)
## End(Not run)
keyATM with Collapsed Variational Bayes
Description
Experimental feature: Fit keyATM base with Collapsed Variational Bayes
Usage
keyATMvb(
docs,
model,
no_keyword_topics,
keywords = list(),
model_settings = list(),
vb_options = list(),
priors = list(),
options = list(),
keep = list()
)
Arguments
docs |
texts read via |
model |
keyATM model: |
no_keyword_topics |
the number of regular topics |
keywords |
a list of keywords |
model_settings |
a list of model specific settings (details are in the online documentation) |
vb_options |
a list of settings for Variational Bayes
|
priors |
a list of priors of parameters |
options |
a list of options same as |
keep |
a vector of the names of elements you want to keep in output |
Value
A keyATM_output
object
See Also
https://keyatm.github.io/keyATM/articles/pkgdown_files/keyATMvb.html
Run the Variational Bayes for the keyATM models
Description
Run the Variational Bayes for the keyATM models
Usage
keyATMvb_call(model)
Arguments
model |
A model |
Fit a keyATM model with Collapsed Variational Bayes
Description
Fit a keyATM model with Collapsed Variational Bayes
Usage
keyATMvb_fit(
docs,
model,
no_keyword_topics,
keywords = list(),
model_settings = list(),
vb_options = list(),
priors = list(),
options = list()
)
Initialize assignments
Description
Initialize assignments
Usage
make_wsz_cpp(docs_, info_, initialized_)
Arguments
docs_ |
Documents |
info_ |
Various information |
initialized_ |
Store initialized objects (W, S and Z) |
Run multinomial regression with Polya-Gamma augmentation
Description
Run multinomial regression with Polya-Gamma augmentation. There is no need to call this function directly. The keyATM Covariate internally uses this.
Usage
multiPGreg(Y, X, num_topics, PG_params, iter = 1, store_lambda = 0)
Arguments
Y |
Outcomes. |
X |
Covariates. |
num_topics |
Number of topics. |
PG_params |
Parameters used in this function. |
iter |
The default is |
store_lambda |
The default is |
Plot document-topic distribution by strata (for covariate models)
Description
Plot document-topic distribution by strata (for covariate models)
Usage
## S3 method for class 'strata_doctopic'
plot(
x,
show_topic = NULL,
var_name = NULL,
by = c("topic", "covariate"),
ci = 0.9,
method = c("hdi", "eti"),
point = c("mean", "median"),
width = 0.1,
show_point = TRUE,
...
)
Arguments
x |
a strata_doctopic object (see |
show_topic |
a vector or an integer. Indicate topics to visualize. |
var_name |
the name of the variable in the plot. |
by |
|
ci |
value of the credible interval (between 0 and 1) to be estimated. Default is |
method |
method for computing the credible interval. The Highest Density Interval ( |
point |
method for computing the point estimate. |
width |
numeric. Width of the error bars. |
show_point |
logical. Show point estimates. The default is |
... |
additional arguments not used. |
Value
keyATM_fig object.
See Also
save_fig()
, by_strata_DocTopic()
Show a diagnosis plot of alpha
Description
Show a diagnosis plot of alpha
Usage
plot_alpha(x, start = 0, show_topic = NULL, scales = "fixed")
Arguments
x |
the output from a keyATM model (see |
start |
integer. The start of slice iteration. Default is |
show_topic |
a vector to specify topic indexes to show. Default is |
scales |
character. Control the scale of y-axis (the parameter in ggplot2::facet_wrap()): |
Value
keyATM_fig object
See Also
Show a diagnosis plot of log-likelihood and perplexity
Description
Show a diagnosis plot of log-likelihood and perplexity
Usage
plot_modelfit(x, start = 1)
Arguments
x |
the output from a keyATM model (see |
start |
integer. The starting value of iteration to use in plot. Default is |
Value
keyATM_fig object.
See Also
Show a diagnosis plot of pi
Description
Show a diagnosis plot of pi
Usage
plot_pi(
x,
show_topic = NULL,
start = 0,
ci = 0.9,
method = c("hdi", "eti"),
point = c("mean", "median")
)
Arguments
x |
the output from a keyATM model (see |
show_topic |
an integer or a vector. Indicate topics to visualize. Default is |
start |
integer. The starting value of iteration to use in the plot. Default is |
ci |
value of the credible interval (between 0 and 1) to be estimated. Default is |
method |
method for computing the credible interval. The Highest Density Interval ( |
point |
method for computing the point estimate. |
Value
keyATM_fig object.
See Also
Plot time trend
Description
Plot time trend
Usage
plot_timetrend(
x,
show_topic = NULL,
time_index_label = NULL,
ci = 0.9,
method = c("hdi", "eti"),
point = c("mean", "median"),
xlab = "Time",
scales = "fixed",
show_point = TRUE,
...
)
Arguments
x |
the output from the dynamic keyATM model (see |
show_topic |
an integer or a vector. Indicate topics to visualize. Default is |
time_index_label |
a vector. The label for time index. The length should be equal to the number of documents (time index provided to |
ci |
value of the credible interval (between 0 and 1) to be estimated. Default is |
method |
method for computing the credible interval. The Highest Density Interval ( |
point |
method for computing the point estimate. |
xlab |
a character. |
scales |
character. Control the scale of y-axis (the parameter in ggplot2::facet_wrap()): |
show_point |
logical. The default is |
... |
additional arguments not used. |
Value
keyATM_fig object.
See Also
Show the expected proportion of the corpus belonging to each topic
Description
Show the expected proportion of the corpus belonging to each topic
Usage
plot_topicprop(
x,
n = 3,
show_topic = NULL,
show_topwords = TRUE,
label_topic = NULL,
order = c("proportion", "topicid"),
xmax = NULL
)
Arguments
x |
the output from a keyATM model (see |
n |
The number of top words to show. Default is |
show_topic |
an integer or a vector. Indicate topics to visualize. Default is |
show_topwords |
logical. Show topwords. The default is |
label_topic |
a character vector. The name of the topics in the plot. |
order |
The order of topics. |
xmax |
a numeric. Indicate the max value on the x axis |
Value
keyATM_fig object
See Also
Predict topic proportions for the covariate keyATM
Description
Predict topic proportions for the covariate keyATM
Usage
## S3 method for class 'keyATM_output'
predict(
object,
newdata,
transform = FALSE,
burn_in = NULL,
parallel = TRUE,
posterior_mean = TRUE,
ci = 0.9,
method = c("hdi", "eti"),
point = c("mean", "median"),
label = NULL,
raw_values = FALSE,
...
)
Arguments
object |
the keyATM_output object for the covariate model. |
newdata |
New observations which should be predicted. |
transform |
Transorm and standardize the |
burn_in |
integer. Burn-in period. If not specified, it is the half of samples. Default is |
parallel |
logical. If |
posterior_mean |
logical. If |
ci |
value of the credible interval (between 0 and 1) to be estimated. Default is |
method |
method for computing the credible interval. The Highest Density Interval ( |
point |
method for computing the point estimate. |
label |
a character. Add a |
raw_values |
a logical. Returns raw values. The default is |
... |
additional arguments not used. |
Read files from the quanteda dfm (this is the same as dgCMatrix)
Description
Read files from the quanteda dfm (this is the same as dgCMatrix)
Usage
read_dfm_cpp(dfm, W_read, vocab, split)
Arguments
dfm |
a dfm input (sparse Matrix) |
W_read |
an object to return |
vocab |
a vector of vocabulary |
split |
split proportion |
Convert a quanteda dictionary to keywords
Description
This function converts or reads a dictionary object from quanteda to a named list. "Glob"-style wildcard expressions (e.g. politic*) are resolved based on the available terms in your texts.
Usage
read_keywords(file = NULL, docs = NULL, dictionary = NULL, split = TRUE, ...)
Arguments
file |
file identifier for a foreign dictionary, e.g. path to a dictionary in YAML or LIWC format |
docs |
texts read via |
dictionary |
a quanteda dictionary object, ignore if file is not NULL |
split |
boolean, if multi-word terms be seperated, e.g. "air force" splits into "air" and "force". |
... |
additional parameters for |
Value
a named list which can be used as keywords for e.g. keyATM()
See Also
Examples
## Not run:
library(keyATM)
library(quanteda)
## using the moral foundation dictionary example from quanteda
dictfile <- tempfile()
download.file("http://bit.ly/37cV95h", dictfile)
data(keyATM_data_bills)
bills_dfm <- keyATM_data_bills$doc_dfm
keyATM_docs <- keyATM_read(bills_dfm)
read_keywords(file = dictfile, docs = keyATM_docs, format = "LIWC")
## End(Not run)
Refine keywords
Description
Refine keywords by dropping topics that does not have any occurence in the documents.
Usage
refine_keywords(keywords, docs)
Arguments
keywords |
a list of keywords |
docs |
a keyATM_docs object, generated by |
Value
a list of refined keywords
Examples
## Not run:
library(quanteda)
data(keyATM_data_bills)
bills_keywords <- keyATM_data_bills$keywords
bills_dfm <- keyATM_data_bills$doc_dfm # quanteda dfm object
keyATM_docs <- keyATM_read(bills_dfm)
bills_keywords$Videogame <- c("metroid", "castlevania", "balatro")
refine_keywords(bills_keywords, keyATM_docs)
## End(Not run)
Save a figure
Description
Save a figure
Usage
save_fig(x, filename, ...)
Arguments
x |
the keyATM_fig object. |
filename |
file name to create on disk. |
... |
other arguments passed on to the ggplot2::ggsave() function. |
See Also
visualize_keywords()
, plot_alpha()
, plot_modelfit()
, plot_pi()
, plot_timetrend()
, plot_topicprop()
, by_strata_DocTopic()
, values_fig()
Semantic Coherence: Mimno et al. (2011)
Description
Mimno, David et al. 2011. “Optimizing Semantic Coherence in Topic Models.” In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK.: Association for Computational Linguistics, 262–72. https://aclanthology.org/D11-1024.
Usage
semantic_coherence(x, docs, n = 10)
Arguments
x |
the output from a keyATM model (see |
docs |
texts read via |
n |
integer. The number terms to visualize. Default is |
Details
Equation 1 of Mimno et al. 2011 adopted to keyATM.
Value
A vector of topic coherence metric calculated by each topic.
Show the top documents for each topic
Description
Show the top documents for each topic
Usage
top_docs(x, n = 10)
Arguments
x |
the output from a keyATM model (see |
n |
How many documents to show. Default is |
Value
An n x k table of the top n documents for each topic, each number is a document index.
Show the top topics for each document
Description
Show the top topics for each document
Usage
top_topics(x, n = 2)
Arguments
x |
the output from a keyATM model (see |
n |
integer. The number of topics to show. Default is |
Value
An n x k table of the top n topics in each document.
Show the top words for each topic
Description
If show_keyword
is TRUE
then words in their keyword topics
are suffixed with a check mark. Words from another keyword topic
are labeled with the name of that category.
Usage
top_words(x, n = 10, measure = c("probability", "lift"), show_keyword = TRUE)
Arguments
x |
the output (see |
n |
integer. The number terms to visualize. Default is |
measure |
character. The way to sort the terms: |
show_keyword |
logical. If |
Value
An n x k table of the top n words in each topic
Get values used to create a figure
Description
Get values used to create a figure
Usage
values_fig(x)
Arguments
x |
the keyATM_fig object. |
See Also
save_fig()
, visualize_keywords()
, plot_alpha()
, plot_modelfit()
, plot_pi()
, plot_timetrend()
, plot_topicprop()
, by_strata_DocTopic()
Visualize keywords
Description
Visualize the proportion of keywords in the documents.
Usage
visualize_keywords(docs, keywords, prune = TRUE, label_size = 3.2)
Arguments
docs |
a keyATM_docs object, generated by |
keywords |
a list of keywords |
prune |
logical. If |
label_size |
the size of keyword labels in the output plot. Default is |
Value
keyATM_fig object
See Also
Examples
## Not run:
# Prepare a keyATM_docs object
keyATM_docs <- keyATM_read(input)
# Keywords are in a list
keywords <- list(Education = c("education", "child", "student"),
Health = c("public", "health", "program"))
# Visualize keywords
keyATM_viz <- visualize_keywords(keyATM_docs, keywords)
# View a figure
keyATM_viz
# Save a figure
save_fig(keyATM_viz, filename)
## End(Not run)
Weighted LDA main function
Description
Fit weighted LDA models.
Usage
weightedLDA(
docs,
model,
number_of_topics,
model_settings = list(),
priors = list(),
options = list(),
keep = c()
)
Arguments
docs |
texts read via |
model |
Weighted LDA model: |
number_of_topics |
the number of regular topics. |
model_settings |
a list of model specific settings (details are in the online documentation). |
priors |
a list of priors of parameters. |
options |
a list of options (details are in the documentation of |
keep |
a vector of the names of elements you want to keep in output. |
Value
A keyATM_output
object containing:
- V
number of terms (number of unique words)
- N
number of documents
- model
the name of the model
- theta
topic proportions for each document (document-topic distribution)
- phi
topic specific word generation probabilities (topic-word distribution)
- topic_counts
number of tokens assigned to each topic
- word_counts
number of times each word type appears
- doc_lens
length of each document in tokens
- vocab
words in the vocabulary (a vector of unique words)
- priors
priors
- options
options
- keywords_raw
NULL
for LDA models- model_fit
perplexity and log-likelihood
- pi
estimated pi for the last iteration (
NULL
for LDA models)- values_iter
values stored during iterations
- number_of_topics
number of topics
- kept_values
outputs you specified to store in
keep
option- information
information about the fitting
See Also
https://keyatm.github.io/keyATM/articles/pkgdown_files/Options.html
Examples
## Not run:
library(keyATM)
library(quanteda)
data(keyATM_data_bills)
bills_dfm <- keyATM_data_bills$doc_dfm # quanteda dfm object
keyATM_docs <- keyATM_read(bills_dfm)
# Weighted LDA
out <- weightedLDA(docs = keyATM_docs, model = "base",
number_of_topics = 5)
# Visit our website for full examples: https://keyatm.github.io/keyATM/
## End(Not run)
Checking if a word is in a document
Description
Checking if a word is in a document
Usage
word_in_doc(doc, word)
Arguments
doc |
a vector |
word |
a word to check |
Value
bool