Load the package, as well as others that we’ll use in this vignette.
Multi-Dimensional Analysis (MDA) is a complex statistical procedure developed by Douglas Biber. It is largely used to describe language as it varies by genre, register, and use.
MDA is based on the fundamental linguistic principle that some linguistic variables co-occur (nouns and adjectives, for example) while others inversely co-occur (think nouns and pronouns).
MDA in conducted in the following stages:
The initial steps are carried out using well-established procures for factor analysis.
First, we inspect the data that comes with the package – counts of features tagged using the pseudobibeR package.
doc_id | f_01_past_tense | f_02_perfect_aspect | f_03_present_tense | f_04_place_adverbials | f_05_time_adverbials |
---|---|---|---|---|---|
BIO.G0.01.1 | 17.678709 | 8.070715 | 52.45965 | 3.074558 | 4.035357 |
BIO.G0.02.1 | 11.477762 | 10.043042 | 61.33429 | 2.152080 | 6.097561 |
BIO.G0.02.2 | 3.875969 | 0.000000 | 62.01550 | 0.000000 | 1.291990 |
BIO.G0.02.3 | 1.700680 | 3.401361 | 64.62585 | 0.000000 | 0.000000 |
BIO.G0.02.4 | 1.531394 | 4.594181 | 70.44410 | 1.531394 | 0.000000 |
BIO.G0.02.5 | 14.844804 | 7.759784 | 47.57085 | 3.373819 | 2.024291 |
[…]
f_62_split_infinitve | f_63_split_auxiliary | f_64_phrasal_coordination | f_65_clausal_coordination | f_66_neg_synthetic | f_67_neg_analytic |
---|---|---|---|---|---|
0.1921599 | 4.611837 | 8.262875 | 5.956956 | 0.7686395 | 4.227517 |
0.7173601 | 5.021521 | 6.097561 | 3.945481 | 1.4347202 | 2.869441 |
0.0000000 | 3.875969 | 6.459948 | 1.291990 | 0.0000000 | 3.875969 |
0.0000000 | 1.700680 | 18.707483 | 0.000000 | 0.0000000 | 0.000000 |
3.0627871 | 6.125574 | 7.656968 | 0.000000 | 0.0000000 | 7.656968 |
0.6747638 | 4.723347 | 3.036437 | 1.686910 | 0.6747638 | 5.398111 |
Note that the first column is the name of a document from the Michigan Corpus of Upper-Level Student Papers (MICUSP), which includes various kinds of the meta-data. The beginning of the file, for example, has a series of upper-case letters that identify the discipline the paper was written for (BIO = Biology).
Because MDA involves a specific application of factor analysis, the first thing we need to do is to extract that string and covert the column to a factor (or categorical variable).
The package comes with a wrapper for the nScree()
function in nFactors. However, you can easily use any
scree plotting function (or alternative method) for determining the
number of factors to be calculated.
Next, we can create a correlation matrix and plot it using
corrplot()
. Note that earlier we omitted any columns that
contain only zeros; if they existed, the correlation would be
undefined.
# create a correlation matrix
cor_m <- cor(d[,-1], method = "pearson")
# plot the matrix
corrplot(cor_m, type = "upper", order = "hclust", tl.col = "black", tl.srt = 45, diag = F, tl.cex = 0.5)
mda_loadings()
functionAs the plot makes clear, there are groupings of features that
positively and negatively correlate, making our data a good candidate
for MDA. To carry out the MDA procedure use the
mda_loadings()
function.
The function requires a data.frame with 1 factor (or categorical variable) and more than 2 numeric variables.
Ideally, need at least 5 times as many texts as variables (i.e. 5 times more rows than columns).
Variables that don’t correlate with any other variable are dropped. The default threshold for dropping variables is 0.20, but that can be changed in the function arguments.
Also, MDA uses a promax rotation.
For this demonstration, we’ll return 2 factors
(n_factors = 2
):
The mda_loadings()
function returns a data frame
containing the dimension score for each document/text. The score is
calculated by:
The high and high-negative are specified by a threshold value, which is conventionally 0.35, but can be set in the function arguments.
Also included in the data structure are the means-by-group, which is used for plotting, and the factor loadings. The values are stored as attributes.
group | Factor1 | Factor2 |
---|---|---|
BIO | -6.642144 | -0.5964011 |
CEE | -8.570111 | -0.1625027 |
CLS | 1.499315 | -0.5898260 |
ECO | -5.869465 | 0.3063591 |
EDU | 6.101605 | -0.7646039 |
ENG | 7.486346 | 1.0436717 |
Factor1 | Factor2 | |
---|---|---|
f_06_first_person_pronouns | 0.6903326 | -0.3038467 |
f_65_clausal_coordination | 0.6613812 | -0.0459734 |
f_42_adverbs | 0.6336405 | 0.1817718 |
f_11_indefinite_pronouns | 0.6206053 | 0.0430852 |
f_67_neg_analytic | 0.5839826 | 0.1694203 |
f_19_be_main_verb | 0.5762678 | 0.1850102 |
[…]
Factor1 | Factor2 | |
---|---|---|
f_27_past_participle_whiz | -0.3520527 | -0.1698809 |
f_39_prepositions | -0.4536996 | 0.0218836 |
f_14_nominalizations | -0.4607526 | 0.2068994 |
f_40_adj_attr | -0.5914670 | 0.1180111 |
f_16_other_nouns | -0.6835550 | -0.1635661 |
f_44_mean_word_length | -0.7242918 | 0.0971474 |
One conventional way of plotting the results is to place the means along a cline in a kind of stick plot.
The package contains a convenience function for making these kinds of plots. As with all plotting functions, it is easy enough to tweak the code and customize your own plots.
Along this particular dimension, Philosophy is positioned at the extreme positive end, while History and various Engineering specialties are positioned at the negative end.
You can also generate a plot that combines the stick plot with a heatmap of the relevant variables and their factor loadings.
The plot highlights the variables that contribute to the positive end of cline like adverbs, to be as a main verb, and first-person pronouns. At the other end, are nouns, longer words, more attributive adjectives, and prepositions.
Alternatively, you can generate a plot of the kind that is common in reporting PCA, which combines scaled vectors of the relevant variable loadings and boxplots of the dimension scores organized by group.
Finally, the variation explained by each dimension can measured using linear regression.
We will hack together a table that uses ANOVA to evaluate the dimensions and includes the R2 value.
# Carry out regression
f1_lm <- lm(Factor1 ~ group, data = m)
f2_lm <- lm(Factor2 ~ group, data = m)
# Convert ANOVA results into data.frames allows for easier name manipulation
f1_aov <- data.frame(anova(f1_lm), r.squared = c(summary(f1_lm)$r.squared*100, NA))
f2_aov <- data.frame(anova(f2_lm), r.squared = c(summary(f2_lm)$r.squared*100, NA))
# Putting all into one data.frame/table
anova_results <- data.frame(rbind(c("DF", "Sum Sq", "Mean Sq", "F value", "Pr(>F)", "*R*^2^",
"DF", "Sum Sq", "Mean Sq", "F value", "Pr(>F)", "*R*^2^"),
cbind(round(f1_aov, 2), round(f2_aov, 2))))
colnames(anova_results) <- c("", "", "", "", "", "", "", "", "", "", "", "")
row.names(anova_results)[1] <- ""
anova_results[is.na(anova_results)] <- "--"
And output the results:
anova_results %>% knitr::kable("html") %>%
kableExtra::kable_styling(bootstrap_options = "striped", full_width = F) %>%
kableExtra::add_header_above(c("", "Dimension 1" = 6, "Dimension 2" = 6))
DF | Sum Sq | Mean Sq | F value | Pr(>F) | R2 | DF | Sum Sq | Mean Sq | F value | Pr(>F) | R2 | |
group | 16 | 40430.14 | 2526.88 | 23.76 | 0 | 31.92 | 16 | 615.63 | 38.48 | 14.22 | 0 | 21.91 |
Residuals | 811 | 86249.77 | 106.35 | – | – | – | 811 | 2194.8 | 2.71 | – | – | – |