Introduction to the mda.biber R package

David Brown

Load the mda.biber package

Load the package, as well as others that we’ll use in this vignette.

library(mda.biber)
library(corrplot)
library(tidyverse)
library(kableExtra)

Multi-Dimensional Analysis (MDA)

Multi-Dimensional Analysis (MDA) is a complex statistical procedure developed by Douglas Biber. It is largely used to describe language as it varies by genre, register, and use.

MDA is based on the fundamental linguistic principle that some linguistic variables co-occur (nouns and adjectives, for example) while others inversely co-occur (think nouns and pronouns).

MDA in conducted in the following stages:

  1. Identification of relevant variables
  2. Extraction of factors from variables
  3. Functional interpretation of factors as dimensions
  4. Placement of categories on the dimensions

The initial steps are carried out using well-established procures for factor analysis.

Inspect the data

First, we inspect the data that comes with the package – counts of features tagged using the pseudobibeR package.

doc_id f_01_past_tense f_02_perfect_aspect f_03_present_tense f_04_place_adverbials f_05_time_adverbials
BIO.G0.01.1 17.678709 8.070715 52.45965 3.074558 4.035357
BIO.G0.02.1 11.477762 10.043042 61.33429 2.152080 6.097561
BIO.G0.02.2 3.875969 0.000000 62.01550 0.000000 1.291990
BIO.G0.02.3 1.700680 3.401361 64.62585 0.000000 0.000000
BIO.G0.02.4 1.531394 4.594181 70.44410 1.531394 0.000000
BIO.G0.02.5 14.844804 7.759784 47.57085 3.373819 2.024291

[…]

f_62_split_infinitve f_63_split_auxiliary f_64_phrasal_coordination f_65_clausal_coordination f_66_neg_synthetic f_67_neg_analytic
0.1921599 4.611837 8.262875 5.956956 0.7686395 4.227517
0.7173601 5.021521 6.097561 3.945481 1.4347202 2.869441
0.0000000 3.875969 6.459948 1.291990 0.0000000 3.875969
0.0000000 1.700680 18.707483 0.000000 0.0000000 0.000000
3.0627871 6.125574 7.656968 0.000000 0.0000000 7.656968
0.6747638 4.723347 3.036437 1.686910 0.6747638 5.398111

Note that the first column is the name of a document from the Michigan Corpus of Upper-Level Student Papers (MICUSP), which includes various kinds of the meta-data. The beginning of the file, for example, has a series of upper-case letters that identify the discipline the paper was written for (BIO = Biology).

Because MDA involves a specific application of factor analysis, the first thing we need to do is to extract that string and covert the column to a factor (or categorical variable).

d <- micusp_biber |>
  mutate(doc_id = str_extract(doc_id, "^[A-Z]+")) |>
  mutate(doc_id = as.factor(doc_id)) |>
  select(where(~ any(. != 0))) # removes any columns containing all zeros as required for corrplot

Determining the number of factors

The package comes with a wrapper for the nScree() function in nFactors. However, you can easily use any scree plotting function (or alternative method) for determining the number of factors to be calculated.

screeplot_mda(d)

Inspecting a correlation matrix

Next, we can create a correlation matrix and plot it using corrplot(). Note that earlier we omitted any columns that contain only zeros; if they existed, the correlation would be undefined.

# create a correlation matrix
cor_m <- cor(d[,-1], method = "pearson")

# plot the matrix
corrplot(cor_m, type = "upper", order = "hclust", tl.col = "black", tl.srt = 45, diag = F, tl.cex = 0.5)

The mda_loadings() function

As the plot makes clear, there are groupings of features that positively and negatively correlate, making our data a good candidate for MDA. To carry out the MDA procedure use the mda_loadings() function.

The function requires a data.frame with 1 factor (or categorical variable) and more than 2 numeric variables.

Ideally, need at least 5 times as many texts as variables (i.e. 5 times more rows than columns).

Variables that don’t correlate with any other variable are dropped. The default threshold for dropping variables is 0.20, but that can be changed in the function arguments.

Also, MDA uses a promax rotation.

For this demonstration, we’ll return 2 factors (n_factors = 2):

m <- mda_loadings(d, n_factors = 2)

MDA data structure

The mda_loadings() function returns a data frame containing the dimension score for each document/text. The score is calculated by:

  1. Standardizing data by converting to z-scores
  2. For each text, summing all of the high-positive variables and subtracting all of the high-negative variables.

The high and high-negative are specified by a threshold value, which is conventionally 0.35, but can be set in the function arguments.

Also included in the data structure are the means-by-group, which is used for plotting, and the factor loadings. The values are stored as attributes.

Means

attributes(m)$group_means
group Factor1 Factor2
BIO -6.642144 -0.5964011
CEE -8.570111 -0.1625027
CLS 1.499315 -0.5898260
ECO -5.869465 0.3063591
EDU 6.101605 -0.7646039
ENG 7.486346 1.0436717

Factor loadings

attributes(m)$loadings %>% arrange(-Factor1)
Factor1 Factor2
f_06_first_person_pronouns 0.6903326 -0.3038467
f_65_clausal_coordination 0.6613812 -0.0459734
f_42_adverbs 0.6336405 0.1817718
f_11_indefinite_pronouns 0.6206053 0.0430852
f_67_neg_analytic 0.5839826 0.1694203
f_19_be_main_verb 0.5762678 0.1850102

[…]

Factor1 Factor2
f_27_past_participle_whiz -0.3520527 -0.1698809
f_39_prepositions -0.4536996 0.0218836
f_14_nominalizations -0.4607526 0.2068994
f_40_adj_attr -0.5914670 0.1180111
f_16_other_nouns -0.6835550 -0.1635661
f_44_mean_word_length -0.7242918 0.0971474

Plotting the results

One conventional way of plotting the results is to place the means along a cline in a kind of stick plot.

The package contains a convenience function for making these kinds of plots. As with all plotting functions, it is easy enough to tweak the code and customize your own plots.

stickplot_mda(m, n_factor = 1)

Along this particular dimension, Philosophy is positioned at the extreme positive end, while History and various Engineering specialties are positioned at the negative end.

You can also generate a plot that combines the stick plot with a heatmap of the relevant variables and their factor loadings.

heatmap_mda(m, n_factor = 1)

The plot highlights the variables that contribute to the positive end of cline like adverbs, to be as a main verb, and first-person pronouns. At the other end, are nouns, longer words, more attributive adjectives, and prepositions.

Alternatively, you can generate a plot of the kind that is common in reporting PCA, which combines scaled vectors of the relevant variable loadings and boxplots of the dimension scores organized by group.

boxplot_mda(m, n_factor = 1)

Evaluating the dimensions

Finally, the variation explained by each dimension can measured using linear regression.

We will hack together a table that uses ANOVA to evaluate the dimensions and includes the R2 value.


# Carry out regression
f1_lm <- lm(Factor1 ~ group, data = m)
f2_lm <- lm(Factor2 ~ group, data = m)

# Convert ANOVA results into data.frames allows for easier name manipulation
f1_aov <- data.frame(anova(f1_lm), r.squared = c(summary(f1_lm)$r.squared*100, NA))
f2_aov <- data.frame(anova(f2_lm), r.squared = c(summary(f2_lm)$r.squared*100, NA))

# Putting all into one data.frame/table
anova_results <- data.frame(rbind(c("DF", "Sum Sq", "Mean Sq", "F value", "Pr(>F)", "*R*^2^",
                                    "DF", "Sum Sq", "Mean Sq", "F value", "Pr(>F)", "*R*^2^"),
                                   cbind(round(f1_aov, 2), round(f2_aov, 2))))
colnames(anova_results) <- c("", "", "", "", "", "", "", "", "", "", "", "")
row.names(anova_results)[1] <- ""
anova_results[is.na(anova_results)] <- "--"

And output the results:

anova_results %>% knitr::kable("html") %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = F) %>%
  kableExtra::add_header_above(c("", "Dimension 1" = 6, "Dimension 2" = 6))
Dimension 1
Dimension 2
DF Sum Sq Mean Sq F value Pr(>F) R2 DF Sum Sq Mean Sq F value Pr(>F) R2
group 16 40430.14 2526.88 23.76 0 31.92 16 615.63 38.48 14.22 0 21.91
Residuals 811 86249.77 106.35 811 2194.8 2.71

Bibliography

Biber, Douglas. 1991. Variation Across Speech and Writing. Cambridge University Press. https://books.google.com/books?id=CVTPaSSYEroC&printsec=frontcover#v=onepage&q&f=false.
Biber, Douglas, Susan Conrad, Randi Reppen, Pat Byrd, and Marie Helt. 2002. “Speaking and Writing in the University: A Multidimensional Comparison.” Tesol Quarterly 36 (1): 9–48. https://onlinelibrary.wiley.com/doi/abs/10.2307/3588359.
Douglas, Douglas. 1992. “The Multi-Dimensional Approach to Linguistic Analyses of Genre Variation: An Overview of Methodology and Findings.” Computers and the Humanities 26 (5): 331–45. https://link.springer.com/article/10.1007/BF00136979.