machine learning techniques have the potential to help us advance knowledge.
///

Decoding the past: How machine learning enhances historical analysis

Machine learning techniques have the potential to advance knowledge, but researchers must exercise caution in decision-making and interpretation.

We are surrounded by text, from the cereal box at breakfast to WhatsApp messages, emails, news, ads, social media posts, and more. Assuming some significant fractions are preserved, interpreting the volume and variety of these materials will challenge future scholars. Recent advances in artificial intelligence and machine learning have given us new tools to analyze and process large collections of texts. Among the suite of emerging Natural Language Processing (NLP)-based tools developed in recent years, we will introduce topic modeling and show how it can be especially helpful for academic historians. With topic modeling, historians are able to extract and interpret relationships between text, images, and video and conduct research across broader sets of text collections and on wider ranges of topics.

Topic modeling

Topic modeling is a powerful text analysis technique that enables researchers to identify patterns or clusters of word co-occurrences within a collection of documents. Topic modeling follows a “bag of words” approach, which identifies topics based on word co-occurrences and frequencies without reference to context. Multiple algorithms can be used to perform topic modeling analysis, including Latent Dirichlet Allocation (LDA), correlated topic models, or hierarchical topic models. In general, topic modeling is particularly useful for researchers seeking to interpret large-scale text corpora.

Smaller collections can continue to be analyzed using traditional research methods, but once an archival corpus exceeds the human-readable scale (probably around several thousand pages of text), researchers will need some form of computational support to process texts while preserving the statistical relationships between words efficiently. An important feature of topic modeling is that it allows researchers to “discover” topics from texts rather than pre-specify or assume them, in principle addressing one source of researcher bias. Although topic modeling cannot replace the close reading of specific sources, it is useful for classification, novelty detection, summarization, or similarity analysis tasks. Moretti (2013) described this process as “distant reading,” a type of high-level interpretation that can complement other forms of scholarly interpretation.

This “complementarity” operates in several important ways for the academic researcher. The availability of large-scale text collections has increased noise and information overload, multiplying the cognitive burden on researchers. Algorithms are efficient at finding patterns among texts while ignoring the context, whereas researchers excel in understanding the context and phenomena they study. By allowing the topic model to lighten the cognitive burden, historians can increase the size, number, and variety of textual inputs they use. Allowing an algorithm to identify latent topics might lead to surprising findings and refine research questions while simultaneously increasing the reliability of studies and conclusions. Moreover, topic modeling allows for greater transparency and replicability, facilitating multiple and iterative interpretations to emerge.

Structural topic modeling

Structural topic modeling (STM) is a subset of topic modeling incorporating document-level metadata into the analysis. Historians and scholars are interested in understanding how topics or concepts evolve over time may find structural topic modeling more useful than traditional topic modeling. STM has two important features: first, the model allows documents to be assigned to multiple topics, and second, the STM algorithm uses structural metadata – author, year, ideological affiliation, etc. – when identifying topics, enabling the researcher to understand how topic relevance changes based on those attributes (e.g., by author, by year, by ideological affiliation, etc.).

Topic modeling (LDA) and STM share a common assumption: each topic is defined by a fixed set of words. The topics (and the words in each topic) are inferred from the whole set of documents and are the same for all the documents in the corpus. Hence, these two algorithms cannot show how meaning changes over time.

Sample structural Topic Modeling output, illustrating how topics’ prevalence varied based on a set of multiple covariates.
Figure 1. Sample structural Topic Modeling output
Credit. Taylor and Francis

Dynamic topic modeling

Dynamic topic models (DTM) are another type of topic model that helps researchers explore how the content of topics changes over time. Unlike STM, which only shows changes in topic relevance over time, DTM estimates topics in a chosen period and, holding these topics constant, estimates these same topics in subsequent periods, allowing for changes in word prevalence and the words contained within each topic (however, the number of topics is fixed over time). As the developers of DTM explain: “Under this model, articles are grouped by year, and each year’s articles arise from a set of topics that have evolved from the last year’s topics”. Like STM, DTM can capture change over time, but DTM tracks change within each topic, allowing researchers to see what words and concepts dominate given topics in different periods.

The researchers’s role

So far, we have shown how topic models can be useful for analyzing large text collections. However, text collections do not arrive on the researcher’s desk ex machina; researchers make critical decisions affecting inputs and outputs. All topic model outputs are sensitive to the inputs provided (type and variety of texts, photos or videos), the type of algorithm selected (LDA, STM, DTM, etc.), and other model parameters specified by the researcher (i.e., the number of topics, metadata, base year). Throughout the research process, the role and judgment of the researcher are critical.

In other words, researchers must make multiple decisions that influence the outputs and conclusions drawn from topic models. These decisions include selecting the documents in the corpus, cleaning the data, choosing the appropriate algorithm, selecting model parameters, determining the correct number of topics, labeling the topics, validating the results, and interpreting the results. Sound interpretation depends upon the combination of a deep understanding of the collections, the phenomenon at hand, the context being studied, and the choice of modeling methodology used to infer meaning from the output. In this respect, the outputs of topic models should be considered exploratory, and researchers should be cautious about overinterpreting the results. To ensure the validity of the results, researchers should triangulate with multiple topic models, validate results using external data or expert opinion, and conduct robustness checks.

In conclusion, machine learning techniques have the potential to help us advance knowledge. As these tools are now easily available, and researchers can implement them in their own fields of expertise, they must exercise caution in making critical decisions and interpreting the output of topic models. The researcher’s role in the process remains essential for sensemaking and theory-building.

🔬🧫🧪🔍🤓👩‍🔬🦠🔭📚

Journal reference

Villamor Martin, M., Kirsch, D. A., & Prieto-Nañez, F. (2023). The promise of machine-learning-driven text analysis techniques for historical research: topic modeling and word embedding. Management & Organizational History, 1-16. https://doi.org/10.1080/17449359.2023.2181184

Marta Villamor Martin is a Ph.D. student in Strategic Management & Entrepreneurship at the Management and Organizations Department of the University of Maryland's Robert H. Smith School of Business. Her research interests encompass industry evolution, entrepreneurship, and innovation.

David A. Kirsch is an Associate Professor at the Robert H. Smith School of Business and the College of Information Studies (i-School) at the University of Maryland, College Park. His research centres around the intersection of innovation and entrepreneurship challenges, technological and business failures, and the emergence and evolution of industries.

Fabian Prieto-Ñañez is an assistant professor in the Department of Science, Technology, and Society at Virginia Tech University. His research centres on the histories of technologies in the Global South, examining media devices and infrastructures. Additionally, his work explores topics such as piracy, informality, and the illicit use of early satellite dishes in the Caribbean.