What are word vector models and how can they be used?
In 2015, Benjamin Schmidt explained that word embedding models “merit attention because they allow a much richer exploration of the vocabularies or discursive spaces implied by massive collections of texts than most other reductions out there” (Schmidt 2015). Schmidt’s model, using the Chronicling America corpus, shows binaries within the corpus (for example, “sweet/salty” and “vegetable/meaty”) that are regions in the model, and not defined by single words. These classifications interestingly “exist as a spectrum rather than a class” and not only exist in the model but also “in the real world” (Schmidt 2015). It is with this concept that this project aims to investigate the understanding of gender within the Foreign Relations of the United States (FRUS) corpus.
After training a word embedding model with the code, a vector space is created in which terms are plotted. The distance between points in the space represents the relationship between each word in the corpus. Each term is given a score based on similarity in usage within the corpus. Reduced down to a linear order of words, word embedding models depict the relations between terms in a corpus. In an introduction to vector space models, Schmidt notes that word embedding models are centered around two objectives: they attempt to first, “reflect similarities in usage between words in distances in space” and second, “reflect similar relationships between words with similar paths in space” (Schmidt 2015).
This project relies on word embedding models as a method to explore how policymakers use terms of gender in policy documents. As we will see in the analysis, the words most similar to gendered terms in the corpus are often based on relationships and similar groupings. These terms follow similar paths in the created space and appear to be used in similar contexts when conducting a close reading of the corpus. Therefore, word embedding models have the capability to tell scholars how the corpus writers understand social structures and cultural constructions within the text.
Data Feminism and the Creation of FRUS
While recognizing that word embedding models reflects the writers’ social and cultural conceptions, I was reminded of Catherine D'Ignazio and Lauren Klein’s lessons on “data feminism”. Data feminism is a framework for digital methods based on intersectional feminism. In their work, D’Ignazio and Klein call for using a feminist approach of recognizing the “context” in which the data was produced. This recognition of context allows researchers “to better understand any functional limitations of the data and any associated ethical obligations, as well as how the power and privilege that contributed to their making may be obscuring the truth” (D’Ignazio and Klein 2020, 152-153). Through this lens, it is possible to see that the raw “data” that is used in this analysis, whether it is textual or otherwise, is not bias-free. There are human selections at every level of this project. The State Department officials in the mid-twentieth century created documents with their own biases; the current producers of the FRUS series make selections based on what is deemed “important”; and the researcher (myself) has crafted an analysis with certain parameters and methodologies that have produced the given results.
In her discussion of the constructedness of digital projects, Johanna Drucker echoes these principles, and also calls for the recognition of interpretations placed at every level of research. Drucker claims that “[h]umanistic inquiry acknowledges the situated, partial, and constitutive character of knowledge production,” and also recognizes “that knowledge is constructed, taken, not simply given as a natural representation of pre-existing fact.” Drucker stresses the importance of two principles that humanists should be mindful of: “first, that the humanities are committed to the concept of knowledge as interpretation, and, second, that the apprehension of the phenomena of the physical, social, cultural world is through constructed and constitutive acts, not mechanistic or naturalistic realist representations of pre-existing or self-evident information” (Drucker 2011).
While it is very interesting to see the differing results that each corpus produced, insinuating that State Department officials had strongly different conceptions of gender, specifically the female gender, in each region, it is important to reconsider the production of the corpus and the project itself. The corpus used in this project is based on documents from the FRUS series produced by the Office of the Historian. In May 1990, Congress passed the Pell Amendment with the intention to produce a “thorough, accurate, and reliable documentary record of major United States foreign policy decisions and significant United States diplomatic activity”.[1] The language for the amendment was adapted from an earlier State Department mandate that required “the editing of the record . . . to be guided by the principles of historical objectivity and accuracy.”[2] This resulted in the production of the FRUS series.
On 14 September 2020, the Historical Advisory Committee (HAC) to the Department of State held a public forum on the production of the FRUS series. Dr. Elizabeth Charles of the Office of the Historian, who spoke on the process of producing a volume, explained that after completing about a year of research, the officials involved select about 3-4,000 documents to review. For the final edition, about 300-450 documents are compiled to total about 1,400 pages based on the “important” events of the period. Dr. Charles estimated that for every document that makes it into the volume to show “what best conveys how these decision makers made these decisions”, there are twenty that are left out. The Office of the Historian aims to create an accurate representation of the foreign policy for that period and region, but this requires curating a sample of the correspondence and official policy rather than providing its entirety. This of course also includes the limitations and stylization of the declassification procedure. In addition, the “Editing Division” copy edits the volume. In order to stay within the publication timeline and typical page limitations (around 1,400), the editors may also choose to reduce the amount of documents included. Overall, there is a precise process of editing the volume to be concise and to represent the chosen topics by the department.
Let us consider the corpus further. The diplomats and planners crafting these written diplomatic documents were individuals carrying with them a set of biases and socially produced knowledge about the topics, categories, and the power dynamic discussed within the corpora. Diplomats held assumptions of gender, the relationship between themselves and the actors in discussion, and the culture which they were discussing. Therefore, the text in question is not a perfect representation of what was happening on the ground, but instead is an interpretation by the State Department officials who authored the documents.
The textual corpora and research questions used in this project are a product of a series of choices and social constructions. The selected query terms limit the results to two categories of gender; the corpus itself was compiled and curated by the Department of History staff; the topic was selected based on the research interests of this author; and importantly, the corpus and documents included assumes a definition (and history) that separates the United States from its southern neighbors.[3] In its most basic form, geographical categories were recognized differently by the State Department each year. During the period that this project analyzes, official documents were labeled the “American Republics” as an entity some years, and in others the Caribbean and Mexico acted as separate categories.[4] Before conducting an analysis, it is important to understand that all elements of this project contain socially and culturally constructed parameters, knowledge, and analysis.
Word embedding models are incredibly powerful tools that are capable of giving the user insights into their corpora and directing them to patterns that would be missed through close reading of a limited corpus. But, as this project has argued, it is important to recognize the limitations inherent to such an analysis. Using terms of gender to query the corpus does not result in representations of gender in Latin America in the postwar period. Instead, these queries result in language related to how State Department officials used gendered terms during the period.
---------------------------------------------------------
[1] The Pell Amendment quoted in Botts 2015.
[2] The 1925 Kellogg Order quoted in Botts 2015.
[3] For a discussion of how to define “Latin America” and to understand the historical division of the hemisphere, see Moya, Jose C. The Oxford Handbook of Latin American History. New York: Oxford University Press, 2011.
[4] For more information on the breakdown of the FRUS volumes during this period and the prevalence of each nation in the eyes of the State Department, see this basic analysis.