Project Showcase and Discussion

Project Showcase and Discussion Julia Flanders Sarah Connell Word Vectors for the Thoughtful Humanist Women Writers Project (via website)

url:mailto:wwp@neu.edu

This TEI-encoded XML file is available under the terms of the Creative Commons Attribution-ShareAlike 3.0 (Unported) license.

Boston, MA USA

Covers several example projects with word embedding models and includes discussion points on strategies for designing effective research projects with word vectors.

First-round proofing complete. Initial draft

This tutorial presents several example projects with word embedding models and includes discussion points on strategies for designing effective research projects and classroom assignments with word vectors.

Showcase of Student Projects with Word Embedding Models

Textual Corpora and Computational Text Analysis Project

Taught by Sarah Connell and Elizabeth Maddock Dillon in the class Literature and Digital Diversity at Northeastern. Full assignment details here.

Assignment Develop a research question Build or find a corpus Prepare the corpus (select specific texts or portions of texts, remove metadata, etc.) Train and query at least one model Write up results in a research blog post Discuss at least one scholarly source in the post

Scaffolding Previous assignment: text analysis blog post project using web-based tools to compare versions of a historical narrative In-class workshops on: R and RStudio, running word2vec code, building and preparing corpora, developing research questions, developing queries, writing research blog posts

This assignment is designed for an undergraduate class focused on the ways that digital tools and methods can be used to support diversity, equity, and inclusion. The class is typically a mixture of English and Computer Science majors; this assignment is part of the second major unit in the course, following a text encoding project. The word2vec assignment follows an introductory activity in the text analysis unit, in which students experiment with web-based analysis tools to compare two related documents and write blog posts about their findings on the class WordPress site. In the word2vec assignment, students are asked to develop research questions, which can be on any topic, and then assemble corpora related to their questions; they learn how to train and query word embedding models using R and RStudio Server, then write up their results in blog posts for the class site. This is a complex assignment and one that requires substantial in-class workshopping, not just on technical skills but also on developing research questions, building corpora, and identifying queries that can help answer the research questions.

Student Project: Gender Representation in Popular Culture

Portrayal of women in popular culture magazines by Vanessa Gregorchik

Framing questions: When feminism meets capitalism, does its language about women change? In a time when print publications are on the decline, do they still cling to the patriarchal portrayal of women as in the past? And do any of these voices change when the audience is shifted to men?

Corpora: 27 issues of Vogue and 29 issues of GQ, downloaded from Archive.org and saved in two collections of .txt files Preparation: Removed HTML metadata, lowercased and removed punctuation, preserved advertisements because these also contribute to consumer culture

This first example student project examines the ways that gender is represented in popular periodicals. The student constructed two corpora—one from Vogue and one from GQ—to study how periodicals aimed at women and men differ in their language usage and representations of gender. The comparative approach can be particularly productive; it takes more work, because the students need to build two corpora, but it also provides a clearer way to see what is distinctive or notable about each corpus.

Lowercasing and removing punctuation is standard practice, and most projects will also remove metadata as this one did. Other data preparation decisions can require more thought—for example, this student determined that advertisements, while not part of the primary contents of each magazine, also contribute to the social norms that periodicals produce and so elected to keep them.

Gender Representation in Popular Culture: Queries and findings

Even basic queries can be very generative with carefully chosen keywords, especially when comparison between models/corpora is possible.

Terms closest to woman: Vogue someone, man, herself, confident, young, compelling, awkward, sincere, fierce

GQ herself, man, she, girl, her, wife, mother, child, pregnant, husband

Conclusions: Vogue’s coverage heavily emphasized one’s personal emotions or feelings while GQ’s content seemed to skew toward women’s role in families and relationships. Reflecting on the mission of the respective publications, this representation makes sense. Vogue is meant to represent the interests of women, while GQ is the authority on men. This leads to women being the main characters in most stories published in Vogue, while in GQ they may play only a secondary role.

The student discusses several queries in her blog post; this example highlights some particularly stark differences between the corpora. From the closest terms for woman, it is clear that Vogue is using more terms focusing on women’s individual identities, while GQ is using terms that connect with the domestic and familial roles that women play. As the student observes, this is not particularly surprising given the audiences of these two periodicals, but it does show how powerful comparative analyses can be with word embedding models. This also demonstrates how generative even very basic queries, such as looking at the closest terms to significant keywords in two different models, can be.

Student Project: Public Perception of COVID-19 Vaccines

Public Perception of COVID-19 Vaccines & Associated Political Response on Twitter by Julia Corfman

Framing question: How does the public perception of the three major U.S. Vaccines (Pfizer, Moderna & Johnson+Johnson) compare, regarding side effects? The project also investigated how political affiliations manifest in different attitudes surrounding vaccines.

Corpus: Tweets with hashtags related to COVID-19 vaccines, approximately 3 million total words Preparation: Removed Twitter artifacts such as mentions and links, preserved the textual contents of hashtags since these are an important part of Twitter discourse, combined key phrases (e.g., side effects, johnson & johnson), lowercased and removed punctuation

In this second project, the student used a single corpus of Tweets to investigate public perceptions of COVID-19 vaccines. Twitter data requires some additional preprocessing to reduce noise, and so the student removed Twitter artifacts such as mentions and links. In this project, we also see some complicated decisions about which language from the corpus is related to the semantics that the project is studying; in this case, the student decided that hashtags are central enough to Twitter discourse that they merited inclusion. She also did some additional preprocessing work by combining key phrases so that they could be treated as single tokens—this step is necessary because it is not possible to query multi-word strings with word2vec.

Public Perception of COVID-19 Vaccines: Queries and findings

Vector math can then be used to investigate very precise associations. Querying Vaccine1 – Vaccine2 – Vaccine 3 + side-effects will show the associations for the first vaccine, as distinct from the second two vaccines, combined with the term side-effects. For example:

Terms closest to moderna – pfizer – johnson + side-effects: symptoms, soreness, stiffness, exhaustion, pain, headaches, atsite, noticeable, injection

Conclusions: The results for the Moderna query are a set of concrete terms that describe physical side effects, while results for Pfizer and Johnson & Johnson are more diffuse. These preliminary insights suggest that the Moderna vaccine is associated with more side effects and that the Johnson and Johnson vaccine may result in comparably fewer side effects.

This is supported by other research: According to a report by the FDA and summarized by Business Insider, less than 50% of Johnson & Johnson Vaccine recipients reported pain at the injection site. This is considerably less than Pfizer’s 84% of recipients reporting pain and 92% of Moderna recipients.

The student used vector math to get at the three vaccines’ distinctive associations with side effects. Her query takes each vaccine and searches for its associations both as distinct from the other two vaccines, and in addition to the associations for side-effects. For instance, the student looked at the terms that are particular to Moderna and not the other two vaccines when combined with the contexts for side effects. She was able to validate her conclusion that there were more specific physical side effects associated with Moderna by looking at some recent medical studies. This example shows an application of vector math, and also demonstrates the importance of asking students to connect their projects with existing scholarship. Reflecting this need, the assignment asks that each project reference at least one scholarly source.

Public Perception of COVID-19 Vaccines: Ideological differences and terminology

In cases where specific terms carry strong ideological freight, word embedding models can be especially powerful. For example:

Terms closest to vax: irreversible, modification, operatingsystem, warns, genetic, wakefield

Terms closest to vaccine: covid, vaccinecovidvaccine, trumpout, covid19update, casa, pfizers

Conclusions: Looking purely at the semantic reference of vaccine and vax, vax is largely used in a conspiracy nature. Vax is associated with terms relating to genetic modification, and operating systems. This is likely because of the popularized moniker of AntiVax for AntiVaccine. There is a large conspiracy by anti-vaxxers that the vaccine will insert chips and alter your DNA, somehow spearheaded by Bill Gates. It’s interesting if this connection is from anti-vaxxers themselves referencing themselves as anti-vax and the different theories, or non-anti-vax referencing anti-vax and their theories. In actual tweets, vax is used both in a regular and conspiracy sense.

It is important to encourage students to verify their preliminary conclusions by looking at actual usage in their corpora.

36 hours post moderna vax dose 2 my arm has a nonitch or pain rash its sore i hope day 2 is as good covid vaccine niagindependent dr wakefield warns this is not a vax it is irreversible genetic modification covid vaccine this is especially dangerous this mrnavaccine shot is not just a vax its geneticmodification

In addition to looking at differences among the three vaccines, this project also examines political impacts on attitudes towards vaccination. One interesting result is in the strong associations between vax and conspiracy theories about the vaccine. Importantly, the student calls attention to the need to examine source materials, since it is impossible to tell from the results whether these associations are coming from those who oppose vaccines, or from others describing what they perceive to be anti-vaccination attitudes. Looking directly at the corpus shows varied results, with some neutral uses of vax to simply describe the experience of being vaccinated, and some Tweets describing the vaccine as an irreversible genetic modification. This is a key strategy for teaching and working with word embedding models: it's essential to return to the input corpora in order to understand results from the models.

Showcase of Research Projects with Word Embedding Models

Showcase: Accrediting early modern histories

Framing questions: Why did so many early modern historians reference credit when evaluating historical works? What are the utilities of credit as a concept for historians in describing their discipline?

So that with what credit, the account of above a thousand yeares from Brute to Casseavellaunus, in a line of absolute Kings, can bee cleared, I do not see, and therefore will leave it on the booke, to such as will be creditors, according to the substance of their understanding.

—Samuel Daniel, The Historie of England, 1612

This project uses a model trained on a corpus of seventeenth-century histories to examine the ways that language around credit was used in discussions about what should constitute history and what should be instead considered fiction or fable. The central questions are: what are the affordances of credit for historians describing their work? And, what can the discourse around credit show about how early modern historians defined their discipline during a period when the pressures of empire and the disruptions of the Civil Wars made national origin stories particularly important, while at the same time the historicity of Britain’s early traditions was being vigorously questioned?

Methods and preliminary results

Corpus: 52 histories (10.9 million words), collected from EEBO-TCP

What are the closest words to credit in a model trained on this corpus?

V(credit): reputation, historian, testimony, certainty, antiquity, relation, truth, pains, authour, opinion, impartiality, authors, account, herein, critiques, fictions, shew, hector's, authours, romance, historians, deserves, histories, disparage, story

History-writing: historian, authors, authour, historians, pains, antiquity

Historical sources: testimony, relation, account, story, romance, fictions

Evaluation and argumentation: reputation, opinion, certainty, truth, shew, disparage, deserves, impartiality, critiques

We are forced in this Period, not only to make use of Authors who lived long after the Things they treat of were done, but also are otherwise of no great Credit; such as Nennius, and Geoffery of Monmouth, whom we sometimes make use of for want of those of better Authority.

—James Tyrrell, The General History of England, 1696

The corpus at stake comprises 52 texts and 10.9 million words, collected from the EEBO-TCP files.

The words closest to credit in vector space show a very close association with historiographic evaluation in this corpus. Words that tend to be used with credit include ones that connect with discussions of history as a discipline and historical research as a process (historian, authors, authour, historians, pains, antiquity); describe different kinds of historical sources (testimony, relation, account, story, romance, fictions); and evaluate validity or make arguments about the past (reputation, opinion, certainty, truth, shew, disparage, deserves, impartiality, critiques).

Corpus analysis

Does this particular corpus actually contain language related to more financial meanings of credit?

V(credit) + V(buy): sell, buy, credit, carry, purchase, get, overplus, gain, profit

Looke how many houshoulders there are in Dublyne, so many Ale-brewers there be in the Towne, for every Houshoulders Wife is a Brewer. And (whatsoever she be otherwise) or let hir come from whence shee will, if her credit will serve to borrowe a Pan, and to buy but a measure of mault in the Market, she sets uppe Brewing.

—Barnabe Rich, A New Description of Ireland, 1610

Despite the strong associations between credit and historiographic validity, it is possible to find terms related to the commercial connotations of credit using vector math. In fact, the more financial connotations of credit are very much present in this collection of seventeenth-century histories, and even seem to be part of the appeal of credit as an evaluative framework.

Further exploration

Can this method help to show the credit available to particular historians and historical figures?

V(credit) + V(arthur): arthur's, geffrey, geoffery, ieffrey, monmouth, story, geoffrey, historian, writer

Arthur was a Prince more worthy to be advanced by the truth of Records in warrantable credit, then by fables scandalized with poeticall fictions and hyperbolicall falshoods…Of Jeffrey Arthur, or Monke of Monmouth: lamentable it is, that the fame of this puissant Prince had not beene sounded by a more certaine Trumpet.

—John Speed, The History of Great Britaine, 1611

These methods can also be used to investigate particular historians and historical figures. For example, the twelfth-century historian Geoffrey of Monmouth—who was widely recognized as the source for most of Britain’s Arthurian traditions and who was equally widely criticized for his inclusion of fictional materials—was particularly troubling for some British historians because his credit was so thoroughly bound up with King Arthur’s, whom most felt deserved a more reliable chronicler. This connection is quite evident in the model, which shows numerous variations of Geoffrey's name when Arthur and credit are queried together.

Scoping analysis

Just how contentious was Geoffrey?

V(geoffrey): jeoffrey, monmouth's, monmouth, geoffery, ieffrey, geffrey, geoffry, history, weathamstead, gyraldus, story, arthur, fabulous, mistake

John Weathamstead, Abbot of Saint Albans, in his worke of English Affaires, accuseth Geoffrey of Monmouth, of meere Fabulousnesse.

—Richard Baker, A Chronicle of the Kings of England, 1643

In fact, the suspect credit available to Geoffrey of Monmouth is such a pervading concern for the authors in this corpus that a simple query for one common spelling of his first name shows that the closest terms, in addition to variant spellings of Geoffrey are those specific to this historian and to debates about his historicity, including two other historians—Giraldus Cambrensis and John Westhampstead—who had written skeptically about Geoffrey's work. Arthur is closely related to Geoffrey as well, again indicating the very close connection between historian and historical figures.

Showcase: Archival descriptions of LGBT Collections

Framing questions

What is the semantic relationship between formal data structure and archival description? What does it mean to read archives as data in DH? What are some of the challenges and benefits?

[T]he enormous scale of digital records has changed the way scholarly resources are read; increasingly, the emphasis is on being able to study large volumes of material rather than on being able to study individual texts. These radical changes mean that, in future, archives will no longer be conceived of as collections of texts, but as data to be made sense of (120).

—Moss, Michael, David Thomas, and Tim Gollins. The Reconfiguration of the Archive as Data to Be Mined. Archiaria, no. 86, 2019, pp. 118-151.

This project explores finding aids of LGBTQ archive collections, focusing on the relationship between archival description and identity. Using digital humanities methods (including word vector models), it explores semantics of description and structured data as it relates to use, access, discoverability, and representation. Central questions for this project include: what is the semantic relationship between formal data structure and archival description? How do controlled vocabularies of structured data appear in computational analysis? How is identity described in structured data and how do we access this information?

Corpus building

ArchiveGrid interface and data organization

This corpus contains 304 finding aids (1.6 million words) collected from ArchiveGrid (a digital repository of over five million archival descriptions) using the keyword search lgbt OR queer. The finding aids are from four archives with the most records from this search. As documents, finding aids vary greatly in form and description level. Across these differences, finding aids contain important metadata about archival collections: titles, creators, physical descriptions, abstracts, biographical or historical notes, scope and contents, and subject headings/indexing terms.

Corpus analysis

Word cloud of most frequent words (45) using Voyant Tools

All the files in this corpus are plain text, but their structures are an integral part of the data itself. Generating a word cloud with Voyant Tools, it is easy to see that the most frequent words in this corpus are box and folder. Box appears over 79,000 times and folder almost 65,000 times. Combined, these words make up 8.6% of the entire corpus word count (144,000 words). Given that structural or formal semantics are such a large part of this corpus, where do they manifest in computational analyses like word vector models?

Results: Clusters

Cluster [1]: includes, materials, publications, other, clippings, notes, material, articles, letters, miscellaneous, various, journal, manuscripts, reviews, newspaper

Cluster [2]: access, permission, must, copyright, conditions, use, governing, researchers, publish, given, requests, restrictions, owner, submitted, behalf

Cluster [3]: gay, lesbian, subject, women, issues, men, lgbt, bisexual, political, publisher, queer, liberation, transgender, politics, movement

Clustering is an interesting way to read a word vector model. These three clusters show words closely associated with each other in vector space and each is a distinct part of this corpus. The first cluster is about material descriptions, including the types of documents, items, and artifacts that are in these collections. The second cluster is about use and access, including words that describe or define the processes of archiving, using, and accessing collections. The third cluster deals with subjects in the collections, including much of the related terminology for subject headings or keywords that are used to define or describe these collections. Each cluster, to some extent, is indicative of the different structures and contents of this type of data.

Results: Queries

V(lgbt): lgbt, broader, groups, topics, highlighting, individuals, transgender, highlight, queer, jewish, intersexuality, related, lgbtq, intersex, media’s, promotes, educational, recreational, audiences, primarily

V(queer): queer, nation, asexual, intersex, questioning, ally, assimilation, lgbt, multicultural, ubiquitous, cybercenter, emerged, cultural, stages, cultures, jqcc, separatists, polyamory, lgbtqia, geek

This corpus was originally generated with the keywords lgbt and queer on ArchivesGrid and their corresponding vectors are an interesting way to parse the relationship of these words throughout the finding aids. Being the broader descriptive term, lgbt is related to a larger variety of words that includes not just identity labels like transgender and intersex, but words relating to use and access of archival materials. For instance, this vector includes educational, promotes, and highlights—all which are verbs and adverbs relating to use and not just people. However, queer is a term with a much more varied semantic past and present, having been (and continuing for some) a derogatory word and an identity label. Some of this history and context is reflected in the list of related words, especially with asexual, intersex, questioning, and assimilation—the distinctive contexts for queer are also evident in the closer connection with lgbtqia than is the case for lgbt.

Future questions and exploration

Future questions:

How can word vector models (and other forms of computational analysis) be used to discover and locate problematic archival descriptions? What are the identities that are being overlooked? What are the complexities of identities that language is erasing or replacing in archival collections? Is there a difference in description semantics based on who is doing the describing (community or institution-based individuals)? How can such an analysis help create recuperative and empathy-based description practices and standards?

As this project progresses, the next step is to create a larger dataset of LGBT finding aids from across the country, paying attention to institutional differences. In the end, by using a variety of digital humanities methods (including other forms of textual and computational analysis), this project seeks to study how these tools may help scholars better understand and respond to the social implications of data structure and classification within archival materials.

Analogies in vector space

Word embedding models can help us reconstruct and explore this historically-situated network of conceptual relationships. A word embedding model trained on ECCO-TCP can answer not only the more trivial analogies of man is to woman as king is to…? (queen), and London is to England as Edinburgh is to…? (Scotland). We can also explore the very analogy Young proposed, and ask the model: Riches are to Virtue as Learning is to…?

Admittedly, I really don’t know anything yet about measures of statistical significance in analogy testing—but, to me, the fact that, out of thousands of possibilities, the 6th closest word vector to the composite, analogical vector V(Virtue-Riches+Learning) is, as Young argued, V(Genius), is remarkable.

—Ryan Heuser, Word Vectors in the Eighteenth Century, Episode 1: Concepts, 2016.

Now let's walk through a slightly more complex use of word embedding models for research; how do analogies work? Can we outline how Heuser applies word2vec to his research questions, and work through the logic of the explanatory analogy? As we've seen already, projects like this bring together a corpus and a research question, with a method and some particular queries; can we identify all of these for Heuser’s work?

Analyzing results and testing queries What conclusion does Heuser reach about his results? How would you characterize the other terms on the list, apart from genius? Are there any other explanations that you can think of for the presence of genius on this list? What other queries would you want to try, if you were pursing this project? What do you think we would find if we tried the same query with a model trained on a different corpus?

Can we characterize the kind of research project that Heuser is pursuing here (thinking about the conclusions he actually draws)? Do you find these conclusions persuasive, or are there any other explanations that you can think of for the presence of genius in this list? What other evidence would you like to see to support this conclusion? What other queries would you want to try? Since we have a model trained on ECCO, using the same parameters that Heuser did, we can investigate these queries directly—what do we find? What do we see with other models?

Try some queries of your own

With either the event sandbox or the full toolkit, break into pairs and choose a model to explore. Try a few queries and make some notes on what you discover.

Think about:

What did you find interesting/confusing/surprising? What are some initial conclusions you can draw from these results?

When we try the same query in a model trained on the full corpus of WWO, we also see genius fairly high on the list, although the overall cosine similarities are lower—note that there are 84 million words in the ECCO corpus that Heuser used, and about 12 million words in Women Writers Online.

What do we make of this? What additional information would we need to have before we could draw any conclusions? What other queries could we try?

Discussion What do we like about these methods, what do we find confusing? What kinds of arguments can we make with these methods? What kinds of arguments can’t we make? What kinds of supporting evidence are necessary? What kinds of data would we need to support these arguments? What kinds of arguments can we make? Claims that examine not just words but also relationships between words, semantic concepts on a broader scale—as in Heuser, where the concept of virtue as opposed to riches is both expressible mathematically and something that was very real in the eighteenth century. Arguments that focus on binaries or that can be tested by looking at relationships between words; arguments that actually study relationships between words, or concepts such as virtue as distinguished from riches. Arguments that can consider words in groups—note that in Heuser's results, genius is actually the sixth term on the list. What kinds of arguments can’t we make? Text-by-text analysis Any kind of claim that relies on having identical results every time a model is trained What kinds of data and supporting evidence are necessary? Multiple models and varied queries Technical validations Evidence from other methodologies