<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="../../../_utils/schema/yaps.rnc" type="application/relax-ng-compact-syntax"?>
<?xml-model href="../../../_utils/schema/yaps.isosch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<?xml-stylesheet type="text/css" href="../../../_utils/stylesheets/yaps-tei.css"?>
<!-- $Id: word_vectors_showcase.xml 51363 2026-03-27 17:51:50Z sconnell $ -->
<TEI xmlns="http://www.wwp.northeastern.edu/ns/yaps" version="5.0">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Project Showcase and Discussion</title>
            <author>Julia Flanders</author>
            <author>Sarah Connell</author>
         </titleStmt>
         <editionStmt>
            <edition>Word Vectors for the Thoughtful Humanist</edition>
         </editionStmt>
         <publicationStmt>
            <distributor>Women Writers Project (via website)</distributor>
            <address>
               <addrLine>url:mailto:wwp@neu.edu</addrLine>
            </address>
            <date when="2019-04-01"/>
            <availability status="restricted">
               <p>Copyright 2019 Syd Bauman, Julia Flanders, Sarah Connell, and the Women Writers Project</p>
               <p>This TEI-encoded XML file is available under the terms of the <ref
                  target="http://creativecommons.org/licenses/by-sa/3.0/">Creative Commons
                  Attribution-ShareAlike 3.0 (Unported)</ref> license.</p>
            </availability>
            <pubPlace>Boston, MA  USA</pubPlace>
         </publicationStmt>
         <sourceDesc>
            <p>Covers several example projects with word embedding models and includes discussion points on strategies for designing effective research projects with word vectors.</p>
         </sourceDesc>
      </fileDesc>
      <revisionDesc>
         <change who="personography.xml#sconnell.yuw" when="2019-07-12">First-round proofing complete.</change>   
         <change when="2019-06-20" who="sconnell.yuw">Initial draft</change>
      </revisionDesc>
   </teiHeader>
   <text>
      <presentation>
         <abstract>
            <p>This tutorial presents several example projects with word embedding models and includes discussion points on strategies for designing effective research projects and classroom assignments with word vectors.</p>
         </abstract>
        <sectionGrp>
           <head>Showcase of Student Projects with Word Embedding Models</head>
          <section>
             <head>Textual Corpora and Computational Text Analysis Project</head>
             <slide>
             	<p>Taught by Sarah Connell and Elizabeth Maddock Dillon in the class <ref target="https://litdigitaldiversity.northeastern.edu/">Literature and Digital Diversity</ref> at Northeastern. See the <ref target="https://litdigitaldiversity.northeastern.edu/text-analysis/">assignment details</ref> on the class site.</p>
            <p>
               <label><emph>Assignment</emph></label>
               <list>
                  <item>Develop a research question</item>
                  <item>Build or find a corpus</item>
                  <item>Prepare the corpus (select specific texts or portions of texts, remove metadata, etc.)</item>
                  <item>Train and query at least one model</item>
                  <item>Write up results in a research blog post</item>
                  <item>Discuss at least one scholarly source in the post</item>
               </list>
            </p>
                <p>
                   <label><emph>Scaffolding</emph></label>
                   <list>
                      <item>Previous assignment: text analysis blog post project using web-based tools to compare versions of a historical narrative</item>
                      <item>In-class workshops on: R and RStudio, running word2vec code, building and preparing corpora, developing research questions, developing queries, writing research blog posts</item>
                   </list>
                </p>
             
             </slide>
             <lectureNote>
                <p>This assignment is designed for an undergraduate class focused on the ways that digital tools and methods can be used to support diversity, equity, and inclusion. The class is typically a mixture of English and Computer Science majors; this assignment is part of the second major unit in the course, following a text encoding project. The word2vec assignment follows an introductory activity in the text analysis unit, in which students experiment with web-based analysis tools to compare two related documents and write blog posts about their findings on the class WordPress site. In the word2vec assignment, students are asked to develop research questions, which can be on any topic, and then assemble corpora related to their questions; they learn how to train and query word embedding models using R and RStudio Server, then write up their results in blog posts for the class site. This is a complex assignment and one that requires substantial in-class workshopping, not just on technical skills but also on developing research questions, building corpora, and identifying queries that can help answer the research questions.</p>
             </lectureNote>
          </section>
           
           <section>
             <head>Student Project: Gender Representation in Popular Culture</head>
             <slide>
                <p><q><ref target="https://litdigitaldiversity.northeastern.edu/portrayal-of-women-in-popular-culture-magazines/">Portrayal of women in popular culture magazines</ref></q> by Vanessa Gregorchik</p>
                <p><emph>Framing questions</emph>: <q>When feminism meets capitalism, does its language about women change? In a time when print publications are on the decline, do they still cling to the patriarchal portrayal of women as in the past? And do any of these voices change when the audience is shifted to men?</q></p>
                <p><label><emph>Corpora:</emph></label>
                <list>
                   <item>27 issues of <title>Vogue</title> and 29 issues of <title>GQ</title>, downloaded from Archive.org and saved in two collections of .txt files</item>
                   <item>Preparation: Removed HTML metadata, lowercased and removed punctuation, preserved advertisements because these also contribute to consumer culture</item>
                </list>
                </p>  
             </slide>
              <lectureNote><p>This first example student project examines the ways that gender is represented in popular periodicals. The student constructed two corpora—one from <title>Vogue</title> and one from <title>GQ</title>—to study how periodicals aimed at women and men differ in their language usage and representations of gender. The comparative approach can be particularly productive; it takes more work, because the students need to build two corpora, but it also provides a clearer way to see what is distinctive or notable about each corpus. </p>
              <p>Lowercasing and removing punctuation is standard practice, and most projects will also remove metadata as this one did. Other data preparation decisions can require more thought—for example, this student determined that advertisements, while not part of the primary contents of each magazine, also contribute to the social norms that periodicals produce and so elected to keep them.</p>
              
              </lectureNote>
          </section>
           <section>
              <head>Gender Representation in Popular Culture: Queries and findings</head>
              <slide>
                 <p>Even basic queries can be very generative with carefully chosen keywords, especially when comparison between models/corpora is possible.</p>
                 <p>
                    <label><emph>Terms closest to <q>woman</q>:</emph></label>
                    <list>
                       <label><title>Vogue</title></label>
                       <item>someone, man, herself, confident, young, compelling, awkward, sincere, fierce</item>                       
                    </list>                    
                 </p>
                 <p><label><title>GQ</title></label>
                <list> <item>herself, man, she, girl, her, wife, mother, child, pregnant, husband</item></list>
                 </p>
                 <p><emph>Conclusions:</emph> <quote><title>Vogue’s</title> coverage heavily emphasized one’s personal emotions or feelings while <title>GQ’s</title> content seemed to skew toward women’s role in families and relationships. Reflecting on the mission of the respective publications, this representation makes sense. <title>Vogue</title> is meant to represent the interests of women, while <title>GQ</title> is the authority on men. This leads to women being the main characters in most stories published in <title>Vogue</title>, while in <title>GQ</title> they may play only a secondary role.</quote></p>
              </slide>
              <lectureNote><p>The student discusses several queries in her blog post; this example highlights some particularly stark differences between the corpora. From the closest terms for <q>woman</q>, it is clear that <title>Vogue</title> is using more terms focusing on women’s individual identities, while <title>GQ</title> is using terms that connect with the domestic and familial roles that women play. As the student observes, this is not particularly surprising given the audiences of these two periodicals, but it does show how powerful comparative analyses can be with word embedding models. This also demonstrates how generative even very basic queries, such as looking at the closest terms to significant keywords in two different models, can be.   </p></lectureNote>
           </section>
           <section>
              <head>Student Project: Public Perception of COVID-19 Vaccines</head>
              <slide>
                 <p><q><ref target="https://litdigitaldiversity.northeastern.edu/covid-19vaccines/">Public Perception of COVID-19 Vaccines &amp; Associated Political Response on Twitter</ref></q> by Julia Corfman</p>
                 <p><emph>Framing question</emph>: <q>How does the public perception of the three major U.S. Vaccines (Pfizer, Moderna &amp; Johnson+Johnson) compare, regarding side effects?</q> The project also investigated how political affiliations manifest in different attitudes surrounding vaccines. </p>
            <p>  
               <label><emph>Corpus:</emph></label>
               <list>
                 <item>Tweets with hashtags related to COVID-19 vaccines, approximately 3 million total words</item>
                 <item>Preparation: Removed Twitter artifacts such as mentions and links, preserved the textual contents of hashtags since these are an important part of Twitter discourse, combined key phrases (e.g., <q>side effects</q>, <q>johnson &amp; johnson</q>), lowercased and removed punctuation</item>
              </list></p>
                
              </slide>
              <lectureNote><p>In this second project, the student used a single corpus of Tweets to investigate public perceptions of COVID-19 vaccines. Twitter data requires some additional preprocessing to reduce noise, and so the student removed Twitter artifacts such as mentions and links. In this project, we also see some complicated decisions about which language from the corpus is related to the semantics that the project is studying; in this case, the student decided that hashtags are central enough to Twitter discourse that they merited inclusion. She also did some additional preprocessing work by combining key phrases so that they could be treated as single tokens—this step is necessary because it is not possible to query multi-word strings with word2vec.</p></lectureNote>
           </section>
           <section>
              <head>Public Perception of COVID-19 Vaccines: Queries and findings</head>
              <slide>
                <!-- <p><quote>From a preliminary search of “side-effects”, the common side effects reference mild types of pain: soreness, fatigue, lethargy.</quote></p>
                 <p>
                    <label><emph>Terms closest to <q>side-effects</q>:</emph></label> symptoms, fromfirst, effects, noticeable, soreness, mild, pain, reactions, rashes, tiredness</p>        -->       
              
                 <p>Vector math can then be used to investigate very precise associations. Querying <q>Vaccine1 – Vaccine2 – Vaccine 3 + side-effects</q> will show the associations for the first vaccine, as distinct from the second two vaccines, combined with the term <q>side-effects</q>. For example:</p>
                 <p>
                    <label><emph>Terms closest to <q>moderna – pfizer – johnson + side-effects</q>:</emph></label> symptoms, soreness, stiffness, exhaustion, pain, headaches, atsite, noticeable, injection                      
                                        
                 </p>                           
                 
                 <p><emph>Conclusions:</emph> The results for the Moderna query are a set of concrete terms that describe physical side effects, while results for Pfizer and Johnson &amp; Johnson are more diffuse. <quote>These preliminary insights suggest that the Moderna vaccine is associated with more side effects and that the Johnson and Johnson vaccine may result in comparably fewer side effects.</quote> </p>
                 <p>This is supported by other research: <quote>According to a report by the FDA and summarized by <title>Business Insider</title>, less than 50% of Johnson  &amp; Johnson Vaccine recipients reported pain at the injection site. This is considerably less than Pfizer’s 84% of recipients reporting pain and 92% of Moderna recipients.</quote></p>
              </slide>
              <lectureNote><p>The student used vector math to get at the three vaccines’ distinctive associations with side effects. Her query takes each vaccine and searches for its associations both as distinct from the other two vaccines, <emph>and</emph> in addition to the associations for <q>side-effects</q>. For instance, the student looked at the terms that are particular to Moderna and <emph>not</emph> the other two vaccines when combined with the contexts for side effects. She was able to validate her conclusion that there were more specific physical side effects associated with Moderna by looking at some recent medical studies. This example shows an application of vector math, and also demonstrates the importance of asking students to connect their projects with existing scholarship. Reflecting this need, the assignment asks that each project reference at least one scholarly source.</p></lectureNote>
           </section>
           <section>
              <head>Public Perception of COVID-19 Vaccines: Ideological differences and terminology</head>
              <slide>
                 <p>In cases where specific terms carry strong ideological freight, word embedding models can be especially powerful. For example:</p>
                 <p><label><emph>Terms closest to <q>vax</q></emph>: irreversible, modification, operatingsystem, warns, genetic, wakefield</label></p>
                 <p><label><emph>Terms closest to <q>vaccine</q></emph>: covid, vaccinecovidvaccine, trumpout, covid19update, casa, pfizers</label></p>
                 <p><emph>Conclusions:</emph> <quote>Looking purely at the semantic reference of vaccine and vax, vax is largely used in a conspiracy nature. Vax is associated with terms relating to genetic modification, and operating systems. This is likely because of the popularized moniker of AntiVax for AntiVaccine. There is a large conspiracy by anti-vaxxers that the vaccine will insert chips and alter your DNA, somehow spearheaded by Bill Gates. It’s interesting if this connection is from anti-vaxxers themselves referencing themselves as anti-vax and the different theories, or non-anti-vax referencing anti-vax and their theories. In actual tweets, vax is used both in a regular and conspiracy sense.</quote>   </p>
                 <p>It is important to encourage students to verify their preliminary conclusions by looking at actual usage in their corpora.</p>                 <p>  
                    
                    <list>
                       <item>36 hours post moderna vax dose 2 my arm has a nonitch or pain rash its sore i hope day 2 is as good</item>
                       <item>covid vaccine niagindependent dr wakefield warns this is not a vax it is irreversible genetic modification</item>
                       <item>covid vaccine this is especially dangerous this mrnavaccine shot is not just a vax its geneticmodification</item>
                    </list> 
                 </p>
              
              </slide>
              <lectureNote><p> In addition to looking at differences among the three vaccines, this project also examines political impacts on attitudes towards vaccination. One interesting result is in the strong associations between <q>vax</q> and conspiracy theories about the vaccine. Importantly, the student calls attention to the need to examine source materials, since it is impossible to tell from the results whether these associations are coming from those who oppose vaccines, or from others describing what they perceive to be anti-vaccination attitudes. Looking directly at the corpus shows varied results, with some neutral uses of <q>vax</q> to simply describe the experience of being vaccinated, and some Tweets describing the vaccine as an irreversible genetic modification. This is a key strategy for teaching and working with word embedding models: it's essential to return to the input corpora in order to understand results from the models.</p></lectureNote>
           </section>
           
        </sectionGrp>
         
         
         <sectionGrp>
           <head>Showcase of Research Projects with Word Embedding Models</head>
           <section>
            <head>Showcase: Accrediting early modern histories</head>
            
            <slide>
               
               <p><emph>Framing questions:</emph> Why did so many early modern historians reference <q>credit</q> when evaluating historical works? What are the utilities of credit as a concept for historians in describing their discipline?</p>
               <p><quote>So that with what <emph>credit</emph>, the <emph>account</emph> of above a thousand yeares from Brute to Casseavellaunus, in a line of absolute Kings, can bee <emph>cleared</emph>, I do not see, and therefore will <emph>leave it on the booke</emph>, to such as will be <emph>creditors</emph>, according to the substance of their understanding.</quote></p>
               <p>—Samuel Daniel, <title>The Historie of England</title>, 1612</p>
            </slide>
            <lectureNote>
               <p>This project uses a model trained on a corpus of seventeenth-century histories to examine the ways that language around <q>credit</q> was used in discussions about what should constitute history and what should be instead considered fiction or fable. The central questions are: what are the affordances of <q>credit</q> for historians describing their work? And, what can the discourse around credit show about how early modern historians defined their discipline during a period when the pressures of empire and the disruptions of the Civil Wars made national origin stories particularly important, while at the same time the historicity of Britain’s early traditions was being vigorously questioned? </p>
              
            </lectureNote>
            
         </section>
         <section>
            <head>Methods and preliminary results</head>
            
            <slide>
               
               <p><emph>Corpus:</emph> 52 histories (10.9 million words), collected from EEBO-TCP
             </p>
               <p>What are the closest words to <q>credit</q> in a model trained on this corpus?</p>
               
               <p><emph>V(<term>credit</term>):</emph> reputation, historian, testimony, certainty, antiquity, relation, truth, pains, authour, opinion, impartiality, authors, account, herein, critiques, fictions, shew, hector's, authours, romance, historians, deserves, histories, disparage, story</p>
               <p><emph>History-writing:</emph> historian, authors, authour, historians, pains, antiquity</p>
               <p><emph>Historical sources:</emph> testimony, relation, account, story, romance, fictions</p>
               <p><emph>Evaluation and argumentation:</emph> reputation, opinion, certainty, truth, shew, disparage, deserves, impartiality, critiques</p>
              
               <p><quote>We are forced in this Period, not only to <emph>make use of Authors</emph> who lived long after the Things they treat of were done, but also are otherwise of <emph>no great Credit</emph>; such as Nennius, and Geoffery of Monmouth, whom we sometimes make use of for want of those of better Authority.</quote></p>
               <p>—James Tyrrell, <title>The General History of England</title>, 1696
               </p>
             
             
             
            </slide>
            <lectureNote>
               <p>The corpus at stake comprises 52 texts and 10.9 million words, collected from the EEBO-TCP files. </p>
               <p>The words closest to <q>credit</q> in vector space show a very close association with historiographic evaluation in this corpus. Words that tend to be used with <q>credit</q> include ones that connect with discussions of history as a discipline and historical research as a process (historian, authors, authour, historians, pains, antiquity); describe different kinds of historical sources (testimony, relation, account, story, romance, fictions); and evaluate validity or make arguments about the past (reputation, opinion, certainty, truth, shew, disparage, deserves, impartiality, critiques). </p>
           
            </lectureNote>
            
         </section>
         <section>
            <head>Corpus analysis</head>
            
            <slide>
            
               <p>Does this particular corpus actually contain language related to more financial meanings of credit?</p>
               
               <p><emph>V(<term>credit</term>) + V(<term>buy</term>):</emph> sell, buy, credit, carry, purchase, get, overplus, gain, profit</p>
               <p><quote>Looke how many houshoulders there are in Dublyne, so many Ale-brewers there be in the Towne, for every Houshoulders Wife is a Brewer. And (whatsoever she be otherwise) or let hir come from whence shee will, if her <emph>credit will serve to borrowe a Pan</emph>, and to buy but a measure of mault in the Market, she sets uppe Brewing.</quote></p>
               <p>—Barnabe Rich, <title>A New Description of Ireland</title>, 1610
               </p>
            </slide>
            <lectureNote>
           
               <p>Despite the strong associations between credit and historiographic validity, it is possible to find terms related to the commercial connotations of <q>credit</q> using vector math. In fact, the more financial connotations of <q>credit</q> are very much present in this collection of seventeenth-century histories, and even seem to be part of the appeal of credit as an evaluative framework.</p>
               
            </lectureNote>
            
         </section>
         <section>
            <head>Further exploration</head>
            
            <slide>
               <p>Can this method help to show the credit available to particular historians and historical figures?</p>
               
               <p><emph>V(<term>credit</term>) + V(<term>arthur</term>):</emph> arthur's, geffrey, geoffery, ieffrey, monmouth, story, geoffrey, historian, writer</p>
   <p><quote>Arthur was a Prince more worthy to be advanced by the truth of Records in <emph>warrantable credit</emph>, then by fables <emph>scandalized</emph> with poeticall fictions and hyperbolicall falshoods…Of Jeffrey Arthur, or Monke of Monmouth: lamentable it is, that the fame of this puissant Prince had not beene sounded by <emph>a more certaine Trumpet</emph>.</quote></p>
               <p>—John Speed, <title>The History of Great Britaine</title>, 1611
               </p> 
            </slide>
            <lectureNote>
               <p>These methods can also be used to investigate particular historians and historical figures. For example, the twelfth-century historian Geoffrey of Monmouth—who was widely recognized as the source for most of Britain’s Arthurian traditions and who was equally widely criticized for his inclusion of fictional materials—was particularly troubling for some British historians because his credit was so thoroughly bound up with King Arthur’s, whom most felt deserved a more reliable chronicler. This connection is quite evident in the model, which shows numerous variations of Geoffrey's name when Arthur and credit are queried together.</p>
            </lectureNote>
            
         </section>
         <section>
            <head>Scoping analysis</head>
            
            <slide>
              
               <p>Just how contentious was Geoffrey?</p>
               <p><emph>V(<term>geoffrey</term>):</emph> jeoffrey, monmouth's, monmouth, geoffery, ieffrey, geffrey, geoffry, history, weathamstead, gyraldus, story, arthur, fabulous, mistake</p>
               
               
               <p><quote>John <emph>Weathamstead</emph>, Abbot of Saint Albans, in his worke of English Affaires, accuseth Geoffrey of Monmouth, of <emph>meere Fabulousnesse</emph>.</quote></p>
               
               <p>—Richard Baker, <title>A Chronicle of the Kings of England</title>, 1643
               </p>
            </slide>
            <lectureNote>
               <p>In fact, the suspect credit available to Geoffrey of Monmouth is such a pervading concern for the authors in this corpus that a simple query for one common spelling of his first name shows that the closest terms, in addition to variant spellings of <q>Geoffrey</q> are those specific to this historian and to debates about his historicity, including two other historians—Giraldus Cambrensis and John Westhampstead—who had written skeptically about Geoffrey's work. <q>Arthur</q> is closely related to Geoffrey as well, again indicating the very close connection between historian and historical figures.</p>
            </lectureNote>
            
         </section>
         
         <section>
            <head>Showcase: Archival descriptions of LGBT Collections</head>
            <slide><p><emph>Framing questions</emph>
               </p>
               <list>
                  <item>What is the semantic relationship between formal data structure and archival description?</item>
                  <item>What does it mean to read archives as data in DH? What are some of the challenges and benefits?</item>
               </list>
               <p><quote>[T]he enormous scale of digital records has changed the way scholarly resources are read; increasingly, the emphasis is on being able to study large volumes of material rather than on being able to study individual texts. These radical changes mean that, in future, archives will no longer be conceived of as collections of texts, but as <emph>data to be made sense of</emph></quote> (120).</p>
               <p>—Moss, Michael, David Thomas, and Tim Gollins. <q>The Reconfiguration of the Archive as Data to Be Mined.</q> <title>Archiaria</title>, no. 86, 2019, pp. 118-151. </p>
            </slide>
            <lectureNote>
               <p>This project explores finding aids of LGBTQ archive collections, focusing on the relationship between archival description and identity. Using digital humanities methods (including word vector models), it explores semantics of description and structured data as it relates to use, access, discoverability, and representation. Central questions for this project include: what is the semantic relationship between formal data structure and archival description? How do controlled vocabularies of structured data appear in computational analysis? How is identity described in structured data and how do we access this information? 
                  
               </p>
            </lectureNote>
         </section>
         <section>
            
            <head>Corpus building</head>
            <slide>
               <figure>
                  <graphic height="500px" url="../../../_utils/gfx/archive-grid.png"/> 
               </figure>
               <p>ArchiveGrid interface and data organization</p>
               
            </slide>
            <lectureNote>
               <p>This corpus contains 304 finding aids (1.6 million words) collected from ArchiveGrid (a digital repository of over five million archival descriptions) using the keyword search <q>lgbt OR queer.</q> The finding aids are from four archives with the most records from this search. As documents, finding aids vary greatly in form and description level. Across these differences, finding aids contain important metadata about archival collections: titles, creators, physical descriptions, abstracts, biographical or historical notes, scope and contents, and subject headings/indexing terms.    
               </p>
     
            </lectureNote>
         </section>
        
         <section>
            
            <head>Corpus analysis</head>
            <slide>
               
               <figure>
                  <graphic height="300px" url="../../../_utils/gfx/corpus-cloud.png"/>
               </figure>
               <p>Word cloud of most frequent words (45) using Voyant Tools</p>
            </slide>
            <lectureNote>
           
               <p>All the files in this corpus are plain text, but their structures are an integral part of the data itself. Generating a word cloud with Voyant Tools, it is easy to see that the most frequent words in this corpus are <q>box</q> and <q>folder</q>. <q>Box</q> appears over 79,000 times and <q>folder</q> almost 65,000 times. Combined, these words make up 8.6% of the entire corpus word count (144,000 words). Given that structural or formal semantics are such a large part of this corpus, where do they manifest in computational analyses like word vector models?</p>
            </lectureNote>
         </section>
         
        
         <section>
            
            <head>Results: Clusters</head>
            <slide>
               <p><emph>Cluster [1]:</emph> includes, materials, publications, other, clippings, notes, material, articles, letters, miscellaneous, various, journal, manuscripts, reviews, newspaper   
          </p>
               <p><emph>Cluster [2]:</emph> access, permission, must, copyright, conditions, use, governing, researchers, publish, given, requests, restrictions, owner, submitted, behalf  
               </p>
               <p><emph>Cluster [3]:</emph> gay, lesbian, subject, women, issues, men, lgbt, bisexual, political, publisher, queer, liberation, transgender, politics, movement
               </p>
             
            </slide>
            <lectureNote>
               <p>Clustering is an interesting way to <q>read</q> a word vector model. These three clusters show words closely associated with each other in vector space and each is a distinct part of this corpus. The first cluster is about material descriptions, including the types of documents, items, and artifacts that are in these collections. The second cluster is about use and access, including words that describe or define the processes of archiving, using, and accessing collections. The third cluster deals with subjects in the collections, including much of the related terminology for subject headings or keywords that are used to define or describe these collections. Each cluster, to some extent, is indicative of the different structures and contents of this type of data.
                                 </p>
            </lectureNote>
         </section>
       <!--  <section>
            
            <head>Results: Queries</head>
            <slide>
               <p><emph>Before cleaning the corpus: </emph></p>
               <p><emph>V(<term>box</term>):</emph> box, folder, 29.b, 43, 49, 48, 29a, 44, 46, 38
               </p>
               <p><emph>V(<term>folder</term>):</emph> folder, box, 29b, 29a, 38, 49, 43, 48, 2, 1
               </p>
               <p><emph>After cleaning the corpus: </emph></p>
               <p><emph>V(<term>box</term>):</emph> box, folder, oversize, clamshell, preferred, artifact, citation, flat, cartons, boxes
               </p>
               <p><emph>V(<term>folder</term>):</emph> folder, citation, preferred, box name, oversize, artifact, identification, kanemoto, or
               </p>
             
            </slide>
            <lectureNote>
               <p>In digital humanities, there are many different explanations, definitions, and ways to understand data. One of the threads of discussion regarding data that relates to training and using word vector models is the “cleanliness” of data. Because Word2Vec treats a corpus as “a bag of words,” if you do not remove metadata or data structures, these will appear and may influence the model. All the results I am sharing today are from my original corpus without cleaning, meaning before I used regular expressions to take out significant instances of “box” and “folder.” However, before exploring subject-related queries, the results of the top-ten related words for “box” and “folder” <emph>before</emph> and <emph>after</emph> cleaning my corpus are striking. Before removing instances of these words, query results for both terms showed folder numbers, nothing more. Querying these words once the corpus has been cleaned shows a much more refined vector with descriptive words about the contents and items in these containers. As this example demonstrates, understanding your research question when using computational analysis is just as important as understanding your methods of data preparation: both can have a significant effect on how you can create or interpret your data. 
                  
               </p>
            </lectureNote>
         </section>-->
       
         <section>
            
            <head>Results: Queries</head>
            <slide>
            
               <p><emph>V(<term>lgbt</term>):</emph> lgbt, broader, groups, topics, highlighting, individuals, transgender, highlight, queer, jewish, intersexuality, related, lgbtq, intersex, media’s, promotes, educational, recreational, audiences, primarily
               </p>
               <p><emph>V(<term>queer</term>):</emph> queer, nation, asexual, intersex, questioning, ally, assimilation, lgbt, multicultural, ubiquitous, cybercenter, emerged, cultural, stages, cultures, jqcc, separatists, polyamory, lgbtqia, geek
               </p>
               
            </slide>
            <lectureNote>
               <p>This corpus was originally generated with the keywords <q>lgbt</q> and <q>queer</q> on ArchivesGrid and their corresponding vectors are an interesting way to parse the relationship of these words throughout the finding aids. Being the broader descriptive term, <q>lgbt</q> is related to a larger variety of words that includes not just identity labels like <q>transgender</q> and <q>intersex,</q> but words relating to use and access of archival materials. For instance, this vector includes <q>educational,</q> <q>promotes,</q> and <q>highlights</q>—all which are verbs and adverbs relating to use and not just people. However, queer is a term with a much more varied semantic past and present, having been (and continuing for some) a derogatory word and an identity label. Some of this history and context is reflected in the list of related words, especially with <q>asexual,</q> <q>intersex,</q> <q>questioning,</q> and <q>assimilation</q>—the distinctive contexts for <q>queer</q> are also evident in the closer connection with <q>lgbtqia</q> than is the case for <q>lgbt</q>. 
                                 </p>
            </lectureNote>
         </section>
         
         <section>
            
            <head>Future questions and exploration</head>
            <slide>
               <p><emph>Future questions: </emph></p>
               <list>
                  <item>How can word vector models (and other forms of computational analysis) be used to <q>discover</q> and locate problematic archival descriptions?</item>
                  <item>What are the identities that are being overlooked? What are the complexities of identities that language is erasing or replacing in archival collections?</item>
                  <item>Is there a difference in description semantics based on who is doing the describing (community or institution-based individuals)? </item>
                  <item>How can such an analysis help create recuperative and empathy-based description practices and standards?</item>
               </list>
            </slide>
            <lectureNote>
               <p>As this project progresses, the next step is to create a larger dataset of LGBT finding aids from across the country, paying attention to institutional differences. In the end, by using a variety of digital humanities methods (including other forms of textual and computational analysis), this project seeks to study how these tools may help scholars better understand and respond to the social implications of data structure and classification within archival materials.                   
               </p>
            </lectureNote>
         </section>
         <section>
            <head>Analogies in vector space</head>
            <slide>
               
               <quote><p>Word embedding models can help us reconstruct and explore this historically-situated network of conceptual relationships. A word embedding model trained on ECCO-TCP can answer not only the more trivial analogies of man is to woman as king is to…? (<q>queen</q>), and London is to England as Edinburgh is to…? (<q>Scotland</q>). We can also explore the very analogy Young proposed, and ask the model: Riches are to Virtue as Learning is to…?</p>
                  <figure>
                     <graphic height="300px" url="../../../_utils/gfx/heuser.genius.png"/>
                  </figure>
                  <p>Admittedly, I really don’t know anything yet about measures of statistical significance in analogy testing—but, to me, the fact that, out of thousands of possibilities, the 6th closest word vector to the composite, analogical vector V(<term>Virtue</term>-<term>Riches</term>+<term>Learning</term>) is, as Young argued, V(<term>Genius</term>), is remarkable.</p> 
               </quote>
<p>               —Ryan Heuser, <ref target="http://ryanheuser.org/word-vectors-1/">Word Vectors in the Eighteenth Century, Episode 1: Concepts</ref>, 2016.
</p>               
           
              
            </slide>
            <lectureNote>
               <p>Now let's walk through a slightly more complex use of word embedding models for research; how do analogies work? Can we outline how Heuser applies word2vec to his research questions, and work through the logic of the explanatory analogy? As we've seen already, projects like this bring together a corpus and a research question, with a method and some particular queries; can we identify all of these for Heuser’s work? 
                  <!-- First, list out the king/queen analogy. Then fill in a chart with: human being, adult, female/male and human being, monarch, female/male. Write out woman - man + king and ask people to talk through the logic of the analogy. If you start with a vector of monarch-ness, and add to that a vector gender difference, you get queen. Then fill in headings for "Originaity" and "Imitation" and ask people to fill in which concepts appear where. This is based on an argument made by Edward Young in 1759—if you can't be good, then you'd better be rich.    -->
               </p>
               <p>
                   </p>
            </lectureNote>
         </section>
         
         <section>
            <head>Analyzing results and testing queries</head>
            <slide>
               
              <list>
                 <item>What conclusion does Heuser reach about his results?</item>
                <item>How would you characterize the other terms on the list, apart from <quote>genius</quote>?</item>
               <item>Are there any other explanations that you can think of for the presence of <quote>genius</quote> on this list? </item>        
               <item>What other queries would you want to try, if you were pursing this project?</item>
                <item>What do you think we would find if we tried the same query with a model trained on a different corpus?</item>
              </list>            
               
               
            </slide>
            <lectureNote>
           
               <p>
                Can we characterize the kind of research project that Heuser is pursuing here (thinking about the conclusions he actually draws)? Do you find these conclusions persuasive, or are there any other explanations that you can think of for the presence of <q>genius</q> in this list? What other evidence would you like to see to support this conclusion? What other queries would you want to try? Since we have a model trained on ECCO, using the same parameters that Heuser did, we can investigate these queries directly—what do we find? What do we see with other models? </p>
               <!-- Make sure to show nonsense analogies (shark-bird+wolf) and simple learning + virtue -->
            </lectureNote>
         </section>
         
         
        <!-- <section>
            <head>Expressing Relational Concepts</head>
            <slide>
               <quote><p> In other words, start with the semantic profile of <soCalled>Virtue</soCalled> expressed in V(<term>Virtue</term>); then, subtract from it the semantic profile of <soCalled>Riches</soCalled> expressed in V(<term>Riches</term>). The result is that the semantic aspects that they shared are removed, and only the aspects on which they differed left behind. This specific semantic contrast is captured in a new composite vector: V(<term>Virtue-Riches</term>). In natural language, V(<term>Virtue-Riches</term>) means Virtue <emph>as given by its distinctness from Riches</emph>. This new vector is not a word vector, but rather a particular <emph>relationship</emph> between word vectors—namely, a relationship of conceptual contrast, modeled as the relationship of mathematical subtraction.</p>
               
               <p>In this way, V(<term>Virtue-Riches</term>) is an independent concept from its components. Unlike its components, V(<term>Virtue</term>) and V(<term>Riches</term>), which, as words, have direct expression in language, V(<term>Virtue-Riches</term>) is instead a <emph>relationship</emph> between those expressions, emergent through their contrast. There is, then, a sense in which V(<term>Virtue-Riches</term>) has no direct expression in language; and therefore also a sense in which we might think of it as an artificial, even virtual, concept. And yet in spite of its ontological fragility—or, more interestingly, <emph>through</emph> it—V(<term>Virtue-Riches</term>) gives formal expression to a concept that couldn’t be more real and natural to the 18C: the concept of Virtue as contrasted with Riches.</p></quote>
               <p>—Ryan Heuser, <ref target="http://ryanheuser.org/word-vectors-1/">Word Vectors in the Eighteenth Century, Episode 1: Concepts</ref>, 2016.</p>
            </slide>
            <lectureNote>
               <p>Here, again, let's work through Heuser's explanation together. What questions do you have about the idea of subtracting concepts? Are there any similar contrasts that you are interested in exploring in your own corpora? What queries do you think you might try?</p>
            </lectureNote>
         </section>-->
         
         <section>
            <head>Try some queries of your own</head>
            <slide>
               <p>With either the event <ref target="http://sandbox.wwp.northeastern.edu/">sandbox</ref> or the full <ref target="http://lab.wwp.northeastern.edu/wwvt/">toolkit</ref>, break into pairs and choose a model to explore. Try a few queries and make some notes on what you discover.</p>
               <p>Think about:</p>
               <list><item>What did you find interesting/confusing/surprising?</item>
                  <item>What are some initial conclusions you can draw from these results?</item></list>
            </slide>
            <lectureNote>
               <p>When we try the same query in a model trained on the full corpus of WWO, we also see <q>genius</q> fairly high on the list, although the overall cosine similarities are lower—note that there are 84 million words in the ECCO corpus that Heuser used, and about 12 million words in Women Writers Online.
                  </p>
               <p> 
                  
                  What do we make of this? What additional information would we need to have before we could draw any conclusions? What other queries could we try?</p>
            </lectureNote>
         </section>
         
         <section>
            <head>Discussion</head>
            <slide>
               <list>            
                  <item>What do we like about these methods, what do we find confusing?</item>
                  <item>What kinds of arguments can we make with these methods?</item>
                  <item>What kinds of arguments <emph>can’t</emph> we make?</item>
                  <item>What kinds of supporting evidence are necessary? What kinds of data would we need to support these arguments?</item>              
               </list>
            </slide>
            <lectureNote>
               <list>
                  <head>What kinds of arguments can we make?</head>
                  <item>Claims that examine not just words but also relationships between words, semantic concepts on a broader scale—as in Heuser, where the concept of virtue as opposed to riches is both expressible mathematically and something that was very real in the eighteenth century.</item>
                  <item>Arguments that focus on binaries or that can be tested by looking at relationships between words; arguments that actually study relationships between words, or concepts such as <quote>virtue as distinguished from riches</quote>.</item>
                  <item>Arguments that can consider words in groups—note that in Heuser's results, <q>genius</q> is actually the sixth term on the list.</item>
               </list>
               <list>
                  <head>What kinds of arguments <emph>can’t</emph> we make?</head>
                  <item>Text-by-text analysis</item>
                  <item>Any kind of claim that relies on having identical results every time a model is trained</item>                  
               </list>
               <list>
                  <head>What kinds of data and supporting evidence are necessary?</head>
                  <item>Multiple models and varied queries</item>
                  <item>Technical validations</item>
                  <item>Evidence from other methodologies</item>
               </list>
            </lectureNote>
         </section></sectionGrp>
         
         
         
         
      </presentation>
   </text>
</TEI>
