<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="https://wwp-test.northeastern.edu/outreach/seminars/_utils/schema/yaps.rnc" type="application/relax-ng-compact-syntax"?>
<?xml-model href="https://wwp-test.northeastern.edu/outreach/seminars/_utils/schema/yaps.isosch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<?xml-stylesheet type="text/css" href="https://wwp-test.northeastern.edu/outreach/seminars/_utils/stylesheets/yaps-tei.css"?>
<!-- $Id: word_vectors_overview.xml 51374 2026-04-01 16:35:16Z aclark $ -->
<TEI xmlns="http://www.wwp.northeastern.edu/ns/yaps" version="5.0">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Word Vectors Institute: Introductions and Overview</title>
        <author>Julia Flanders</author>
      </titleStmt>
      <editionStmt>
        <edition>Word Vectors for the Thoughtful Humanist</edition>
      </editionStmt>
      <publicationStmt>
        <distributor>Women Writers Project (via website)</distributor>
        <address>
          <addrLine>url:mailto:wwp@neu.edu</addrLine>
        </address>
        <date when="2019-04-01"/>
        <availability status="restricted">
          <p>Copyright 2019 Syd Bauman, Julia Flanders, Sarah Connell, and the Women Writers
            Project</p>
          <p>This TEI-encoded XML file is available under the terms of the <ref
              target="http://creativecommons.org/licenses/by-sa/3.0/">Creative Commons
              Attribution-ShareAlike 3.0 (Unported)</ref> license.</p>
        </availability>
        <pubPlace>Boston, MA USA</pubPlace>
      </publicationStmt>
      <sourceDesc>
        <p>Opening session for the institute, including an overview of what will be covered.</p>
      </sourceDesc>
    </fileDesc>
    <revisionDesc>
      <change when="2021-06-24" who="jflanders.lfw">Updates for institute 3</change>
      <change when="2021-05-17" who="jflanders.lfw">Updates for institute 2</change>
      <change when="2019-07-12" who="jflanders.lfw">Initial draft</change>
    </revisionDesc>
  </teiHeader>
  <text>
    <presentation>
      <abstract>
        <p>This tutorial introduces the WWP curriculum on word embedding models.</p>
      </abstract>


      <section>
        <head>Overview</head>
        <slide>

          <p>This institute series: four institutes in all, generously funded by the National
            Endowment for the Humanities: <list>
              <item>July 2019: Introductory, research focused</item>
              <item>May 2021 (rescheduled from 2020): Introductory, teaching focused</item>
              <item>July 2021 (rescheduled from 2020): Intensive, research focused</item>
              <item><emph>May 2022</emph> (rescheduled from 2021): Intensive, teaching
                focused</item>
            </list>
          </p>
          <p>Our focus for this event: <list>
              <item>What are word embedding models? What's different about them? What are they good
                for?</item>
              <item>What do all the specialized terms mean?</item>
              <item>How do word embedding models work?</item>
              <item>How can we build fascinating and effective curricular materials that make use of
                word embedding models? How do we explain word embedding models in our teaching, at
                different levels and for different pedagogical contexts?</item>
              <!--<item>How do we explain and contextualize our results, and use them persuasively in
                research?</item>-->
            </list>
          </p>
        </slide>
        <lectureNote>

          <p>To situate this event a bit: <list>
              <item>this is the third of a series of four institutes, in which we're trying to
                approach the general topic of word embedding models from both a teaching and a
                research perspective, and also for audiences with different levels of comfort with
                programming</item>
              <item>so this third event focuses on <emph>research</emph> usage from an
                  <emph>intensive</emph> standpoint</item>
            </list></p>
          <p>We’re not expecting any prior knowledge of text analysis and certainly none of word
            embedding models (that’s why you’re here!) but we hope everyone will come away feeling
            comfortable with several things: <list>
              <item>What word embedding models are and how they differ from other text
                analysis/machine learning approaches</item>
              <item> The vocabulary and specialized terminology used to talk about word embedding
                models</item>
              <item> How word embedding models work: what is actually happening under the hood and
                how that affects the kinds of research and interpretive work we can do with this
                technique</item>
              <item> How to explain and contextualize these approaches, particularly in the context
                of our research and scholarship </item>
              <item>How to <emph>read and modify</emph> the R code used to train and query the
                models (but not write new R code from scratch)</item>
            </list></p>

          <p> What we will not be covering: <list>
              <!--<item>How to use word embedding models using command-line tools like RStudio (although we are offering an optional session on days 2 and 3 to walk you through the process of setting that up)</item>
              <item>We have done the heavy lifting and advance work behind the scenes (although we
                will walk through the concepts carefully), so the web interface allows you to
                explore the resulting models. Our “intensive” institutes later in the grant will
                cover RStudio and the process of training models by hand in more detail. </item>
              
              -->
              <item>We have developed an easy-to-use web interface which we'll use a bit (and which
                you may find very useful for teaching), which lets you query existing trained
                models</item>
              <item>But for this intensive workshop, we will be getting into the actual process of
                training and querying models on the command line</item>
              <item>We will be using RStudio, which is a command-line environment for running R
                code</item>
            </list></p>

        </lectureNote>

        <tutorial>
          <p>To situate this walkthrough a bit: <list>
              <item>This walkthrough was developed using materials from the third of a series of
                four institutes, in which we approached the general topic of word embedding models
                from both a teaching and a research perspective. We designed this walkthrough for audiences with
                different levels of comfort with programming, so no high-level experience with programming is required to follow along.</item>
              <item>The primary focus of this walkthrough is on <emph>research</emph> usage from an
                  <emph>intensive</emph> standpoint</item>
            </list></p>
          <p>We’re not expecting any prior knowledge of text analysis and certainly none of word
            embedding models (that’s why you’re here!) but we hope you will come away from this walkthrough feeling
            comfortable with several things: <list>
              <item>What word embedding models are and how they differ from other text
                analysis/machine learning approaches</item>
              <item> The vocabulary and specialized terminology used to talk about word embedding
                models</item>
              <item> How word embedding models work: what is actually happening under the hood and
                how that affects the kinds of research and interpretive work we can do with this
                technique</item>
              <item> How to explain and contextualize these approaches, particularly in the context
                of our research and scholarship </item>
              <item>How to <emph>read and modify</emph> the R code used to train and query the
                models (but not write new R code from scratch)</item>
            </list></p>

          <p> What we will not be covering: <list>
              <!--<item>How to use word embedding models using command-line tools like RStudio (although we are offering an optional session on days 2 and 3 to walk you through the process of setting that up)</item>
              <item>We have done the heavy lifting and advance work behind the scenes (although we
                will walk through the concepts carefully), so the web interface allows you to
                explore the resulting models. Our “intensive” institutes later in the grant will
                cover RStudio and the process of training models by hand in more detail. </item>
              
              -->
              <item>We have developed an easy-to-use web interface which we'll use a bit (and which
                you may find very useful for teaching), which lets you query existing trained
                models</item>
              <item>We will be going into more detail later, but in this intensive guide, we will be getting into the actual process of
                training and querying models on the command line</item>
              <item>We will be using RStudio, which is a command-line environment for running R
                code</item>
            <item>Although we will be using appropriate jargon and terminology when referring to the tools and methods in this guide, 
            we do not assume any prior knowledge and thus, terms will be defined when appropriate.</item>
            </list></p>

        </tutorial>


      </section>
      <section>
        <head>Finding the right level</head>
        <slide>
          <figure>
            <graphic height="600px" url="../../../_utils/gfx/w2v_range_of_complexity.png"/>
          </figure>
        </slide>
        <lectureNote>
          <p>This is also a sort of meta-workshop: <list>
              <item>Part of the goal of this grant is to explore ways of making word embedding
                models approachable and useful and persuasive to many different audiences without
                dumbing them down;</item>
              <item> we’re trying to develop appropriate explanatory narratives that are somewhere
                in between “word vectors are a fun tool! See the clusters!” and technical language
                that assumes deep expertise</item>
              <item>So we are going to be interested in thinking with you about that boundary: about
                what parts of this topic are especially challenging, and how we can best understand
                them and explain them to others: for instance, colleagues, students, and readers of
                articles where you draw on these techniques</item>
              <item>Your current unfamiliarity with the topic is a brief and precious resource for
                you as teachers: this is your moment to reflect on what is hardest to understand, so
                that you can anticipate the things others may find confusing or worth unpacking, and
                explain them in terms that are legible and appropriately pitched</item>
              <!--
                <item>Your current unfamiliarity with the topic is a brief and precious resource for you as teachers: this is your moment to reflect on what is hardest to understand, so that you can give your students the pacing and explanations they need</item>
                <item>So we are going to be interested in thinking with you about what parts of this
                topic are especially challenging, and how we can best understand them and explain
                them to others (for instance, colleagues, readers, or students) </item>-->
            </list></p>
        </lectureNote>
        <tutorial>
          <p>This walkthrough approaches word embedding models from a meta-perspectivee: <list>
            <item>Part of the goal is to explore ways of making word embedding
              models approachable, useful, and persuasive to many different audiences without
              oversimplifying them;</item>
            <item>The guide uses explanatory narratives that are somewhere
              in between “word vectors are a fun tool! See the clusters!” and technical language
              that assumes deep expertise. This way, word embedding models are demystified as much as possible while still
            familiarizing you with the terms that practitioners use to talk about the same processes.</item>
            <item>So we will be exploring that boundary a bit and asking you to consider the following:
              what parts of this topic are especially challenging, and how we can best understand
              those challenges and explain them to others: for instance, colleagues, students, and readers of
              articles where you draw on these techniques</item>
            <item>Your current unfamiliarity with the topic is a brief and precious resource for
              you as a researcher or a teacher (or both!): this is your moment to reflect on what is hardest to understand, so
              that you can anticipate the things others may find confusing or worth unpacking, and
              explain them in terms that are legible and appropriately pitched</item>
            <!--
                <item>Your current unfamiliarity with the topic is a brief and precious resource for you as teachers: this is your moment to reflect on what is hardest to understand, so that you can give your students the pacing and explanations they need</item>
                <item>So we are going to be interested in thinking with you about what parts of this
                topic are especially challenging, and how we can best understand them and explain
                them to others (for instance, colleagues, readers, or students) </item>-->
          </list></p>
        </tutorial>
      </section>
      <section>
        <head>A quick look at the schedule...</head>
        <slide>
          <p>Monday and Tuesday: <list>
              <item>An initial orientation and a showcase of some pedagogical projects</item>
              <item>A deeper explanation of terminology and concepts</item>
              <item>A walkthrough of some commented code samples and some hands-on
                experimentation</item>
            </list>
          </p>
          <p>Wednesday and Thursday: <list>
              <item>A close look at corpus and data preparation</item>
              <item>A close look at the model training process</item>
              <item>More hands-on experimentation using commented code walkthroughs</item>
            </list>
          </p>
          <p>Friday: <list>
              <item>More hands-on practice, and a walkthrough of the code to visualize word
                embedding models</item>
              <item>Wrapping up and next steps</item>
            </list>
          </p>
          <!--<p>Today: <list>
              <item>An initial orientation and some sample research questions</item>
              <item>Lunch (and optional lunchtime look under the hood, part 1)</item>
              <item>A deeper explanation of terminology and concepts</item>
              <item>Hands-on exploration of the word vector toolkit</item>
            </list>
          </p>
          <p>Tomorrow: <list>
              <item>The process of data preparation and model training</item>
              <item>Hands-on exploration and discussion</item>
              <item>Lunch (and optional lunchtime look under the hood, part 2)</item>
              <item>Small and large group discussion of research questions</item>
            </list>
          </p>
          <p>Friday: <list>
              <item>Hands-on work on your own projects</item>
              <item>Short project presentations</item>
              <item>Lunch</item>
              <item>More presentations</item>
              <item>Closing discussion, questions, next steps</item>
            </list>
          </p>-->
        </slide>
        <lectureNote>
          <p> Quick look at the schedule: <list>
              <item>Our basic strategy here is to examine and explain word embedding models several
                times, at increasing levels of detail, so that you have a chance to internalize one
                level of knowledge before we dive into the next deeper level. </item>
              <item> We’ll be working intensively with commented code walkthroughs: these are R
                programs with detailed comments and some specific places where you can make
                modifications and specify parameters; these are designed so that you don't have to
                actually <emph>write</emph> any R code, but can become familiar with how it works
                and how to adapt it</item> <item>We'll also spend time doing hands-on work in small
                groups so that you have a chance to practice and explore on your own</item>
              <item>During the workshop, we will be using a version of RStudio that is installed on
                a shared server, so that you (and we) don't have to deal with the complexities of
                getting RStudio running on everyone's individual computers. However, before the
                workshop on Wednesday and Thursday, for those who are interested, we will also do a
                walkthrough of how to download and install RStudio on your own computer; no
                obligation but if you're interested all are welcome.</item> <item>On the final day,
                we'll do a bit of experimentation with code to visualize word embedding models, and
                then we’ll wrap up with a discussion of next steps (including what would be involved
                in tackling RStudio and the command line if you’re so inclined).</item>
            </list>
          </p>
        </lectureNote>
        <tutorial>
          <p> A quick roadmap for this walkthrough: <list>
            <item>The strategy of this walkthrough, is to examine and explain word embedding models several
              times, at increasing levels of detail, so that you have a chance to internalize one
              level of knowledge before we dive into the next deeper level. </item>
            <item> We’ll be working intensively with commented code walkthroughs: these are R
              programs with detailed comments and some specific places where you can make
              modifications and specify parameters; these are designed so that you don't have to
              actually <emph>write</emph> any R code, but can become familiar with how it works
              and how to adapt it</item> <item>We also include several hands-one exercises for you to try out
              so that you have the opportunity to put key concepts into practice.</item>
            <item>The code we will be using in this walkthrough is written in the programming language R
              and is run using the command line tool RStudio. We won't spend a lot of time explaining how
              to download and install RStudio on your own computer, so we recommend taking some time
              to install RStudio on your own computer or to connect to a shared server with RStudio already available.
            </item> 
            <item>Finally,
                we'll do a bit of experimentation with code to visualize word embedding models, and
                then we’ll wrap up with a discussion of possible next steps.</item>
          </list>
          </p>
        </tutorial>
      </section>
      <section>
        <head>Making notes</head>
        <slide>
          <p>What did you try?</p>
          <p>What settings did you use? (Which corpora, what query terms...etc.?)</p>
          <p>What result did you get?</p>
          <p>What didn't make sense?</p>
          <p>What do you want to remember to try later?</p>
        </slide>
        <lectureNote>
          <p>We've provided a fair amount of time for individual and small-group experimentation,
            and time for you to think about your own research projects</p>
          <p>However, this workshop will really just be a start, a chance to get comfortable with
            fundamental concepts</p>
          <p>I want to talk for a moment about some suggestions for how to take this work with you
            and continue it in your own time after you get home: <list>
              <item>For all of the sessions, we have a shared notes document [share link] for
                anything you want to write down that might be useful to the group, and we'll also
                ask you to make notes there during some of the small-group hands-on work</item>
              <item>we'd also like to suggest that you keep something like a lab notebook: an
                informal, personal (but somewhat detailed) record of what you tried, what worked,
                what questions you have, what you want to follow up on later</item>
              <item>More specifically, it's helpful to remember details like what words you queried,
                what corpora you were comparing, what settings you used </item>
              <item>We have created some samples and templates as inspiration, which are in our
                shared Google space</item>
              <item>Later on, these kinds of notes can also be useful in documenting your results,
                for purposes of writing about them in your research; very similar to documenting
                your bibliographic sources for a research article</item>
              <item>Screen shots can also be a convenient way to keep a record of a notable result. </item>
              <item>Questions?</item>
            </list>
          </p>
        </lectureNote>
        <tutorial>
          <p>We've provided a fair amount of space in the walkthrough for you to think about your own research projects as well as 
          how you see word embedding models playing a role in answering the research questions you may have</p>
          <p>However, this walkthrough is a soft introduction to the concept of word embedding models, a chance to get comfortable with
            fundamental concepts although there is much more that word embeddings are useful for</p>
          <p>Since we can't possibly cover every application word embedding models may have for humanities researchers,
            we want to offer some suggestions for how to take this work with you
            and continue it in your own time after you have familiarized yourself with the content that we have provided: <list>
              <item>We encourage you to have a dedicated space for notes that you can use to record
                anything that might be useful for a later exploration. At key points, we will also
                ask you to use your notetaking space to jot down some thoughts, particularly during some of the hands-on exercises</item>
              <item>We suggest that in addition to more traditional notetaking, that you keep something like a lab notebook: an
                informal, personal (but somewhat detailed) record of what experiments you tried, what worked,
                what questions you have, what you want to follow up on later</item>
              <item>More specifically, it's helpful to remember details like what words you queried,
                what corpora you were comparing, what settings you used </item>
              <item>We have created some samples and templates as inspiration, which you may use as a model</item>
              <item>Later on, these kinds of notes can also be useful in documenting your results,
                for purposes of writing about them in your research; very similar to documenting
                your bibliographic sources for a research article</item>
              <item>Screen shots can also be a convenient way to keep a record of a notable result. </item>
            </list>
          </p>
        </tutorial>
      </section>

      <section>
        <head>
          <soCalled>Model?</soCalled>
        </head>
        <slide>
          <figure>
            <graphic height="600px" url="../../../_utils/gfx/w2v_model2.png"/>
          </figure>

        </slide>
        <lectureNote>
          <p>So, with those preliminaries out of the way, let’s get into our first explanation of
            word embedding models. For this first explanatory pass through, we won't dwell in detail
            on the terminology or the mathematics: we'll keep to a sort of metaphorical level of
            explanation to get a feel for things.</p>
          <p>And the first term I want to talk about is the word <term>model</term>
            <list>
              <item><term>model</term> is a potent concept in digital humanities, because so much of
                what we do depends on models of one kind or another: creating digital
                representations of real-world objects and ideas, and using them to study those
                things</item>
              <item>in some of the earlier domains of DH we're used to thinking of models
                representationally: as static proxies for research objects (like texts or artifacts)
                that capture what is salient to us about those artifacts: a model as a TEI-encoded
                text</item>
              <item>in more recent domains such as machine learning, a <soCalled>model</soCalled> is
                more of a predictive or generative tool: something we can use to model the behavior
                of a system and not only learn more about it, but also produce new things that
                follow the rules and probabilities of the system: the kind of model that is
                represented by a schema</item></list></p>
          <p>Word-embedding models have properties of both, but in important respects are more like
            this latter type: <list>
              <item>they model the language of a corpus in a way that focuses on questions like "if
                I'm reading or writing this sentence, what's the most likely next word?" or "based
                on the words I'm seeing in this little region, what is the most likely word at the
                center of that region"?</item>
              <item>in other words, word-embedding models are interested in a probabilistic model of
                language that represents the interconnections between words as likelihoods based on
                proximity</item>
            </list></p>
          <p>The practical applications of this kind of modeling are familiar: predictive text on
            your phone! But in digital humanities, models of this kind are also valuable because
            they let us understand language better and help us do research on specific topics and
            historical formations. So where the machine-learning research in industry is focused on
            getting the most accurate predictions of what word I'm trying to type, through a
            somewhat abstract, de-historicized understanding of language, in digital humanities we
            need to pay close attention to language as represented in our specific corpora
            (representing a time period, a genre, a set of authors, etc.) and also to the
            assumptions we're making about language when we train our models.</p>
        </lectureNote>
        <tutorial>
          <p>So, with those preliminaries out of the way, let’s get into our first explanation of
            word embedding models. For this first explanatory pass through, we won't dwell in detail
            on the terminology or the mathematics: we'll keep to a sort of metaphorical level of
            explanation to get a feel for things.</p>
          <p>And the first key term we will cover is the word <term>model</term>
            <list>
              <item>The term <term>model</term> is a potent concept in digital humanities, because so much of
                what we do depends on models of one kind or another: creating digital
                representations of real-world objects and ideas, and using them to study those
                things</item>
              <item>in some of the earlier domains of DH models tended to be thought of
                representationally: as static proxies for research objects (like texts or artifacts)
                that capture what is salient to us about those artifacts: a model as a TEI-encoded
                text</item>
              <item>in more recent domains such as machine learning, a <soCalled>model</soCalled> is
                regarded as more of a predictive or generative tool: something we can use to model the behavior
                of a system and not only learn more about it, but also produce new things (objects, items, or even new models) that
                follow the rules and probabilities of that system: this is the kind of model that can be
                represented by a schema</item></list></p>
          <p>Word-embedding models have properties of both of these perspectives on models, but in important respects are more like
            the latter type: <list>
              <item>word embedding models model the language of a corpus in a way that focuses on questions like "if
                I'm reading or writing this sentence, what's the most likely next word?" or "based
                on the words I'm seeing in this little region, what is the most likely word at the
                center of that region"?</item>
              <item>in other words, word-embedding models are interested in a probabilistic model of
                language that represents the interconnections between words as likelihoods based on
                proximity</item>
            </list></p>
          <p>The practical applications of this kind of modeling are likely already familiar: for example, predictive text on
            your phone! But in digital humanities, models of this kind are also valuable because
            they let us understand language better and help us do research on specific topics and
            historical formations. So where machine-learning research in industry is often focused on
            getting the most accurate predictions of what word a user is trying to type, through a
            somewhat abstract, de-historicized understanding of language, in digital humanities we
            need to pay close attention to language as represented in our specific corpora
            (representing a time period, a genre, a set of authors, etc.) and also to the
            assumptions we're making about language when we train our models. I.e., we are interested in emphasizing humanistic questions
          and training while using these methods.</p>
        </tutorial>
      </section>

      <section>
        <head>A first look at word vectors</head>
        <slide>
          <figure>
            <graphic height="600px" url="../../../_utils/gfx/w2v_clusters.png"/>
          </figure>
        </slide>
        <lectureNote>

          <p>At the simplest level, a word embedding model is a model of a text corpus that
            represents word usage in the corpus by locating each word in
            <soCalled>space</soCalled></p>
          <p>Metaphorically, we can imagine that those spatial locations show us
              <soCalled>neighborhoods</soCalled> of words that tend to occur in the same
            contexts</p>
          <p>Another way to think about these <soCalled>neighborhoods</soCalled> is that they are
            answers to the question: <q>what are the words most likely to appear near word X?</q> or
              <q>what word X is most likely to appear in this context?</q></p>
          <p>So the clusters we see are groups of words that might be predicted by the same kinds of
            contexts. What can we imagine those contexts to be, based on the clusters we're seeing
            here? <list>
              <item>Start with cluster 5 (accompanying <soCalled>mad lib</soCalled>): words relating
                to expressions of risk and despair, unhappy futurity</item>
              <item>we can see how these words could plausibly fit into very similar contexts</item>
              <item>How about clusters 6 and 7? (righteous war; early modern female virtue?)</item>
              <item>Cluster 8 is a little different: not really a <soCalled>thematic</soCalled>
                cluster: what is the predictive <soCalled>context</soCalled> here?</item>
              <item>How about clusters 9 and 4?</item>
            </list>
          </p>
        </lectureNote>
        <tutorial>
          <p>At the simplest level, a word embedding model is a model of a text corpus that
            represents word usage in the corpus by locating each word in
            <soCalled>space</soCalled></p>
          <p>Metaphorically, we can imagine that those spatial locations show us
            <soCalled>neighborhoods</soCalled> of words that tend to occur in the same
            contexts</p>
          <p>Another way to think about these <soCalled>neighborhoods</soCalled> is that they are
            answers to the question: <q>what are the words most likely to appear near word X?</q> or
            <q>what word X is most likely to appear in this context?</q></p>
          <p>So the clusters we see are groups of words that might be predicted by the same kinds of
            contexts. What can we imagine those contexts to be, based on the clusters we're seeing
            here? <list>
              <item>Start with cluster 5 (accompanying <soCalled>mad lib</soCalled>): words relating
                to expressions of risk and despair, unhappy futurity</item>
              <item>we can see how these words could plausibly fit into very similar contexts</item>
              <item>How about clusters 6 and 7? (righteous war; early modern female virtue?)</item>
              <item>Cluster 8 is a little different: not really a <soCalled>thematic</soCalled>
                cluster: what is the predictive <soCalled>context</soCalled> here?</item>
              <item>How about clusters 9 and 4?</item>
            </list>
          </p>
        </tutorial>
      </section>
      <section>
        <head>Thinking with vectors</head>
        <slide>
          <figure>
            <graphic height="600px" url="../../../_utils/gfx/w2v_vector_space2.png"/>
          </figure>
        </slide>
        <lectureNote>
          <p>So this is interesting in itself: <list>
              <item>these clusters of words tell us something about how our corpus uses
                language</item>
              <item>it shows semantic connections between words </item></list></p>
          <p>It's also interesting because we can do further analysis: <list>
              <item>These <soCalled>neighborhoods</soCalled> aren’t just clusters of words that are
                impressionistically near one another: they are positioned in a spatial relationship
                to the rest of the model </item>
              <item>that spatial relationship can be described mathematically</item>
              <item>the position of each word is represented by a vector (essentially, vectors are
                lines that aim out at different angles and distances) </item>
              <item>this means that we can actually compare the position of one word mathematically
                with the position of another word, and we can represent the difference in their
                positions as: <emph>another vector!</emph>
              </item>
              <item>We don’t want to examine that math just yet, but we can take advantage of it.
              </item>
            </list>
          </p>
          <p>If you had a chance to read Ryan Heuser's analysis of <q>riches</q> and <q>virtue</q>,
            or Ben Schmidt's analysis of the Rate My Professor data, where he considers
              <quote>breaking down the gender binary</quote>, they are taking advantage of this same
            idea: <list>
              <item>that we can use these vectors, these spatialized relationships between words, as
                an analytical tool</item>
              <item>and that although in a sense <soCalled>space</soCalled> is a metaphor here (or
                at least a purely mathematical kind of reality), nonetheless it has a level of
                internal consistency and truth-value that means we can do meaningful analyses based
                on it.</item>
            </list>
          </p>
        </lectureNote>
        <tutorial>
          <p>Here is an interesting thought: <list>
            <item>these clusters of words tell us something about how our corpus uses
              language</item>
            <item>So, these clusters show semantic connections between words; they show us how words are related to one another
            within the world of the corpus. However, to be as precise as possible, it is important to understand that the model is
            only aware of the language in terms of words it has access to. This means that the relationships between words represented
            by these clusters only captures the relationships that these words have within this specific corpus rather than language more broadly</item></list></p>
          <p>Let's analyze this idea a little further: <list>
            <item>These <soCalled>neighborhoods</soCalled> aren’t just clusters of words that are
              impressionistically near one another: they are positioned in a spatial relationship
              to the rest of the model, meaning that words don't just have relationships with other words in a shared cluster, 
            but clusters have relationships with other clusters which are represented by their distance or proximity in what is called
            vector space</item>
            <item>because word embedding models represent words numerically, that spatial relationship can be described mathematically</item>
            <item>the position of each word is represented by what is called a vector (essentially, vectors are
              lines that aim out at different angles and distances; they allow us to figure out where exactly in vector space a word is located) </item>
            <item>since vectors allow us to locate words in this space, this means that we can actually compare the position of one word mathematically
              with the position of another word, and we can represent the difference in their
              positions as: <emph>another vector</emph>. This means that we can use addition and subtraction to analyze language!
            </item>
            <item>We don’t want to examine that math just yet, but we can take advantage of it.
            </item>
          </list>
          </p>
          <p>If you get a chance to read Ryan Heuser's analysis of <q>riches</q> and <q>virtue</q>,
            or Ben Schmidt's analysis of the Rate My Professor data, where he considers
            <quote>breaking down the gender binary</quote>, both writers are taking advantage of this same
            idea: <list>
              <item>that we can use these vectors, these spatialized relationships between words, as
                an analytical tool</item>
              <item>and that although in a sense <soCalled>space</soCalled> is a metaphor here (or
                at least a purely mathematical kind of reality), nonetheless it has a level of
                internal consistency and truth-value that means we can do meaningful analyses based
                on it.</item>
            </list>
          </p>
        </tutorial>
      </section>

      <section>
        <head>Locating words in vector space</head>
        <slide>
          <figure>
            <graphic height="600px" url="../../../_utils/gfx/w2v_word_proximities.png"/>
          </figure>

        </slide>
        <lectureNote>
          <p>So how do those words get located in this <soCalled>space</soCalled>? What does the
            spatial metaphor really mean?</p>

          <p>We will go into the details much more fully, very soon. But for this initial
            orientation: <list>
              <item>This model of our corpus, in which each word is represented by a vector, is
                created through a <soCalled>training</soCalled> process, in which a software program
                works its way through the text, over and over, making observations about what words
                appear near one another</item>
              <item>essentially, building a model of the corpus that addresses the question <q>If I
                  have word X, what words are most likely to appear nearby?</q></item>
              <item>at each observation, it adjusts the position of the words</item>
              <item>by the end of the training process, the model contains very detailed information
                about where each word is positioned relative to all or most of the others: this
                information is more detailed the more thoroughly we do the training</item>
              <item>this training process can be varied depending on what actual task or insight or
                research we're trying to support: if we are Google and we're trying to develop text
                prediction systems, the most interesting words will be the single word right after
                word X. On the other hand, if we're digital humanists and we're trying to understand
                discourse more generally, the words surrounding word X might all be equally
                interesting. And in fact different researchers might be interested in the words very
                close to word X (words that suggest how syntax behaves) or in the words more loosely
                associated (which might suggest conceptual connections)</item>
            </list>
          </p>
          <p>This slide shows some actual quotations from WWO where the word
              <mentioned>danger</mentioned> occurs: <list>
              <item>if we imagine the training process working its way through the text and making
                observations, we can see that when it encounters the word
                  <mentioned>danger</mentioned> it repeatedly sees words nearby like
                  <mentioned>approaching</mentioned>, <mentioned>imminent</mentioned>,
                  <mentioned>apprehend</mentioned>: terms that convey futurity, threat, warning,
                causality, states of knowledge: these establish a semantic context</item>
              <item>there are also function-words that appear nearby that don't carry semantic
                associations, but do establish that <mentioned>danger</mentioned> is a noun and can
                be the object of prepositions like <mentioned>to</mentioned> and the subject of
                prepositions like <mentioned>of</mentioned>, which would assist in the
                Google-word-prediction kinds of tasks.</item>
            </list>
          </p>
        </lectureNote>
        <tutorial>
          <p>So how do those words get located in this <soCalled>space</soCalled>? What does the
            spatial metaphor really mean?</p>
          
          <p>We will go into the details much more fully, very soon. But for this initial
            orientation: <list>
              <item>This model of our corpus, in which each word is represented by a vector, is
                created through a <soCalled>training</soCalled> process, in which a software program
                works its way through the text, over and over, making observations about what words
                appear near one another</item>
              <item>essentially, the program is working towards building a model of the corpus that addresses the question <q>If I
                have word X, what words are most likely to appear nearby?</q></item>
              <item>at each observation, it adjusts the position of the words to reflect its increased understanding of the corpus</item>
              <item>by the end of the training process, the model contains very detailed information
                about where each word is positioned relative to all or most of the others within the corpus's vocabulary: this
                information is more detailed the more thoroughly we do the training</item>
              <item>this training process can be varied depending on what actual task or insight or
                research we're trying to support: if we are Google and we're trying to develop text
                prediction systems, the most interesting words will be the single word right after
                word X. On the other hand, if we're digital humanists and we're trying to understand
                discourse more generally, the words surrounding word X might all be equally
                interesting. And in fact different researchers might be interested in the words very
                close to word X (words that suggest how syntax behaves) or in the words more loosely
                associated (which might suggest conceptual connections)</item>
            </list>
          </p>
          <p>This slide shows some actual quotations from WWO where the word
            <mentioned>danger</mentioned> occurs: <list>
              <item>if we imagine the training process working its way through the text and making
                observations, we can see that when it encounters the word
                <mentioned>danger</mentioned> it repeatedly sees words nearby like
                <mentioned>approaching</mentioned>, <mentioned>imminent</mentioned>,
                <mentioned>apprehend</mentioned>: terms that convey futurity, threat, warning,
                causality, states of knowledge: these establish a semantic context</item>
              <item>there are also function-words that appear nearby that don't carry semantic
                associations, but do establish that <mentioned>danger</mentioned> is a noun and can
                be the object of prepositions like <mentioned>to</mentioned> and the subject of
                prepositions like <mentioned>of</mentioned>, which would assist in the
                Google-word-prediction kinds of tasks.</item>
            </list>
          </p>
        </tutorial>
      </section>
      <section>
        <head>Multidimensionality</head>
        <slide>
          <figure>
            <graphic height="600px" url="../../../_utils/gfx/w2v_word_relationships_coordinates.png"/>
            <!--<graphic height="600px" url="../../../_utils/gfx/w2v_word_relationships.png"/>-->
          </figure>

        </slide>
        <lectureNote>
          <p>You may be thinking, as I did, <q>words have many different associations: if
                <soCalled>location in space</soCalled> is representing the semantic affiliations of
              each word, how can a word be in multiple places at one time?</q>
            <list>
              <item>In three-dimensional space, this would indeed be very difficult</item>
              <item>but in our word vector model, there are enormous numbers of dimensions; very
                difficult to picture</item></list></p>
          <p>In this diagram, on the left, the word <mentioned>bank</mentioned> has two
            associations: <list>
              <item>with the semantic space of money, and with the semantic space of rivers</item>
              <item>in this very simple view, each of those relationships is expressed as a single
                dimension (the <mentioned>river</mentioned> association is on the y axis and the
                  <mentioned>money</mentioned> association is on the x axis</item>
              <item>each line only has dimensionality/distance on that one axis, and the location of
                  <mentioned>bank</mentioned> is thus defined by two dimensions (easy to draw on a
                slide)</item>
            </list>
          </p>
          <p>On the right, we have a more complicated situation: the word <mentioned>set</mentioned>
            has many more associations. We can't draw an equivalent diagram, but we can still
            imagine: <list>
              <item>each relationship is on a single, distinct dimension</item>
              <item>there are just way more than two or three of these dimensions (we have to
                imagine them all sprouting off in five-dimensional space)</item>
              <item>and the position of <mentioned>set</mentioned> is defined by five
                dimensions</item>
              <item>so it's not that the word is in five different places at a time, but rather that
                its unique location within this cloud of vectors is based on information about those
                five relationships</item>
            </list>
          </p>
          <p>If this feels baffling right now, don't worry--in my experience this idea takes a
            little time to sink in. Let it sit in your mind as a metaphor for now: a big cloud of
            words, with neighborhoods of related words; closer words are more closely related.</p>
          <p>Questions at this stage?</p>
        </lectureNote>
        <tutorial>
          <p>You may be thinking, <q>words have many different associations: if
            <soCalled>location in space</soCalled> is representing the semantic affiliations of
            each word, how can a word be in multiple places at one time?</q>
            <list>
              <item>In three-dimensional space, this would indeed be very difficult</item>
              <item>but in our word vector model, there are enormous numbers of dimensions; it's very
                difficult to picture</item></list></p>
          <p>In this diagram, on the left, the word <mentioned>bank</mentioned> has two
            associations: <list>
              <item>with the semantic space of money, and with the semantic space of rivers</item>
              <item>in this very simple view, each of those relationships is expressed as a single
                dimension (the <mentioned>river</mentioned> association is on the y axis and the
                <mentioned>money</mentioned> association is on the x axis</item>
              <item>each line only has dimensionality/distance on that one axis, and the location of
                <mentioned>bank</mentioned> is thus defined by two dimensions (easy to draw on a
                slide)</item>
            </list>
          </p>
          <p>On the right, we have a more complicated situation: the word <mentioned>set</mentioned>
            has many more associations. We can't draw an equivalent diagram, but we can still
            imagine: <list>
              <item>each relationship is on a single, distinct dimension</item>
              <item>there are just way more than two or three of these dimensions (we have to
                imagine them all sprouting off in five-dimensional space)</item>
              <item>and the position of <mentioned>set</mentioned> is defined by five
                dimensions</item>
              <item>so it's not that the word is in five different places at a time, but rather that
                its unique location within this cloud of vectors is based on information about those
                five relationships</item>
            </list>
          </p>
          <p>If this feels baffling right now, don't worry--this idea takes a
            little time to sink in. Let it sit in your mind as a metaphor for now: a big cloud of
            words, with neighborhoods of related words; words that are closer are more closely related semantically.</p>
        </tutorial>
      </section>

      <section>
        <head>Factors that affect the behavior of the model</head>
        <slide>
          <p><emph>Size</emph> of the corpus: <list><item>a larger corpus supports more precise word
                positionings for uncommon words</item></list></p>
          <p><emph>Content</emph> of the corpus: <list>
              <item>genre? </item>
              <item>uniformity of language?</item>
            </list></p>
          <p><emph>Preparation</emph> of the corpus: <list>
              <item>correction of errors (e.g. from OCR)</item>
              <item>elimination of <soCalled>noise</soCalled></item>
            </list>
          </p>
          <p>The <emph>training process</emph>: <list>
              <item>parameters!! (coming up soon)</item>
            </list>
          </p>
        </slide>
        <lectureNote>
          <p>I mentioned earlier that we need to be attentive and critical about how this model is
            created; there are a number of things that affect how a word embedding model will
            perform for us.</p>
          <p>The <emph>size</emph> of the corpus matters a lot (and you'll remember that we
            specified that you had to have at least a million words): <list>
              <item>this is because the training process, where we actually create the model, starts
                from zero information: everything the model knows about where words are located, it
                learns from that training process, which goes through the text and observes what
                words are near what other words</item>
              <item>for common words, the training process gets a lot of data very quickly, but for
                uncommon words there's less information available</item>
              <item>it takes a certain minimum size corpus to provide enough information about each
                word (from repeated usage) to make the model reasonably accurate in its
                representation of less common words</item>
              <item>what other factors might be in play here? When might we be able to get away with
                a smaller corpus?</item>
            </list>
          </p>
          <p>The content of the corpus also matters a lot: <list>
              <item>what if you have a corpus where there are no common words? (what would be an
                example of such a corpus?)</item>
              <item>what about a corpus in multiple languages?</item>
              <item>some genres are much more vocabulary-dense than others: for instance, poetry has
                more uncommon words, less filler; novels use more commonplace words; a corpus of
                technical documents might have a very large proportion of uncommon words (how might
                that affect our model?)</item>
            </list>
          </p>
          <p>The data preparation also matters a <emph>lot</emph> (and we're going to spend two
            whole sessions on this later on): <list>
              <item>remember that a <soCalled>word</soCalled> here is any token, any string of
                characters with space around it, so if the text has lots of typographical errors,
                each incorrect word will still count as a unique word; how might that affect our
                model?</item>
              <item>similarly, our corpus might contain things like page numbers, stage directions,
                running headers: would those be useful? inconvenient?</item> </list>
          </p>
          <p>And finally, the training process matters: <list>
              <item>during the training process, we can control various settings that affect what
                observations are made about the texts, and how that information is used</item>
              <item>we will also explore this at greater length over the next few days</item>
            </list>
          </p>
        </lectureNote>
        <tutorial>
          <p>As mentioned earlier, we need to be attentive and critical about how this model is
            created; there are a number of things that affect how a word embedding model will
            perform for us.</p>
          <p>The <emph>size</emph> of the corpus matters a lot (and you'll remember that we
            specified that you typically have to have at least a million words): <list>
              <item>this is because the training process, where we actually create the model, starts
                from zero information: everything the model knows about where words are located, it
                learns from that training process, which goes through the text and observes what
                words are near what other words</item>
              <item>for common words, the training process gets a lot of data very quickly, but for
                uncommon words there's less information available</item>
              <item>it takes a certain minimum size corpus to provide enough information about each
                word (from repeated usage) to make the model reasonably accurate in its
                representation of less common words</item>
              <item>what other factors might be in play here? When might we be able to get away with
                a smaller corpus?</item>
            </list>
          </p>
          <p>The content of the corpus also matters a lot: <list>
            <item>what if you have a corpus where there are no common words? (what would be an
              example of such a corpus?)</item>
            <item>what about a corpus in multiple languages?</item>
            <item>some genres are much more vocabulary-dense than others: for instance, poetry has
              more uncommon words, less filler; novels use more commonplace words; a corpus of
              technical documents might have a very large proportion of uncommon words (how might
              that affect our model?)</item>
          </list>
          </p>
          <p>The data preparation also matters a <emph>lot</emph> (and we're going to spend two
            whole sessions on this later on): <list>
              <item>remember that a <soCalled>word</soCalled> here is any token, any string of
                characters with space around it, so if the text has lots of typographical errors,
                each incorrect word will still count as a unique word; how might that affect our
                model?</item>
              <item>similarly, our corpus might contain things like page numbers, stage directions,
                running headers: would those be useful? inconvenient?</item> </list>
          </p>
          <p>And finally, the training process matters: <list>
            <item>during the training process, we can control various settings that affect what
              observations are made about the texts, and how that information is used</item>
            <item>we will also explore this at greater length over the next few days</item>
          </list>
          </p>
        </tutorial>
      </section>
      <section>
        <head>Comparison with other forms of text analysis</head>
        <slide>
          <p>Other forms you might have heard of: <list>
              <item>Word frequency analysis and concordancing (for instance, Voyant Tools)</item>
              <item>Topic models</item>
            </list>
          </p>
        </slide>
        <lectureNote>
          <p>As part of our orientation, it may also be helpful to situate word embedding in
            relation to some other kinds of digital analysis we may already be familiar with; all of
            these are ways to get an understanding of <emph>texts at scale</emph></p>
          <p>Has anyone here already experimented with word frequency, for instance with Voyant
            tools? <list>
              <item>Just what it sounds like: computing the frequency of different words in the
                corpus, possibly comparing frequency of words between different texts</item>
              <item>including their relative frequency (that is, frequency that has been normalized,
                such as frequency per thousand words)</item>
              <item>useful as a way to get a sense of the vocabulary of a text</item>
              <item>can be used even on small collections and individual texts</item>
            </list>
          </p>
          <p>How about topic models: has anyone used those? For instance, tools like Mallet? <list>
              <item>Topic models are closer to word embedding models</item>
              <item>They are trained models: that is, we go through a training process that examines
                a text corpus and generates a model based on it</item>
              <item>A topic model assigns words to topics based on their occurrence within the same
                document: it gives you a view of the document collection that represents the
                  <soCalled>topics</soCalled> or patterns of word collocation that appear in
                them</item>
              <item>but it doesn't pay attention to where they occur within that document: it treats
                the whole document as a single <soCalled>bag of words</soCalled></item>
              <item>A topic model can be generated from a small text collection</item>
            </list>
          </p>
          <p>What's distinctive about word embedding models: <list>
              <item>they give you a view of semantic relationships and spaces within the model (i.e.
                the corpus) as a whole</item>
              <item>they pay much closer attention to word proximity than topic models do: they use
                information about the immediate context of a word</item>
              <item>they don't pay attention to individual documents during the training process
                (and there's no way to get back to the individual documents once the model is
                trained)</item>
              <item>they require a much larger corpus to get meaningful results</item>
              <item>they give us much more information about the semantics of individual words,
                whereas topic models mostly give us a view of the <soCalled>topics</soCalled> rather
                than the individual words in the topic</item>
            </list>
          </p>
          <p>The larger question of what word embedding models are distinctively good for is one
            that we will explore as a group in the rest of the institute!</p>
        </lectureNote>
        <tutorial>
          <p>As part of our orientation, it may also be helpful to situate word embedding in
            relation to some other kinds of digital analysis we may already be familiar with; all of
            these are ways to get an understanding of <emph>texts at scale</emph></p>
          <p>Have you ever experimented with word frequency, for instance with Voyant
            tools? <list>
              <item>This type of analysis is just what it sounds like: computing the frequency of different words in the
                corpus, possibly comparing frequency of words between different texts</item>
              <item>including their relative frequency (that is, frequency that has been normalized,
                such as frequency per thousand words)</item>
              <item>this method is useful as a way to get a sense of the vocabulary of a text</item>
              <item>it can be used even on small collections and individual texts</item>
            </list>
          </p>
          <p>How about topic models: is this methodology familiar to you? For instance, tools like Mallet? <list>
            <item>Topic models are much more similar to word embedding models than word frequency counts</item>
            <item>Topic models are trained models: that is, we go through a training process that examines
              a text corpus and generates a model based on it</item>
            <item>A topic model assigns words to topics based on their occurrence within the same
              document: it gives you a view of the document collection that represents the
              <soCalled>topics</soCalled> or patterns of word collocation that appear in
              them</item>
            <item>but it doesn't pay attention to where they occur within that document: it treats
              the whole document as a single <soCalled>bag of words</soCalled></item>
            <item>A topic model can be generated from a small text collection</item>
          </list>
          </p>
          <p>What's distinctive about word embedding models: <list>
            <item>they give you a view of semantic relationships and spaces within the model (i.e.
              the corpus) as a whole</item>
            <item>they pay much closer attention to word proximity than topic models do: they use
              information about the immediate context of a word</item>
            <item>they don't pay attention to individual documents during the training process
              (and there's no way to get back to the individual documents once the model is
              trained)</item>
            <item>they require a much larger corpus to get meaningful results</item>
            <item>they give us much more information about the semantics of individual words,
              whereas topic models mostly give us a view of the <soCalled>topics</soCalled> rather
              than the individual words in the topic</item>
          </list>
          </p>
          <p>The larger question of what word embedding models are distinctively good for is one
            that we will explore later on in this walkthrough!</p>
        </tutorial>
      </section>

      <section>
        <head>Disclaimers! Questions?</head>
        <slide>
          <p>Questions?</p>
        </slide>
        <lectureNote>
          <p>I should note here: we have been working hard to understand word embedding models and
            develop this curriculum; however, the underlying math is undeniably challenging. At some
            points in the next few days, I anticipate that you're going to have questions that we
            actually can't answer, because we haven't yet fully mastered that deeper layer. We're
            going to treat these as learning and teaching moments! After all, these are also
            questions that our students and colleagues will be asking us. So part of what we're
            exploring here is how to understand the boundaries of what we know, and how to respond
            effectively based on that knowledge, whatever level we may be at.</p>
          <p>Questions at this stage?</p>
        </lectureNote>
        <tutorial>
          <p>As a quick disclaimer: we have been working hard to understand word embedding models and
            develop this curriculum; however, the underlying math is undeniably challenging. At some
            points we anticipate that you're going to have questions that will not be easily answered
            or covered in detail in this walkthrough because we haven't yet fully mastered that deeper layer. Let's 
            treat these as learning and teaching moments! After all, these are also
            questions that our students and colleagues will be asking us. So part of what we're
            exploring here is how to understand the boundaries of what we know, and how to respond
            effectively based on that knowledge, whatever level we may be at.</p>
          
        </tutorial>
      </section>

    </presentation>
  </text>
</TEI>
