<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="https://wwp-test.northeastern.edu/outreach/seminars/_utils/schema/yaps.rnc" type="application/relax-ng-compact-syntax"?>
<?xml-model href="https://wwp-test.northeastern.edu/outreach/seminars/_utils/schema/yaps.isosch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<?xml-stylesheet type="text/css" href="../../../_utils/stylesheets/yaps-tei.css"?>
<!-- $Id: word_vectors_process.xml 51374 2026-04-01 16:35:16Z aclark $ -->
<TEI xmlns="http://www.wwp.northeastern.edu/ns/yaps" version="5.0">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Processes for Data Preparation and Model Training</title>
        <author>Julia Flanders</author>
        <author>Sarah Connell</author>
      </titleStmt>
      <editionStmt>
        <edition>Word Vectors for the Thoughtful Humanist, Northeastern University,
          2019-07</edition>
      </editionStmt>
      <publicationStmt>
        <distributor>Women Writers Project (via website)</distributor>
        <address>
          <addrLine>url:mailto:wwp@neu.edu</addrLine>
        </address>
        <date when="2019-07-17"/>
        <availability status="restricted">
          <p>Copyright 2007 Syd Bauman, Julia Flanders, and the Women Writers Project</p>
          <p>This TEI-encoded XML file is available under the terms of the <ref
              target="http://creativecommons.org/licenses/by-sa/3.0/">Creative Commons
              Attribution-ShareAlike 3.0 (Unported)</ref> license.</p>
        </availability>
        <pubPlace>Boston, MA USA</pubPlace>
      </publicationStmt>
      <notesStmt>
        <note>
          <p/>
        </note>
      </notesStmt>
      <sourceDesc>
        <p>Covers the essentials for data preparation and model training for word2vec.</p>
      </sourceDesc>
    </fileDesc>
    <revisionDesc>
      <change who="personography.xml#sconnell.yuw" when="2019-07-12">First-round proofing
        complete.</change>
      <change who="personography.xml#sconnell.yuw" when="2019-04-01">Created file.</change>
    </revisionDesc>
  </teiHeader>
  <text>
    <presentation>
      <abstract>
        <p>This tutorial contains an overview of the processes for data preparation and for training
          and testing word embedding models. It covers how to select and prepare texts, how to set
          parameters and train models, and how to test and validate trained models.</p>
      </abstract>
      <section>
        <head>Process overview</head>
        <slide>
          <figure>
            <graphic width="100%" url="../../../_utils/gfx/w2v_process_overview.png"/>
            <!-- Update from earlier slides -->
          </figure>
        </slide>
        <lectureNote>
          <p>Models like word2vec require large corpora of texts as their input but, fortunately,
            the data preparation processes are fairly straightforward. Essentially, all that's
            required is a large amount of words in plain text. </p>
          <p>Because each model operates on the level of the full corpus, the text selection process
            is crucial, and is best accompanied by an analysis phase that considers: how the texts
            in the corpus pertain to the research questions at stake, how the characteristics of the
            corpus (such as variations in language or literary forms) could impact results, and
            which options for data preparation will be best suited for the project. </p>
          <p>After a corpus has been selected and prepared, the next step is model training. Here,
            again, it is useful to include an analysis phase to determine which configurations will
            be most effective with the corpus and research questions—followed by a testing process
            to see whether those configurations need to be adjusted. </p>
        </lectureNote>
        <tutorial>
          <p>This tutorial is going to focus on how to prepare your data for training and testing a word2ved model.</p>
          <p>Models like word2vec require large corpora of texts as their input but, fortunately,
            the data preparation processes are fairly straightforward. Essentially, all that's
            required is a large amount of words in plain text. Plain text means that it is machine readable and not, for example, a pdf file.</p>
          <p>Because each model operates on the level of the full corpus, the text selection process
            is crucial, and is best accompanied by an analysis phase where you consider: how the texts
            in the corpus pertain to the research questions at stake, how the characteristics of the
            corpus (such as variations in language or literary forms) could impact results, and
            which options for data preparation will be best suited for the project. </p>
          <p>After a corpus has been selected and prepared, the next step is model training. Here,
            again, it is useful to include an analysis phase so that you can determine which parameters will
            be most effective for your corpus and research questions—followed by a testing process
            to see whether those configurations need to be adjusted. </p>
        </tutorial>

      </section>

      <section>
        <head>For example: Analyzing city reputation with Airbnb data</head>
        <slide>
          <figure>
            <graphic width="100%" url="../../../_utils/gfx/w2v_process_overview_airbnb.png"/>
          </figure>
        </slide>
        <lectureNote>
          <p>Let's add some greater specificity to that overview by walking through a brief example. </p>
          <p>A research project exploring the different characteristics ascribed to cities by
            tourists might decide to work with a set of Airbnb reviews. Many review corpora are
            available online—for a project focused on comparison, the researcher would want to
            select reviews for each city that are roughly similar in their composition—which would
            include the raw numbers and lengths of reviews, as well as factors such as the dates
            represented in each corpus. The researcher would then analyze their corpora to identify
            any features of the data that might impact results in unwanted ways—for example, if the
            dataset contained anything apart from the text of the reviews themselves, like reviewer
            handles or posting dates, the researcher would want to remove those. In training the
            model, the researcher would consider which parameters would work best with this corpus
            and research question; for example, reviews are likely to be short and focused in their
            language, so a fairly small window size would likely provide better results. Finally,
            the researcher would validate the model to test both its consistency and the accuracy of
            its representations of relationships between words. </p>
          <p/>

        </lectureNote>
        <tutorial>
          <p>Let's add some greater specificity to that broad overview by walking through a brief example. </p>
          <p>A research project exploring the different characteristics ascribed to cities by
            tourists might decide to work with a set of Airbnb reviews. Many review corpora are
            available online—for a project focused on comparison, the researcher would want to
            select reviews for each city that are roughly similar in their composition—which would
            include the raw numbers and lengths of reviews, as well as factors such as the dates
            represented in each corpus. The researcher would then analyze their corpora to identify
            any features of the data that might impact results in unwanted ways—for example, if the
            dataset contained anything apart from the text of the reviews themselves, like reviewer
            handles or posting dates, the researcher would want to remove those.</p> <p>In training the
            model, the researcher would consider which parameters would work best with this corpus
            and research question; for example, reviews are likely to be short and focused in their
            language, so a fairly small window size would likely provide better results. Finally,
            the researcher would validate the model to test both its consistency and the accuracy of
            its representations of relationships between words. </p>
        </tutorial>

      </section>
      <section>
        <head>Data preparation</head>
        <slide>
          <p>For more details on the decisions you can make at each state of the data preparation
            process, see <ref
              target="https://www.wwp.northeastern.edu/outreach/seminars/_current/handouts/word_vectors/data_checklist.html"
              >this checklist</ref>.</p>
          <figure>
            <graphic width="100%" url="../../../_utils/gfx/w2v_process_overview1.png"/>

          </figure>
        </slide>
        <lectureNote>

          <p>As we've seen, the first stage of the overall process is assembling a corpus, analyzing
            it, and performing the necessary data preparation. </p>


        </lectureNote>
        
        <tutorial>
          <p>As we've seen, the first stage of the overall process is assembling a corpus, analyzing
            it, and performing the necessary data preparation. </p>
        </tutorial>

      </section>
      <section>
        <head>Data sources</head>
        <slide>
          <list>
            <item><emph>Pre-built plain-text corpora</emph>—such as from <ref
                target="https://docsouth.unc.edu/browse/collections.html">DocSouth</ref> or <ref
                target="http://graphics.cs.wisc.edu/WP/vep/">Visualizing Early Print</ref></item>
            <item><emph>Hand-assembled plain-text corpora</emph>—from sources such as <ref
                target="https://www.gutenberg.org/">Project Gutenberg</ref></item>
            <item><emph>TEI data</emph>—from sources such as the <ref
                target="http://ota.ox.ac.uk/tcp/">Text Creation Partnership</ref> or the <ref
                target="http://webapp1.dlib.indiana.edu/TEIgeneral/welcome.do?brand=wright">Wright
                American Fiction</ref> collections</item>
            <item><emph>Optical Character Recognition (OCR) data</emph>—either assembled from
              existing collections, such as the <ref target="https://archive.org/">Internet
                Archive</ref>, or generated from facsimiles using OCR software</item>
          </list>

        </slide>
        <lectureNote>
          <p>Now let's consider each step of the overall process in more detail. </p>
          <!-- For each of the below, ask the participants who are working with each kind of corpus to discuss their projects and data preparation. -->
          <p>Let's look at a few sample sources, beginning with ones that will require the least
            amount of intervention: pre-built corpora that are already in plain text format and that
            contain only or primarily the contents of the texts that the researcher wishes to
            analyze. These are often the easiest to use, and are generally curated by individuals
            with expertise in the collections, but will likely still require some additional
            processing. Selecting individual texts to build a corpus by hand is more time consuming,
            but also provides more fine control. And, of course, these two approaches can be
            combined, by merging a set of hand-curated texts with some or all of an existing corpus,
            or by modifying a corpus to remove or add texts. </p>

          <p>Some projects might work with texts that will require more extensive intervention. For
            example, working with a corpus of TEI texts will allow for considerable control over the
            final formats and contents of the plain-text files, but will also require additional
            preparation work in transforming the texts. </p>
          <p>In some cases, a project may need to work with plain-text files generated from optical
            character recognition, or OCR. Unless the OCRd text has been subsequently reviewed and
            corrected, it is likely to contain a number of errors, which will lead to noise in the
            word2vec results. However, because vector space models operate at a large scale, it is
            still possible to get useful results, even from fairly messy data. </p>
        </lectureNote>
        <tutorial>
          <p>Now, let's consider each step of the overall process in more detail. </p>
          <!-- For each of the below, ask the participants who are working with each kind of corpus to discuss their projects and data preparation. -->
          <p>Let's look at a few sample sources, beginning with ones that will require the least
            amount of intervention: pre-built corpora that are already in plain text format and that
            contain only or primarily the contents of the texts that the researcher wishes to
            analyze. These are often the easiest to use, and are generally curated by individuals
            with expertise in the subject area of the corpus, but will likely still require some additional
            processing. </p>
            <p>Selecting individual texts to build a corpus by hand is more time consuming,
            but also provides more fine control and allows you to more specifically tailor the corpus to your research questions. And, of course, these two approaches can be
            combined, by merging a set of hand-curated texts with some or all of an existing corpus,
            or by modifying a corpus to remove or add texts. </p>
          
          <p>Some projects might work with texts that will require more extensive intervention. For
            example, working with a corpus of TEI texts will allow for considerable control over the
            final formats and contents of the plain-text files, but will also require additional
            preparation work in transforming the texts, for example removing unnecessary tags. </p>
          <p>In some cases, a project may need to work with plain-text files generated from optical
            character recognition, or OCR. Unless the OCRd text has been subsequently reviewed and
            corrected, it is likely to contain a number of errors, which will lead to noise in the
            word2vec results. However, because vector space models operate at a large scale, it is
            still possible to get useful results, even from fairly messy data. </p>
        </tutorial>

      </section>
      <section>
        <head>Data types</head>
        <slide>
          <figure>
            <graphic height="300px" url="../../../_utils/gfx/txt-ocr-tei.png"/>
          </figure>
        </slide>
        <lectureNote>
          <p>Looking at examples of several common data types, let's consider each in turn: 
            
          <list>
            <item>What do
              you think the challenges might be for projects working with each of these kinds of data?</item>
            <item>What can be done with these different kinds of data?</item>
            <item> What would be easier or harder to
              do?</item>
            <item> How might these different data types impact the results one would get in models
              trained on them?</item>
          </list>
          
          
          </p>
          
         

        </lectureNote>
        <tutorial>
          <p>Looking at examples of several common data types, let's consider each in turn: 
            
            <list>
              <item>What do
                you think the challenges might be for projects working with each of these kinds of data?</item>
              <item>What can be done with these different kinds of data?</item>
              <item> What would be easier or harder to
                do?</item>
              <item> How might these different data types impact the results one would get in models
                trained on them?</item>
            </list>
            
            
          </p>
        </tutorial>
      </section>
      <section>
        <head>Data analysis</head>
        <slide>
          <list>
            <head>Questions to consider:</head>
            <item>What kinds of metadata do you have? Can you use your metadata to select texts or
              subset your corpus?</item>
            <item>Are there any editorial artifacts from the texts’ transcription (figure
              descriptions, uncertainty markers, annotations)? Can these be systematically
              identified?</item>
            <item>Are there any features from within the documents themselves that you might want to
              remove (page numbers, labels)?</item>
            <item>How do you want to handle paratexts, such as tables of contents, appendices, or
              prefatory materials?</item>
            <item>Do you want to regularize spelling or any other aspects of your texts?</item>
          </list>
        </slide>
        <lectureNote>
          <p>Because our results depend so heavily on our input data—and because so much of this
            work happens at considerable abstraction from our texts—it's crucial to include a data
            analysis phase early in a project. </p>
          <p>In fact, data preparation and analysis should be iterative: reviewing texts,
            identifying where the data needs to be adjusted, making those changes, reviewing the
            results, identifying additional changes, and so on. It is also important to develop and
            implement a system for keeping track of all the changes made to the texts in a
            corpus.</p>
          <p>Some common information types that are often included with digital texts will need to
            be removed for most projects. One such example is metadata generated by the project
            publishing the digital collection. Others are editorially-authored text (such as
            annotations or descriptions of images), page numbers, and labels. Removing these is
            preferable both because they are not likely to be of interest in most cases and also
            because they can artificially introduce distance between closely related terms—remember
            that the model is relying on proximity to determine relatedness. </p>
          <p>In conducting data analysis, we review not only for which extratextual features are
            present but also for how they are being marked within the text, so that undesired
            features can be removed programatically wherever possible. For example, if all of the
            page numbers in a document collection are enclosed in square brackets, and no other text
            is marked out in this way, we can use regular expressions to delete all the numerals
            inside of square brackets at once. </p>
          <p>For other document features, the goals of the project will impact which would best be
            removed or kept. These would include paratexts, such as indices, tables of contents, and
            advertisements, as well as features like stage directions. </p>
          <p>And finally, we may choose to manipulate the language of our documents directly, such
            as by regularizing or modernizing the spelling. Note that if you make substantial
            changes to the language of your documents, you will also want to maintain an unmodified
            corpus, so that you can investigate the impacts of your data manipulations. </p>

        </lectureNote>
        <tutorial>
          <p>Because our results depend so heavily on our input data—and because so much of this
            work happens at considerable abstraction from our texts—it's crucial to include a data
            analysis phase early in a project. </p>
          <p>In fact, data preparation and analysis should be iterative: reviewing texts,
            identifying where the data needs to be adjusted, making those changes, reviewing the
            results, identifying additional changes, and so on. It is also important to develop and
            implement a system for keeping track of all the changes made to the texts in a
            corpus.</p>
          <p>Some common information types that are often included with digital texts will need to
            be removed for most projects. One such example is metadata generated by the project
            publishing the digital collection. Others are editorially-authored text (such as
            annotations or descriptions of images), page numbers, and labels. Removing these is
            preferable both because they are not likely to be of interest in most cases and also
            because they can artificially introduce distance between closely related terms—remember
            that the model is relying on proximity to determine relatedness. </p>
          <p>In conducting data analysis, we review not only for which extratextual features are
            present but also for how they are being marked within the text, so that undesired
            features can be removed programatically wherever possible. For example, if all of the
            page numbers in a document collection are enclosed in square brackets, and no other text
            is marked out in this way, we can use regular expressions to delete all the numerals
            inside of square brackets at once. </p>
          <p>For other document features, the goals of the project will impact which would best be
            removed or kept. These would include paratexts—such as indices, tables of contents, and
            advertisements—as well as features like stage directions. </p>
          <p>And finally, we may choose to manipulate the language of our documents directly, such
            as by regularizing or modernizing the spelling. Note that if you make substantial
            changes to the language of your documents, you will also want to maintain an unmodified
            corpus, so that you can investigate the impacts of your data manipulations. </p>
        </tutorial>
      </section>
      <section>
        <head>More on regularization and correction</head>
        <slide>
          <list>
            <item><emph>OCR Errors:</emph> f1sh, flsh, fifh</item>
            <item><emph>Errors and corrections:</emph> ibside → inside</item>
            <item><emph>Abbreviations and expansions:</emph> s. → shillings</item>
            <item><emph>Original spellings:</emph> queen, quean, queene, quéene</item>
          </list>


        </slide>
        <lectureNote>

          <p>Regularization and correction might involve many different kinds of data issues, from
            addressing particularly common OCR errors to applying automated regularization routines,
            particularly for historical documents. Projects working with document collections that
            have manually marked errors and their corrections, or editorial regularizations will
            likely decide to keep the corrections and regularizations and omit the errors and
            original spellings. </p>
          <p>It might seem that more regularization is always better, but that's not necessarily the
            case. Decisions about regularization should take into account how spelling variations
            are operating in the input corpus, and should consider where original spellings and word
            usages might have implications for the interpretations that can be drawn from models
            trained on a corpus. For example, a collection might contain deliberate archaisms that
            are connected with poetic voice, which would be flattened in the regularized text.
            Nevertheless, regularization is worth considering, particularly for projects invested in
            exploring the contexts of particular terms over time: it might not be important whether
            the spelling is <q>queen,</q>
            <q>quean,</q> or <q>queene,</q> for a project studying discourse around queenship within
            a broad chronological frame.</p>

        </lectureNote>
        <tutorial>
          <p>Regularization and correction might involve many different kinds of data issues, from
            addressing particularly common OCR errors to applying automated regularization routines,
            particularly for historical documents. Projects working with document collections that
            have manually marked errors and their corrections, or editorial regularizations will
            likely decide to keep the corrections and regularizations and omit the errors and
            original spellings. </p>
          <p>It might seem that more regularization is always better, but that's not necessarily the
            case. Decisions about regularization should take into account how spelling variations
            are operating in the input corpus, and should consider where original spellings and word
            usages might have implications for the interpretations that can be drawn from models
            trained on a corpus. For example, a collection might contain deliberate archaisms that
            are connected with poetic voice, which would be flattened in the regularized text.
            Nevertheless, regularization is worth considering, particularly for projects invested in
            exploring the contexts of language over time: it might not be important whether
            the spelling is <q>queen,</q>
            <q>quean,</q> or <q>queene,</q> for a project studying discourse around queenship within
            a broad chronological frame.</p>
        </tutorial>

      </section>

      <section>
        <head>Final text preparation</head>
        <slide>
          <list>
            <item>Lowercasing all words</item>
            <item>Removing most punctuation</item>
            <item>Optional: tokenizing common phrases</item>
          </list>

          <p><emph>Before:</emph> Yet, the present era has give indisputable proofs, that woman is a
            thinking and an enlightened being! We have seen a Wollstonecraft, a Macaulay, a Sévigné;
            and many others, now living, who embellish the sphere of literary splendour, with genius
            of the first order. </p>
          <p><emph>After:</emph> yet the present era has give indisputable proofs that woman is a
            thinking and an enlightened being we have seen a wollstonecraft a macaulay a sévigné and
            many others now living who embellish the sphere of literary splendour with genius of the
            first order</p>
          <p><emph>Before:</emph> If the breath be short let it take an Electuary of Honey and
            Linseed, and anoint the ears and parts about them with Olive oil. </p>
          <p><emph>After:</emph> if the breath be short let it take an electuary of honey and
            linseed and anoint the ears and parts about them with olive_oil</p>



        </slide>
        <lectureNote>
          <p>Regardless of how a project approaches more extensive regularizations, most projects
            will choose to lowercase all of the words in the input corpus and remove most
            punctuation. Projects can also make decisions about how to handle cases such as
            contractions, which might be treated as either one word or two, as well as commonly
            occurring word-pairings, such as <q>olive oil</q>, which can be tokenized so that the
            model will treat them as single terms. </p>
          
          


        </lectureNote>
<tutorial>
  <p>Regardless of how a project approaches more extensive regularizations, most projects
    will choose to lowercase all of the words in the input corpus and remove most
    punctuation. Projects can also make decisions about how to handle cases such as
    contractions, which might be treated as either one word or two, as well as commonly
    occurring word-pairings, such as <q>olive oil</q>, which can be tokenized so that the
    model will treat them as single terms. </p>
  

</tutorial>
      </section>
      <section>
        <head>A special case: TEI data</head>
        <slide>
          <p>
            <emph>Removing project-authored text:</emph>
            <eg><![CDATA[<figDesc>A portrait of a woman</figDesc>]]></eg>
          </p>
          <p>
            <emph>Removing speaker labels:</emph>
            <eg><![CDATA[<sp who="#ign">
 <speaker>Lady Ignorant</speaker>
  <p>Lord Husband! I can never have your company, for you are at all times writing,
   or reading, or turning your Globes, or peaking through your Perspective Glasse,
   or repeating Verses, or speaking Speaches to yourself.
  </p>
</sp>]]></eg>
          </p>
          <p>
            <emph>Advanced tokenization:</emph>
            <eg><![CDATA[<p>I was just returning from our <placeName>New London</placeName> summer
  home where I had seen the most marvelous performance of the 
  new <orgName><placeName>London</placeName> Symphony Orchestra</orgName> traveling 
  production of <title><persName>Peter</persName> and the Wolf</title>.
</p>]]></eg>
          </p>

        </slide>
        <lectureNote>
          <p>TEI data requires some additional work in transformation, but it also provides more
            fine control over the contents of the input corpus. For example, if a TEI corpus is
            marking figure descriptions with <gi>figDesc</gi>, then the researcher does not have to
            use punctuation to identify figure descriptions—the <gi>figDesc</gi> element makes these
            figure descriptions both unambiguous and easy to remove programatically. </p>
          <p>In a similar vein, we might decide to remove the contents of the <gi>speaker</gi>
            labels in all our texts. Speaker labels are a particularly challenging case for working
            with vector space models because the models are making inferences about meaning based on
            proximity, so, for example, if a collection has multiple repetitions of <q>Lady
              Ignorant</q> because one play includes a character of that name, that will have an
            impact on the results—and probably not one that the researcher was expecting.</p>
          <p>TEI can also be very powerful for selecting subsets of documents; for example, the TEI
              <gi>body</gi> element allows for the selection of the main body of a text, excluding
            front and back matter altogether. Finally, text encoding can be used for tokenization
            that is much more precise than automated routines. The markup here specifies that one of
            the instances of <q>New London</q> is a single place name, while the other is not. </p>


        </lectureNote>
        <tutorial>
          <p>TEI data requires some additional work in transformation, but it also provides more
            fine control over the contents of the input corpus. For example, if a TEI corpus is
            marking figure descriptions with <gi>figDesc</gi>, then the researcher does not have to
            use punctuation to identify figure descriptions—the <gi>figDesc</gi> element makes these
            figure descriptions both unambiguous and easy to remove programmatically. </p>
          <p>In a similar vein, we might decide to remove the contents of the <gi>speaker</gi>
            labels in all our texts. Speaker labels are a particularly challenging case for working
            with vector space models because the models are making inferences about meaning based on
            proximity, so, for example, if a collection has multiple repetitions of <q>Lady
              Ignorant</q> because one play includes a character of that name, that will have an
            impact on the results—and probably not one that the researcher was expecting.</p>
          <p>TEI can also be very powerful for selecting subsets of documents; for example, the TEI
            <gi>body</gi> element allows for the selection of the main body of a text, excluding
            front and back matter altogether. </p>
          <p>Finally, text encoding can be used for tokenization
            that is much more precise than automated routines. The markup here specifies that one of
            the instances of <q>New London</q> is a single place name, while the other is not. </p>
          
        </tutorial>

      </section>

      <section>
        <head>Training the model</head>
        <slide>
          <figure>
            <graphic width="100%" url="../../../_utils/gfx/w2v_process_overview2.png"/>
          </figure>
        </slide>
        <lectureNote>

          <p>After corpus selection and the initial data analysis and processing, the next step is
            training a model. Some of the options for training models will depend on which algorithm
            and programming language you're using. We'll be using the word2vec package in R as an
            exemplar, since that has substantial support for new learners, but we'll also consider
            some more generalizable aspects of this part of the process. </p>


        </lectureNote>
    <tutorial>
      <p>After corpus selection and the initial data analysis and processing, the next step is
        training a model. Some of the options for training models will depend on which algorithm
        and programming language you're using. We'll be using the word2vec package in R as an
        exemplar, since that has substantial support for new learners, but we'll also consider
        some more generalizable aspects of this part of the process. </p>
    </tutorial>
      </section>
      <section>
        <head>Setting parameters</head>
        <slide>
          <list>
            <head>Parameters:</head>
            <item><emph>vectors:</emph> controls the number of dimensions in the model</item>
            <item><emph>window:</emph> controls the number of <q>context words</q> on either side of
              the <q>target word</q> taken into consideration during the model training
              process</item>
            <item><emph>iterations:</emph> controls the number of cycles through the process of
              examining each word and its context and adjusting its positioning in vector
              space</item>
            <item><emph>negative samples:</emph> controls the number of
                <soCalled>negative</soCalled> or non-context words to be updated in an iteration of
              the model training process</item>
            <item><emph>threads:</emph> controls the number of processors to use in a computer when
              a model is being trained</item>
          </list>
          <figure>
            <graphic width="1100px" url="../../../_utils/gfx/window-comparison.png"/>
          </figure>
        </slide>
        <lectureNote>

          <p>As with text preparation, it is a good idea to vary and test the parameters used to
            train a model, since these can have a significant impact on results. We'll be covering a
            few general guidelines and principles here, but there is no single optimal approach to
            setting parameters; the best options will depend on a project's research goals and
            corpus characteristics.</p>
          <p>The <term>vectors</term> parameter controls the dimensionality of a model.
            <!-- This number will always be a necessary reduction because expressing all of the possible dimensions of words' 
    			relationships would require one dimension for every vocabulary term in the corpus, 
    			which would not only be computationally very expensive, but would also diffuse word relationships to the point
    			that it would be impossible to identify significant connections.--> Higher
            numbers of dimensions can make a model more precise, but will also increase training
            time and the possibility of random errors. Something between 100 and 500 vectors will
            work for most projects. </p>
          <p><term>Window size</term> controls the number of words on either side of the target word
            that the model treats as relevant context in predicting which words are likely to appear
            near each other. The size of the window affects the kinds of similarities between words
            that are brought to visibility: a larger window will tend to emphasize topical
            similarities, whereas a smaller window will tend to emphasize functional and syntactic
            similarities. If you are working with very short documents, such as Tweets, then a
            smaller window size will be preferable. </p>
          <p><term>Iterations</term> controls the number of passes through a corpus during the model
            training process. Remember that training a model involves repeatedly iterating over a
            corpus to predict which words are likeliest to appear in context with each other,
            refining those predictions over each iteration. Additional iterations will generally
            make a model more reliable, but the impact of increased iterations also depends on
            corpus size. With larger corpora, fewer iterations are necessary because the model has
            more data to work with. </p>
          <p>The <term>threads</term> parameter controls the number of processors to use on your
            computer during training, which impacts how much time it will take to train a model.
            Something in the range of 2 and 8 threads should work for most laptops. </p>
          <p><term>Negative sampling</term> controls how much the model's weights for individual
            words are adjusted each time it iterates through a corpus, which decreases training
            time. The <q>negative</q> words are the ones that are not context words for an
            individual target word (that is, words that do not appear within the window); when the
            model re-calculates the weights for that term, it will do so only for a sample of its
            negative words, as controlled by this parameter. For smaller datasets, a value between 5
            and 20 is recommended; larger datasets can use smaller values, between 2 and 5. </p>

        </lectureNote>
        <tutorial>
          <p>As with text preparation, it is a good idea to vary and test the parameters used to
            train a model, since these can have a significant impact on results. We'll be covering a
            few general guidelines and principles here, but there is no single optimal approach to
            setting parameters; the best options will depend on a project's research goals and
            corpus characteristics.</p>
          <p>The <term>vectors</term> parameter controls the dimensionality of a model.
          This number will always be a necessary reduction because expressing all of the possible dimensions of words' 
    			relationships would require one dimension for every vocabulary term in the corpus, 
    			which would not only be computationally very expensive, but would also diffuse word relationships to the point
    			that it would be impossible to identify significant connections. Higher
            numbers of dimensions can make a model more precise, but will also increase training
            time and the possibility of random errors. Something between 100 and 500 vectors will
            work for most projects. </p>
          <p><term>Window size</term> controls the number of words on either side of the target word
            that the model treats as relevant context in predicting which words are likely to appear
            near each other. The size of the window affects the kinds of similarities between words
            that are brought to visibility: a larger window will tend to emphasize topical
            similarities, whereas a smaller window will tend to emphasize functional and syntactic
            similarities. If you are working with very short documents, then a
            smaller window size will be preferable. </p>
          <p><term>Iterations</term> controls the number of passes through a corpus during the model
            training process. Remember that training a model involves repeatedly iterating over a
            corpus to predict which words are likeliest to appear in context with each other,
            refining those predictions over each iteration. Additional iterations will generally
            make a model more reliable, but the impact of increased iterations also depends on
            corpus size. With larger corpora, fewer iterations are necessary because the model has
            more data to work with. </p>
          <p>The <term>threads</term> parameter controls the number of processors to use on your
            computer during training, which impacts how much time it will take to train a model.
            Something in the range of 2 to 8 threads should work for most laptops. </p>
          <p><term>Negative sampling</term> controls how much the model's weights for individual
            words are adjusted each time it iterates through a corpus, which decreases training
            time. The <q>negative</q> words are the ones that are not context words for an
            individual target word (that is, words that do not appear within the window); when the
            model re-calculates the weights for that term, it will do so only for a sample of its
            negative words, as controlled by this parameter. For smaller datasets, a value between 5
            and 20 is recommended; larger datasets can use smaller values, between 2 and 5. </p>
        </tutorial>
      </section>
      <section>
        <head>Validation and testing: principles</head>
        <slide>
          <figure>
            <graphic width="300px" url="../../../_utils/gfx/model-test.png"/>
          </figure>
        </slide>
        <lectureNote>
          <p>The next step is testing and validating a trained model. The most effective methods for
            testing models vary depending on a project's algorithm, goals, and corpus, but there are
            a few key principles at stake. The first is reliability: do models trained on the same
            datasets and with the same parameters produce acceptably similar results? These models
            are nondeterministic, which means that there will be some variation, but significant
            variations in results would indicate that it may be necessary to increase the number of
            iterations or otherwise adjust the parameters and input corpus. Otherwise, you can't
            draw any conclusions about your <emph>corpus</emph>, just about the particular
              <emph>model</emph> that you happen to be looking at. One way to test reliability is to
            train multiple models and then calculate the cosine similarities between sets of word
            pairs that the model should perform reasonably well on, to see if these are consistent
            from one model to another. </p>
          <p>Another principle looks at the quality of the model's placements of words in vector
            space, which boils down to: is the model showing results that actually make sense? One
            way to perform this kind of validation is to identify word relationships that should be
            close ones within a corpus—that is, words that should have high cosine similarities—and
            then calculate the cosine similarities for those pairs. Other common model evaluation
            techniques test performance with analogies or with word clustering—or with performance
            on particular tasks that the model will be used to accomplish. </p>

        </lectureNote>
    <tutorial>
      <p>The next step is testing and validating a trained model. The most effective methods for
        testing models vary depending on a project's algorithm, goals, and corpus, but there are
        a few key principles at stake. The first is reliability: do models trained on the same
        datasets and with the same parameters produce acceptably similar results? These models
        are nondeterministic, which means that there will be some variation, but significant
        variations in results would indicate that it may be necessary to increase the number of
        iterations or otherwise adjust the parameters and input corpus. Otherwise, you can't
        draw any conclusions about your <emph>corpus</emph>, just about the particular
        <emph>model</emph> that you happen to be looking at. One way to test reliability is to
        train multiple models and then calculate the cosine similarities between sets of word
        pairs that the model should perform reasonably well on, to see if these are consistent
        from one model to another. </p>
      <p>Another principle looks at the quality of the model's placements of words in vector
        space, which boils down to: is the model showing results that actually make sense? One
        way to perform this kind of validation is to identify word relationships that should be
        close ones within a corpus—that is, words that should have high cosine similarities—and
        then calculate the cosine similarities for those pairs. Other common model evaluation
        techniques test performance with analogies or with word clustering—or with performance
        on particular tasks that the model will be used to accomplish. </p>
    </tutorial>
      </section>
      <section>
        <head>Validation and testing: pragmatics</head>
        <slide>
          <figure>
            <graphic width="100%" url="../../../_utils/gfx/toad-test.png"/>
          </figure>
        </slide>
        <lectureNote>

          <p>There are also some more immediate and pragmatic forms of validation that can be used
            as a reality check after training a model. One of the quickest ways to check the status
            of a model is look at its clusters. If these are in line with what you know about your
            corpus, that's a good sign. If they're sheer nonsense, you might have a problem.
            Clusters can also help you to see where you might need to do more work in data
            preparation. </p>
          <p>Another pragmatic test is to look at cosine similarities for cases where the closest
            words in vector space should be fairly predictable. Days of the week are a useful
            example of this method; in most models, the closest words to any day of the week will be
            other days of the week. Months also work here. In thinking about which words to test
            with, remember that the model will be more accurate for more frequent words. For a quick
            check, basic tests with common words will suffice. More robust forms of testing should
            use words that are only moderately common, since those will be better at revealing
            frailties in the model—that is, you don't want to test your model only on the easiest
            things to get right. </p>

        </lectureNote>
        <tutorial>
          <p>There are also some more immediate and pragmatic forms of validation that can be used
            as a reality check after training a model. One of the quickest ways to check the status
            of a model is look at its clusters. If these are in line with what you know about your
            corpus, that's a good sign. If they're sheer nonsense, you might have a problem.
            Clusters can also help you to see where you might need to do more work in data
            preparation. </p>
          <p>As a thought exercise, take a look at these clusters: what can you observe about each of them?
            The rather whimsically-named <soCalled>Toad Test</soCalled> asks you to consider: which of 
            these clusters might be associated with rationality/accuracy/validity (as represented by
            Ada Lovelace) and which would indicate that your model might be a toad? Even clusters that
            surprise you at first might still be revealing useful things about language use in your
            corpus, but seeing large numbers of clusters in which you, as the person familiar with the corpus,
            cannot determine a relationship between the terms, would suggest that there are issues
            with your corpus or your model (or both). 
          </p>
          <p>Another pragmatic test is to look at cosine similarities for cases where the closest
            words in vector space should be fairly predictable. Days of the week are a useful
            example of this method; in most models, the closest words to any day of the week will be
            other days of the week. Months also work here. In thinking about which words to test
            with, remember that the model will be more accurate for more frequent words. For a quick
            check, basic tests with common words will suffice. More robust forms of testing should
            use words that are only moderately common, since those will be better at revealing
            frailties in the model—that is, you don't want to test your model only on the easiest
            things to get right. </p>
        </tutorial>
      </section>
      <section>
        <head>Questions? Discussion!</head>
        <slide>

          <list>
            <item>What kinds of manipulations do you think you will need to perform on your data? </item>
            <item>What kinds of inconsistencies in your data might you need to address? How do you
              think you might do so?</item>
            <item>Do you think you'll want to omit any sections of your texts?</item>
            <item>What kinds of questions do you want to investigate with your texts? What queries
              do you plan to try?</item>
          </list>
        </slide>


      </section>


    </presentation>
  </text>
</TEI>
