Processes for Data Preparation and Model Training

This tutorial contains an overview of the processes for data preparation and for training and testing word embedding models. It covers how to select and prepare texts, how to set parameters and train models, and how to test and validate trained models.

Process overview

Models like word2vec require large corpora of texts as their input but, fortunately, the data preparation processes are fairly straightforward. Essentially, all that's required is a large amount of words in plain text.

Because each model operates on the level of the full corpus, the text selection process is crucial, and is best accompanied by an analysis phase that considers: how the texts in the corpus pertain to the research questions at stake, how the characteristics of the corpus (such as variations in language or literary forms) could impact results, and which options for data preparation will be best suited for the project.

After a corpus has been selected and prepared, the next step is model training. Here, again, it is useful to include an analysis phase to determine which configurations will be most effective with the corpus and research questions—followed by a testing process to see whether those configurations need to be adjusted.

This tutorial is going to focus on how to prepare your data for training and testing a word2ved model.

Because each model operates on the level of the full corpus, the text selection process is crucial, and is best accompanied by an analysis phase where you consider: how the texts in the corpus pertain to the research questions at stake, how the characteristics of the corpus (such as variations in language or literary forms) could impact results, and which options for data preparation will be best suited for the project.

After a corpus has been selected and prepared, the next step is model training. Here, again, it is useful to include an analysis phase so that you can determine which parameters will be most effective for your corpus and research questions—followed by a testing process to see whether those configurations need to be adjusted.

For example: Analyzing city reputation with Airbnb data

Let's add some greater specificity to that overview by walking through a brief example.

A research project exploring the different characteristics ascribed to cities by tourists might decide to work with a set of Airbnb reviews. Many review corpora are available online—for a project focused on comparison, the researcher would want to select reviews for each city that are roughly similar in their composition—which would include the raw numbers and lengths of reviews, as well as factors such as the dates represented in each corpus. The researcher would then analyze their corpora to identify any features of the data that might impact results in unwanted ways—for example, if the dataset contained anything apart from the text of the reviews themselves, like reviewer handles or posting dates, the researcher would want to remove those. In training the model, the researcher would consider which parameters would work best with this corpus and research question; for example, reviews are likely to be short and focused in their language, so a fairly small window size would likely provide better results. Finally, the researcher would validate the model to test both its consistency and the accuracy of its representations of relationships between words.

Let's add some greater specificity to that broad overview by walking through a brief example.

In training the model, the researcher would consider which parameters would work best with this corpus and research question; for example, reviews are likely to be short and focused in their language, so a fairly small window size would likely provide better results. Finally, the researcher would validate the model to test both its consistency and the accuracy of its representations of relationships between words.

Data preparation

For more details on the decisions you can make at each state of the data preparation process, see this checklist.

As we've seen, the first stage of the overall process is assembling a corpus, analyzing it, and performing the necessary data preparation.

Data sources Pre-built plain-text corpora—such as from DocSouth or Visualizing Early Print Hand-assembled plain-text corpora—from sources such as Project Gutenberg TEI data—from sources such as the Text Creation Partnership or the Wright American Fiction collections Optical Character Recognition (OCR) data—either assembled from existing collections, such as the Internet Archive, or generated from facsimiles using OCR software

Now let's consider each step of the overall process in more detail.

Let's look at a few sample sources, beginning with ones that will require the least amount of intervention: pre-built corpora that are already in plain text format and that contain only or primarily the contents of the texts that the researcher wishes to analyze. These are often the easiest to use, and are generally curated by individuals with expertise in the collections, but will likely still require some additional processing. Selecting individual texts to build a corpus by hand is more time consuming, but also provides more fine control. And, of course, these two approaches can be combined, by merging a set of hand-curated texts with some or all of an existing corpus, or by modifying a corpus to remove or add texts.

Some projects might work with texts that will require more extensive intervention. For example, working with a corpus of TEI texts will allow for considerable control over the final formats and contents of the plain-text files, but will also require additional preparation work in transforming the texts.

In some cases, a project may need to work with plain-text files generated from optical character recognition, or OCR. Unless the OCRd text has been subsequently reviewed and corrected, it is likely to contain a number of errors, which will lead to noise in the word2vec results. However, because vector space models operate at a large scale, it is still possible to get useful results, even from fairly messy data.

Now, let's consider each step of the overall process in more detail.

Selecting individual texts to build a corpus by hand is more time consuming, but also provides more fine control and allows you to more specifically tailor the corpus to your research questions. And, of course, these two approaches can be combined, by merging a set of hand-curated texts with some or all of an existing corpus, or by modifying a corpus to remove or add texts.

Data types

Looking at examples of several common data types, let's consider each in turn: What do you think the challenges might be for projects working with each of these kinds of data? What can be done with these different kinds of data? What would be easier or harder to do? How might these different data types impact the results one would get in models trained on them?

Data analysis Questions to consider: What kinds of metadata do you have? Can you use your metadata to select texts or subset your corpus? Are there any editorial artifacts from the texts’ transcription (figure descriptions, uncertainty markers, annotations)? Can these be systematically identified? Are there any features from within the documents themselves that you might want to remove (page numbers, labels)? How do you want to handle paratexts, such as tables of contents, appendices, or prefatory materials? Do you want to regularize spelling or any other aspects of your texts?

Because our results depend so heavily on our input data—and because so much of this work happens at considerable abstraction from our texts—it's crucial to include a data analysis phase early in a project.

In fact, data preparation and analysis should be iterative: reviewing texts, identifying where the data needs to be adjusted, making those changes, reviewing the results, identifying additional changes, and so on. It is also important to develop and implement a system for keeping track of all the changes made to the texts in a corpus.

Some common information types that are often included with digital texts will need to be removed for most projects. One such example is metadata generated by the project publishing the digital collection. Others are editorially-authored text (such as annotations or descriptions of images), page numbers, and labels. Removing these is preferable both because they are not likely to be of interest in most cases and also because they can artificially introduce distance between closely related terms—remember that the model is relying on proximity to determine relatedness.

In conducting data analysis, we review not only for which extratextual features are present but also for how they are being marked within the text, so that undesired features can be removed programatically wherever possible. For example, if all of the page numbers in a document collection are enclosed in square brackets, and no other text is marked out in this way, we can use regular expressions to delete all the numerals inside of square brackets at once.

For other document features, the goals of the project will impact which would best be removed or kept. These would include paratexts, such as indices, tables of contents, and advertisements, as well as features like stage directions.

And finally, we may choose to manipulate the language of our documents directly, such as by regularizing or modernizing the spelling. Note that if you make substantial changes to the language of your documents, you will also want to maintain an unmodified corpus, so that you can investigate the impacts of your data manipulations.

For other document features, the goals of the project will impact which would best be removed or kept. These would include paratexts—such as indices, tables of contents, and advertisements—as well as features like stage directions.

More on regularization and correction OCR Errors: f1sh, flsh, fifh Errors and corrections: ibside → inside Abbreviations and expansions: s. → shillings Original spellings: queen, quean, queene, quéene

Regularization and correction might involve many different kinds of data issues, from addressing particularly common OCR errors to applying automated regularization routines, particularly for historical documents. Projects working with document collections that have manually marked errors and their corrections, or editorial regularizations will likely decide to keep the corrections and regularizations and omit the errors and original spellings.

It might seem that more regularization is always better, but that's not necessarily the case. Decisions about regularization should take into account how spelling variations are operating in the input corpus, and should consider where original spellings and word usages might have implications for the interpretations that can be drawn from models trained on a corpus. For example, a collection might contain deliberate archaisms that are connected with poetic voice, which would be flattened in the regularized text. Nevertheless, regularization is worth considering, particularly for projects invested in exploring the contexts of particular terms over time: it might not be important whether the spelling is queen, quean, or queene, for a project studying discourse around queenship within a broad chronological frame.

It might seem that more regularization is always better, but that's not necessarily the case. Decisions about regularization should take into account how spelling variations are operating in the input corpus, and should consider where original spellings and word usages might have implications for the interpretations that can be drawn from models trained on a corpus. For example, a collection might contain deliberate archaisms that are connected with poetic voice, which would be flattened in the regularized text. Nevertheless, regularization is worth considering, particularly for projects invested in exploring the contexts of language over time: it might not be important whether the spelling is queen, quean, or queene, for a project studying discourse around queenship within a broad chronological frame.

Final text preparation Lowercasing all words Removing most punctuation Optional: tokenizing common phrases

Before: Yet, the present era has give indisputable proofs, that woman is a thinking and an enlightened being! We have seen a Wollstonecraft, a Macaulay, a Sévigné; and many others, now living, who embellish the sphere of literary splendour, with genius of the first order.

After: yet the present era has give indisputable proofs that woman is a thinking and an enlightened being we have seen a wollstonecraft a macaulay a sévigné and many others now living who embellish the sphere of literary splendour with genius of the first order

Before: If the breath be short let it take an Electuary of Honey and Linseed, and anoint the ears and parts about them with Olive oil.

After: if the breath be short let it take an electuary of honey and linseed and anoint the ears and parts about them with olive_oil

Regardless of how a project approaches more extensive regularizations, most projects will choose to lowercase all of the words in the input corpus and remove most punctuation. Projects can also make decisions about how to handle cases such as contractions, which might be treated as either one word or two, as well as commonly occurring word-pairings, such as olive oil, which can be tokenized so that the model will treat them as single terms.

A special case: TEI data

Removing project-authored text: A portrait of a woman]]>

Removing speaker labels: Lady Ignorant

Lord Husband! I can never have your company, for you are at all times writing, or reading, or turning your Globes, or peaking through your Perspective Glasse, or repeating Verses, or speaking Speaches to yourself.

]]>

Advanced tokenization: I was just returning from our New London summer home where I had seen the most marvelous performance of the new London Symphony Orchestra traveling production of <persName>Peter</persName> and the Wolf.

]]>

TEI data requires some additional work in transformation, but it also provides more fine control over the contents of the input corpus. For example, if a TEI corpus is marking figure descriptions with figDesc, then the researcher does not have to use punctuation to identify figure descriptions—the figDesc element makes these figure descriptions both unambiguous and easy to remove programatically.

In a similar vein, we might decide to remove the contents of the speaker labels in all our texts. Speaker labels are a particularly challenging case for working with vector space models because the models are making inferences about meaning based on proximity, so, for example, if a collection has multiple repetitions of Lady Ignorant because one play includes a character of that name, that will have an impact on the results—and probably not one that the researcher was expecting.

TEI can also be very powerful for selecting subsets of documents; for example, the TEI body element allows for the selection of the main body of a text, excluding front and back matter altogether. Finally, text encoding can be used for tokenization that is much more precise than automated routines. The markup here specifies that one of the instances of New London is a single place name, while the other is not.

TEI can also be very powerful for selecting subsets of documents; for example, the TEI body element allows for the selection of the main body of a text, excluding front and back matter altogether.

Finally, text encoding can be used for tokenization that is much more precise than automated routines. The markup here specifies that one of the instances of New London is a single place name, while the other is not.

Training the model

After corpus selection and the initial data analysis and processing, the next step is training a model. Some of the options for training models will depend on which algorithm and programming language you're using. We'll be using the word2vec package in R as an exemplar, since that has substantial support for new learners, but we'll also consider some more generalizable aspects of this part of the process.

Setting parameters Parameters: vectors: controls the number of dimensions in the model window: controls the number of context words on either side of the target word taken into consideration during the model training process iterations: controls the number of cycles through the process of examining each word and its context and adjusting its positioning in vector space negative samples: controls the number of negative or non-context words to be updated in an iteration of the model training process threads: controls the number of processors to use in a computer when a model is being trained

As with text preparation, it is a good idea to vary and test the parameters used to train a model, since these can have a significant impact on results. We'll be covering a few general guidelines and principles here, but there is no single optimal approach to setting parameters; the best options will depend on a project's research goals and corpus characteristics.

The vectors parameter controls the dimensionality of a model. Higher numbers of dimensions can make a model more precise, but will also increase training time and the possibility of random errors. Something between 100 and 500 vectors will work for most projects.

Window size controls the number of words on either side of the target word that the model treats as relevant context in predicting which words are likely to appear near each other. The size of the window affects the kinds of similarities between words that are brought to visibility: a larger window will tend to emphasize topical similarities, whereas a smaller window will tend to emphasize functional and syntactic similarities. If you are working with very short documents, such as Tweets, then a smaller window size will be preferable.

Iterations controls the number of passes through a corpus during the model training process. Remember that training a model involves repeatedly iterating over a corpus to predict which words are likeliest to appear in context with each other, refining those predictions over each iteration. Additional iterations will generally make a model more reliable, but the impact of increased iterations also depends on corpus size. With larger corpora, fewer iterations are necessary because the model has more data to work with.

The threads parameter controls the number of processors to use on your computer during training, which impacts how much time it will take to train a model. Something in the range of 2 and 8 threads should work for most laptops.

Negative sampling controls how much the model's weights for individual words are adjusted each time it iterates through a corpus, which decreases training time. The negative words are the ones that are not context words for an individual target word (that is, words that do not appear within the window); when the model re-calculates the weights for that term, it will do so only for a sample of its negative words, as controlled by this parameter. For smaller datasets, a value between 5 and 20 is recommended; larger datasets can use smaller values, between 2 and 5.

The vectors parameter controls the dimensionality of a model. This number will always be a necessary reduction because expressing all of the possible dimensions of words' relationships would require one dimension for every vocabulary term in the corpus, which would not only be computationally very expensive, but would also diffuse word relationships to the point that it would be impossible to identify significant connections. Higher numbers of dimensions can make a model more precise, but will also increase training time and the possibility of random errors. Something between 100 and 500 vectors will work for most projects.

Validation and testing: principles

The next step is testing and validating a trained model. The most effective methods for testing models vary depending on a project's algorithm, goals, and corpus, but there are a few key principles at stake. The first is reliability: do models trained on the same datasets and with the same parameters produce acceptably similar results? These models are nondeterministic, which means that there will be some variation, but significant variations in results would indicate that it may be necessary to increase the number of iterations or otherwise adjust the parameters and input corpus. Otherwise, you can't draw any conclusions about your corpus, just about the particular model that you happen to be looking at. One way to test reliability is to train multiple models and then calculate the cosine similarities between sets of word pairs that the model should perform reasonably well on, to see if these are consistent from one model to another.

Another principle looks at the quality of the model's placements of words in vector space, which boils down to: is the model showing results that actually make sense? One way to perform this kind of validation is to identify word relationships that should be close ones within a corpus—that is, words that should have high cosine similarities—and then calculate the cosine similarities for those pairs. Other common model evaluation techniques test performance with analogies or with word clustering—or with performance on particular tasks that the model will be used to accomplish.

Validation and testing: pragmatics

There are also some more immediate and pragmatic forms of validation that can be used as a reality check after training a model. One of the quickest ways to check the status of a model is look at its clusters. If these are in line with what you know about your corpus, that's a good sign. If they're sheer nonsense, you might have a problem. Clusters can also help you to see where you might need to do more work in data preparation.

Another pragmatic test is to look at cosine similarities for cases where the closest words in vector space should be fairly predictable. Days of the week are a useful example of this method; in most models, the closest words to any day of the week will be other days of the week. Months also work here. In thinking about which words to test with, remember that the model will be more accurate for more frequent words. For a quick check, basic tests with common words will suffice. More robust forms of testing should use words that are only moderately common, since those will be better at revealing frailties in the model—that is, you don't want to test your model only on the easiest things to get right.

As a thought exercise, take a look at these clusters: what can you observe about each of them? The rather whimsically-named Toad Test asks you to consider: which of these clusters might be associated with rationality/accuracy/validity (as represented by Ada Lovelace) and which would indicate that your model might be a toad? Even clusters that surprise you at first might still be revealing useful things about language use in your corpus, but seeing large numbers of clusters in which you, as the person familiar with the corpus, cannot determine a relationship between the terms, would suggest that there are issues with your corpus or your model (or both).

Questions? Discussion! What kinds of manipulations do you think you will need to perform on your data? What kinds of inconsistencies in your data might you need to address? How do you think you might do so? Do you think you'll want to omit any sections of your texts? What kinds of questions do you want to investigate with your texts? What queries do you plan to try?