Word Vector Model Evaluation
By Avery Blankenship
This introduction covers some important concepts for understanding the evaluation of word embedding models and outlines the specific approach for evaluating such models adopted by the Women Writers Project (WWP). See the Evaluation Guide for Word Embedding Models for code that can be used to build a test set of word pairs and run an evaluation test on trained models in Python, or download this GitHub release to access a full ecology for model training and evaluation.
Evaluation of word embedding models (WEMs) can be a difficult task to approach. There is no “right” or “wrong” way to evaluate models, as model evaluation should be tailored to suit the goals of the research project. For many projects, the goal of evaluation is to test how well the model understands the language of the corpus. For other projects, the goal of evaluation may be to test how well the model understands the grammatical structure of a corpus’s text. Another goal of evaluation may be to test how well the model replicates bias in the corpus. No matter what the end goal of evaluation may be for your particular project, evaluation is often a necessary but opaque process with little professional agreement on how best to approach it. Here, I describe a method for evaluating models trained on pre-twentieth-century texts which attempts to address the issues that researchers may encounter when applying evaluation tasks customized for modern texts.
There are two methods for evaluating models that are considered to be fairly standard, both of which are described in Mikolov et al. (2013) and described in the Word2Vec documentation. The first method involves generating word pairs which you believe should be related in some way (or even words you would assume to have no relation) and surveying a group of people who will assign a cosine similarity to those word pairs (for example, a closely related word might have 0.9 or 0.8 as a similarity). The survey results are then averaged and compared to the cosine similarities that the model generates for the same set of word pairings. There are a number of potential problems when it comes to using this method to evaluate models trained on pre-twentieth-century texts. First, if the corpus represents text across a broad timeline, language can change—sometimes even dramatically—and differ from modern uses of the same word or phrases. The meanings of words when presented to a modern reader may not be in line with the way that same word is used in the specific texts selected for the model. This, of course, is a potential issue that can be accounted for in selecting the word pairs for testing to begin with, but there is a large potential for error and the method requires that the survey-takers share an understanding of the model’s vocabulary that may be difficult to moderate for large corpora. There is also little documentation to support how large this group of survey takers should be in order to produce meaningful results and since the survey takers would be assigning similarity scores based on personal assessment, these scores may have little to do with the cosine similarities produced by the model.
The evaluation method I will describe here is based on the second standard method for model evaluation which was first described alongside Word2Vec’s initial release by Mikolov et al. (2013). This method, traditionally, tests words which share a close relationship and evaluates how well the model understands these relationships based on their closeness in vector space. Although the task isn’t necessarily restricted to solving analogies and is more attempting to connect and compare words in a corpus, the task is colloquially referred to as an “analogy task” by computer scientists and computational linguists and thus, I will refer to the task similarly—though keep in mind that the task isn’t really an analogy task. The words are given to the model in sets (for example, London is to England as Paris is to ?) and the model is asked to “solve” the analogy task by choosing the most appropriate word to fill in the blank. The model accomplishes this task by assuming that words a, b, c, and d share a relationship with one another and thus will be close to one another in vector space such that a-b = c-d. The model “solves” this equation for d by using the vectors for each word. Solving the equation for d, then, provides the vector for the missing word. This method of evaluation introduces slightly more complexity than a single set of word pairings (for example, “milk, cream” or “breakfast, dinner”). However, a major limitation of the method is that it assumes that words can have only one relationship to other words. For example, “sofa” might have a relationship to “living room,” but it is also highly likely that it has the same relationship, or may be just as close in vector space, to “den” or “sitting room” or “parlor.” If “living room,” “den,” “sitting room,” and “parlor” are all acceptable words to stand in for word d, then how can we ensure that any of these words can pass the evaluation task?
In order to account for both the likelihood of language differences over time as well as the likelihood that words have similar relationships, and thus share similar distances in vector space, with more than just one word, the WWP has decided to adapt a modified version of the analogy task for both our own model testing and for the code tutorials that we publish. This method draws upon the work and data of the BATS (Bigger Analogy Test Set) data set and uses methods presented by the Vecto research team whose proposed modification of the more standard analogy test allows multiple words to be acceptable answers for word d. Since the evaluation test is self-contained and doesn’t require surveying, the chance that the cosine similarity scores assigned to each word pairing will be inconsistent is also lowered.
Of course, there are still some downsides to the BATS method. Namely, that the word pairings provided by the BATS team for the purposes of plug-and-play evaluation may not match actual language usage in the corpus you are working with. For this reason, the WWP is developing a set of word pairings which more closely represent language usage in pre-twentieth century texts and which are modeled on the BATS word sets. While these sets of words are not perfect and do not reflect all possible language usage across such a wide timeline, the method described here can provide a framework if you are interested in creating your own set of word pairs with which to test a model or even sets of models and capture, broadly, important concepts in western literature within the timeframe which may diverge from their modern uses.
Why is Model Evaluation So Difficult?
Because Word2Vec uses an unsupervised training task in order to train models, evaluation of these models is not as simple as counting correct answers or missed words. An easy answer to the question “why are word embedding models so difficult to evaluate?” is that language is difficult to evaluate. First and foremost, the difficulty of evaluation is a matter of scale. No matter how large the corpus, no matter how robust the training method, WEMs cannot model all language to ever exist. Thus, by virtue of having selected a corpus of texts to train the model with at all, the resulting model will only be able to model the behavior of the language in that specific corpus using the specific vocabulary within the input corpus.
On the basic level, a major contributor to poor model evaluation is what computer scientists call the “semantic gap.” Lev Manovich (2015) writes that “a human reader understands what the text is about, but a computer can only ‘see’ a set of letters separated by spaces” (22). How well the data can actually reflect a human’s understanding of some concept or phenomenon can be variable and at times opaque. For this reason, it is important that you narrow the kinds of evaluation benchmarks you set for the model to reflect this specificity. For example, you couldn’t evaluate a WEM to determine how well it reflects the English language in its entirety, but you could evaluate how well the model understands the language of Shakespeare. Beginning the evaluation process with a specific, narrow set of benchmark questions is a crucial step in beginning to evaluate a model. This approach both directs what types of evaluation are best suited to address these benchmarks and also directs the evaluation process toward specificity.
Richard Jean So, in “All Models Are Wrong,” makes the argument that a model “allows the researcher to isolate aspects of an interesting phenomenon, and in discovering certain properties of such aspects, the researcher can continue revising the model to identify additional properties” (669). This cyclical process illustrates a feature of evaluation that might help you think through your own evaluation tasks: that the process is cyclical and may involve revising and revisiting both the model’s understanding of language as well as your own. The purpose of evaluation is to make a model “better,” whatever “better” means for your specific research questions. Gabi Kirilloff (2022) argues that, rather than view the results of computational methods as interpretations or “reads” of some phenomenon—which implies that the process is both linear and finite—the information gleaned from computational models should instead be thought of as a type of context building which can change and morph over time. Even when an evaluation task fails, or the model does not behave in the way that you had anticipated, this failure can be incredibly useful for reflecting on both your understanding of what the model is attempting to represent as well as on your understanding of this same phenomenon. This reflection can be as granular as reevaluating the quality of the text you initially trained the model on or even the process through which that text was prepared for training.
There are a number of phenomena which can impact the performance of a model and thus the success of evaluation. Since the evaluation process described here is aimed at those interested in pre-twentieth century texts and WEMs, one particularly significant influencer on model performance is OCR quality. Torodov et al. (2022) have demonstrated through their work on WEMs and OCR quality that, despite the size of a model’s vocabulary, poor OCR will always have a significant impact on the model’s results. While skip-gram and CBOW algorithms, the kinds of algorithms that Word2Vec uses, tend to outperform algorithms such as BERT in terms of resiliency to OCR noise, the influence of noisy OCR on the quality of trained models can be quite significant. OCR noise remains a major challenge in text analysis of historical documents, as Mutuvi et al. (2018) write, since OCR noise can stem from a broad range of circumstances—actual inconsistencies in spelling, low quality software, faded ink on the documents, etc.
For example, let’s say that you are evaluating how well a WEM model you trained represents the word “drama” in a set of nineteenth-century texts. If these texts were OCRed and this OCR contained noise, the model may struggle to actually capture the use of the word “drama” given the spelling mistakes which are quite common in noisy OCR. The word “drama” may appear as such in one place and then in another may have been OCR’ed as “druama” somewhere else and so on.
An additional consideration that you may make when evaluating your model’s performance is the impact of the corpus’s content on the model’s understanding of important concepts. Machine learning models, as Ted Underwood (2019) writes, aren’t made using specific definitions or definitive knowledge but are instead made using examples of the concepts we wish to model. This method of learning by example makes machine learning models incredibly useful for humanities researchers as it allows humanists to explore how well a model performs when there is no definitive knowledge or definition; this very same feature also makes these models “very apt to pick up the assumptions or prejudices latent in a particular selection of evidence" (xiv–xv).
Approaching Model Evaluation
The words used in the WWP’s testing set were obtained by training a WEM on the Women Writers Online, Early Modern 1080, ECCO–TCP, Victorian Women Writers Project, DocSouth’s North American Slave Narratives, and the Wright American Fiction texts. The corpora were selected both because of the lack of OCR noise (most of the texts are transcribed) and for their wide representation of genre across the timeframe of interest. The resulting model was made up of all texts from each of these collections. In addition to training a word embedding model on these texts, we also counted the top one thousand most frequently used words across the entire combined corpus, not including stopwords. These one thousand words were chosen so that we could make sure to select words for evaluation which are significantly present in the model’s corpus.
If you are interested in training your own model for the purposes of creating a set of word pairs, the WWP team will soon publish a Jupyter notebook with code templates that will allow you to extract text from any number of text files, including files nested within a folder or set of folders; clean and tokenize this text; train a WEM model using these texts; and finally count the n most commonly used words in the corpus where n is a number of your choosing. The script requires that users have Gensim 4.0 installed and also that all text files are saved in a plain text, machine readable format (.txt). The notebook will also provide the model evaluation code discussed below.
Once the top one thousand most commonly used words were obtained from this multi-collection corpus, the words were reviewed by WWP team members in order to select words we considered to be significant within the context of the corpus or to represent concepts we considered to be important. Specific place names were, for the most part, not included in the final set of words primarily because we wanted the set of testing words to be generalizable and useful across a broad spectrum of pre-twentieth century texts from a range of locations.
The final number of words nominated for generating pairings was 600. In order to get a starting sense of which word pairings might be the most generative, we queried the model for the most similar words to each of the 600 words. We then sorted through the most similar words generated by the model and selected words which we thought to represent key relationships between it and the testing word. In some cases, we supplemented the word pairings with other words we thought significantly represented some concept or idea, using the most similar words generated by the model as a guide. I chose this process for two reasons. First, by selecting words which we already knew the model thought had close relationships, we could more accurately test how the actual evaluation code itself performed, especially when supplemented with additional words we came up with. Second, these are words which we knew were widely represented within the model’s vocabulary. It would be less useful to evaluate words which are uncommonly present in the model’s vocabulary since the purpose of this evaluation is to test how well the model understands the language of the corpus it is trained on. Cosine similarity scores are the primary mode described here for evaluating the model’s understanding of concepts because as Kusner et al. (2015) write, distances between embedded word vectors are semantically significant.
If you wish to generate your own word pairings, ideally you would train a model on either a selection of texts from your training corpus, or texts from a closely related corpus—the goal is to work with a corpus that is not identical to your own, but that you expect to have similar relationships between words.
Why should you develop your test set from a different model from the one you are evaluating? For example, let’s say you select the word “king” as a testing word and you query your model to see which words might make a good pair with your testing word. You determine that since the model gives the word “power” a high cosine similarity score with the word “king,” you should test these words as a pair. You run the evaluation code on this word pairing to discover that the cosine similarity of these words is high, which means the model is successfully understanding the connection between the two, right? Not necessarily. If you only select word pairings based on the top most frequently used words and words which share a high cosine similarity score with these words, you aren’t actually testing your model in a true sense—you can’t test the model if you pull all the “answers” to the test from the model itself. For this reason, we recommend either training a model with a different selection of comparable texts in order to generate your own word pairings or following a similar methodology so that your evaluation is more accurate.
Running the Model Evaluation Code
The code used for conducting the actual evaluation of the model is adapted from the IceBATS project, a team of researchers who have worked to develop an Icelandic version of the BATS analogy test and have also released the relevant code for conducting this analysis. The IceBATS code accepts plain text files with a testing word/query word listed followed by the acceptable pairings for the testing word.
I made very slight modifications to the evaluation code released by the IceBATS team to only allow the multiple answers, BATS format for word pairing testing rather than allowing both the multiple answer format as well as the A:B as C:D format. Primarily, I made this decision based on the determination that single-answer analogy testing has many of the limitations listed above and I wanted to prioritize only one type of evaluation as a result. For those interested in the single answer, A:B as C:D analogy testing, the built-in evaluation function that comes with Word2Vec is capable of evaluating models based on this formatting without any additional code necessary. For those interested in testing other word embedding algorithms, the IceBATS team has also released alternate versions of the evaluation code which can be accessed on the team’s Github repository.
The required format of the word pairs for the modified, multiple answers evaluation task is as follows:
: furniture
sofa living-room/parlor/sitting-room/den
table kitchen/dining-room
chair living-room/parlor/sitting-room/den/kitchen/dining-room
When the evaluation code is run, both the number of words in the testing set which were not in the model’s vocabulary as well as the final accuracy score for the model are provided. Because the code allows for words which are not part of the model’s vocabulary to be included in the evaluation (by this, I mean that the evaluation doesn’t stop when the code hits a word not in the vocabulary), this means that you don’t necessarily need to check for occurrences of each word that you decide to include in the word pairings for testing, although this might be a step of some interest to you depending on the evaluation goals of the research project as well as the number of words that the evaluation code determines are not in the vocabulary.
Final Takeaways
Model evaluation remains one aspect of WEM research that researchers can’t quite hammer down into a definitive method or prescribed approach. There are a number of approaches to evaluating the ability of a model to understand the vocabulary of a corpus and much of the process of evaluating a model depends greatly on what you consider “understand the vocabulary” to mean. For some projects, “understand” can mean more a structural, grammatical understanding. For others, “understand” may mean a clear and consistent scoring of concepts or specific terminology. Depending on what your evaluation needs are, evaluation methods and tasks should be modified to best answer the types of questions you are interested in answering.
Bibliography
Bode, Katherine. “The Equivalence of ‘Close’ and ‘Distant’ Reading; or, Toward a New Object for Data-Rich Literary History.” Modern Language Quarterly, vol. 78, no. 1, Mar. 2017, pp. 77–106. Silverchair, https://doi.org/10.1215/00267929-3699787.
Bowker, Geoffrey C., and Susan Leigh Star. Sorting Things Out: Classification and Its Consequences. MIT Press, 2000.
Cordell, Ryan. “‘Q i-Jtb the Raven’: Taking Dirty OCR Seriously.” Book History, vol. 20, no. 1, 2017, pp. 188–225. Project MUSE, https://doi.org/10.1353/bh.2017.0006.
D’Ignazio, Catherine, and Lauren F. Klein. Data Feminism. MIT Press, 2020.
Dobson, James. “Interpretable Outputs: Criteria for Machine Learning in the Humanities.” Digital Humanities Quarterly, vol. 15, no. 2, June 2020.
Gabi, Kirilloff. “Computation as Context: New Approaches to the Close/Distant Reading Debate.” College Literature, vol. 49, no. 1, 2022, pp. 1–25. Project MUSE, https://doi.org/10.1353/lit.2022.0000.
Le, Quoc V., and Tomas Mikolov. Distributed Representations of Sentences and Documents. arXiv:1405.4053, arXiv, 22 May 2014. arXiv.org, http://arxiv.org/abs/1405.4053.
Mikolov, Tomas, et al. “Distributed Representations of Words and Phrases and Their Compositionality.” Advances in Neural Information Processing Systems, vol. 26, Curran Associates, Inc., 2013. Neural Information Processing Systems.
Mutuvi, Stephen, et al. “Evaluating the Impact of OCR Errors on Topic Modeling.” Maturity and Innovation in Digital Libraries, edited by Milena Dobreva et al., Springer International Publishing, 2018, pp. 3–14. Springer Link, https://doi.org/10.1007/978-3-030-04257-8_1.
Nelson, Laura K. “Leveraging the Alignment between Machine Learning and Intersectionality: Using Word Embeddings to Measure Intersectional Experiences of the Nineteenth Century U.S. South.” Poetics, vol. 88, Oct. 2021, p. 101539. ScienceDirect, https://doi.org/10.1016/j.poetic.2021.101539.
Schöch, Christof. “Topic Modeling Genre: An Exploration of French Classical and Enlightenment Drama.” Digital Humanities Quarterly, vol. 011, no. 2, May 2017.
So, Richard Jean. “‘All Models Are Wrong.’” PMLA, vol. 132, no. 3, May 2017, pp. 668–73. Cambridge University Press, https://doi.org/10.1632/pmla.2017.132.3.668.
Todorov, Konstantin, and Giovanni Colavizza. An Assessment of the Impact of OCR Noise on Language Models. arXiv:2202.00470, arXiv, 26 Jan. 2022. arXiv.org, http://arxiv.org/abs/2202.00470.
Whitmore, Michael. “Text: A Massively Addressable Object.” Debates in the Digital Humanities.