Validation and assessment of research with word embedding models

The following questions and criteria are intended to help you think through what constitutes a persuasive use of word embedding models in humanities research. Understanding how researchers use word embedding models in order to answer larger research questions can not only help you navigate current research much more effectively, but can actually help you understand your own models. Below, we suggest some questions to ask when you are evaluating the model you’re using, evaluating your own research argumentation, or evaluating other people’s research. We also include some potential pitfalls to avoid when using these methods. In general, addressing the questions below is a good step to complete after you have moved past the training and testing phase because they can help point out some pitfalls you may have missed and may even prompt you to return to the training and testing phase. The process of working with word embedding models is often quite iterative and can involve returning to previous steps or revisiting your research questions.

Evaluating models:

  • Are your results consistent across models? When you train a series of models on the same corpus using the same parameters, do you get consistent cosine similarities for the same sets of words? (Note: because training a model is a probabilistic process, you won’t get identical results from model to model, even if they’re trained on exactly the same corpus, but the results should be comparable.)
  • When you generate groups of “similar” terms (either by k-means clustering or using a query term), do you get plausibly related groups of words for common and moderately common query terms? If you don’t get plausible groupings for moderately common words, you should proceed with caution; if you don’t get plausible groupings for even common words, this would be a sign that the model may not be very useful (either because of small corpus size or some other factor). Word embedding models are often evaluated based on their ability to mimic human understandings of language. If your model is grouping words together in a way that doesn’t make sense to you as a human, then that is likely a sign that something may have gone wrong in the training process or may even indicate issues with your corpus.
  • When you do the various forms of vector math (addition, subtraction, analogies) do your results continue to seem plausible? Do analogies for common words work in a plausible way? Note that it is useful to have a set of test analogies for your own corpus, since “common words” will vary from corpus to corpus. Each of these test analogies should be relevant to your corpus. For example, if you are working with a corpus of recipes, your analogies should probably be food related, etc.

Assessing the persuasiveness of your own research arguments:

  • Are you able to articulate directly and clearly how the results of your analysis relate to your research questions? Does the model help to support the argument in a substantive way?
  • Would the research questions be interesting and useful if you weren’t using digital tools to answer it?
  • Have you chosen to omit certain aspects of the research simply because they seemed to require too much work, or because they lay outside your comfort zone? Could this omission potential lead to a misunderstanding of your results?
  • Can you explain (have you explained) why you chose the specific parameters, and other aspects of the process that are under your control? (An explanation of the form “I just tried it and it seemed OK” doesn’t count, nor does “other people seem to do it this way.”) Have you tried alternative settings and ascertained what impact they have on the results?
  • Can you explain (have you explained) why you chose the specific queries you did? Did you try synonyms, other spellings, etc.? Are you able to articulate how you selected those query terms and how they relate to your research questions?
  • Have you taken into account the non-deterministic aspects of the process? For instance, do your results rely on things that may vary from model to model (even with the same parameters), such as having a specific word at a specific cosine similarity to a given target word? For more on model validation and assessment, see this tutorial on model training and validation. For an example of analysis that looks at results across a group of models, see this article by Gabriel Recchia.
  • Are you confident that your results aren’t influenced by a single anomalous text or other artifact of your corpus that isn’t relevant to your research questions (such as OCR errors)?
  • Can you locate evidence to confirm your results by looking at specific instances within your corpus?
  • Is it possible that your results aren’t as conclusive or remarkable as they appear? For instance, imagine for a moment that the opposite of what you’re claiming is true: would your evidence also support that possibility? Or is there a simpler explanation or argument that would also account for your results? It might be helpful to discuss this question with someone familiar with your research.
  • Imagine you needed to redo this analysis a year from now: would you be able to do so based on the information you have documented? Think about the information you would need, or information a reader would need, in order to follow your path again.
  • A bigger and more difficult question to consider: have you thought carefully about what the numbers mean? For instance, given that cosine similarity measures the proximity of two words in vector space, what kinds of cosine similarity numbers are relevant to your specific analysis and why? If you’re comparing cosine similarities for different groups of words, what do different levels of “similarity” really mean? This may not be a question you can answer, but it may be worth reflecting on it in your discussion of results.

Assessing the persuasiveness of other people’s research arguments: all of the above, plus:

  • Do they provide specifics of how they got their results? Are they sharing the data corpus, the parameters they used, the specifics about the tool they used? (Note that there are several different algorithms for training word embedding models that can produce quite different results.)

Avoiding potential gotchas and conceptual pitfalls:

  • Note that the top items in one list of results (for instance, words related to a particular query term) aren’t necessarily comparable in their level of relevance/salience to the top items in a different list; you need to look carefully at the cosine similarity numbers, and think about what level of proximity is relevant for your research analysis.
  • Cosine similarity does not equal synonymity: rather, it means that words are used in similar contexts (which could mean that they are actually antonyms, or words with related functions).
  • Words that are considered “common” (for purposes of validating a model) may not actually be common in your corpus, depending on its scope and composition. For instance, a corpus of legislative documents might not contain the words “king” and “queen” at all, which would make the frequently-used analogy of man:woman::king:queen irrelevant as a test.