Introduction to Word Vectors

A Road Map

As we've already seen, word vectors are complicated...

The next few sessions are intended to offer an overview, from several different angles: a walk through the special concepts and terminology so that we're all comfortable with them a walk through the actual process of training and querying a model an exploration of the mathematical side of word embedding models: what do we mean by vector space? a review of the tool set we use with word embedding models: what are the actual technologies we use and what role do they play?

Hopefully by the end, we'll have gone over the same material from enough different perspectives that it will all make perfect sense!

And at various points, we'll take a step back and think about the explanatory process itself: what kinds of explanation might work best for different audiences (especially readers of our scholarship, project collaborators, colleagues, grant reviewers, also potentially our students)

Word vectors can be pretty complicated. This tutorial is designed to offer an overview of word vectors, from several different angles:

a walk through the special concepts and terminology a walk through the actual process of training and querying a model an exploration of the mathematical side of word embedding models: what do we mean by vector space? a review of the tool set we use with word embedding models: what are the actual technologies we use and what role do they play?

Corpus and model

We're going to hear the terms corpus and model a lot this week: let's look more closely at those terms

Corpus: In the simplest sense, the corpus is the body of textual material we are analysing A set of documents in some machine-readable form, that is ready for the word2vec program to ingest Our corpus might be derived from a larger research collection (or several different collections), maybe in another format (like TEI/XML) that contains extra information that we take advantage of when we generate the corpus that will be fed into Word2Vec So to get from the research collection to the corpus we might need to do some data conversion: from XML (or some other format) to plain text (which is what the Word2Vec tool requires) And we might also need to do some cleaning and regularization, to tame the irregularities of the original research collection. A little later on, we'll think about data formats and cleaning processes in more detail. So when we talk about the corpus here, we're talking about the plain-text corpus that is ready to be fed into Word2Vec

Model: As we've already noted, the term model is an important one in digital humanities: in general terms, a model is a representation of something we are interested in, that captures some features of importance, in a way that makes it easier for us to examine and learn about that object of interest. So for instance a TEI-encoded text is a representation of a text that makes the structure and content of that text easier for us to see and work with. A word-embedding model of a corpus is a representation of a corpus of texts, in a way that makes the semantic relationships between words easier for us to see and work with. Practically speaking, the model we will be dealing with is a processed version of the corpus, produced by the Word2Vec tool, which represents the positioning of each word within the model as a vector So for now the key point is: the corpus is a collection of documents, while the model is a processed, computed representation of the textual data contained in those documents

The data preparation process is how you get from the research collection to the corpus

The training process is how you get from the corpus to the model.

Two important terms to learn before beginning your work with word vectors are: corpus and model. Let's take a closer look at those terms

Corpus: In the simplest sense, the corpus is the body of textual material you are analysing, a set of documents in some machine-readable form, ready for the word embedding algorithm to ingest Your corpus might be derived from a larger research collection (or several different collections), maybe in another format (like TEI/XML) that contains extra information that you should take advantage of when you generate the corpus that will be fed into Word2Vec. Generally, you want your corpus to consist of machine-readable texts (plain text). So to get from the research collection to the corpus you might need to do some data conversion: from XML (or some other format) to plain text (which is what the model training algorithm requires) And you might also need to do some cleaning and regularization, to tame the irregularities of the original research collection. Later in the tutorial, we'll learm more about data formats and cleaning processes in more detail. So the corpus here is a plain-text corpus that is ready to be fed into the model

Model: The term model is an important one in digital humanities: in general terms, a model is a representation of something we are interested in, that captures some features of importance, in a way that makes it easier for us to examine and learn about that object of interest. So for instance a TEI-encoded text is a representation of a text that makes the structure and content of that text easier for us to see and work with. A word-embedding model of a corpus is a representation of a corpus of texts, in a way that makes the semantic relationships between words easier for us to see and work with. Practically speaking, the model in this tutorial is a processed version of the corpus, produced by the Word2Vec tool, which represents the positioning of each word within the model as a vector So for now the key point is: the corpus is a collection of documents, while the model is a processed, computed representation of the textual data contained in those documents

Some other important terms that will appear in this tutorial are:

The data preparation process is how you get from the research collection to the corpus The training process is how you get from the corpus to the model.

Parameters

Remember that we said different researchers might want to use the model for different things, which would result in training/generating the model somewhat differently. The way we control that training process is by adjusting a set of parameters.

You can think of the training process (where we take a corpus and create a model of it) as being sort of like an industrial operation:

you take some raw materials and feed them into a big machine, and on the other end you get out some product and this hypothetical machine has a whole bunch of knobs and levers on it that you use to control the settings in our word2vec model training, the parameters are those knobs and levers, that control the training process depending on how you adjust them, you get differently trained models with different behaviours

We'll take a quick look now at two of these parameters, so that you can get a sense of how they affect the training process; they also have an important impact on how we interpret the results of the model. Later in the week, we'll look at these parameters in more detail and think about the effect these specific settings have on our models.

Different researchers might want to use the model for different purposes. Depending on the goal of the research, training/generating the model might be somewhat different. The way that the training process is controlled is by adjusting a set of what are known as parameters.

The training process (where we take a corpus and create a model of it) is sort of like an industrial operation with the following steps:

The tutorial will next walk you through some of these parameters, so that you can get a sense of how they affect the training process; they also have an important impact on how we interpret the results of the model. Later, we'll look at these parameters in more detail and think about the effect these specific settings have on our models. This tutorial will not cover all of the parameters that are available for the training process, so refer to the word2vec documentation for a list of additional parameters that may be of use to you.

Window

The first parameter for us to consider is the concept of the window

And here we come to a fundamental assumption for a lot of text analysis: that words that are used together have something to do with one another

What does it mean for words to be used together? right next to one another? all or nothing? more relevant the closer they are? sort of a gradient? contained within the same semantic construct, like a sentence or a paragraph? (problem: we're working with plain text so we don't have access to semantic constructs)

In Word2Vec, instead of these, we use a window: a span of text of a specified length, like a viewing port that we move over the text that allows us to see X words at a time we can control the size of the window (it is one of the parameters we just talked about) the Word2Vec algorithm is like a bookworm reading its way through the text, bite by bite each taste is localized by the window: each bite gives the processor a set of words that are considered used together and the size of the bite affects how many words are considered together in this way a bigger window lets us treat larger groups of words as related [pause and discuss for a moment:] what might be the results for our analysis of a larger or smaller window? (imagine a window that is thousands of words, as big as an entire chapter; imagine a window that is only two words wide)

Remember that this is a machine learning process and moreover it is an unsupervised machine learning process: one that starts from a state of complete ignorance and has to bootstrap itself. So another way to imagine the approach being taken in the training process: picture that you have a big bag containing all the words in the corpus. You shake the bag and then dump it out on the floor. Now you start reading the corpus (i.e the actual texts with their actual word order). Each time you read a word, you make observations about the words around it. Remember that this one observation doesn't give you any kind of Truth!! 100%!! about those words: it's just one little observed fact. Probabilistically, it contributes a tiny bit to our belief about the whole corpus. So based on those observations about word X, we move each of the context words a tiny bit closer to word X. Now we look at the next word X and its companions, and we move those words a little bit. note that the window is giving us two pieces of information: what's in the window, and what's not in the window. We'll come back to this in more detail later, but for now we can say that in addition to moving the words we do see, we also update the position of some of the words we don't see as we read the text.

The first parameter we will consider is the concept of the window

A fundamental assumption for many text analysis methods is: words that are used together have something to do with one another

What does it mean for words to be used together? right next to one another? all or nothing? the more relevant the closer they are? sort of a gradient? contained within the same semantic construct, like a sentence or a paragraph? (problem: we're working with plain text so we don't have access to semantic constructs)

In Word2Vec, we use a window in order to help the model view text together. In the Word2Vec algorithm, a window means: a span of text of a specified length, like a viewing port that we move over the text that allows us to see X words at a time we can control the size of the window (it is one of the parameters we just talked about) the Word2Vec algorithm is like a bookworm reading its way through the text, bite by bite each taste is localized by the window: each bite gives the processor a set of words that are considered used together and the size of the bite affects how many words are considered together in this way a bigger window lets us treat larger groups of words as related Take a moment to ask yourself: what might be the results for the analysis of a larger or smaller window? (imagine a window that is thousands of words, as big as an entire chapter; imagine a window that is only two words wide). Pausing to reflect on this question will help you make better decisions about the window size of your text.

Word2Vec is a machine learning process and moreover it is an unsupervised machine learning process: one that starts from a state of complete ignorance and has to bootstrap itself. So another way to imagine the approach being taken in the training process: picture that you have a big bag containing all the words in the corpus. You shake the bag and then dump it out on the floor. Now you start reading the corpus (i.e the actual texts with their actual word order). Each time you read a word, you make observations about the words around it. Keep in mind that this one observation doesn't give you any kind of Truth!! 100%!! about those words: it's just one little observed fact. Probabilistically, it contributes a tiny bit to your belief about the whole corpus. So based on those observations about word X, we move each of the context words a tiny bit closer to word X. Now we look at the next word X and its companions, and we move those words a little bit. note that the window is giving us two pieces of information: what's in the window, and what's not in the window. We'll come back to this in more detail later in the tutorial, but for now we can say that in addition to moving the words we do see, we also update the position of some of the words we don't see as we read the text.

Iterations

We've talked about the creation of a model as a training process, and we've just imagined it as a bookworm eating its way through the text, repeatedly. The trained model is the representation of the probability that words appear within the same window. As we just noted, the model begins in a state of complete randomness: words dumped on the floor. But after one read through the corpus, the words on the floor have moved around a bit. The machine is learning! Now, if we repeat the process, we can move them a bit further--it might seem as if we're getting the same information as we got before, but because the words on the floor are now in different (better) positions already, what we're doing is refining that information further. each pass through the corpus provides another set of adjustments, making the model more accurate each of these passes is called an iteration, and the more iterations the training process does, the more accurate the model (but of course the more time the process takes) you can control the number of iterations: it is another of the parameters we mentioned a moment ago

By now, you have read about the creation of a model as a training process, and we've just imagined it as a bookworm eating its way through the text, repeatedly. The trained model is the representation of the probability that words appear within the same window. As we just noted, the model begins in a state of complete randomness: words dumped on the floor. But after one read through the corpus, the words on the floor have moved around a bit. The machine is learning! Now, if we repeat the process, we can move them a bit further--it might seem as if we're getting the same information as we got before, but because the words on the floor are now in different (better) positions already, what we're doing is refining that information further. each pass through the corpus provides another set of adjustments, making the model more accurate each of these passes is called an iteration, and the more iterations the training process does, the more accurate the model (but of course the more time the process takes) you can control the number of iterations: it is another of the parameters we mentioned before

Vectors: a first look

Let's look next at some terms that may seem most distant from our humanistic expertise: the ones that refer to the mathematical aspects of word embedding models. The word vector has come up already: what is a vector and how is it relevant in this case? We'll start with a simple explanation first, and then circle back a bit later for more detail.

A vector is basically a line that has both a specific length and a specific direction or orientation in space: we can describe that line using coordinates: one coordinate for each axis of information we have about the line in this example, the vector is the thick black line that starts at the origin (the point where all three axes are at zero) and extends out to the point in space where the x axis (the blue number) is at 3, the y axis (the red number) is at 2, and the z axis (the green number) is at zero its direction and length are defined by those three dimensions any questions? This may be new for some and probably not current knowledge for most of us!

In a word-embedding model, the model represents a text corpus almost like a dandelion: as if each word were at the end of one of the little dandelion threads: each thread projects at a slightly different angle each word is located at a slightly different point in this cloud of words and words that are nearer to one another in meaning are also nearer to one another in vector space.

A vector is basically a line that has both a specific length and a specific direction or orientation in space: we can describe that line using coordinates: one coordinate for each axis of information we have about the line in this example, the vector is the thick black line that starts at the origin (the point where all three axes are at zero) and extends out to the point in space where the x axis (the blue number) is at 3, the y axis (the red number) is at 2, and the z axis (the green number) is at zero its direction and length are defined by those three dimensions This may be new and probably not current knowledge for most humanists, so don't worry if you are feeling a little lost!

In most cases, word vectors represent words in thousands of dimensions, way too many for any one person to be able to wrap their head around or for even the computer to handle. One handy solution for reducing the number of these dimensions is by using word-embedding models. In a word-embedding model, the model represents a text corpus almost like a dandelion: as if each word were at the end of one of the little dandelion threads: each thread projects at a slightly different angle each word is located at a slightly different point in this cloud of words and words that are nearer to one another in meaning are also nearer to one another in vector space.

Cosine Similarity: What is a cosine anyway?

So what does it mean to be near something in vector space? How do we measure this kind of proximity or association? If we understand these vectors as lines whose directionality and length reflects word associations in the corpus, then the more closely aligned two vectors are (the more they are going in the same direction for the same distance), the nearer they are for our purposes.

We can measure that alignment by using a mathematical expression called a cosine. What is a cosine? If we have two vectors (two lines extending out in different directions), what we really have is a triangle (the third leg would be the line connecting the ends of those two vectors) Within a triangle, a cosine is a way of representing an angle in relation to the lengths of its two legs the exact formula (for right triangles) is shown here on the slide, but even without parsing that in detail, we can say that the cosine of an angle is the ratio between its two sides (for triangles without a right angle, the formula is a little more complex) So when those two sides are very similar in length and direction, the cosine is going to get closer and closer to 1

Now that we have a basic understanding of what vectors are, what does it mean to be near something in vector space? How do we measure this kind of proximity or association? If we understand these vectors as lines whose directionality and length reflects word associations in the corpus, then the more closely aligned two vectors are (the more they are going in the same direction for the same distance), the nearer they are for our purposes.

We can measure that alignment by using a mathematical expression called a cosine. What is a cosine? If we have two vectors (two lines extending out in different directions and connected at the same origin point), then when we draw a third line connecting those two vectors, what we really have is a triangle. Within a triangle, a cosine is a way of representing an angle in relation to the lengths of its two legs While there is an exact formula for calculating the cosine, in general the cosine of an angle is the ratio between its two sides: the side adjacent to the angle and the hypotenuse (for triangles without a right angle, the formula is a little more complex) So when those two sides are very similar in length and direction, the cosine is going to get closer and closer to 1 while dissimilar sides get closer to 0.

Cosine Similarity

So now we can come back to our question of how to measure nearness. In word embedding models the measure of nearness that we use is something called cosine similarity. Roughly speaking, this is a measure of the similarity of two vectors, based on the cosine of the angle between them As we've seen, the more similar the two vectors are, the closer their ratio gets to 1. And in fact the values of cosine similarity range between zero and one: two identical vectors have a cosine similarity of 1; two absolutely dissimilar vectors have a cosine similarity of zero So the smaller the cosine similarity, the less similar the words are, and the farther apart they are in vector space We'll talk a bit later on about what level of similarity really counts as similar, and you'll get a feel for it In general, anything above .5 starts to feel meaningful

So in this example (a real-world example from the WWP corpus), if we take the word sacred as our starting point, the words holy and consecrated are fairly close in meaning (and have high cosine similarity); the word shrine is more distant but still related enough to be interesting

So far so good? Questions?

So now we can come back to our question of how to measure nearness. In word embedding models the measure of nearness that we use is something called cosine similarity. Roughly speaking, this is a measure of the similarity of two vectors, based on the cosine of the angle between them As we've seen, the more similar the two vectors are, the closer their ratio gets to 1. And in fact the values of cosine similarity range between zero and one: two identical vectors have a cosine similarity of 1; two absolutely dissimilar vectors have a cosine similarity of zero So the smaller the cosine similarity, the less similar the words are, and the farther apart they are in vector space Later in this walkthrough, we will cover how the level of similarity really counts as similar, and you'll get a feel for it In general, anything above .5 starts to feel meaningful

Querying

So what can we do with this information? We've created a model of our corpus (a representation that helps us see some aspect of that information more clearly and easily): how do we use it?

The first thing we might try is just querying the model about the neighborhood of a word we're interested in: essentially, asking it questions about where specific words are located and what is around them: this slide shows a simple example using the WWP's Women Writers Vector Toolkit, but in this workshop we will be working in the RStudio environment that we looked at in the last session, so we can design much more complex queries we can enter a search term, and get back a list of the words that are closest to it in vector space: that is, words that are probably semantically related to it, based on the way those words appear together in the corpus as we can see from this list, these aren't necessarily synonyms: there are many different ways words can be related as we will discover but they are words that tend to appear in the same contexts as our query term (in this case, discussions of families and familial roles and relationships)

So what can you do with this information? At this point, you have created a model of your corpus (a representation that will help you see some aspect of that information more clearly and easily): how do you use it?

A good initial query is to try just querying the mdoel about the neighborhood of a word we're interested in: essentially, asking it questions about where specific words are located and what is around them: If you're using the WWP's Women Writers Vector Toolkit you can perform some of these basic queries using the pre-loarded models, but in this walkthrough we will be working with the model we just built so that we can ask more complex questions we can enter a search term, and get back a list of the words that are closest to it in vector space: that is, words that are probably semantically related to it, based on the way those words appear together in the corpus From this list, you might notice that these aren't necessarily synonyms: there are many different ways words can be related as we will discover but they are words that tend to appear in the same contexts as our query term (in this case, discussions of families and familial roles and relationships)

Clustering

Another way we can interact with the model is to ask it more generally, where are your semantically dense zones? Or please show me some clusters of related words!

This process is somewhat similar to topic modeling: it says What if we divided up the corpus into X different zones? where are the centers of those zones, and what is nearest to those centers? just as in topic modeling where we say, in effect, if our corpus has X number of topics, what would they be? or if we were looking at a map of a region, we might say if we were going to build ten new Home Depot stores, where should they go so that most people have the shortest drive? and who lives in those regions?

To generate these clusters (as part of the initial model training process): the modeling program runs a clustering algorithm that randomly chooses a number of locations within the vector space—in this case, three (like throwing a set of three darts at the map) then, it makes a series of adjustments to those locations to move them closer to actual population centers, places where words are close to one another within the vector space if we kept up the adjustment process for a long time, we would eventually get a perfect result: the three most significant semantic zones within the model

Another way we can interact with the model is to ask it more generally, where are your semantically dense zones? Or please show me some clusters of related words!

To generate these clusters (as part of the initial model training process): the modeling program runs a clustering algorithm that randomly chooses a number of locations within the vector space—in this case, three (like throwing a set of three darts at the map randomly) then, it makes a series of adjustments to those locations to move them closer to actual population centers, places where words are close to one another within the vector space if we kept up the adjustment process for a long time, we would eventually get a perfect result: the three most significant semantic zones within the model

Clusters: an example

So what we get at the end of the process is clusters of words that are like neighborhoods within the vector space: densely populated areas where words are grouped together around a concept or a textual phenomenon. in principle, the number of clusters is up to us in our own model training for this workshop, since we are working directly with the R code, we can choose how many clusters we want the WWVT doesn't give you this option but the WWP chose 150 as a reasonable number for the Toolkit we stop the process after 40 adjustments, so the clusters will come out a bit different every time you reset them, but when running the code yourself in RStudio you can control that more precisely. in this example, for instance (which shows three of the 150 clusters), there's a cluster that's roughly associated with religiously-oriented death ceremony, and another one that is old-fashioned cavalry-oriented warfare, but the one in the middle is harder to describe as a concept: it's more like the space of dialogue and spoken language markers

Check the time and consider stopping here!

So what we get at the end of the process is clusters of words that are like neighborhoods within the vector space: densely populated areas where words are grouped together around a concept or a textual phenomenon. in principle, the number of clusters is up to us in your own model training, since you are working directly with code, you are able to choose how many clusters you want by adjusting the number in the code, itself. if you are using the Toolkit, the WWVT doesn't give you this option but the WWP chose 150 as a reasonable number for the Toolkit we stop the process after 40 adjustments, so the clusters will come out a bit different every time you reset them, but when running the code yourself you can control that more precisely. Some of the reasons why you may want to adjust these numbers is to increase accuracy or speed. in this example, for instance (which shows three of the 150 clusters), there's a cluster that's roughly associated with religiously-oriented death ceremony, and another one that is old-fashioned cavalry-oriented warfare, but the one in the middle is harder to describe as a concept: it's more like the space of dialogue and spoken language markers

Vector Math 1

One more thing we can do to explore the word information in our vector space model: we can examine the relationships between words, taking advantage of the fact that each word is represented as a vector, which is a kind of number

To understand how this works, we need to envision a little more clearly how words are positioned in this vector space model: during the training process, the word2vec algorithm is examining the text (looking through its little window at successive groupings of words) and with each observation, it adjusts the position of the words in the model to take into account the word associations it observes so for a word like bank, it might observe some instances where that word is associated with words like funds and revenue, and so it moves bank closer to those words: it adds information that makes an association between these words then maybe it observes some other instances where bank is associated with words like river and lake and Hudson, and it moves the word bank a little closer to those words so by the end, each word is positioned in vector space in a way that reflects its associations (some weaker and some stronger) with many of the other words in the corpus we can think of each association as being like a rubber band that pulls a pair of words together; each word is being pulled in multiple different directions, with different strengths of association, and its net position is the result of all of those pulls

Vector Math 2

We can use this information to tease out more specific semantic spaces for individual words: For instance, we might imagine that the word grace has some rubber band pulling it towards a word like beauty. What if we cut that rubber band? What part of the semantic field might pull grace more into their orbit if beauty were out of the running? We can find this out by subtracting the vector for beauty from the vector for grace: the result is a set of associations that are specifically religious Similarly, instead of cutting that rubber band, we might intensify its strength and allow it to pull grace towards it more strongly (putting grace into a zone where its aesthetic associations are most powerful). We can do this by adding the vector for beauty to the vector for grace.

Note that words here are just proxies or symptoms (imperfect ones) for the concepts we might be interested in: As we think about what words to add or subtract, it's important to think about how those words are related to the concept we're trying to examine (and it's worth trying different words) Also, the semantic associations of words are very corpus-specific: in a corpus of financial documents, the term grace might be exclusively associated with the grace period for bill payment So knowing our corpus is really crucial

Note that words here are just proxies or symptoms (imperfect ones) for the concepts we might be interested in: As you think about what words to add or subtract, it's important to think about how those words are related to the concept you're trying to examine (and it's worth trying different words) Also, the semantic associations of words are very corpus-specific: in a corpus of financial documents, the term grace might be exclusively associated with the grace period for bill payment So knowing our corpus is really crucial

Validation

How do you know your model is telling you something meaningful? Do you get consistent results from model to model? Do you get plausible word groupings? Does vector math work as you would expect?

As we use our model in these various ways, we're going to get some results (hopefully) that look very predictable, and some others that look provocative and fascinating, and maybe some others that look bizarre and unexpected. How can we tell the difference between an interpretive breakthrough and a glitch resulting from some terrible flaw in our training process?

Once we've generated a model, there are ways we can and should test it to see whether it is actually a useful representation that will give research results we can use. That testing process is called validation. To validate a model, we can ask questions like these:

Are your results consistent across models? When you train a series of models on the same corpus using the same parameters, do you get consistent cosine similarities for the same sets of words? (Note: because training a model is a probabilistic process, you won’t get identical results from model to model, even if they’re trained on exactly the same corpus, but the results should be comparable.)

Do you get plausible word groupings? When you generate groups of similar terms (either by generating clusters, or by querying a specific word), do you get plausibly related groups of words for common and moderately common query terms? (Common within your corpus, that is!) If you don’t get plausible groupings for moderately common words, this would be a sign to proceed with caution; if you don’t get plausible groupings for even common words, this would be a sign that the model may not be very useful (this might be because of small corpus size, or some other factor).

Does vector math work as you would expect? When you do the various forms of vector math (addition, subtraction) do your results continue to seem plausible?

If we didn't stop before, consider stopping now!

As you use your model in these various ways, you're going to get some results that hopefully look very predictable, and some others that look provocative and fascinating, and maybe some others that look bizarre and unexpected. With all these interesting results, how can we tell the difference between an interpretive breakthrough and a glitch resulting from some terrible flaw in our training process?

Are your results consistent across models? When you train a series of models on the same corpus using the same parameters, do you get consistent cosine similarities for the same sets of words? It is important to note that because training a model is a probabilistic process, you won’t get identical results from model to model, even if they’re trained on exactly the same corpus, but the results should be comparable.

Does vector math work as you would expect? When you do the various forms of vector math (addition, subtraction) do your results continue to seem plausible?

Circling back: another look at vectors

Now that we've worked through the basic concepts, let's circle back and consider the whole picture of word vectors or word embedding models, and introduce a few additional complexities.

[if starting the day here, check in and see if people want to recap anything]

A quick review: we've already noted that a vector is basically a line that has both a specific length and a specific direction or orientation in space: so here again in this example, the vector is a line that starts at the origin (the point where all three axes are at zero) and extends out to the point in space where the x axis is at 3, the y axis is at 2, and the z axis is at zero we can think of those three axes as representing three pieces of information: together, they constitute a unique vector in three-dimensional space. I'm going to pause here for a moment and let the diagram sink in a bit more, because at this stage in the explanation, it helps to have a sense of what the diagram is telling us. Does anyone want to test out their understanding of how those three axes (the x, y, and z) are contributing information to the direction and distance of that vector? Does everyone see how the blue number 3 comes from the blue x axis? etc.?

Now that we've worked through the basic concepts, let's circle back and consider the whole picture of word vectors or word embedding models, and introduce a few additional complexities.

So far, we've already noted that a vector is basically a line that has both a specific length and a specific direction or orientation in space: so here again in this example, the vector is a line that starts at the origin (the point where all three axes are at zero) and extends out to the point in space where the x axis is at 3, the y axis is at 2, and the z axis is at zero we can think of those three axes as representing three pieces of information: together, they constitute a unique vector in three-dimensional space. Pause here and let the diagram sink in a bit more, because at this stage in the explanation, it helps to have a sense of what the diagram is telling us. Try and test out your understanding of how those three axes (the x, y, and z) are contributing information to the direction and distance of that vector. Do you see how the blue number 3 comes from the blue x axis? etc.?

Words as vectors

The example we were just looking at shows a vector defined by three dimensions: three different numbers representing three different axes of meaning. However, when we're working with word embedding models, we are working with vectors that are defined by many more dimensions. So in order to understand that scenario, we need to get a little more comfortable with two ideas: a vector is just an assemblage of dimensions each dimension represents an association that has been observed

So let's take the first example on this slide (the idea may look familiar if you read the Jay Allamar Illustrated Introduction to Word Embeddings): Our little chart here shows three people (Jo, Lee, Robin) and for each person it shows an assemblage of dimensions each dimension represents an association in this case, that association has to do with the person's affinity for specific animals: perhaps through observations of how many pets of each type they have, or their response when they encounter the animal in the wild So each person in this chart is represented by a vector with five dimensions: a line in five-dimensional space if we want to compare two people and find out whether they tend to like the same animals, impressionistically we can say that people with high or low affinity for the same animals are similar: the color coding is highlighting this pattern. But if we want a quantitative way to talk about that similarity, we can use the measure called cosine similarity, which is a way of measuring the angle between two lines. Here, we're doing exactly the same thing, except that our lines are defined by five dimensions instead of three. the calculation isn't hard (you can find an Excel version on the web!) and what it shows us is that the cosine similarity between our two mammal-lovers is very high, whereas the similarity between the mammal-lovers and the person who prefers birds/lizards/beetles is quite low.

Pause for questions and reflection!

So, taking this a step farther, let's look at the chart on the right: Here, instead of looking at people and their association with animals, we're looking at words and their association with other words those other words have been observed in proximity (in the window) with our target word, to a greater or lesser extent we're not giving numbers here, but imagine that the green boxes are the ones that were observed more often, and the orange boxes are the words that were observed less often, and maybe the greenish-yellow boxes were somewhere in the middle

So what do we see when we look at the righthand chart? What kind of cosine similarity would we expect to find between danger and peril? A high or low similarity? How about between danger and horses? Horses and goats?

A few interesting things to note: all of the words are contributing information to each of the vectors, even when the actual association observed is low (I'll come back to this in a minute) and in fact the chart goes way off the edge of the screen to the right: there could in principle be hundreds of words contributing to the distinctive vector that is danger

The previous example shows a vector defined by three dimensions: three different numbers representing three different axes of meaning. However, when we're working with word embedding models, we are working with vectors that are defined by many more dimensions. So in order to understand what that all means, we need to get a little more comfortable with two ideas: a vector is just an assemblage of dimensions each dimension represents an association that has been observed

So let's take the first example shown here (the idea may look familiar if you read the Jay Allamar Illustrated Introduction to Word Embeddings): The chart shows three people (Jo, Lee, Robin) and for each person it shows an assemblage of dimensions each dimension represents an association in this case, that association has to do with the person's affinity for specific animals: perhaps through observations of how many pets of each type they have, or their response when they encounter the animal in the wild So each person in the chart is represented by a vector with five dimensions: a line in five-dimensional space if we want to compare two people and find out whether they tend to like the same animals, impressionistically we can say that people with high or low affinity for the same animals are similar The chart is color coded to highlight this pattern. But if we want a quantitative way to talk about that similarity, we can use the measure called cosine similarity, which is a way of measuring the angle between two lines. Here, we're doing exactly the same thing, except that our lines are defined by five dimensions instead of three. the calculation isn't hard, you can even find an Excel version on the web, and what it shows us is that the cosine similarity between our two mammal-lovers is very high, whereas the similarity between the mammal-lovers and the person who prefers birds/lizards/beetles is quite low.

So, taking this a step farther, look at the chart on the right side: Here, instead of looking at people and their association with animals, we're looking at words and their association with other words those other words have been observed in proximity (in the window) with our target word, to a greater or lesser extent we're not giving numbers here, but imagine that the green boxes are the ones that were observed more often, and the orange boxes are the words that were observed less often, and maybe the greenish-yellow boxes were somewhere in the middle

Negative Sampling

So let's now add another concept. Cast our minds back to the little bookworm eating through the corpus, making observations about the words that are near the target word, and adjusting the position of the words within the model. The information about those words that it observes is being fed into our little chart here. But how about the words that aren't being observed?

We mentioned earlier that these are also significant. When the bookworm takes a bite, there are a huge number of words that are not in that sample, and the model training process could (in principle) use that information to adjust all of the words in the corpus, moving them away from the target word. In practice, it doesn't adjust all of the words (since that would be too much work) but it adjusts some of the words: a random sample. This is called negative sampling, and it is one of the parameters we can adjust: we can say how many of these non-appearing words should have their positions updated with each observation. If we have a large negative sampling value, the model training will be more precise, but the training process will take a lot longer.

Looking again at our chart: If time and computing power were no object, we could imagine the chart extending off to the right so that every word in the corpus is listed, and we could imagine the position of every word in the model being adjusted with each observation, so that both the positive and negative sampling information would be fully reflected in the model. We could think of this situation as a kind of perfect model: showing all words exerting some probabilistic influence on each other in terms of text prediction, all words have some probability of being the next word even if that probability is very, very low in this perfect model, the vector for each word has as many dimensions as there are words in the corpus

Let's test this idea a little further: imagine that the window is the size of the corpus: now all words are related to all other words equally! Let that sink in for a moment: our understanding of the relatedness of words is strongly determined by our observational parameters: it isn't intrinsic, it's something we control. And in fact in some forms of unsupervised modeling, like a topic model, which operates on the whole document, the window size is in effect the entire document: the model training process says which words appear in the same document? But in word embedding models, our concept of relatedness is a bit more precise than this: we are interested in things that are happening more at the sentence or phrase level, where the association between words reflects the way writers are actually articulating specific ideas

One more look at our perfect model: note that it contains a lot of empty space: places where we are noticing that in fact the word toothbrush is not related to the words danger, horses, etc. without getting too far into the weeds, it turns out that this empty space is a problem: largely because it makes the data set very, very large.

So what do we do about that?

So let's now add another concept. Imagine the model training process to be a little bookworm eating through the corpus, making observations about the words that are near the target word, and adjusting the position of the words within the model. The information about those words that it observes is being fed into our chart here. But how about the words that aren't being observed?

These words are also significant. When the bookworm takes a bite, there are a huge number of words that are not in that sample, and the model training process could (in principle) use that information to adjust all of the words in the corpus, moving them away from the target word. In practice, it doesn't adjust all of the words (since that would be too much work) but it adjusts some of the words: a random sample. This is called negative sampling, and it is one of the training parameters we can adjust: we can say how many of these non-appearing words should have their positions updated with each observation. If we have a large negative sampling value, the model training will be more precise, but the training process will take a lot longer.

Looking again at our chart, if time and computing power were no object, we could imagine the chart extending off to the right so that every word in the corpus is listed, and we could imagine the position of every word in the model being adjusted with each observation, so that both the positive and negative sampling information would be fully reflected in the model. We could think of this situation as a kind of perfect model: showing all words exerting some probabilistic influence on each other in terms of text prediction, all words have some probability of being the next word even if that probability is very, very low in this perfect model, the vector for each word has as many dimensions as there are words in the corpus

To test this idea a little further: imagine that the window is the size of the corpus: now all words are related to all other words equally! Let that sink in for a moment: our understanding of the relatedness of words is strongly determined by our observational parameters: it isn't intrinsic, it's something we control. And in fact in some forms of unsupervised modeling, like a topic model, which operates on the whole document, the window size is in effect the entire document: the model training process says which words appear in the same document? But in word embedding models, our concept of relatedness is a bit more precise than this: we are interested in things that are happening more at the sentence or phrase level, where the association between words reflects the way writers are actually articulating specific ideas

So what do we do about that?

Embedding!

To make the model more compact, and hence easier to process while you wait, clever people developed a technique called embedding which flattens the model: reducing it from a very large number of dimensions (like, thousands) to a somewhat smaller number of dimensions (like, hundreds).

For those of you who may have read Edwin Abbott's Flatland, you might remember how when a sphere visits Flatland, the two-dimensional creatures there see it as a circle: a three-dimensional entity flattened or projected onto two dimensions. Something similar sometimes happens to Wily Coyote.

We are not going to cover the mathematics of it, but we will look at a few effects/results.

In simple terms: in our perfect model, remember that the position of each word in the model is a vector, and that vector is essentially a complicated multidimensional number each of the dimensions of that number is another word in the corpus, one number for each word (even the unrelated words) in the flattened version, that is no longer true: the position of each word is still a vector, but that vector's dimensions are no longer individual words, and the number of dimensions is no longer the total vocabulary of the corpus. instead, we choose the number of dimensions, as one of the parameters for the training process the embedding process then compresses the model down to that number of dimensions, and reduces the empty space of the unrelated words.

So by specifying the number of dimensions, we are in effect specifying how many other words each word's position takes into account: if we choose a very low number of dimensions, the model will have very little information about the word relationships within our corpus if we choose a very high number of dimensions, the model will have a lot of information about the word relationships in the corpus however, the sweet spot is also going to depend on the total size and total vocabulary of the corpus: for a corpus with a tiny vocabulary, a large number of dimensions may not be very useful.

I'm afraid there's a little 'magic happens here' at this stage--the mathematical details are a little out of scope for this institute, but there are some good sources in the readings for those who want to understand this more fully.

If you have read Edwin Abbott's Flatland, you might remember how when a sphere visits Flatland, the two-dimensional creatures there see it as a circle: a three-dimensional entity flattened or projected onto two dimensions. Something similar sometimes happens to Wily Coyote.

We are not going to cover the mathematics of it, but we will look at a few effects/results.

In simple terms: in our perfect model, the position of each word in the model is a vector, and that vector is essentially a complicated multidimensional number each of the dimensions of that number is another word in the corpus, one number for each word (even the unrelated words) in the flattened version, that is no longer true: the position of each word is still a vector, but that vector's dimensions are no longer individual words, and the number of dimensions is no longer the total vocabulary of the corpus. instead, we choose the number of dimensions, as one of the parameters for the training process the embedding process then compresses the model down to that number of dimensions, and reduces the empty space of the unrelated words.

there's a little 'magic happens here' at this stage--the mathematical details are a little out of scope for this institute, but there are some good sources in the suggested readings if you want to understand this more fully.

The word vector process: Data preparation

So another way to put this all together is to walk through the entire process in order, step by step. There are basically three major acts in this drama, very much like a classic comedy

In the first act, we set up the problem and introduce the main characters: We analyse our problem and establish a set of research questions we want to focus on We gather a corpus of documents that are relevant to this research; at this stage they may be a motley bunch cobbled together from various sources, with differing quality, accuracy, transcription conventions, etc. And we might do some data cleanup on the corpus to improve consistency or make the data better suited to our research: for instance, filtering out unnecessary information like page numbers, or regularizing/modernizing spelling

As part of this process, we might discover things that cause us to reassess or expand our research question: so it's helpful to keep an open mind and be prepared to treat this as an iterative process.

So another way to put this all together is to walk through the entire process in order, step by step. There are basically three major acts in this drama, very much like a classic comedy

The word vector process: Training the model

In the second act, we get the real meat of the plot: in this case, the process where we train our model and create a vector space representation of our corpus: First, we set the parameters for the training process: we choose the window size (how many words on either side of our target word we want to look at); we set the number of iterations (how many times to run through the data); we set the number of dimensions (how much we want the model to be flattened) and we set the negative sampling (a large number will make the model more precise but will take longer to train) Second, we actually run the training process: our little caterpillar eats its way through the corpus, taking bigger or smaller bites (depending on window size), and the number of times through depends on the number of iterations we set. You may want to tweak the parameters depending on what your model looks like at the end of the training process. It can even be a good idea to train and save many models with different parameters so that you can test them out. However, make sure to note the parameters you chose for each model so that you can reproduce your results. And third, we validate the model: we test it for plausibility. Plausibility is going to mean something different to different researchers. Keep your research question in mind as you approach the validation stage as this may influence how you validate.

The word vector process: Iteration and refinement

As before, this is an iterative process! When you're first training a model, it's a good idea to try different parameter settings just to find out what difference they make And when you validate the model, you might see something that prompts you to go back and change a parameter and try again: for instance, with a very small corpus, you might need to do extra iterations (because with a small corpus, there isn't as much information being generated about word relationships during each iteration, so you need to run the process more times to get the same level of accuracy) And the model training process might in turn send you back to the corpus: you might discover that your corpus is just too small and you need to go back and add some more materials. Or you might find that your corpus is too heterogeneous: maybe you'd like to try splitting it into two and treating them separately.

Like training, validation is an iterative process! As stated before, when you're first training a model, it's a good idea to try different parameter settings just to find out what difference they make--sometimes even a little change can have major downstream impact. And when you validate the model, you might see something that prompts you to go back and change a parameter and try again: for instance, with a very small corpus, you might need to do extra iterations (because with a small corpus, there isn't as much information being generated about word relationships during each iteration, so you need to run the process more times to get the same level of accuracy). Even if your queries are producing interesting results, if your model isn't valid then your queries aren't really, either. The model training process might send you back to the corpus: you might discover that your corpus is just too small and you need to go back and add some more materials. Or you might find that your corpus is too heterogeneous: maybe you'd like to try splitting it into two and treating them separately. Sometimes even the method you used to prepare the data might need to be adjusted.

Tools for word embedding models

To wrap up this session, let's take a quick look at the tools we use for working with word embedding models

We can arrange them in order of abstraction: the most foundational tool in this set is the word embedding algorithms themselves. These are mathematical processes that perform computations that generate a word embedding: a representation of a corpus as a vector space that has been squashed or flattened in useful ways. The two main word embedding algorithms in common use are Word2Vec (developed by Tomas Mikolov at Google) and GLoVE (developed by a research group at Stanford). For this workshop, we are using Word2Vec. When we want to actually run those algorithms on our data, we need to have a computer program that will do things like read in the corpus, run the algorithm on it, allow us to set parameters, etc. We could write one ourselves if we were clever that way but there already exist specific software packages we can use: specific implementations of the word embedding algorithms. Two in common use are the WordVectors package (written in R by Ben Schmidt) and the GenSim package (written in Python by a Czech researcher, Radim Řehůřek). For this workshop, we are using the WordVectors R package. In order to run these programs on your computer, you need to have an environment within which the programming language (R, Python) can operate: something that understands the R or Python language and can run it within the operating system on your computer. These software environments are sort of like sandboxes or life support systems for specific languages. Examples include RStudio, which is an environment for working in the R programming language and running R code, and Jupyter Notebooks, which are an environment for working in the Python programming language and running Python code. Within these environments, we can train models and we can also query and interact with them. An added option (which we're only touching on briefly in this workshop) is the Women Writers Vector Toolkit, which is a set of programs that create a web interface for Word2Vec, and allow you to query the trained models without having to use RStudio or interact directly with any of the underlying layers

Those layers are all sitting underneath us and they each have effects on the outcomes of our work: the environment we're working in is the result of a number of layers of decisions that could have been made differently and even if you don't want to make a different decision, in a teaching context you might want your students to understand the effects of a different set of choices so over time, you may want to revisit them as you gain more familiarity and comfort with these tools the important note to end on here is that this workshop is intended to be a starting point the things we observe about word vectors and how they work are not universal, but local and situational; however, we can learn a lot from these experiments

To finish this part of the tutorial, let's take a quick look at the tools we use for working with word embedding models

We can arrange them in order of abstraction: the most foundational tool in this set is the word embedding algorithms themselves. These are mathematical processes that perform computations that generate a word embedding: a representation of a corpus as a vector space that has been squashed or flattened in useful ways. The two main word embedding algorithms in common use are Word2Vec (developed by Tomas Mikolov at Google) and GLoVE (developed by a research group at Stanford). For this tutorial, we are using Word2Vec. When we want to actually run those algorithms on our data, we need to have a computer program that will do things like read in the corpus, run the algorithm on it, allow us to set parameters, etc. We could write one ourselves if we were clever that way but there already exist specific software packages we can use: specific implementations of the word embedding algorithms. Two in common use are the WordVectors package (written in R by Ben Schmidt) and the GenSim package (written in Python by a Czech researcher, Radim Řehůřek). For this workshop, we are using the WordVectors R package. In order to run these programs on your computer, you need to have an environment within which the programming language (R, Python) can operate: something that understands the R or Python language and can run it within the operating system on your computer. These software environments are sort of like sandboxes or life support systems for specific languages. Examples include RStudio, which is an environment for working in the R programming language and running R code, and Jupyter Notebooks, which are an environment for working in the Python programming language and running Python code. Jupyter Notebooks is designed to work with Python notebooks, meaning code that is intermixed with prose. If you want to just run the code, itself, you may want to check out Python's IDLE or Spyder which comes preinstalled with Anaconda. Within these environments, we can train models and we can also query and interact with them. An added option (which we've touched on in this tutorial) is the Women Writers Vector Toolkit, which is a set of programs that create a web interface for Word2Vec, and allow you to query the trained models without having to use RStudio or interact directly with any of the underlying layers. This tool can be particularly useful if you just want to get comfortable with what the results of a model query look like or if the model you want to train is using a corpus available with the Toolkit.

Those layers are all sitting underneath us and they each have effects on the outcomes of our work: the environment we're working in is the result of a number of layers of decisions that could have been made differently. Choosing an environment to work in is an important step that asks you to consider how you want to interact with the code and what your level of comfort is. So over time, you may want to revisit different environments as you gain more familiarity and comfort with these tools the important note to keep in mind is that this tutorial is intended to be a starting point. We have only begun to scratch the surface of all the cool stuff word embedding models can do! The things we observe about word vectors and how they work are not universal, but local and situational; however, we can learn a lot from these experiments, about both how models work as well as about our corpus, itself