Introduction to Word Vectors
Julia Flanders
2019-04-01
A Road Map
As we’ve already seen, word vectors are complicated...
The next few sessions are intended to offer an overview, from several
different angles:
- a walk through the special concepts and terminology so that we’re
all comfortable with them
- a walk through the actual process of training and querying a
model
- an exploration of the mathematical side of word embedding models:
what do we mean by vector space?
- a review of the tool set we use with word embedding models: what
are the actual technologies we use and what role do they
play?
Hopefully by the end, we’ll have gone over the same material from enough
different perspectives that it will all make perfect sense!
And at various points, we’ll take a step back and think about the explanatory
process itself: what kinds of explanation might work best for different
audiences (especially readers of our scholarship, project collaborators,
colleagues, grant reviewers, also potentially our students)
Corpus and model
We’re going to hear the terms corpus and model a
lot this week: let’s look more closely at those terms
Corpus:
- In the simplest sense, the corpus is the body of textual material
we are analysing
- A set of documents in some machine-readable form, that is ready
for the word2vec program to ingest
- Our corpus might be derived from a larger research collection (or
several different collections), maybe in another format (like
TEI/XML) that contains extra information that we take advantage of
when we generate the corpus that will be fed into Word2Vec
- So to get from the research collection to the
corpus we might need to do some data conversion:
from XML (or some other format) to plain text (which is what the
Word2Vec tool requires)
- And we might also need to do some cleaning and regularization, to
tame the irregularities of the original research collection. A
little later on, we’ll think about data formats and cleaning
processes in more detail.
- So when we talk about the corpus here, we’re
talking about the plain-text corpus that is ready to be fed into
Word2Vec
Model:
- As we’ve already noted, the term model is an
important one in digital humanities: in general terms, a model is a
representation of something we are interested in, that captures some
features of importance, in a way that makes it easier for us to
examine and learn about that object of interest. So for instance a
TEI-encoded text is a representation of a text that
makes the structure and content of that text easier for us to see
and work with. A word-embedding model of a corpus is a
representation of a corpus of texts, in a way that
makes the semantic relationships between words easier for us to see
and work with.
- Practically speaking, the model we will be
dealing with is a processed version of the corpus, produced by the
Word2Vec tool, which represents the positioning of each word within
the model as a vector
- So for now the key point is: the corpus is a collection of
documents, while the model is a processed, computed representation
of the textual data contained in those documents
The data preparation process is how you get from the research
collection to the corpus
The training process is how you get from the corpus to the
model.
Parameters
Remember that we said different researchers might want to use the model for
different things, which would result in training/generating the model
somewhat differently. The way we control that training process is by
adjusting a set of parameters.
You can think of the training process (where we take a corpus and create a
model of it) as being sort of like an industrial operation:
- you take some raw materials and feed them into a big machine, and on
the other end you get out some product
- and this hypothetical machine has a whole bunch of knobs and levers on
it that you use to control the settings
- in our word2vec model training, the parameters are those knobs and
levers, that control the training process
- depending on how you adjust them, you get differently trained models
with different behaviours
We’ll take a quick look now at two of these parameters, so that you can get a
sense of how they affect the training process; they also have an important
impact on how we interpret the results of the model. Later in the week,
we’ll look at these parameters in more detail and think about the effect
these specific settings have on our models.
Window
The first parameter for us to consider is the concept of the
window
And here we come to a fundamental assumption for a lot of text analysis: that
words that are used together have something to do with one another
What does it mean for words to be used together?
- right next to one another? all or nothing?
- more relevant the closer they are? sort of a gradient?
- contained within the same semantic construct, like a sentence or a
paragraph? (problem: we’re working with plain text so we don’t have
access to semantic constructs)
In Word2Vec, instead of these, we use a window:
- a span of text of a specified length, like a viewing port that we
move over the text that allows us to see X words at a time
- we can control the size of the window (it is one of the
parameters we just talked about)
- the Word2Vec algorithm is like a bookworm reading its way through
the text, bite by bite
- each taste is localized by the window: each bite gives the
processor a set of words that are considered used
together
- and the size of the bite affects how many words are considered
together in this way
- a bigger window lets us treat larger groups of words as
related
- [pause and discuss for a moment:] what might be the results for
our analysis of a larger or smaller window? (imagine a window that
is thousands of words, as big as an entire chapter; imagine a window
that is only two words wide)
Remember that this is a machine learning process and moreover it
is an unsupervised machine learning process: one that starts
from a state of complete ignorance and has to bootstrap itself.
- So another way to imagine the approach being taken in
the training process: picture that you have a big bag containing all
the words in the corpus. You shake the bag and then dump it out on
the floor. Now you start reading the corpus (i.e the actual texts
with their actual word order).
- Each time you read a word, you make observations about the words
around it.
- Remember that this one observation doesn’t give you any kind of
Truth!! 100%!! about those words: it’s just one little
observed fact. Probabilistically, it contributes a tiny bit to our
belief about the whole corpus.
- So based on those observations about word X, we move each of the
context words a tiny bit closer to word X. Now we look at the next
word X and its companions, and we move those words a
little bit.
- note that the window is giving us two pieces
of information: what’s in the window, and what’s not in the window.
We’ll come back to this in more detail later, but for now we can say
that in addition to moving the words we do see, we also
update the position of some of the words we don’t see
as we read the text.
Iterations
We’ve talked about the creation of a model as a training
process, and we’ve just imagined it as a bookworm eating its way through the
text, repeatedly. The trained model is the representation of the probability
that words appear within the same window.
- As we just noted, the model begins in a state of complete
randomness: words dumped on the floor. But after one read through
the corpus, the words on the floor have moved around a bit. The
machine is learning! Now, if we repeat the process, we can move them
a bit further--it might seem as if we’re getting the same
information as we got before, but because the words on the floor are
now in different (better) positions already, what we’re doing is
refining that information further.
- each pass through the corpus provides another set of adjustments,
making the model more accurate
- each of these passes is called an iteration, and the
more iterations the training process does, the more accurate the
model (but of course the more time the process takes)
- you can control the number of iterations: it is another of the
parameters we mentioned a moment ago
Vectors: a first look
Let’s look next at some terms that may seem most distant from our humanistic
expertise: the ones that refer to the mathematical aspects of word embedding
models. The word vector has come up already: what is
a vector and how is it relevant in this case? We’ll start with a simple
explanation first, and then circle back a bit later for more detail.
A vector is basically a line that has both a specific length and a specific
direction or orientation in space:
- we can describe that line using coordinates: one coordinate for
each axis of information we have about the line
- in this example, the vector is the thick black line that starts at
the origin (the point where all three axes are at zero) and extends
out to the point in space where the x axis (the blue number) is at
3, the y axis (the red number) is at 2, and the z axis (the green
number) is at zero
- its direction and length are defined by those three
dimensions
- any questions? This may be new for some and probably not current
knowledge for most of us!
In a word-embedding model, the model represents a text corpus almost like a
dandelion: as if each word were at the end of one of the little dandelion
threads:
- each thread projects at a slightly different angle
- each word is located at a slightly different point in this cloud
of words
- and words that are nearer to one another in meaning are also
nearer to one another in vector space.
Cosine Similarity: What is a cosine anyway?
So what does it mean to be near something in vector space? How
do we measure this kind of proximity or association? If we understand these
vectors as lines whose directionality and length reflects word associations
in the corpus, then the more closely aligned two vectors are (the more they
are going in the same direction for the same distance), the
nearer they are for our purposes.
We can measure that alignment by using a mathematical expression called a
cosine. What is a cosine?
- If we have two vectors (two lines extending out in different
directions), what we really have is a triangle (the third leg would
be the line connecting the ends of those two vectors)
- Within a triangle, a cosine is a way of representing an angle in
relation to the lengths of its two legs
- the exact formula (for right triangles) is shown here on the
slide, but even without parsing that in detail, we can say that the
cosine of an angle is the ratio between its two sides (for triangles
without a right angle, the formula is a little more complex)
- So when those two sides are very similar in length and direction,
the cosine is going to get closer and closer to 1
Cosine Similarity
So now we can come back to our question of how to measure
nearness. In word embedding models the measure of
nearness that we use is something called cosine similarity.
- Roughly speaking, this is a measure of the similarity of two
vectors, based on the cosine of the angle between them
- As we’ve seen, the more similar the two vectors are, the closer
their ratio gets to 1. And in fact the values of cosine similarity
range between zero and one: two identical vectors have a cosine
similarity of 1; two absolutely dissimilar vectors have a cosine
similarity of zero
- So the smaller the cosine similarity, the less similar the words
are, and the farther apart they are in vector space
- We’ll talk a bit later on about what level of similarity really
counts as similar, and you’ll get a feel for
it
- In general, anything above .5 starts to feel meaningful
So in this example (a real-world example from the WWP corpus), if we take the
word sacred as our starting point, the words
holy and consecrated are
fairly close in meaning (and have high cosine similarity); the word
shrine is more distant but still related enough
to be interesting
So far so good? Questions?
Querying
So what can we do with this information? We’ve created a model of our corpus
(a representation that helps us see some aspect of that information more
clearly and easily): how do we use it?
The first thing we might try is just querying the model about the
neighborhood of a word we’re interested in: essentially, asking it questions
about where specific words are located and what is around them:
- this slide shows a simple example using the WWP’s Women Writers
Vector Toolkit, but in this workshop we will be working in the
RStudio environment that we looked at in the last session, so we can
design much more complex queries
- we can enter a search term, and get back a list of the words that
are closest to it in vector space: that is, words that are probably
semantically related to it, based on the way those words appear
together in the corpus
- as we can see from this list, these aren’t necessarily synonyms:
there are many different ways words can be
related as we will discover
- but they are words that tend to appear in the same contexts as our
query term (in this case, discussions of families and familial roles
and relationships)
Clustering
Another way we can interact with the model is to ask it more generally,
where are your semantically dense zones? Or please show me
some clusters of related words!
This process is somewhat similar to topic modeling:
- it says What if we divided up the corpus into X different
zones? where are the centers of those zones, and what is nearest
to those centers?
- just as in topic modeling where we say, in effect, if our
corpus has X number of topics, what would they be?
- or if we were looking at a map of a region, we might say if we
were going to build ten new Home Depot stores, where should they
go so that most people have the shortest drive? and who lives in
those regions?
To generate these clusters (as part of the initial model training process):
- the modeling program runs a clustering algorithm that randomly
chooses a number of locations within the vector space—in this case,
three (like throwing a set of three darts at the map)
- then, it makes a series of adjustments to those locations to move
them closer to actual population centers,
places where words are close to one another within the vector space
- if we kept up the adjustment process for a long time, we would
eventually get a perfect result: the three most significant semantic
zones within the model
Clusters: an example
So what we get at the end of the process is clusters of words that are like
neighborhoods within the vector space: densely
populated areas where words are grouped together around a concept or a
textual phenomenon.
- in principle, the number of clusters is up to us
- in our own model training for this workshop, since we are working
directly with the R code, we can choose how many clusters we
want
- the WWVT doesn’t give you this option but the WWP chose 150 as a
reasonable number
- for the Toolkit we stop the process after 40 adjustments, so the
clusters will come out a bit different every time you reset them,
but when running the code yourself in RStudio you can control that
more precisely.
- in this example, for instance (which shows three of the 150
clusters), there’s a cluster that’s roughly associated with
religiously-oriented death ceremony, and another one that is
old-fashioned cavalry-oriented warfare, but the one in the middle is
harder to describe as a concept: it’s more like
the space of dialogue and spoken language markers
Check the time and consider stopping here!
Vector Math 1
One more thing we can do to explore the word information in our vector space
model: we can examine the relationships between words, taking advantage of
the fact that each word is represented as a vector, which is a kind of
number
To understand how this works, we need to envision a little more clearly how
words are positioned in this vector space model:
- during the training process, the word2vec algorithm is examining
the text (looking through its little window at successive groupings
of words)
- and with each observation, it adjusts the position of the words in
the model to take into account the word associations it
observes
- so for a word like bank, it might observe
some instances where that word is associated with words like
funds and revenue,
and so it moves bank closer to those words:
it adds information that makes an association between these
words
- then maybe it observes some other instances where
bank is associated with words like
river and lake and
Hudson, and it moves the word
bank a little closer to those
words
- so by the end, each word is positioned in vector space in a way
that reflects its associations (some weaker and some stronger) with
many of the other words in the corpus
- we can think of each association as being like a rubber band that
pulls a pair of words together; each word is being pulled in
multiple different directions, with different strengths of
association, and its net position is the result of all of those
pulls
Vector Math 2
We can use this information to tease out more specific semantic spaces for
individual words:
- For instance, we might imagine that the word
grace has some rubber band pulling it
towards a word like beauty. What if we cut
that rubber band? What part of the semantic field might pull
grace more into their orbit if
beauty were out of the running? We can
find this out by subtracting the vector for
beauty from the vector for
grace: the result is a set of
associations that are specifically religious
- Similarly, instead of cutting that rubber band, we might intensify
its strength and allow it to pull grace
towards it more strongly (putting grace into
a zone where its aesthetic associations are most powerful). We can
do this by adding the vector for
beauty to the vector for
grace.
Note that words here are just proxies or symptoms (imperfect ones) for the
concepts we might be interested in:
- As we think about what words to add or
subtract, it’s important to think about how
those words are related to the concept we’re trying to examine (and
it’s worth trying different words)
- Also, the semantic associations of words are very corpus-specific:
in a corpus of financial documents, the term
grace might be exclusively associated
with the grace period for bill payment
- So knowing our corpus is really crucial
Validation
As we use our model in these various ways, we’re going to get some results
(hopefully) that look very predictable, and some others that look
provocative and fascinating, and maybe some others that look bizarre and
unexpected. How can we tell the difference between an interpretive
breakthrough and a glitch resulting from some terrible flaw in our training
process?
Once we’ve generated a model, there are ways we can and should test it to see
whether it is actually a useful representation that will give research
results we can use. That testing process is called validation.
To validate a model, we can ask questions like these:
Are your results consistent across models?
- When you train a
series of models on the same corpus using the same parameters, do
you get consistent cosine similarities for the same sets of words?
- (Note: because training a model is a probabilistic process, you
won’t get identical results from model to model, even if they’re
trained on exactly the same corpus, but the results should be
comparable.)
Do you get plausible word groupings?
- When you generate groups of similar terms
(either by generating clusters, or by querying a specific word), do
you get plausibly related groups of words for common and moderately
common query terms? (Common within your corpus, that is!)
- If you don’t get plausible groupings for moderately common words,
this would be a sign to proceed with caution; if you don’t get
plausible groupings for even common words, this would be a sign that
the model may not be very useful (this might be because of small
corpus size, or some other factor).
Does vector math work as you would expect?
- When you do the various forms of vector math (addition,
subtraction) do your results continue to seem plausible?
If we didn’t stop before, consider stopping now!
Circling back: another look at vectors
Now that we’ve worked through the basic concepts, let’s circle back and
consider the whole picture of word vectors or
word embedding models, and introduce a few
additional complexities.
[if starting the day here, check in and see if people want to recap
anything]
A quick review: we’ve already noted that a vector is basically a line that
has both a specific length and a specific direction or orientation in space:
- so here again in this example, the vector is a line that starts at
the origin (the point where all three axes are at zero) and extends
out to the point in space where the x axis is at 3, the y axis is at
2, and the z axis is at zero
- we can think of those three axes as representing three pieces of
information: together, they constitute a unique vector in
three-dimensional space.
- I’m going to pause here for a moment and let the diagram sink in a
bit more, because at this stage in the explanation, it helps to have
a sense of what the diagram is telling us. Does anyone want to test
out their understanding of how those three axes (the x, y, and z)
are contributing information to the direction and distance of that
vector? Does everyone see how the blue number 3 comes from the blue
x axis? etc.?
Words as vectors
The example we were just looking at shows a vector defined by three
dimensions: three different numbers representing three different axes of
meaning. However, when we’re working with word embedding models, we are
working with vectors that are defined by many more dimensions. So in order
to understand that scenario, we need to get a little more comfortable with
two ideas:
- a vector is just an assemblage of dimensions
- each dimension represents an association that has been
observed
So let’s take the first example on this slide (the idea may look familiar if
you read the Jay Allamar Illustrated Introduction to Word
Embeddings):
- Our little chart here shows three people (Jo, Lee, Robin) and for
each person it shows an assemblage of dimensions
- each dimension represents an association
- in this case, that association has to do with the person’s
affinity for specific animals: perhaps through observations of how
many pets of each type they have, or their response when they
encounter the animal in the wild
- So each person in this chart is represented by a vector with five
dimensions: a line in five-dimensional space
- if we want to compare two people and find out whether they tend to
like the same animals, impressionistically we can say that people
with high or low affinity
for the same animals are similar: the color coding is highlighting
this pattern.
- But if we want a quantitative way to talk about that similarity,
we can use the measure called cosine similarity, which
is a way of measuring the angle between two lines.
- Here, we’re doing exactly the same thing, except that our
lines are defined by five dimensions
instead of three.
- the calculation isn’t hard (you can find an Excel version on the
web!) and what it shows us is that the cosine similarity between our
two mammal-lovers is very high, whereas the similarity between the
mammal-lovers and the person who prefers birds/lizards/beetles is
quite low.
Pause for questions and reflection!
So, taking this a step farther, let’s look at the chart on the right:
- Here, instead of looking at people and their association with
animals, we’re looking at words and their association with other
words
- those other words have been observed in
proximity (in the window) with our target
word, to a greater or lesser extent
- we’re not giving numbers here, but imagine that the green boxes
are the ones that were observed more often, and the orange boxes are
the words that were observed less often, and maybe the
greenish-yellow boxes were somewhere in the middle
So what do we see when we look at the righthand chart?
- What kind of cosine similarity would we expect to find between
danger and peril?
A high or low similarity?
- How about between danger and
horses? Horses and
goats?
A few interesting things to note:
- all of the words are contributing information to each of the
vectors, even when the actual association observed is low (I’ll come
back to this in a minute)
- and in fact the chart goes way off the edge of the screen to the
right: there could in principle be hundreds of words contributing to
the distinctive vector that is danger
Negative Sampling
So let’s now add another concept. Cast our minds back to the little bookworm
eating through the corpus, making observations about the words that are
near the target word, and adjusting the position of
the words within the model. The information about those words that it
observes is being fed into our little chart here. But how about the words
that aren’t being observed?
We mentioned earlier that these are also significant. When the bookworm takes
a bite, there are a huge number of words that are not in that sample, and
the model training process could (in principle) use that information to
adjust all of the words in the corpus, moving them away from
the target word. In practice, it doesn’t adjust all of the
words (since that would be too much work) but it adjusts some
of the words: a random sample. This is called negative
sampling, and it is one of the parameters we can adjust: we can say
how many of these non-appearing words should have their positions updated
with each observation. If we have a large negative sampling value, the model
training will be more precise, but the training process will take a lot
longer.
Looking again at our chart: If time and computing power were no object, we
could imagine the chart extending off to the right so that every word in the
corpus is listed, and we could imagine the position of every word in the
model being adjusted with each observation, so that both the positive and
negative sampling information would be fully reflected in the model. We
could think of this situation as a kind of perfect
model:
- showing all words exerting some probabilistic influence on each
other
- in terms of text prediction, all words have some
probability of being the next word even if that
probability is very, very low
- in this perfect model, the vector for each
word has as many dimensions as there are words in the corpus
Let’s test this idea a little further:
- imagine that the window is
the size of the corpus: now all words are related to all other words
equally! Let that sink in for a moment: our understanding of the
relatedness of words is strongly determined
by our observational parameters: it isn’t intrinsic, it’s something
we control.
- And in fact in some forms of unsupervised modeling, like a topic
model, which operates on the whole document, the window size is in
effect the entire document: the model training process says
which words appear in the same document?
- But in word embedding models, our concept of relatedness is a bit
more precise than this: we are interested in things that are
happening more at the sentence or phrase level, where the
association between words reflects the way writers are actually
articulating specific ideas
One more look at our perfect model:
- note that it contains a lot of empty space: places where we are
noticing that in fact the word toothbrush is
not related to the words danger,
horses, etc.
- without getting too far into the weeds, it turns out that this
empty space is a problem: largely because it makes the data set
very, very large.
So what do we do about that?
Embedding!
To make the model more compact, and hence easier to process while you wait,
clever people developed a technique called embedding which
flattens the model: reducing it from a very large number of dimensions
(like, thousands) to a somewhat smaller number of dimensions (like,
hundreds).
For those of you who may have read Edwin Abbott’s Flatland,
you might remember how when a sphere visits Flatland, the two-dimensional
creatures there see it as a circle: a three-dimensional entity
flattened or projected onto two dimensions.
Something similar sometimes happens to Wily Coyote.
We are not going to cover the mathematics of it, but we will look at a few
effects/results.
In simple terms:
- in our perfect model, remember that the
position of each word in the model is a vector, and that vector is
essentially a complicated multidimensional number
- each of the dimensions of that number is another word
in the corpus, one number for each word (even the
unrelated words)
- in the flattened version, that is no longer
true: the position of each word is still a vector, but that vector’s
dimensions are no longer individual words, and the number of
dimensions is no longer the total vocabulary of the corpus.
- instead, we choose the number of dimensions, as one of the
parameters for the training process
- the embedding process then compresses the model down
to that number of dimensions, and reduces the empty space of the
unrelated words.
So by specifying the number of dimensions, we are in effect specifying how
many other words each word’s position takes into account:
- if we choose a very low number of dimensions, the model will have
very little information about the word relationships within our
corpus
- if we choose a very high number of dimensions, the model will have
a lot of information about the word relationships in the
corpus
- however, the sweet spot is also going to
depend on the total size and total vocabulary of the corpus: for a
corpus with a tiny vocabulary, a large number of dimensions may not
be very useful.
I’m afraid there’s a little ’magic happens here’ at this stage--the
mathematical details are a little out of scope for this institute, but there
are some good sources in the readings for those who want to understand this
more fully.
The word vector process: Data preparation
So another way to put this all together is to walk through the entire process
in order, step by step. There are basically three major acts in this drama,
very much like a classic comedy
In the first act, we set up the problem and introduce the main characters:
- We analyse our problem and establish a set of research questions
we want to focus on
- We gather a corpus of documents that are relevant to this
research; at this stage they may be a motley bunch cobbled together
from various sources, with differing quality, accuracy,
transcription conventions, etc.
- And we might do some data cleanup on the corpus to improve
consistency or make the data better suited to our research: for
instance, filtering out unnecessary information like page numbers,
or regularizing/modernizing spelling
As part of this process, we might discover things that cause us to reassess
or expand our research question: so it’s helpful to keep an open mind and be
prepared to treat this as an iterative process.
The word vector process: Training the model
In the second act, we get the real meat of the plot: in this case, the
process where we train our model and create a vector space representation of
our corpus:
- First, we set the parameters for the training process: we choose
the window size (which does what?); we set the number of iterations
(which does what?); we set the number of dimensions (which does
what?) and we set the negative sampling (which does what?)
- Second, we actually run the training process: our little
caterpillar eats its way through the corpus, taking bigger or
smaller bites (depending on window size), the number of times
through depends on the number of iterations we set.
- And third, we validate the model: we test it for
plausibility
The word vector process: Iteration and refinement
As before, this is an iterative process!
- When you’re first training a model, it’s a good idea to try
different parameter settings just to find out what difference they
make
- And when you validate the model, you might see something that
prompts you to go back and change a parameter and try again: for
instance, with a very small corpus, you might need to do extra
iterations (because with a small corpus, there isn’t as much
information being generated about word relationships during each
iteration, so you need to run the process more times to get the same
level of accuracy)
- And the model training process might in turn send you back to the
corpus: you might discover that your corpus is just too small and
you need to go back and add some more materials. Or you might find
that your corpus is too heterogeneous: maybe you’d like to try
splitting it into two and treating them separately.
The word vector process: Querying and research
In the final act, as with a proper comedy, we reach resolution and answers:
this is where we can start querying our model and doing our research
(although as we’ve seen, the corpus-building and model-training processes
are also definitely integral to the research process)
Tools for word embedding models
To wrap up this session, let’s take a quick look at the tools we use for
working with word embedding models
We can arrange them in order of abstraction:
- the most foundational tool in this set is the
word embedding algorithms themselves. These are mathematical
processes that perform computations that generate a word
embedding: a representation of a corpus as a vector
space that has been squashed or
flattened in useful ways. The two main word
embedding algorithms in common use are Word2Vec (developed by Tomas
Mikolov at Google) and GLoVE (developed by a research group at
Stanford). For this workshop, we are using Word2Vec.
- When we want to actually run those algorithms on our data, we need
to have a computer program that will do things like read in the
corpus, run the algorithm on it, allow us to set parameters, etc. We
could write one ourselves if we were clever that way but there
already exist specific software packages we can use: specific
implementations of the word embedding algorithms. Two in common use
are the WordVectors package (written in R by Ben Schmidt) and the
GenSim package (written in Python by a Czech researcher, Radim
Řehůřek). For this workshop, we are using the WordVectors R
package.
- In order to run these programs on your computer, you need to have
an environment within which the programming language (R, Python) can
operate: something that understands the R or Python language and can
run it within the operating system on your computer. These software
environments are sort of like sandboxes or life support systems for
specific languages. Examples include RStudio, which is an
environment for working in the R programming language and running R
code, and Jupyter Notebooks, which are an environment for working in
the Python programming language and running Python code. Within
these environments, we can train models and we can also query and
interact with them.
- An added option (which we’re only touching on briefly in this
workshop) is the Women Writers Vector Toolkit, which is a set of
programs that create a web interface for Word2Vec, and allow you to
query the trained models without having to use RStudio or interact
directly with any of the underlying layers
Those layers are all sitting underneath us and they each have effects on the
outcomes of our work:
- the environment we’re working in is the result of a number of
layers of decisions that could have been made differently
- and even if you don’t want to make a different decision, in a
teaching context you might want your students to understand the
effects of a different set of choices
- so over time, you may want to revisit them as you gain more
familiarity and comfort with these tools
- the important note to end on here is that this workshop is
intended to be a starting point
- the things we observe about word vectors and how they work are not
universal, but local and situational; however, we can learn a lot
from these experiments
Discussion and questions
So now let’s take a step back, with this more detailed perspective:
- Are there further questions or things that need more
explanation?
- Any mew perspectives on the examples we saw earlier?
- Any reflections on the explanatory process? What worked and what
didn’t?