Word Vectors Training, Querying, and Validation
This walkthrough is a static version of an R notebook published by the Women Writers Project. In an environment such as RStudio, the code blocks below would be editable and interactive. For the full set of notebooks and more context on their usage, visit our GitHub repository.
Getting started
Using this File
This file is an introduction to training and querying a model with word2vec; it is designed to be used with our class’s RStudio Server instance.
Reminder on running code
To run a single line of code from an R Markdown file, put your
cursor anywhere in that line of code and then hit
command-enter or control-enter. If you
want to run all of the code in a code snippet, you can hit the
green triangle button on the right. If you want to run a
particular section of code, highlight the section you want to run
and hit command-enter or
control-enter.
Much of our code will run almost instantly, but some things
will take a few seconds or minutes, or even longer. You can tell
code is still running if you see a red stop sign in the top-right
corner of the console. If you’d like to stop a process, you can
hit this stop sign. You will know that the code has been run or
successfully stopped when you see a new > prompt
in the bottom of the console.
Opening a new session: checking project and working directory
As a reminder, at the start of any new session, you should make sure that you have the right project open and you should check your working directory.
If you opened the “WordVectors” project file first, then you
should already be working in the “WordVectors” project space. To
confirm that you have the correct project open, check the
top-right corner of the RStudio screen. If the project is not
open, you can open it by going to File then
Open Project... in the menu bar at the top of
RStudio, or by clicking on the project file.
At the start of each new session, you should check your working
directory with the code below. As long as you opened this file
from the WordVectors project, your working directory should be in
the right place: the “WordVectors” folder. If you do need to
change your working directory, you can do so with the
setwd() function.
getwd()Loading packages
All the packages you will need for this exercise have been
installed ahead of time on our RStudio Server instance, but you
should load them using the library() function if this
is a new session. You’ll have to load these packages every time
you start a new session in RStudio.
When you run this code for the first time after you start a session, you’ll see a lot of text go through the console, possibly with some warning messages. Even if the text looks alarming, it probably won’t cause any issues. To confirm that the packages have loaded correctly, you can run this code a second time—if you see the code pop into the console with no additional text, that means you have loaded the packages properly and you are all set.
library(tidyverse)
library(tidytext)
library(magrittr)
library(devtools)
library(tsne)
library(wordVectors)
library(ggplot2)
library(lsa)Training a model
Reading in text files
The code we will be using in this session is set up to require
minimal editing, but that does mean that you need to have your
input files in a very specific format. You should have a set of
.txt files all saved in the same folder (without any files in
subfolders). To add a new folder to RStudio Server, you should
compress that folder on your computer (create a zip file) and then
upload it to the data folder with the
Upload button near the top of the Files
menu (make sure you are already inside the data
folder when you do this!).
This tutorial also comes with a small sample folder, called “WomensNovelsDemo”; it is not large enough to produce a useful model, but will run more quickly and so is useful for initial experimentation.
The following script allows you to “read-in” multiple text files and combine them into a “tibble,” which is a type of data table. Think of it as being like a spreadsheet, with rows and columns organizing information.
First, we get a list of the files to read-in
(fileList), then we create a function
(readTextFiles) to produce a tibble with two columns,
filename and text for each text file in
the folder. Then, we run the function to combine everything into
one tibble called combinedTexts.
There are some special requirements when you want to run code that is defining functions; unlike most of the time, where you can put your cursor anywhere in the line of code to run it, you need to have your cursor either at the beginning or the end of the code defining your function when you run it (or just select the whole thing). There are comments both before and after the code that defines the function, so you can see what its boundaries are.
The only thing you’ll need to change in the code below is the file path in the first line.
As long as you have the folder with your text files inside the
data folder, you should only need to change the part
after the slash (the part that reads “name_of_your_folder”).
Remember that you can use tab to select the folder
you want.
Make sure to change this one line before you run any of the code below.
# Change this line to match the name of the folder with your corpus
path2file <- "data/WomensNovelsDemo/"
# This will create a list of files in the folder
fileList <- list.files(path2file, full.names=TRUE)
# This is where you define a function to read in multiple text files and paste them into a tibble (remember that the code that defines functions must be run by putting your cursor at the beginning or end, or by selecting the whole section of code). You are only defining the function here; the next section of code is when you actually run the function.
readTextFiles <- function(file) {
message(file)
rawText = paste(scan(file, sep="\n", what="raw", strip.white=TRUE))
output = tibble(filename=gsub(path2file, "", file), text=rawText) %>%
group_by(filename) %>%
summarise(text = paste(rawText, collapse=" "))
return(output)
}
# This is where you run the function to create a tibble of combined files called "combinedTexts"
combinedTexts <- tibble(filename=fileList) %>%
group_by(filename) %>%
do(readTextFiles(.$filename)) Preparing text for word2vec
The section below defines several variables so that they can be used in training your model. Working with general names (such as “w2vInput”) for these variables lets you use them in the code that follows without having to change each instance; the first line is where you set up the specifics you need to distinguish one model from another.
You can pick any name you want in the first line of code below; make sure there are no spaces in the name you select and that it is descriptive enough that you will remember which corpus you were working from when you want to read in a trained model.
The only line in the block of code below that you need to change is the first one, but make sure to do this, or you will end up with a model file called “your_file_name.bin”!
The last line of this code section creates a single text file, with a name based on the one that you chose, combining all of the texts in your corpus.
# This section is where you define the variables you will be using to train your model; change the text in the first line to whatever you want to call your model file
baseFile <- "your_file_name"
w2vInput <- paste("data/",baseFile,".txt", sep = "")
w2vCleaned <- paste("data/",baseFile,"_cleaned.txt", sep="")
w2vBin <- paste("data/",baseFile,".bin", sep="")
#This line creates a singe text file with all the texts in your corpus
combinedTexts$text %>% write_lines(w2vInput)Creating a vector space model
The code below is how you actually train your model. There are some parameters you might want to modify, or, if this is your first time training a model, you can also keep the defaults to start.
You can adjust the number of processors to use on your computer
in training the model with the threads parameter;
this will impact how quickly the model is trained.
The vectors parameter allows you to change the
dimensionality of your model to include more or fewer dimensions.
Higher numbers of dimensions can make your model more precise, but
will also increase both training time and the possibility of
random errors. A value between 100 and 500 will work for most
projects.
The window parameter allows you to control the
number of words on either side of the target word that the model
treats as relevant context; the smaller the window, the closer the
context words will be.
The iter parameter allows you to control how many
times your corpus is read through during model training. If your
corpus is on the smaller side, then increasing the number of
iterations can improve the reliability of your results.
The negative_samples parameter allows you to
control the number of “negative samples” used in training. During
the training process, each iteration updates the information about
the position of each word in the model (making it progressively
more and more accurate). Because there are many thousands of words
in the model, doing that update with every iteration is
time-consuming and computationally costly. With negative sampling,
instead of updating every word, the training process updates only
the words directly observed within the window, plus a random
sampling of the other words in the model. For smaller datasets, a
value between 5 and 20 is recommended; for larger ones, you can
use smaller values, between 2 and 5.
For more on these parameters, and other options that you have in training a model, see the code documentation.
This code will check if there is already a .bin file with the same name in the current directory—if there isn’t, it will train a new model. If there is, it will read in the existing one. If you ever want to overwrite a model you’ve already trained, make sure to delete or rename that model’s .bin file first.
# This controls how much of your computer's processing power the code is allowed to use.
THREADS <- 3
# prep_word2vec will prepare your corpus by cleaning and lowercasing your text with the `tokenizers` package. If you set the value of `bundle_ngrams` to be greater than 1, it will automatically join common bigrams into a single word.
prep_word2vec(origin=w2vInput, destination=w2vCleaned, lowercase=T, bundle_ngrams=1)
# The code below will train or read in a model
if (!file.exists(w2vBin)) {
w2vModel <- train_word2vec(
w2vCleaned,
output_file=w2vBin,
vectors=100,
threads=THREADS,
window=6, iter=10, negative_samples=15
)
} else {
w2vModel <- read.vectors(w2vBin)
}Querying the model
Visualizing
We can get a glimpse of what the model looks like by plotting it in two dimensions. Keep in mind that the model actually has many more dimensions, so we are, in effect, flattening it. Though the visualization may be somewhat difficult to read, you should be able to see that similar words—words that are near each other in vector space—tend to clump together. The code below will likely take a minute or two to run, and your results will appear in the “Plots” window to the right (you can hit the “Zoom” button to get a better view).
As the code is running, you’ll see a set of lines in the console that will something like “Epoch: Iteration #100 error is: 20.3048394873336”; note that this is not an error message! As the code runs, the values for “error” should decrease—this reflects increasing confidence about how to plot the vector representation.
w2vModel %>% plot(perplexity=10)Clustering
The following script provides a way to cluster words that are
near each other in vector space, using the “k-means” clustering
algorithm. Below, we choose 150 centers, or 150
points around which to cluster words. Then we select 10 random
clusters and 15 words from each cluster to view. This code will
also take a minute or two to run. You can change the number of
centers, the number of clusters to view, or the number of words to
see—you can also increase the number of iterations (the number of
times the algorithm should adjust where the centers are and where
terms are positioned in relation to those centers).
centers <- 150
clustering <- kmeans(w2vModel, centers=centers, iter.max=40)
sapply(sample(1:centers, 10), function(n) {
names(clustering$cluster[clustering$cluster==n][1:15])
})Closest to
To find the words closest to a particular word in vector space, fill in that term and then run the code below. If you want to see more or fewer words, you can change the number in the argument. Make sure not to delete the quotation marks, and enter your word in lowercase
w2vModel %>% closest_to("girl", 30) Closest to two terms
You might also want to see the words closest to a combination
of two (or more) words. Notice that this will open a new window
with the results because of the view() function. If
you prefer to see your results in this format, you can paste
“%>% view()” at the end of the code above; or, if you prefer to
see your results in the console, you can delete “%>% View()”
from the code below. Note that the code below also shows 20
results, instead of 30. If you want to continue adding terms, just
follow the format as in the example by putting a +
between each pair and putting each word in quotation marks.
# Closest to two terms
w2vModel %>% closest_to(~"girl"+"woman", 20) %>% view()
# Closest to more than two terms
w2vModel %>% closest_to(~"girl"+"woman"+"daughter"+"aunt"+"sister"+"lady", 20) %>% view()Closest to the difference between two terms
Or, you might want to look at the difference between two terms, to see which words are similar to one term but not another:
w2vModel %>% closest_to(~'rose'-'flower',20) Analogies
You can even construct analogies, such as in the example below; these use vector math to subtract the contexts associated with one word from another and then add a third term, which brings you to new vector space where you will find terms associated with the distinction between the first two terms plus the contexts of the third term.
In the classic example, you might start with the vector for “woman” and subtract the vector for “man”, thus producing a vector that represents the contexts for “woman” as distinct from those for “man”. You might then add a third term, such as “king”, to add its own contexts to the query. This would let you look at a vector associated with something like femininity and then adds a vector associated with royalty; you might expect to get a result like “queen”.
To frame this as an analogy: this lets you ask questions like “man” is to “king” as “woman” is to what?
w2vModel %>% closest_to(~"woman"-"man"+"king", 20)It is not always helpful to think about this approach strictly within the analogy framework; in many cases, it can be more productive to think about constructing a vector that represents the difference between two terms, and then adding the contexts of a third term. In the line of code below, for instance, we are constructing a vector that might be described as a wealth vector (by looking at the contexts for “rich” as distinct from “poor”) and adding to that vector the semantic space of clothing (by adding the contexts for “dress”). We might expect to get results associated with expensive clothing or the dress habits of the wealthy.
To experiment with this, try adding different third terms (perhaps “food” or “house”) or reverse the first two terms, to look at the contexts for poverty instead of wealth.
w2vModel %>% closest_to(~"rich"-"poor"+"dress", 20)Working with other models and exporting results
Reading in existing model files
If you want to read in an existing model, you can do so with
the code below (just replace “name_of_your_file” with the name of
your file, and make sure you don’t delete the .bin extension or
the quotation marks). If you follow the instructions above, all of
your trained models will be saved as binary files (with a .bin
extension) in your data folder. You only need to
train each model once, and then you can use this code to read it
in at the start of each new session.
You can also read in models trained by others if you upload
their .bin file to your data folder using the
Upload button near the top of the Files
menu (make sure you are already inside the data
folder when you do this!). After you have uploaded the files to
RStudio Server, you can read them in with the code below.
After you’ve restarted RStudio (in addition to checking your working directory and loading your packages), you’ll also need to use the code below to read in your model again.
# Replace this with the path to the model you want to read in
w2vModel <- read.vectors("data/wwo-regularized.bin")Exporting queries
The code below will enable you to export the results from a particular query. To export query results, change the part after “w2vModel %>%” to match the query that you want to export. An example is filled in so that you can see what this looks like. You can also adjust the number of words in the results set, if you want to see more or fewer terms. If you’d like to export results from a different query, such as addition or subtraction, paste over the example query with the one that you want to export.
The first line of code defines the variable “w2vExport” as whatever query you set. The second line exports a CSV file (which you can open in any program on your computer that works with tabular data, including Excel and Numbers). You can call the file whatever you like by replacing the template text inside of the quotation marks. The CSV file will be exported to the “output” folder in your current working directory, and it will overwrite existing files with the same name, so make sure to rename the export file if you want to keep earlier versions. Make sure not to use any spaces in the file names you choose.
If you would like to then export any file from RStudio Server to your own computer, click the box next to the file to select it and then go to the “More” gear near the top of the Files menu. Then, choose “Export” and hit the “Download” button.
w2vExport <- w2vModel %>% closest_to("girl", 30)
#Change "name_of_your_query" to a descriptive name that you want to give to your export file. Don't put any spaces in the file name.
write.csv(file="output/name_of_your_query.csv", x=w2vExport)Exporting clusters
You can use a similar method to export your clusters; the code below will first generate a set of clusters and then export a specified (by you) number of terms from those clusters. As above, you can change the number of centers and iterations when you are generating the clusters; you can also change how many sets of clusters and words from each cluster to export.
The exporting mechanism is the same as with exporting queries above; you change the language in the quotation marks to match the name that you want to give your file. The export file can be fairly large, so this code might take a bit of time to run. And, again, you can follow the instructions above to export from RStudio Server to your own computer.
#Change "name_of_your_cluster" to a descriptive name that you want to give to your export file.
centers <- 150
clustering <- kmeans(w2vModel,centers=centers,iter.max = 40)
w2vExport <-sapply(sample(1:centers,150),function(n) {
names(clustering$cluster[clustering$cluster==n][1:15])
})
write.csv(file="output/name_of_your_cluster.csv", x=w2vExport)Evaluating the Model
Below is a very simple test that will calculate the cosine similarities for a small set of word pairs that are likely to be related in many models. You can customize this list for your own corpus by editing the pairs below, or adding new ones (add as many as you like, but make sure to follow the same format as in the examples below). This code will produce a “model-test-results.csv” file with cosine similarity scores on these word pairs for every model in your folder. The results file will be in the “output” folder of your current working directory. This is meant to be an example of the kinds of testing that are used in model evaluation, and is not a substitute for more rigorous testing processes.
files_list = list.files(pattern = "*.bin$", recursive = TRUE)
rownames <- c()
data_frame <- data.frame()
data = list(c("away", "off"),
c("before", "after"),
c("cause", "effects"),
c("children", "parents"),
c("come", "go"),
c("day", "night"),
c("first", "second"),
c("good", "bad"),
c("last", "first"),
c("kind", "sort"),
c("leave", "quit"),
c("life", "death"),
c("girl", "boy"),
c("little", "small"))
data_list = list()
for(fn in files_list) {
wwp_model = read.vectors(fn)
sims <- c()
for(pairs in data)
{
vector1 <- c()
for(x in wwp_model[[pairs[1]]]) {
vector1 <- c(vector1, x)
}
vector2 <- c()
for(x in wwp_model[[pairs[2]]]) {
vector2 <- c(vector2, x)
}
sims <- c(sims, cosine(vector1,vector2))
#f_name <- strsplit(fn, "/")[[1]][[2]]
data_list[[fn]] <- sims
}
}
for(pairs in data){
rownames <- c(rownames, paste(pairs[1], pairs[2], sep = "-"))
}
results <- structure(data_list,
class = "data.frame",
row.names = rownames
)
#If you want to give your results document a more specific name; you can edit "model-test-results" below.
write.csv(file="output/model-test-results.csv", x=results)Credits and Thanks
This tutorial uses the wordVectors package
developed by Ben Schmidt and Jian Li, itself based on the original
word2vec code developed by Mikolov et al. The
walkthrough was also informed by workshop materials authored by
Schmidt, as well as by an exercise created by Thanasis Kinias and
Ryan Cordell for the “Humanities Data Analysis” course, and a
later version used in Elizabeth Maddock Dillon and Sarah Connell’s
“Literature and Digital Diversity” class, both at Northeastern
University.
This version of the walkthrough was developed as part of the Word Vectors for the Thoughtful Humanist series at Northeastern. Word Vectors for the Thoughtful Humanist has been made possible in part by a major grant from the National Endowment for the Humanities: Exploring the human endeavor. Any views, findings, conclusions, or recommendations expressed in this project, do not necessarily represent those of the National Endowment for the Humanities.