Word Vectors Installation, Training, Querying, and Validation

This walkthrough is a static version of an R notebook published by the Women Writers Project. In an environment such as RStudio, the code blocks below would be editable and interactive. For the full set of notebooks and more context on their usage, visit our GitHub repository.

Getting started

Using this File

This file is an introduction to training and querying a model using word2vec with R and RStudio on your own computer.

This is an R Markdown file, which means that it contains both text (what you’re reading now) that can be formatted for display on the web or as a pdf file, and snippets of code, which you can run right from the file.

Before running any code, you should do a quick check on your preferences for how RStudio will handle R Markdown files. Go to the Tools menu above, select “Global Options”, then select “R Markdown Preferences,” and make sure that “Show output inline for all R Markdown documents” is not selected.

To run a single line of code from an R Markdown file, put your cursor anywhere in that line of code and then hit command-enter or control-enter. If you want to run all of the code in a code snippet, you can hit the green triangle button on the right. If you want to run a particular section of code, highlight the section you want to run and hit command-enter or control-enter.

Much of our code will run almost instantly, but some things will take a few seconds or minutes, or even longer. You can tell code is still running if you see a red stop sign in the top-right corner of the console. If you’d like to stop a process, you can hit this stop sign. You will know that the code has been run or successfully stopped when you see a new > prompt in the bottom of the console.

If you don’t see the stop sign but want to cancel a process, you can also hit control-C.

If you are running code in a block line by line, your cursor will automatically go down to the next line after each is run, so you can move through the block by repeatedly hitting command-enter or control-enter.

Below is a code block: try running the code below,

print("Put your cursor on this line and try running it!")
print("Now run this line.")
print("Now run all three lines at once by selecting them and running them together or by hitting the green play triangle")

You can also run code directly in the console by typing or pasting it in and hitting enter. You will get the same results, but if you want to save code that you have written, it is better to keep it in the R Markdown file, since edits there will be saved. On the other hand, if you prefer to run some code but not make changes to your file, you can just run that in the console.

Projects

Projects are a way to organize your work in RStudio. If you opened the “WordVectors” project file first, then you should already be working in the “WordVectors” project space—and, as long as you have this project open, your files should be where you expect them to be. It will usually be easiest to start any session by opening the “WordVectors.Rproj” file, at least while you are getting used to working in RStudio.

To confirm that you have the correct project open, check the top-right corner of the RStudio screen. If the project is not open, you can open it by going to File then Open Project... in the menu bar at the top of RStudio, or by clicking on the project file. Always check at the beginning of a session to make sure you have the project open; if you don’t, it will likely cause errors. If you do hit an error, one of the first things you should check is whether your project is open.

This introduction provides some basic instructions to get you started, but it is not a substitute for learning the fundamentals of R and RStudio. For a very basic intro to some key principles, see the “Introduction to R and RStudio” in the WordVectors folder. For more detailed information, there are many helpful resources online, including tutorials by the Programming Historian and Lincoln Mullen’s Digital History Methods in R.

Downloading and installing R and RStudio

You can download R from the CRAN (Comprehensive R Archive Network) repository: https://cloud.r-project.org/. There are specific instructions for downloading to Linus, Mac OS X, and Windows machines.

To download RStudio see: https://rstudio.com/products/rstudio/download/.

Setting and checking your working directory

If this is a new RStudio session, you should check your working directory with the code below. As long as you opened this file from the WordVectors project, your working directory should be in the right place: the “WordVectors” folder.

You should check your working directory because if the working directory is not where you are expecting, then not much else in your files will work. Any time you see an error message that says a file does not exist in the current working directory, that’s a good sign your working directory isn’t where you think it is.

There are two lines of code in the block below; the first will allow you to check your working directory and the second will allow you to set your working directory with the setwd() function, if you ever need to change it. We’ve provided you with some template text that you can replace with a file path specific to your computer.

Navigating file paths can be a bit confusing, but, fortunately there is a trick you can use. If you try deleting the text inside of the quotation marks below and then hitting tab, you should get a pop-up with a view of the folder system you’re used to navigating; you can use this to fill in file paths whenever you need to.

# How to check your working directory (this is also an example of how you add a comment to your code—by typing "#")
getwd()

# How to set your working directory (do not run this unless you actually want to change your working directory—as long as you opened the project file first, you should not need to change your working directory!). Delete the hashtag to take the line below out of the comment, then fill your file path in. 

# setwd("path/to/your/directory")

Installing packages

R Packages are collections of functions, variables, data, and documentation that you can use to get options that go beyond what’s available with the base R code. Like extensions for Google Chrome or plugins on WordPress, these are additions that you can install directly from the internet, in this case through R’s repository. Packages make it possible to do a wider range of tasks in R and are essential for working with word vectors.

To install an R package, you can navigate through the Tools menu (Tools –> Install Packages) or install packages directly by running code with the install() function, as below. Whenever you are installing packages, make sure that you have an internet connection because the code pulls directly from online repositories.

Once you have installed a package on your computer you do not need to do so again.

Run each line of code below to install the packages you will need.

install.packages("tidyverse")
install.packages("tidytext")
install.packages("magrittr")
install.packages("devtools")
install.packages("tsne")
install.packages("lsa")
install.packages("ggplot2")

Loading packages

Once the packages have been installed, you should load them using the library() function. You only need to install packages once, but you’ll have to load them every time you start a new session in RStudio. You should make a habit of checking your working directory and loading your packages at the start of each session.

When you run this code for the first time after you start a session, you’ll see a lot of text go through the console, possibly with some warning messages. Even if the text looks alarming, it probably won’t cause any issues. To confirm that the packages have loaded correctly, you can run this code a second time—if you see the code pop into the console with no additional text, that means the packages loaded properly and you are all set.

library(tidyverse)
library(tidytext)
library(magrittr)
library(devtools)
library(tsne)
library(lsa)
library(ggplot2)

# Hold off on running the line below until after you get to the next section 
library(wordVectors)

Installing word2vec

Because the wordVectors package lives on GitHub, you will need to install it in a different way, using the devtools() function. That’s why we had you pause before loading wordVectors above. After you’ve installed the wordVectors package (by running the code snippet below), you can ignore or even delete the comment above and just load all the packages at once.

Make sure to load wordVectors after you install it; you can either scroll up and load it from the code block above, or you could try writing the code to load it yourself (either in the code block below or right in the console).

#This is the code to install the `WordVectors` package from GitHub
devtools::install_github('bmschmidt/wordVectors', force=TRUE)

Training a model

Reading in text files

The code we will be using in this session is set up to require minimal editing, but that does mean that you will need to have your input files in a very specific format. You need to have a set of .txt files all saved in the same folder (without any files in subfolders). The folder with your texts should be saved in the data folder.

This tutorial also comes with a small sample folder, called “WomensNovelsDemo”; it is not large enough to produce a useful model, but will run more quickly and so is useful for initial experimentation.

The following script allows you to “read in” multiple text files and combine them into a “tibble,” which is a type of data table. Think of it as being like a spreadsheet, with rows and columns organizing information.

First, we get a list of the files to read in (fileList), then we create a function (readTextFiles) to create a tibble with two columns, filename and text for each text file in the folder. Then, we run the function to combine everything into one tibble called combinedTexts.

There are some special requirements when you want to run code that is defining functions; unlike most of the time, where you can put your cursor anywhere in the line of code to run it, you need to have your cursor either at the beginning or the end of the code defining your function when you run it (or just select the whole thing and run it). There are comments both before and after the code that defines the function, so you can see what its boundaries are.

The only thing you’ll need to change in the code below is the file path in the first line.

As long as you have the folder with your text files inside the data folder, you should only need to change the part after the slash (the part that reads “name_of_your_folder”). Remember that you can use tab to navigate to the folder you want. Make sure to change that one line before you run any of the code below.

# Change this line to match the name of the folder with your corpus
path2file <- "data/WomensNovelsDemo/"

# This will create a list of files in the folder
fileList <- list.files(path2file, full.names=TRUE) 

# This is where you define a function to read in multiple text files and paste them into a tibble (remember that the code that defines functions must be run by putting your cursor at the beginning or end, or by selecting the whole section of code). You are only defining the function here; the next section of code is when you actually run the function.
readTextFiles <- function(file) { 
  message(file)
  rawText = paste(scan(file, sep="\n", what="raw", strip.white=TRUE))
  output = tibble(filename=gsub(path2file, "", file), text=rawText) %>% 
    group_by(filename) %>% 
    summarise(text = paste(rawText, collapse=" "))
  return(output)
}

# This is where you run the function to create a tibble of combined files called "combinedTexts"
combinedTexts <- tibble(filename=fileList) %>% 
  group_by(filename) %>% 
  do(readTextFiles(.$filename)) 

Preparing text for word2vec

The section below defines several variables so that they can be called on in training your model. Working with general names (such as “w2vInput”) for these variables lets you use them in the code that follows without having to change each instance; the first line is where you set up the specifics you need to distinguish one model from another.

You can pick any name you want in the first line of code below; make sure there are no spaces in the name you select and that it is descriptive enough that you will remember which corpus you were working from when you want to read in a trained model.

The only line in the block of code below that you will need to change is the first one, but make sure to do this, or you will end up with a file called “your_file_name.bin”!

The last line of this code section creates a single text file, with a name based on the one that you chose, combining all of the texts in your corpus.

# This section is where you define the variables you will be using to train your model; don't forget to change the text in the first line to whatever you want to call your model file
baseFile <- "your_file_name"
w2vInput <- paste("data/",baseFile,".txt", sep = "")
w2vCleaned <- paste("data/",baseFile,"_cleaned.txt", sep="")
w2vBin <- paste("data/",baseFile,".bin", sep="")

#This line creates a singe text file with all the texts in your corpus
combinedTexts$text %>% write_lines(w2vInput)

Creating a vector space model

The code below is how you actually train your model. There are some parameters you might want to modify, or, if this is your first time training a model, you can also keep the defaults to start.

You can adjust the number of processors to use on your computer in training the model with the threads parameter; this will impact how quickly the model is trained.

The vectors parameter allows you to change the dimensionality of your model to include more or fewer dimensions. Higher numbers of dimensions can make your model more precise, but will also increase both training time and the possibility of random errors. A value between 100 and 500 will work for most projects.

The window parameter allows you to control the number of words on either side of the target word that the model treats as relevant context; the smaller the window, the closer the context words will be.

The iter parameter allows you to control how many times your corpus is read through during model training. If your corpus is on the smaller side, then increasing the number of iterations can improve the reliability of your results.

The negative_samples parameter allows you to control the number of “negative samples” used in training. During the training process, each iteration updates the information about the position of each word in the model (making it progressively more and more accurate). Because there are many thousands of words in the model, doing that update with every iteration is time-consuming and computationally costly. With negative sampling, instead of updating every word, the training process updates only the words directly observed within the window, plus a random sampling of the other words in the model. For smaller datasets, a value between 5 and 20 is recommended; for larger ones, you can use smaller values, between 2 and 5.

For more on these parameters, and other options that you have in training a model, see the code documentation.

This code will check if there is already a .bin file with the same name in the current directory—if there isn’t, it will train a new model. If there is, it will read in the existing one. If you ever want to overwrite a model you’ve already trained, make sure to delete or rename that model’s .bin file first.

# This controls how much of your computer's processing power the code is allowed to use. 
THREADS <- 3

# prep_word2vec will prepare your corpus by creating a single text file and cleaning and lowercasing your text with the `tokenizers` package. If you set the value of `bundle_ngrams` to be greater than 1, it will automatically join common bigrams into a single word. 
prep_word2vec(origin=w2vInput, destination=w2vCleaned, lowercase=T, bundle_ngrams=1)

# The code below will train or read in a model
if (!file.exists(w2vBin)) {
  w2vModel <- train_word2vec(
    w2vCleaned,
    output_file=w2vBin,
    vectors=100,
    threads=THREADS,
    window=6, iter=10, negative_samples=15
  )
} else {
  w2vModel <- read.vectors(w2vBin)
}

Querying the model

Visualizing

We can get a glimpse of what the model looks like by plotting it in two dimensions. Keep in mind that the model actually has many more dimensions, so we are, in effect, flattening it. Though the visualization may be somewhat difficult to read, you should be able to see that similar words—words that are near each other in vector space—tend to clump together. The code below will likely take a minute or two to run, and your results will appear in the “Plots” window to the right (you can hit the “Zoom” button to get a better view).

As the code is running, you’ll see a set of lines in the console that will something like “Epoch: Iteration #100 error is: 20.3048394873336”; note that this is not an error message! As the code runs, the values for “error” should decrease—this reflects increasing confidence about how to plot the vector representation.

w2vModel %>% plot(perplexity=10)

Clustering

The following script provides a way to cluster words that are near each other in vector space, using the “k-means” clustering algorithm. Below, we choose 150 centers, or 150 points around which to cluster words. Then we select 10 random clusters and 15 words from each cluster to view. This code will also take a minute or two to run. You can change the number of centers, the number of clusters to view, or the number of words to see—you can also increase the number of iterations (the number of times the algorithm should adjust where the centers are and where terms are positioned in relation to those centers).

centers <- 150
clustering <- kmeans(w2vModel, centers=centers, iter.max=40)

sapply(sample(1:centers, 10), function(n) {
  names(clustering$cluster[clustering$cluster==n][1:15])
})

Closest to

To find the words closest to a particular word in vector space, fill in that term and then run the code below. If you want to see more words, just increase the number in the argument. Make sure not to delete the quotation marks, and enter your word in lowercase

w2vModel %>% closest_to("girl", 30) 

Closest to two terms

You might also want to see the words closest to a combination of two (or more) words. Notice that this will open a new window with the results because of the view() function. If you prefer to see your results in this format, you can paste “%>% view()” at the end of the code above; or, if you prefer to see your results in the console, you can delete “%>% View()” from the code below. Note that the code below also shows 20 results, instead of 30. If you want to continue adding terms, just follow the format as in the example by putting a + between each pair and putting each word in quotation marks.

# Closest to two terms
w2vModel %>% closest_to(~"girl"+"woman", 20) %>% view()

# Closest to more than two terms
w2vModel %>% closest_to(~"girl"+"woman"+"daughter"+"aunt"+"sister"+"lady", 20) %>% view()

Closest to the difference between two terms

Or, you might want to look at the difference between two terms, to see which words are similar to one term but not another:

w2vModel %>% closest_to(~'woman'-'man',20) 

Analogies

You can even construct analogies, such as in the example below; these use vector math to subtract the contexts associated with one word from another and then add a third term, which brings you to new vector space where you will find terms associated with the distinction between the first two terms plus the contexts of the third term.

In the classic example, you might start with the vector for “woman” and subtract the vector for “man”, thus producing a vector that represents the contexts for “woman” as distinct from those for “man”. You might then add a third term, such as “king”, to add its own contexts to the query. This would let you look at a vector associated with something like femininity and then adds a vector associated with royalty; you might expect to get a result like “queen”.

Or, to frame this as an analogy: this lets you ask questions like “man” is to “king” as “woman” is to what?

w2vModel %>% closest_to(~"woman"-"man"+"king", 20)

It is not always helpful to think strictly within the analogy framework; in many cases, it can be more productive to think about constructing a vector that represents the difference between two terms, and then adding the contexts of a third term. In the line of code below, for instance, we are constructing a vector that might be described as a “wealth” vector (by looking at the contexts for “rich” as distinct from “poor”) and adding to that vector the semantic space of clothing (by adding the contexts for “dress”). We might expect to get results associated with expensive clothing or the dress habits of the wealthy.

To experiment with this, try adding different third terms (perhaps “food” or “house”) or reverse the first two terms, to look at the contexts for poverty instead of wealth.

w2vModel %>% closest_to(~"rich"-"poor"+"dress", 20)

Working with other models and exporting results

Reading in existing model files

If you want to read in an existing model, you can do so with the code below (just replace “name_of_your_file” with the name of your file, and make sure you don’t delete the .bin extension or the quotation marks). If you follow the instructions above, all of your trained models will be saved as binary files (with a .bin extension) in your data folder. You only need to train each model once, and then you can use this code to read it in at the start of each new session.

You can also read in models trained by others if you save them to your data folder and then read them in with the code below.

After you’ve restarted RStudio (in addition to checking your working directory and loading your packages), you’ll also need to use the code below to read in your model again.

# Replace this with the path to the model you want to read in
w2vModel <- read.vectors("data/wwo-regularized.bin")

Exporting queries

The code below will enable you to export the results from a particular query. To export query results, change the part after “w2vModel %>%” to match the query that you want to export. An example is filled in so that you can see what this looks like. You can also adjust the number of words in the results set, if you want to see more or fewer. If you’d like to export results from a different query, such as addition or subtraction, paste over the example query with the one that you want to export.

The first line of code defines the variable “w2vExport” as whatever query you set. The second line exports a CSV file (which you can open in any program on your computer that works with tabular data, including Excel and Numbers). You can call the file whatever you like by replacing the template text inside of the quotation marks. The CSV file will be exported to the “output” folder in your current working directory, and it will overwrite existing files with the same name, so make sure to rename the export file if you want to keep earlier versions. Make sure not to use any spaces in the file names you choose.

w2vExport <- w2vModel %>% closest_to("girl", 30) 

#Change "name_of_your_query" to a descriptive name that you want to give to your export file. Don't put any spaces in the file name.
write.csv(file="output/name_of_your_query.csv", x=w2vExport)

Exporting clusters

You can use a similar method to export your clusters; the code below will first generate a set of clusters and then export a specified (by you) number of terms from those clusters. As above, you can change the number of centers and iterations when you are generating the clusters; you can also change how many sets of clusters and words from each cluster to export. The exporting mechanism is the same as with exporting queries above; you change the language in the quotation marks to match the name that you want to give your file. The export file can be fairly large, so this code might take a bit of time to run.

#Change "name_of_your_cluster" to a descriptive name that you want to give to your export file.
centers <- 150

clustering <- kmeans(w2vModel,centers=centers,iter.max = 40)

w2vExport <-sapply(sample(1:centers,150),function(n) {
  names(clustering$cluster[clustering$cluster==n][1:15])
})

write.csv(file="output/name_of_your_cluster.csv", x=w2vExport)

Evaluating the Model

Below is a very simple test that will calculate the cosine similarities for a small set of word pairs that are likely to be related in many models. You can customize this list for your own corpus by editing the pairs below, or adding new ones (add as many as you like, but make sure to follow the same format as in the examples below). This code will produce a “model-test-results.csv” file with cosine similarity scores on these word pairs for every model in your folder. The results file will be in the “output” folder of your current working directory. This is meant to be an example of the kinds of testing that are used in model evaluation, and is not a substitute for more rigorous testing processes.

files_list = list.files(pattern = "*.bin$", recursive = TRUE)

rownames <- c()

data_frame <- data.frame()
data = list(c("away", "off"),
            c("before", "after"),
            c("cause", "effects"),
            c("children", "parents"),
            c("come", "go"),
            c("day", "night"),
            c("first", "second"),
            c("good", "bad"),
            c("last", "first"),
            c("kind", "sort"),
            c("leave", "quit"),
            c("life", "death"),
            c("girl", "boy"),
            c("little", "small"))

data_list = list()

for(fn in files_list) {
  
  wwp_model = read.vectors(fn)
  sims <- c()
  for(pairs in data)
  {
    vector1 <- c()
    for(x in wwp_model[[pairs[1]]]) {
      vector1 <- c(vector1, x)
    }
    
    vector2 <- c()
    for(x in wwp_model[[pairs[2]]]) {
      vector2 <- c(vector2, x)
    }
    
    sims <- c(sims, cosine(vector1,vector2))
    #f_name <- strsplit(fn, "/")[[1]][[2]]
    data_list[[fn]] <- sims
  }
  
}

for(pairs in data){
  rownames <- c(rownames, paste(pairs[1], pairs[2], sep = "-"))
}

results <- structure(data_list,
                     class     = "data.frame",
                     row.names = rownames
)


#If you want to give your results document a more specific name; you can edit "model-test-results" below. 
write.csv(file="output/model-test-results.csv", x=results)

Credits and Thanks

This tutorial uses the wordVectors package developed by Ben Schmidt and Jian Li, itself based on the original word2vec code developed by Mikolov et al. The walkthrough was also informed by workshop materials authored by Schmidt, as well as by an exercise created by Thanasis Kinias and Ryan Cordell for the “Humanities Data Analysis” course, and a later version used in Elizabeth Maddock Dillon and Sarah Connell’s “Literature and Digital Diversity” class, both at Northeastern University.

This version of the walkthrough was developed as part of the Word Vectors for the Thoughtful Humanist series at Northeastern. Word Vectors for the Thoughtful Humanist has been made possible in part by a major grant from the National Endowment for the Humanities: Exploring the human endeavor. Any views, findings, conclusions, or recommendations expressed in this project, do not necessarily represent those of the National Endowment for the Humanities.