Word Vectors Training, Querying, and Validation

This walkthrough is a static version of an R notebook published by the Women Writers Project. In an environment such as RStudio, the code blocks below would be editable and interactive. For the full set of notebooks and more context on their usage, visit our GitHub repository.

Getting started

Using this File

This file is an introduction to training and querying a model with word2vec; it is designed to be used with our class’s RStudio Server instance.

Reminder on running code

To run a single line of code from an R Markdown file, put your cursor anywhere in that line of code and then hit command-enter or control-enter. If you want to run all of the code in a code snippet, you can hit the green triangle button on the right. If you want to run a particular section of code, highlight the section you want to run and hit command-enter or control-enter.

Much of our code will run almost instantly, but some things will take a few seconds or minutes, or even longer. You can tell code is still running if you see a red stop sign in the top-right corner of the console. If you’d like to stop a process, you can hit this stop sign. You will know that the code has been run or successfully stopped when you see a new > prompt in the bottom of the console.

Opening a new session: checking project and working directory

As a reminder, at the start of any new session, you should make sure that you have the right project open and you should check your working directory.

If you opened the “WordVectors” project file first, then you should already be working in the “WordVectors” project space. To confirm that you have the correct project open, check the top-right corner of the RStudio screen. If the project is not open, you can open it by going to File then Open Project... in the menu bar at the top of RStudio, or by clicking on the project file.

At the start of each new session, you should check your working directory with the code below. As long as you opened this file from the WordVectors project, your working directory should be in the right place: the “WordVectors” folder. If you do need to change your working directory, you can do so with the setwd() function.

getwd()

Loading packages

All the packages you will need for this exercise have been installed ahead of time on our RStudio Server instance, but you should load them using the library() function if this is a new session. You’ll have to load these packages every time you start a new session in RStudio.

When you run this code for the first time after you start a session, you’ll see a lot of text go through the console, possibly with some warning messages. Even if the text looks alarming, it probably won’t cause any issues. To confirm that the packages have loaded correctly, you can run this code a second time—if you see the code pop into the console with no additional text, that means you have loaded the packages properly and you are all set.

library(tidyverse)
library(tidytext)
library(magrittr)
library(devtools)
library(tsne)
library(wordVectors)
library(ggplot2)
library(lsa)

Training a model

Reading in text files

The code we will be using in this session is set up to require minimal editing, but that does mean that you need to have your input files in a very specific format. You should have a set of .txt files all saved in the same folder (without any files in subfolders). To add a new folder to RStudio Server, you should compress that folder on your computer (create a zip file) and then upload it to the data folder with the Upload button near the top of the Files menu (make sure you are already inside the data folder when you do this!).

This tutorial also comes with a small sample folder, called “WomensNovelsDemo”; it is not large enough to produce a useful model, but will run more quickly and so is useful for initial experimentation.

The following script allows you to “read-in” multiple text files and combine them into a “tibble,” which is a type of data table. Think of it as being like a spreadsheet, with rows and columns organizing information.

First, we get a list of the files to read-in (fileList), then we create a function (readTextFiles) to produce a tibble with two columns, filename and text for each text file in the folder. Then, we run the function to combine everything into one tibble called combinedTexts.

There are some special requirements when you want to run code that is defining functions; unlike most of the time, where you can put your cursor anywhere in the line of code to run it, you need to have your cursor either at the beginning or the end of the code defining your function when you run it (or just select the whole thing). There are comments both before and after the code that defines the function, so you can see what its boundaries are.

The only thing you’ll need to change in the code below is the file path in the first line.

As long as you have the folder with your text files inside the data folder, you should only need to change the part after the slash (the part that reads “name_of_your_folder”). Remember that you can use tab to select the folder you want.

Make sure to change this one line before you run any of the code below.

# Change this line to match the name of the folder with your corpus
path2file <- "data/WomensNovelsDemo/"

# This will create a list of files in the folder
fileList <- list.files(path2file, full.names=TRUE) 

# This is where you define a function to read in multiple text files and paste them into a tibble (remember that the code that defines functions must be run by putting your cursor at the beginning or end, or by selecting the whole section of code). You are only defining the function here; the next section of code is when you actually run the function.
readTextFiles <- function(file) { 
  message(file)
  rawText = paste(scan(file, sep="\n", what="raw", strip.white=TRUE))
  output = tibble(filename=gsub(path2file, "", file), text=rawText) %>% 
    group_by(filename) %>% 
    summarise(text = paste(rawText, collapse=" "))
  return(output)
}

# This is where you run the function to create a tibble of combined files called "combinedTexts"
combinedTexts <- tibble(filename=fileList) %>% 
  group_by(filename) %>% 
  do(readTextFiles(.$filename)) 

Preparing text for word2vec

The section below defines several variables so that they can be used in training your model. Working with general names (such as “w2vInput”) for these variables lets you use them in the code that follows without having to change each instance; the first line is where you set up the specifics you need to distinguish one model from another.

You can pick any name you want in the first line of code below; make sure there are no spaces in the name you select and that it is descriptive enough that you will remember which corpus you were working from when you want to read in a trained model.

The only line in the block of code below that you need to change is the first one, but make sure to do this, or you will end up with a model file called “your_file_name.bin”!

The last line of this code section creates a single text file, with a name based on the one that you chose, combining all of the texts in your corpus.

# This section is where you define the variables you will be using to train your model; change the text in the first line to whatever you want to call your model file
baseFile <- "your_file_name"
w2vInput <- paste("data/",baseFile,".txt", sep = "")
w2vCleaned <- paste("data/",baseFile,"_cleaned.txt", sep="")
w2vBin <- paste("data/",baseFile,".bin", sep="")

#This line creates a singe text file with all the texts in your corpus
combinedTexts$text %>% write_lines(w2vInput)

Creating a vector space model

The code below is how you actually train your model. There are some parameters you might want to modify, or, if this is your first time training a model, you can also keep the defaults to start.

You can adjust the number of processors to use on your computer in training the model with the threads parameter; this will impact how quickly the model is trained.

The vectors parameter allows you to change the dimensionality of your model to include more or fewer dimensions. Higher numbers of dimensions can make your model more precise, but will also increase both training time and the possibility of random errors. A value between 100 and 500 will work for most projects.

The window parameter allows you to control the number of words on either side of the target word that the model treats as relevant context; the smaller the window, the closer the context words will be.

The iter parameter allows you to control how many times your corpus is read through during model training. If your corpus is on the smaller side, then increasing the number of iterations can improve the reliability of your results.

The negative_samples parameter allows you to control the number of “negative samples” used in training. During the training process, each iteration updates the information about the position of each word in the model (making it progressively more and more accurate). Because there are many thousands of words in the model, doing that update with every iteration is time-consuming and computationally costly. With negative sampling, instead of updating every word, the training process updates only the words directly observed within the window, plus a random sampling of the other words in the model. For smaller datasets, a value between 5 and 20 is recommended; for larger ones, you can use smaller values, between 2 and 5.

For more on these parameters, and other options that you have in training a model, see the code documentation.

This code will check if there is already a .bin file with the same name in the current directory—if there isn’t, it will train a new model. If there is, it will read in the existing one. If you ever want to overwrite a model you’ve already trained, make sure to delete or rename that model’s .bin file first.

# This controls how much of your computer's processing power the code is allowed to use. 
THREADS <- 3

# prep_word2vec will prepare your corpus by cleaning and lowercasing your text with the `tokenizers` package. If you set the value of `bundle_ngrams` to be greater than 1, it will automatically join common bigrams into a single word. 
prep_word2vec(origin=w2vInput, destination=w2vCleaned, lowercase=T, bundle_ngrams=1)

# The code below will train or read in a model
if (!file.exists(w2vBin)) {
  w2vModel <- train_word2vec(
    w2vCleaned,
    output_file=w2vBin,
    vectors=100,
    threads=THREADS,
    window=6, iter=10, negative_samples=15
  )
} else {
  w2vModel <- read.vectors(w2vBin)
}

Querying the model

Visualizing

We can get a glimpse of what the model looks like by plotting it in two dimensions. Keep in mind that the model actually has many more dimensions, so we are, in effect, flattening it. Though the visualization may be somewhat difficult to read, you should be able to see that similar words—words that are near each other in vector space—tend to clump together. The code below will likely take a minute or two to run, and your results will appear in the “Plots” window to the right (you can hit the “Zoom” button to get a better view).

As the code is running, you’ll see a set of lines in the console that will something like “Epoch: Iteration #100 error is: 20.3048394873336”; note that this is not an error message! As the code runs, the values for “error” should decrease—this reflects increasing confidence about how to plot the vector representation.

w2vModel %>% plot(perplexity=10)

Clustering

The following script provides a way to cluster words that are near each other in vector space, using the “k-means” clustering algorithm. Below, we choose 150 centers, or 150 points around which to cluster words. Then we select 10 random clusters and 15 words from each cluster to view. This code will also take a minute or two to run. You can change the number of centers, the number of clusters to view, or the number of words to see—you can also increase the number of iterations (the number of times the algorithm should adjust where the centers are and where terms are positioned in relation to those centers).

centers <- 150
clustering <- kmeans(w2vModel, centers=centers, iter.max=40)

sapply(sample(1:centers, 10), function(n) {
  names(clustering$cluster[clustering$cluster==n][1:15])
})

Closest to

To find the words closest to a particular word in vector space, fill in that term and then run the code below. If you want to see more or fewer words, you can change the number in the argument. Make sure not to delete the quotation marks, and enter your word in lowercase

w2vModel %>% closest_to("girl", 30) 

Closest to two terms

You might also want to see the words closest to a combination of two (or more) words. Notice that this will open a new window with the results because of the view() function. If you prefer to see your results in this format, you can paste “%>% view()” at the end of the code above; or, if you prefer to see your results in the console, you can delete “%>% View()” from the code below. Note that the code below also shows 20 results, instead of 30. If you want to continue adding terms, just follow the format as in the example by putting a + between each pair and putting each word in quotation marks.

# Closest to two terms
w2vModel %>% closest_to(~"girl"+"woman", 20) %>% view()

# Closest to more than two terms
w2vModel %>% closest_to(~"girl"+"woman"+"daughter"+"aunt"+"sister"+"lady", 20) %>% view()

Closest to the difference between two terms

Or, you might want to look at the difference between two terms, to see which words are similar to one term but not another:

w2vModel %>% closest_to(~'rose'-'flower',20) 

Analogies

You can even construct analogies, such as in the example below; these use vector math to subtract the contexts associated with one word from another and then add a third term, which brings you to new vector space where you will find terms associated with the distinction between the first two terms plus the contexts of the third term.

In the classic example, you might start with the vector for “woman” and subtract the vector for “man”, thus producing a vector that represents the contexts for “woman” as distinct from those for “man”. You might then add a third term, such as “king”, to add its own contexts to the query. This would let you look at a vector associated with something like femininity and then adds a vector associated with royalty; you might expect to get a result like “queen”.

To frame this as an analogy: this lets you ask questions like “man” is to “king” as “woman” is to what?

w2vModel %>% closest_to(~"woman"-"man"+"king", 20)

It is not always helpful to think about this approach strictly within the analogy framework; in many cases, it can be more productive to think about constructing a vector that represents the difference between two terms, and then adding the contexts of a third term. In the line of code below, for instance, we are constructing a vector that might be described as a wealth vector (by looking at the contexts for “rich” as distinct from “poor”) and adding to that vector the semantic space of clothing (by adding the contexts for “dress”). We might expect to get results associated with expensive clothing or the dress habits of the wealthy.

To experiment with this, try adding different third terms (perhaps “food” or “house”) or reverse the first two terms, to look at the contexts for poverty instead of wealth.

w2vModel %>% closest_to(~"rich"-"poor"+"dress", 20)

Working with other models and exporting results

Reading in existing model files

If you want to read in an existing model, you can do so with the code below (just replace “name_of_your_file” with the name of your file, and make sure you don’t delete the .bin extension or the quotation marks). If you follow the instructions above, all of your trained models will be saved as binary files (with a .bin extension) in your data folder. You only need to train each model once, and then you can use this code to read it in at the start of each new session.

You can also read in models trained by others if you upload their .bin file to your data folder using the Upload button near the top of the Files menu (make sure you are already inside the data folder when you do this!). After you have uploaded the files to RStudio Server, you can read them in with the code below.

After you’ve restarted RStudio (in addition to checking your working directory and loading your packages), you’ll also need to use the code below to read in your model again.

# Replace this with the path to the model you want to read in
w2vModel <- read.vectors("data/wwo-regularized.bin")

Exporting queries

The code below will enable you to export the results from a particular query. To export query results, change the part after “w2vModel %>%” to match the query that you want to export. An example is filled in so that you can see what this looks like. You can also adjust the number of words in the results set, if you want to see more or fewer terms. If you’d like to export results from a different query, such as addition or subtraction, paste over the example query with the one that you want to export.

The first line of code defines the variable “w2vExport” as whatever query you set. The second line exports a CSV file (which you can open in any program on your computer that works with tabular data, including Excel and Numbers). You can call the file whatever you like by replacing the template text inside of the quotation marks. The CSV file will be exported to the “output” folder in your current working directory, and it will overwrite existing files with the same name, so make sure to rename the export file if you want to keep earlier versions. Make sure not to use any spaces in the file names you choose.

If you would like to then export any file from RStudio Server to your own computer, click the box next to the file to select it and then go to the “More” gear near the top of the Files menu. Then, choose “Export” and hit the “Download” button.

w2vExport <- w2vModel %>% closest_to("girl", 30) 

#Change "name_of_your_query" to a descriptive name that you want to give to your export file. Don't put any spaces in the file name.
write.csv(file="output/name_of_your_query.csv", x=w2vExport)

Exporting clusters

You can use a similar method to export your clusters; the code below will first generate a set of clusters and then export a specified (by you) number of terms from those clusters. As above, you can change the number of centers and iterations when you are generating the clusters; you can also change how many sets of clusters and words from each cluster to export.

The exporting mechanism is the same as with exporting queries above; you change the language in the quotation marks to match the name that you want to give your file. The export file can be fairly large, so this code might take a bit of time to run. And, again, you can follow the instructions above to export from RStudio Server to your own computer.

#Change "name_of_your_cluster" to a descriptive name that you want to give to your export file.
centers <- 150

clustering <- kmeans(w2vModel,centers=centers,iter.max = 40)

w2vExport <-sapply(sample(1:centers,150),function(n) {
  names(clustering$cluster[clustering$cluster==n][1:15])
})

write.csv(file="output/name_of_your_cluster.csv", x=w2vExport)

Evaluating the Model

Below is a very simple test that will calculate the cosine similarities for a small set of word pairs that are likely to be related in many models. You can customize this list for your own corpus by editing the pairs below, or adding new ones (add as many as you like, but make sure to follow the same format as in the examples below). This code will produce a “model-test-results.csv” file with cosine similarity scores on these word pairs for every model in your folder. The results file will be in the “output” folder of your current working directory. This is meant to be an example of the kinds of testing that are used in model evaluation, and is not a substitute for more rigorous testing processes.

files_list = list.files(pattern = "*.bin$", recursive = TRUE)

rownames <- c()

data_frame <- data.frame()
data = list(c("away", "off"),
            c("before", "after"),
            c("cause", "effects"),
            c("children", "parents"),
            c("come", "go"),
            c("day", "night"),
            c("first", "second"),
            c("good", "bad"),
            c("last", "first"),
            c("kind", "sort"),
            c("leave", "quit"),
            c("life", "death"),
            c("girl", "boy"),
            c("little", "small"))

data_list = list()

for(fn in files_list) {
  
  wwp_model = read.vectors(fn)
  sims <- c()
  for(pairs in data)
  {
    vector1 <- c()
    for(x in wwp_model[[pairs[1]]]) {
      vector1 <- c(vector1, x)
    }
    
    vector2 <- c()
    for(x in wwp_model[[pairs[2]]]) {
      vector2 <- c(vector2, x)
    }
    
    sims <- c(sims, cosine(vector1,vector2))
    #f_name <- strsplit(fn, "/")[[1]][[2]]
    data_list[[fn]] <- sims
  }
  
}

for(pairs in data){
  rownames <- c(rownames, paste(pairs[1], pairs[2], sep = "-"))
}

results <- structure(data_list,
                     class     = "data.frame",
                     row.names = rownames
)


#If you want to give your results document a more specific name; you can edit "model-test-results" below. 
write.csv(file="output/model-test-results.csv", x=results)

Credits and Thanks

This tutorial uses the wordVectors package developed by Ben Schmidt and Jian Li, itself based on the original word2vec code developed by Mikolov et al. The walkthrough was also informed by workshop materials authored by Schmidt, as well as by an exercise created by Thanasis Kinias and Ryan Cordell for the “Humanities Data Analysis” course, and a later version used in Elizabeth Maddock Dillon and Sarah Connell’s “Literature and Digital Diversity” class, both at Northeastern University.

This version of the walkthrough was developed as part of the Word Vectors for the Thoughtful Humanist series at Northeastern. Word Vectors for the Thoughtful Humanist has been made possible in part by a major grant from the National Endowment for the Humanities: Exploring the human endeavor. Any views, findings, conclusions, or recommendations expressed in this project, do not necessarily represent those of the National Endowment for the Humanities.