Word Vectors Primer
This primer covers core concepts for understanding and working with word vector models. Word vector models are a set of techniques from machine learning and natural language processing that model textual data numerically, using mathematical vectors to map semantic relationships between words in a corpus. These models are able to process large amounts of textual data to predict semantic, structural, and conceptual relationships between words. The tutorials below introduce some of the core concepts for understanding how word vector models operate and how they can be used to explore textual corpora. They then unpack some of the specific considerations for building a corpus and training a model. Finally, there is a series of hands-on exercise for training and querying word vector models, using the R programming language.
Set-up for Tutorials
The hands-on activities in these tutorials use the WordVectors package to train and query word2vec models with the R programming language. You can download a folder with RMarkdown walkthroughs, sample data, and guides on the WWP’s GitHub repository. The repository also contains a guide for working with the RMarkdown files.
Many of the concepts discussed below are also relevant for other implementations of word2vec, such as in Gensim. While these tutorials focus on word2vec implementations in R, the WWP will soon publish a set of notebooks for training and querying models in Python with Gensim.
The WWP also publishes the Women Writers Vector Toolkit, which includes pre-trained models that can be queried without running any code and which provides extensive readings, guides, and other resources.
An Overview of Word Vector Models
This tutorial provides a high-level overview of some important concepts for understanding word vector models. It discusses what the term “model” means in the context of machine learning, begins explaining what we mean when we say “vectors” and “vector space,” and situates word vector models relative to other forms of computational text analysis.
An Introduction to Word Vectors
This tutorial provides a more detailed and comprehensive introduction to word vector models. It walks through some key concepts and terminology, explains the processes for training and querying a model, discusses some of the mathematical frameworks for word vectors, and covers the tools and technologies available for working with word vector models.
Processes for Data Preparation and Model Training
This tutorial covers data preparation and model training in more detail. It discusses considerations for building a corpus, regularizing and cleaning textual data, making choices about model parameters, and validating trained models.
Hands-on Activities
This tutorial includes a set of frameworks to supplement the RMarkdown walkthroughs on GitHub. These walkthroughs are designed to be self-paced and, combined with this guide, provide a complete curriculum of code and instructions for learning the basics of R and RStudio; training, querying, and validating word2vec models; and performing some exploratory visualizations. This tutorial provides a supplementary curriculum of instructions and screenshots to accompany the RMarkdown walkthroughs.
What next?
If you have finished this primer, you might be wondering where to go next. The Women Writers Vector Toolkit includes links to case studies, curricular materials, and readings—in addition to an interface that will allow you to explore pre-trained models without running code. The WWP’s Resources page includes additional tutorials for learning more about text encoding, as well as further guides for working with word vectors.