<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="https://wwp-test.northeastern.edu/outreach/seminars/_utils/schema/yaps.rnc" type="application/relax-ng-compact-syntax"?>
<?xml-model href="https://wwp-test.northeastern.edu/outreach/seminars/_utils/schema/yaps.isosch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<?xml-stylesheet type="text/css" href="https://wwp-test.northeastern.edu/outreach/seminars/_utils/stylesheets/yaps-tei.css"?>
<!-- $Id: word_vectors_intro.xml 51374 2026-04-01 16:35:16Z aclark $ -->
<TEI xmlns="http://www.wwp.northeastern.edu/ns/yaps" version="5.0">
	<teiHeader>
		<fileDesc>
			<titleStmt>
				<title>Introduction to Word Vectors</title>
				<author>Julia Flanders</author>
			</titleStmt>
			<editionStmt>
				<edition>Word Vectors for the Thoughtful Humanist</edition>
			</editionStmt>
			<publicationStmt>
				<distributor>Women Writers Project (via website)</distributor>
				<address>
					<addrLine>url:mailto:wwp@neu.edu</addrLine>
				</address>
				<date when="2019-04-01"/>
				<availability status="restricted">
					<p>Copyright 2019 Syd Bauman, Julia Flanders, Sarah Connell, and the Women
						Writers Project</p>
					<p>This TEI-encoded XML file is available under the terms of the <ref
							target="http://creativecommons.org/licenses/by-sa/3.0/">Creative Commons
							Attribution-ShareAlike 3.0 (Unported)</ref> license.</p>
				</availability>
				<pubPlace>Boston, MA USA</pubPlace>
			</publicationStmt>
			<sourceDesc>
				<p>Gives a conceptual overview of the most basic concepts, terminology, and
					processes of word embedding models.</p>
			</sourceDesc>
		</fileDesc>
		<revisionDesc>
			<change when="2021-06-24" who="jflanders.lfw">Updated for third institute</change>
			<change when="2021-05-17" who="jflanders.lfw">Updated for second institute</change>
			<change when="2019-04-01" who="jflanders.lfw">Initial draft</change>
		</revisionDesc>
	</teiHeader>
	<text>
		<presentation>
			<abstract>
				<p>This tutorial offers an introductory overview of word embedding models and
					their associated concepts and terminology.</p>
			</abstract>
			<section>
				<head>A Road Map</head>

				<slide>
					<figure>
						<graphic height="600px" url="../../../_utils/gfx/w2v_roadmap.png"/>
					</figure>

				</slide>
				<lectureNote>
					<p>As we've already seen, word vectors are complicated...</p>
					<p>The next few sessions are intended to offer an overview, from several
						different angles: <list>
							<item>a walk through the special concepts and terminology so that we're
								all comfortable with them</item>
							<item>a walk through the actual process of training and querying a
								model</item>
							<item>an exploration of the mathematical side of word embedding models:
								what do we mean by <soCalled>vector space</soCalled>?</item>
							<item>a review of the tool set we use with word embedding models: what
								are the actual technologies we use and what role do they
								play?</item>
						</list>
					</p>
					<p>Hopefully by the end, we'll have gone over the same material from enough
						different perspectives that it will all make perfect sense!</p>
					<p>And at various points, we'll take a step back and think about the explanatory
						process itself: what kinds of explanation might work best for different
						audiences (especially readers of our scholarship, project collaborators,
						colleagues, grant reviewers, also potentially our students)</p>
				</lectureNote>
				<tutorial>
					
					<p>Word vectors can be pretty complicated. This tutorial is designed to offer an
						overview of word vectors, from several different angles:</p>
					<list>
							<item>a walk through the special concepts and terminology</item>
							<item>a walk through the actual process of training and querying a
								model</item>
							<item>an exploration of the mathematical side of word embedding models:
								what do we mean by <soCalled>vector space</soCalled>?</item>
							<item>a review of the tool set we use with word embedding models: what
								are the actual technologies we use and what role do they
								play?</item>
						</list>
					
			
				</tutorial>
			</section>

			<section>
				<head>Corpus and model</head>
				<slide>
					<figure>
						<graphic height="600px" url="../../../_utils/gfx/w2v_corpus_model.png"/>
					</figure>
				</slide>
				<lectureNote>
					<p>We're going to hear the terms <term>corpus</term> and <term>model</term> a
						lot this week: let's look more closely at those terms</p>
					<p>Corpus: <list>
							<item>In the simplest sense, the corpus is the body of textual material
								we are analysing</item>
							<item>A set of documents in some machine-readable form, that is ready
								for the word2vec program to ingest</item>
							<item>Our corpus might be derived from a larger research collection (or
								several different collections), maybe in another format (like
								TEI/XML) that contains extra information that we take advantage of
								when we generate the corpus that will be fed into Word2Vec</item>
							<item>So to get from the <term>research collection</term> to the
									<term>corpus</term> we might need to do some data conversion:
								from XML (or some other format) to plain text (which is what the
								Word2Vec tool requires)</item>
							<item>And we might also need to do some cleaning and regularization, to
								tame the irregularities of the original research collection. A
								little later on, we'll think about data formats and cleaning
								processes in more detail. </item>
							<item>So when we talk about <soCalled>the corpus</soCalled> here, we're
								talking about the plain-text corpus that is ready to be fed into
								Word2Vec </item>
						</list>
					</p>
					<p>Model: <list>
							<item>As we've already noted, the term <term>model</term> is an
								important one in digital humanities: in general terms, a model is a
								representation of something we are interested in, that captures some
								features of importance, in a way that makes it easier for us to
								examine and learn about that object of interest. So for instance a
								TEI-encoded text is a <emph>representation of a text</emph> that
								makes the structure and content of that text easier for us to see
								and work with. A word-embedding model of a corpus is a
									<emph>representation of a corpus of texts</emph>, in a way that
								makes the semantic relationships between words easier for us to see
								and work with.</item>
							<item>Practically speaking, the <soCalled>model</soCalled> we will be
								dealing with is a processed version of the corpus, produced by the
								Word2Vec tool, which represents the positioning of each word within
								the model as a vector</item>
							<item>So for now the key point is: the corpus is a collection of
								documents, while the model is a processed, computed representation
								of the textual data contained in those documents</item>
						</list>
					</p>
					<p>The <term>data preparation</term> process is how you get from the research
						collection to the corpus</p>
					<p>The <term>training</term> process is how you get from the corpus to the
						model.</p>
				</lectureNote>
				<tutorial>
					<p>Two important terms to learn before beginning your work with word vectors
						are: <term>corpus</term> and <term>model</term>. Let's take a closer look at
						those terms</p>
					<p>Corpus: <list>
							<item>In the simplest sense, the corpus is the body of textual material
								you are analysing, a set of documents in some machine-readable form, ready
								for the word embedding algorithm to ingest</item>
							<item>Your corpus might be derived from a larger research collection (or
								several different collections), maybe in another format (like
								TEI/XML) that contains extra information that you should take
								advantage of when you generate the corpus that will be fed into
								Word2Vec. Generally, you want your corpus to consist of
								machine-readable texts (plain text).</item>
							<item>So to get from the <term>research collection</term> to the
									<term>corpus</term> you might need to do some data conversion:
								from XML (or some other format) to plain text (which is what the
								model training algorithm requires)</item>
							<item>And you might also need to do some cleaning and regularization, to
								tame the irregularities of the original research collection. Later
								in the tutorial, we'll learm more about data formats and cleaning
								processes in more detail. </item>
							<item>So <soCalled>the corpus</soCalled> here is a plain-text corpus
								that is ready to be fed into the model </item>
						</list>
					</p>
					<p>Model: <list>
							<item>The term <term>model</term> is an important one in digital
								humanities: in general terms, a model is a representation of
								something we are interested in, that captures some features of
								importance, in a way that makes it easier for us to examine and
								learn about that object of interest. So for instance a TEI-encoded
								text is a <emph>representation of a text</emph> that makes the
								structure and content of that text easier for us to see and work
								with. A word-embedding model of a corpus is a <emph>representation
									of a corpus of texts</emph>, in a way that makes the semantic
								relationships between words easier for us to see and work
								with.</item>
							<item>Practically speaking, the <soCalled>model</soCalled> in this
								tutorial is a processed version of the corpus, produced by the
								Word2Vec tool, which represents the positioning of each word within
								the model as a vector</item>
							<item>So for now the key point is: the corpus is a collection of
								documents, while the model is a processed, computed representation
								of the textual data contained in those documents</item>
						</list>
					</p>
					<p>Some other important terms that will appear in this tutorial are:</p>
					<list>
						<item>The <term>data preparation</term> process is how you get from the
							research collection to the corpus</item>
						<item>The <term>training</term> process is how you get from the corpus to
							the model.</item>
					</list>
				</tutorial>
			</section>

			<section>
				<head>Parameters</head>
				<slide>
					<figure>
						<graphic height="600px"
							url="../../../_utils/gfx/w2v_industrial_pasta_process_parameters.png"/>
					</figure>
				</slide>
				<lectureNote>
					<p>Remember that we said different researchers might want to use the model for
						different things, which would result in training/generating the model
						somewhat differently. The way we control that training process is by
						adjusting a set of <term>parameters</term>.</p>
					<p>You can think of the training process (where we take a corpus and create a
						model of it) as being sort of like an industrial operation: </p>
					<list>
						<item>you take some raw materials and feed them into a big machine, and on
							the other end you get out some product</item>
						<item>and this hypothetical machine has a whole bunch of knobs and levers on
							it that you use to control the settings</item>
						<item>in our word2vec model training, the parameters are those knobs and
							levers, that control the training process</item>
						<item>depending on how you adjust them, you get differently trained models
							with different behaviours</item>

					</list>
					<p>We'll take a quick look now at two of these parameters, so that you can get a
						sense of how they affect the training process; they also have an important
						impact on how we interpret the results of the model. Later in the week,
						we'll look at these parameters in more detail and think about the effect
						these specific settings have on our models.</p>
				</lectureNote>
				<tutorial>
					
					<p>Different researchers might want to use the model for different purposes.
						Depending on the goal of the research, training/generating the model might
						be somewhat different. The way that the training process is controlled is by
						adjusting a set of what are known as <term>parameters</term>.</p>
					<p>The training process (where we take a corpus and create a model of it) is
						sort of like an industrial operation with the following steps: </p>
					<list>
						<item>you take some raw materials and feed them into a big machine, and on
							the other end you get out some product</item>
						<item>and this hypothetical machine has a whole bunch of knobs and levers on
							it that you use to control the settings</item>
						<item>in our word2vec model training, the parameters are those knobs and
							levers, that control the training process</item>
						<item>depending on how you adjust them, you get differently trained models
							with different behaviours</item>

					</list>
					<p>The tutorial will next walk you through some of these parameters, so that you
						can get a sense of how they affect the training process; they also have an
						important impact on how we interpret the results of the model. Later, we'll
						look at these parameters in more detail and think about the effect these
						specific settings have on our models. This tutorial will not cover all of
						the parameters that are available for the training process, so refer to the
							<ref target="https://radimrehurek.com/gensim/models/word2vec.html"
							>word2vec documentation</ref> for a list of additional parameters that
						may be of use to you.</p>

				</tutorial>
			</section>

			<section>
				<head>Window</head>
				<slide>
					<figure>
						<graphic height="800px" url="../../../_utils/gfx/w2v_window.png"/>
					</figure>
				</slide>
				<lectureNote>
					<p>The first parameter for us to consider is the concept of the
							<term>window</term></p>
					<p>And here we come to a fundamental assumption for a lot of text analysis: that
						words that are used together have something to do with one another</p>
					<p>What does it mean for words to be <soCalled>used together</soCalled>? <list>
							<item>right next to one another? all or nothing?</item>
							<item>more relevant the closer they are? sort of a gradient?</item>
							<item>contained within the same semantic construct, like a sentence or a
								paragraph? (problem: we're working with plain text so we don't have
								access to <soCalled>semantic constructs</soCalled>)</item>
						</list>
					</p>
					<p>In Word2Vec, instead of these, we use a <soCalled>window</soCalled>: <list>
							<item>a span of text of a specified length, like a viewing port that we
								move over the text that allows us to see X words at a time</item>
							<item>we can control the size of the window (it is one of the
									<term>parameters</term> we just talked about)</item>
							<item>the Word2Vec algorithm is like a bookworm reading its way through
								the text, bite by bite</item>
							<item>each taste is localized by the window: each bite gives the
								processor a set of words that are considered <soCalled>used
									together</soCalled></item>
							<item>and the size of the bite affects how many words are considered
								together in this way</item>
							<item>a bigger window lets us treat larger groups of words as
									<soCalled>related</soCalled></item>
							<item>[pause and discuss for a moment:] what might be the results for
								our analysis of a larger or smaller window? (imagine a window that
								is thousands of words, as big as an entire chapter; imagine a window
								that is only two words wide)</item>
						</list>
					</p>
					<p>Remember that this is a <term>machine learning</term> process and moreover it
						is an <term>unsupervised machine learning process</term>: one that starts
						from a state of complete ignorance and has to bootstrap itself.
								<list><item>So another way to imagine the approach being taken in
								the training process: picture that you have a big bag containing all
								the words in the corpus. You shake the bag and then dump it out on
								the floor. Now you start reading the corpus (i.e the actual texts
								with their actual word order). </item>
							<item>Each time you read a word, you make observations about the words
								around it. </item>
							<item>Remember that this one observation doesn't give you any kind of
									<q>Truth!! 100%!!</q> about those words: it's just one little
								observed fact. Probabilistically, it contributes a tiny bit to our
								belief about the whole corpus. </item>
							<item>So based on those observations about word X, we move each of the
								context words a tiny bit closer to word X. Now we look at the next
									<q>word X</q> and its companions, and we move those words a
								little bit.</item>
							<item>note that the <soCalled>window</soCalled> is giving us two pieces
								of information: what's in the window, and what's not in the window.
								We'll come back to this in more detail later, but for now we can say
								that in addition to moving the words we <emph>do</emph> see, we also
								update the position of some of the words we <emph>don't</emph> see
								as we read the text. </item>
						</list>
					</p>
				</lectureNote>

				<tutorial>
					
					<p>The first parameter we will consider is the concept of the
							<term>window</term></p>
					<p>A fundamental assumption for many text analysis methods is: words that are
						used together have something to do with one another</p>
					<p>What does it mean for words to be <soCalled>used together</soCalled>? <list>
							<item>right next to one another? all or nothing?</item>
							<item>the more relevant the closer they are? sort of a gradient?</item>
							<item>contained within the same semantic construct, like a sentence or a
								paragraph? (problem: we're working with plain text so we don't have
								access to <soCalled>semantic constructs</soCalled>)</item>
						</list>
					</p>
					<p>In Word2Vec, we use a <soCalled>window</soCalled> in order to help the model
						view text together. In the Word2Vec algorithm, a window means: <list>
							<item>a span of text of a specified length, like a viewing port that we
								move over the text that allows us to see X words at a time</item>
							<item>we can control the size of the window (it is one of the
									<term>parameters</term> we just talked about)</item>
							<item>the Word2Vec algorithm is like a bookworm reading its way through
								the text, bite by bite</item>
							<item>each taste is localized by the window: each bite gives the
								processor a set of words that are considered <soCalled>used
									together</soCalled></item>
							<item>and the size of the bite affects how many words are considered
								together in this way</item>
							<item>a bigger window lets us treat larger groups of words as
									<soCalled>related</soCalled></item>
							<item>Take a moment to ask yourself: what might be the results for the
								analysis of a larger or smaller window? (imagine a window that is
								thousands of words, as big as an entire chapter; imagine a window
								that is only two words wide). Pausing to reflect on this question
								will help you make better decisions about the window size of your
								text.</item>
						</list>
					</p>
					<p>Word2Vec is a <term>machine learning</term> process and moreover it is an
							<term>unsupervised machine learning process</term>: one that starts from
						a state of complete ignorance and has to bootstrap itself. <list><item>So
								another way to imagine the approach being taken in the training
								process: picture that you have a big bag containing all the words in
								the corpus. You shake the bag and then dump it out on the floor. Now
								you start reading the corpus (i.e the actual texts with their actual
								word order). </item>
							<item>Each time you read a word, you make observations about the words
								around it. </item>
							<item>Keep in mind that this one observation doesn't give you any kind
								of <q>Truth!! 100%!!</q> about those words: it's just one little
								observed fact. Probabilistically, it contributes a tiny bit to your
								belief about the whole corpus. </item>
							<item>So based on those observations about word X, we move each of the
								context words a tiny bit closer to word X. Now we look at the next
									<q>word X</q> and its companions, and we move those words a
								little bit.</item>
							<item>note that the <soCalled>window</soCalled> is giving us two pieces
								of information: what's in the window, and what's not in the window.
								We'll come back to this in more detail later in the tutorial, but
								for now we can say that in addition to moving the words we
									<emph>do</emph> see, we also update the position of some of the
								words we <emph>don't</emph> see as we read the text. </item>
						</list>
					</p>
				</tutorial>
			</section>

			<section>
				<head>Iterations</head>
				<slide>
					<figure>
						<graphic height="600px" url="../../../_utils/gfx/w2v_iterations2.png"/>
					</figure>
				</slide>
				<lectureNote>
					<p>We've talked about the creation of a model as a <term>training</term>
						process, and we've just imagined it as a bookworm eating its way through the
						text, repeatedly. The trained model is the representation of the probability
						that words appear within the same window. <list>
							<item>As we just noted, the model begins in a state of complete
								randomness: words dumped on the floor. But after one read through
								the corpus, the words on the floor have moved around a bit. The
								machine is learning! Now, if we repeat the process, we can move them
								a bit further--it might seem as if we're getting the same
								information as we got before, but because the words on the floor are
								now in different (better) positions already, what we're doing is
								refining that information further. </item>
							<item>each pass through the corpus provides another set of adjustments,
								making the model more accurate</item>
							<item>each of these passes is called an <term>iteration</term>, and the
								more iterations the training process does, the more accurate the
								model (but of course the more time the process takes)</item>
							<item>you can control the number of iterations: it is another of the
									<term>parameters</term> we mentioned a moment ago</item>
						</list>
					</p>
				</lectureNote>
				<tutorial>
					
					<p>By now, you have read about the creation of a model as a
							<term>training</term> process, and we've just imagined it as a bookworm
						eating its way through the text, repeatedly. The trained model is the
						representation of the probability that words appear within the same window. <list>
							<item>As we just noted, the model begins in a state of complete
								randomness: words dumped on the floor. But after one read through
								the corpus, the words on the floor have moved around a bit. The
								machine is learning! Now, if we repeat the process, we can move them
								a bit further--it might seem as if we're getting the same
								information as we got before, but because the words on the floor are
								now in different (better) positions already, what we're doing is
								refining that information further. </item>
							<item>each pass through the corpus provides another set of adjustments,
								making the model more accurate</item>
							<item>each of these passes is called an <term>iteration</term>, and the
								more iterations the training process does, the more accurate the
								model (but of course the more time the process takes)</item>
							<item>you can control the number of iterations: it is another of the
									<term>parameters</term> we mentioned before</item>
						</list>
					</p>
				</tutorial>
			</section>



			<section>
				<head>Vectors: a first look</head>
				<slide>
					<figure>
						<graphic height="600px" url="../../../_utils/gfx/w2v_vector.png"/>
					</figure>
				</slide>
				<lectureNote>
					<p>Let's look next at some terms that may seem most distant from our humanistic
						expertise: the ones that refer to the mathematical aspects of word embedding
						models. The word <mentioned>vector</mentioned> has come up already: what is
						a vector and how is it relevant in this case? We'll start with a simple
						explanation first, and then circle back a bit later for more detail.</p>
					<p>A vector is basically a line that has both a specific length and a specific
						direction or orientation in space:<list>
							<item>we can describe that line using coordinates: one coordinate for
								each axis of information we have about the line</item>
							<item>in this example, the vector is the thick black line that starts at
								the origin (the point where all three axes are at zero) and extends
								out to the point in space where the x axis (the blue number) is at
								3, the y axis (the red number) is at 2, and the z axis (the green
								number) is at zero</item>
							<item>its direction and length are defined by those three
								dimensions</item>
							<item>any questions? This may be new for some and probably not current
								knowledge for most of us!</item>
						</list>
					</p>
					<p>In a word-embedding model, the model represents a text corpus almost like a
						dandelion: as if each word were at the end of one of the little dandelion
						threads: <list>
							<item>each thread projects at a slightly different angle</item>
							<item>each word is located at a slightly different point in this cloud
								of words</item>
							<item>and words that are nearer to one another in meaning are also
								nearer to one another in vector space.</item>
						</list>
					</p>

				</lectureNote>
				<tutorial>
					
					<p>Let's look next at some terms that may seem most distant from our humanistic
						expertise: the ones that refer to the mathematical aspects of word embedding
						models. The word <mentioned>vector</mentioned> has come up already: what is
						a vector and how is it relevant in this case? We'll start with a simple
						explanation first, and then circle back a bit later for more detail.</p>
					<p>A vector is basically a line that has both a specific length and a specific
						direction or orientation in space:<list>
							<item>we can describe that line using coordinates: one coordinate for
								each axis of information we have about the line</item>
							<item>in this example, the vector is the thick black line that starts at
								the origin (the point where all three axes are at zero) and extends
								out to the point in space where the x axis (the blue number) is at
								3, the y axis (the red number) is at 2, and the z axis (the green
								number) is at zero</item>
							<item>its direction and length are defined by those three
								dimensions</item>
							<item>This may be new and probably not current knowledge for most
								humanists, so don't worry if you are feeling a little lost!</item>
						</list>
					</p>
					<p>In most cases, word vectors represent words in thousands of dimensions, way
						too many for any one person to be able to wrap their head around or for even
						the computer to handle. One handy solution for reducing the number of these
						dimensions is by using word-embedding models. In a word-embedding model, the
						model represents a text corpus almost like a dandelion: as if each word were
						at the end of one of the little dandelion threads: <list>
							<item>each thread projects at a slightly different angle</item>
							<item>each word is located at a slightly different point in this cloud
								of words</item>
							<item>and words that are nearer to one another in meaning are also
								nearer to one another in vector space.</item>
						</list>
					</p>
				</tutorial>
			</section>

			<section>
				<head>Cosine Similarity: What is a cosine anyway?</head>
				<slide>
					<figure>
						<graphic height="600px" url="../../../_utils/gfx/w2v_cosine.png"/>
					</figure>
				</slide>
				<lectureNote>
					<p>So what does it mean to be <term>near</term> something in vector space? How
						do we measure this kind of proximity or association? If we understand these
						vectors as lines whose directionality and length reflects word associations
						in the corpus, then the more closely aligned two vectors are (the more they
						are going in the same direction for the same distance), the
							<soCalled>nearer</soCalled> they are for our purposes.</p>
					<p>We can measure that alignment by using a mathematical expression called a
							<term>cosine</term>. What is a cosine? <list>
							<item>If we have two vectors (two lines extending out in different
								directions), what we really have is a triangle (the third leg would
								be the line connecting the ends of those two vectors)</item>
							<item>Within a triangle, a cosine is a way of representing an angle in
								relation to the lengths of its two legs</item>
							<item>the exact formula (for right triangles) is shown here on the
								slide, but even without parsing that in detail, we can say that the
								cosine of an angle is the ratio between its two sides (for triangles
								without a right angle, the formula is a little more complex)</item>
							<item>So when those two sides are very similar in length and direction,
								the cosine is going to get closer and closer to 1</item>
						</list>
					</p>

				</lectureNote>
				<tutorial>
					<p>Now that we have a basic understanding of what vectors are, what does it mean
						to be <term>near</term> something in vector space? How do we measure this
						kind of proximity or association? If we understand these vectors as lines
						whose directionality and length reflects word associations in the corpus,
						then the more closely aligned two vectors are (the more they are going in
						the same direction for the same distance), the <soCalled>nearer</soCalled>
						they are for our purposes.</p>
					<p>We can measure that alignment by using a mathematical expression called a
							<term>cosine</term>. What is a cosine? <list>
							<item>If we have two vectors (two lines extending out in different
								directions and connected at the same origin point), then when we
								draw a third line connecting those two vectors, what we really have
								is a triangle.</item>
							<item>Within a triangle, a cosine is a way of representing an angle in
								relation to the lengths of its two legs</item>
							<item>While there is an exact formula for calculating the cosine, in
								general the cosine of an angle is the ratio between its two sides:
								the side adjacent to the angle and the hypotenuse (for triangles
								without a right angle, the formula is a little more complex)</item>
							<item>So when those two sides are very similar in length and direction,
								the cosine is going to get closer and closer to 1 while dissimilar
								sides get closer to 0.</item>
						</list>
					</p>
				</tutorial>
			</section>

			<section>
				<head>Cosine Similarity</head>
				<slide>
					<figure>
						<graphic height="600px" url="../../../_utils/gfx/w2v_cosinesimilarity.png"/>
					</figure>
				</slide>
				<lectureNote>
					<p>So now we can come back to our question of how to measure
							<soCalled>nearness</soCalled>. In word embedding models the measure of
						nearness that we use is something called <term>cosine similarity</term>. <list>
							<item>Roughly speaking, this is a measure of the similarity of two
								vectors, based on the cosine of the angle between them</item>
							<item>As we've seen, the more similar the two vectors are, the closer
								their ratio gets to 1. And in fact the values of cosine similarity
								range between zero and one: two identical vectors have a cosine
								similarity of 1; two absolutely dissimilar vectors have a cosine
								similarity of zero</item>
							<item>So the smaller the cosine similarity, the less similar the words
								are, and the farther apart they are in vector space</item>
							<item>We'll talk a bit later on about what level of similarity really
								counts as <soCalled>similar</soCalled>, and you'll get a feel for
								it</item>
							<item>In general, anything above .5 starts to feel meaningful</item>
						</list>
					</p>
					<p>So in this example (a real-world example from the WWP corpus), if we take the
						word <mentioned>sacred</mentioned> as our starting point, the words
							<mentioned>holy</mentioned> and <mentioned>consecrated</mentioned> are
						fairly close in meaning (and have high cosine similarity); the word
							<mentioned>shrine</mentioned> is more distant but still related enough
						to be interesting</p>
					<p>So far so good? Questions?</p>
				</lectureNote>
				<tutorial>
					<p>So now we can come back to our question of how to measure
							<soCalled>nearness</soCalled>. In word embedding models the measure of
						nearness that we use is something called <term>cosine similarity</term>. <list>
							<item>Roughly speaking, this is a measure of the similarity of two
								vectors, based on the cosine of the angle between them</item>
							<item>As we've seen, the more similar the two vectors are, the closer
								their ratio gets to 1. And in fact the values of cosine similarity
								range between zero and one: two identical vectors have a cosine
								similarity of 1; two absolutely dissimilar vectors have a cosine
								similarity of zero</item>
							<item>So the smaller the cosine similarity, the less similar the words
								are, and the farther apart they are in vector space</item>
							<item>Later in this walkthrough, we will cover how the level of
								similarity really counts as <soCalled>similar</soCalled>, and you'll
								get a feel for it</item>
							<item>In general, anything above .5 starts to feel meaningful</item>
						</list>
					</p>
					<p>So in this example (a real-world example from the WWP corpus), if we take the
						word <mentioned>sacred</mentioned> as our starting point, the words
							<mentioned>holy</mentioned> and <mentioned>consecrated</mentioned> are
						fairly close in meaning (and have high cosine similarity); the word
							<mentioned>shrine</mentioned> is more distant but still related enough
						to be interesting</p>

				</tutorial>
			</section>

			<section>
				<head>Querying</head>
				<slide>
					<figure>
						<graphic height="600px" url="../../../_utils/gfx/w2v_querying.png"/>
					</figure>
				</slide>
				<lectureNote>
					<p>So what can we do with this information? We've created a model of our corpus
						(a representation that helps us see some aspect of that information more
						clearly and easily): how do we use it?</p>
					<p>The first thing we might try is just querying the model about the
						neighborhood of a word we're interested in: essentially, asking it questions
						about where specific words are located and what is around them: <list>
							<!--<item>if we were working in an <soCalled>under-the-hood</soCalled> environment where we had full control over everything (writing our own R code), we could design very complex queries</item>
            <item>the vector toolkit we're using for this workshop offers some simpler options, but enough to get started with</item>-->
							<item>this slide shows a simple example using the WWP's Women Writers
								Vector Toolkit, but in this workshop we will be working in the
								RStudio environment that we looked at in the last session, so we can
								design much more complex queries</item>
							<item>we can enter a search term, and get back a list of the words that
								are closest to it in vector space: that is, words that are probably
								semantically related to it, based on the way those words appear
								together in the corpus</item>
							<item>as we can see from this list, these aren't necessarily synonyms:
								there are many different ways words can be
									<mentioned>related</mentioned> as we will discover </item>
							<item>but they are words that tend to appear in the same contexts as our
								query term (in this case, discussions of families and familial roles
								and relationships)</item>
						</list>
					</p>

				</lectureNote>
				<tutorial>
					<p>So what can you do with this information? At this point, you have created a model
						of your corpus (a representation that will help you see some aspect of that
						information more clearly and easily): how do you use it?</p>
					<p>A good initial query is to try just querying the mdoel about the neighborhood of
						a word we're interested in: essentially, asking it questions about where
						specific words are located and what is around them: <list>
							<!--<item>if we were working in an <soCalled>under-the-hood</soCalled> environment where we had full control over everything (writing our own R code), we could design very complex queries</item>
            <item>the vector toolkit we're using for this workshop offers some simpler options, but enough to get started with</item>-->
							<item>If you're using the WWP's Women Writers Vector Toolkit you can perform
								some of these basic queries using the pre-loarded models, but in this
								walkthrough we will be working with the model we just built so that we
								can ask more complex questions</item>
							<item>we can enter a search term, and get back a list of the words that are
								closest to it in vector space: that is, words that are probably
								semantically related to it, based on the way those words appear together
								in the corpus</item>
							<item>From this list, you might notice that these aren't necessarily
								synonyms: there are many different ways words can be
								<mentioned>related</mentioned> as we will discover </item>
							<item>but they are words that tend to appear in the same contexts as our
								query term (in this case, discussions of families and familial roles and
								relationships)</item>
						</list>
					</p>
				</tutorial>
			</section>
			
			<section>
				<head>Clustering</head>
				<slide>
					<figure>
						<graphic height="600px" url="../../../_utils/gfx/w2v_clustering2.png"/>
					</figure>
				</slide>
				<lectureNote>
					<p>Another way we can interact with the model is to ask it more generally,
							<q>where are your semantically dense zones?</q> Or <q>please show me
							some clusters of related words!</q>
					</p>
					<p>This process is somewhat similar to topic modeling: <list>
							<item>it says <q>What if we divided up the corpus into X different
									zones? where are the centers of those zones, and what is nearest
									to those centers?</q></item>
							<item>just as in topic modeling where we say, in effect, <q>if our
									corpus has X number of topics, what would they be?</q></item>
							<item>or if we were looking at a map of a region, we might say <q>if we
									were going to build ten new Home Depot stores, where should they
									go so that most people have the shortest drive? and who lives in
									those regions?</q></item>
						</list></p>
					<p>To generate these clusters (as part of the initial model training process): <list>
							<item>the modeling program runs a clustering algorithm that randomly
								chooses a number of locations within the vector space—in this case,
								three (like throwing a set of three darts at the map)</item>
							<item>then, it makes a series of adjustments to those locations to move
								them closer to actual <soCalled>population centers</soCalled>,
								places where words are close to one another within the vector space </item>
							<item>if we kept up the adjustment process for a long time, we would
								eventually get a perfect result: the three most significant semantic
								zones within the model</item>
						</list>
					</p>

				</lectureNote>
				<tutorial>
					<p>Another way we can interact with the model is to ask it more generally,
							<q>where are your semantically dense zones?</q> Or <q>please show me
							some clusters of related words!</q>
					</p>
					<p>This process is somewhat similar to topic modeling: <list>
							<item>it says <q>What if we divided up the corpus into X different
									zones? where are the centers of those zones, and what is nearest
									to those centers?</q></item>
							<item>just as in topic modeling where we say, in effect, <q>if our
									corpus has X number of topics, what would they be?</q></item>
							<item>or if we were looking at a map of a region, we might say <q>if we
									were going to build ten new Home Depot stores, where should they
									go so that most people have the shortest drive? and who lives in
									those regions?</q></item>
						</list></p>
					<p>To generate these clusters (as part of the initial model training process): <list>
							<item>the modeling program runs a clustering algorithm that randomly
								chooses a number of locations within the vector space—in this case,
								three (like throwing a set of three darts at the map
								randomly)</item>
							<item>then, it makes a series of adjustments to those locations to move
								them closer to actual <soCalled>population centers</soCalled>,
								places where words are close to one another within the vector space </item>
							<item>if we kept up the adjustment process for a long time, we would
								eventually get a perfect result: the three most significant semantic
								zones within the model</item>
						</list>
					</p>
				</tutorial>
			</section>

			<section>
				<head>Clusters: an example</head>
				<slide>
					<figure>
						<graphic height="600px" url="../../../_utils/gfx/w2v_clustering3.png"/>
					</figure>
				</slide>
				<lectureNote>
					<!--Again, if we were writing code ourselves we could exert some fine control over this process, but in the toolkit we have a simple version:-->

					<p>So what we get at the end of the process is clusters of words that are like
							<soCalled>neighborhoods</soCalled> within the vector space: densely
						populated areas where words are grouped together around a concept or a
						textual phenomenon. <list>
							<item>in principle, the number of clusters is up to us</item>
							<item>in our own model training for this workshop, since we are working
								directly with the R code, we can choose how many clusters we
								want</item>
							<item>the WWVT doesn't give you this option but the WWP chose 150 as a
								reasonable number </item>
							<item>for the Toolkit we stop the process after 40 adjustments, so the
								clusters will come out a bit different every time you reset them,
								but when running the code yourself in RStudio you can control that
								more precisely.</item>
							<item>in this example, for instance (which shows three of the 150
								clusters), there's a cluster that's roughly associated with
								religiously-oriented death ceremony, and another one that is
								old-fashioned cavalry-oriented warfare, but the one in the middle is
								harder to describe as a <soCalled>concept</soCalled>: it's more like
								the space of dialogue and spoken language markers</item>
						</list>
					</p>

					<p>
						<emph>Check the time and consider stopping here!</emph>
					</p>
				</lectureNote>
				<tutorial>
					<p>So what we get at the end of the process is clusters of words that are like
							<soCalled>neighborhoods</soCalled> within the vector space: densely
						populated areas where words are grouped together around a concept or a
						textual phenomenon. <list>
							<item>in principle, the number of clusters is up to us</item>
							<item>in your own model training, since you are working directly with
								code, you are able to choose how many clusters you want by adjusting
								the number in the code, itself.</item>
							<item>if you are using the Toolkit, the WWVT doesn't give you this
								option but the WWP chose 150 as a reasonable number </item>
							<item>for the Toolkit we stop the process after 40 adjustments, so the
								clusters will come out a bit different every time you reset them,
								but when running the code yourself you can control that more
								precisely. Some of the reasons why you may want to adjust these
								numbers is to increase accuracy or speed.</item>
							<item>in this example, for instance (which shows three of the 150
								clusters), there's a cluster that's roughly associated with
								religiously-oriented death ceremony, and another one that is
								old-fashioned cavalry-oriented warfare, but the one in the middle is
								harder to describe as a <soCalled>concept</soCalled>: it's more like
								the space of dialogue and spoken language markers</item>
						</list>
					</p>

				</tutorial>
			</section>

			<section>
				<head>Vector Math 1</head>
				<slide>
					<figure>
						<graphic height="600px" url="../../../_utils/gfx/w2v_vectorMath1a.png"/>

						<!--              <graphic height="600px" url="../../../_utils/gfx/w2v_vectorMath1.png"/>
-->
					</figure>
				</slide>
				<lectureNote>
					<p>One more thing we can do to explore the word information in our vector space
						model: we can examine the relationships between words, taking advantage of
						the fact that each word is represented as a vector, which is a kind of
						number</p>
					<p>To understand how this works, we need to envision a little more clearly how
						words are positioned in this vector space model: <list>
							<item>during the training process, the word2vec algorithm is examining
								the text (looking through its little window at successive groupings
								of words)</item>
							<item>and with each observation, it adjusts the position of the words in
								the model to take into account the word associations it
								observes</item>
							<item>so for a word like <mentioned>bank</mentioned>, it might observe
								some instances where that word is associated with words like
									<mentioned>funds</mentioned> and <mentioned>revenue</mentioned>,
								and so it moves <mentioned>bank</mentioned> closer to those words:
								it adds information that makes an association between these
								words</item>
							<item>then maybe it observes some other instances where
									<mentioned>bank</mentioned> is associated with words like
									<mentioned>river</mentioned> and <mentioned>lake</mentioned> and
									<mentioned>Hudson</mentioned>, and it moves the word
									<mentioned>bank</mentioned> a little closer to those
								words</item>
							<item>so by the end, each word is positioned in vector space in a way
								that reflects its associations (some weaker and some stronger) with
								many of the other words in the corpus</item>
							<item>we can think of each association as being like a rubber band that
								pulls a pair of words together; each word is being pulled in
								multiple different directions, with different strengths of
								association, and its net position is the result of all of those
								pulls</item>
						</list>
					</p>
				</lectureNote>
				<tutorial>
					<p>One more thing we can do to explore the word information in our vector space
						model: we can examine the relationships between words, taking advantage of
						the fact that each word is represented as a vector, which is a kind of
						number</p>
					<p>To understand how this works, we need to envision a little more clearly how
						words are positioned in this vector space model: <list>
							<item>during the training process, the word2vec algorithm is examining
								the text (looking through its little window at successive groupings
								of words)</item>
							<item>and with each observation, it adjusts the position of the words in
								the model to take into account the word associations it
								observes</item>
							<item>so for a word like <mentioned>bank</mentioned>, it might observe
								some instances where that word is associated with words like
									<mentioned>funds</mentioned> and <mentioned>revenue</mentioned>,
								and so it moves <mentioned>bank</mentioned> closer to those words:
								it adds information that makes an association between these
								words</item>
							<item>then maybe it observes some other instances where
									<mentioned>bank</mentioned> is associated with words like
									<mentioned>river</mentioned> and <mentioned>lake</mentioned> and
									<mentioned>Hudson</mentioned>, and it moves the word
									<mentioned>bank</mentioned> a little closer to those
								words</item>
							<item>so by the end, each word is positioned in vector space in a way
								that reflects its associations (some weaker and some stronger) with
								many of the other words in the corpus</item>
							<item>we can think of each association as being like a rubber band that
								pulls a pair of words together; each word is being pulled in
								multiple different directions, with different strengths of
								association, and its net position is the result of all of those
								pulls</item>
						</list>
					</p>
				</tutorial>
			</section>
			<section>
				<head>Vector Math 2</head>
				<slide>
					<figure>
						<graphic height="600px" url="../../../_utils/gfx/w2v_vectorMath2.png"/>
					</figure>
				</slide>

				<lectureNote>
					<p>We can use this information to tease out more specific semantic spaces for
						individual words: <list>
							<item>For instance, we might imagine that the word
									<mentioned>grace</mentioned> has some rubber band pulling it
								towards a word like <mentioned>beauty</mentioned>. What if we cut
								that rubber band? What part of the semantic field might pull
									<mentioned>grace</mentioned> more into their orbit if
									<mentioned>beauty</mentioned> were out of the running? We can
								find this out by <soCalled>subtracting</soCalled> the vector for
									<mentioned>beauty</mentioned> from the vector for
									<mentioned>grace</mentioned>: the result is a set of
								associations that are specifically religious </item>
							<item>Similarly, instead of cutting that rubber band, we might intensify
								its strength and allow it to pull <mentioned>grace</mentioned>
								towards it more strongly (putting <mentioned>grace</mentioned> into
								a zone where its aesthetic associations are most powerful). We can
								do this by <soCalled>adding</soCalled> the vector for
									<mentioned>beauty</mentioned> to the vector for
									<mentioned>grace</mentioned>.</item>
						</list>
					</p>
					<p>Note that words here are just proxies or symptoms (imperfect ones) for the
						concepts we might be interested in: <list>
							<item>As we think about what words to <soCalled>add</soCalled> or
									<soCalled>subtract</soCalled>, it's important to think about how
								those words are related to the concept we're trying to examine (and
								it's worth trying different words)</item>
							<item>Also, the semantic associations of words are very corpus-specific:
								in a corpus of financial documents, the term
									<mentioned>grace</mentioned> might be exclusively associated
								with the <mentioned>grace period</mentioned> for bill payment</item>
							<item>So knowing our corpus is really crucial</item>
						</list>
					</p>
				</lectureNote>
				<tutorial>
					<p>We can use this information to tease out more specific semantic spaces for
						individual words: <list>
							<item>For instance, we might imagine that the word
									<mentioned>grace</mentioned> has some rubber band pulling it
								towards a word like <mentioned>beauty</mentioned>. What if we cut
								that rubber band? What part of the semantic field might pull
									<mentioned>grace</mentioned> more into their orbit if
									<mentioned>beauty</mentioned> were out of the running? We can
								find this out by <soCalled>subtracting</soCalled> the vector for
									<mentioned>beauty</mentioned> from the vector for
									<mentioned>grace</mentioned>: the result is a set of
								associations that are specifically religious </item>
							<item>Similarly, instead of cutting that rubber band, we might intensify
								its strength and allow it to pull <mentioned>grace</mentioned>
								towards it more strongly (putting <mentioned>grace</mentioned> into
								a zone where its aesthetic associations are most powerful). We can
								do this by <soCalled>adding</soCalled> the vector for
									<mentioned>beauty</mentioned> to the vector for
									<mentioned>grace</mentioned>.</item>
						</list>
					</p>
					<p>Note that words here are just proxies or symptoms (imperfect ones) for the
						concepts we might be interested in: <list>
							<item>As you think about what words to <soCalled>add</soCalled> or
									<soCalled>subtract</soCalled>, it's important to think about how
								those words are related to the concept you're trying to examine (and
								it's worth trying different words)</item>
							<item>Also, the semantic associations of words are very corpus-specific:
								in a corpus of financial documents, the term
									<mentioned>grace</mentioned> might be exclusively associated
								with the <mentioned>grace period</mentioned> for bill payment</item>
							<item>So knowing our corpus is really crucial</item>
						</list>
					</p>
				</tutorial>

			</section>

			<section>
				<head>Validation</head>
				<slide>
					<p>How do you know your model is telling you something meaningful? <list>
							<item>Do you get consistent results from model to model?</item>
							<item>Do you get plausible word groupings?</item>
							<item>Does <soCalled>vector math</soCalled> work as you would
								expect?</item></list>
					</p>
					<figure>
						<graphic url="../../../_utils/gfx/bad_model_fish.png" height="600px"/>
					</figure>
				</slide>
				<lectureNote>
					<p>As we use our model in these various ways, we're going to get some results
						(hopefully) that look very predictable, and some others that look
						provocative and fascinating, and maybe some others that look bizarre and
						unexpected. How can we tell the difference between an interpretive
						breakthrough and a glitch resulting from some terrible flaw in our training
						process?</p>
					<p>Once we've generated a model, there are ways we can and should test it to see
						whether it is actually a useful representation that will give research
						results we can use. That testing process is called <term>validation</term>.
						To validate a model, we can ask questions like these:</p>
					<p>Are your results consistent across models? <list><item>When you train a
								series of models on the same corpus using the same parameters, do
								you get consistent cosine similarities for the same sets of words? </item>
							<item>(Note: because training a model is a probabilistic process, you
								won’t get identical results from model to model, even if they’re
								trained on exactly the same corpus, but the results should be
								comparable.)</item></list></p>
					<p>Do you get plausible word groupings? <list>
							<item>When you generate groups of <soCalled>similar</soCalled> terms
								(either by generating clusters, or by querying a specific word), do
								you get plausibly related groups of words for common and moderately
								common query terms? (Common within your corpus, that is!) </item>
							<item>If you don’t get plausible groupings for moderately common words,
								this would be a sign to proceed with caution; if you don’t get
								plausible groupings for even common words, this would be a sign that
								the model may not be very useful (this might be because of small
								corpus size, or some other factor).</item>
						</list></p>
					<p>Does <soCalled>vector math</soCalled> work as you would expect? <list>
							<item>When you do the various forms of vector math (addition,
								subtraction) do your results continue to seem plausible? </item>
						</list>
					</p>
					<p>
						<emph>If we didn't stop before, consider stopping now!</emph>
					</p>
				</lectureNote>
				<tutorial>
					<p>As you use your model in these various ways, you're going to get some results
						that hopefully look very predictable, and some others that look provocative
						and fascinating, and maybe some others that look bizarre and unexpected.
						With all these interesting results, how can we tell the difference between
						an interpretive breakthrough and a glitch resulting from some terrible flaw
						in our training process?</p>
					<p>Once we've generated a model, there are ways we can and should test it to see
						whether it is actually a useful representation that will give research
						results we can use. That testing process is called <term>validation</term>.
						To validate a model, we can ask questions like these:</p>
					<p>Are your results consistent across models? <list><item>When you train a
								series of models on the same corpus using the same parameters, do
								you get consistent cosine similarities for the same sets of words? </item>
							<item>It is important to note that because training a model is a
								probabilistic process, you won’t get identical results from model to
								model, even if they’re trained on exactly the same corpus, but the
								results should be comparable.</item></list></p>
					<p>Do you get plausible word groupings? <list>
							<item>When you generate groups of <soCalled>similar</soCalled> terms
								(either by generating clusters, or by querying a specific word), do
								you get plausibly related groups of words for common and moderately
								common query terms? (Common within your corpus, that is!) </item>
							<item>If you don’t get plausible groupings for moderately common words,
								this would be a sign to proceed with caution; if you don’t get
								plausible groupings for even common words, this would be a sign that
								the model may not be very useful (this might be because of small
								corpus size, or some other factor).</item>
						</list></p>
					<p>Does <soCalled>vector math</soCalled> work as you would expect? <list>
							<item>When you do the various forms of vector math (addition,
								subtraction) do your results continue to seem plausible? </item>
						</list>
					</p>
				</tutorial>
			</section>




			<section>
				<head>Circling back: another look at vectors</head>
				<slide>
					<figure>
						<graphic height="600px" url="../../../_utils/gfx/w2v_vector.png"/>
					</figure>
				</slide>
				<lectureNote>
					<p>Now that we've worked through the basic concepts, let's circle back and
						consider the whole picture of <soCalled>word vectors</soCalled> or
							<soCalled>word embedding models</soCalled>, and introduce a few
						additional complexities.</p>
					<p>[if starting the day here, check in and see if people want to recap
						anything]</p>
					<p>A quick review: we've already noted that a vector is basically a line that
						has both a specific length and a specific direction or orientation in space:<list>
							<item>so here again in this example, the vector is a line that starts at
								the origin (the point where all three axes are at zero) and extends
								out to the point in space where the x axis is at 3, the y axis is at
								2, and the z axis is at zero</item>
							<item>we can think of those three axes as representing three pieces of
								information: together, they constitute a unique vector in
								three-dimensional space. </item>
							<item>I'm going to pause here for a moment and let the diagram sink in a
								bit more, because at this stage in the explanation, it helps to have
								a sense of what the diagram is telling us. Does anyone want to test
								out their understanding of how those three axes (the x, y, and z)
								are contributing information to the direction and distance of that
								vector? Does everyone see how the blue number 3 comes from the blue
								x axis? etc.?</item>
						</list>
					</p>

				</lectureNote>
				<tutorial>
					<p>Now that we've worked through the basic concepts, let's circle back and
						consider the whole picture of <soCalled>word vectors</soCalled> or
							<soCalled>word embedding models</soCalled>, and introduce a few
						additional complexities.</p>
					<p>So far, we've already noted that a vector is basically a line that has both a
						specific length and a specific direction or orientation in space:<list>
							<item>so here again in this example, the vector is a line that starts at
								the origin (the point where all three axes are at zero) and extends
								out to the point in space where the x axis is at 3, the y axis is at
								2, and the z axis is at zero</item>
							<item>we can think of those three axes as representing three pieces of
								information: together, they constitute a unique vector in
								three-dimensional space. </item>
							<item>Pause here and let the diagram sink in a bit more, because at this
								stage in the explanation, it helps to have a sense of what the
								diagram is telling us. Try and test out your understanding of how
								those three axes (the x, y, and z) are contributing information to
								the direction and distance of that vector. Do you see how the blue
								number 3 comes from the blue x axis? etc.?</item>
						</list>
					</p>
				</tutorial>

			</section>

			<section>
				<head>Words as vectors</head>
				<slide>
					<figure>
						<graphic height="400px" url="../../../_utils/gfx/w2v_vector_matrices.png"/>
					</figure>
				</slide>
				<lectureNote>
					<p>The example we were just looking at shows a vector defined by three
						dimensions: three different numbers representing three different axes of
						meaning. However, when we're working with word embedding models, we are
						working with vectors that are defined by many more dimensions. So in order
						to understand that scenario, we need to get a little more comfortable with
						two ideas: <list>
							<item>a vector is just an assemblage of dimensions</item>
							<item>each dimension represents an association that has been
								observed</item>
						</list>
					</p>
					<p>So let's take the first example on this slide (the idea may look familiar if
						you read the Jay Allamar <title>Illustrated Introduction to Word
							Embeddings</title>): <list>
							<item>Our little chart here shows three people (Jo, Lee, Robin) and for
								each person it shows an assemblage of dimensions</item>
							<item>each dimension represents an association</item>
							<item>in this case, that association has to do with the person's
								affinity for specific animals: perhaps through observations of how
								many pets of each type they have, or their response when they
								encounter the animal in the wild</item>
							<item>So each person in this chart is represented by a vector with five
								dimensions: a line in five-dimensional space</item>
							<item>if we want to compare two people and find out whether they tend to
								like the same animals, impressionistically we can say that people
								with <soCalled>high</soCalled> or <soCalled>low</soCalled> affinity
								for the same animals are similar: the color coding is highlighting
								this pattern.</item>
							<item>But if we want a quantitative way to talk about that similarity,
								we can use the measure called <term>cosine similarity</term>, which
								is a way of measuring the angle between two lines. </item>
							<item>Here, we're doing exactly the same thing, except that our
									<soCalled>lines</soCalled> are defined by five dimensions
								instead of three.</item>
							<item>the calculation isn't hard (you can find an Excel version on the
								web!) and what it shows us is that the cosine similarity between our
								two mammal-lovers is very high, whereas the similarity between the
								mammal-lovers and the person who prefers birds/lizards/beetles is
								quite low.</item>
						</list>
					</p>
					<p>Pause for questions and reflection!</p>
					<p>So, taking this a step farther, let's look at the chart on the right: <list>
							<item>Here, instead of looking at people and their association with
								animals, we're looking at words and their association with other
								words</item>
							<item>those <soCalled>other words</soCalled> have been observed in
								proximity (in the <term>window</term>) with our <term>target
									word</term>, to a greater or lesser extent</item>
							<item>we're not giving numbers here, but imagine that the green boxes
								are the ones that were observed more often, and the orange boxes are
								the words that were observed less often, and maybe the
								greenish-yellow boxes were somewhere in the middle</item>
						</list>
					</p>
					<p>So what do we see when we look at the righthand chart? <list>
							<item>What kind of cosine similarity would we expect to find between
									<mentioned>danger</mentioned> and <mentioned>peril</mentioned>?
								A high or low similarity?</item>
							<item>How about between <mentioned>danger</mentioned> and
									<mentioned>horses</mentioned>? <mentioned>Horses</mentioned> and
									<mentioned>goats</mentioned>?</item>
						</list>
					</p>
					<p>A few interesting things to note: <list>
							<item>all of the words are contributing information to each of the
								vectors, even when the actual association observed is low (I'll come
								back to this in a minute)</item>
							<item>and in fact the chart goes way off the edge of the screen to the
								right: there could in principle be hundreds of words contributing to
								the distinctive vector that is <mentioned>danger</mentioned></item>
						</list>
					</p>
				</lectureNote>
				<tutorial>
					<p>The previous example shows a vector defined by three dimensions: three
						different numbers representing three different axes of meaning. However,
						when we're working with word embedding models, we are working with vectors
						that are defined by many more dimensions. So in order to understand what
						that all means, we need to get a little more comfortable with two ideas: <list>
							<item>a vector is just an assemblage of dimensions</item>
							<item>each dimension represents an association that has been
								observed</item>
						</list>
					</p>
					<p>So let's take the first example shown here (the idea may look familiar if you
						read the Jay Allamar <title>Illustrated Introduction to Word
							Embeddings</title>): <list>
							<item>The chart shows three people (Jo, Lee, Robin) and for each person
								it shows an assemblage of dimensions</item>
							<item>each dimension represents an association</item>
							<item>in this case, that association has to do with the person's
								affinity for specific animals: perhaps through observations of how
								many pets of each type they have, or their response when they
								encounter the animal in the wild</item>
							<item>So each person in the chart is represented by a vector with five
								dimensions: a line in five-dimensional space</item>
							<item>if we want to compare two people and find out whether they tend to
								like the same animals, impressionistically we can say that people
								with <soCalled>high</soCalled> or <soCalled>low</soCalled> affinity
								for the same animals are similar The chart is color coded to
								highlight this pattern.</item>
							<item>But if we want a quantitative way to talk about that similarity,
								we can use the measure called <term>cosine similarity</term>, which
								is a way of measuring the angle between two lines. </item>
							<item>Here, we're doing exactly the same thing, except that our
									<soCalled>lines</soCalled> are defined by five dimensions
								instead of three.</item>
							<item>the calculation isn't hard, you can even find an Excel version on
								the web, and what it shows us is that the cosine similarity between
								our two mammal-lovers is very high, whereas the similarity between
								the mammal-lovers and the person who prefers birds/lizards/beetles
								is quite low.</item>
						</list>
					</p>

					<p>So, taking this a step farther, look at the chart on the right side: <list>
							<item>Here, instead of looking at people and their association with
								animals, we're looking at words and their association with other
								words</item>
							<item>those <soCalled>other words</soCalled> have been observed in
								proximity (in the <term>window</term>) with our <term>target
									word</term>, to a greater or lesser extent</item>
							<item>we're not giving numbers here, but imagine that the green boxes
								are the ones that were observed more often, and the orange boxes are
								the words that were observed less often, and maybe the
								greenish-yellow boxes were somewhere in the middle</item>
						</list>
					</p>
					<p>So what do we see when we look at the righthand chart? <list>
							<item>What kind of cosine similarity would we expect to find between
									<mentioned>danger</mentioned> and <mentioned>peril</mentioned>?
								A high or low similarity?</item>
							<item>How about between <mentioned>danger</mentioned> and
									<mentioned>horses</mentioned>? <mentioned>Horses</mentioned> and
									<mentioned>goats</mentioned>?</item>
						</list>
					</p>
					<p>A few interesting things to note: <list>
							<item>all of the words are contributing information to each of the
								vectors, even when the actual association observed is low (I'll come
								back to this in a minute)</item>
							<item>and in fact the chart goes way off the edge of the screen to the
								right: there could in principle be hundreds of words contributing to
								the distinctive vector that is <mentioned>danger</mentioned></item>
						</list>
					</p>
				</tutorial>
			</section>
			<section>
				<head>Negative Sampling</head>
				<slide>
					<figure>
						<graphic height="400px"
							url="../../../_utils/gfx/w2v_vector_matrices_embedding.png"/>
					</figure>
				</slide>

				<lectureNote>
					<p>So let's now add another concept. Cast our minds back to the little bookworm
						eating through the corpus, making observations about the words that are
							<soCalled>near</soCalled> the target word, and adjusting the position of
						the words within the model. The information about those words that it
						observes is being fed into our little chart here. But how about the words
						that <emph>aren't</emph> being observed? </p>
					<p>We mentioned earlier that these are also significant. When the bookworm takes
						a bite, there are a huge number of words that are not in that sample, and
						the model training process could (in principle) use that information to
						adjust all of the words in the corpus, moving them <emph>away</emph> from
						the target word. <emph>In practice</emph>, it doesn't adjust all of the
						words (since that would be too much work) but it adjusts <emph>some</emph>
						of the words: a random sample. This is called <term>negative
						sampling</term>, and it is one of the parameters we can adjust: we can say
						how many of these non-appearing words should have their positions updated
						with each observation. If we have a large negative sampling value, the model
						training will be more precise, but the training process will take a lot
						longer. </p>
					<p>Looking again at our chart: If time and computing power were no object, we
						could imagine the chart extending off to the right so that every word in the
						corpus is listed, and we could imagine the position of every word in the
						model being adjusted with each observation, so that both the positive and
						negative sampling information would be fully reflected in the model. We
						could think of this situation as a kind of <soCalled>perfect</soCalled>
						model: <list>
							<item>showing all words exerting some probabilistic influence on each
								other</item>
							<item>in terms of text prediction, all words have <emph>some</emph>
								probability of being <soCalled>the next word</soCalled> even if that
								probability is very, very low</item>
							<item>in this <soCalled>perfect</soCalled> model, the vector for each
								word has as many dimensions as there are words in the corpus</item>
						</list>
					</p>
					<p>Let's test this idea a little further: <list><item>imagine that the window is
								the size of the corpus: now all words are related to all other words
								equally! Let that sink in for a moment: our understanding of the
									<soCalled>relatedness</soCalled> of words is strongly determined
								by our observational parameters: it isn't intrinsic, it's something
								we control.</item>
							<item> And in fact in some forms of unsupervised modeling, like a topic
								model, which operates on the whole document, the window size is in
								effect the entire document: the model training process says
									<said>which words appear in the same document?</said></item>
							<item>But in word embedding models, our concept of relatedness is a bit
								more precise than this: we are interested in things that are
								happening more at the sentence or phrase level, where the
								association between words reflects the way writers are actually
								articulating specific ideas</item>
						</list>
					</p>
					<p>One more look at our <soCalled>perfect</soCalled> model: <list>
							<item>note that it contains a lot of empty space: places where we are
								noticing that in fact the word <mentioned>toothbrush</mentioned> is
								not related to the words <mentioned>danger</mentioned>,
									<mentioned>horses</mentioned>, etc.</item>
							<item>without getting too far into the weeds, it turns out that this
								empty space is a problem: largely because it makes the data set
								very, very large.</item></list>
					</p>
					<p>So what do we do about that?</p>
				</lectureNote>

				<tutorial>
					<p>So let's now add another concept. Imagine the model training process to be a
						little bookworm eating through the corpus, making observations about the
						words that are <soCalled>near</soCalled> the target word, and adjusting the
						position of the words within the model. The information about those words
						that it observes is being fed into our chart here. But how about the words
						that <emph>aren't</emph> being observed? </p>
					<p>These words are also significant. When the bookworm takes a bite, there are a
						huge number of words that are not in that sample, and the model training
						process could (in principle) use that information to adjust all of the words
						in the corpus, moving them <emph>away</emph> from the target word. <emph>In
							practice</emph>, it doesn't adjust all of the words (since that would be
						too much work) but it adjusts <emph>some</emph> of the words: a random
						sample. This is called <term>negative sampling</term>, and it is one of the
						training parameters we can adjust: we can say how many of these
						non-appearing words should have their positions updated with each
						observation. If we have a large negative sampling value, the model training
						will be more precise, but the training process will take a lot longer. </p>
					<p>Looking again at our chart, if time and computing power were no object, we
						could imagine the chart extending off to the right so that every word in the
						corpus is listed, and we could imagine the position of every word in the
						model being adjusted with each observation, so that both the positive and
						negative sampling information would be fully reflected in the model. We
						could think of this situation as a kind of <soCalled>perfect</soCalled>
						model: <list>
							<item>showing all words exerting some probabilistic influence on each
								other</item>
							<item>in terms of text prediction, all words have <emph>some</emph>
								probability of being <soCalled>the next word</soCalled> even if that
								probability is very, very low</item>
							<item>in this <soCalled>perfect</soCalled> model, the vector for each
								word has as many dimensions as there are words in the corpus</item>
						</list>
					</p>
					<p>To test this idea a little further: <list><item>imagine that the window is
								the size of the corpus: now all words are related to all other words
								equally! Let that sink in for a moment: our understanding of the
									<soCalled>relatedness</soCalled> of words is strongly determined
								by our observational parameters: it isn't intrinsic, it's something
								we control.</item>
							<item> And in fact in some forms of unsupervised modeling, like a topic
								model, which operates on the whole document, the window size is in
								effect the entire document: the model training process says
									<said>which words appear in the same document?</said></item>
							<item>But in word embedding models, our concept of relatedness is a bit
								more precise than this: we are interested in things that are
								happening more at the sentence or phrase level, where the
								association between words reflects the way writers are actually
								articulating specific ideas</item>
						</list>
					</p>
					<p>One more look at our <soCalled>perfect</soCalled> model: <list>
							<item>note that it contains a lot of empty space: places where we are
								noticing that in fact the word <mentioned>toothbrush</mentioned> is
								not related to the words <mentioned>danger</mentioned>,
									<mentioned>horses</mentioned>, etc.</item>
							<item>without getting too far into the weeds, it turns out that this
								empty space is a problem: largely because it makes the data set
								very, very large.</item></list>
					</p>
					<p>So what do we do about that?</p>
				</tutorial>
			</section>
			<section>
				<head>Embedding!</head>
				<slide>
					<figure>
						<graphic height="600px" url="../../../_utils/gfx/w2v_embedding.png"/>
					</figure>
				</slide>
				<lectureNote>
					<p>To make the model more compact, and hence easier to process while you wait,
						clever people developed a technique called <term>embedding</term> which
						flattens the model: reducing it from a very large number of dimensions
						(like, thousands) to a somewhat smaller number of dimensions (like,
						hundreds).</p>
					<p>For those of you who may have read Edwin Abbott's <title>Flatland</title>,
						you might remember how when a sphere visits Flatland, the two-dimensional
						creatures there see it as a circle: a three-dimensional entity
							<term>flattened</term> or <term>projected</term> onto two dimensions.
						Something similar sometimes happens to Wily Coyote.</p>
					<p>We are not going to cover the mathematics of it, but we will look at a few
						effects/results. </p>
					<p>In simple terms: <list>
							<item>in our <soCalled>perfect</soCalled> model, remember that the
								position of each word in the model is a vector, and that vector is
								essentially a complicated multidimensional number</item>
							<item>each of the <term>dimensions</term> of that number is another word
								in the corpus, one number for each word (even the
									<soCalled>unrelated</soCalled> words)</item>
							<item>in the <soCalled>flattened</soCalled> version, that is no longer
								true: the position of each word is still a vector, but that vector's
								dimensions are no longer individual words, and the number of
								dimensions is no longer the total vocabulary of the corpus. </item>
							<item>instead, we choose the number of dimensions, as one of the
								parameters for the training process </item>
							<item>the <term>embedding</term> process then compresses the model down
								to that number of dimensions, and reduces the empty space of the
								unrelated words.</item></list>
					</p>
					<p>So by specifying the number of dimensions, we are in effect specifying how
						many other words each word's position takes into account: <list>
							<item>if we choose a very low number of dimensions, the model will have
								very little information about the word relationships within our
								corpus</item>
							<item>if we choose a very high number of dimensions, the model will have
								a lot of information about the word relationships in the
								corpus</item>
							<item>however, the <soCalled>sweet spot</soCalled> is also going to
								depend on the total size and total vocabulary of the corpus: for a
								corpus with a tiny vocabulary, a large number of dimensions may not
								be very useful.</item>
						</list>
					</p>
					<p>I'm afraid there's a little 'magic happens here' at this stage--the
						mathematical details are a little out of scope for this institute, but there
						are some good sources in the readings for those who want to understand this
						more fully.</p>
				</lectureNote>

				<tutorial>
					<p>To make the model more compact, and hence easier to process while you wait,
						clever people developed a technique called <term>embedding</term> which
						flattens the model: reducing it from a very large number of dimensions
						(like, thousands) to a somewhat smaller number of dimensions (like,
						hundreds).</p>
					<p>If you have read Edwin Abbott's <title>Flatland</title>, you might remember
						how when a sphere visits Flatland, the two-dimensional creatures there see
						it as a circle: a three-dimensional entity <term>flattened</term> or
							<term>projected</term> onto two dimensions. Something similar sometimes
						happens to Wily Coyote.</p>
					<p>We are not going to cover the mathematics of it, but we will look at a few
						effects/results. </p>
					<p>In simple terms: <list>
							<item>in our <soCalled>perfect</soCalled> model, the position of each
								word in the model is a vector, and that vector is essentially a
								complicated multidimensional number</item>
							<item>each of the <term>dimensions</term> of that number is another word
								in the corpus, one number for each word (even the
									<soCalled>unrelated</soCalled> words)</item>
							<item>in the <soCalled>flattened</soCalled> version, that is no longer
								true: the position of each word is still a vector, but that vector's
								dimensions are no longer individual words, and the number of
								dimensions is no longer the total vocabulary of the corpus. </item>
							<item>instead, we choose the number of dimensions, as one of the
								parameters for the training process </item>
							<item>the <term>embedding</term> process then compresses the model down
								to that number of dimensions, and reduces the empty space of the
								unrelated words.</item></list>
					</p>
					<p>So by specifying the number of dimensions, we are in effect specifying how
						many other words each word's position takes into account: <list>
							<item>if we choose a very low number of dimensions, the model will have
								very little information about the word relationships within our
								corpus</item>
							<item>if we choose a very high number of dimensions, the model will have
								a lot of information about the word relationships in the
								corpus</item>
							<item>however, the <soCalled>sweet spot</soCalled> is also going to
								depend on the total size and total vocabulary of the corpus: for a
								corpus with a tiny vocabulary, a large number of dimensions may not
								be very useful.</item>
						</list>
					</p>
					<p>there's a little 'magic happens here' at this stage--the mathematical details
						are a little out of scope for this institute, but there are some good
						sources in the suggested readings if you want to understand this more
						fully.</p>

				</tutorial>
			</section>

			<!--
      <section>
        <head>Words and dimensions</head>
        <slide>
            <figure>
              <graphic height="600px" url="../../../_utils/gfx/w2v_vector_words.png"/>
            </figure>
        </slide>
        <lectureNote>
          <p>Each of those pieces of information, that contribute to the precise direction of the vector, is coming from a word in the corpus. Putting this a different way: when we train a model based on a corpus of words, each word contributes a dimension (an informational axis) to the location of all the other words in the corpus.</p>
          <p>So in this version of the diagram, instead of just having an "x" axis, we have an axis that is showing us how much of the "bank" vector's length and direction is contributed by the word <mentioned>funds</mentioned>. And instead of a y axis, we have an axis that reflects the influence of the word <mentioned>river</mentioned>. And instead of a z axis, we have an axis that reflects the lack of influence of the word <mentioned>toothbrush</mentioned>.</p>
          <p>Note that for purposes of this slide, we are talking about the <soCalled>perfect</soCalled>, non-flattened model where each word contributes a distinct dimension, not the <soCalled>flattened</soCalled> version.</p>
          <p>Let's pause and let that sink in:
          <list>
            <item>during the training process, as the little bookworm is eating its way through the corpus, over and over, looking at the window of words, it is gathering information about word relationships</item>
            <item>each word is potentially related to all of the other words in the corpus (maybe only very slightly, or not at all in the case of <mentioned>bank</mentioned> and <mentioned>toothbrush</mentioned>)</item>
            <item>each word contributes information about the relative location of all of the other words</item>
            <item>in vector space, that information takes the form of a dimension, just like the dimensions we're looking at on this slide</item>
            <item>so in this diagram (representing the location of each vector using three dimensions) is drawing information from a corpus with very, very few words</item>
          </list>
          </p>
        </lectureNote>
      </section>
      <section>
  <head>Way more dimensions...</head>
  <slide>
            <figure>
              <graphic height="600px" url="../../../_utils/gfx/w2v_vector_higher_dimensions.png"/>
            </figure>
        </slide>
  <lectureNote>
    <p>With all that in mind, let's try picturing a higher-dimensional reality....</p>
    <p>With a real-world corpus, our word vectors are defined with <emph>way more</emph> than two or three dimensions, although this gets very difficult to draw and to visualize in our minds</p>
    <p>But let's try to imagine it:
      <list>
        <item>if we think of a vector (approximately) as a line that connects two points</item>
        <item>and if we think of each of those points as a piece of information that comes from a dimensional axis (that is, like one of our familiar three "dimensions" in geometry)</item>
        <item>then we can imagine, sort of, having more than three of these axes to work with: we would just have a lot more information about that line</item>
        <item>instead of specifying its path in 2- or 3-dimensional space, we'd be specifying it in higher-dimensional space</item>
        <item>and we would need a lot more coordinates</item>
      </list>
    </p>
       
      <p>In word embedding models, we might have hundreds of dimensions:
      
      <list>
        <item>each of those dimensions is one of those rubber bands we saw a moment ago, helping to determine the word's position in vector space</item>    
      <item>so a trained word embedding model represents a textual corpus as a huge multidimensional cloud of words</item>
      <item>each word represents a distinct vector</item>
      <item>the location of each word within that cloud is based on the words it tends to appear with in the corpus (so, words that tend to appear in the same contexts in the corpus will be near one another in vector space)</item>
    </list>
    
    </p>
  

          <p>At this stage we can also come back to look with more expert eyes at the parameters that we talked about earlier: the settings we can control as part of the model training process.</p>
          <p>Two parameters we're already familiar with:
          <list>
            <item>Window: the number of words on either side of our <term>target word</term> that are considered <soCalled>nearby</soCalled> or <soCalled>related</soCalled> to the target word</item>
            <item>Iterations: the number of times we run through the corpus during the model training (the number of times the <term>window</term> gets passed through the text)</item>
          </list>
          </p>
          <p>We can now add another: we can control the number of dimensions in our model. What does this really mean?
            <list>
              <item>We know that each word in our model is described by a vector that locates it in the vector space</item>
              <item>And we just talked about how that vector is based on the words that tend to appear nearby</item>
              <item>In fact, the numeric components of that vector (the distance travelled in each dimension of the vector space) are information about other words</item>
              <item>so by specifying the number of dimensions, we are in effect specifying how many other words each word's position takes into account</item>
              <item>if we choose a very low number of dimensions, the model will have very little information about the word relationships within our corpus</item>
              <item>if we choose a very high number of dimensions, the model will have a lot of information about the word relationships in the corpus</item>
              <item>however, the <soCalled>sweet spot</soCalled> is also going to depend on the total size and total vocabulary of the corpus: for a corpus with a tiny vocabulary, a large number of dimensions may not be very useful.</item>
            </list>
          </p>
        </lectureNote>
      </section>
      
          <section>
      <head>Negative sampling</head>
        <slide>
            <figure>
              <graphic height="600px" url="../../../_utils/gfx/w2v_negative_sampling.png"/>
            </figure>
        </slide>
      <lectureNote>
        <p>Finally we come to the most abstruse parameter of all: negative sampling. To understand what this is, we need first to remind ourselves about the model training process:
          <list>
            <item>during the training process, each time we go through the corpus, with each observation we make we get more information about the relative positions of all the words in the corpus</item>
            <item>so with each iteration, our model becomes more and more accurate</item>
            <item>however, given that there are thousands of words represented in the model, if we update the information we have about every single word every time we go through the corpus, it's going to take a very long time to train the model, and our computer is going to have to work very hard</item></list></p>
        <p>So negative sampling is a way to reduce that work:
          <list>
            <item>instead of updating our information on every word in the corpus, we only update the words we directly observe within the window (so looking at this slide, which words would those be?)</item>
            <item>plus a random sampling of the other words in the model (in this slide: <mentioned>revolution</mentioned>, <mentioned>elderly</mentioned>, <mentioned>pinnacle)</mentioned></item>
            <item>the negative sampling parameter specifies how many random words to update with each observation</item>
          </list>
        </p>
        <p>Any ideas about what effect a large negative sampling value would have on our model and on the training process?</p>
        
      </lectureNote>
    </section> 
            <section>
        <head>Embedding (semi-technical)</head>
        <slide>
            <p>Embedding:
            <list>
              <item>A way of reducing the dimensionality of our vector space model</item>
              <item>Makes it more dense</item>
              <item>Makes it faster to process</item>
            </list>
            
            </p>
        </slide>
        <lectureNote>
          <p>We're now ready to explore one term we haven't defined yet: <term>embedding</term>, which is a curious term, especially in the context of the phrase <q>word embedding models</q>. To explain embedding I can offer a semi-technical view, and then a metaphorical view, and you can decide which works better.</p>
          <p>For the semi-technical view, we need to remember a few things:
          <list>
            <item>a <term>word vector</term> is a way of representing a single feature of a corpus: the behavior of a single word</item>
            <item>and a vector represents a word in a corpus as a set of dimensions; each dimension is a possible piece of information that narrows down the location of that word in vector space</item>
            <item>by default, without <soCalled>embedding</soCalled>, there would be as many dimensions as there are vocabulary terms in the corpus: potentially thousands</item>
            <item>each added dimension multiplies the time needed to do computations on the model; computers are fast, but not that fast</item>
            <item>and also, for math reasons I won't go into here, it turns out that those dimensions are mostly empty space anyway: metaphorically, most words actually don't have anything to do with each other!</item>
          </list>
          </p>
          <p>So <soCalled>embedding</soCalled> is a way of reducing the number of dimensions we're working with: embedding some of the dimensions in each other and eliminating the empty space.</p>
        </lectureNote>
      </section>
      
            <section>
        <head>Embedding (metaphorical)</head>
        <slide>
            <figure>
              <graphic height="600px" url="../../../_utils/gfx/w2v_embedding.png"/>
            </figure>
        </slide>
        <lectureNote>
          <p>Metaphorically, we can imagine embedding as being sort of like flattening, or like a projection:
          <list>
            <item>For anyone who has read Edwin Abbott's <title>Flatland</title>, when a two-dimensional creature like a square sees a sphere, it sees it as a circle: that is, the sphere projected onto a 2-dimensional surface</item>
            <item>Or for fans of more modern culture, when Wile E. Coyote gets flattened by a falling object, all of his three-dimensional features get projected onto a single two-dimensional plane, making him denser and easier to process</item>
          </list>
          
          </p>
          <p>Questions at this stage?</p>
        </lectureNote>
      </section>
-->
			<!-- 
 
      <section>
        <head>Skip-gram, continuous bag-of-words (CBOW)</head>
        <slide>
             <figure>
              <graphic height="600px" url="../../../_utils/gfx/w2v_cbow_skipgram1.png"/>
            </figure>
        </slide>
        <lectureNote>
          <p>This is where we really start to come face to face with the <soCalled>machine learning</soCalled> aspects of word embedding models: these two terms (and their difference) expose for our view some of the specifics of how the machine <soCalled>learns</soCalled> the model.</p>
          <list><item>Both of these terms refer to the process by which the training process examines the text (through the <term>window</term> we talked about earlier) and specifically what the computer does with the information it sees in that window.</item>
          <item>Each one approaches that process a little differently</item>
            <item>The difference between the two is not really important for us to master (it won't affect our analysis significantly)</item>
            <item>But it may help us see into that training process a little further, so let's see how it goes</item>
          </list>
          </lectureNote></section>
      
      <section>
        <head>Common ground</head>
         <slide>
             <figure>
              <graphic height="600px" url="../../../_utils/gfx/w2v_cbow_skipgram2.png"/>
            </figure>
        </slide>
        <lectureNote>
        <p>Let's focus first on what these two approaches have in common:
        <list>
          <item>In both cases, as we discussed a moment ago, when we are training a model (creating a model based on a text corpus), the training process works its way through the corpus examining each word in the corpus and deciding where to locate it in vector space, based on the words around it</item>
          <item>In both cases, <soCalled>the words around it</soCalled> are determined by size of the <term>window</term> that we set</item>
        </list>
        </p>
          </lectureNote>
        </section>
      <section>
        <head>Continuous bag-of-words</head>
         <slide>
             <figure>
              <graphic height="600px" url="../../../_utils/gfx/w2v_cbow_skipgram3.png"/>
            </figure>
        </slide>
        <lectureNote>
        <p>So far so good. The differences between the two are a little trickier:
        <list>
          <item>the continuous bag-of-words approach treats the entire contents of the window as a <term>bag of words</term>: an unordered group</item>
          <item>and what the training process does with this bag of words is attempt to predict the target word (in this case <mentioned>times</mentioned>) based on the context words and what it has learned about their relationship to the target word in its examination of the corpus</item>
          <item>the first few times through the process, these predictions are lousy!</item>
          <item>but each time the process eats its way through the text, it updates its knowledge of word relationships and improves its predictions (this is the <soCalled>machine learning</soCalled> at work)</item>
          <item>and when the predictions are no longer lousy, the model has been fully trained</item>
        </list>
        
        </p>  
</lectureNote></section>
      <section>
        <head>Skip-gram</head>
         <slide>
             <figure>
              <graphic height="600px" url="../../../_utils/gfx/w2v_cbow_skipgram4.png"/>
            </figure>
        </slide>
        <lectureNote>
          <p>In the skip-gram version of this process, the training process is still eating its way through the text, but in this case it treats the contents of the <term>window</term> a little differently:
<list>
  <item>it considers the <term>target word</term> (in this case again, <mentioned>times</mentioned>) together with each word in the window</item>
  <item>and in each case, it tries to predict the <term>context word</term> (<mentioned>it</mentioned>, <mentioned>was</mentioned>, <mentioned>the</mentioned>, etc.) based on the target word</item>
  <item>The term <term>skip-gram</term> is analogous to <term>n-gram</term>; each pair of words is a <term>skip-gram</term> in the sense that it skips over the intervening words within the window.</item>
  <item>skip-gram <emph>classifies</emph> words based on context: it predicts context words based on target words</item>
  <item>as above, the first time through the training process, these predictions are terrible, but they get refined and improved with each iteration.</item>
</list>
</p>
          <p>What does this mean for the output? At this stage, we can note that the bag-of-words approach is a little better for smaller data sets, while the skip-gram approach is a little better for larger data sets, for reasons which I confess I don't completely understand at this point. </p>
          
          <p>[Detail if needed: <quote>This inversion might seem like an arbitrary choice, but statistically it has the effect that CBOW smoothes over a lot of the distributional information (by treating an entire context as one observation). For the most part, this turns out to be a useful thing for smaller datasets. However, skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets.</quote> (https://www.tensorflow.org/tutorials/representation/word2vec)]</p>
          
         
            
        </lectureNote>
      </section>
       -->



			<section>
				<head>The word vector process: Data preparation</head>
				<slide>
					<figure>
						<graphic height="600px" url="../../../_utils/gfx/w2v_process_overview1.png"
						/>
					</figure>
				</slide>
				<lectureNote>
					<p>So another way to put this all together is to walk through the entire process
						in order, step by step. There are basically three major acts in this drama,
						very much like a classic comedy</p>
					<p>In the first act, we set up the problem and introduce the main characters: <list>
							<item>We analyse our problem and establish a set of research questions
								we want to focus on</item>
							<item>We gather a corpus of documents that are relevant to this
								research; at this stage they may be a motley bunch cobbled together
								from various sources, with differing quality, accuracy,
								transcription conventions, etc.</item>
							<item>And we might do some data cleanup on the corpus to improve
								consistency or make the data better suited to our research: for
								instance, filtering out unnecessary information like page numbers,
								or regularizing/modernizing spelling </item>
						</list>
					</p>
					<p>As part of this process, we might discover things that cause us to reassess
						or expand our research question: so it's helpful to keep an open mind and be
						prepared to treat this as an iterative process.</p>
				</lectureNote>

				<tutorial>

					<p>So another way to put this all together is to walk through the entire process
						in order, step by step. There are basically three major acts in this drama,
						very much like a classic comedy</p>
					<p>In the first act, we set up the problem and introduce the main characters: <list>
							<item>We analyse our problem and establish a set of research questions
								we want to focus on</item>
							<item>We gather a corpus of documents that are relevant to this
								research; at this stage they may be a motley bunch cobbled together
								from various sources, with differing quality, accuracy,
								transcription conventions, etc.</item>
							<item>And we might do some data cleanup on the corpus to improve
								consistency or make the data better suited to our research: for
								instance, filtering out unnecessary information like page numbers,
								or regularizing/modernizing spelling </item>
						</list>
					</p>
					<p>As part of this process, we might discover things that cause us to reassess
						or expand our research question: so it's helpful to keep an open mind and be
						prepared to treat this as an iterative process.</p>

				</tutorial>
			</section>

			<section>
				<head>The word vector process: Training the model</head>
				<slide>
					<figure>
						<graphic height="600px" url="../../../_utils/gfx/w2v_process_overview2.png"
						/>
					</figure>
				</slide>
				<lectureNote>
					<p>In the second act, we get the real meat of the plot: in this case, the
						process where we train our model and create a vector space representation of
						our corpus: <list>
							<item>First, we set the parameters for the training process: we choose
								the window size (which does what?); we set the number of iterations
								(which does what?); we set the number of dimensions (which does
								what?) and we set the negative sampling (which does what?)</item>
							<item>Second, we actually run the training process: our little
								caterpillar eats its way through the corpus, taking bigger or
								smaller bites (depending on window size), the number of times
								through depends on the number of iterations we set.</item>
							<item>And third, we validate the model: we test it for
								plausibility</item>
						</list>
					</p>
				</lectureNote>

				<tutorial>

				
						<p>In the second act, we get the real meat of the plot: in this case, the
							process where we train our model and create a vector space
							representation of our corpus: <list>
								<item>First, we set the parameters for the training process: we
									choose the window size (how many words on either side of our
									target word we want to look at); we set the number of iterations
									(how many times to run through the data); we set the number of
									dimensions (how much we want the model to be flattened) and we
									set the negative sampling (a large number will make the model
									more precise but will take longer to train)</item>
								<item>Second, we actually run the training process: our little
									caterpillar eats its way through the corpus, taking bigger or
									smaller bites (depending on window size), and the number of
									times through depends on the number of iterations we set. You
									may want to tweak the parameters depending on what your model
									looks like at the end of the training process. It can even be a
									good idea to train and save many models with different
									parameters so that you can test them out. However, make sure to
									note the parameters you chose for each model so that you can
									reproduce your results.</item>
								<item>And third, we validate the model: we test it for plausibility.
									Plausibility is going to mean something different to different
									researchers. Keep your research question in mind as you approach
									the validation stage as this may influence <emph>how</emph> you
									validate. </item>
							</list>
						</p>
				</tutorial>
			</section>

			<section>
				<head>The word vector process: Iteration and refinement</head>
				<slide>
					<figure>
						<graphic height="600px" url="../../../_utils/gfx/w2v_process_overview3.png"
						/>
					</figure>
				</slide>
				<lectureNote>
					<p>As before, this is an iterative process! <list>
							<item>When you're first training a model, it's a good idea to try
								different parameter settings just to find out what difference they
								make</item>
							<item>And when you validate the model, you might see something that
								prompts you to go back and change a parameter and try again: for
								instance, with a very small corpus, you might need to do extra
								iterations (because with a small corpus, there isn't as much
								information being generated about word relationships during each
								iteration, so you need to run the process more times to get the same
								level of accuracy)</item>
							<item>And the model training process might in turn send you back to the
								corpus: you might discover that your corpus is just too small and
								you need to go back and add some more materials. Or you might find
								that your corpus is too heterogeneous: maybe you'd like to try
								splitting it into two and treating them separately.</item>
						</list>
					</p>
				</lectureNote>

				<tutorial>
					<p>Like training, validation is an iterative process! <list>
							<item>As stated before, when you're first training a model, it's a good
								idea to try different parameter settings just to find out what
								difference they make--sometimes even a little change can have major
								downstream impact.</item>
							<item>And when you validate the model, you might see something that
								prompts you to go back and change a parameter and try again: for
								instance, with a very small corpus, you might need to do extra
								iterations (because with a small corpus, there isn't as much
								information being generated about word relationships during each
								iteration, so you need to run the process more times to get the same
								level of accuracy). Even if your queries are producing interesting
								results, if your model isn't valid then your queries aren't really,
								either.</item>
							<item>The model training process might send you back to the corpus: you
								might discover that your corpus is just too small and you need to go
								back and add some more materials. Or you might find that your corpus
								is too heterogeneous: maybe you'd like to try splitting it into two
								and treating them separately. Sometimes even the method you used to
								prepare the data might need to be adjusted.</item>
						</list>
					</p>

				</tutorial>

			</section>

			<section>
				<head>The word vector process: Querying and research</head>
				<slide>
					<figure>
						<graphic height="600px" url="../../../_utils/gfx/w2v_process_overview4.png"
						/>
					</figure>
				</slide>
				<lectureNote>
					<p>In the final act, as with a proper comedy, we reach resolution and answers:
						this is where we can start querying our model and doing our research
						(although as we've seen, the corpus-building and model-training processes
						are also definitely integral to the research process)</p>
				</lectureNote>

				<tutorial>
					<p>Next, we are going to cover one of the most useful features of a word
						embedding model: the ability to ask it questions about your data. This is
						the stage where the actual research begins, though in any write up involving
						word embedding models, you should make sure to share details about your
						training process as this is a very important stage in the research
						process!</p>

				</tutorial>
			</section>





			<section>
				<head>Tools for word embedding models</head>
				<slide>
					<figure>
						<graphic height="600px" url="../../../_utils/gfx/w2v_tools.png"/>
					</figure>

				</slide>
				<lectureNote>
					<p>To wrap up this session, let's take a quick look at the tools we use for
						working with word embedding models</p>
					<p>We can arrange them in order of abstraction: <list>
							<item>the most foundational <soCalled>tool</soCalled> in this set is the
								word embedding algorithms themselves. These are mathematical
								processes that perform computations that generate a <soCalled>word
									embedding</soCalled>: a representation of a corpus as a vector
								space that has been <soCalled>squashed</soCalled> or
									<soCalled>flattened</soCalled> in useful ways. The two main word
								embedding algorithms in common use are Word2Vec (developed by Tomas
								Mikolov at Google) and GLoVE (developed by a research group at
								Stanford). For this workshop, we are using Word2Vec.</item>
							<item>When we want to actually run those algorithms on our data, we need
								to have a computer program that will do things like read in the
								corpus, run the algorithm on it, allow us to set parameters, etc. We
								could write one ourselves if we were clever that way but there
								already exist specific software packages we can use: specific
								implementations of the word embedding algorithms. Two in common use
								are the WordVectors package (written in R by Ben Schmidt) and the
								GenSim package (written in Python by a Czech researcher, Radim
								Řehůřek). For this workshop, we are using the WordVectors R
								package.</item>
							<item>In order to run these programs on your computer, you need to have
								an environment within which the programming language (R, Python) can
								operate: something that understands the R or Python language and can
								run it within the operating system on your computer. These software
								environments are sort of like sandboxes or life support systems for
								specific languages. Examples include RStudio, which is an
								environment for working in the R programming language and running R
								code, and Jupyter Notebooks, which are an environment for working in
								the Python programming language and running Python code. Within
								these environments, we can train models and we can also query and
								interact with them.
								<!--For this workshop, we sort of bypass this layer, because we are not running the word2vec code directly (we already did it for you when we pre-trained your models in preparation for the workshop)--></item>
							<item>An added option (which we're only touching on briefly in this
								workshop) is the Women Writers Vector Toolkit, which is a set of
								programs that create a web interface for Word2Vec, and allow you to
								query the trained models without having to use RStudio or interact
								directly with any of the underlying layers</item>
							<!-- <item>The final layer for this workshop is the Women Writers Vector Toolkit, which is how we will interact with the trained models. The toolkit and its exploratory interface are a set of programs that create a web interface for Word2Vec, and allow you to query the trained models without having to use RStudio or interact directly with any of the underlying layers</item>-->
						</list>
					</p>
					<p>Those layers are all sitting underneath us and they each have effects on the
						outcomes of our work: <list>
							<item>the environment we're working in is the result of a number of
								layers of decisions that could have been made differently</item>
							<!--<item>and in some cases, for some of you, a different choice in one of these layers might ultimately prove to be more suitable for your project</item>-->
							<item>and even if you don't want to make a different decision, in a
								teaching context you might want your students to understand the
								effects of a different set of choices</item>
							<item>so over time, you may want to revisit them as you gain more
								familiarity and comfort with these tools</item>
							<item>the important note to end on here is that this workshop is
								intended to be a starting point</item>
							<item>the things we observe about word vectors and how they work are not
								universal, but local and situational; however, we can learn a lot
								from these experiments </item>
						</list>
					</p>


				</lectureNote>
				<tutorial>
					<p>To finish this part of the tutorial, let's take a quick look at the tools we
						use for working with word embedding models</p>
					<p>We can arrange them in order of abstraction: <list>
							<item>the most foundational <soCalled>tool</soCalled> in this set is the
								word embedding algorithms themselves. These are mathematical
								processes that perform computations that generate a <soCalled>word
									embedding</soCalled>: a representation of a corpus as a vector
								space that has been <soCalled>squashed</soCalled> or
									<soCalled>flattened</soCalled> in useful ways. The two main word
								embedding algorithms in common use are Word2Vec (developed by Tomas
								Mikolov at Google) and GLoVE (developed by a research group at
								Stanford). For this tutorial, we are using Word2Vec.</item>
							<item>When we want to actually run those algorithms on our data, we need
								to have a computer program that will do things like read in the
								corpus, run the algorithm on it, allow us to set parameters, etc. We
								could write one ourselves if we were clever that way but there
								already exist specific software packages we can use: specific
								implementations of the word embedding algorithms. Two in common use
								are the WordVectors package (written in R by Ben Schmidt) and the
								GenSim package (written in Python by a Czech researcher, Radim
								Řehůřek). For this workshop, we are using the WordVectors R
								package.</item>
							<item>In order to run these programs on your computer, you need to have
								an environment within which the programming language (R, Python) can
								operate: something that understands the R or Python language and can
								run it within the operating system on your computer. These software
								environments are sort of like sandboxes or life support systems for
								specific languages. Examples include RStudio, which is an
								environment for working in the R programming language and running R
								code, and Jupyter Notebooks, which are an environment for working in
								the Python programming language and running Python code. Jupyter
								Notebooks is designed to work with Python notebooks, meaning code
								that is intermixed with prose. If you want to just run the code,
								itself, you may want to check out Python's IDLE or Spyder which
								comes preinstalled with Anaconda. Within these environments, we can
								train models and we can also query and interact with them.</item>
							<item>An added option (which we've touched on in this tutorial) is the
								Women Writers Vector Toolkit, which is a set of programs that create
								a web interface for Word2Vec, and allow you to query the trained
								models without having to use RStudio or interact directly with any
								of the underlying layers. This tool can be particularly useful if
								you just want to get comfortable with what the results of a model
								query look like or if the model you want to train is using a corpus
								available with the Toolkit.</item>
						</list>
					</p>
					<p>Those layers are all sitting underneath us and they each have effects on the
						outcomes of our work: <list>
							<item>the environment we're working in is the result of a number of
								layers of decisions that could have been made differently. Choosing
								an environment to work in is an important step that asks you to
								consider how you want to interact with the code and what your level
								of comfort is.</item>
							<item>So over time, you may want to revisit different environments as
								you gain more familiarity and comfort with these tools</item>
							<item>the important note to keep in mind is that this tutorial is
								intended to be a starting point. We have only begun to scratch the
								surface of all the cool stuff word embedding models can do!</item>
							<item>The things we observe about word vectors and how they work are not
								universal, but local and situational; however, we can learn a lot
								from these experiments, about both how models work as well as about
								our corpus, itself</item>
						</list>
					</p>

				</tutorial>

			</section>
			<section>
				<head>Discussion and questions</head>
				<slide>
					<p>Using this new information: <list>
							<item>Are there further questions or things that need more
								explanation?</item>
							<item>Any new perspectives on the examples we saw earlier?</item>
							<item>Any reflections on the explanatory process? What worked and what
								didn't?</item>
						</list>
					</p>
				</slide>
				<lectureNote>
					<p>So now let's take a step back, with this more detailed perspective: <list>
							<item>Are there further questions or things that need more
								explanation?</item>
							<item>Any mew perspectives on the examples we saw earlier?</item>
							<item>Any reflections on the explanatory process? What worked and what
								didn't?</item>
						</list></p>
				</lectureNote>
			</section>

		</presentation>
	</text>
</TEI>
