Implementing Word2vec in PyTorch from the Floor Up | by Jonah Breslow | Dec, 2022

January 1, 2023

1

A Step-by-Step Information to coaching a Word2vec Mannequin

An necessary element of pure language processing (NLP) is the power to translate phrases, phrases, or bigger our bodies of textual content into steady numerical vectors. There are various strategies for undertaking this activity, however on this submit we’ll focus in on a way printed in 2013 known as word2vec.

Word2vec is an algorithm printed by Mikolov et al. in a paper titled Environment friendly Estimation of Phrase Representations in Vector Area. This paper is price studying, although I’ll present an outline as we construct it from the bottom up in PyTorch. Succinctly, word2vec makes use of a single hidden layer synthetic neural community to study dense phrase embeddings. These phrase embeddings enable us to establish phrases which have related semantic meanings. Moreover, phrase embeddings enable us to use algebraic operations. For instance “vector(‘King’) – vector(‘Man’) + vector(‘Lady’) leads to a vector that’s closest to the vector illustration of the phrase Queen” (“Environment friendly Estimation of Phrase Representations in Vector Area” 2).

Determine 1: Phrase Embedding Instance — **Credit:** https://thegradient.pub/nlp-imagenet/

Determine 1 is an instance of phrase embeddings in 3-dimensions. Phrase embeddings can study semantic relationships between phrases. The “Male-Feminine” instance illustrates how the connection between “man” and “girl” is similar to the connection between “king” and “queen.” Syntactic relationships will be encoded by embeddings as is proven within the “Verb tense” instance.

Earlier than we get into the mannequin overview and PyTorch code, let’s begin with a proof of phrase embeddings.

Why will we even want phrase embeddings?

Computer systems are merely abstracted calculators. They’re actually environment friendly at doing mathematical computations. Any time we need to specific our ideas to a pc, the language we should use is numerical. If we need to determine the sentiment of a yelp evaluation or the subject of a preferred e-book, we might want to first translate the textual content into vectors. Solely then can we use follow-on procedures to extract the data of curiosity from the textual content.

Easiest phrase embeddings

Phrase embeddings are exactly how we translate our ideas right into a language computer systems can perceive. Let’s work by an instance taken from the Wikipedia article on python that states “python persistently ranks as one of the common programming languages.” This sentence incorporates 11 phrases, so why don’t we create a vector of size 11 the place every index takes the worth 1 if the phrase is current and a 0 if it isn’t? That is generally generally known as one-hot encoding.

python       = [1,0,0,0,0,0,0,0,0,0,0]
persistently = [0,1,0,0,0,0,0,0,0,0,0]
ranks        = [0,0,1,0,0,0,0,0,0,0,0]
as           = [0,0,0,1,0,0,0,0,0,0,0]
one          = [0,0,0,0,1,0,0,0,0,0,0]
of           = [0,0,0,0,0,1,0,0,0,0,0]
the          = [0,0,0,0,0,0,1,0,0,0,0]
most         = [0,0,0,0,0,0,0,1,0,0,0]
common      = [0,0,0,0,0,0,0,0,1,0,0]
programming  = [0,0,0,0,0,0,0,0,0,1,0]
languages    = [0,0,0,0,0,0,0,0,0,0,1]

This methodology of changing phrases into vectors is arguably the only. But, there are a couple of shortcomings that can present the motivation for word2vec embeddings. First, the size of the embedding vectors will increase linearly with the scale of the vocabulary. As soon as we have to embed thousands and thousands of phrases, this methodology of embedding turns into problematic when it comes to house complexity. Second is the difficulty that these vectors are sparse. Every vector has solely a single entry with worth 1 and all remaining entries have worth 0. As soon as once more, it is a vital waste of reminiscence. Lastly, every phrase vector is orthogonal to each different phrase vector. Due to this fact, there isn’t a strategy to decide which phrases are most related. I might argue that the phrases “python” and “programming” needs to be thought of extra related to one another than “python” and “ranks.” Sadly, the vector representations for every of those phrases is equally totally different from each different vector.

Improved Phrase Embeddings

Our aim is now extra refined: can we create fixed-length embeddings that enable us to establish which phrases are most related to one another? An instance is perhaps:

python       = [0.5,0.8,-0.1]
ranks        = [-0.5,0.1,0.8]
programming  = [0.9,0.4,0.1]

If we take the dot product of “python” and “ranks”, we’d get:

And if we take the dot product of “python” and “programming”, we’d get:

For the reason that rating between “python” and “ranks” is decrease than that of “python” and “programming”, we’d say that “python” and “programming” are extra related. Typically, we won’t use the dot product between two embeddings to compute a similarity rating. As a substitute, we’ll use the cosine similarity because it removes the impact of vector norms and returns a extra standardized rating. Regardless, Each of the problems we confronted with the one-hot encoding methodology are solved — our embedding vectors are of fastened size and so they enable us to compute similarity between phrases.

Now that we’ve got a grasp of phrase embeddings, the query turns into study these embeddings. That is the place Mikolov’s word2vec mannequin comes into play. If you’re unfamiliar with synthetic neural networks, the next sections will likely be unclear since word2vec is basically primarily based on any such mannequin. I extremely suggest trying out Michael Nielsen’s free on-line Deep Studying and Neural Networks course and 3Blue1Brown’s YouTube collection on neural networks if this materials is new for you.

Skipgrams

Recall the sentence from earlier, “python persistently ranks as one of the common programming languages.” Think about somebody didn’t know the phrase “programming” and needed to determine its which means. An inexpensive strategy is to examine the neighboring phrases for a clue in regards to the which means of this unknown phrase. They’d discover that it was surrounded by “common” and “language.” These phrases may give them a touch to the attainable which means of “programming”. That is exactly how the skipgram mannequin works. Finally, we’ll practice a neural community to foretell the neighboring context phrases, given an enter phrase. In Determine 2, the inexperienced phrase is the unknown goal phrase and the blue phrases surrounding it are the context phrases our neural community will likely be skilled to foretell.

On this instance, the window measurement is 2. Which means that each goal phrase will likely be surrounded by 2 context phrases that the mannequin might want to predict. For the reason that phrase “rank” has 2 phrases to the left and a couple of phrases to the appropriate, the ensuing coaching information is 4 examples for this goal phrase.

Mannequin Structure

The neural community used to study these phrase embeddings is a single hidden layer feedforward community. The inputs to the community are the goal phrases. The labels are the context phrases. The one hidden layer would be the dimension by which we select to embed our phrases. For this instance, we’ll use an embedding measurement of 300.

Let’s undergo an instance of how this mannequin works. If we need to embed a phrase, step one is to seek out its index within the vocabulary. This index is then handed to the networks because the row index within the embedding matrix. In Determine 3, the enter phrase is the second entry in our vocabulary vector. Which means that we’ll now will enter the inexperienced embedding matrix on the second row. This row is of size 300 — the embedding dimension, N. We then matrix multiply this vector, which is the hidden layer, by a second embedding matrix of form N x V to lead to a vector of size V.

Discover that there are Vcolumns within the second embedding matrix (the purple matrix). Every of those columns represents a phrase within the vocabulary. One other strategy to conceptualize this matrix multiplication is by recognizing that it leads to the dot product between the vector for the goal phrase (the hidden layer) and each phrase within the vocabulary (the columns of the purple matrix). The result’s a vector of size V, representing the context phrase predictions. Since our context window measurement is 2, we could have 4 prediction vectors of size V. We then evaluate these prediction vectors with the corresponding floor fact vectors to compute the loss that we backpropagate by the community to replace the mannequin parameters. On this case, the mannequin parameters are the weather of the embedding matrices. Dialogue of the mechanics of this coaching process will likely be fleshed out in PyTorch code afterward.

Adverse Sampling

Within the paper titled Distributed Representations of Phrases and Phrases and their Compositionality by Mikolov et al., the authors suggest two enhancements to the unique word2vec mannequin — damaging sampling and subsampling.

In Determine 3, discover how every prediction vector is of size V. The bottom fact vectors that will likely be in comparison with every prediction vector will even be of size V, however the floor fact vectors will likely be extraordinarily sparse since solely a single ingredient of the vector will likely be labeled 1 — the true context phrase the mannequin is being skilled to foretell. This true context phrase will likely be known as the “constructive context phrase”. Each different phrase within the vocabulary, which is V — 1 phrases, will likely be labeled 0 since they don’t seem to be the context phrase within the coaching instance. All of those phrases will likely be known as “damaging context phrases”.

Mikolov et al. proposed a technique known as damaging sampling that reduces the scale of the bottom fact vector and subsequently the prediction vector. This reduces the computational necessities of the community and expedites coaching. As a substitute of utilizing all of the damaging context phrases, Mikolov proposed a way to pattern a small variety of damaging context phrases from the presentV — 1 damaging context phrases utilizing a conditional likelihood distribution.

Within the code offered, I implement a damaging sampling process that differs from the tactic Mikolov proposed. It was less complicated to assemble and nonetheless leads to high-quality embeddings. In Mikolov’s paper, the likelihood {that a} damaging pattern is chosen relies on the conditional likelihood of seeing the candidate phrase within the context of the goal phrase. So, for each phrase within the vocabulary, we’d generate a likelihood distribution for each different phrase within the vocabulary. These distributions characterize the conditional likelihood of seeing the opposite phrase within the goal phrase’s context. Then, the damaging context phrases could be sampled with a likelihood inversely proportional to the likelihood.

I carried out damaging sampling in a barely totally different means avoiding conditional distributions. First, I discovered the frequency of every phrase within the vocabulary. I ignored the conditional likelihood by discovering the general frequency. Then, an arbitrarily massive damaging sampling vector is populated with the vocabulary indices proportional to the frequency of the phrase. For instance, if the phrase “is” contains 0.01% of the corpus, and we determine the damaging sampling vector needs to be of measurement 1,000,000, then 100 components (0.01% x 1,000,000) of the damaging sampling vector will likely be populated with the vocabulary index of the phrase “is”. Then, for each coaching instance, we randomly pattern a small variety of components from the damaging sampling vector. If this small quantity is 20 and the vocabulary is 10,001 phrases, we simply lowered the size of the prediction and floor fact vectors by 9,980 components. This discount accelerates the mannequin coaching time considerably.

Subsampling

Subsampling is one other methodology proposed by Mikolov et al. to scale back coaching time and enhance mannequin efficiency. The basic commentary from which subsampling arises is that phrases with excessive frequency “present much less info worth than the uncommon phrases” (“Distributed Representations of Phrases and Phrases and their Compositionality” 4). As an example, phrases like “is,” “the,” or “in” happen fairly incessantly. These phrases are extremely more likely to co-occur with many different phrases. This means that the context phrases round these high-frequency phrases impart little contextual details about the high-frequency phrase itself. So, as an alternative of utilizing each phrase pair within the corpus, we pattern the phrases with a likelihood inversely proportional to the frequency of the phrases within the pair. The precise implementation particulars will likely be defined within the following part.

With the overview of phrase embeddings, word2vec structure, damaging sampling, and subsampling out of the best way, let’s dig into the code. Please notice, there are frameworks which have abstracted away the implementation particulars of word2vec. These choices are extraordinarily highly effective and supply the consumer extensibility. For instance, gensim gives a word2vec API which incorporates extra capabilities corresponding to utilizing pretrained fashions and multi-word n-grams. Nonetheless, on this tutorial we’ll create a word2vec mannequin with out leveraging any of those frameworks.

All of the code we’ll evaluation on this tutorial will be discovered on my GitHub. Please notice, the code within the repository is topic to vary as I work on it. For the aim of this tutorial, a simplified model of this code will likely be introduced right here in a Google Colab pocket book.

Getting Knowledge

We are going to use a wikipedia dataset known as WikiText103 offered by PyTorch for coaching our word2vec mannequin. Within the code beneath, you will notice how I import and print the primary few strains of the dataset. The primary textual content comes from the wikipedia article on Valkyria Chronicles III.