Introduction to Word2Vec (Skip-gram) | by Zolzaya Luvsandorj | Dec, 2022

December 16, 2022

2

NLP Fundamentals

A short introduction to phrase embeddings in Python

When working with textual content knowledge, we have to rework textual content into numbers. There are alternative ways to signify textual content in numerical knowledge. Bag of phrases (a.okay.a. BOW) is a well-liked and easy strategy to signify textual content in numbers. Nonetheless, there isn’t any notion of phrase similarity in bag of phrases as a result of every phrase is represented independently. Consequently, embeddings of phrases like ‘nice’ and ‘superior’ are as comparable to one another as they’re to an embedding of the phrase ‘ebook’.

Phrase embedding is one other smart way of representing textual content with numbers. With this strategy, every phrase is represented by an embedding, a dense vector (i.e. an array of numbers). The strategy preserves relationship between phrases and is ready to seize phrase similarity. Phrases that seem in comparable contexts have nearer vectors within the vector area. Consequently, the phrase ‘nice’ is prone to have extra comparable embedding to ‘superior’ than ‘ebook’.

Photograph by Sebastian Svenson on Unsplash

On this submit, we are going to take a look at an summary of phrase embeddings, particularly a sort of embedding algorithm known as Word2Vec, and look underneath the hood to grasp how the algorithm operates on a toy instance in Python.

Picture by creator | Comparability of preprocessing an instance doc: “Good day world!” with two approaches. Assumed vocabulary dimension of 5 for bag of phrases strategy and embedding dimension of three for phrase embedding.

When utilizing the bag of phrases strategy, we rework textual content right into a doc time period matrix of m by n the place m is the variety of paperwork/textual content data and n is the variety of distinctive phrases throughout all paperwork. This normally ends in a giant sparse matrix. If you wish to familiarise with the strategy intimately, take a look at this tutorial.

In phrase embeddings, every phrase is represented by a vector, normally with a dimension of 100 to 300. Word2Vec is a well-liked technique to create embeddings. The fundamental instinct behind Word2Vec is that this: We are able to get helpful details about a phrase by observing its context/neighbours. In Word2Vec, there are two architectures or studying algorithms we are able to use to acquire vector illustration (simply one other phrase for embedding) of phrases: Continuous Bag of Phrases (a.okay.a. CBOW) and Skip-gram.
◼️ CBOW: Predict focus phrase given surrounding context phrases
◼️ Skip-gram: Predict context phrases given focus phrase (the main target of this submit)

At this stage, this may occasionally not make a lot sense. We’ll quickly take a look at an instance of this and it’ll develop into clearer.

When coaching embeddings utilizing the Skip-gram algorithm, we undergo the next three steps at a high-level:
◼️ Acquire textual content: We begin with unlabelled textual content corpus — so it’s an unsupervised studying downside.
◼️ Rework knowledge: Then, we preprocess the information and rearrange the preprocessed knowledge into focus phrases as function and context phrases as goal for a fictitious supervised studying downside. So, it turns into a multiclassification downside the place P(context phrase | focus phrase). Right here’s an instance of what this may appear to be on a single doc:

Picture by creator | We first preprocess textual content into tokens. Then, for every token as focus phrase, we discover the context phrases with a window dimension of two. This implies we contemplate 2 tokens earlier than and after the main target phrase as context phrases. We are able to see that not all tokens have 2 tokens earlier than and after in a small instance textual content like this. In these instances, we use the accessible tokens. On this instance, we use the time period phrase and token loosely and interchangeably.

Having a number of targets for a similar function is perhaps complicated to think about. Right here’s one other method to consider learn how to put together the information:

Primarily, we put together function and goal pairs.
◼️ Construct a easy neural community: Then, we practice a easy neural community with a single hidden layer for the supervised studying downside utilizing the newly constructed dataset. The primary motive we’re coaching a neural community is to get the skilled weights from the hidden layer which turns into the phrase embeddings. The embedding for phrases that happen in comparable contexts are usually comparable to one another.

Having lined the overview, it’s time to implement it in Python to consolidate what we have now realized.

For the reason that focus of this submit is to develop higher instinct of how the algorithm works, we are going to deal with constructing it ourselves slightly than utilizing pretrained Word2Vec embeddings to deepen our understanding.

🔗 Disclaimer: Whereas creating the code for this submit, I’ve closely used the next repositories:
◼️ word-embedding-creation by Eligijus112 (His Medium web page: Eligijus Bujokas)
◼️ word2vec_numpy by DerekChia

I wish to thank these superior authors for making their helpful work accessible for others. Their repositories are nice extra studying sources if you wish to deepen your understanding of word2vec.

🔨 Word2vec with Gensim

We’ll use this pattern toy dataset from Eligijus112’s repository together with his permission. Let’s import libraries and the dataset.

import numpy as np
import pandas as pd
from nltk.tokenize.regexp import RegexpTokenizer
from nltk.corpus import stopwords
from gensim.fashions import Word2Vec, KeyedVectors
from scipy.spatial.distance import cosineimport tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Enter, Denseimport matplotlib.pyplot as plt
import seaborn as sns
sns.set(model='darkgrid', context='discuss')textual content = ["The prince is the future king.",
"Daughter is the princess.",
"Son is the prince.",
"Only a man can be a king.",
"Only a woman can be a queen.",
"The princess will be a queen.",
"Queen and king rule the realm.", 
"The prince is a strong man.",
"The princess is a beautiful woman.",
"The royal family is the king and queen and their children.",
"Prince is only a boy now.",
"A boy will be a man."]

We’ll now preprocess the textual content very frivolously. Let’s create a perform that lowercases textual content, tokenises the paperwork into alphabetic tokens and removes stopwords.

The extent of preprocessing can range from implementation to implementation. In some implementations, one might select to do little or no preprocessing preserving the textual content virtually as it’s. On the opposite spectrum, one may also select to do a extra thorough preprocessing than this instance.

def preprocess_text(doc):
tokeniser = RegexpTokenizer(r"[A-Za-z]{2,}")
tokens = tokeniser.tokenize(doc.decrease())
key_tokens = [token for token in tokens 
if token not in stopwords.words('english')]
return key_tokenscorpus = []
for doc in textual content:
corpus.append(preprocess_text(doc))
corpus

Now every doc consists of tokens. We’ll construct Word2Vec utilizing Gensim on our customized corpus:

dimension = 2
window = 2word2vec0 = Word2Vec(corpus, min_count=1, vector_size=dimension, 
window=window, sg=1)
word2vec0.wv.get_vector('king')

Picture by creator

We select window dimension of two for the contexts. This implies we are going to look 2 tokens earlier than and after the main target token. dimension can be set to 2. This refers back to the dimension of the vector. We selected 2 as a result of we are able to simply visualise it in two-dimensional chart and we’re working with a really small textual content corpus. These two hyperparameters might be tuned with completely different values to enhance the usefulness of the phrase embeddings for a use case. Whereas getting ready Word2Vec, we ensured to make use of Skip-gram algorithm by specifying sg=1. As soon as the embedding is prepared, we are able to see the embedding for a token 'king'.

Let’s see how intuitive the embeddings are. We’ll decide a pattern phrase: 'king' and see if the phrases which are most much like it within the vector area is sensible. Let’s discover probably the most comparable 3 phrases to 'king':

n=3
word2vec0.wv.most_similar(optimistic=['king'], topn=n)

Picture by creator

This record of tuples present probably the most comparable phrases and their cosine similarity to 'king'. This outcome shouldn’t be unhealthy given we’re working with a really small knowledge.

Let’s put together a DataFrame of the embeddings for the vocabulary, the gathering of distinctive tokens:

embedding0 = pd.DataFrame(columns=['d0', 'd1'])
for token in word2vec0.wv.index_to_key:
embedding0.loc[token] = word2vec0.wv.get_vector(token)
embedding0

Now, we are going to visualise the tokens in two-dimensional vector area:

sns.lmplot(knowledge=embedding0, x='d0', y='d1', fit_reg=False, side=2)
for token, vector in embedding0.iterrows():
plt.gca().textual content(vector['d0']+.02, vector['d1']+.03, str(token), 
dimension=14)
plt.tight_layout()

🔗 If you wish to be taught extra about Word2Vec in Gensim, right here’s a tutorial by Radim Rehurek, the creator of Gensim.

Alright, this was a pleasant warm-up. Within the subsequent part, we are going to create a Word2Vec embedding ourselves.

🔨 Handbook Word2Vec — Strategy 1

We’ll begin by discovering the vocabulary from the corpus. We’ll assign a worth to every token within the vocabulary:

vocabulary = sorted([*set([token for document in corpus for token in 
document])])
n_vocabulary = len(vocabulary)
token_index ={token: i for i, token in enumerate(vocabulary)}
token_index

Now, we are going to make token pairs as a preparation for the neural community.

token_pairs = []for doc in corpus:
for i, token in enumerate(doc):
for j in vary(i-window, i+window+1):
if (j>=0) and (j!=i) and (j<len(doc)):
token_pairs.append([token] + [document[j]])n_token_pairs = len(token_pairs)
print(f"{n_token_pairs} token pairs")token_pairs[:5]

Token pairs are prepared however they’re nonetheless in textual content kind. Now we have to one-hot-encode them in order that they’re appropriate for the neural community.

X = np.zeros((n_token_pairs, n_vocabulary))
Y = np.zeros((n_token_pairs, n_vocabulary))for i, (focus_token, context_token) in enumerate(token_pairs):    
X[i, token_index[focus_token]] = 1
Y[i, token_index[context_token]] = 1
print(X[:5])

Now that the enter knowledge is prepared, we are able to construct a neural community with a single hidden layer:

tf.random.set_seed(42)
word2vec1 = Sequential([
Dense(units=dimension, input_shape=(n_vocabulary,), 
use_bias=False, name='embedding'),
Dense(units=n_vocabulary, activation='softmax', name='output')
])
word2vec1.compile(loss='categorical_crossentropy', optimizer='adam')
word2vec1.match(x=X, y=Y, epochs=100)

We specified hidden layer to haven’t any bias phrases. Since we would like the hidden layer to have linear activation, we didn’t have to specify. The variety of items within the layer displays the scale of the vector: dimension.

Let’s extract the weights, our embeddings, from the hidden layer.

embedding1 = pd.DataFrame(columns=['d0', 'd1'])
for token in token_index.keys():
ind = token_index[token]
embedding1.loc[token] = word2vec1.get_weights()[0][ind]
embedding1

Utilizing our new embeddings, let’s see probably the most comparable 3 phrases to 'king':

vector1 = embedding1.loc['king']
similarities = {}for token, vector in embedding1.iterrows():
theta_sum = np.dot(vector1, vector)
theta_den = np.linalg.norm(vector1) * np.linalg.norm(vector)
similarities[token] = theta_sum / theta_densimilar_tokens = sorted(similarities.objects(), key=lambda x: x[1], 
reverse=True)
similar_tokens[1:n+1]

Picture by creator

Nice, this is sensible. We are able to save the embeddings and cargo them utilizing Gensim. As soon as loaded into Gensim, we are able to examine our similarity calculation.

with open('embedding1.txt' ,'w') as text_file:
text_file.write(f'{n_vocabulary} {dimension}n')
for token, vector in embedding1.iterrows():
text_file.write(f"{token} {' '.be part of(map(str, vector))}n")
text_file.shut()embedding1_loaded = KeyedVectors.load_word2vec_format('embedding1.txt', binary=False)
embedding1_loaded.most_similar(optimistic=['king'], topn=n)

The similarities calculated by Gensim matches our guide calculations.

We’ll now visualise the embeddings within the vector area:

sns.lmplot(knowledge=embedding1, x='d0', y='d1', fit_reg=False, side=2)
for token, vector in embedding1.iterrows():
plt.gca().textual content(vector['d0']+.02, vector['d1']+.03, str(token), 
dimension=14)
plt.tight_layout()