Extending Word2Vec with Domain-Specific Labels

Choosing a proper representation of textual data is an important part of natural language processing. One option is using Word2Vec embeddings, i.e., dense vectors whose properties can to a degree capture the "meaning" of each word. One of the main disadvantages of Word2Vec is its inability to distinguish between antonyms. Motivated by this deficiency, this paper presents a Word2Vec extension for incorporating domain-specific labels. The goal is to improve the ability to differentiate between embeddings of words associated with different document labels or classes. This improvement is demonstrated on word embeddings derived from tweets related to a publicly traded company. Each tweet is given a label depending on whether its publication coincides with a stock price increase or decrease. The extended Word2Vec model then takes this label into account. The user can also set the weight of this label in the embedding creation process. Experiment results show that increasing this weight leads to a gradual decrease in cosine similarity between embeddings of words associated with different labels. This decrease in similarity can be interpreted as an improvement of the ability to distinguish between these words.


I. INTRODUCTION
T RANSFORMATION of text into a numerical representation is an important part of any natural language processing (NLP) problem. Considering the level of words, one can choose between alternatives ranging from simple onehot encoding to complex language models such as ELMo [9], BERT [3], or GPT-3 [2]. Word2Vec embeddings lie in-between these two extremes in terms of level of complexity.
Word embeddings are dense vectors usually of several hundred dimensions with the ability to at least partly capture the meaning of each word. This meaning is based on each word's context, i.e., words that co-occur with a given target word. It is captured by the relative position of different words in the embedding vector space. Words with similar meaning should be represented by vectors close to each other, while dissimilar words should be more distant. Moreover, basic operations such as addition or subtraction enable the derivation of representations for new words. A common example is the subtraction of the embedding for the word man from of the word king, followed by the addition of the embedding of woman. the result should be close to the embedding of the word queen.
Originally proposed by [6], Word2Vec is an algorithm for creating word embeddings. It is based on a simple idea: words with similar meaning occur in similar contexts. This context is usually defined by surrounding words. One issue with this line of thinking is that although Word2Vec works well for detecting synonyms, hyponyms or hypernyms [5,14], it can't easily distinguish between two antonyms [11,1]. Hence, words such as good and bad are often represented by relatively similar embeddings.
Several authors, including [8] or [4], addressed this issue by extending the basic Word2Vec algorithm to consider not only the context of words, but also thesauri information. In cited papers, several well known public datasets such as WordNet or Roget are used. [10] claim that the information about antonyms can be extracted from the geometry of the embedding space with a method called contrasting maps.
This paper describes a simple modification of the Word2Vec algorithm for considering domain specific document labels data during embedding creation. In contrast with the aforementioned work, presented approach is more flexible and can be adopted to many domains of interest.
In the paper, the modification is applied in the domain of finance. The problem can be stated as follows: We have set of tweets related to a specific publicly traded company. It is assumed that certain words occur more frequently when the stock price of a company drops, while others are used more when the stock price rises. In many financial analysis tasks it would be useful to represent words with occurrences associated with these opposite situations with vectors that are distant from each other. Can we improve on the basic Word2Vec algorithm by considering the information about stock price increase or decrease in the time of tweet publication?
Incorporating this information directly into word embeddings could be helpful in risk or return prediction. Some researchers, e.g. [12] or [13] are already using the basic Word2Vec model for similar tasks and this work could improve the prediction power of their models.
The remainder of this paper is organized as follows: Section II provides a basic overview of the Word2Vec algorithm, section III describes a modification for considering domainspecific labels during embedding creation, and section IV introduces the experiments performed to evaluate this modification. Results of these experiments are then presented and discussed in section V. Finally, the presented work is summarized in section VI. This section also discusses future research opportunities. The extension presented in this paper is based on the skipgram model. When using this approach, the neural network uses the one-hot encoded target word as it's input and tries to predict it's multi-hot encoded context. The objective of the skip-gram model can be further formulated as maximizing the log probability [7]: where T is the total number of target words, and c is the context window size, i.e., the number of words before or after the target word in the text to consider as skip-gram output. The training sets for both CBoW and skip-gram methods are practically identical and can be created easily from a corpora of text documents. After the neural network is trained, the embeddings of different words are simply the weights of connections between the input element representing a given word in one-hot encoding and all neurons of the hidden layer.
In their subsequent paper, [7] extended the Word2Vec algorithm to improve both its performance in terms of training time and its accuracy. Two modifications were proposed: " Frequent word subsampling: Words that occur frequently in the text are omitted from the document with probability: where f (w i ) is the frequency of the word w i and t is a threshold usually around 10 25 . Most frequent words include articles such as the, a and an that do not carry that much information about the contextual meaning of a specific word.
" Negative sampling: Leads to a modified objective function: which in practical terms causes only k words that are missing in a given context to update their embeddings during training. Experimental results show that these extensions lead to both more efficient training, and to better quality of final embeddings. Proposed extension implements word subsampling, but not negative sampling.

III. PROPOSED EXTENSION
As discussed in introduction, one of Word2Vec's disadvantages is antonym representation. Antonyms such as good and bad are often surrounded by similar words, hence their Word2Vec embeddings are relatively similar.
I propose an extension of the Word2Vec model that allows for consideration of domain specific document labels to better distinguish between words related to different classes. In contrast to previous work presented in section I, it does not rely on general purpose thesauri.
The extension is based on the idea of modifying the output to be predicted by the neural network. In addition to context (surrounding words) for a certain input word, the neural network also has to predict the document class.
This extension further lets the user set the weight of this class label prediction relative to word context prediction. Modified objective function can then be formulated as: where u is the label prediction weight and d t is the document label associated with word w t . This weight is implemented as replication of the output neurons for label prediction u times.

IV. EXPERIMENT DESIGN
To verify that the extension helps the Word2Vec algorithm better consider domain specific labels, an experiment with the goal of creating embeddings from tweets related to a publicly traded company was performed. Each tweet was labeled with a label "1" if the stock price increased one day after tweet publication. If the stock price decreases, the tweet was labeled with "0". Difference of 2 subsequent daily close prices was used to determine the label. The basic skip-gram model was then extended to predict not only the context of a word but also the label representing the stock price increase or decrease.
The hypothesis to test can be stated as follows: When the price change label is considered during embedding creation, the distance between embeddings of words occurring more frequently on price decrease and embeddings of words occurring more frequently on price increase will be greater as compared to the embeddings created with the standard Word2Vec skip-gram model. Moreover, this distance should grow with the weight of this price change label represented by u in equation 4.

A. Data and Preprocessing
The experiment utilized tweets related to the Walt Disney Company, which is publicly traded on the New York Stock Exchange under the DIS symbol. Tweets published between January 1, 2017 and December 31, 2020 were used to train the Word2Vec model. The $DIS "cashtag" was used to find tweets related to the company. 45 000 tweets containing 401 000 words were further randomly selected in order to reduce both the training time and memory use.
These tweets were preprocessed before they were used as input for the skip-gram model. Following steps were performed: " the text was transformed to lower-case, " cashtag symbols $, hashtag symbols # and the user mention symbols @ were removed, " all other non-alphanumic symbols were removed, " all HTML tags were removed, " all URLs were replaced by a "__URL__" placeholder, " and all numbers were replaced by a "__NUMBER__" placeholder.

B. Hypothesis Evaluation
To evaluate the aforementioned hypothesis, the polarity of each word was calculated as the difference between the number of occurrences of a given word in tweets associated with a price increase (positive occurrences) and in tweets associated with a price decrease (negative occurrences).
Since the dataset contains a different number of positive and negative tweets, the polarity of negative tweets was further modified using the following equation: where pol m (w) is the modified polarity of word w, pol(w) is the basic polarity of the same word calculated as the difference between positive and negative occurrences, and N pos and N neg is the total number of tweets related to price increase and price decrease respectively. Top 75 positive words were then paired with top 75 negative words (most negative word was paired with the most positive word, 2nd most positive word was paired with 2nd most negative word and so on). Then the distances between the embeddings of paired words were examined.
Furthermore, a list of 10 specific antonym pairs was constructed. This list was then verified to make sure that one word in the pair has indeed negative polarity, while the other word's polarity is positive. These antonym pairs are listed in table IV-B. Distances between the embeddings were again examined.
To consider different vector norms, cosine similarity was used as a measure of distance. According to Agudo [1], cosine similarity is also preferred for synonym or antonym detection tasks.
All experiments were implemented in Python 3.9 using well-known libraries including numpy, pandas and Keras. buying  selling  upgraded  downgraded  raised  lowered  strength  weakness  ahead  delayed  bullish  bearish  up  down  above  below  high  low  positive  negative   TABLE I  PAIRS OF ANTONYMS WHOSE DISTANCES WERE EXAMINED DURING   EXPERIMENTS   V. EXPERIMENT RESULTS  Table II shows the cosine similarity measure for the antonym pairs listed above. These results include three different levels of label weight, as well as similarity derived from the embeddings trained by the well-known Gensim 1 Word2Vec implementation. With label weight set to 1, the average cosine similarity is very close to the similarity exhibited by Gensim embeddings. This small difference can be attributed to the random nature of the neural network training process. However, as the label weight increases the similarity between embeddings starts to drop significantly. When the label weight is set to 100 the embeddings become almost orthogonal.

Positive word Negative word
These results are confirmed by the second test examining 75 pairs of top positive and top negative words. Noteworthy negative words include "down", "coronavirus", "below", "downgraded" or "risk". Interesting positive words include "higher", "nice", "streaming", "up" or "nflx". Using the same label weight values as in table II, mean cosine similarities across all word pairs were 0.36 (u = 1), 0.315 (u = 10) and 0.046 (u = 100). The cosine similarity mean for Gensim embeddings was 0.346. These results manifest the same behavior as the results for 10 selected antonym pairs.
Both experiments support the hypothesis stated in section IV. Given a specific minimum weight, domain-specific labels can indeed help increase the distance between relevant word embeddings. Moreover, this distance increment grows with the label weight.

VI. CONCLUSION
This paper explored the possible utilization of domainspecific labels during word embedding creation with the Word2Vec algorithm. An extension to the skip-gram model was proposed and evaluated in an experiment where word embeddings were created from a dataset of tweets related to the Walt Disney company. Results of this experiment show that the extension helps distinguish between words whose occurrence is correlated with different labels (in this case stock price increase and decrease). Furthermore, the cosine similarity between such words decreases as the label weight increases.
Presented experiments were performed on a relatively small dataset of 45000 tweets related to a single company. The proposed extension should therefore be examined on significantly larger amounts of data in the future. Moreover, the extension implementation used during experiments is not ready to be deployed to production. Further performance improvements are needed. Combining the extension with negative sampling should also be examined.
The extension was compared with a standard Word2Vec implementation provided by the Gensim library. Additional comparison to the models proposed by [8] or [4] would be beneficial. One of the potential benefits of the presented extension is its ability to consider any domain-specific labels instead of relying on a specific thesaurus.
The work presented in this paper is a part of a larger project with the aim of examining if sentiment and other information extracted from social media can be used to improve meanrisk investment portfolio optimization models. In the future I plan to examine the usefulness of the presented extension for stock price and risk prediction and compare it with complex language models such as BERT.