词向量化 Vector Representation of Words 方法汇总

发布于:2023-01-04 ⋅ 阅读:(195) ⋅ 点赞:(0)

PART I: Classical Machine Leaning

为什么要进行词向量化?

“向量化”可以理解为“数值化”,为什么要“数值化”?因为文字是不可以运算的,而数值可以。【明天探索者 2021】不管用什么方法把词向量化,都是为了下一步放进训练模型计算。一句话总结,目的:put words to vector space。

文字中词的特征和向量化又有什么关系?

NLP中字、词、词频、ngram、词性…等都可以认为是特征,只要将这些特征向量化就能放入模型中计算。比如词袋模型就是利用的词频特征,word2vec 可以认为是利用了窗口内文本的共现特征。【明天探索者 2021】

下图展示了文章中所提到的算法级别的联系。

Bag-of-Words (BoW): The BoW model separately matches and counts each element in the document to form a vector representation of a document. [Dongyang Yan 2020 Network-Based]

具体做法:A document is mapped into a vector as v = [x1,x2,...,xn] where xi denotes the occurrence of the ith word in basic terms. 

        - The basic terms (原型 ate --> eat, jumping --> jump, 无 stopwords 'a', 'the') are usually the top n highest-frequency words collected from the datasets (注意:from all documents, not a single document being analysed).

        - The value of occurrence feature can be a binary, term frequency, or term frequency-inverse document frequency (TF-IDF). A binary value denotes whether the ith word is presented in a document, which reckons without the weight of words. The term frequency is the number of occurrences of each word. TF-IDF assumes that the importance of a word increases proportionally to its frequency in a document but is offset by its frequency in the word corpus. [Dongyang Yan 2020 Network-Based]

举例:我喜欢水,很想喝水。[Jonathan Hui]

        basic terms:[我,喜,欢,水,很,想,喝]

        测试字:[水]

        binary: [0,0,0,1,0,0,0]

        term-frequency: [0,0,0,2,0,0,0]

        TF-IDF: ...
 

优点:Straightforward method for text representation in vector space. [Dongyang Yan 2020 Network-Based]

缺点:1)The occurrence value xi is matched and counted without considering the influence of other words. Much context information may be lost without dealing with correlated words.

        举例: Sen 1: 我想喝水。 Sen 2: 水想喝我。

                basic terms: [我,想,喝,水]

                测试字:[我,想,喝,水]

                binary: [1,1,1,1]

        For both two sentences, each word in basic terms occurs once. The BoW model will project  Sen 1 and Sen 2 to the same vector, i.e. v1=v2=[1,1,1,1], though the two sentences have the opposite meaning. [Dongyang Yan 2020 Network-Based]

2)传统向量空间模型使用精确的词匹配,即精确匹配用户输入的词与向量空间中存在的词,无法解决一词多义(polysemy)和一义多词(synonymy)的问题。实际上在搜索中,我们实际想要去比较的不是词,而是隐藏在词之后的意义和概念


Word Embedding scheme:

This method introduces the dependence of one word on the other words, being the most popular vector representation of document vocabulary.  [Dhruvil Karani 2018 Introduction to Word]  

In a vector space, words with a similar context occupy close spatial positions; in other words, similar words cluster together and different words repel. Mathematically, the cosine of the angle between such vectors should be close to 1, i.e. angle close to 0.   [Dhruvil Karani 2018 Introduction to Word]

实现方法1:       

Word2Vec: A method to construct such an embedding that captures local statistics of a corpus. It can be obtained using two training methods (both involving Neural Networks): Continuous Bag Of Words (CBOW) or Skip Gram. 非常好的视频解释可以在B站找到:word2vec_bilibili

具体做法:

                --> CBOW: 用周围词预测中间词(的context)。This algorithm takes the context of each word as the input and tries to predict the word corresponding to the context. 具体数学方法:[Xing Rong 2016 word2vec Parameter]

优点:According to Mikolov, 1) CBOW is faster (computationally efficient) and 2) has better representations for more frequent words

缺点:Low accuracy        

                --> Skip Gram: 用中间词预测周围词(的context)。This model uses the target word (whose representation we want to generate) to predict the context and in the process, we produce the representations. To some extend, it is true to say it is the filpped (multiple-context) CBOW. 具体数学方法:[Xin Rong 2016 word2vec Parameter]

优点:According to Mikolov, 1) Skip Gram works well with small amount of data and 2) is found to represent rare words well, i.e. good accuracy. 

缺点:computational time     

Word2Vec优点:Word2vec which captures local statistics does very well in analogy tasks. [Sharma 2017 Vector]

Word2Vec 缺点:Word2vec relies only on local information of language.

It is unable to leverage the statistics of the corpus since they are trained on separate local context windows. That is, the semantics learnt for a given word is only affected by the surrounding words. [Sharma 2017 Vector] 

实现方法2:

LSA:  A method that efficiently utilize statistical information on global co-occurrence counts. The vector representation is obtained with SVD operation. 

具体做法:1) A single term-frequency matrix X containing word counts per document (row represent unique words and columns represent each document) is constructed from a large piece of text. 2) A mathematical technique called singular value decomposition (SVD) is used to decompose the matrix into a product of three simpler matrices – an term-concept matrix U, a singular values marix D, and a concept-document vector matrix V. [Letsche 1997 Large-scale information retrieval]

引用吴军老师在 “矩阵计算与文本处理中的分类问题” 中的总结:三个矩阵有非常清楚的物理含义。

  • 第一个矩阵 U 中的每一行表示意思相关的一类词,其中的每个非零元素表示这类词中每个词的重要性(或者说相关性),数值越大越相关。
  • 最后一个矩阵V中的每一列表示同一主题一类文章,其中每个元素表示这类文章中每篇文章的相关性。
  • 中间的矩阵D表示词类和文章类的相关性。

因此,我们只要对关联矩阵X进行一次奇异值分解,我们就可以同时完成近义词分类和文章的分类,同时得到每类文章和每类词的相关性。【笨兔勿应 博客园】

优点: 1)Efficiently utilize statistical information on global co-occurrence counts. 

            2)LSA可以处理传统向量空间模型无法解决的一义多词(synonymy)问题

缺点:LSA不能解决一词多义(polysemy)问题。因为LSA将每一个词映射为潜在语义空间中的一个点,也就是说一个词的多个意思在空间中对于的是同一个点,并没有被区分。

实现方法3:

GloVe: This method captures both global statistics and local statistics of a corpus when putting words to a vector space. [Ganegedara, 2019 Intuitive Guide to] GloVe employs the observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. [Standford GloVe]

具体做法:        1) It utilises the count data, the ability to capture global statistics.

co-occurrence matrix is constructed, where a cell Xij is a “strength” which represents how often the word i appears in the context of the word j.  【Glove算法原理 知乎】 

        2) It predicts surrounding words by performing a dynamic logistic regression.

The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words' probability of co-occurrence.

Owing to the fact that the logarithm of a ratio equals the difference of logarithms, this objective associates (the logarithm of) ratios of co-occurrence probabilities with vector differences in the word vector space. [Standford GloVe]

优点:GloVe combines the benefits of the word2vec skip-gram model in the word analogy tasks, with those of matrix factorization methods of the LSA exploiting global statistical information.

缺点:Uses a lot of memory: the fastest way to construct a term-cooccurence matrix is to keep it in RAM as a hash map and perform cooccurence increments in a global manner.   [Sciforce 2018]

BERT

优点:

缺点:


网站公告

今日签到

点亮在社区的每一天
去签到