Poster Session 6 · Friday, December 5, 2025 4:30 PM → 7:30 PM
#3209
On the Emergence of Linear Analogies in Word Embeddings
Abstract
Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability of words and in text corpora. The resulting vectors not only group semantically similar words but also exhibit a striking linear analogy structure---for example, ---whose theoretical origin remains unclear.
Previous observations indicate that this analogy structure:
- already emerges in the top eigenvectors of the matrix ,
- strengthens and then saturates as more eigenvectors of , which controls the dimension of the embeddings, are included,
- is enhanced when using rather than , and
- persists even when all word pairs involved in a specific analogy relation (e.g., king--queen, man--woman) are removed from the corpus.
To explain these phenomena, we introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)--(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.