3 papers across 3 sessions
Attention heads in text-generative models specialize in semantic and visual concepts. Leveraging this property, we can reliably suppress or enhance specific attributes in both language and vision-language tasks.
We provide a method for accurate end-to-end FP4 training of Large Language Models.
We characterize the structure of embeddings obtained via gradient descent, showing that the attention mechanism provably selects important tokens.