1 paper across 1 session
New Transformer architectures, FAL and FAL+, reuse the first attention output carefully —FAL cuts TP communication for up to 44% faster training, while FAL+ improves perplexity.