PhD student, Korea Advanced Institute of Science & Technology
1 paper at NeurIPS 2025
New Transformer architectures, FAL and FAL+, reuse the first attention output carefully —FAL cuts TP communication for up to 44% faster training, while FAL+ improves perplexity.