2 papers across 1 session
We interpret attention as discrete-time markov chains and show its effectiveness on various downstream tasks.