1 paper across 1 session
We develop the unbiased Slice Wassertein RBF kernel to better measure cross-modal alignment between acoustic and linguistic modalities for audio captioning and reasoning tasks.