2 papers across 1 session
We develop hypothesis tests for fairness metrics for small and large subgroups
STSBench is a benchmark that evaluates the capabilities of Multi-modal Large Language Models to reason about spatio-temporal actions.