Human-level shape inferences: A benchmark for evaluating the 3D understanding of vision models

tyler bonnen, Stephanie Fu, Yutong Bai, Thomas O'Connell, Yoni Friedman, Josh Tenenbaum, Alexei Efros

Abstract

Human visual abilities are a common inspiration for computer vision algorithms. Here we introduce a benchmark to directly evaluate the alignment between human observers and vision models on 3D shape inferences. Our experimental design requires zero-shot visual inferences about object shape: given three images, participants identify which contain the same/different objects, in spite of considerable viewpoint variation. Images in this dataset include common objects (e.g., chairs) as well as abstract shapes (i.e., synthetic objects without semantic attributes), while controlling for a number of shape-orthogonal image properties (e.g., lighting, background). After constructing over 2000 unique image triplets, we administer these tasks to human participants, collecting 35K trials of behavioral data from over 500 participants. With these data, we define a series of increasingly granular evaluation metrics using choice, reaction time, and eye-tracking measurements. We evaluate models optimized via contrastive (DINOv2) and masked autoencoding (MAE) self-supervision objectives, as well as language-image pretraining (CLIP). While there are underlying similarities between human and model choice behaviors, humans outperform all models by a wide margin, typically succeeding where models fail. Using more granular evaluation metrics from reaction time and gaze data, we conclude by identifying potential sources for this divergence. This benchmark is designed to serve as an independent validation set, showcasing the utility of a multiscale approach to evaluating human-model alignment on 3D shape inferences.