logo
today local_bar
Poster Session 2 West
Wednesday, December 11, 2024 4:30 PM → 7:30 PM
Poster #5206

The Multimodal Universe: Enabling Large-Scale Machine Learning with 100TBs of Astronomical Scientific Data

Eirini Angeloudi, Jeroen Audenaert, Micah Bowles, Benjamin M. Boyd, David Chemaly, Brian Cherinka, Ioana Ciucă, Miles Cranmer, Aaron Do, Matthew Grayling, Erin E. Hayes, Tom Hehir, Shirley Ho, Marc Huertas-Company, Kartheik Iyer, Maja Jablonska, Francois Lanusse, Henry Leung, Kaisey Mandel, Rafael Martínez-Galarza, Peter Melchior, Lucas Meyer, Liam Parker, Helen Qu, Jeff Shen, Michael Smith, Connor Stone, Mike Walmsley, John Wu

Abstract

We present the "Multimodal Universe", a large-scale multimodal dataset of scientific astronomical data, compiled specifically to facilitate machine learning research. Overall, the Multimodal Universe contains hundreds of millions of astronomical observations, constituting 100 TB of multi-channel and hyper-spectral images, spectra, multivariate time series, as well as a wide variety of associated scientific measurements and ``metadata''. In addition, we include a range of benchmark tasks representative of standard practices for machine learning methods in astrophysics. This massive dataset will enable the development of large multi-modal models specifically targeted towards scientific applications. All codes used to compile the Multimodal Universe and a description of how to access the data is available at https://github.com/MultimodalUniverse/MultimodalUniverse