DataComp-LM: In search of the next generation of training sets for language models
#5109
·
Amro Abbas, Alon Albalak, Kushal Arora, Hritik Bansal, Yonatan Bitton, Yair Carmon, Khyathi Chandu, Mayee Chen, Giannis Daras, Achal Dave, Alex Dimakis, Alaaeldin El-Nouby, Fartash Faghri, Alex Fang, Samir Yitzhak Gadre, Josh Gardner, Saurabh Garg, Dhruba Ghosh, Aaron Gokaslan, Dirk Groeneveld, Etash Guha, Suchin Gururangan, Reinhard Heckel, Cheng-Yu Hsieh, Gabriel Ilharco, Maor Ivgi, Jenia Jitsev, Matt Jordan, Sham Kakade, Sedrick Scott Keh, Maciej Kilian, Pang Wei Koh, Thomas Kollar, Jeffrey Li, Kyle Lo, Kalyani Marathe, Jean Mercat, Niklas Muennighoff, Marianna Nezhurina, Thao Nguyen, Sewoong Oh, Hadi Pouransari, Sarah Pratt, Sunny Sanyal, Ludwig Schmidt, Vaishaal Shankar, Rulin Shao, Georgios Smyrnis, Luca Soldaini, Shuran Song, Alexander Toshev, Igor Vasiljevic, Stephanie Wang, Mitchell Wortsman, Rui Xin, Luke Zettlemoyer, Hanlin Zhang, Jieyu Zhang