Upcoming Events
CSE Faculty Candidate Seminar - Yizhong Wang
Name: Yizhong Wang, Ph.D. candidate from the University of Washington
Date: Thursday, February 13, 2024 at 11:00 am
Location: Coda Building, Room 200 (Google Maps link)
Link: The recording of this in-person seminar will be uploaded to CSE's MediaSpace
Coffee, drinks, and snacks provided!
Title: Building a Sustainable Data Foundation for AI
Abstract: The rapid expansion of AI is consuming data at an unprecedented scale. However, as pretraining on raw Internet data reaches diminishing returns and downstream applications grow increasingly complex, the question arises: What data paradigm will sustain the next generation of more powerful AI systems? This requires a systematic rethinking of how we structure, create, and share data. In this talk, I will present my research on building language models that go beyond pretraining while improving their generalization through a data-centric perspective. First, I will introduce how unifying NLP tasks through task instructions enables broader generalization. The resulting technique, instruction tuning, redefines how task data should be structured and has transformed how people interact with language models. Second, I will explore synthetic data creation, where models get employed in the data production process. This led to Self-Instruct, the first framework using language models to create diverse tasks and demonstrating model self-improvement. Finally, I will discuss the development of Tülu and OLMo, two representative open models, highlighting the central role of data curation and open collaboration in advancing AI research. Together, these efforts—unifying task representations, leveraging synthetic data, and fostering open data sharing—have shaped current AI research and application landscape. As these trends continue to evolve, they hold the potential to establish a scalable and sustainable data foundation for the future of AI.
Bio: Yizhong Wang is a Ph.D. candidate at the University of Washington, advised by Hannaneh Hajishirzi and Noah Smith. He has also been a student researcher at the Allen Institute for Artificial Intelligence (AI2) for the past 2 years, co-leading the post-training efforts in building fully open language models (OLMo). His research focuses on the fundamental data challenges in AI development and algorithms centered around data, particularly for building more general-purpose models. His work, such as Super-NaturalInstructions, Self-Instruct, and Tülu, has been widely used in building large language models today. He has won multiple paper awards, including ACL 2024 Best Theme Paper, CCL 2020 Best Paper, and ACL 2017 Outstanding Paper. He also serves on the program committee of top NLP and ML conferences and was an area chair for EMNLP 2024.
Event Details
Media Contact
Mary High
mhigh7@gatech.edu
EVENTS BY SCHOOL & CENTER
School of Computational Science and Engineering
School of Interactive Computing
School of Cybersecurity and Privacy
Algorithms and Randomness Center (ARC)
Center for 21st Century Universities (C21U)
Center for Deliberate Innovation (CDI)
Center for Experimental Research in Computer Systems (CERCS)
Center for Research into Novel Computing Hierarchies (CRNCH)
Constellations Center for Equity in Computing
Institute for People and Technology (IPAT)
Institute for Robotics and Intelligent Machines (IRIM)