Evaluating the factuality of LLM-generated text
Abstract: Recent advances in large language models (LLMs) have enabled them to process texts that are millions of words long, fueling demand for long-form language processing tasks such as the summarization or translation of books. However, LLMs struggle to take full advantage of the information within such long contexts, which contributes to factually incorrect text generation. In this talk, I consider the growing problem of long-form evaluation: as the length of the inputs and outputs of long-form tasks grows further, how do we even measure progress? I propose a high-level framework (applicable to both human and automatic evaluation) that first decomposes a long-form text into simpler atomic units before then evaluating each unit on a specific aspect. I demonstrate the framework's effectiveness at evaluating factuality and faithfulness on tasks such as book summarization and biography generation. Overall, our experiments suggest that despite their impressive capabilities, LLMs have a long way to go before their output can be trusted.
Mohit Iyyer is an associate professor in computer science at the University of Massachusetts Amherst, with a primary research interest in natural language generation. He is the recipient of best paper awards at NAACL (2016, 2018), an outstanding paper award at EACL 2023, and a best demo award at NeurIPS 2015, and he also received the 2022 Samsung AI Researcher of the Year award. He obtained his PhD in computer science from the University of Maryland, College Park in 2017 and spent the following year as a researcher at the Allen Institute for Artificial Intelligence.