Many of us are used to setting alarms with Siri, asking Google for nearby restaurants or telling Alexa to turn up the lights. Even if we don’t use these AI-powered assistants ourselves, we often see others using them. However, when it comes to complex tasks, like drafting an email to the boss, the results can be baffling or even a confusing mess. This stark difference in outcomes raises an important question: Do large language models (LLMs) — the brains behind our virtual assistants and chatbots — really perform as we expect them to?
LLMs have become the cornerstone of modern AI applications. Models like OpenAI’s GPT-4, Google’s Gemini, Meta’s LLAMA, are capable of generating human-like text, translating languages, writing code and even crafting poetry. The excitement around LLMs stems from their ability to handle a diverse range of tasks using a single model. This versatility offers immense potential — imagine a model helping a doctor summarise patient notes while also assisting a software engineer in debugging code.
However, this very diversity also presents a significant challenge — How do we evaluate such a multifaceted tool? Traditional models are typically designed for specific tasks and evaluated against benchmarks tailored to those tasks. But with LLMs, it’s impractical to create benchmarks for every possible application they might be used for. This raises an essential question for researchers and users alike: How can we gauge where an LLM will perform well and where it might stumble?
The LLM dielemma
The crux of the problem lies in understanding human expectations. When deciding where to deploy an LLM, we naturally rely on our interactions with the model. If it performs well on one task, we might assume it will excel at related tasks. This generalisation process — where we infer the capabilities of a model based on limited interactions — is key to understanding and improving the deployment of LLMs.
In a new paper, MIT researchers Keyon Vafa, Ashesh Rambachan and Sendhil Mullainathan took a different approach. In their study — ‘Do large language models perform the way people expect? Measuring the human generalisation function’ — they have explored how humans form beliefs about LLM capabilities and whether these beliefs align with the models’ actual performance.
To start with, the researchers collected a substantial dataset of human generalisations. They surveyed participants, presenting them with examples of how an LLM responded to specific questions. The participants were then asked whether these responses influenced their beliefs about how the model would perform on other, related tasks. This data collection spanned 19,000 examples across 79 tasks, sourced from well-known benchmarks like the MMLU and BIG-Bench.
On analysing the data using sophisticated natural language processing (NLP) techniques, they found that human generalisations are not random; unsurprisingly, they follow consistent, structured patterns that can be predicted using existing NLP methods.
The researchers also evaluated how well different LLMs align with these human generalisations. They tested several models for this, including GPT-4, to see if their performance matched human expectations, and discovered a paradox: larger, more capable models like GPT-4 often performed worse in high-stakes scenarios, precisely because users overestimated their capabilities. In contrast, smaller models sometimes aligned better with human expectations, leading to more reliable deployment in critical applications.
The researchers used a novel approach to evaluate model alignment. Instead of relying on fixed benchmarks, they modelled the human deployment distribution — the set of tasks humans choose based on their beliefs about the model’s capabilities. This method acknowledges that real-world use depends not just on the model’s abilities but also on human perceptions of those abilities.
The findings of this research are both fascinating and cautionary. It highlights that while larger LLMs have impressive capabilities, their misalignment with human generalisations can lead to significant deployment errors.
On the flip side, by understanding and modelling human generalisations, we can better align LLMs with user expectations. This could involve developing better interfaces that help users accurately gauge a model’s strengths and weaknesses or creating more targeted training data that helps models perform consistently across a broader range of tasks.
Comments
Comments have to be in English, and in full sentences. They cannot be abusive or personal. Please abide by our community guidelines for posting your comments.
We have migrated to a new commenting platform. If you are already a registered user of TheHindu Businessline and logged in, you may continue to engage with our articles. If you do not have an account please register and login to post comments. Users can access their older comments by logging into their accounts on Vuukle.