How Do We Evaluate Large Language Models?

It’s not as easy as you may think.

Large language models (LLMs), like OpenAI’s ChatGPT and Meta’s Llama, have been transforming our lives for a while now. Yet, with so many models to choose from, many people are wondering which model is “the best.” To answer this question, both researchers and users often turn to benchmarks and tests to see which model solved the hardest coding problems or got the highest SAT score. In this post, I will argue three points.

Neither benchmarks nor traditional tests are appropriate to evaluate the capabilities of modern LLMs.
LLMs exhibiting human-like abilities without possessing human-like intelligence and cognition add entirely new dimensions to the field of psychometrics.
Substantial research will be required to arrive at LLM assessments whose results can be interpreted with confidence.

Benchmarks

Benchmarks have traditionally been used to assess the performance of software and hardware. A benchmark evaluates a tool’s performance by having it complete a set of tasks for which it was specifically designed. An image classifier is benchmarked by having it classify a selection of images, and a computer processor is benchmarked by running a series of complex computations.

When it comes to LLMs, benchmarking is not straightforward. First, LLMs are not trained for any specific task: they can be used for text classification, but they are not text classifiers; they can be used to score essays, but they are not automated scorers – and so on. Therefore, any benchmark result depends not only on which LLM was used, but also on how it was used. This ambiguity detracts from the credibility of the results and often leads to debates, for example, about whether a different prompt would have led to different results.

Two other common problems with benchmarks are saturation, meaning that all recent models are getting close to perfect scores, and contamination, meaning that some or all of a benchmark’s elements are included in a model’s training data. Both problems are particularly acute in the case of LLMs because their progress is rapid and their training data contain virtually the entire internet.

Owing to these and other issues, many LLM benchmarks offer limited value in assessing the overall quality of an LLM. This shortcoming has sparked initiatives to benchmark the benchmarks according to various quality criteria. Such efforts aim to establish a set of high-quality benchmarks that comprised carefully crafted problem sets, monitored for saturation and contamination, and updated or recalibrated if necessary. In this regard, benchmarks are moving closer to traditional tests where such practices have been common from the beginning. However, shifting from benchmarking to testing AI brings its own set of challenges.

Tests

Virtually everyone has been tested at some point in their life, whether for college admission, professional licensure, or a driver's license. Such tests are markedly different from benchmarks. Most importantly, the ability or knowledge that a test assesses is too complex to be measured directly. For example, a student’s readiness for college cannot be tested by letting them attend a selection of undergraduate programs. Therefore, tests need to be carefully designed to be valid.

Consider two common types of validity evidence: predictive and content-related. Predictive evidence for a test’s validity can be established by the degree to which its score predicts important observable outcomes and performances. For example, SAT scores correlate well with various measures of academic success. Content-related evidence suggests that the test reflects the ability being tested. For instance, an algebra question set in a tennis context shouldn’t require knowledge of tennis rules, nor should it be answerable only by knowledge of tennis rules.

Validity issues inevitably arise when we let LLMs take tests designed for humans. Take predictive evidence: An LLM can ace the SAT, but it will not enroll in college; it can pass the bar exam with flying colors, but it will not represent clients in court—at least for the foreseeable future. Similar problems arise with content-related evidence. If a human scores high on an algebra test, one might infer that they understand and are able to apply the laws of algebra probed by the test’s items. In contrast, the question of how LLMs solve algebra problems and if they really learn generalizable laws is still largely unanswered. Typically, the more complex the construct being tested, the more speculative the interpretation of an LLM test score becomes: Does an LLM that scores high on a medical licensing exam really demonstrate knowledge of clinical medicine or patient management abilities?

However, with more tasks and responsibilities being delegated to LLMs, we are witnessing the emergence of early tests that are designed specifically for LLMs. For example, a company using an LLM for its customer service needs to test a new model before deploying it. While such tests might start as a collection of benchmarks and sanity checks, over time they tend to become more structured and include more sophisticated items that capture important aspects of challenges that previous models encountered and perhaps mishandled. Consequently, the test will become an increasingly informative indicator of a model’s ability to meet the company’s customer service needs.

While such “proto tests” are useful, they are often proprietary, limited in scope, and driven by operational needs rather than by scientific inquiry.

Research Challenges

As argued above, the distinct non-human intelligence of LLMs invalidates many of the assumptions underpinning test theory and psychometrics. Significant research efforts will be required to establish which tests are appropriate for LLMs and which interpretations of test results can be supported by scientifically sound experiments.

Moreover, large networks trained from scratch on enormous data sets are unlikely to remain the only systems with human-like abilities. For example, Joint-Embedding Predictive Architectures (JEPAs) learn in a more human fashion by directly observing and interacting with their environments, whereas neurosymbolic AIs are focused on symbolic reasoning and explicit knowledge representation. Hence, researchers might soon be faced with a multitude of different types of intelligence that give rise to the same abilities.

This raises fundamental questions: Can we define constructs independently of the underlying type of intelligence? Is, say, the ability to “think critically” the same for humans and various types of AI? If so, how should we measure it? Will each type of intelligence require its own test? For example, a critical thinking test might account for test takers’ varying degrees of literacy but will likely assume that all test takers can count and know the cardinal directions. For LLMs, the opposite is the case: they are highly literate by design but might be lacking in basic skills. As long as such differences are not accounted for, LLM test results will remain prone to misinterpretations.

Finally, there might be interesting cross-fertilization between AI tests and more established areas of psychometrics. For example, factors such as age, gender, culture, and education, in addition to neurological disorders, have been shown to impact cognitive processes in individuals. In this context, an AI can be seen as an extreme case of a neurodivergent intelligence. A better understanding of this extreme case could pave the way for more personalized, fairer, and more objective assessments, allowing learners with unique cognitive traits to demonstrate the full spectrum of their competencies.

In conclusion, while assessment of LLMs is a considerable challenge, my fellow researchers at ETS and I are excited by the opportunity to push boundaries and improve the techniques of modern psychometrics.

Michael Fauss is a research scientist at the ETS Research Institute. His work focuses on ethical AI.