TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance
Abstract
TAM-Eval is a framework and benchmark for evaluating large language models on comprehensive test suite maintenance tasks including creation, repair, and updating across multiple programming languages.
While Large Language Models (LLMs) have shown promise in software engineering, their application to unit testing remains largely confined to isolated test generation or oracle prediction, neglecting the broader challenge of test suite maintenance. We introduce TAM-Eval (Test Automated Maintenance Evaluation), a framework and benchmark designed to evaluate model performance across three core test maintenance scenarios: creation, repair, and updating of test suites. Unlike prior work limited to function-level tasks, TAM-Eval operates at the test file level, while maintaining access to full repository context during isolated evaluation, better reflecting real-world maintenance workflows. Our benchmark comprises 1,539 automatically extracted and validated scenarios from Python, Java, and Go projects. TAM-Eval supports system-agnostic evaluation of both raw LLMs and agentic workflows, using a reference-free protocol based on test suite pass rate, code coverage, and mutation testing. Empirical results indicate that state-of-the-art LLMs have limited capabilities in realistic test maintenance processes and yield only marginal improvements in test effectiveness. We release TAM-Eval as an open-source framework to support future research in automated software testing. Our data and code are publicly available at https://github.com/trndcenter/TAM-Eval.
Community
🧪 TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance
What’s new:
Large Language Models (LLMs) have been widely explored for unit test generation, but real-world test suite maintenance — like creating, updating, and repairing tests as code evolves — hasn’t been systematically evaluated. This paper introduces TAM-Eval, the benchmark and evaluation framework that targets exactly these maintenance tasks in realistic software contexts.
Core contributions:
🔹 Benchmark and framework: TAM-Eval evaluates LLMs on three maintenance scenarios — creation, repair, and updating of test suites — at the test file level with actual context (not isolated function snippets).
🔹 Real-world dataset: The benchmark is built from 1,539 automatically extracted and validated scenarios from open-source projects in Python, Java, and Go.
🔹 Evaluation metrics: Instead of simple accuracy scores, the framework uses reference-free metrics reflecting real software quality:
Test suite pass rate
Code coverage change
Mutation testing outcomes
These align closer to developer goals in maintenance.
🔹 Empirical results: State-of-the-art LLMs show only limited improvements on realistic maintenance workflows — indicating that current models still struggle with practical test suite evolution.
Why it matters:
Automated testing and maintenance are essential for high-quality software. Most benchmarks have focused on test generation at function level; TAM-Eval shifts the focus to maintenance workflows developers actually deal with, providing a new community standard for evaluating LLMs in software engineering contexts.
Open science: The TAM-Eval code and dataset are fully open-source, enabling future research and direct integration into evaluation pipelines.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper


