Project: Automated Unit Test Generation

Fall 2024

View Source Colab

🎯 Key Results

164

Functions Tested

HumanEval+ Dataset

47%

Best Accuracy

Gemini Performance

27%

Mutation Test Score

Real Bug Detection

The Problem: Testing is Essential, Testing is Tedious

The Developer Reality: 30-40% of development time goes to testing, yet most codebases still have inadequate test coverage.

Here's a truth every developer knows but doesn't want to admit: writing tests sucks. It's not that tests aren't important – they absolutely are. It's that test writing feels like the programming equivalent of doing dishes. You know you need to do it, but you'd rather spend your time building the actual product.

[Screenshot: Typical developer workflow - features vs tests time allocation]

The eternal struggle: building features vs writing comprehensive tests

The numbers back this up. Developers typically spend 30-40% of their time on testing-related activities, yet most codebases still have inadequate test coverage. When deadlines loom, guess what gets cut first? It's not the shiny new feature – it's the comprehensive test suite that might catch bugs before users do.

This seemed like the perfect problem for AI to solve. LLMs can understand code, they can write code, and they can read documentation. In theory, they should be able to look at a function and its docstring and generate a complete test suite. Easy, right?

Why This Gets Hard Fast

Writing tests isn't just about generating code that runs. It's about anticipating edge cases, understanding the semantic intent behind documentation, and creating tests that actually catch bugs rather than just passing with the happy path.

Naive AI Approach

Generate tests that look right

Our Systematic Approach

Generate tests that catch real bugs

What humans do when writing tests:

Parse function signatures and understand valid inputs
Read docstrings and infer intended behavior
Think about edge cases, error conditions, and boundary values
Write assertions that meaningfully test correctness

Current LLMs are decent at the first two, okay at the third, and surprisingly bad at the fourth. They'll generate tests that look plausible but fail to actually validate the function's behavior in meaningful ways.

Building a Systematic Approach

I tackled this with my collaborator Ethan Yu, using the HumanEval+ dataset as our foundation. This gave us 164 programming problems with existing test suites – perfect for training and evaluation. But even with good data, we immediately hit problems.

🔧 The Dataset Challenge

Dataset Inconsistency Problem: Original HumanEval+ had wildly inconsistent test coverage - some functions had 1 test, others had 100+.

The original HumanEval+ dataset had wildly inconsistent test coverage. Some functions had one test case, others had over a hundred. When we tried to train models on this, they had no idea when to stop generating tests. We'd get incomplete outputs, token limit overruns, and models that couldn't follow basic instructions.

Our solution: We reconstructed the entire dataset. We standardized on 5, 7, 9, or 11 test cases per function and rebuilt the training data to be consistent. This wasn't glamorous work, but it was essential – you can't evaluate model performance if the model can't even generate well-formed output.

[System Architecture: Function Analysis → Test Generation → Evaluation Pipeline]

Standardized dataset → Model training → Dual-lens evaluation → Mutation testing

🔬 The Model Comparison

We tested multiple approaches:

Fine-tuned LLaMA 3.0/3.1: Our attempt to specialize a general model for test generation
CodeLLaMA: Hypothesis that code-specific models would perform better
GPT-3.5: Baseline comparison with a widely-used model
Gemini: Additional comparison point

The Evaluation Framework: Beyond "Does It Compile?"

Here's where most AI code generation projects fail: they focus on whether the generated code is syntactically correct and call it a day. But syntactically correct code that doesn't actually test the function is worse than useless – it gives you false confidence.

🔍 Syntax Validation

• Parse correctly
• Run without crashing
• Follow expected format

🧠 Semantic Accuracy

• Validate function behavior
• Test edge cases properly
• Catch actual bugs

Our dual-lens evaluation system:

Syntax Validation: Does the generated code parse correctly? Can it run without crashing?
Semantic Accuracy: Do the generated tests actually validate the function's behavior? We extracted inputs and expected outputs from generated assertions, ran them against the original functions, and checked whether the tests would pass or fail correctly.

Only tests that passed both criteria were considered successful. This revealed the real challenge: getting models to generate semantically meaningful tests, not just syntactically correct code.

Mutation Testing: Finding the Real Gaps

But we wanted to go further. Even if a model generates tests that work on the original function, how robust are those tests? Will they catch bugs if the function changes?

[Screenshot: Mutation testing dashboard showing introduced bugs vs detected bugs]

Our LLM-based mutation testing revealed which tests actually catch real bugs

Traditional mutation testing tools like mutmut weren't suitable for our custom test format, so we developed an LLM-based mutation approach using Gemini. We had it introduce subtle bugs into functions:

Changing arithmetic operators (+ to -, * to /)
Modifying logical conditions (<= to <)
Tweaking return values slightly

Then we tested whether our generated test suites would catch these mutations. The results were... humbling.

Real Results: The Good, Bad, and Ugly

📊 Model Performance Breakdown

Accuracy Scores

Gemini: 46.67% accuracy
GPT-3.5: 40% accuracy
Our fine-tuned model: 38.46% accuracy

Syntax Validity

Gemini: 100% syntax validity
GPT-3.5: 83.3% syntax validity
Our fine-tuned model: 86.7% syntax validity

Let's be honest about the numbers: these aren't terrible, but they're not production-ready either. And mutation testing revealed even bigger problems – our fine-tuned model dropped to just **26.92%** accuracy when tested against mutated functions.

Key Finding: Models are decent at generating tests that pass on original functions, but struggle to generate tests that catch real bugs.

What I Actually Learned

The technical results were interesting, but the bigger lessons were about the nature of building reliable AI systems:

The Demo vs. Reality Gap is Real

Easy to build impressive-looking test code, much harder to build tests that improve code quality

Evaluation is Everything

Without rigorous evaluation frameworks, you can't distinguish between working systems and ones that just look like they work

Robustness is Rare

Models that perform well on standard benchmarks often fail with small variations

Specialization Isn't Always Better

Fine-tuned code-specific models didn't dramatically outperform general models like Gemini

The Bigger Picture: Building Systems That Actually Work

This project reinforced something I've learned throughout my research: the gap between "cool AI demo" and "reliable system" is enormous, and bridging it requires systematic thinking about evaluation, robustness, and failure modes.

[Connection: Test Generation → Reasoning Evaluation → AI Alignment]

Same principles: systematic evaluation, robustness testing, reliable performance

The techniques we developed here – particularly the comprehensive evaluation framework and LLM-based mutation testing – directly inform my current work on evaluating reasoning capabilities in language models. The fundamental challenge is the same: how do you measure whether an AI system actually works, rather than just appears to work?

And the lesson about robustness connects to broader questions in AI alignment. If we're building AI systems that need to work reliably in the real world, we need to understand how they fail and why. Test generation might seem like a narrow technical problem, but the principles apply much more broadly.

The goal isn't just to generate code that looks right – it's to build systems that are reliable, robust, and actually improve the software development process. That's much harder than the initial demo, but it's the only way to create AI tools that developers will actually trust and use.

Back to All Projects

Built with: Python, LLaMA, CodeLLaMA, Gemini, HumanEval+

David Akinboro

Automated Unit Test Generation: From 'Cool Demo' to 'Actually Works'