The Hunt for Flaky Tests: When Code Analysis Meets Reality
Building tools that help ML teams ship more reliable software.
🎯 Key Results
The Problem: When Your Tests Can't Make Up Their Mind
The Flaky Test Problem: Tests that pass randomly undermine CI/CD reliability and waste developer time investigating false failures.
Ever had a test that passes on your machine but fails in CI? Welcome to the world of flaky tests – the kind of bug that makes you question your sanity.
Why ML libraries are vulnerable:
- Random number generators with inconsistent seeding
- Floating-point precision edge cases
- Inherently probabilistic algorithms
- GPU vs CPU computation differences
The real cost: When developers start seeing random failures, they begin ignoring test results altogether. That's how you end up shipping broken code.
The Challenge: Finding Needles in Code Haystacks
The naive approach would be to run every test in a library hundreds of times and see which ones fail inconsistently. But that's expensive and time-consuming. Instead, I wanted to build a smarter approach: identify tests that are likely to be flaky based on their code structure, then focus testing efforts on those.
The key insight is that flaky tests often use "approximate assertions" – statements that check if values are "close enough" rather than exactly equal. Things like assert accuracy > 0.9
or np.testing.assert_allclose(output, expected, rtol=1e-6)
.
But here's where it gets complicated: you can't just search for keywords. You need to understand the actual structure of the code, handle different assertion patterns, and accurately map assertions back to their test functions and classes.
The Technical Solution: AST-Powered Code Analysis
I built a systematic approach using Python's Abstract Syntax Tree (AST) to analyze test code structure.
[Architecture Diagram: AST Parser → Pattern Detection → CSV Output]
Visual diagram showing the tool's data flow would go here
🔍 How It Works (Simple Version)
Parse Code Structure
Break down Python files into logical components (classes, functions, statements)
Identify Test Context
Find test classes and functions, track location in code hierarchy
Hunt for Assertions
Look for approximate assertion patterns across different testing frameworks
Generate Report
Output structured CSV with filepath, test context, and assertion details
🎯 Assertion Types Detected
Direct Comparisons
assert accuracy > 0.9
assert loss < threshold
Unittest Methods
assertGreater()
assertLess()
NumPy Testing
assert_allclose()
assert_array_almost_equal()
PyTorch Testing
torch.testing.assert_close()
Real-World Testing: What Actually Happened
✅ What Worked Well
🔍 What We Learned
🔬 The Flaky Test Hunt: No Luck This Time
Interesting Finding: 100% Pass Rates
When I ran the identified tests 100 times each, I got 100% pass rates. No flaky behavior detected.
Why this matters:
- Gap between theory and practice: "Potentially flaky" ≠ "actually flaky"
- Environmental factors: Deterministic seeds, test isolation, controlled environments
- Evaluation challenge: Static analysis has limits – runtime behavior requires actual execution
Beyond the Code: What This Really Taught Me
Building this tool was as much about understanding software engineering practices as it was about AST parsing. The most valuable insights came from the things that didn't work as expected:
Static Analysis Has Limits: You can identify potentially problematic code patterns, but you can't predict runtime behavior without actually running the code. The tool found assertions that could be flaky, but many weren't in practice.
Context Matters: Different ML libraries have different testing cultures. Some are incredibly disciplined about deterministic testing, while others embrace controlled randomness. The tool had to be flexible enough to handle both approaches.
Tooling vs. Process: The real value isn't just in finding flaky tests – it's in giving teams a systematic way to audit their testing practices and identify potential reliability issues before they become problems.
The Bigger Picture: Building Better ML Software
This project connects to something I care deeply about: how do we build AI systems that teams can actually trust and deploy with confidence? Flaky tests are a small but important piece of that puzzle.
When your test suite is unreliable, you lose confidence in your codebase. When you lose confidence in your codebase, you ship less frequently and with more anxiety. When you ship less frequently, you iterate slower and build worse products. It's a vicious cycle.
The tool I built is one small piece of breaking that cycle – giving teams a way to systematically identify and address potential reliability issues in their testing infrastructure. It's not glamorous work, but it's the kind of foundation that makes everything else possible.
Technical Implementation Details
For those interested in the technical specifics, here's how the core components work:
AST Visitor Pattern
class ApproximateAssertionVisitor(ast.NodeVisitor):
def __init__(self):
self.assertions = []
self.current_class = None
self.current_function = None
def visit_ClassDef(self, node):
# Track test classes
old_class = self.current_class
self.current_class = node.name
self.generic_visit(node)
self.current_class = old_class
def visit_FunctionDef(self, node):
# Track test functions
if node.name.startswith('test_'):
old_function = self.current_function
self.current_function = node.name
self.generic_visit(node)
self.current_function = old_function
Assertion Pattern Detection
The tool identifies several categories of approximate assertions:
- Direct Comparisons:
assert accuracy > 0.9
,assert loss < threshold
- Unittest Methods:
self.assertGreater()
,self.assertLess()
- NumPy Testing:
np.testing.assert_allclose()
,assert_array_almost_equal()
- PyTorch Testing:
torch.testing.assert_close()