Project: Flaky Test Detector

Fall 2024

Github Live Demo

🎯 Key Results

Assertions Found

Across 2 ML Libraries

Cleverhans

Approximate Assertions

GPyTorch

Approximate Assertions

The Problem: When Your Tests Can't Make Up Their Mind

The Flaky Test Problem: Tests that pass randomly undermine CI/CD reliability and waste developer time investigating false failures.

Ever had a test that passes on your machine but fails in CI? Welcome to the world of flaky tests – the kind of bug that makes you question your sanity.

Why ML libraries are vulnerable:

Random number generators with inconsistent seeding
Floating-point precision edge cases
Inherently probabilistic algorithms
GPU vs CPU computation differences

The real cost: When developers start seeing random failures, they begin ignoring test results altogether. That's how you end up shipping broken code.

The Challenge: Finding Needles in Code Haystacks

The naive approach would be to run every test in a library hundreds of times and see which ones fail inconsistently. But that's expensive and time-consuming. Instead, I wanted to build a smarter approach: identify tests that are likely to be flaky based on their code structure, then focus testing efforts on those.

The key insight is that flaky tests often use "approximate assertions" – statements that check if values are "close enough" rather than exactly equal. Things like assert accuracy > 0.9 or np.testing.assert_allclose(output, expected, rtol=1e-6).

But here's where it gets complicated: you can't just search for keywords. You need to understand the actual structure of the code, handle different assertion patterns, and accurately map assertions back to their test functions and classes.

The Technical Solution: AST-Powered Code Analysis

I built a systematic approach using Python's Abstract Syntax Tree (AST) to analyze test code structure.

[Architecture Diagram: AST Parser → Pattern Detection → CSV Output]

Visual diagram showing the tool's data flow would go here

🔍 How It Works (Simple Version)

Parse Code Structure

Break down Python files into logical components (classes, functions, statements)

Identify Test Context

Find test classes and functions, track location in code hierarchy

Hunt for Assertions

Look for approximate assertion patterns across different testing frameworks

Generate Report

Output structured CSV with filepath, test context, and assertion details

🎯 Assertion Types Detected

Direct Comparisons

assert accuracy > 0.9
assert loss < threshold

Unittest Methods

assertGreater()
assertLess()

NumPy Testing

assert_allclose()
assert_array_almost_equal()

PyTorch Testing

torch.testing.assert_close()

Real-World Testing: What Actually Happened

✅ What Worked Well

Cleverhans 35 assertions found

GPyTorch 49 assertions found

Clean CSV output with proper test context mapping and accurate line numbers.

🔍 What We Learned

AllenNLP 0 assertions (different patterns)

Sonnet 0 assertions (different testing approach)

Different ML libraries have different testing cultures – some avoid approximate assertions entirely.

🔬 The Flaky Test Hunt: No Luck This Time

Interesting Finding: 100% Pass Rates

When I ran the identified tests 100 times each, I got 100% pass rates. No flaky behavior detected.

Why this matters:

Gap between theory and practice: "Potentially flaky" ≠ "actually flaky"
Environmental factors: Deterministic seeds, test isolation, controlled environments
Evaluation challenge: Static analysis has limits – runtime behavior requires actual execution

Beyond the Code: What This Really Taught Me

Building this tool was as much about understanding software engineering practices as it was about AST parsing. The most valuable insights came from the things that didn't work as expected:

Static Analysis Has Limits: You can identify potentially problematic code patterns, but you can't predict runtime behavior without actually running the code. The tool found assertions that could be flaky, but many weren't in practice.

Context Matters: Different ML libraries have different testing cultures. Some are incredibly disciplined about deterministic testing, while others embrace controlled randomness. The tool had to be flexible enough to handle both approaches.

Tooling vs. Process: The real value isn't just in finding flaky tests – it's in giving teams a systematic way to audit their testing practices and identify potential reliability issues before they become problems.

The Bigger Picture: Building Better ML Software

This project connects to something I care deeply about: how do we build AI systems that teams can actually trust and deploy with confidence? Flaky tests are a small but important piece of that puzzle.

When your test suite is unreliable, you lose confidence in your codebase. When you lose confidence in your codebase, you ship less frequently and with more anxiety. When you ship less frequently, you iterate slower and build worse products. It's a vicious cycle.

The tool I built is one small piece of breaking that cycle – giving teams a way to systematically identify and address potential reliability issues in their testing infrastructure. It's not glamorous work, but it's the kind of foundation that makes everything else possible.

Technical Implementation Details

For those interested in the technical specifics, here's how the core components work:

AST Visitor Pattern

class ApproximateAssertionVisitor(ast.NodeVisitor):
    def __init__(self):
        self.assertions = []
        self.current_class = None
        self.current_function = None
    
    def visit_ClassDef(self, node):
        # Track test classes
        old_class = self.current_class
        self.current_class = node.name
        self.generic_visit(node)
        self.current_class = old_class
    
    def visit_FunctionDef(self, node):
        # Track test functions
        if node.name.startswith('test_'):
            old_function = self.current_function
            self.current_function = node.name
            self.generic_visit(node)
            self.current_function = old_function

Assertion Pattern Detection

The tool identifies several categories of approximate assertions:

Direct Comparisons: assert accuracy > 0.9, assert loss < threshold
Unittest Methods: self.assertGreater(), self.assertLess()
NumPy Testing: np.testing.assert_allclose(), assert_array_almost_equal()
PyTorch Testing: torch.testing.assert_close()

Back to All Projects

David Akinboro

The Hunt for Flaky Tests: When Code Analysis Meets Reality