Project: AI Code Synthesis Optimization

Fall 2024

The Challenge: Measuring Prompt Engineering Impact

Code generation models like Gemini can produce syntactically correct code that fails to solve the actual problem. While demos showcase impressive capabilities, systematic evaluation reveals the gap between "generates plausible code" and "generates correct code."

I used HumanEval—164 hand-crafted programming problems—to measure whether targeted prompt engineering could improve functional correctness. Each problem provides a function signature and docstring, requiring the model to generate working code that passes comprehensive test suites.

The key question: Can systematic prompt improvements measurably increase pass rates on a standard benchmark?

Baseline: Simple Prompt, Measured Results

I started with Gemini 1.5 Flash using a minimal system prompt:

Your role is to generate the function body based solely on the docstring

This basic approach established the baseline performance across all 164 HumanEval problems. Using greedy decoding (temperature=0) ensured reproducible results for accurate comparison.

Baseline Performance

82.9%

HumanEval Pass@1

75.6%

HumanEval+ Pass@1

The baseline already performed well, making meaningful improvements challenging. Failed solutions typically showed algorithmic errors rather than syntax issues—problems that better prompting could potentially address.

Targeted Prompt Engineering

Instead of random prompt variations, I analyzed failure patterns and designed a more specific system prompt:

SYSTEM = '''
You are a Python coding assistant specializing in program synthesis.
Your task is to auto-complete Python functions using the information provided in the docstring.
The docstring describes the purpose of the function, the expected behavior, and examples of input-output pairs.
The generated function should fulfill the requirements specified in the docstring and pass the provided examples.
Ensure that the implementation is syntactically correct, logically aligned with the problem statement, and uses best practices for Python programming.
'''

This prompt addresses specific issues observed in failed solutions:

Explicit quality expectations: "syntactically correct, logically aligned with the problem statement"
Clear task definition: "auto-complete Python functions using the information provided in the docstring"
Test-aware mindset: "pass the provided examples"
Best practices emphasis: Encourages clean, maintainable code

Measured Improvements

The systematic prompt engineering produced consistent improvements:

Improved Performance

84.1%

HumanEval Pass@1

+1.2% improvement

78.0%

HumanEval+ Pass@1

+2.4% improvement

While 1.2% may seem modest, it represents solving 2 additional problems out of 164. In code generation benchmarks, where state-of-the-art models cluster within a few percentage points, consistent improvements across problem types indicate systematic rather than random gains.

Case Study: Algorithmic Reasoning Improvement

HumanEval_10 (make_palindrome) illustrates the improvement pattern. This function should find the shortest palindrome by identifying the longest palindromic suffix and prepending the reversed prefix.

Baseline failure: The generated code had incorrect palindrome checking logic and failed on test cases like "cata" → "atcata".

Improved solution: The better prompt led to:

Correct implementation of the two-step palindrome algorithm
Helper function decomposition (is_palindrome)
Proper edge case handling for empty strings
Clear, maintainable code structure

This represents the core improvement: better reasoning about algorithmic requirements and code organization, not just different syntax.

Implementation and Evaluation

The evaluation used the EvalPlus framework, which extends HumanEval with additional test cases to catch edge case failures. The process:

Generate solutions: Process all 164 problems with both baseline and improved prompts
Functional testing: Execute generated code against comprehensive test suites
Performance measurement: Calculate pass@1 rates for both base and extended test sets
Failure analysis: Examine specific failures to understand improvement patterns

The systematic approach enabled controlled comparison between prompting strategies, avoiding the cherry-picking common in prompt engineering demonstrations.

Core Implementation

def program_synthesis_improved(input_prompt: str, model, **kwargs) -> str:
    response = model.generate_content(input_prompt,
                                      generation_config = genai.GenerationConfig(
                                          temperature=0))
    
    # Clean up the response to extract just the code
    response = response.text.replace('```python', '').replace('```', '').strip()
    return response

def complete_improve_humaneval(model, dataset, workdir):
    for task_id, problem in tqdm(dataset.items()):
        name = task_id.replace("/", "_")
        prompt = problem['prompt']
        
        solution = program_synthesis_improved(prompt, model)
        os.makedirs(os.path.join(workdir, name), exist_ok=True)
        
        with open(os.path.join(workdir, name, '0.py'), 'w') as f:
            f.write(solution)
        
        time.sleep(2)  # Rate limiting

Key Insights

This work reinforced several principles for AI code generation:

Systematic beats intuitive: Measurable improvements came from analyzing failure patterns and designing targeted prompts, not creative guesswork.

Prompts as specifications: Effective prompts explicitly communicate quality expectations and task requirements rather than relying on model intuition.

Edge cases reveal gaps: The additional test cases in HumanEval+ exposed brittleness that base tests missed, highlighting the importance of comprehensive evaluation.

Small improvements matter: In mature benchmarks, consistent percentage point gains indicate meaningful progress in model reliability.

Technical Context

This work addresses the gap between impressive code generation demos and reliable production performance. While HumanEval tests relatively simple programming tasks, the systematic approach to prompt engineering applies to more complex scenarios.

The evaluation framework could extend to multi-turn interactions, test-driven development workflows, or integration with development tools—areas where reliable code generation becomes increasingly valuable for practical applications.

The methodology demonstrates how controlled experimentation can drive measurable improvements in AI code generation, moving beyond anecdotal evidence toward systematic optimization of model performance.

Back to All Projects

Built with: Python, Gemini 1.5 Flash, EvalPlus

David Akinboro

Systematic Prompt Engineering for Code Generation