AI-Assisted Development Featured

ATDD-Driven AI Development: How Prompting and Tests Steer the Code

Explore how ATDD and AI combine to create better software through executable specifications and test-driven development.

Paul Duvall

05 Jun 2025 • 10 min read

In the future, code will just be specifications, and tests are specifications. Our tests will be our code, and the actual code will be generated by AI.

Acceptance Test-Driven Development (ATDD) can steer AI to generate reliable code. To validate this, I built DoubleUp!, a savings-tracking app for kids, and used it as an ATDD-driven AI development testbed. In this post, I explain the app's architecture, the ATDD-AI workflow, and lessons learned, referencing Dave Farley and Craig Statham's testing practices for AI alongside my own experiences.

The DoubleUp! Project

DoubleUp! is a web application that helps children track their savings and request matching contributions from parents. Key features include:

Savings Dashboard: Displays current balance from a connected Savings account
Double Request: Children can request parents to match a chosen amount
Parent Notifications: Email notifications for parent approval workflows
IP-Based Access Control: Restricts access to approved home networks
Automated CI/CD: Full BDD (Behavior-Driven Development) test suite running on GitHub Actions. (Note: This project uses BDD's Gherkin syntax within an ATDD workflow - the terms are used somewhat interchangeably, with ATDD emphasizing the test-driven process and BDD providing the specification format.)

Why ATDD for AI Development?

"What if the specifications/the acceptance tests were the program?" Source

When my intent isn't crystal clear (and even when it is), AI can hallucinate features, drift from specifications, or introduce unintended changes. ATDD can prevent this by making the expected behavior executable and verifiable before any implementation is accepted—by me or by the AI.

Recommended Resource: Acceptance Testing Is the FUTURE of Programming — A perspective on how the next evolution in programming might not be a new language, but acceptance tests as the specification. The developer's job becomes writing clear, detailed, executable examples and letting the AI generate implementations.

Technical Architecture

The DoubleUp! project follows a structured BDD approach using Behave, with a clear separation between specifications, implementation, and tests. The repository is organized with:

doubleup/
├── features/               # BDD feature files (Gherkin)
│   ├── US-101_doubleup_dashboard.feature
│   ├── US-102_request_double.feature
│   ├── US-103_ip_restriction.feature
│   ├── US-104_parent_notification.feature
│   ├── US-105_allowance_history.feature
│   ├── US-106_github_actions_test_workflow.feature
│   └── US-107_amplify_hosting.feature
├── features/steps/         # Step definitions
│   ├── US-101_dashboard_steps.py
│   └── ...
├── src/                    # Implementation code
│   └── api.py
├── tests/                  # Unit/integration tests
├── frontend/               # Static frontend assets
├── run.sh                  # Environment setup & test runner
├── requirements.txt        # Python dependencies
└── .github/workflows/      # GitHub Actions CI/CD

Each feature is tied to a specific user story (US-XXX) and has corresponding step definitions. This structure ensures complete traceability from specifications to tests to implementation.

The ATDD-Driven AI Development Workflow

1. Specification-First Development

In this approach, I start by defining executable specifications in Gherkin format that describe the desired behavior from the user's perspective. For example:

Feature: View Savings Balance (US-101)
  As a child user   
  I want to view my current savings balance
  So that I can track my progress

  Scenario: View balance via API
    Given I have a balance of $50.25 in my Savings account
    When I request my current balance
    Then I should see my current balance displayed as $50.25

2. AI as Implementation Partner

With the specification in place, I use AI to:

Generate step definitions for the BDD scenarios
Direct the AI at the acceptance tests, prompting it to produce the minimal code needed to pass each test
Refactor while maintaining test coverage

For example, when implementing the balance display feature, the AI helped create:

API endpoints in src/api.py
Step definitions in features/steps/US-101_dashboard_steps.py
Supporting test infrastructure

3. Strict Traceability

A key aspect of this approach is maintaining traceability between specifications, tests, and implementation. The project includes:

A comprehensive traceability matrix linking user stories to scenarios and code
Clear naming conventions (e.g., US-101_ prefix for related files)
Commit history that follows the RED-GREEN-REFACTOR cycle

4. My ATDD Workflow: AI-First, Test-Driven

Here's how I approach ATDD-driven AI development in the DoubleUp! project, drawing from established TDD principles but adapted for AI collaboration:

Write the expected behavior first: I create a .feature file in features/ describing the desired behavior in Gherkin syntax (for example, US-107_amplify_hosting.feature for AWS Amplify hosting capabilities).
Prompt the AI to generate step definitions: I instruct the AI to implement step definitions that connect the Gherkin specifications to actual code.
Iteratively develop with AI: Using a strict RED-GREEN-REFACTOR cycle:
- Write a failing test (RED)
- Prompt AI to implement just enough code to make it pass (GREEN)
- Work with AI to refactor as needed (REFACTOR)
Run BDD tests locally: Using ./run.sh --bdd, I verify that new code passes the acceptance tests. The script handles:
- Python 3.11 environment setup
- Virtual environment configuration
- Dependency installation
- Behave test execution
Maintain traceability: Each implementation file is linked to its corresponding feature file in the traceability matrix, ensuring complete coverage and accountability.
Enforce CI/CD discipline: GitHub Actions (implemented via US-106) runs the full test suite on every push, maintaining quality standards.

Prompting is the new coding:

In agentic development, my main job is to write clear, structured, and precise prompts/tests—one at a time. The AI generates the code, but ATDD ensures it's always aligned with my intent. I treat each prompt as a "commit" in the development process, enforcing discipline and traceability.

How Gherkin Features Become Executable Specifications

A key strength of the DoubleUp! project's workflow is that Gherkin feature files are not just documentation—they are executable specifications. Here's how the process works with our AWS Amplify hosting feature (US-107):

From Feature to Executable Test

flowchart TD
    A[Gherkin Feature File] --> B[Behave Step Definitions]
    B --> C[Implementation Code]
    B -.->|Assertions| D[Test Outcome]
    C --> D

Legend:

Gherkin Feature File = features/US-101_doubleup_dashboard.feature
Behave Step Definitions = features/steps/US-101_dashboard_steps.py
Implementation Code = src/api.py
Test Outcome = Pass/Fail (reported by run.sh --bdd)

Gherkin scenarios are mapped to Python step definitions using Behave. Step definitions call implementation code and make assertions. The test runner reports outcomes, which guide further development.

1. Write Gherkin Scenarios

I describe the desired behavior in plain English using Gherkin syntax in a .feature file:

Scenario: Deploy site with HTTPS via GitHub integration
  Given I have created a simple HTML frontend for DoubleUp!
  When I create a GitHub repository for my code
  And I push my code to the repository
  And I connect the GitHub repository to AWS Amplify
  Then the build should complete successfully
  And I should see a public HTTPS URL for my website

2. Map Steps to Python Functions

For each Gherkin step (Given/When/Then), I write a corresponding Python function in a step definition file:

@when('I connect the GitHub repository to AWS Amplify')
def step_connect_github_to_amplify(context):
    # This is where the GitHub repository is connected to AWS Amplify
    # In a real implementation, this would call the AWS Amplify API
    # For tests, we simulate this behavior
    context.amplify_url = "https://main.d123456abcdef.amplifyapp.com"
    assert context.amplify_url, "Failed to connect GitHub repository to AWS Amplify"

3. Execution

When you run behave features/, Behave reads the .feature files and, for each step, finds and executes the matching Python function. If a step is missing a definition, Behave will fail and report it. The Python functions can call your implementation code and make assertions to verify behavior.

4. Result

If all steps in all scenarios pass, the feature is considered implemented and correct. If any step fails (e.g., an assertion fails), the scenario fails, signaling a gap between intent and implementation.

Acceptance Tests vs. Unit Tests: The DoubleUp! Approach

In the DoubleUp! project, I use both acceptance tests (via Behave) and unit tests (via pytest), because they serve distinct but complementary purposes:

Behave Acceptance Tests (`features/`, `features/steps/`)

Purpose: Define high-level behaviors from the user's perspective
How: Written in Gherkin syntax with step definitions in Python
Scope: Validate that the system meets user-focused specifications
Value: Ensures the functionality meets user needs and creates living documentation

Example from DoubleUp!:

Scenario: View current savings balance via API
  Given I have a balance of $50.25 in my Savings account
  When I request my current balance
  Then I should see my current balance displayed as $50.25

Pytest Unit Tests (`tests/`)

Purpose: Verify the correctness of individual functions or components
How: Written as Python test functions
Scope: Test API functions, edge cases, and error handling
Value: Provides fast feedback on code changes and ensures technical correctness

Example from DoubleUp!:

def test_get_balance_with_valid_credentials():
    # Test that the API returns the correct balance when given valid credentials

A Key DoubleUp! Principle: Strict Traceability

In accordance with our project rules, no implementation file is created in /src without a corresponding BDD feature file in /features. This ensures complete traceability between specifications, tests, and code—a critical aspect when partnering with AI for development.

Why Use Both?

Acceptance tests ensure the system delivers the correct behavior and user value.
Unit tests ensure the underlying code is robust, correct, and maintainable.
Together: They provide confidence at both the system and component level, making the codebase easier to evolve and refactor safely.

Example: DoubleUp! Dashboard Implementation

Here's a concrete example from the DoubleUp! project:

Gherkin feature (features/US-101_doubleup_dashboard.feature):

Feature: DoubleUp Savings Dashboard
  In order to track my savings and get motivated by potential matching
  As a child with a savings account
  I want to see my current balance and matching status

  Scenario: View current savings balance via API
    Given I have a balance of $50.25 in my Savings account
    When I request my current balance
    Then I should see my current balance displayed as $50.25

  Scenario: View matching eligibility via API
    Given I have a balance of $50.25 in my Savings account
    And parental matching is available up to $100.00
    When I request my matching eligibility
    Then I should see that I am eligible for a $50.25 match

Step definitions (features/steps/US-101_dashboard_steps.py):

from behave import given, when, then
import json

@given('I have a balance of ${amount} in my Savings account')
def step_given_savings_balance(context, amount):
    # Convert the string amount (without $) to a float for calculations
    amount_float = float(amount.replace('$', ''))
    context.balance = amount_float
    context.app_config = {"account_id": "12345"}

@when('I request my current balance')
def step_when_request_balance(context):
    # In a real implementation, this would call the Savings API
    # For tests, we use the previously set balance
    assert context.balance is not None, "Balance not set in previous step"
    
@then('I should see my current balance displayed as ${amount}')
def step_then_see_balance(context, amount):
    # Convert the expected amount to float for comparison
    expected_amount = float(amount.replace('$', ''))
    # Compare with actual balance
    assert context.balance == expected_amount, f"Expected ${expected_amount} but got ${context.balance}"

Implementation (src/api.py):

import os
import json
from typing import Dict, Any, Tuple

def get_balance(account_id: str) -> float:
    """Get the current balance for a given account"""
    # In a production implementation, this would call an external API
    # For demo purposes, we simulate the API response
    return 50.25

def get_matching_eligibility(account_id: str) -> Tuple[float, float]:
    """Get the matching eligibility for a given account"""
    # Return current balance and maximum match amount
    balance = get_balance(account_id)
    max_match = 100.00
    eligible_match = min(balance, max_match)
    return (eligible_match, max_match)

Unit test (tests/test_api.py):

import pytest
from src.api import get_balance, get_matching_eligibility

def test_get_balance():
    # Test the balance retrieval function
    balance = get_balance("12345")
    assert isinstance(balance, float)
    assert balance == 50.25

def test_get_matching_eligibility():
    # Test the matching eligibility function
    eligible_match, max_match = get_matching_eligibility("12345")
    assert eligible_match == 50.25
    assert max_match == 100.00

CI Integration: GitHub Actions Workflow

I use .github/workflows/atdd-tests.yml to make sure all tests run on every push:

name: ATDD Tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.11"
      - name: Install dependencies
        run: pip install -r requirements.txt
      - name: Run pytest
        run: pytest tests/
      - name: Run behave
        run: behave features/

Diagram: CI/CD Workflow for ATDD

flowchart TD
    E[Code Commit & Push] --> F[GitHub Actions Workflow]
    F --> G[Install Dependencies]
    G --> H[Run pytest Unit Tests]
    G --> I[Run Behave Acceptance Tests]
    H --> J{All Tests Pass?}
    I --> J
    J -- Yes --> K[Merge/Deploy]
    J -- No --> L[Fail Build & Notify]

Key Learnings from the ATDD-AI Approach

Through developing DoubleUp! with an ATDD-AI approach, I've discovered several critical insights:

Precision in Prompting is Crucial
- Clear, specific prompts yield better AI-generated code
- Including test failures in prompts helps the AI understand the problem better
Incremental Development Works Best
- Small, focused test cases are easier for AI to implement correctly
- The RED-GREEN-REFACTOR cycle keeps progress measurable
Documentation as Primary Artifact
- With AI handling much of the implementation, your tests and documentation become your most valuable contribution
- Well-documented feature files serve as living documentation
- Traceability ensures no specifications are lost in implementation
Testing as a Shared Language
- Tests become the contract between human intent and AI implementation
- Both human and AI can reason about behavior through test cases
Strong Traceability
- Maintaining links between user stories, features, and implementation files is crucial when AI is generating code
- It ensures nothing is created without corresponding specifications

The Future of AI-Assisted Development

This project demonstrates how ATDD can evolve in an AI-assisted development workflow. As AI becomes more capable, the developer's role shifts from writing implementation code to:

Defining clear, testable specifications
Creating and maintaining the test suite
Guiding the AI through the development process
Making architectural decisions

By treating tests as executable specifications and AI as a collaborative partner, we can achieve higher quality software with less manual coding. This shift in thinking builds on established Test-Driven Development principles but elevates them to a new level when paired with modern AI capabilities.

Lessons from Agentic ATDD: VibeTDD and Beyond

Experiments like VibeTDD (SAS hackathon) show that strict, test-first prompting keeps AI agents on track:

Write one test at a time—never generate multiple tests at once
Number and structure steps for clarity and consistency
Commit before each new test to enable rollback
Refactor explicitly—the AI needs direct prompts to improve code
Fail tests first to validate the test's effectiveness
Use Gherkin-style names for clarity and continuity
Save prompt history at each TDD cycle
Adapt prompts to the environment (e.g., OS changes)
Prompt clearly and concisely—precision matters
Watch prompt/token limits—optimize for efficiency

Getting Started

To explore this approach in your own projects:

Start with clear user stories and acceptance criteria
Write Gherkin scenarios before implementation
Use AI to generate step definitions and implementation
Maintain strict traceability between specifications and code
Iterate with small, testable increments

This approach is inspired by both my own experience and by experiments like SAS's VibeTDD hackathon, where engineers guided an AI agent to build a working basketball game using strict TDD—without writing any code themselves. Their lessons, along with the concept of ATDD as a "fifth-generation programming language" (5GL) mentioned by Farley, reinforce a key idea: in AI-driven development, prompting and test-writing might become the new coding.

Final Thought

ATDD in this experiment is not theoretical—it's enforced, automated, and traceable. Every behavior is specified, tested, and validated in both local and CI environments. This ensures clarity, quality, and confidence in every change.

The vision: "What if the specifications—the acceptance tests—were the program?" In this experiment, that's the reality. I specify the problem and let the AI solve it, with ATDD as the contract.

Note: Thanks to Jeff Schneider for bringing Eval Driven Design (EDD) to my attention. While not covered in this experiment, EDD presents an intriguing approach to validating problem-solution fit that I may explore in future work.