ATDD-Driven AI Development: How Prompting and Tests Steer the Code
Explore how ATDD and AI combine to create better software through executable specifications and test-driven development.

In the future, code will just be specifications, and tests are specifications. Our tests will be our code, and the actual code will be generated by AI.
Acceptance Test-Driven Development (ATDD) can steer AI to generate reliable code. To validate this, I built DoubleUp!, a savings-tracking app for kids, and used it as an ATDD-driven AI development testbed. In this post, I explain the app's architecture, the ATDD-AI workflow, and lessons learned, referencing Dave Farley and Craig Statham's testing practices for AI alongside my own experiences.
The DoubleUp! Project
DoubleUp! is a web application that helps children track their savings and request matching contributions from parents. Key features include:
- Savings Dashboard: Displays current balance from a connected Savings account
- Double Request: Children can request parents to match a chosen amount
- Parent Notifications: Email notifications for parent approval workflows
- IP-Based Access Control: Restricts access to approved home networks
- Automated CI/CD: Full BDD (Behavior-Driven Development) test suite running on GitHub Actions. (Note: This project uses BDD's Gherkin syntax within an ATDD workflow - the terms are used somewhat interchangeably, with ATDD emphasizing the test-driven process and BDD providing the specification format.)
Why ATDD for AI Development?
"What if the specifications/the acceptance tests were the program?" Source
When my intent isn't crystal clear (and even when it is), AI can hallucinate features, drift from specifications, or introduce unintended changes. ATDD can prevent this by making the expected behavior executable and verifiable before any implementation is accepted—by me or by the AI.
Recommended Resource: Acceptance Testing Is the FUTURE of Programming — A perspective on how the next evolution in programming might not be a new language, but acceptance tests as the specification. The developer's job becomes writing clear, detailed, executable examples and letting the AI generate implementations.
Technical Architecture
The DoubleUp! project follows a structured BDD approach using Behave, with a clear separation between specifications, implementation, and tests. The repository is organized with:
doubleup/
├── features/ # BDD feature files (Gherkin)
│ ├── US-101_doubleup_dashboard.feature
│ ├── US-102_request_double.feature
│ ├── US-103_ip_restriction.feature
│ ├── US-104_parent_notification.feature
│ ├── US-105_allowance_history.feature
│ ├── US-106_github_actions_test_workflow.feature
│ └── US-107_amplify_hosting.feature
├── features/steps/ # Step definitions
│ ├── US-101_dashboard_steps.py
│ └── ...
├── src/ # Implementation code
│ └── api.py
├── tests/ # Unit/integration tests
├── frontend/ # Static frontend assets
├── run.sh # Environment setup & test runner
├── requirements.txt # Python dependencies
└── .github/workflows/ # GitHub Actions CI/CD
Each feature is tied to a specific user story (US-XXX) and has corresponding step definitions. This structure ensures complete traceability from specifications to tests to implementation.
The ATDD-Driven AI Development Workflow
1. Specification-First Development
In this approach, I start by defining executable specifications in Gherkin format that describe the desired behavior from the user's perspective. For example:
Feature: View Savings Balance (US-101)
As a child user
I want to view my current savings balance
So that I can track my progress
Scenario: View balance via API
Given I have a balance of $50.25 in my Savings account
When I request my current balance
Then I should see my current balance displayed as $50.25
2. AI as Implementation Partner
With the specification in place, I use AI to:
- Generate step definitions for the BDD scenarios
- Direct the AI at the acceptance tests, prompting it to produce the minimal code needed to pass each test
- Refactor while maintaining test coverage
For example, when implementing the balance display feature, the AI helped create:
- API endpoints in
src/api.py
- Step definitions in
features/steps/US-101_dashboard_steps.py
- Supporting test infrastructure
3. Strict Traceability
A key aspect of this approach is maintaining traceability between specifications, tests, and implementation. The project includes:
- A comprehensive traceability matrix linking user stories to scenarios and code
- Clear naming conventions (e.g.,
US-101_
prefix for related files) - Commit history that follows the RED-GREEN-REFACTOR cycle
4. My ATDD Workflow: AI-First, Test-Driven
Here's how I approach ATDD-driven AI development in the DoubleUp! project, drawing from established TDD principles but adapted for AI collaboration:
- Write the expected behavior first: I create a
.feature
file infeatures/
describing the desired behavior in Gherkin syntax (for example,US-107_amplify_hosting.feature
for AWS Amplify hosting capabilities). - Prompt the AI to generate step definitions: I instruct the AI to implement step definitions that connect the Gherkin specifications to actual code.
- Iteratively develop with AI: Using a strict RED-GREEN-REFACTOR cycle:
- Write a failing test (RED)
- Prompt AI to implement just enough code to make it pass (GREEN)
- Work with AI to refactor as needed (REFACTOR)
- Run BDD tests locally: Using
./run.sh --bdd
, I verify that new code passes the acceptance tests. The script handles:- Python 3.11 environment setup
- Virtual environment configuration
- Dependency installation
- Behave test execution
- Maintain traceability: Each implementation file is linked to its corresponding feature file in the traceability matrix, ensuring complete coverage and accountability.
- Enforce CI/CD discipline: GitHub Actions (implemented via US-106) runs the full test suite on every push, maintaining quality standards.
Prompting is the new coding:
In agentic development, my main job is to write clear, structured, and precise prompts/tests—one at a time. The AI generates the code, but ATDD ensures it's always aligned with my intent. I treat each prompt as a "commit" in the development process, enforcing discipline and traceability.
How Gherkin Features Become Executable Specifications
A key strength of the DoubleUp! project's workflow is that Gherkin feature files are not just documentation—they are executable specifications. Here's how the process works with our AWS Amplify hosting feature (US-107):
From Feature to Executable Test
flowchart TD
A[Gherkin Feature File] --> B[Behave Step Definitions]
B --> C[Implementation Code]
B -.->|Assertions| D[Test Outcome]
C --> D
Legend:
- Gherkin Feature File =
features/US-101_doubleup_dashboard.feature
- Behave Step Definitions =
features/steps/US-101_dashboard_steps.py
- Implementation Code =
src/api.py
- Test Outcome = Pass/Fail (reported by
run.sh --bdd
)
Gherkin scenarios are mapped to Python step definitions using Behave. Step definitions call implementation code and make assertions. The test runner reports outcomes, which guide further development.
1. Write Gherkin Scenarios
I describe the desired behavior in plain English using Gherkin syntax in a .feature
file:
Scenario: Deploy site with HTTPS via GitHub integration
Given I have created a simple HTML frontend for DoubleUp!
When I create a GitHub repository for my code
And I push my code to the repository
And I connect the GitHub repository to AWS Amplify
Then the build should complete successfully
And I should see a public HTTPS URL for my website
2. Map Steps to Python Functions
For each Gherkin step (Given/When/Then), I write a corresponding Python function in a step definition file:
@when('I connect the GitHub repository to AWS Amplify')
def step_connect_github_to_amplify(context):
# This is where the GitHub repository is connected to AWS Amplify
# In a real implementation, this would call the AWS Amplify API
# For tests, we simulate this behavior
context.amplify_url = "https://main.d123456abcdef.amplifyapp.com"
assert context.amplify_url, "Failed to connect GitHub repository to AWS Amplify"
3. Execution
When you run behave features/
, Behave reads the .feature
files and, for each step, finds and executes the matching Python function. If a step is missing a definition, Behave will fail and report it. The Python functions can call your implementation code and make assertions to verify behavior.
4. Result
If all steps in all scenarios pass, the feature is considered implemented and correct. If any step fails (e.g., an assertion fails), the scenario fails, signaling a gap between intent and implementation.
Acceptance Tests vs. Unit Tests: The DoubleUp! Approach
In the DoubleUp! project, I use both acceptance tests (via Behave) and unit tests (via pytest), because they serve distinct but complementary purposes:
Behave Acceptance Tests (features/
, features/steps/
)
- Purpose: Define high-level behaviors from the user's perspective
- How: Written in Gherkin syntax with step definitions in Python
- Scope: Validate that the system meets user-focused specifications
- Value: Ensures the functionality meets user needs and creates living documentation
Example from DoubleUp!:
Scenario: View current savings balance via API
Given I have a balance of $50.25 in my Savings account
When I request my current balance
Then I should see my current balance displayed as $50.25
Pytest Unit Tests (tests/
)
- Purpose: Verify the correctness of individual functions or components
- How: Written as Python test functions
- Scope: Test API functions, edge cases, and error handling
- Value: Provides fast feedback on code changes and ensures technical correctness
Example from DoubleUp!:
def test_get_balance_with_valid_credentials():
# Test that the API returns the correct balance when given valid credentials
A Key DoubleUp! Principle: Strict Traceability
In accordance with our project rules, no implementation file is created in /src
without a corresponding BDD feature file in /features
. This ensures complete traceability between specifications, tests, and code—a critical aspect when partnering with AI for development.
Why Use Both?
- Acceptance tests ensure the system delivers the correct behavior and user value.
- Unit tests ensure the underlying code is robust, correct, and maintainable.
- Together: They provide confidence at both the system and component level, making the codebase easier to evolve and refactor safely.
Example: DoubleUp! Dashboard Implementation
Here's a concrete example from the DoubleUp! project:
Gherkin feature (features/US-101_doubleup_dashboard.feature
):
Feature: DoubleUp Savings Dashboard
In order to track my savings and get motivated by potential matching
As a child with a savings account
I want to see my current balance and matching status
Scenario: View current savings balance via API
Given I have a balance of $50.25 in my Savings account
When I request my current balance
Then I should see my current balance displayed as $50.25
Scenario: View matching eligibility via API
Given I have a balance of $50.25 in my Savings account
And parental matching is available up to $100.00
When I request my matching eligibility
Then I should see that I am eligible for a $50.25 match
Step definitions (features/steps/US-101_dashboard_steps.py
):
from behave import given, when, then
import json
@given('I have a balance of ${amount} in my Savings account')
def step_given_savings_balance(context, amount):
# Convert the string amount (without $) to a float for calculations
amount_float = float(amount.replace('$', ''))
context.balance = amount_float
context.app_config = {"account_id": "12345"}
@when('I request my current balance')
def step_when_request_balance(context):
# In a real implementation, this would call the Savings API
# For tests, we use the previously set balance
assert context.balance is not None, "Balance not set in previous step"
@then('I should see my current balance displayed as ${amount}')
def step_then_see_balance(context, amount):
# Convert the expected amount to float for comparison
expected_amount = float(amount.replace('$', ''))
# Compare with actual balance
assert context.balance == expected_amount, f"Expected ${expected_amount} but got ${context.balance}"
Implementation (src/api.py
):
import os
import json
from typing import Dict, Any, Tuple
def get_balance(account_id: str) -> float:
"""Get the current balance for a given account"""
# In a production implementation, this would call an external API
# For demo purposes, we simulate the API response
return 50.25
def get_matching_eligibility(account_id: str) -> Tuple[float, float]:
"""Get the matching eligibility for a given account"""
# Return current balance and maximum match amount
balance = get_balance(account_id)
max_match = 100.00
eligible_match = min(balance, max_match)
return (eligible_match, max_match)
Unit test (tests/test_api.py
):
import pytest
from src.api import get_balance, get_matching_eligibility
def test_get_balance():
# Test the balance retrieval function
balance = get_balance("12345")
assert isinstance(balance, float)
assert balance == 50.25
def test_get_matching_eligibility():
# Test the matching eligibility function
eligible_match, max_match = get_matching_eligibility("12345")
assert eligible_match == 50.25
assert max_match == 100.00
CI Integration: GitHub Actions Workflow
I use .github/workflows/atdd-tests.yml
to make sure all tests run on every push:
name: ATDD Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.11"
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run pytest
run: pytest tests/
- name: Run behave
run: behave features/
Diagram: CI/CD Workflow for ATDD
flowchart TD
E[Code Commit & Push] --> F[GitHub Actions Workflow]
F --> G[Install Dependencies]
G --> H[Run pytest Unit Tests]
G --> I[Run Behave Acceptance Tests]
H --> J{All Tests Pass?}
I --> J
J -- Yes --> K[Merge/Deploy]
J -- No --> L[Fail Build & Notify]
Key Learnings from the ATDD-AI Approach
Through developing DoubleUp! with an ATDD-AI approach, I've discovered several critical insights:
- Precision in Prompting is Crucial
- Clear, specific prompts yield better AI-generated code
- Including test failures in prompts helps the AI understand the problem better
- Incremental Development Works Best
- Small, focused test cases are easier for AI to implement correctly
- The RED-GREEN-REFACTOR cycle keeps progress measurable
- Documentation as Primary Artifact
- With AI handling much of the implementation, your tests and documentation become your most valuable contribution
- Well-documented feature files serve as living documentation
- Traceability ensures no specifications are lost in implementation
- Testing as a Shared Language
- Tests become the contract between human intent and AI implementation
- Both human and AI can reason about behavior through test cases
- Strong Traceability
- Maintaining links between user stories, features, and implementation files is crucial when AI is generating code
- It ensures nothing is created without corresponding specifications
The Future of AI-Assisted Development
This project demonstrates how ATDD can evolve in an AI-assisted development workflow. As AI becomes more capable, the developer's role shifts from writing implementation code to:
- Defining clear, testable specifications
- Creating and maintaining the test suite
- Guiding the AI through the development process
- Making architectural decisions
By treating tests as executable specifications and AI as a collaborative partner, we can achieve higher quality software with less manual coding. This shift in thinking builds on established Test-Driven Development principles but elevates them to a new level when paired with modern AI capabilities.
Lessons from Agentic ATDD: VibeTDD and Beyond
Experiments like VibeTDD (SAS hackathon) show that strict, test-first prompting keeps AI agents on track:
- Write one test at a time—never generate multiple tests at once
- Number and structure steps for clarity and consistency
- Commit before each new test to enable rollback
- Refactor explicitly—the AI needs direct prompts to improve code
- Fail tests first to validate the test's effectiveness
- Use Gherkin-style names for clarity and continuity
- Save prompt history at each TDD cycle
- Adapt prompts to the environment (e.g., OS changes)
- Prompt clearly and concisely—precision matters
- Watch prompt/token limits—optimize for efficiency
Getting Started
To explore this approach in your own projects:
- Start with clear user stories and acceptance criteria
- Write Gherkin scenarios before implementation
- Use AI to generate step definitions and implementation
- Maintain strict traceability between specifications and code
- Iterate with small, testable increments
This approach is inspired by both my own experience and by experiments like SAS's VibeTDD hackathon, where engineers guided an AI agent to build a working basketball game using strict TDD—without writing any code themselves. Their lessons, along with the concept of ATDD as a "fifth-generation programming language" (5GL) mentioned by Farley, reinforce a key idea: in AI-driven development, prompting and test-writing might become the new coding.
Final Thought
ATDD in this experiment is not theoretical—it's enforced, automated, and traceable. Every behavior is specified, tested, and validated in both local and CI environments. This ensures clarity, quality, and confidence in every change.
The vision: "What if the specifications—the acceptance tests—were the program?" In this experiment, that's the reality. I specify the problem and let the AI solve it, with ATDD as the contract.
Note: Thanks to Jeff Schneider for bringing Eval Driven Design (EDD) to my attention. While not covered in this experiment, EDD presents an intriguing approach to validating problem-solution fit that I may explore in future work.