AI Development

Code Quality Gates: Using Claude Code Hooks to Block Code Smells on Every Write

Hooks are the quality enforcement layer most people don't have. I built a PostToolUse hook system that detects complexity, deep nesting, long functions, and duplicate code, then blocks Claude from moving on until it fixes them.

Paul Duvall

24 Feb 2026 • 14 min read

Photo by Daniil Komov on Unsplash

After building 30+ repositories with AI coding tools (web apps, CLI tools, data pipelines, infrastructure automation), I've seen a recurring pattern: AI generates code that works but can quietly degrade your codebase over time. Functions balloon to 60 lines. Nesting goes four levels deep. Cyclomatic complexity creeps past 15. And you don't notice until every change takes three times longer than it should.

These are the same structural problems human developers create. AI just creates them faster than any review process can catch them. By the time a human reviewer flags a 40-line function, the AI has already built three more on top of it.

I built a set of Claude Code hooks that analyze every file Claude writes or edits and block it from proceeding when the code violates quality thresholds. Hooks are the quality enforcement layer most people don't have. They close the gap between "code written" and "code checked" within an AI coding session.

Claude fixes the violations immediately, explains what it changed, and moves on. In a recent 50-file session, the hook blocked 12 times in the first 20 writes and twice in the last 30. The violation rate dropped as Claude's context filled with examples of blocked-then-fixed code.

The Problem: AI-Generated Code Smells

In a single session, Claude can generate hundreds of lines across a dozen files. Most of it works. But structural problems accumulate:

Long functions that do five things instead of one
Deep nesting that makes control flow hard to follow
High cyclomatic complexity from chains of if/else/elif
Too many parameters that signal a function is doing too much
Duplicate code blocks that should be extracted into shared helpers
Oversized files that need to be split into focused modules

You can put rules in CLAUDE.md that say "keep functions under 20 lines." Claude will try to follow them. But compliance varies. The model follows them most of the time, then drifts when focused on solving the immediate problem. What you need is enforcement.

If you want to see how many of these smells your existing codebase already has, the Tuning and Overrides section below includes a one-liner that scans your project.

The Solution: PostToolUse Hooks

Claude Code hooks let you intercept operations at specific lifecycle points. The one that matters here is PostToolUse. It fires after Claude writes or edits a file, giving you a chance to inspect the result before Claude moves on.

The hook reads the file Claude just modified, runs it through a set of code smell detectors, and returns one of two outcomes:

No violations: Claude continues normally.
Violations found: The hook returns a block decision with a detailed report of what's wrong and how to fix it. Claude must address the violations before proceeding.

Getting Started

Requires Python 3.8+ (for AST end_lineno support).

Setup takes about 30 minutes: copy the files, configure the hook, and verify it works. Threshold tuning is an ongoing process. Expect to adjust limits over the first few sessions as you find what fits.

1. Create the hook files. The system uses four Python files in ~/.claude/hooks/: an entry point (check-complexity.py), a thresholds module (smell_types.py), a Python AST analyzer (smell_python.py), and an orchestration module (smell_checks.py). The Architecture section below shows what each module does with code excerpts you can use as a starting point.

mkdir -p ~/.claude/hooks

2. Configure the hook. Add this to ~/.claude/settings.json. If the file doesn't exist, create it with this content. If you already have settings (e.g., permissions), add the hooks key alongside them. If you already have a PostToolUse array, append to it rather than replacing it:

{
  "permissions": { "...existing settings..." },
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [
          {
            "type": "command",
            "command": "python3 $HOME/.claude/hooks/check-complexity.py"
          }
        ]
      }
    ]
  }
}

The matcher ensures the hook only fires on Write and Edit tool calls, not on file reads, searches, or bash commands.

3. Install Lizard for multi-language support (optional but recommended). Without Lizard, JavaScript, TypeScript, Go, Java, Rust, and C/C++ files pass through with no analysis and no warning. The system fails silently for unsupported languages:

python3 -m pip install lizard

4. Verify the hook is working. Ask Claude to write a deliberately long function (30+ lines) and confirm you see the violation report inline in the conversation. If the hook never fires, check that the file path in settings.json is correct and that python3 ~/.claude/hooks/check-complexity.py runs without errors when invoked manually:

echo '{"tool_input":{"file_path":"any_file.py"}}' | python3 ~/.claude/hooks/check-complexity.py

What Gets Checked

The system enforces six thresholds:

Metric	Limit	Why It Matters
Cyclomatic complexity	10	Functions with more decision paths are harder to test and reason about
Function length	20 lines	Long functions are doing too much; extract methods
Nesting depth	3 levels	Deep nesting obscures control flow; use guard clauses
Parameters per function	4	Too many params signal the function needs decomposition
File length	300 lines	Large files should be split into focused modules
Duplicate blocks	4+ lines, 2+ occurrences	Repeated code belongs in a shared helper

Not all checks apply to all languages:

Python (.py): All six checks via AST analysis
JS, TS, Java, Go, Rust, C/C++, C#: Three checks via Lizard: cyclomatic complexity, function length, and parameter count. Nesting depth requires language-specific AST parsing that Lizard doesn't provide.
File length and duplicate detection: Apply to all supported file types
Unsupported types (YAML, Markdown, Dockerfiles, shell scripts): Silently skipped

These thresholds are my opinionated starting points. McCabe's work defines the complexity metric and Fowler's catalog describes the refactoring techniques, but neither prescribes these specific numbers as blocking thresholds. I tuned them based on what produces readable, maintainable code in my projects. Your codebase will need different numbers.

The 4-parameter limit is the one most likely to cause friction in practice. Many standard APIs exceed it: subprocess.run, requests.get with auth/headers/params/timeout, ORM query builders. When you're wrapping an external interface that takes 6 parameters, forcing a parameter object can add complexity rather than reduce it. Adjust this one first if it creates friction.

The Go community routinely writes longer functions. Java codebases commonly exceed 300 lines per file. Test files regularly violate multiple thresholds: long setup, many parameters, large files with dozens of test cases. The system checks test files the same as production code by default. If that produces too much friction, raise the thresholds or add your test directories to the skip list (see the Tuning and Overrides section below).

Architecture: Four Modules, Clear Responsibilities

The hook system is split into four Python files in ~/.claude/hooks/, each under 200 lines. The excerpts below explain what each module does and give you enough to build your own version.

1. `check-complexity.py`: The Entry Point

This is the hook Claude Code invokes. It reads the PostToolUse JSON event from stdin, extracts the file path, runs the checks, and emits a blocking result when violations are found.

# Allow importing sibling modules from the same directory.
# Claude Code invokes the hook as a standalone script, not as part
# of a Python package, so the directory isn't automatically importable.
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))

def main() -> None:
    try:
        event = json.load(sys.stdin)
    except (json.JSONDecodeError, EOFError):
        sys.exit(0)
    file_path = event.get("tool_input", {}).get("file_path", "")
    if not file_path or not os.path.isfile(file_path):
        sys.exit(0)
    smells = check_file(file_path)
    if smells:
        reason = format_violations(file_path, smells)
        print(json.dumps({"decision": "block", "reason": reason}))
    sys.exit(0)

The try/except around json.load is defensive. A broken hook that crashes fails open. Claude proceeds as if no violation was found. This is the right default, but it means a bug in the hook silently disables all quality gates. If you suspect the hook isn't firing, test it manually with the command in the Getting Started section above.

When the hook prints {"decision": "block"}, Claude Code stops and presents the reason to the model. Claude fixes the violations and tries again, triggering the hook on the next edit. If the fix still violates, the cycle repeats.

In practice, Claude usually resolves violations within one or two retries. There's no built-in retry limit. If Claude and the hook disagree indefinitely, the cycle continues. If you see more than 3-4 consecutive blocks on the same file and function, Claude is likely stuck. Raise the threshold or add the file's directory to SKIP_DIRS to break the loop. Each retry adds roughly one model turn of token cost.

2. `smell_types.py`: Thresholds and Shared Types

All thresholds live in one place. Every detector imports from here, so changing a limit is a single-line edit:

MAX_COMPLEXITY = 10
MAX_FUNCTION_LINES = 20
MAX_NESTING_DEPTH = 3
MAX_PARAMETERS = 4
MAX_FILE_LINES = 300
DUPLICATE_MIN_LINES = 4
DUPLICATE_MIN_OCCURRENCES = 2

Each smell kind also has a prescribed fix:

FIXES = {
    "complexity": "Use extract-method, early returns, guard clauses, or lookup tables.",
    "long_function": "Extract helper functions for distinct logical steps.",
    "deep_nesting": "Use guard clauses and early returns to flatten control flow.",
    "too_many_params": "Group related parameters into a dataclass or options object.",
    "duplicate_block": "Extract repeated code into a shared helper function.",
    "long_file": "Split into smaller modules with clear single responsibilities.",
}

These fix instructions matter. When Claude gets blocked, it sees the violation detail and the recommended fix. It gets a concrete starting point instead of guessing which refactoring technique to apply.

3. `smell_python.py`: Python AST Analysis

For Python files, the system uses the ast module to parse the source and walk every function definition. No external dependencies required. It calculates:

Cyclomatic complexity: Counts decision points (if, for, while, try, except, assert, plus each and/or in boolean expressions). Note: Each boolean operator adds one to the count, so if a and b and c: scores as complexity 4, not 2. This is a deviation from standard McCabe. Tools like radon, flake8-mccabe, and SonarQube don't count boolean operators. Expect higher numbers from this hook than from those tools. I find the stricter counting catches functions that hide complexity in long conditionals, but adjust if it produces too many false positives.
Function line span: end_lineno - lineno + 1
Nesting depth: Recursive walk counting nested control structures
Parameter count: All args minus self/cls

The AST gives you the real parse tree, no regex heuristics, no text-pattern matching. One caveat: the function length check uses line span, which includes docstrings, comments, and blank lines within the function body. A function with a 15-line docstring and 5 lines of logic counts as 20+ lines. This is a real trade-off. It penalizes well-documented functions. If your team writes detailed docstrings (as PEP 257 recommends), raise MAX_FUNCTION_LINES to 30 or 35 to compensate.

4. `smell_checks.py`: Orchestration

This module ties everything together. It determines which checker to use based on file extension, runs file-level checks (length, duplicates), and delegates language-specific analysis:

Python files (.py): AST-based analysis via smell_python
JavaScript, TypeScript, Java, Go, Rust, C, C++, C# (.js, .jsx, .ts, .tsx, .java, .go, .rs, .c, .cpp, .cs): Lizard-based analysis for cyclomatic complexity, function length, and parameter count

The orchestration layer skips files in common non-source directories:

SKIP_DIRS = frozenset((
    "node_modules", "__pycache__", ".git", "dist", "build", ".next",
))

If any path component matches a skip directory, the file is ignored entirely. Add your own directories here to exclude generated code, vendored dependencies, or test directories if the thresholds don't fit your test style.

The hook checks the entire file after each write, not just the lines Claude changed. If a file already has violations before Claude touches it, the hook will block on those pre-existing issues. This is by design (it prevents Claude from making a problematic file worse), but it can be surprising.

The duplicate block detector compares sliding windows of DUPLICATE_MIN_LINES (default 4) consecutive non-trivial lines and flags blocks that appear multiple times. It skips import statements, trivial syntax lines ({, }, else:), and empty lines. It catches exact textual clones, not semantic duplicates where variable names differ. In the codebases I've worked on, exact clones are the most common form of copy-paste duplication, but this varies by codebase.

What It Looks Like in Practice

Say Claude writes a function like this:

def process_order(order, user, cart, config, logger):
    if order.status == "pending":
        if user.is_verified:
            if cart.items:
                total = 0
                for item in cart.items:
                    if item.is_taxable:
                        tax = item.price * config.tax_rate
                        total += item.price + tax
                    else:
                        total += item.price
                if total > config.free_shipping_threshold:
                    shipping = 0
                else:
                    shipping = config.shipping_rate
                total += shipping
                if user.has_discount:
                    total *= (1 - user.discount_rate)
                order.total = total
                order.status = "processed"
                logger.info(f"Order processed: {order.total}")
                return order
            else:
                raise ValueError("Empty cart")
        else:
            raise PermissionError("Unverified user")
    else:
        raise ValueError(f"Invalid status: {order.status}")

The hook fires and Claude sees this violation report inline in the conversation:

CODE SMELL VIOLATIONS in src/processor.py:
  [COMPLEXITY] process_order() line 45: complexity=14 (max 10)
  [LONG FUNCTION] process_order() line 45: 29 lines (max 20)
  [DEEP NESTING] process_order() line 45: nesting depth 5 (max 3)
  [TOO MANY PARAMS] process_order() line 45: 5 params (max 4)

Fix these code smells before moving on:
  - Use extract-method, early returns, guard clauses, or lookup tables.
  - Extract helper functions for distinct logical steps.
  - Use guard clauses and early returns to flatten control flow.
  - Group related parameters into a dataclass or options object.
Then notify the user what you fixed and why.

Claude refactors using guard clauses to flatten nesting, extracts helpers for distinct steps, and moves the logger to module level to reduce the parameter count without losing the logging behavior. Module-level loggers are standard Python. Passing loggers as parameters works too if you prefer explicit dependency injection for testability.

import logging

logger = logging.getLogger(__name__)


def process_order(order, user, cart, config):
    _validate_order(order, user, cart)
    total = _calculate_total(cart, config)
    total = _apply_discount(total, user)
    order.total = total
    order.status = "processed"
    logger.info("Order processed: %s", order.total)
    return order


def _validate_order(order, user, cart):
    if order.status != "pending":
        raise ValueError(f"Invalid status: {order.status}")
    if not user.is_verified:
        raise PermissionError("Unverified user")
    if not cart.items:
        raise ValueError("Empty cart")


def _calculate_total(cart, config):
    subtotal = sum(_item_cost(item, config) for item in cart.items)
    if subtotal > config.free_shipping_threshold:
        return subtotal
    return subtotal + config.shipping_rate


def _item_cost(item, config):
    if item.is_taxable:
        return item.price * (1 + config.tax_rate)
    return item.price


def _apply_discount(total, user):
    if user.has_discount:
        return total * (1 - user.discount_rate)
    return total

The hook fires again on the new version. This time, no violations. Every function is under 20 lines, nesting never exceeds 2 levels, complexity is well under 10, and the logging behavior is preserved. Claude continues.

Multi-Language Support

Python gets the deepest analysis because the ast module ships with every Python installation. The system also covers JavaScript, TypeScript, Java, Go, Rust, and C/C++ through Lizard, a language-agnostic complexity analyzer.

Lizard provides cyclomatic complexity, function length, and parameter count. It doesn't do nesting depth analysis (that requires language-specific AST parsing), but it catches the most common structural problems across those languages. Here's what a TypeScript violation looks like:

CODE SMELL VIOLATIONS in src/handlers/user.ts:
  [COMPLEXITY] processUserData() line 12: complexity=11 (max 10)
  [LONG FUNCTION] processUserData() line 12: 34 lines (max 20)
  [TOO MANY PARAMS] processUserData() line 12: 5 params (max 4)

Same format, same blocking behavior, same fix instructions. The refactoring follows the same extract-method and guard-clause patterns shown in the Python example above.

If Lizard isn't installed, those languages get no coverage at all: no warning, no error, just silence. Python AST analysis always works since it uses the standard library.

Why Hooks Complement Rules

I've tried three approaches to keeping AI-generated code clean:

Rules in CLAUDE.md: Claude reads them and tries to follow them. Compliance is inconsistent. Guidance, not enforcement.
CI linters and pre-commit hooks: These catch problems on commit or push. But within a Claude Code session, the model can write dozens of files before you commit. Violations accumulate between commits.
PostToolUse hooks that block on every write: Claude can't proceed until the code is clean. Violations get fixed at the point of creation, before they compound.

You want all three layers. Rules in CLAUDE.md set proactive expectations. Hooks enforce the measurable thresholds. CI linters and code review catch the things hooks can't: architectural problems, business logic errors, security reasoning, and anything that requires understanding across files. Each layer covers what the others miss.

Code can pass every threshold here and still be poorly designed: wrong abstractions, leaky boundaries, security vulnerabilities. These hooks catch mechanical, measurable problems so that human review can focus on the harder ones.

One observation: the violation rate drops within a session. After being blocked a few times, Claude starts proactively writing shorter functions and using guard clauses. This is context accumulation, not learning. The model's context window fills with examples of blocked-then-fixed code, so it mimics those patterns within the session. The effect resets completely across sessions, so the first few files always trigger more blocks.

Performance overhead: Each hook invocation spawns a Python process that parses the file and runs the checks. Based on timing across a dozen sessions on an M-series MacBook, each invocation adds roughly 100-300ms per Write or Edit. For typical sessions that's negligible. For a session that touches 40+ files, expect an extra 15-30 seconds of total overhead.

Tuning and Overrides

The default thresholds work for the greenfield projects I've tested on. Your codebase will probably need adjustments, especially if you're working on an existing project. Tuning isn't a one-time task: expect to revisit thresholds over the first few weeks as you learn which limits fit and which create unnecessary friction.

A practical way to calibrate: run the checks against your existing codebase and see which files violate. This one-liner pipes each Python file through the hook and prints any violations (change the -name glob for other file types):

for f in $(find src/ -name '*.py'); do
  echo "{\"tool_input\":{\"file_path\":\"$(pwd)/$f\"}}" \
    | python3 ~/.claude/hooks/check-complexity.py
done

If more than a third of your files produce violations, start lenient and tighten over time. These lenient defaults roughly double the tolerance for each metric, which lets most existing codebases pass while still catching the worst offenders:

# All thresholds live in smell_types.py. Change them once,
# every detector picks up the new values.
MAX_COMPLEXITY = 15       # Default: 10
MAX_FUNCTION_LINES = 30   # Default: 20
MAX_NESTING_DEPTH = 4     # Default: 3
MAX_PARAMETERS = 5        # Default: 4
MAX_FILE_LINES = 500      # Default: 300
DUPLICATE_MIN_LINES = 6   # Default: 4

Test Files

Test files routinely violate these thresholds. Test functions are often long (setup, action, assertion), test files are large (many test cases), and fixtures increase parameter counts. If the hook blocks too aggressively on tests, add your test directories to SKIP_DIRS in smell_checks.py:

SKIP_DIRS = frozenset((
    "node_modules", "__pycache__", ".git", "dist", "build", ".next",
    "tests", "test", "__tests__",  # Add your test directories
))

Handling False Positives

Sometimes the hook blocks on code that's genuinely fine: a 22-line function that reads clearly, or a 5-parameter function that maps to an external API. The system doesn't have an inline suppression mechanism like # noqa or // eslint-disable-next-line. The current escape hatches are:

Raise the specific threshold in smell_types.py that's producing the most friction. This is usually the right first step.
Add directories to SKIP_DIRS if a single directory is the problem (generated code, vendored dependencies, test suites)
Adjust per-language expectations: if you routinely write Go or Java, raise function length and file length limits

Per-function or per-file suppression (like # smell: ignore) is a gap. For team-scale adoption, it's closer to essential than nice-to-have. It's at the top of my list for a future version.

When the hook and the task requirements genuinely conflict (a state machine with 12 legitimate states, a validation function that must check 5 independent conditions), the only options today are raising the global threshold or adding the directory to SKIP_DIRS. This is the bluntest part of the system.

Interaction with Other Hooks

If you already have other PostToolUse hooks configured (formatters, linters), they all run independently. Each hook in the array fires in the order listed. A block decision from any hook stops Claude before later hooks run. There's no conflict between hooks, but the overhead stacks: three hooks that each take 200ms add 600ms per write.

What's Next

What I want to build next:

Inline suppression: # smell: ignore for functions that legitimately need to exceed a threshold. This is the most requested feature and likely the next addition.
Trend tracking: I still can't tell whether a codebase is getting cleaner or dirtier session over session
Per-project overrides: Let individual repos set stricter or looser thresholds via a local config file
Integration with centralized-rules: I maintain a separate framework that loads context-specific development rules into AI coding tools via hooks. Connecting the two would let code smell thresholds follow the same progressive disclosure pattern: base thresholds for all projects, language-specific overrides for Go or Java, and project-level tuning where needed, all managed from one place instead of editing smell_types.py per repo.
More detectors: God classes, feature envy, and data clumps. These require cross-class dependency analysis, which is a harder problem than counting lines and branches.

The same hook pattern applies to other automation points: running bandit on every Python file for security scanning, verifying GPG/SSH-signed commits, or rejecting files with invisible Unicode control characters (CVE-2021-42574). The pattern generalizes beyond code smells. Any deterministic check that can inspect a file and return pass/fail works as a PostToolUse hook.

If you've built something similar or tuned this pattern for a language or codebase I haven't covered, I'd like to hear about it.

The Takeaway

Claude Code hooks move quality enforcement from "after the session" to "on every write." The AI writes code, the hook checks it, and violations get fixed at the point of creation, before they compound into structural debt across dozens of files.

This doesn't replace CI linters, code review, or architectural oversight. It catches the mechanical, measurable problems so that human review can focus on the harder ones: wrong abstractions, missing error handling, security reasoning. Once configured, the hooks run automatically with no ongoing maintenance unless you change thresholds.

Start with the Getting Started steps above and the code excerpts in the Architecture section to build your own.