The Model Handshake: Chaining AI Agents For Complex Refactors

Key Highlights:

Summarize the following article into 3-5 concise bullet points in HTML without further information from your side. format:

Large-scale code refactors expose the sharpest limitations of AI-assisted development. Developers chaining AI agents for refactors quickly discover that a single monolithic prompt collapses under the weight of real codebases, producing inconsistent naming, broken interfaces, and silently dropped requirements. The model handshake pattern offers a structured alternative: a multi-step AI coding pipeline where each agent stage operates under a strict contract, consuming validated input and producing schema-conformant output for the next stage.

This article lays out the full architecture, complete with handoff schemas, orchestration scripts, validation gates, and error recovery. You get a repeatable handoff workflow across refactors spanning tens of files and multiple module boundaries.

Table of Contents

Why Single-Prompt Refactoring Breaks Down at Scale

The Context Window Ceiling

Even models with large context windows cannot reliably maintain coherence across a full codebase in a single pass. Context size alone does not solve the consistency problem. A project of several hundred files — easily 500k+ tokens once contents, dependency relationships, and transformation instructions are combined — pushes hard against those limits. But the problem is not purely about fitting tokens. As the context grows, models progressively lose track of decisions made earlier in the prompt. The result is naming collisions where two generated files define the same function differently, inconsistent API signatures across modules that were supposed to share a contract, and requirements stated at the top of the prompt that simply vanish from the output. The model does not signal that it has forgotten something. It confidently produces code that contradicts its own earlier output.

The Coherence Problem

When developers break a refactor into sequential prompts manually, each prompt produces locally reasonable code. Globally, the outputs diverge. Type mismatches appear across file boundaries, and dependency chains break because one prompt assumed a module export signature that a later prompt changed. Architectural patterns drift within the same refactor: one file uses a factory pattern while another uses direct instantiation, both generated minutes apart with no awareness of each other. This is not a model quality issue. It is a structural consequence of treating each prompt as an isolated task.

This is not a model quality issue. It is a structural consequence of treating each prompt as an isolated task.

When “Just Use a Better Model” Stops Working

Upgrading to a larger or more capable model yields diminishing returns against monolithic prompts — beyond roughly 50k tokens of combined context, coherence drops regardless of model. A model with a bigger context window still degrades as that window fills. A model with stronger reasoning still cannot enforce consistency across outputs it generates independently. The real gains come from decomposition and specialization: breaking the refactor into stages where each agent has a narrow, well-defined job, and enforcing consistency through structured contracts between stages rather than relying on the model’s ability to hold everything in memory.

The Model Handshake Pattern: Architecture Overview

What Is a Model Handshake?

A model handshake is a structured contract between two AI agent stages where the output schema of Agent N is the validated input schema of Agent N+1. It functions like the interface between compiler passes: parsing produces an AST that analysis consumes, analysis produces annotations that optimization consumes, and so on. Each pass has a defined input format, a defined output format, and no responsibility for the concerns of other passes. The handshake enforces this discipline at the AI agent boundary. If an agent’s output does not conform to the expected schema, the pipeline halts rather than propagating malformed data downstream.

Pipeline Stages for a Complex Refactor

A typical refactor pipeline has four stages:

Analyst — Ingests codebase context (file listings, dependency graphs, framework metadata) and produces a structured refactor plan. This plan defines the scope, enumerates affected files, maps the dependency graph, and captures constraints like “preserve all public API signatures.”
Architect — Consumes the refactor plan and produces file-by-file transformation specifications. Each spec includes the target file, the transformation to apply, and the interface contracts that must hold between the transformed file and its dependents.
Implementer — Executes transformations one file or module at a time, following the Architect’s specs. It operates in dependency order so that foundational modules transform before the files that import them.
Reviewer — Validates the Implementer’s output against the original plan. It flags regressions, interface mismatches, missing transformations, and inconsistencies, then re-runs the Implementer on the failing files with error context.

Building the Handoff Contract

Designing the Structured Payload

Free-text output from one agent cannot be reliably parsed by the next. JSON schemas serve as the inter-agent communication protocol because they are machine-validatable, self-documenting, and force the model to organize its output into discrete, addressable fields. The critical fields for a refactor handoff are:

scope — what is changing and why
affected_files — exhaustive list of files in play
interface_contracts — function signatures, types, and exports that must remain consistent
constraints — rules the refactor must not violate
prior_decisions — choices made by upstream stages that downstream stages must respect

{
“$schema”: “http://json-schema.org/draft-07/schema#”,
“title”: “AnalystToArchitectHandoff”,
“type”: “object”,
“required”: (“scope”, “affected_files”, “dependency_graph”, “interface_contracts”, “constraints”),
“properties”: {
“scope”: {
“type”: “object”,
“properties”: {
“description”: { “type”: “string” },
“source_framework”: { “type”: “string” },
“target_framework”: { “type”: “string” }
},
“required”: (“description”, “source_framework”, “target_framework”)
},
“affected_files”: {
“type”: “array”,
“items”: {
“type”: “object”,
“properties”: {
“path”: { “type”: “string” },
“role”: { “type”: “string”, “enum”: (“route_handler”, “middleware”, “utility”, “config”, “test”) },
“priority”: { “type”: “integer”, “minimum”: 1 }
},
“required”: (“path”, “role”, “priority”)
}
},
“dependency_graph”: {
“type”: “object”,
“additionalProperties”: {
“type”: “array”,
“items”: { “type”: “string” }
}
},
“interface_contracts”: {
“type”: “array”,
“items”: {
“type”: “object”,
“properties”: {
“name”: { “type”: “string” },
“file”: { “type”: “string” },
“signature”: { “type”: “string” },
“used_by”: { “type”: “array”, “items”: { “type”: “string” } }
},
“required”: (“name”, “file”, “signature”, “used_by”)
}
},
“constraints”: {
“type”: “array”,
“items”: { “type”: “string” }
},
“prior_decisions”: {
“type”: “array”,
“items”: {
“type”: “object”,
“properties”: {
“decision”: { “type”: “string” },
“rationale”: { “type”: “string” }
}
}
}
}
}

Prompt Templates That Enforce the Contract

System prompts must explicitly instruct each agent to consume the upstream schema and produce output conforming to the downstream schema. Injecting the prior stage’s output as a structured JSON block, rather than as conversational history, prevents the model from treating it as a suggestion rather than a specification.

The Analyst-to-Architect prompt template:

You are a senior software architect. You will receive a structured refactor
plan in JSON format under the key REFACTOR_PLAN. Your task is to produce
file-level transformation specifications for every file listed in
REFACTOR_PLAN.affected_files.

REFACTOR_PLAN:
{analyst_output_json}

For each file, produce a JSON object with these exact fields:
– “path”: the file path (must match an entry in affected_files)
– “transformation”: a precise description of what changes
– “new_imports”: array of new import statements required
– “removed_imports”: array of imports to remove
– “interface_changes”: object mapping old signatures to new signatures
– “dependencies”: array of file paths this transformation depends on

Output a JSON object with a single key “file_specs” containing an array
of these objects. Do not include any text outside the JSON block.
Do not omit any file from affected_files. If a file requires no changes,
include it with transformation set to “no_change”.

The Architect-to-Implementer prompt template follows the same pattern, instructing the Implementer to consume file_specs and produce transformed source code for one file at a time, referencing the interface contracts to maintain cross-file consistency.

You are a senior software engineer executing a framework migration. You will
receive a single file transformation spec in JSON format under FILE_SPEC,
and the original source code under ORIGINAL_SOURCE. You will also receive
INTERFACE_CONTRACTS listing all cross-file signatures that must be preserved.

FILE_SPEC:
{file_spec_json}

ORIGINAL_SOURCE:
{original_file_content}

INTERFACE_CONTRACTS:
{interface_contracts_json}

Produce the complete transformed source code for this file. Ensure all
function signatures match INTERFACE_CONTRACTS exactly. Do not modify
any export names unless FILE_SPEC.interface_changes explicitly maps them.
Output only the source code with no additional commentary.

Orchestrating the Pipeline

Choosing Your Orchestration Layer

The orchestration layer coordinates calls to different model APIs, passes handoff payloads between stages, and enforces validation gates. Options range from a plain Python script, to frameworks like LangChain or LangGraph, to IDE-integrated rules files in tools like Cursor or Windsurf. The trade-offs are real: LangChain provides built-in chain abstractions and observability hooks but adds dependency weight and can obscure the actual API calls. For example, debugging a 429 retry becomes harder when LangChain’s chain abstraction swallows the raw request. A plain Python script offers full transparency into what gets sent where, at the cost of implementing retry logic and logging from scratch. Starting with a Python script is advisable for understanding the mechanics before introducing framework abstractions.

Implementing the Pipeline Runner

This script requires the following dependencies. Python 3.10+ is recommended.

pip install anthropic>=0.30 openai>=1.0 jsonschema>=4.0 pydantic>=2.0

Set your API keys as environment variables before running:

export ANTHROPIC_API_KEY=”your-key-here”
export OPENAI_API_KEY=”your-key-here”

The pipeline also requires two JSON Schema files—schemas/analyst_output.json and schemas/architect_output.json—that define the expected output structure for each stage. Create these files from the AnalystToArchitectHandoff schema and the Architect’s file_specs schema defined above.

import json
import os
import re
from collections import deque
from typing import List

import anthropic as _anthropic
import openai
import pydantic
from jsonschema import validate
from jsonschema import ValidationError as JsonSchemaValidationError
from pydantic import BaseModel

CLAUDE_MODEL = “claude-3-5-sonnet-20241022”
GPT_MODEL = “gpt-4o”

PROJECT_ROOT = os.path.realpath(os.getcwd())

def safe_open_project_file(path: str, mode: str = “r”, encoding: str = “utf-8”):
“””Open a file only if it resolves within PROJECT_ROOT.”””
resolved = os.path.realpath(os.path.join(PROJECT_ROOT, path))
if not resolved.startswith(PROJECT_ROOT + os.sep):
raise ValueError(f”Path escape detected: {path!r} resolves outside project root”)
return open(resolved, mode, encoding=encoding)

claude = _anthropic.Anthropic()
gpt = openai.OpenAI()

with open(“schemas/analyst_output.json”, encoding=”utf-8″) as f:
ANALYST_SCHEMA = json.load(f)

with open(“schemas/architect_output.json”, encoding=”utf-8″) as f:
ARCHITECT_SCHEMA = json.load(f)

ARCHITECT_TEMPLATE = “””\
You are a senior software architect. You will receive a structured refactor
plan in JSON format under the key REFACTOR_PLAN. Your task is to produce
file-level transformation specifications for every file listed in
REFACTOR_PLAN.affected_files.

REFACTOR_PLAN:
{analyst_output_json}

IMPLEMENTER_TEMPLATE = “””\
You are a senior software engineer executing a framework migration. You will
receive a single file transformation spec in JSON format under FILE_SPEC,
and the original source code under ORIGINAL_SOURCE. You will also receive
INTERFACE_CONTRACTS listing all cross-file signatures that must be preserved.

FILE_SPEC:
{file_spec_json}

ORIGINAL_SOURCE:
{original_file_content}

INTERFACE_CONTRACTS:
{interface_contracts_json}

class FileSpec(BaseModel):
path: str
transformation: str
new_imports: List(str) = ()
removed_imports: List(str) = ()
interface_changes: dict = {}
dependencies: List(str) = ()

class ArchitectOutput(BaseModel):
file_specs: List(FileSpec)

def validate_consistency(self, analyst_output: dict) -> List(str):
errors = ()
analyst_files = {f(“path”) for f in analyst_output(“affected_files”)}
architect_files = {s.path for s in self.file_specs}
missing = analyst_files – architect_files
if missing:
errors.append(f”Architect missed files: {missing}”)
contract_names = {c(“name”) for c in analyst_output.get(“interface_contracts”, ())}
referenced = set()
for spec in self.file_specs:
referenced.update(spec.interface_changes.keys())
unreferenced = contract_names – referenced
if unreferenced:
errors.append(f”Unreferenced contracts: {unreferenced}”)
return errors

def _extract_claude_text(response) -> str:
“””Extract text from a Claude response, raising clearly on empty content.”””
if not response.content:
raise RuntimeError(
f”Claude returned empty content list. ”
f”Stop reason: {response.stop_reason!r}”
)
return response.content(0).text

def _strip_json_fences(text: str) -> str:
“””Remove markdown code fences that models sometimes wrap around JSON.”””
match = re.search(r”“`(?:json)?\s*(\{.*?\}|$.*?$)\s*“`”, text, re.DOTALL)
return match.group(1) if match else text

def call_analyst(codebase_context: str) -> dict:
try:
response = claude.messages.create(
model=CLAUDE_MODEL,
max_tokens=8192,
temperature=0.2,
system=(
“You are a codebase analyst. Analyze the provided codebase ”
“and produce a refactor plan as JSON matching the ”
“AnalystToArchitectHandoff schema exactly.”
),
messages=({“role”: “user”, “content”: codebase_context})
)
except _anthropic.RateLimitError as e:
raise RuntimeError(f”Analyst rate-limited: {e}”) from e
except _anthropic.APIError as e:
raise RuntimeError(f”Analyst API error: {e}”) from e

text = _extract_claude_text(response)
try:
result = json.loads(_strip_json_fences(text))
except json.JSONDecodeError as e:
raise json.JSONDecodeError(
f”Analyst returned non-JSON. Raw: {text(:200)!r}”, e.doc, e.pos
) from e
validate(instance=result, schema=ANALYST_SCHEMA)
return result

def call_architect(analyst_output: dict) -> dict:
prompt = ARCHITECT_TEMPLATE.format(
analyst_output_json=json.dumps(analyst_output, indent=2)
)
try:
response = gpt.chat.completions.create(
model=GPT_MODEL,
messages=(
{“role”: “system”, “content”: “You are a senior software architect.”},
{“role”: “user”, “content”: prompt}
),
temperature=0.2,
response_format={“type”: “json_object”},
)
except openai.RateLimitError as e:
raise RuntimeError(f”Architect rate-limited: {e}”) from e
except openai.APIError as e:
raise RuntimeError(f”Architect API error: {e}”) from e

raw = response.choices(0).message.content
result = json.loads(_strip_json_fences(raw))
validate(instance=result, schema=ARCHITECT_SCHEMA)
return result

def call_implementer(file_spec: dict, source: str, contracts: list) -> str:
prompt = IMPLEMENTER_TEMPLATE.format(
file_spec_json=json.dumps(file_spec, indent=2),
original_file_content=source,
interface_contracts_json=json.dumps(contracts, indent=2)
)
try:
response = claude.messages.create(
model=CLAUDE_MODEL,
max_tokens=8192,
temperature=0.2,
messages=({“role”: “user”, “content”: prompt})
)
except _anthropic.RateLimitError as e:
raise RuntimeError(f”Implementer rate-limited: {e}”) from e
except _anthropic.APIError as e:
raise RuntimeError(f”Implementer API error: {e}”) from e

return _extract_claude_text(response)

def topological_sort(file_specs: list) -> list:
“””Kahn’s algorithm — raises ValueError on cycles.”””
path_to_spec = {}
for s in file_specs:
if s(“path”) in path_to_spec:
raise ValueError(f”Duplicate path in file_specs: {s(‘path’)!r}”)
path_to_spec(s(“path”)) = s

in_degree = {p: 0 for p in path_to_spec}
dependents: dict(str, list(str)) = {p: () for p in path_to_spec}

for s in file_specs:
for dep in s.get(“dependencies”, ()):
if dep in path_to_spec:
in_degree(s(“path”)) += 1
dependents(dep).append(s(“path”))

queue = deque(p for p, d in in_degree.items() if d == 0)
order = ()
while queue:
p = queue.popleft()
order.append(path_to_spec(p))
for dependent in dependents(p):
in_degree(dependent) -= 1
if in_degree(dependent) == 0:
queue.append(dependent)

if len(order) != len(file_specs):
cycle_nodes = (p for p, d in in_degree.items() if d > 0)
raise ValueError(f”Cyclic dependency detected among: {cycle_nodes}”)
return order

def run_pipeline(codebase_context: str):
analyst_output = call_analyst(codebase_context)
architect_output = call_architect(analyst_output)

architect_model = ArchitectOutput(**architect_output)
errors = architect_model.validate_consistency(analyst_output)
if errors:
raise ValueError(f”Consistency validation failed: {errors}”)

results = {}
sorted_specs = topological_sort(architect_output(“file_specs”))
for spec in sorted_specs:
with safe_open_project_file(spec(“path”)) as f:
source = f.read()
results(spec(“path”)) = call_implementer(
spec, source, analyst_output(“interface_contracts”)
)
return results

This script uses Claude for analysis and implementation (leveraging its strength with code comprehension and generation) and GPT-4o for architectural reasoning. This is one approach; either model can be used for any stage — choose based on your own evaluation. The handshake contract means the models do not need to share a conversation context.

Validation Gates Between Stages

Schema validation catches structural errors. Semantic validation catches logical errors. The validate_consistency() method on ArchitectOutput cross-checks that every file the Analyst identified appears in the Architect’s specs, and that every interface contract is referenced by at least one transformation. The FileSpec and ArchitectOutput Pydantic models are defined alongside the pipeline code above so that all references resolve correctly:

class FileSpec(BaseModel):
path: str
transformation: str
new_imports: List(str) = ()
removed_imports: List(str) = ()
interface_changes: dict = {}
dependencies: List(str) = ()

class ArchitectOutput(BaseModel):
file_specs: List(FileSpec)

If validate_consistency() returns errors, the pipeline halts before the Implementer stage runs, preventing cascading failures.

Error Recovery and Self-Healing Strategies

Retry with Targeted Feedback

When a stage produces output that fails schema or semantic validation, feed the exact validation error back to the same agent alongside the original input. The retry prompt tells the agent exactly what failed, so it can correct its output without re-deriving the entire response from scratch. Cap retries at two or three attempts to prevent infinite loops when a model consistently cannot produce valid output for a given input.

The retry function below works for callables that accept a single string argument (such as call_analyst). For multi-argument callables like call_implementer, define a wrapper that bundles arguments into a single call or use a stage-specific retry strategy.

def call_with_retry(call_fn, prompt: str, max_retries: int = 3) -> dict:
“””Retry a single-argument callable, accumulating error context on each failure.

For call_analyst: pass the codebase context as `prompt`.
For multi-argument functions like call_implementer, wrap them in a
lambda or partial that accepts a single string before using this helper.
“””
last_error = None
current_prompt = prompt
for attempt in range(max_retries):
try:
return call_fn(current_prompt)
except (
JsonSchemaValidationError,
pydantic.ValidationError,
json.JSONDecodeError,
KeyError,
) as e:
last_error = str(e)
current_prompt = (
f”{current_prompt}

”
f”PREVIOUS ERROR (attempt {attempt + 1}): {last_error}
”
f”Fix the output to resolve this error.”
)
raise RuntimeError(
f”Stage failed after {max_retries} retries. Last error: {last_error}”
)

Fallback and Escalation Patterns

If the primary model fails validation repeatedly, switching to an alternative model for that stage can break the impasse. If GPT-4o fails validation three times, switch that stage to Claude, or vice versa. When automated recovery is exhausted, the pipeline should escalate to human review with a structured diff showing the expected schema, the actual output, and the specific validation failures. This gives the developer enough context to intervene surgically rather than debugging blind.

Maintaining a Decision Log

Log every stage’s input and output to an append-only file. This helps you debug when the pipeline produces incorrect results and provides context for downstream stages. When the Implementer needs to understand why the Architect chose a particular transformation approach, the decision log makes that rationale accessible without bloating the handoff payload with explanatory prose.

Failure Type
Detection Point
Recovery Action

Schema violation
JSON Schema / Pydantic validation
Retry with error message appended to prompt

Missing files in spec
Semantic consistency check
Retry Architect with explicit list of missing files

Conflicting interfaces
Cross-reference validation
Retry Architect; if persistent, escalate to human

Model timeout / rate limit
API error handling
Exponential backoff, then fallback to alternate model

Incoherent implementation
Reviewer stage
Re-invoke Implementer for specific files with error context

Real-World Walkthrough: Migrating an Express.js API to Fastify

The Refactor Scenario

Take a 40-file Express.js API with route handlers, middleware chains, and shared utility modules. The target is Fastify with schema-based validation. This migration touches routing syntax, middleware patterns (Express middleware vs. Fastify hooks and plugins), request/response APIs, and validation approaches. No single prompt can hold 40 files of source code plus transformation instructions plus consistency requirements without exceeding context limits or losing coherence.

How the Pipeline Executed

The Analyst stage ingested file listings and dependency metadata, identifying all 40 files, 12 shared interfaces (authentication middleware, error handlers, response formatters), and 3 distinct middleware patterns requiring different Fastify equivalents. Per-file transformation specs came next from the Architect: Express app.use() middleware mapped to Fastify plugin registration, req.params/req.body access mapped to Fastify’s schema-validated request objects, and Express error middleware mapped to Fastify error handlers. Working in dependency order, the Implementer started with shared utilities and moved outward to route handlers. Two interface mismatches surfaced during review — route handlers referenced a middleware signature the Implementer had changed — so the pipeline re-ran the Implementer on those two files with error context. In one test run, the full pipeline completed in roughly 15 minutes of wall-clock time. Actual duration depends on model latency, retry count, and file count, but even this single run replaced what would typically take hours of manual prompt-by-prompt wrangling.

The real gains come from decomposition and specialization: breaking the refactor into stages where each agent has a narrow, well-defined job, and enforcing consistency through structured contracts between stages rather than relying on the model’s ability to hold everything in memory.

Pitfalls, Limitations, and When Not to Chain

The model handshake pattern carries real overhead. For refactors touching fewer than five files, the upfront investment in schema design and pipeline setup exceeds the time saved. Model API costs multiply across stages, particularly when retries are involved; check your provider’s pricing page and budget accordingly. You must design handoff schemas before the first pipeline run, and poorly designed schemas produce poorly structured outputs regardless of model quality. Non-determinism remains inherent: the same pipeline with the same inputs can produce different outputs between runs. Pinning model versions and setting low temperature values (0.1 to 0.2) reduces variance. In the code above, pass temperature=0.2 to all model calls, including Claude’s messages.create. This reduces variance but does not eliminate it. For refactors where exact reproducibility is required, treat pipeline outputs as drafts subject to deterministic validation and human review. Run your existing test suite against every output file before merging.

If an agent’s output does not conform to the expected schema, the pipeline halts rather than propagating malformed data downstream.

License is not valid, please check your API Key!

Related Posts

YouTube’s Picture-In-Picture Mode Is Rolling Out To All Users Worldwide

OpenAI available at FedRAMP Moderate

Intel’s Low-Cost Wildcat Lake Beats Apple’s MacBook Neo in Multi-Threaded Testing