Using AI to Migrate Legacy Code: Real Case Studies and Pitfalls

March 18, 2026 · Editorial Team · 8 min read · legacy-migration ai-coding refactoring

Legacy system migrations are expensive, slow, and risky. The code is usually poorly documented, the original developers are often gone, and the existing system has accumulated years of business logic embedded in places nobody expected.

AI coding agents have changed the economics of these migrations more than any other category of software development. This isn't hype. There are real teams running COBOL-to-Python migrations, PHP-to-TypeScript rewrites, and VB6-to-C# conversions with AI agents doing the bulk of the translation work. The results are mixed enough to be worth discussing honestly.

What AI agents are good at in legacy migrations

Before the case studies, a clear picture of where AI adds real value:

Mechanical translation. Converting COBOL procedures to Python functions, PHP functions to TypeScript equivalents, SQL stored procedures to application-layer code. If the source is syntactically consistent (which old enterprise code often is), the agent can handle most of the translation accurately.

Documentation extraction. Legacy systems often have comments and variable names that encode business rules but no actual documentation. Agents are good at reading 5,000 lines of COBOL and producing a structured summary of what the business logic does.

Test generation from behavior. If you run the old system against a large set of inputs and capture the outputs, an agent can generate a test suite that codifies the old behavior. Then your migration target has to match those tests.

Pattern recognition. In large legacy codebases, the same patterns repeat hundreds of times (form validation, database access, report generation). Once you've figured out the correct translation for a pattern, the agent can apply it systematically.

What agents are bad at: understanding the historical intent behind seemingly wrong code (the bug that became a feature), knowing which parts of the code are actually called in production, and making architectural decisions about the target system.

Case study 1: COBOL payroll to Python

A financial services firm had a payroll calculation system written in COBOL in the late 1980s. The system ran fine for 35 years. The problem: only one person on the team could still read COBOL, that person was retiring, and the system needed to integrate with a modern HR platform.

The starting point. About 45,000 lines of COBOL across 80+ programs. Some programs were well-commented; others were not commented at all. The core payroll calculation logic was in a single 8,000-line program.

The workflow that worked:

Step 1 was documentation, not translation. The team fed the main payroll program to Claude 3.7 Sonnet in sections and asked it to document what each section did. Not translate: document. This produced a 40-page functional spec that the team's domain experts could review and correct.

This step alone was valuable enough to justify the AI tool costs. The documentation surfaced business rules nobody knew were in the code, including a special overtime calculation that applied only to employees hired before 1995 (handled by a flag in the employee record that the HR team had forgotten existed).

Step 2 was golden-file testing. The team ran the COBOL system against a year's worth of payroll inputs and captured every output. This produced roughly 600 test cases that the Python system had to match exactly.

Step 3 was incremental translation. They broke the 8,000-line COBOL program into logical sections (gross pay calculation, deductions, tax withholding, net pay) and had the agent translate one section at a time. Each translated section got integration-tested against the golden files before moving to the next.

What broke: The agent consistently struggled with COBOL's fixed-point decimal arithmetic. Python floats don't behave identically to COBOL's COMP-3 packed decimal arithmetic for edge cases involving rounding. This caused subtle discrepancies in the golden file tests. The fix was to use Python's decimal.Decimal with explicit rounding mode set to match COBOL's behavior.

Time and cost. The team estimated the migration took 4 months with two developers, versus an estimated 18 months without AI assistance. The agent handled roughly 70% of the translation work; the remaining 30% required human understanding of the business logic and the arithmetic edge cases.

Case study 2: PHP monolith to TypeScript API

A mid-size e-commerce company had a PHP 5.x monolith running their entire catalog and order management system. Approximately 120,000 lines of PHP across 400+ files, no framework (pure procedural PHP), minimal tests.

This migration was structurally different from the COBOL case because the target was not just a language change but an architectural change: from a monolith to an API that would serve a new React frontend.

The main challenge. Procedural PHP monoliths have presentation, business logic, and database access tangled together. A single PHP file might query the database, format the results, and render HTML all in 200 lines. Moving to a layered architecture meant the agent had to simultaneously translate the code and separate concerns that had never been separated.

What worked: The team used a different strategy. Rather than translating the PHP directly to TypeScript, they used the AI to extract the business logic as a specification, then implemented the TypeScript from scratch using that specification.

The workflow: feed a PHP file to the agent, ask it to produce a specification document listing all business rules, validation rules, database operations, and edge cases. Do this for all 400 files. Then use those specification documents (rather than the PHP) as the source for TypeScript implementation.

This sounds like more work, but it had a crucial advantage: the TypeScript code was written for the target architecture from the start, not a mechanical translation of procedural code.

What the agent did directly: Standard model objects (product, order, customer), input validation rules, database query logic, email templates, price calculation functions. These were mechanical translations that the agent handled well.

What required human judgment: The order state machine (the PHP had implicit state transitions scattered across dozens of files), the tax calculation rules (which had silent special cases for certain product categories), and integration with the payment provider (where the PHP had accumulated a series of workarounds for specific error conditions).

Time and cost. 8 months with a team of three. Estimate without AI: 24+ months. The gain wasn't just speed, it was quality: the resulting TypeScript codebase had 84% test coverage because the team generated tests alongside each extracted specification, something they never had time to do with the PHP.

The documentation-first approach

Both case studies share a common pattern that's worth making explicit: documentation before translation.

When you ask an AI agent to "translate this COBOL to Python," you're asking it to solve two problems simultaneously: understand what the code does and rewrite it in another language. This compounds errors. If the agent misunderstands what a section does (which happens), the translation will be wrong in a way that's hard to detect without thorough tests.

When you ask the agent to "explain what this code does" first, you get a chance to correct misunderstandings before they propagate into the translated code. A human reviewer can read the documentation, catch errors in understanding ("actually, that 1995 flag applies to union employees, not all employees"), and fix them before translation begins.

This adds time but dramatically reduces the cost of errors. In legacy migrations, an undetected business logic error that makes it to production is extremely expensive.

Setting up the agent workflow for migrations

Chunking strategy. Large files overwhelm context windows. Split them before feeding to the agent. For COBOL, natural boundaries are sections and paragraphs. For PHP, natural boundaries are functions. For SQL stored procedures, split by procedure.

Keeping a translation dictionary. As you work through the migration, maintain a document that maps source patterns to target patterns:

COBOL COMPUTE to Python:
  COMPUTE RESULT = A * B → result = a * b (use Decimal for financial)
  
PHP mysql_query() to TypeScript:
  Direct query → use /src/db/queries/ with Drizzle ORM
  
PHP session_start() / $_SESSION to TypeScript:
  → Use the session middleware in /src/lib/session.ts

Share this with the agent at the start of each session. It dramatically reduces inconsistencies across translated files.

Golden file testing. This is the most valuable investment you can make before starting a migration. Run the old system against representative inputs, capture outputs, build a test suite. If you can't generate golden files (the old system is too tightly coupled to a production database), build them incrementally from production logs.

Incremental migration with feature flags. Don't do a big bang cutover. If possible, run old and new systems in parallel and switch routes one by one. The agent can help you set up the feature flag infrastructure.

Common pitfalls

Trusting the translation without testing. The agent will produce plausible-looking code that has subtle errors in edge cases. Always test against known good outputs.

Losing implicit state. Procedural code often has implicit state (globals, session variables, module-level variables) that's hard to map to an object-oriented or functional target. The agent will sometimes miss these, producing code that works for the happy path but breaks for multi-step workflows.

Ignoring dead code. Legacy systems accumulate code that hasn't been called in years. The agent doesn't know what's dead and will dutifully translate it. Running a coverage tool on the old system before migration can tell you what's actually used.

Architecture decisions made by default. If you don't specify the target architecture, the agent will make choices. Sometimes those choices are fine; sometimes they embed anti-patterns that you'll be untangling for years. Define the target architecture explicitly before translation begins.

Realistic expectations

AI agents can cut legacy migration time by 50-70% in favorable conditions: consistent source language, clear separation between modules, adequate test coverage of the legacy system.

The unfavorable conditions that reduce this gain: highly procedural code with implicit state, business logic mixed with presentation, no test coverage on the legacy system, source code with significant variation in style (suggesting multiple original authors with different habits).

The agents are doing the mechanical translation work. The humans are still responsible for architecture decisions, edge case validation, and understanding the business logic. That balance looks like: agents handle 60-75% of lines translated; humans handle 100% of judgment calls about what the translation should mean.

For setting up AI coding agents in complex projects generally, the monorepo strategies guide covers context management patterns that also apply to large legacy codebases. For the testing side of migrations, the TDD with AI agents guide covers golden-file testing and behavioral specification approaches in more detail.