Why LLM Agents Fail at Backend Code: They Do Not Just Write Code, They Forget Constraints

It is tempting to look at today’s coding agents and believe that backend development is about to become a one-shot prompt.

I do not buy it.

Not because LLM agents cannot write code. They clearly can. They can add endpoints, patch CRUD flows, generate migrations, update tests, and edit multiple files with impressive speed.

The real problem is more annoying: the most important parts of a backend system are often not written in the first line of the ticket.

Things like:

Who is allowed to call this endpoint?
Which service owns this field?
Will this change break old data?
Does this ORM query accidentally bypass tenant isolation?
If this workflow fails halfway through, what state should be rolled back?
Is that weird old code accidental mess, or a guardrail left there for a reason?

These constraints are not as obvious as “build a login API,” but they are what decide whether backend code can safely ship.

A paper discussed on Hacker News today captures this problem well: Constraint Decay: The Fragility of LLM Agents in Backend Code Generation. The phrase is sharp. LLM agents often do not fail immediately. They gradually lose track of the constraints they were supposed to preserve.

That is the dangerous part.

The agent starts by remembering the architecture, ORM, database schema, and API contract. Then, as the task gets longer, it begins to take shortcuts. The feature may still appear to work, but the structure slowly bends out of shape.

That is where backend AI coding gets risky.

Being able to write code is not the same as respecting system boundaries

Most people judge AI-generated code by asking one question: does it run?

That matters, of course. But for backend systems, “it runs” is a dangerously low bar.

An endpoint returning 200 does not mean it is correct. It may have skipped authorization. It may have written the wrong tenant ID. It may have put a slow task into a synchronous request path. It may fail under concurrency and quietly corrupt inventory, billing, or permissions.

These bugs rarely show up in a clean demo.

What makes LLM agents tricky is that they are very good at creating the feeling of completion. They update files, explain their reasoning, and show test output. You skim the result and think, “close enough.”

Backend systems punish “close enough.”

If a frontend button is wrong, someone sees it immediately. If backend code is wrong, you may only notice after data is corrupted, access control is breached, or money is calculated incorrectly.

This is why I dislike broad claims like “AI will replace programmers.” They flatten the problem too much. AI can replace many local coding actions, but backend engineering has never been only about actions. It is about boundaries.

A good backend engineer knows what not to touch. They know which constraints must survive. They know when a quick fix creates a long-term maintenance bomb.

What Constraint Decay actually means

The paper studies whether LLM agents can satisfy both functional requirements and structural constraints in multi-file backend generation.

The authors fix a unified API contract, then evaluate 80 greenfield generation tasks and 20 feature-implementation tasks across eight web frameworks. They do not only run end-to-end behavioral tests. They also use static verifiers to check structural constraints.

The results are not comforting. As structural requirements accumulate, agent performance drops sharply. The abstract says capable configurations lose 30 points on average in assertion pass rates from baseline to fully specified tasks, while some weaker configurations approach zero.

That is not a tiny edge case.

It suggests that agents are much more sensitive to functional goals than structural constraints. Ask an agent to make an endpoint work, and it will try hard to make the endpoint work. Ask it to do that while preserving architecture, ORM rules, database structure, framework conventions, project style, and hidden boundaries, and it starts dropping things.

This is very similar to handing a backend project to a junior developer.

The junior developer may know how to code. They just do not know where the traps are. They see a slow query and write raw SQL. They see an awkward object boundary and bypass the service layer. They see complicated authorization and copy the nearest check without understanding it. Each move can be explained. Together, they deform the system.

An LLM agent is like a junior developer with incredible speed, short memory, and dangerous confidence.

A little harsh, but honestly, it fits.

Backend difficulty is mostly invisible constraint stacking

Backend code is rarely hard because the syntax is hard.

Most backend code is syntactically boring. The complexity comes from stacked constraints.

1. Permission constraints

Who can call this endpoint? Admins, organization members, project owners, anonymous users? Each boundary changes the implementation.

Agents can easily produce code that is functionally correct but too permissive.

For example, a user passes a project_id; the agent fetches the project and returns data. That looks normal. In a real system, you also need to check whether the user belongs to the organization, whether they have access to that project, and whether they can see that specific field.

Miss one check, and the demo still works. Production may not be so forgiving.

2. State constraints

Backend objects often live inside state machines.

Orders move from pending to paid to shipped to completed. Jobs move from queued to running to failed to retried. Accounts move from active to suspended to deleted.

Agents often focus on “set the status to X” and forget to ask whether that transition is valid.

These bugs are sneaky. Small test data passes. Real business data eventually accumulates impossible states.

3. Data consistency constraints

A database is not a giant JSON file.

There are transactions, unique indexes, foreign keys, idempotency rules, concurrent updates, and historical compatibility issues.

LLM agents often know how to write data. They are less reliable at knowing when a transaction is required, when duplicate submissions must be blocked, and when concurrent writes need protection.

This matters most around money, inventory, credits, quotas, and permissions. You cannot validate these areas with “looks fine.”

4. Architecture constraints

Mature projects have layers: controller, service, repository, domain, jobs, events, middleware.

These layers can look tedious, but they protect complexity.

Agents often take the shortest path to the current task. They put business logic in the nearest file because that is where the edit is easiest.

Ask an agent to add a field, and it may query the database, build business rules, emit an event, and write logs directly in the controller. The feature passes. The architecture gets worse.

5. Implicit requirements

The hardest constraints are the ones nobody wrote down.

A field cannot be removed because old mobile clients still depend on it. An API response looks strange because a large customer integrated against it years ago. A duplicated code path exists because a vendor bug forced it.

These things may not live in README files or tickets. They live in team memory.

Human maintainers forget them too. Agents have even less chance unless we deliberately surface them.

Why “smarter” frameworks can make agents worse

The paper includes an interesting observation: agents do better in small, explicit frameworks like Flask, and perform worse on average in convention-heavy environments like FastAPI and Django.

That makes sense.

An explicit framework is like a printed instruction manual. The route may be simple, even crude, but the pieces are visible.

A convention-heavy framework is more like a city transit system. You need to understand defaults, lifecycle hooks, dependency injection, ORM behavior, middleware order, and configuration conventions. Many rules are not written next to the code you are editing, but you still have to obey them.

Human developers learn these traps through experience. Agents infer them from context. When the context is incomplete or the task is long, the agent starts treating the framework like a generic codebase.

That is why “just let the agent handle the backend” sounds appealing but often turns into review hell.

The agent does not fail completely.

It does 70% impressively well, then quietly creates debt in the remaining 30%.

Backend incidents usually hide in that 30%.

The developer community is already searching for guardrails

What I find interesting about today’s Tech Radar is that the Constraint Decay paper was not the only signal. Other developer signals point in the same direction: agents cannot run naked.

On GitHub Trending, Understand-Anything turns code into an interactive knowledge graph so developers and agents can explore, search, and ask questions about a codebase. codegraph focuses on local pre-indexed code knowledge graphs to reduce tokens and tool calls.

The implicit message is clear: if agents are going to write code well, they first need to understand codebases better.

Peter Steinberger also shared a small but excellent workflow idea: ask Codex to maintain a scratch-log during big refactors, recording decisions, tradeoffs, and review fixes. Later, you can read what the agent decided on your behalf and what you forgot to specify.

That is good engineering instinct.

Do not pretend the agent will always be right. Make it leave tracks.

A serious backend agent workflow should look less like “prompt once and accept the diff” and more like this:

Read project constraints
Write a plan
Record decisions during execution
Run tests at every stage
Have a human review the dangerous boundaries
Close with static checks and behavioral tests

That sounds slower, but backend work is supposed to be careful. If you skip the careful parts, production will collect the debt later.

How I would use an LLM agent for backend work

I would not ask an agent to simply “implement this backend feature.” That instruction is too wide.

I would split the job into layers.

Step 1: Make it restate the constraints

Before writing code, ask the agent to describe the constraints it has found:

Which models and tables are involved?
Which endpoints may be affected?
What are the permission boundaries?
Which state transitions are forbidden?
Which existing tests must keep passing?
How is similar functionality implemented elsewhere?

This is not ceremony. It checks whether the agent has actually read the map.

If it cannot explain the map, letting it write code faster is just a faster way to get lost.

Step 2: Make it find similar code first

Backend projects have local conventions.

Do not let the agent improvise from first principles. Ask it to find three similar implementations and follow the existing style.

That is usually more reliable than feeding it abstract architecture advice.

Principles are easy to forget. Nearby code is harder to ignore.

Step 3: Turn permissions, state, and consistency into acceptance criteria

Do not just say “create an order endpoint.”

Write constraints like:

Non-members cannot create orders in this project
Repeated requests must not create duplicate orders
If inventory is insufficient, the request fails and inventory is not deducted
Successful creation writes an audit log
Existing API response shapes remain unchanged

The more explicit these constraints are, the less room the agent has to hallucinate the missing parts.

Step 4: Require a scratch-log

For any non-trivial backend change, I would ask the agent to keep a temporary log:

Files changed
Why each change was made
Tradeoffs it chose
Places it was uncertain
Fixes made after review

This is not busywork. It gives human reviewers a handle.

Without a log, you can only infer the agent’s reasoning from the diff. With a log, you can see where it started drifting.

Step 5: Layer the tests

Backend tasks should not only test the happy path.

I would require at least three categories:

Happy path: the feature works
Boundary path: permissions, state, retries, empty data, duplicate requests
Regression path: old behavior remains intact

If the project supports it, add static rules too: forbidden imports, layering constraints, ORM usage checks, schema compatibility checks.

Agents hate explicit guardrails in the best possible way. Guardrails turn vague expectations into things that can fail fast.

The real conclusion: agents are not programmer replacements; they are managed executors

I do not think Constraint Decay means AI coding is doomed.

I think it means AI coding is becoming more real.

The early question was: can it write code?

The better question now is: can it write correct code under layered constraints? If not, what workflow makes its mistakes smaller, earlier, and traceable?

That question matters more than another model leaderboard.

Backend development is not mainly about producing code. It is about keeping a system from losing control as it changes. LLM agents can dramatically speed up execution, but the faster execution gets, the clearer constraints must become.

An unconstrained agent is like an energetic new teammate: fast, positive, and eager to edit a lot of files.

Sounds great.

Until you discover it removed a transaction, bypassed authorization, and turned your state machine into a free fall.

So my view is simple: the next valuable layer in backend AI coding will not be merely “agents that write more code.” It will be workflows that help agents forget fewer constraints.

Whoever connects context, constraints, logs, tests, and review will turn agents from toys into production tools.

Everyone else gets a very fast intern who also buries landmines very fast.

References

Constraint Decay: The Fragility of LLM Agents in Backend Code Generation: https://arxiv.org/abs/2605.06445
Hacker News discussion: https://news.ycombinator.com/item?id=48256912
Understand-Anything: https://github.com/Lum1104/Understand-Anything
codegraph: https://github.com/colbymchenry/codegraph
Peter Steinberger on scratch-log: https://x.com/steipete/status/2058308112134635528

Being able to write code is not the same as respecting system boundaries#

What Constraint Decay actually means#

Backend difficulty is mostly invisible constraint stacking#

1. Permission constraints#

2. State constraints#

3. Data consistency constraints#

4. Architecture constraints#

5. Implicit requirements#

Why “smarter” frameworks can make agents worse#

The developer community is already searching for guardrails#

How I would use an LLM agent for backend work#

Step 1: Make it restate the constraints#

Step 2: Make it find similar code first#

Step 3: Turn permissions, state, and consistency into acceptance criteria#

Step 4: Require a scratch-log#

Step 5: Layer the tests#

The real conclusion: agents are not programmer replacements; they are managed executors#

References#