The LLM Is Not A Junior Engineer

Key Highlights:

License is not valid, please check your API Key!

Rewrite the following article in a natural, human-like tone. Keep the meaning the same but improve clarity, structure, and readability. Do NOT mention any source, website, or external reference. Return clean HTML paragraphs:

In the wake of my last essay on why I don’t vibe
code, I heard from various people on the
Internet who read it (or the Bluesky skeets that inspired it). Some had adopted
similar stances on their own, often for overlapping reasons. Others were
dedicated vibe coders who wanted to share what practices they used that made
things manageable for them. I appreciate the feedback! I’m not going to change
my stance, but it’s good to get perspective on how actual developers (and not
senior executives or corporate marketing materials) have engaged with this
technology.

Contrary to how I might seem at times, I do think there could be a place for
LLM models in modern software development. I do understand why
some developers are entranced with the ability to create applications in hours
and turn prototypes into products. I see how it could be highly useful for home
hobbyists, back-office bureaucrats and expert engineers to build the low-stakes software
they’ve never had the time or skill for.

But I think it’s also incredibly dangerous to assume these same benefits apply
equally for large software engineering teams or more sensitive software
applications. It might be one thing if your vibe-coded personal recipe app
mangles a measurement conversion; it’s entirely different if it destroys your
production
database
(even if it pretends to be really sorry afterwards). It’s one thing if you stay
up late coding on your personal project (I can’t
criticize), but another if your team is just merging in
code without testing or reviews because you’re too afraid to be the person who
is slowing the team down. And I haven’t seen many public example where teams have set
explicit bounds on how and where they want to use AI.

I haven’t been able to find many good examples of how teams are using Generative
AI (GenAI) effectively. Much of the literature is still very much in the early
phase of the AI lifecycle, with many articles exulting more in how AI replaces
the need for software teams rather than
how it might boost them. Perhaps, there are people figuring it out, but it feels
like we’re going to continue to see a new land-speed records in
self-owning
before we learn how to do it right.

And that’s probably how it will be. Software development as a practice has been
informed by decades of experience of how to do things terribly wrong, and many
of the standard practices we follow today were learned the hard way. Maybe we
need a few more years of companies deleting their entire code base or shipping
with critical errors (perhaps with a few high-profile outages or bankruptcies in
the mix) before the industry figures out how to sustainably work with these new
tools. These are some of the things I’ve been thinking about though.

The LLM is Not a Junior Engineer

First, I need to get something off my chest. It’s fairly common in our industry
to anthropomorphize GenAI products and describe them as junior engineers or
similar low-level coworkers. Stop it! While it may be useful to think of LLMs
as interns instead of
gods,
this framing still grants AI a conceptual personhood that makes it seem more
capable and reliable than it actually is. And it’s highly insulting to the
actual junior engineers in the industry, who are usually some of the most
talented and hard-working individuals you will find.

An AI model is not a person or even sentient. It has no long-term memory. It has
no internalized morality of what are good or bad behaviors (apart from what
might be implicitly reflected in its training data and reinforced in
post-training calibration). It doesn’t learn from any of its actions itself.
Instead, developers make it “learn” by carefully crafting introductory texts
which tell the LLM how to act and what things to avoid. And it seems to work
mostly, as long as someone remembers to tell the AI to stop talking about
goblins so
much.
It is indeed impressive and little magical that it does work so well most of the
time, but there are no
guarantees
that it won’t go wrong either. To help the LLM build on previous steps, many
coding agents will write to and read from a working memory file. This memory
itself is also included as part of the inputs into the model for each new step,
which means the longer the LLM is used on a single problem, the slower and more
expensive each successive query gets, to the point where some engineers have
reported hitting their weekly usage quotas within a single day. And when the LLM
finally fills out its limited context window, all sorts of wrong things will
happen, from API errors to selective amnesia as well as “lost-in-the-middle”
confusion and issues inferring responses to new prompts. To mitigate this, some
agentic models will include processes to summarize and compact their own
memories; this is a lossy compression by its very nature, so there is some risk
of distortion and loss there. Others will regularly just start over with new
agents that can stumble into the same mistakes and suggestions as their
predecessors without active intervention.

All of which is to say, if an LLM were a person (to be clear, he is not.), he
would be an absolute nightmare to work with. Every LLM agent is essentially a
combination of Amelia Bedelia
and Leonard Shelby from
Memento. It is up to you to
tell him precisely what he should or should not do. And it is also your
responsibility to keep a written memory that he will need to reference
constantly to know how to make sense of his world. Also, he doesn’t know or care
about anything else outside of his written notes, and you will feel intense
pressure to keep his notes as concise as possible without messing up. And when
you do mess up, other developers will blame you and not him. Your corporate
culture and values? Your general approach to architecture and testing? Your
long-term product road map? He doesn’t know it and doesn’t care. He doesn’t care
about how well the company is doing financially or its biggest challenges. He
will face no legal liability or professional consequences when he screws up
something. You know nothing about his background before he walked into the door
(and the secret instructions that his temp agency handed him before that so he
wouldn’t talk about the goblins). You do know he sometimes lies about things,
but it’s not always clear how much he’s lying beyond the very obvious whoppers
you’ve already caught. He can make recommendations and write code, but he has no
idea where it came from and how easy it is to support. He acts like a friend,
but at his core, he has no loyalty or empathy. He has no shared experiences with
you or fun weekend plans. He’s a perfectly impenetrable black box.

Junior engineers do learn, however. And unlike the hypothetical LLM-as-a-person,
they arrive at your company laden with context on how to work with your team.
They’ve grown up in a culture, learned the rules of society, have probably gone
to school for many years and likely learned some important things there. They
belong to families and possibly have heir own. They’ve put down roots in the
community. To work at your company, they’ve provided a resume and have gone
through multiple stages of your interview process. They’ve worked hard to land
this job and they want to keep it! They know that malicious behavior or neglect
will lead to real and painful consequences – being fired, litigation or even
criminal charges – that could potentially derail their entire careers. And even
without considering the threat of consequences or possibility reward, they will
still strive to do things the best they can. Because it just matters to them
in a way they might not always fully articulate but feel innately.

Of course, junior engineers do not move as fast as AI models, because they do
not come prepackaged with knowledge and must live in a world more complicated
than a stream of data. It is true that junior engineers will often make
mistakes, but they also generally learn from those too. And yes, genuinely
malicious engineers can cause real damage of their own as “insider
threats,” but there is at least
the possibility of consequences – being fired, litigation or even criminal
prosecution – including the fear of never being able to work in the industry
again. I don’t claim that junior engineers are perfect or infallible, but I do
believe that it’s also unfair to compare the worst of junior engineers to the
best of Generative AI models. And the most junior engineers are simply
phenomenal.

Best of all, junior engineers will mature with time into highly competent senior
engineers, engineering managers, directors and architects. But the GenAI models
will simply be replaced with new black boxes. Your investment of time and money
in them will earn dividends as they become more capable members of your staff.

‘Magic’ and ‘More Magic’

Junior Engineers will also rarely embarrass you.

In a recent profile of vibe coders, the New York Times included a tip from one
developer to tell his agent to not do things that would be
embarrassing,
leading the author to wonder how that would even work:

Embarrassing? Did that actually help, I wondered, telling the A.I. not to
“embarrass” you? Ebert grinned sheepishly. He couldn’t prove it, but prompts
like that seem to have slightly improved Claude’s performance.

This could of course be another fine example of pranking the
Times, but I’ve seen enough
similar guidance from other developers to believe this advice is genuine. That
doesn’t make it real however. Despite many claims about their effectiveness,
these prompting instructions are often little more than “prompting
folklore” that seem to work, but we lack an
explanation of why they would work or if any particular prompts are more
effective (maybe asking please would help? Or making threats to a machine that
will never truly feel threatened?). In some cases, other helpful prompting
strategies like asking the model to pretend to be an expert might improve the
odds of the AI agreeing with you, but actually damage its
accuracy. My point here is that it is going
to be a while before we understand what prompts and usage patterns are effective
and even longer before we understand why they work. How much should we
continue to share this superstition as useful?

I am reminded of an anecdote in the Jargon File about a
“magic” switch at the MIT AI
lab. As the story begins, an
early programmer (or “hacker” in MIT
parlance) was roaming the
halls of the AI lab when they noticed a curious switch bolted on to a
PDP-10:

You don’t touch an unknown switch on a computer without knowing what it does,
because you might crash the computer. The switch was labeled in a most
unhelpful way. It had two positions, and scrawled in pencil on the metal
switch body were the words ‘magic’ and ‘more magic’. The switch was in the
‘more magic’ position.

I called another hacker over to look at it. He had never seen the switch
before either. Closer examination revealed that the switch had only one wire
running to it! The other end of the wire did disappear into the maze of wires
inside the computer, but it’s a basic fact of electricity that a switch can’t
do anything unless there are two wires connected to it. This switch had a wire
connected on one side and no wire on its other side.

It was clear that this switch was someone’s idea of a silly joke. Convinced by
our reasoning that the switch was inoperative, we flipped it. The computer
instantly crashed.

Imagine our utter astonishment. We wrote it off as coincidence, but
nevertheless restored the switch to the ‘more magic’ position before reviving
the computer.

The author related the story to someone else a year later who insisted on going
back to look at the switch. They flipped it and the machine crashed yet again.
Obviously, this switch wasn’t actually controlling magic inside the computer,
but it was doing something, even if there were no satisfying actual explanations
of what that might be.

Not everything in the world needs an explanation, and many of us have certain
superstitions that let us pretend that we have some control over the arbitrary
randomness of the universe. A superstition like the gambler blowing on his dice
before shooting craps is harmless enough until the gambler starts to really
believe that his superstitions are real and that failing to follow them will
result in disastrous consequences. This phenomenon is called the illusion of
control, and it is common
for people to feel this when faced with completely random processes.

Large Language Models (LLMs) are not as simple as a game of craps, but they are
random at their core. Instead of a game of craps, we could consider an LLM as an
inordinately complicated pachinko
machine that
uses snippets of text instead of ricocheting ball bearings. Most LLMs include a
temperature parameter that injects a specific amount of randomness into how
the LLM selects its next generative response because this can jostle the model’s
completions onto a different pathway and make it seem more creative. This is
great for open-ended conversations, but not ideal for tools; I don’t imagine
you’d like to use a hammer that every so often would spray you in the face with
confetti. But, even when the temperature is set to 0, LLMs remain innately
nondetermistic
because of how floating point operations work. And, of course, companies can
also tweak the operating parameters for their models at any time, which could
have its own unexpected effects on model behavior for your applications (hey,
have you heard about the goblins?).

As a programmer, this nondeterminism makes me queasy. Suppose that I give an LLM
a prompt, notice an obvious error, rephrase my prompt and get a second correct
response in turn. In a purely deterministic model, I can be certain that my
modification to the prompt was the thing that fixed the model’s response because
it was the only thing that changed. Software is generally deterministic because
it allows us to understand things as cause and effect. Flipping the switch away
from “more magic” always causes a crash. It also lets us repeat the
circumstances that cause bugs to occur (and know that our fixes work). For
instance, a certain LLM model might always miscount the letter ‘r’ in
strawberry because it uses a deterministic
model for tokenization.

Under the nondeterministic model, I can no longer be sure. Maybe my change to
the system’s input did fix its output. But it’s also possible that what simply
happened is that the odds of the LLM producing erroneous output that I would
notice twice in a row were just suitably low. It’s possible that my prompt
modification had perhaps only a subtle effect on the system’s behavior or maybe
even no effect at all. Was it real or a
Heisenbug? When the system deletes my
production DB and backups, is it because I have a skill issue or was I just
particularly unlucky? I’ll never know.

Of course, we can build systems to handle unpredictability and bad luck. One
common approach is to use multiple computations and then check them against each
other. Indeed, one way to reduce the risks of individual LLM errors is to ask
the same question multiple times against multiple models or use judges to assess
output. The challenge is that this approach will both dramatically increase
costs as well as reduce the responsiveness of LLM models in specific
applications. It reminds me, in a way, of error correction code (ECC) memory
that protects against the relatively rare soft error bit flips caused by cosmic
rays This technology has been around
since the 1980s but it is not everywhere, because those memory correction
techniques increase costs and decrease system performance. And so, it makes
sense for a computer system controlling a nuclear reactor, but it might be
something you might not need for a graphics processing
unit
that normally would just power video games. It goes without saying that the risk
of LLM-induced failures is exponentially higher than those caused by an errant
cosmic ray, but many developers are still stumbling in the dark on the best ways
to handle error correction and when to use it. Maybe we can start by accurately
thinking about risks.

Clarifying Risks

As an experienced software engineer, you learn to always anticipate that danger
can happen. Instead of dissolving into a puddle of
anxiety, a software development team learn to manage the fear and quantify
uncertainties into a document that outlines all the important risks. A specific
risk could be something technical (“an entire data center that we are using for
our cloud provider goes
offline”)
or more broadly existential (“we miss our launch deadline”), and it will often
include things that are not entirely in the team’s control. For every risk, the
team will determine if there is a suitable mitigation that allows the team to
recover from the risk (ideally, the best mitigation will prevent the risk
entirely) and if the team “owns” the risk mitigation and response. It’s also
highly useful to estimate the likelihood and impact of a given risk. Typically,
these are both estimated on a scale from 1 to 5 and multiplied to generate an
overall severity estimate – under this approach, low-impact but highly common
risk might turn out to have a higher severity than a more dangerous risk that is
extremely rare. Severity gives teams the ability to prioritize which risks to
address first and calibrate how much they should worry about specific problems.
To avoid being overwhelmed by a world filled with danger, teams will also
generally limit the number of risks they track at a given moment (usually to 20
or fewer) and will regularly review the list to add or remove items or
recalculate their severity.

There is an art to this approach, of course, and its estimates are more of a
heuristic than a precise calculation of what dangers might occur. And that’s
fine. The main point of the exercise is the process where the entire team
discusses the risk and not the just the risk register that they create. Everyone
contributes to the discussion of what risks should be tracked and how to point
them, and it benefits immensely from multiple perspectives that can look at the
technical stack with a critical eye. It is also is informed by decades of
failures and fixes in how software is built and deployed to infrastructure.

I don’t know how to properly consider risks for GenAI usage. I think the
industry is still figuring out the failure patterns and mitigations for them. I
have observed that developers (and other team members) vary wildly in how they
assess the impact and severity of AI-related risks (boosters are excited and
skeptics are cautious), and I believe that this makes it harder for teams to
come to a realistic consensus of risks in AI usage in their products and
processes. I’ve also noticed that much of the existing discussion about GenAI
risks has been at a relatively high-level (e.g., “should we use this for
purpose X or industry Y?”) and it’s harder to find more granular examples
exploring the risks and benefits of using AI for very specific purposes in your
very specific context (e.g., “should we use this LLM model to generate synthetic
data for our testing environment?”). Many AI users and companies are actively
learning how these tools can go wrong and what to do about it, but this is a
case where risk assessments should skew pessimistically while they figure it
out. Maybe we should make a regular practice of asking “what is the worst that
can happen?” and identifying scenarios where “it gets worse than that” if we
don’t feel confident we’re pessimistic enough.

Human in the Loop

Last year, Amazon’s retail operations suffered several outages that directly
caused by the Kiro LLM tool. Amazon angrily retorted that the news reports were
wrong and the outages were actually caused by solitary developers making
changes under the advice of
AI.
That’s not much more reassuring. As I’ve mentioned above, we’ve had decades of
mistakes that inform modern DevOps practices for maintaining software
architecture. This includes best practices like “don’t let a single engineer
make tweaks to production” or “code changes should be reviewed by at least one
other engineer before being merged” or even “people (and the AI models using
their credentials) should have only the minimum amount of access
privileges they
need for their job.” It’s common for companies to ritualistically invoke the
phrase “human in the loop” to answer concerns about LLM safety. But how
effective is that guarantee?

Much has already been written about how extensive GenAI usage can create not
just cognitive displacement but even “cognitive
surrender”,
where people defer unthinkingly to the judgment of the LLM. Some go willingly,
but many are being pushed into it by management pressures. It seems like no
coincidence that the same Amazon retail division that has suffered several
AI-related outages is also heavily monitoring GenAI usage among
staff and setting ambitiously high targets for
adoption:

The effort calls for more than 2,100 engineering teams in the retail arm to
triple software code release velocity using what Amazon calls “AI-native”
practices, while a smaller group of at least 25 teams is expected to boost
output tenfold this year. Progress against these goals is closely tracked by
Amazon’s senior leadership team, known as the S-Team, according to the
document.

This model of setting targets before considering their effectiveness is how you
repeatedly take down production and create problems that are hard to unwind.
This not only creates cognitive surrender but also a moral abdication, where
engineering teams abandon their sense of duty to a vague hope that the AI won’t
err too badly. These metrics are very specific about what tools teams should use
and how they should act, but they don’t consider how this might affect the
product that the teams are working on (in this case, Amazon’s retail operations
and website) and which might be negatively impacted by this change. I will talk
more about the absurdity of these velocity metrics below, but I wanted to note
here how it is simply impossible for a “human in the loop” to act meaningfully
under these circumstances: direct action against GenAI usage that might stop
bugs will be penalized for making your team miss the target, but letting stuff
through that might take down production can be conveniently blamed on the AI (or
the single engineer we can scapegoat for using it). And so, the teams will hit
their targets, the executives will chalk this up as a success, and the website will just start
acting more erratically because that’s not as high a priority.

I think it’s essential for teams and organizations to firmly define what they
mean by a “human in the loop” and stick to that definition. Here’s a possible
place to start: any person on the team can hit the imaginary big red button that
stops the assembly line at any time for any reason (no shaming nor penalties
allowed). Never replace a person with an AI model, especially if that person
contributes necessary friction (product managers, accessibility testers,
security compliance); if you must, instead work with that role on ways that LLMs
might supplement their activities. Explicitly ask at team retrospective meetings
if people felt pressured or rushed into approving AI-generated work. Audit all
AI usage to ensure there are no places where an AI model is taking both sides of
a working relationship for quality control. For instance, an AI model should
never be allowed to write both code and the tests for that code, or to write
both design documents and implementations. These restrictions should be enforced
even if we are using two different agents of the same Large Language Model or
two different AI products to evaluate the model. Define the line and hold firm
to it.

And it may be that the line is not to us AI at all. Perhaps, you work in an area
where the legal or ethical risks are too great. Perhaps, you don’t want to help
support AI as it currently exists, where the choice is among several different
large organizations with terrible environmental practices, huge influence over
certain swaths of society and decidedly flexible approaches to morality. I feel
the same way. This isn’t politicizing AI, but it is recognizing that how you use
AI is an inherently political choice (as is basically everything else), and your
team should be making that choice willingly rather than feeling coerced.

The goal here is to be intentional about using LLMs where they might fit best
while reducing exposure to the risks of cognitive surrender and professional
laxity. For me, a good breakdown would be to use LLMs for situations where there
is accidental complexity (something that can be solved by better tooling) vs.
problems with essential complexity (how to architecture precise software models
to reflect the messy nuances of reality). I do think AI models potentially show a lot of
promise for building tests and fuzz testing, synthetic data generation, static
analysis to identify bugs, code exploration and summarization. But, I also think
a person should always be checking their work.

The LLM Budget Bomb

I’ve already mentioned that I’m a
cheapskate, so admittedly I’m overly
sensitive to this, but it’s shocking to me how little discussion there is about
the inevitable substantial rate hikes that will hit LLM usage as soon as this
year.
Over the past few years, AI companies have been drastically discounting their
products in the hopes of increasing market-share and gaining advantages against
their rivals. At the same time, they have been dramatically expanding their
capital expenditures by building out new data centers. At some point, investors
are going to want to see returns, and some changes to AI pricing have already
started to show:

Everyone I spoke to had some version of this problem — their token usage has
gone up, so their usage-based billing cost has gone up, or the tier they were
on no longer has the same cap, and now they’re having to go to a more
expensive tier to try to keep the same amount of usage per month as part of
their flat rate.

Many personal AI users have been using tiered products that come with usage
quotas which obscure the true cost of their activities (unless you hit your
weekly limit). But as costs increase, even those users might find themselves
forced to move to higher tiers. For instance, Anthropic recently removed Claude
Code from its $20/month tier and made it exclusive to the $100/month tier or
higher. The
company reversed course after facing widespread outrage from developers suddenly
contemplating a 5x increase in their monthly AI bill, but I think an price
increase for this feature is inevitable. Similarly, GitHub Copilot has announced
it is moving all customers to a metered model based on token
usage,
and that model is how Enterprise users of all the major LLM platforms are
already paying for that usage. Under this model, it’s up to developers to track
and manage and budget their AI costs by tracking their usage and setting
budgets. The problem is that those costs are essentially impossible to budget
for.

Some of you might be wondering what these tokens actually are. Generative LLMs
work by taking inputs of text to generate more text which is then fed in as more
text into the LLM to generate more
text until either it hits some
sort of stop condition or character limit. A Large Language Model roughly works
in a similar fashion to text autocomplete, but instead of suggesting the most
likely next letter, it suggests the next possible token that is the most
semantically likely continuation (with a bit of randomness thrown in). It does
that by calculating the probability of every possible completion based on the
data it was trained with and then usually (this is where temperature comes into
play) picking the most likely one. Whole words would be the best unit of text for making
these semantic associations, but there are just too many of them (especially
when you factor in spelling mistakes, etc.) to make the models feasible.
Instead, a given LLM reads its input as a sequence of multi-character tokens
rather than characters – for instance, “strawberry” might be read as “str”
“awb” “erry” by one model and “straw” “berry” by another. Tokens are an
engineering compromise that invisibly shapes how each model sees the world. And
since they are the basic unit of processing, they are also naturally the basic
unit for any metered billing (or usage quotas for different tiers in an flat
pricing model). Costs scale linearly with token usage, but this token usage is
hard to predict.

One problem is that it’s impossible to know in advance how many tokens even a
single LLM query will actually consume. It is possible to set hard limits so
that a session will not churn indefinitely, but that limit essentially acts like
a proctor for an exam telling students to put their pencils down; the LLM is not
able to strategically analyze and plan against its token budget. Some guides
suggest handling this through prompts that tell the LLM to be more terse in its
responses, but this is just another example of prompting folklore.

It is also worth remembering that LLMs produce plausible output (usually the
most likely responses). That is not the same as correct output. Unfortunately,
many LLM techiques to increase accuracy – chain of thought
prompting or using another
LLM as a judge – will
themselves increase token usage, pushing teams into the dilemma of balancing the
risks of errors against increasing costs. For instance, a team might find that
it has to add guardrails in the form of regular expressions and analysis by
another LLM model to check the output of an AI chat agent that has been acting a
little too helpful to hackers. They probably didn’t consider those costs at
first.

And we shouldn’t forget that LLM companies still control two different means to
shift more token costs onto user. First, companies will set a fixed rate on
their per-token cost. For instance, Anthropic’s Claude Opus 4.6 model charge $5
per million input tokens and $25 per million output tokens, while the older and
less powerful 3.5 Sonnet model is $3/$15 per million tokens respectively. These
rates are not regulated and could change at any time. In addition, token usage
can vary. Anthropic recently announced an upgrade to Opus 4.7 to better handle
certain agentic coding tasks. The company guaranteed it would have the same
per-token cost as Opus 4.6, but users have still seen dramatically higher usage
costs because the new model has a different tokenizer implementation, leading to
much higher token utilization for identical queries. Anthropic itself has
conceded that some tasks might require 35% more tokens, but developers have
seen increases of up to 46% in token
usage
and corresponding costs. There will always be a trade-off between accuracy and
costs; at some point, a 1% improvement in model performance might not seem worth
if it leads to a 10% increase in costs. But, the LLM provider is the one making
that decision, while you are the one footing the bill. At some point, you might
need to decide if the benefits still outweigh the costs, but how do you figure
out the benefits anyway?

Productivity Metrics Are a Scam

As companies have moved to mandate LLM usage for their teams, they have
increasingly relied on metrics to track how their employees are using the tool
and to brag about how fast their development velocity has become. In the example
of Amazon from above, they have stated their expectations is for teams to
triple their software code release output, which could possibly mean they
either want to triple the number of times they deploy something to production or
triple the lines of code they write every week or triple the number of work
tasks they complete. Google’s CEO has recently bragged that their AI development
tools have led to a 10% increase in developer
velocity
that sounds impressive, but how do they even measure that? Investigating this
question led me down a rabbit hole. First, I waded through a lot of useless slop
explaining the importance of measuring developer productivity without providing
any direction on how. Eventually, an astute reader tipped me of to a DX article
about measuring LLM
impact
that clarified they were indeed measuring improvemtn in developer-hours to make
this declaration. Great! Now, how do we measure developer hours then?

And this is where I ran into a wall. DX is a commercial product for measuring
Developer Experience metrics, and that documentation is behind a
paywall. So, it’s possible they are actually measuring
developer time. This likely doesn’t mean measuring the overall time of
developers working in the office (since I doubt the Google CEO would be bragging
about developers leaving work almost an hour early each day. Instead, it likely
is a complicated melange of various automated surveillance observations
(corporations are a big market for spyware) that are blended at different
weights into deriving a measure of overall developer time for each feature being
developed. As always, it’s important to ask what gets counted and what doesn’t.
Does typing time count even though it’s just people typing comments into Stack
Overflow? Do meetings not count, even if they’re important to coordinate what
the team is building? And then it’s important to ask how AI changing team
behavior might possibly change the output of these statistics. For example, it’s
pretty clear that developers in teams that are heavily using LLMs have to shift
more time into waiting for responses and reviewing pull requests; this
translates to much less typing and more scrolling and staring at screens for
minutes. It’s possible that the LLM is really reducing the overall time per
feature, but it’s also possible that LLM usage is just skewing how the metric is
calculated without any real improvement to developer productivity. I simply
don’t have enough info to say, but I’m innately suspicious about a metric that
is conveniently obscure.

Of course, most places don’t have the time or energy to invest in a platform to
collect Developer Experience metrics and instead they just wing it with what
they have. Returning to Amazon’s productivity goals, I’m not sure what metrics
they mean to measure for this. Obviously, just tripling the lines of code is not
necessarily a sign of quality, and it’s also unclear to me why tripling the
number of deployments would be better, since I assume Amazon has already been
following modern DevOps practices for frequent
deployments. Instead, I assume they are
just committing the cardinal sin of agile development: using estimation points
as a metric. For those of you who don’t know what I
mean, many teams doing agile development grapple with the problem of estimating
how long the work will take by using a points-based approach. Under this model,
we divide the work into discrete tasks that can be assigned to individual
developers and then estimate the work it will take to do such tasks. So, fixing
a simple typo on an admin screen might be 1 point, while a complex redesign of a
testing harness might be 8 points (many teams use a Fibonacci-based
scale
of 1, 2, 3, 5, 8, 12, but it’s not essential). Once we have pointed all the
tickets, we can then use this to estimate how much work the team can fit into a
single sprint of 2 weeks (or usually 10 working days) by adding up the number of
points that the team was able to complete in a few prior sprints and deriving a
points-per-developer-day rolling average you can use to figure out a target for
an upcoming sprint with only 9 working days and one of the developers out on
vacation. It’s definitely still an approximation rather than an exact science,
but the goal here is for the team to be realistic about how much work it can
reasonably do in the next two weeks and defer remaining work for later sprints.
The problem is that the organizational leadership doesn’t want to know “are we
able to somewhat reliably estimate work for the next sprint?” What they care
about is “can we publish a statistic to show how our corporate bet on AI is
paying off?”

Absent a crystal ball and intensive measures to track productivity, companies
will try to get a “good enough” answer out of the team metrics they do have, and
points seem like a natural choice. After all, if a given team now is doing 33
points in a sprint with LLM tools and it was doing 30 points six months ago,
that’s a 10% velocity gain! Except for all the ways it which it goes wrong. For
starters, we should check that the team isn’t achieving this gain by skipping
important safeguards like code reviews or passing tests. Furthermore, team
practices for estimation can drift over time, as the team compositions changes
or developers become better about underestimating the work. And of course, teams
might actively cheat, if there are incentives to be top performers or penalties
meted out to laggards – one way to get a 10% boost might be just to point a
bunch of 1-point tickets to 2 points. Sure, it’ll make it harder to estimate the
work of future sprints in the short term, but eventually the rolling averages
will reflect the new normal. Simply put, points are just a terrible way to track
team velocity and productivity. They’re just so inconsistent, unreliable, easily
manipulated and totally uncalibrated, that I would hesitate about making broad
inferences from comparing totals a year apart. But, they do let you do easy
math, and that’s good enough for people who just need to say they had a 23.7%
improvement in developer speed.

To be fair, this is not a new problem that is unique to LLM-assisted teams.
Most metrics are bad, but the ones that measure
team productivity or speed have always been especially deficient, as the example
above shows. They are so tempting though! This is because they are super easy to
compute and directly responsive to changes, letting you cite an instant
improvement or reduction as a quick win for the next all-hands meeting. For
instance, a company can tweak a setting and directly measure “40% more customers
are interacting with our AI agent” just by counting the API calls from its
front-end web interface. More customer-AI interactions doesn’t necessarily mean
better interactions or happier customers however! And the metrics which reflect
the real quality measures – customer satisfaction, site reliability, revenue –
that dictate whether the business succeeds or fails are often lagging indicators
that take a while to manifest and are harder to diagnose. If a team is hitting
high velocity scores by not reviewing any LLM code before merging it, that will
eventually be reflected in a product quality metric like uptime showing that the
system is crashing more frequently. The truth will out, but sometimes it takes a
while; by the time the company realizes that the increase in AI chats is leading
to a 23% surge in subscription cancellations, the damage might already be locked
in.

The main points from this section? Most AI-related statistics are potentially
suspicious and you should be wary before citing any of them uncritically (even
for ones reporting damaging effects from LLM usage). It is easy to measure an
action, but much harder to measure the effects of an action or gauge its
quality.

What if the AI Goes Away?

There is a joke from the “Two Dozen and One Grayhounds” episode in the sixth
season of the Simpsons where technical difficulties force the local TV station
off the air and it airs this disclaimer:

Your cable TV is experiencing difficulties. Do not panic. Resist the urge to
read or talk to loved ones. Do not attempt sexual relations as years of TV
radiation exposure have left your genitals withered and useless.

Lying in bed with his wife, Chief Wiggum lifts up the sheets and looks before
uttering “Well, I’ll be damned.” I do sometimes wonder if a similar moment of
reckoning is coming for companies that have leaned heavily into LLM usage and
eliminated large numbers of its staff as redundant. Will they find themselves
trapped in a spending cycle, locked into a particular vendor? What happens if
that vendor goes bankrupt?

I don’t particularly think there will be a catastrophic moment where all AI
companies are destroyed, although I do think it’s likely one of the major
vendors will implode into bankruptcy, with profound effects on the economy and
the software industry. For many companies, their AI usage will not end with a
bang, but with a series of whimpers, when they need to make strategic decisions
about their budgets and usage of the technology. As risks go, this is a pretty
important one to prepare for.

If you are on a software team using AI in your processes or products, it might
be worth regularly asking yourself what it would mean if the LLM went away?
Would this be a critical failure that would doom your operations or something
that you can fix? Is the LLM being used for production products that would break
instantly or just for development practices like automated testing which can be
revised? How much is this usage locked in to a particular provider or model? If
you are using the LLM to generate code, how much do other developers understand
that code? You wouldn’t abide a situation where everything broke because only
one developer understood that code and she’s on vacation this week, are you
accepting that for AI-generated code?

What Next?

I wrote this essay because I wanted to work through some thoughts about how LLMs
might be incorporated into teams working on software, in the hope that such a
future is possible to achieve and a thing worth having. I hoped to see something
in this that would make me feel more comfortable about AI usage in the craft of
software development, as an alternative to the grim future that Silicon Valley
is intent on building
now.
Perhaps, a better AI future is possible. For instance, I’m particularly
intrigued by the idea that a smaller and slightly-less-accurate-but-still-good-enough Medium Language
Model (MLM) that can run on a server or even your laptop might just be good
enough for many teams, and it’s certainly a lot cheaper! Maybe we will enter an
era of open-source and more ethical choices for AI
tools. I’m not
convinced, but I sincerely hope I’m wrong on this. For all of our sakes.

Related Posts

Python Decorators for Production Machine Learning Engineering

Researchers try to cut the genetic code from 20 to 19 amino acids

AI sandboxing is having its Kubernetes moment