Rendered at 10:23:53 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
hathawsh 12 hours ago [-]
I'm either in a minority or a silent majority. Claude Code surpasses all my expectations. When it makes a mistake like over-editing, I explain the mistake, it fixes it, and I ask it to record what it learned in the relevant project-specific skills. It rarely makes that mistake again. When the skill file gets big, I ask Claude to clean and compact it. It does a great job.
It doesn't really make sense economically for me to write software for work anymore. I'm a teacher, architect, and infrastructure maintainer now. I hand over most development to my experienced team of Claude sessions. I review everything, but so does Claude (because Claude writes thorough tests also.) It has no problem handling a large project these days.
I don't mean for this post to be an ad for Claude. (Who knows what Anthropic will do to Claude tomorrow?) I intend for this post to be a question: what am I doing that makes Claude profoundly effective?
Also, I'm never running out of tokens anymore. I really only use the Opus model and I find it very efficient with tokens. Just last week I landed over 150 non-trivial commits, all with Claude's help, and used only 1/3 of the tokens allotted for the week. The most commits I could do before Claude was 25-30 per week.
(Gosh, it's hard to write that without coming across as an ad for Anthropic. Sorry.)
Swizec 11 hours ago [-]
> I'm either in a minority or a silent majority. Claude Code surpasses all my expectations.
I looked at some stats yesterday and was surprised to learn Cursor AI now writes 97% of my code at work. Mostly through cloud agents (watching it work is too distracting for me)
My approach is very simple: Just Talk To It
People way overthink this stuff. It works pretty good. Sharing .md files and hyperfocusing on various orchestrations and prompt hacks of the week feels as interesting as going deep on vim shortcuts and IDE skins.
Just ask for what you want, be clear, give good feedback. That’s it
TranquilMarmot 4 hours ago [-]
Right - I have a ton of coworkers who obsess over "skills" and different ways to run agents and whatnot but I just... spend some time to give very thorough, detailed instructions and it just Does The Thing. I rarely fight with Claude Code these days.
shinycode 5 hours ago [-]
I agree it works nicely for me.
From my experience it’s not realistic to expect one-shot each time. But asking it to build chunks and entering a review cycle with nudging works well. Once I changed my mindset from it « didn’t do a one-shot so it’s crap » and took it as an iterative tool that build pieces that I assemble it’s been working nicely without external frameworks or anything. Plan-review, iterate, split, build, review iterate
phito 3 hours ago [-]
You're wasting a ton of tokens doing that though. Right now you don't realize it because they're being heavily subsidized, but you will understand the point of have good orchestration and memory files when you will have to pay the real cost of your use.
mirekrusin 3 hours ago [-]
Cost cannot go up, only down with time (with occasional short term fluctuations). Competition, including open weight models and consumer hardware (ie upcoming M5 Ultra) keeps moving ceiling of what you can charge down.
2 hours ago [-]
Foobar8568 5 hours ago [-]
My experience as well on non trivial stuff for personal projects, just talk...
It makes mistakes but considering the code I see in professionnal settings, I rather deal with an agent than third parties.
theshrike79 3 hours ago [-]
The trick is to "just use it", BUT every few weeks grab the logs (you do keep them, right?) and have a session with the model to find out if there are any repeated patterns.
If you find any, consider making them into skills or /commands or maybe even add them to AGENTS.md.
hathawsh 11 hours ago [-]
I love the IDE skins analogy. Very true.
acessoproibido 9 hours ago [-]
Everyone knows that a red UI skin goes faster
WatchDog 8 hours ago [-]
How do you collect these stats?
Is it by characters human typed vs AI generated, or by commit or something?
Swizec 6 hours ago [-]
> How do you collect these stats?
Cursor dashboard. I know they're incentivized to over-estimate but feels directionally accurate when I look at recent PRs.
anabis 9 hours ago [-]
Are you mostly using the Composer model?
Swizec 9 hours ago [-]
> Are you mostly using the Composer model?
Don’t really think about it. I think when I talk to it through Slack, cursor users codex, in my ide looks like it’s whatever highest claude. In Github comments, who even knows
Rekindle8090 10 hours ago [-]
[dead]
Calavar 3 hours ago [-]
It's interesting how variable people's experiences seem to be.
Personally, I tend to get crap quality code out of Claude. Very branchy. Very un-DRY. Consistently fails to understand the conventions of my codebase (e.g. keeps hallucinating that my arena allocator zero initializes memory - it does not). And sometimes after a context compaction it goes haywire and starts creating new regressions everywhere. And while you can prompt to fix these things, it can take an entire afternoon of whack-a-mole prompting to fix the fallout of one bad initial run. I've also tried dumping lessons into a project specific skill file, which sometimes helps, but also sometimes hurts - the skill file can turn into a footgun if it gets out of sync with an evolving codebase.
In terms of limits, I usually find myself hitting the rate limit after two or three requests. On bad days, only one. This has made Claude borderline unusable over the past couple weeks, so I've started hand coding again and using Claude as a code search and debugging tool rather than a code generator.
jampekka 3 hours ago [-]
> Very branchy. Very un-DRY.
I've found this can be vastly reduced with AGENTS.md instructions, at least with codex/gpt-5.4.
Calavar 3 hours ago [-]
What sorts of instructions?
jampekka 2 hours ago [-]
Usually I just put something like "Prefer DRY code". I like to keep my AGENTS.md DRY too :)
Lionga 2 hours ago [-]
also add "no hallucinations" and "make it works this time pretty please" while also say Claude will go to jail if does not do it right should work all the time (so like 60%)
jampekka 2 hours ago [-]
There are of course limits to what prompting can do, but it does steer the models.
In TFA they found that prompting mitigates over-editing up to about 10 percentage points.
maxbond 11 hours ago [-]
When I see people talking about Claude Code becoming "unusable" for them recently, I believe them, but I don't understand. It's a deeply flawed and buggy piece of software but it's very effective. One of the strangest things about AI to me is that everyone seems to have a radically different experience.
6 hours ago [-]
shimman 10 hours ago [-]
My workflow is to just use LLMs for small context work. Anything that involves multiple files it truly doesn't do better than what I'd expect from a competent dev.
It's bitten me several times at work, and I rather not waste any more of my limited time doing the re-prompt -> modify code manually cycle. I'm capable of doing this myself.
It's great for the simple tasks tho, most feature work are simple tasks IMO. They were only "costly" in the sense that it took a while to previously read the code, find appropriate changes, create tests for appropriate changes, etc. LLMs reduce that cycle of work, but that type of work in general isn't the majority of my time at my job.
I've worked at feature factories before, it's hell. I can't imagine how much more hell it has become since the introduction of these tools.
Feature factories treat devs as literal assembly line machines, output is the only thing that matters not quality. Having it mass induced because of these tools is just so shitty to workers.
I fully expect a backlash in the upcoming years.
---
My only Q to the OP of this thread is what kind of teacher they are, because if you teach people anything about software while admitting that you no longer write code because it's not profitable (big LOL at caring about money over people) is just beyond pathetic.
enraged_camel 11 hours ago [-]
I use it through the desktop app, which has a lot of features I appreciate. Today it was implementing a feature. It came across a semi-related bug that wasn’t a stopper but should really be fixed before go live. Instead of tackling it itself or mentioning it at the final summary (where it becomes easy to miss), it triggered a modal inside the Claude app with a description of the issue and two choices: fix in another session or fix in current session. Really good way to preserve context integrity and save tokens!
sroussey 9 hours ago [-]
How to you get CC to connect to your dev container? I have the CC app but it’s kinda useless as I’m not have it barebacking my system, so I’m left with the cli and vs code extension.
gommm 5 hours ago [-]
I just run CC in a VM. It gets full control over the VM. The VM doesn't have access to my internal networks. I share the code repos it works on over virtiofs so it has access to the repos but doesn't have access to my github keys for pushing and pulling.
This means it can do anything in the VM, install dependencies, etc... So far, it managed to bork the VM once (unbootable), I could have spent a bit of time figuring out what happened but I had a script to rebuild the VM so didn't bother. To be entirely fair to claude, the VM runs arch linux which is definitely easier to break than other distros.
nicbou 3 hours ago [-]
Same. It's surprisingly good as a labour saving device. It produces code that I would accept without reservations from a coworker. I still read every line and make tweaks, but they're the same tweaks I would ask for in a code review.
I don't measure my productivity, but I see it in the sort of tasks I tackle after years of waiting. It's especially good at tedious tasks like turning 100 markdown files into 5 json files and updating the code that reads them, for example.
rtpg 11 hours ago [-]
Are you writing code that gets reviewed by other people? Were code reviews hard in the past? Do your coworkers care about "code quality" (I mean this in scare quotes because that means different things to different people).
Are you working more on operational stuff or on "long-running product" stuff?
My personal headcanon: this tooling works well when built on simple patterns, and can handle complex work. This tooling has also been not great at coming up with new patterns, and if left unsupervised will totally make up new patterns that are going to go south very quickly. With that lens, I find myself just rewriting what Claude gives me in a good number of cases.
I sometimes race the robot and beat the robot at doing a change. I am "cheating" I guess cuz I know what I want already in many cases and it has to find things first but... I think the futzing fraction[0] is underestimated for some people.
And like in the "perils of laziness lost"[1] essay... I think that sometimes the machine trying too hard just offends my sensibilities. Why are you doing 3 things instead of just doing the one thing!
One might say "but it fixes it after it's corrected"... but I already go through this annoying "no don't do A,B, C just do A, yes just that it's fine" flow when working with coworkers, and it's annoying there too!
"Claude writes thorough tests" is also its own micro-mess here, because while guided test creation works very well for me, giving it any leeway in creativity leads to so many "test that foo + bar == bar + foo" tests. Applying skepticism to utility of tests is important, because it's part of the feedback loop. And I'm finding lots of the test to be mainly useful as a way to get all the imports I need in.
If we have all these machines doing this work for us, in theory average code quality should be able to go up. After all we're more capable! I think a lot of people have been using it in a "well most of the time it hits near the average" way, but depending on how you work there you might drag down your average.
> My personal headcanon: this tooling works well when built on simple patterns, and can handle complex work. This tooling has also been not great at coming up with new patterns, and if left unsupervised will totally make up new patterns that are going to go south very quickly. With that lens, I find myself just rewriting what Claude gives me in a good number of cases.
I've been doing a greenfield project with Claude recently. The initial prototype worked but was very ugly (repeated duplicate boilerplate code, a few methods doing the same exact thing, poor isolation between classes)... I was very much tempted to rewrite it on my own. This time, I decided to try and get it to refactor so get the target architecture and fix those code quality issues, it's possible but it's very much like pulling teeths... I use plan mode, we have multiple round of reviews on a plan (that started based on me explaining what I expect), then it implements 95% of it but doesn't realize that some parts of it were not implemented... It reminds me of my experience mentoring a junior employee except that claude code is both more eager (jumping into implementation before understanding the problem), much faster at doing things and dumber.
That said, I've seen codebases created by humans that were as bad or worse than what claude produced when doing prototype.
hathawsh 10 hours ago [-]
You hinted at an aspect I probably haven't considered enough: The code I'm working on already has many well-established, clean patterns and nearly all of Claude's work builds on those patterns. I would probably have a very different experience otherwise.
rtpg 10 hours ago [-]
I legit think this is the biggest danger with velocity-focused usage of these tools. Good patterns are easy to use and (importantly!) work! So the 32nd usage of a good pattern will likely be smooth.
The first (and maybe even second) usage of a gnarly, badly thought out pattern might work fine. But you're only a couple steps away from if statement soup. And in the world where your agent's life is built around "getting the tests to pass", you can quickly find it doing _very_ gnarly things to "fix" issues.
sroussey 9 hours ago [-]
I’ve seen ai coding agents spin out and create 1_000 line changesets that I have to stop before they are 10_000. And then I look at the problem and change one line instead.
chickensong 4 hours ago [-]
This is it right here. Claude loves to follow existing patterns, good or bad. Once you have a solid foundation, it really starts to shine.
I think you're likely in the silent majority. LLMs do some stupid things, but when they work it's amazing and it far outweighs the negatives IMHO, and they're getting better by leaps and bounds.
I respect some of the complaints against them (plagiarism, censorship, gatekeeping, truth/bias, data center arms race, crawler behavior, etc.), but I think LLMs are a leap forward for mankind (hopefully). A Young Lady's Illustrated Primer for everyone. An entirely new computing interface.
TranquilMarmot 4 hours ago [-]
We noticed this and spent a week or two going through and cleaning up tests, UI components, comments, and file layout to be a lot more consistent throughout the codebase. Codebase was not all AI written code - just many humans being messy and inconsistent over time as they onboard/offboard from the project.
Much like giving a codebase to a newbie developer, whatever patterns exist will proliferate and the lack of good patterns means that patterns will just be made up in an ad-hoc and messy way.
esalman 7 hours ago [-]
You haven't answered the question though. Are your code peer reviewed? Are they part of client-facing product? No offense, I like what you are doing, but I wouldn't risk delegation this much workload in my day job, even though there is a big push towards AI.
max_streese 2 hours ago [-]
To people stating these high commit numbers: What is your average changeset size? I have found that having agent do large changes (few hundred lines or more) results in a lot of friction for me and it feels like at some point I leave a happy path where instead of moving quickly I get dragged down.
swader999 11 hours ago [-]
I feel the same way. Doesn't make sense economically or even in good faith for me to use company paid time writing code for line of business apps at anymore and I'm 28 years into this kind of work.
ytoawwhra92 7 hours ago [-]
> I intend for this post to be a question: what am I doing that makes Claude profoundly effective?
I've personally had radically different experiences working on different projects, different features within the same project, etc.
leonidasv 6 hours ago [-]
The article has a benchmark and Opus has best score in two categories and the second-best in another (there are only three categories). Opus is probably the best choice when it comes to producing readable code right now. GPT (for example) lags way behind.
baq 5 hours ago [-]
Anecdotally it’s the exact opposite for me: gpt 5.4 is leagues ahead of opus for the kind of backend work I do. Opus keeps making stupid mistakes while overengineering the irrelevant parts. However when I have to work on the backoffice ui, I still pick opus.
Unbeliever69 11 hours ago [-]
I think a lot of use have implemented our own ad hoc self-improvement checks into our agentic workflows. My observations are the same as yours.
Powdering7082 11 hours ago [-]
Is your claude.md, skills or other settings that you have honed public?
hathawsh 11 hours ago [-]
Sorry, no, and they're highly project specific anyway. I just started with the "/init" skill a few weeks ago and gradually improved it from there.
nobodywillobsrv 4 hours ago [-]
How much does it cost though?
This is the problem.
I think there is a huge gap between people on salaries getting effectively more responsibility by being given spend that they otherwise would not have had and people hustling on projects on their own.
Yes it is 100% what I use but I am never happy with usage. It burns up by sub fast and there is little feelings of control. Experiments like using lower tier models are hard to understand in reality. Graphify might work or it might not. I have no idea.
p1necone 11 hours ago [-]
Which subscription tier are you using?
hathawsh 11 hours ago [-]
I'm on the $200/month max plan.
p1necone 10 hours ago [-]
Makes sense, maybe it is worth it...
baq 5 hours ago [-]
Wait till you try codex so you don’t have to keep saying ‘don’t be lazy’
jstanley 16 hours ago [-]
Conversely, I often find coding agents privileging the existing code when they could do a much better job if they changed it to suit the new requirement.
I guess it comes down to how ossified you want your existing code to be.
If it's a big production application that's been running for decades then you probably want the minimum possible change.
If you're just experimenting with stuff and the project didn't exist at all 3 days ago then you want the agent to make it better rather than leave it alone.
Probably they just need to learn to calibrate themselves better to the project context.
_pastel 15 hours ago [-]
The tradeoff is highly contextual; it's not a tradeoff an agent can always make by inspecting the project themselves.
Even within the same project, for a given PR, there are some parts of the codebase I want to modify freely and some that I want fixed to reduce the diff and testing scope.
I try to explain up-front to the agent how aggressively they can modify the existing code and which parts, but I've had mixed success; usually they bias towards a minimal diff even if that means duplication or abusing some abstractions. If anyone has had better success, I'd love to hear your approach.
mncharity 14 hours ago [-]
Just brainstorming, but perhaps a more tangible gradient, with social backpressure?
Imagine three identical patch tools: "patch", "submit patch", and "send patch to chief architect and wait". With the "where each can be used" explained or even enforced. Having the contrast of less-aggressive options, might make it easier to encourage more aggressive refactoring elsewhere. Or pushing the impact further up the CoT, "patch'ing X requires an analysis field describing less invasive alternatives and their un/suitability; for Y, just do it, refactor aggressively".
ZihangZ 4 hours ago [-]
[dead]
jauntywundrkind 12 hours ago [-]
To get the agent to think for itself sometimes it feels like I have to delete a bunch of code and markdown first. Instruction to refactor/reconsider broadly has such mild success, I find.
I'll literally run an agent & tell it to clean up a markdown file that has too much design in it, delete the technical material, and/or delete key implementations/interfaces in the source, then tell a new session to do the work, come up with the design. (Then undelete and reconcile with less naive sessions.)
Path dependence is so strong. Right now I do this flow manually but I would very much like to codify this, make a skill for this pattern that serves so well.
jdkoeck 11 hours ago [-]
Tell me you’re using Codex without telling me you’re using Codex :)
foo12bar 15 hours ago [-]
I've noticed AI's often try and hide failure by catching exceptions and returning some dummy value maybe with some log message buried in tons of extraneous other log messages. And the logs themselves are often over abbreviated and missing key data to successfully debug what is happening.
I suspect AI's learned to do this in order to game the system. Bailing out with an exception is an obvious failure and will be penalized, but hiding a potential issue can sometimes be regarded as a success.
I wonder how this extrapolates to general Q&A. Do models find ways to sound convincing enough to make the user feels satisfied and the go away? I've noticed models often use "it's not X, it's Y", which is a binary choice designed to keep the user away from thinking about other possibilities. Also they often come up with a plan of action at the end of their answer, a sales technique known as the "assumptive close", which tries to get the user to think about the result after agreeing with the AI, rather than the answer itself.
hexaga 10 hours ago [-]
AI behavior is pretty easy to understand and predict if you view it from the lens of: they will shamelessly do any/everything possible to game whatever metric they are trained on. Because... that's how hill-climbing a metric looks. It's A/B enshittification taken to inscrutable heights.
They are trained on human feedback, so there is no other way this goes. Every bit of every response is pointed toward subversion of the assumed evaluator.
Isolated_Routes 15 hours ago [-]
I think building something really well with AI takes a lot of work. You can certainly ask it to do things and it will comply, and produce something pretty good. But you don't know what you don't know, especially when it speaks to you authoritatively. So checking its work from many different angles and making sure it's precise can be a challenge. Will be interesting to see how all of this iterates over time.
deepfriedbits 15 hours ago [-]
I agree 100%. At the same time, I feel like this piece, and our comments on it are snapshots in time because of the rate of advancement in the industry. These coding models are already significantly better than they were even nine months ago.
I can't help but read complaints about the capabilities of AI – and I'm certainly not accusing you of complaining about AI, just a general thought – and think "Yet" to myself every time.
Isolated_Routes 15 hours ago [-]
Exactly! I completely agree. I think figuring out how to use this new tool well develop into a bit of an art form, which we will race to keep up with.
ValentineC 14 hours ago [-]
> But you don't know what you don't know, especially when it speaks to you authoritatively. So checking its work from many different angles and making sure it's precise can be a challenge.
I've spent far more time pitting one AI context against another (reviewing each other's work) than I have using AI to build stuff these days.
The benefit is that since it mostly happens asynchronously, I'm free to do other stuff.
mleo 11 hours ago [-]
If I don’t know what I don’t know, how am I going to build something any better than a coding agent?
An approach on a couple of projects has been to prototype with the agent, learn, write a design and then start over. I then know the areas to look into more detail.
nunez 10 hours ago [-]
Yep. It's quite good at getting you to 80% of the solution. The other 20% depends on the problem!
anonu 16 hours ago [-]
Here, the author means the agent over-edits code. But agents also do "too much": as in they touch multiple files, run tests, do deployments, run smoke tests, etc... And all of this gets abstracted away. On one hand, its incredible. But on the other hand I have deep anxiety over this:
1. I have no real understanding of what is actually happening under the hood. The ease of just accepting a prompt to run some script the agent has assembled is too enticing. But, I've already wiped a DB or two just because the agent thought it was the right thing to do. I've also caught it sending my AWS credentials to deployment targets when it should never do that.
2. I've learned nothing. So the cognitive load of doing it myself, even assembling a simple docker command, is just too high. Thus, I repeatedly fallback to the "crutch" of using AI.
ok_dad 16 hours ago [-]
Why are you letting the LLM drive? Don't turn on auto-approve, approve every command the agent runs. Don't let it make design or architecture decisions, you choose how it is built and you TELL that clanker what's what! No joke, if you treat the AI like a tool then you'll get more mileage out of it. You won't get 10x gains, but you will still understand the code.
soiltype 15 hours ago [-]
Personally I've found "carefully review every move it makes" to be an extremely unpleasant and difficult workflow. The effort needed to parse every action is immense, but there's a complete absence of creative engagement - no chance of flow state. Just the worst kind of work which I've been unable to sustain, unfortunately. At this point I mostly still do work by hand.
est31 13 hours ago [-]
It's unpleasant for me at normal speed settings, but on fast mode it works really well: the AI does changes quickly enough for me to stay focused.
Of course this requires being fortunate enough that you have one of those AI positive employers where you can spend lots of money on clankers.
I don't review every move it makes, I rather have a workflow where I first ask it questions about the code, and it looks around and explores various design choices. then i nudge it towards the design choice I think is best, etc. That asking around about the code also loads up the context in the appropriate manner so that the AI knows how to do the change well.
It's a me in the loop workflow but that prevents a lot of bugs, makes me aware of the design choices, and thanks to fast mode, it is more pleasant and much faster than me manually doing it.
lambda 14 hours ago [-]
This is my biggest problem with the promises of agentic coding (well, there are an awful lot of problems, but this is the biggest one from an immediate practical perspective).
One the one hand, reviewing and micromaning everything it does is tedious and unrewarding. Unlike reviewing a colleague's code, you're never going to teach it anything; maybe you'll get some skills out of it if you finds something that comes up often enough it's worth writing a skill for. And this only gets you, at best, a slight speedup over writing it yourself, as you have to stay engaged and think about everything that's going on.
Or you can just let it grind away agentically and only test the final output. This allows you to get those huge gains at first, but it can easily just start accumulating more and more cruft and bad design decisions and hacks on top of hacks. And you increasingly don't know what it's doing or why, you're losing the skill of even being able to because you're not exercising it.
You're just building yourself a huge pile of technical debt. You might delete your prod database without realizing it. You might end up with an auth system that doesn't actually check the auth and so someone can just set a username of an admin in a cookie to log in. Or whatever; you have no idea, and even if the model gets it right 95% of the time, do you want to be periodically rolling a d20 and if you get a 1 you lose everything?
JoshTriplett 15 hours ago [-]
I agree, but I also think that giving the LLM free rein is also extremely unpleasant and difficult. And you still need to review the resulting code.
soiltype 14 hours ago [-]
I don't think there's anything difficult or unpleasant about the process of letting the LLM run free, that's the whole point, it's nearly frictionless. Which includes not reviewing the code carefully. You say "need" but you mean "ought".
JoshTriplett 13 hours ago [-]
Friction is not the only source of displeasure. I've tried out vibe-coding for something non-trivial; I found it deeply unpleasant.
tabwidth 12 hours ago [-]
Reviewing isn't hard when the diff is what you asked for. It's when you asked for a one-line fix and get back 40 changed lines across four files. At that point you're not even reviewing your change anymore, you're auditing theirs.
Lihh27 14 hours ago [-]
That's the trap though. The moment you approve every step, you're no longer getting the product that was sold to you. You're doing code review on a stochastic intern. The whole 10x story depends on you eventually looking away.
ok_dad 12 hours ago [-]
Just don’t buy the tools for 10x improvements, buy them for the 1.1x improvement and the help it gives with the annoying stuff like refactoring arguments to a function that’s used all over, writing tests, etc. They can also help reduce cognitive load in certain ways when you just use them to ask about your large code base.
threatofrain 15 hours ago [-]
Because the degree to which the LLM prompts you back to the terminal is too frequent for the human to engage in parallel work.
ok_dad 12 hours ago [-]
I’m basically saying don’t do parallel work, use it as a tool. Just sit there and watch it do stuff, make sure it’s doing what you want, and stop it if it’s doing too much or not what you want to do.
Maybe I’m just weird (actually that’s a given) but I don’t mind babysitting the clanker while it works.
sornaensis 13 hours ago [-]
I define tools that perform individual tasks, like build the application, run the tests, access project management tools with task context, web search, edit files in the workspace, read only vs write access source control, etc.
The agent only has access to exactly what it needs, be it an implementation agent, analysis agent, or review agent.
Makes it very easy to stay in command without having to sit and approve tons of random things the agent wants to do.
I do not allow bash or any kind of shell. I don't want to have to figure out what some random python script it's made up is supposed to do all the time.
ok_dad 12 hours ago [-]
This is a cool idea, can you write more about how your tools work or maybe short descriptions of a few of them? I’m interested in more rails for my bots.
sornaensis 12 hours ago [-]
I just made MCP servers that wrap the tools I need the agents to use, and give no-ask permissions to the specific tools the agents need in the agent definition.
Both OpenCode and VsCode support this. I think in ClaudeCode you can do it with skills now.
The other benefit is the MCP tool can mediate e.g. noisy build tool output, and reduce token usage by only showing errors or test failures, nothing else, or simply an ok response with the build run or test count.
So far, I have not needed to give them access to more than build tools, git, and a project/knowledge system (e.g. Obsidian) for the work I have them doing. Well and file read/write and web search.
ok_dad 7 hours ago [-]
Cool, thanks for the additional details!
I use Cursor but it's getting expensive lately, so I'm trying to reduce context size and move to OpenCode or something like that which I can use with some cheaper provider and Kimi 2.5 or whatever.
andoando 15 hours ago [-]
Because its SO much faster not to have to do all that. I think 10x is no joke, and if you're doing MVP, its just not worth the mental effort.
pron 14 hours ago [-]
POC, sure (although 10x-ing a POC doesn't actually get you 10x velocity). MVP, though? No way. Today's frontier models are nowhere near smart enough to write a non-trivial product (i.e. something that others are meant to use), minimal or otherwise, without careful supervision. Anthropic weren't able to get agents to write even a usable C compiler (not a huge deal to begin with), even with a practically infeasible amount of preparatory work (write a full spec and a reference implementation, train the model on them as well as on relevant textbooks, write thousands of tests). The agents just make too many critical architectural mistakes that pretty much guarantee you won't be able to evolve the product for long, with or without their help. The software they write has an evolution horizon between zero days and about a year, after which the codebase is effectively bricked.
andoando 13 hours ago [-]
There is a million things in between a C compiler and a non-trivial product. They do make a ton of horrible architectural decisions, but I only need to review the output/ask questions to guide that, not review every diff.
pron 12 hours ago [-]
A C compiler is a 10-50KLOC job, which the agents bricked in 0 days despite a full spec and thousands of hand-written tests, tests that the software passed until it collapsed beyond saving. Yes, smaller products will survive longer, but how would you know about the time bombs that agents like hiding in their code without looking? When I review the diffs I see things that, if had let in, the codebase would have died in 6-18 months.
BTW, one tip is to look at the size of the codebase. When you see 100KLOC for a first draft of a C compiler, you know something has gone horribly wrong. I would suggest that you at least compare the number of lines the agent produced to what you think the project should take. If it's more than double, the code is in serious, serious trouble. If it's in the <1.5x range, there's a chance it could be saved.
Asking the agent questions is good - as an aid to a review, not as a substitute. The agents lie with a high enough frequency to be a serious problem.
The models don't yet write code anywhere near human quality, so they require much closer supervision than a human programmer.
sarchertech 12 hours ago [-]
A C compiler with an existing C compiler as oracle, existing C compilers in the training set, and a formal spec, is already the easiest possible non-trivial product an agent could build without human review.
You could have it build something that takes fewer lines of code, but you aren’t gonna to find much with that level of specification and guardrails.
16 hours ago [-]
applfanboysbgon 15 hours ago [-]
This is significantly slower than just writing the code yourself.
ok_dad 12 hours ago [-]
I don’t find it slower overall, personally, but YMMV depending on how you like to tackle problems. Also the problem space and the project details can dictate that these tools aren’t helpful. Luckily the code I write tends to be perfect for a coding agent to clank away for me.
sdevonoes 13 hours ago [-]
For that kind of flow, I prefer to work without AI.
ok_dad 12 hours ago [-]
The agent mostly helps me reduce cognitive load and avoid the fiddly bits. I still review and understand all of the code but I don’t have to think about writing all of it. I also still hand write tons of code when I want to be very specific about behavior.
giraffe_lady 16 hours ago [-]
I agree with this too. I decided on constraints for myself around these tools and I give my complete focus & attention to every prompt, often stopping for minutes to figure things through and make decisions myself. Reviewing every line they produce. I'm a senior dev with a lot of experience with pair programming and code review, and I treat its output just as I would those tasks.
It has about doubled my development pace. An absolutely incredible gain in a vacuum, though tiny compared to what people seem to manage without these self-constraints. But in exchange, my understanding of the code is as comprehensive as if I had paired on it, or merged a direct report's branch into a project I was responsible for. A reasonable enough tradeoff, for me.
arjie 15 hours ago [-]
I have never found any utility in that. After all, you can still just review the diffs and ask it for explanation for sections instead.
pavel_lishin 15 hours ago [-]
> After all, you can still just review the diffs
anonu has explicitly said that they've wiped a database twice as a result of agents doing stuff. What sort of diff would help against an agent running commands, without your approval?
arjie 10 hours ago [-]
Agent does not have to run in your user context. It is easy mistake to make in yolo mode but after that it's easy to fix. e.g. this is what I use now so I can release agent from my machine and also constrain its access:
$ main-app git:(main) kubectl get pods | grep agent | head -n 1 | sed -E 's/[a-z]+-agent(.*)/app-agent\1/'
app-agent-656c6ff85d-p86t8 1/1 Running 0 13d
Agent is fully capable of making PR etc. if you provide appropriate tooling. It wipes DB but DB is just separate ephemeral pod. One day perhaps it will find 0-day and break out, but so far it has not done it.
exe34 14 hours ago [-]
Hah I run my agent inside a docker with just the code. Anything clever it tries to do just goes nowhere.
ModernMech 15 hours ago [-]
> After all, you can still just review the diffs
The diff: +8000 -4000
arjie 10 hours ago [-]
You can ask it to make the changes in appropriate PRs. SOTA model + harness can do it. I find it useful to separate refactors and implementations, just like with humans, but I admittedly rely heavily on multi-provider review.
wahnfrieden 15 hours ago [-]
It’s terribly slow
ok_dad 12 hours ago [-]
I get it, but if tomorrow every inference provider doubled costs I still understand my applications code and can continue to work on it myself.
wahnfrieden 11 hours ago [-]
I hear this a lot but I don't think decades of experience atrophies irretrievably so quickly as to make it worth it (alone) to abstain from making full use of these tools. I still read and direct enough of the architecture to not be lost in the code it generates. Maybe you haven't tried using agents to reorganize/refactor as much - I have cleaner code than I did before when it was done by hand, because I can afford to tackle debts.
I also don't find the permissions it prompts for very meaningful. Permission to use a file search tool? Permission to make a web request? It's a clumsy way to slow it down enough for me to catch up.
esafak 15 hours ago [-]
You can push thousands of LOC every day while approving manually. If you went any faster you would not be able to read the code.
harikb 16 hours ago [-]
On the credentials point. Here is what I find.
Day 1: Carefully handles the creds, gives me a lecture (without asking) about why .env should be in .gitignore and why I should edit .env and not hand over the creds to it.
Day 2: I ask for a repeat, has lost track of that skill or setting, frantically searches my entire disk, reads .env including many other files, understands that it is holding a token, manually creates curl commands to test the token and then comes back with some result.
It is like it is a security expert on Day 1 and absolute mediocre intern on Day 2
eterm 16 hours ago [-]
I found the same, it was super careful handling the environment variable until it hit an API error, and I caught in it's thinking "Let me check the token is actually set correctly" and it just echoed the token out.
( This was low-stakes test creds anyway which I was testing with thankfully. )
I never pass creds via env or anything else it can access now.
My approach now is to get it to write me linqpad scripts, which has a utility function to get creds out of a user-encrypted share, or prompts if it's not in the store.
This works well, but requires me to run the scripts and guide it.
Ultimately, fully autotonous isn't compatible with secrets. Otherwise, if it really wanted to inspect it, then it could just redirect the request to an echo service.
The only real way is to deal with it the same way we deal with insider threat.
A proxy layer / secondary auth, which injects the real credentials. Then give claude it's own user within that auth system, so it owns those creds. Now responsibilty can be delegated to it without exposing the original credentials.
That's a lot of work when you're just exploring an API or DB or similar.
jbreckmckye 12 hours ago [-]
I think it is just because they are having to load shed! Some days you may be getting much less compute - the main way "thinking" operates, is to just iterate on the result a few more times
boron1006 15 hours ago [-]
I essentially have 3 modes:
1. Everything is specified, written and tested by me, then cleaned up by AI. This is for the core of the application.
2. AI writes the functions, then sets up stub tests for me to write. Here I’ll often rewrite the functions as they often don’t do what I want, or do too much. I just find it gets rid of a lot of boilerplate to do things this way.
3. AI does everything. This is for experiments or parts of an application that I am perfectly willing to delete. About 70% of the time I do end up deleting these parts. I don’t allow it to touch 1 or 2.
Of course this requires that the architecture is setup in a way where this is possible. But I find it pretty nice.
suzzer99 5 hours ago [-]
This is the way imo, at least for now.
stult 15 hours ago [-]
This seems like a really easy problem to solve. Just don't give the LLM access to any prod credentials[1]. If you can't repro a problem locally or in staging/dev environments, you need to update your deployment infra so it more closely matches prod. If you can't scope permissions tightly enough to distinguish between environments, update your permissions system to support that. I've never had anything even vaguely resembling the problems you are describing because I follow this approach.
[1] except perhaps read-only credentials to help diagnose problems, but even then I would only issue it an extremely short-lived token in case it leaks it somehow
lucasgerads 15 hours ago [-]
I usually try to review all the code written by claude. And also let claude review all the code that i write. So, usually I have some understanding of what is going on. And Claude definitely sometimes makes "unconventional" decisions. But if you are working on a large code base with other team members (some of which may already have left the company). Their are also large parts of the code that one doesn't understand and are abstracted away.
Barbing 16 hours ago [-]
Must consider ourselves lucky for having the intuition to notice skill stagnation and atrophy.
Only helps if we listen to it :) which is fun b/c it means staying sharp which is inherently rewarding
Don’t give your agent access to content it should not edit, don’t give keys it shouldn’t use.
raincole 16 hours ago [-]
It never ceases to scare me how they just run python code I didn't write via:
> python <<'EOF'
> ${code the agent wrote on the spot}
> EOF
I mean, yeah, in theory it's just as dangerous as running arbitrary shell commands, which the agent is already doing anyway, but still...
dns_snek 15 hours ago [-]
The good news is that some of these harnesses (like Codex) use sandboxing. The bad news is that they're too inflexible to be effective.
By default these shell commands don't have network access or write access outside the project directory which is good, but nowhere near customizable enough. Once you approve a command because it needs network access, its other restrictions are lifted too. It's all or nothing.
onlyrealcuzzo 16 hours ago [-]
> 2. I've learned nothing. So the cognitive load of doing it myself, even assembling a simple docker command, is just too high. Thus, I repeatedly fallback to the "crutch" of using AI.
I'm not trying to be offense, so with all due respect... this sounds like a "you" problem. (And I've been there, too)
You can ask the LLMs: how do I run this, how do I know this is working, etc etc.
Sure... if you really know nothing or you put close to zero effort into critically thinking about what they give you, you can be fooled by their answers and mistake complete irrelevance or bullshit for evidence that something works is suitably tested to prove that it works, etc.
You can ask 2 or 3 other LLMs: check their work, is this conclusive, can you find any bugs, etc etc.
But you don't sound like you know nothing. You sound like you're rushing to get things done, cutting corners, and you're getting rushed results.
What do you expect?
Their work is cheap. They can pump out $50k+ worth of features in a $200/mo subscription with minimal baby-sitting. Be EAGER to reject their work. Send it back to them over and over again to do it right, for architectural reviews, to check for correctness, performance, etc.
They are not expensive people with feelings you need to consider in review, that might quit and be hard to replace. Don't let them cut corners. For whatever reason, they are EAGER to cut corners no matter how much you tell them not to.
devilsdata 12 hours ago [-]
Good advice. Personally I'm waiting until it is worthwhile to run these models locally, then I'm going to pin a version and just use that.
I'm only 5 years into this career, and I'm going to work manually and absorb as much knowledge as possible while I'm still able to do it. Yes, that means manually doing shit-kicker work. If AI does get so good that I need to use it, as you say, then I'll be running it locally on a version I can master and build tooling for.
cortesoft 16 hours ago [-]
While I share some of the feelings about 'not understanding what is actually happening under the hood', I can't help but think about how this feeling is the exact same response that programmers had when compilers were invented:
We are completely comfortable now letting the compilers do their thing, and never seem to worry that we "don't know what is actually happening under the hood".
I am not saying these situations are exactly analogous, but I am saying that I don't think we can know yet if this will be one of those things that we stop worrying about or it will be a serious concern for a while.
msteffen 16 hours ago [-]
I think about this a lot, though one paragraph from that article:
> Many assembly programmers were accustomed to having intimate control over memory and CPU instructions. Surrendering this control to a compiler felt risky. There was a sentiment of, if I don’t code it down to the metal, how can I trust what’s happening? In some cases, this was about efficiency. In other cases, it was about debuggability and understanding programming behavior. However, as compilers matured, they began providing diagnostic output and listings that actually improved understanding.
I would 100% use LLMs more and more aggressively if they were more transparent. All my reservations come from times when I prompt “change this one thing” and it rewrites my db schema for some reason, or adds a comment that is actively wrong in several ways. I also think I have a decent working understanding of the assembly my code compiles to, and do occasionally use https://godbolt.org/. Of course, I didn’t start out that way, but I also don’t really have any objections to teenagers vibe-coding games, I just think at some point you have to look under the hood if you’re serious.
cortesoft 15 hours ago [-]
> I would 100% use LLMs more and more aggressively if they were more transparent. All my reservations come from times when I prompt “change this one thing” and it rewrites my db schema for some reason, or adds a comment that is actively wrong in several ways.
Isn't that what git is for, though? Just have your LLM work in a branch, and then you will have a clear record of all the changes it made when you review the pull request.
ManuelKiessling 16 hours ago [-]
(I‘m saying this as someone who uses AI for coding a lot and mostly love it) Yeah, but is that really the same? Compilers work deterministically — if it works once, it will work always. LLMs are a different story for now.
betenoire 15 hours ago [-]
Said another way, compilers are a translation of existing formal code. Compilers don't add features, they don't create algorithms (unrolling, etc., notwithstanding), they are another expression of the same encoded solution.
LLMs are nothing like that
cortesoft 15 hours ago [-]
LLMs are just translating text into output, too, and are running on deterministic computers like every other bit of code we run. They aren't magic.
It is just the scope that makes it appear non-deterministic to a human looking at it, and it is large enough to be impossible for a human to follow the entire deterministic chain, but that doesn't mean it isn't in the end a function that translates input data into output data in a deterministic way.
betenoire 14 hours ago [-]
just text !== syntactically correct code that solves a defined problem
There is a world of difference between translation and generation. It's even in the name: generative AI. I didn't say anything about magic.
cortesoft 15 hours ago [-]
LLMs are deterministic, too. I know there is randomness in the choosing tokens, but that randomness is derived from a random seed that can be repeated.
gpderetta 1 hours ago [-]
LLMs are deterministic[1], but the only way to determine the output is to empirically run them. With compilers, both the implementor and a power user understand the specific code transformations they are capable of, so you can predict their output with good accuracy. I.e. LLMs are probably chaotic systems.
edit: there might be a future where we develop robopsychology enough to understand LLM more than black boxes, we we are not there yet.
[1] Aside from injected randomness and parallel scheduling artifacts.
Supermancho 13 hours ago [-]
Only if the seed is known. Determinism is often predicated on perfect information. Many programs do not have that. Their operations cannot be reproduced practically. The difference between saying deterministic and non-deterministic is contextual based on if you are concerned with theory or practicality.
lelanthran 14 hours ago [-]
If I understand your argument, you're saying that models can be deterministic, right?
Care to point to any that are set up to be deterministic?
Did you ever stop to think about why no one can get any use out of a model with temp set to zero?
mrob 13 hours ago [-]
llama.cpp is deterministic when run with a specified PRNG seed, at least when running on CPU without caching. This is true regardless of temperature. But when people say "non-deterministic", they really mean something closer to "chaotic", i.e. the output can vary greatly with small changes to input, and there is no reliable way to predict when this will happen without running the full calculation. This is very different behavior from traditional compilers.
cortesoft 14 hours ago [-]
No, LLMs ARE deterministic, just like all computer programs are.
I get why that is in practice different then the manner in which compilers are deterministic, but my point is the difference isnt because of determinism.
betenoire 12 hours ago [-]
I think you are misunderstanding the term "deterministic". Running on deterministic hardware does not mean an algorithm is deterministic.
Create a program that reads from /dev/random (not urandom). It's not determistic.
cortesoft 10 hours ago [-]
Fair, although you can absolutely use local LLMs in a deterministic way (by using fixed seeds for the random number generation), and my point is that even if you did that with your LLM, it wouldn't change the feeling someone has about not being able to reason out what was happening.
In other words, it isn't the random number part of LLMs that make them seem like a black box and unpredictable, but rather the complexity of the underlying model. Even if you ran it in a deterministic way, I don't think people would suddenly feel more confident about the outputted code.
nextaccountic 15 hours ago [-]
The difference is that compilers are supposed to be deterministic and low level inclined people often investigate compiler bugs (specially performance bugs) and can pinpoint to some deterministic code that triggered it. Fix the underlying code and it stops misbehaving with high assurance
A non deterministic compiler is probably defective and in any case much less useful
mathieudombrock 16 hours ago [-]
A major difference is that _someone_ knew what was going on (compiler devs).
cortesoft 15 hours ago [-]
That is an interesting difference, I agree.
Although, while the compiler devs might know what was going on in the compiler, they wouldn't know what the compiler was doing with that particular bit of code that the FORTRAN developer was writing. They couldn't possibly foresee every possible code path that a developer might traverse with the code they wrote. In some ways, you could say LLMs are like that, too; the LLM developers know how the LLM code works, but they don't know the end result with all the training data and what it will do based on that.
In addition, to the end developer writing FORTRAN it was a black box either way. Sure, someone else knows how the compiler works, but not the developer.
lelanthran 14 hours ago [-]
I think you have an incorrect mental model of how LLMs work.
There's plenty of resources online to rectify that, though.
cortesoft 14 hours ago [-]
I think you may be misreading my comment, then, because I know how LLMs work. Which part of my comment do you think shows that I don’t?
gpderetta 1 hours ago [-]
maybe you have a wrong mental model on how compiler works then. I'm not a compiler developer, but usually I have a pretty good idea on what code gcc will generate for my C++: it is far from a black box.
Also compilers usually compose well: you can test snippets of code in isolation and the generated code it will have at least some relation to whatever asm would be generated when the snippet is embedded in a larger code base (even under inter-procedural optimizations or LTO, you can predict and often control how it will affect the generated code).
mnkypete 16 hours ago [-]
Except that compilers are (at least to a large degree) deterministic. It's complexity that you don't need to worry about. You don't need to review the generated assembly. You absolutely need to review AI generated code.
cortesoft 15 hours ago [-]
At the end of the day, LLMs are also deterministic. They are running on computers just like all software, and if you have all the same data and random seeds, and you give the same prompt to the same LLM, you will get back the exact same response.
Supermancho 13 hours ago [-]
> you give the same prompt to the same LLM, you will get back the exact same response.
Demonstrably incorrect. This is because the model selection, among other data, is not fixed for (I would say most) LLMs. They are constantly changing. I think you meant something more like an LLM with a fixed configuration. Maybe additional constraints, depending on the specific implementation.
cortesoft 10 hours ago [-]
Yes, by 'same LLM', I mean literally the same model with the same random seeds. You are correct, the big LLMs from providers like Anthropic and OpenAI do not meet this definition.
ryandrake 15 hours ago [-]
[dead]
eterm 16 hours ago [-]
It's funny, because the wisdom that was often taught ( but essentially never practiced ) was "Refactor as you go".
The idea being that if you're working in an area, you should refactor and tidy it up and clean up "tech debt" while there.
In practice, it was seldom done, and here we have LLMs actually doing it, and we're realising the drawbacks.
hirako2000 16 hours ago [-]
When the model write new code doing the same thing as existing logic that's not a refactor.
At times even when a function is right there doing exactly what's needed.
Worse, when it modifies a function that exists, supposedly maintaining its behavior, but breaks for other use cases. Good try I guess.
Worst. Changing state across classes not realising the side effect. Deadlock, or plain bugs.
aerhardt 16 hours ago [-]
When they decide to touch something as they go, they often don't improve it. Not what I would call "refactoring" but rather a yank of the slot machine's arm.
whimblepop 15 hours ago [-]
> In practice, it was seldom done, and here we have LLMs actually doing it, and we're realising the drawbacks.
I spent some time dealing with this today. The real issue for me, though, was that the refactors the agent did were bad. I only wanted it to stop making those changes so I could give it more explicit changes on what to fix and how.
localhoster 16 hours ago [-]
So I think theres some more nuance than that.
A lot of the times, the abstraction is solid enough for you to work with that code area, ie tracking down some bug or extending a functionality.
But sometimes you find yourself at a crossroad - which is either hacking around the existing implementation, or rethink it. With LLMs, how do you even rethink it? Does it even matter to rethink it? And on any who, those decisions are hidden away from you.
traderj0e 15 hours ago [-]
It's only hidden if you don't read the code. Even if you don't, at some point you'll notice the LLM starting to struggle.
hyperpape 16 hours ago [-]
That's a real question, maybe the changes are useful, though I think I'd like to see some examples. I do not trust cognitive complexity metrics, but it is a little interesting that the changes seem to reliably increase cognitive complexity.
raincole 16 hours ago [-]
Really? I've never heard it's considered wise to put refactoring and new features (or bugfixes) in the same commit. Everyone I know from every place I've seen consider it bad. From harmful to a straight rejection in code review.
"Refactor-as-you-go" means to refactor right after you add features / fix bugs, not like what the agent does in this article.
xboxnolifes 12 hours ago [-]
Notice how they didn't say to put it in the same commit. The real issue, and why refactor as you go isn't done as much, is the overhead of splitting changes that touch the same code into different commits without disrupting your workflow. It's not as easy as it should be to support this strategy.
Instead you to do it later, and then never do it.
raincole 12 hours ago [-]
I think you're talking about a different topic unrelated to the linked article. In the linked article the LLM doesn't split it into several commits. If LLM had a button to split the bug fix and the overall refactoring, the author wouldn't complain and we wouldn't see this article.
bluefirebrand 16 hours ago [-]
There is a pretty substantial difference between "making changes" and "refactoring"
If LLMs are doing sensible and necessary refactors as they go then great
I have basically zero confidence that is actually the case though
ramesh31 16 hours ago [-]
>The idea being that if you're working in an area, you should refactor and tidy it up and clean up "tech debt" while there.
This is horrible practice, and very typical junior behavior that needs to be corrected against. Unless you wrote it, Chesterton's Fence applies; you need to think deeply for a long time about why that code exists as it does, and that's not part of your current task. Nothing worse than dealing with a 1000 line PR opened for a small UI fix because the code needed to be "cleaned up".
cassianoleal 16 hours ago [-]
That is the flip side of what you're arguing against, and is also very typical junior behaviour that needs to be corrected against.
Tech debt needs to be dealt with when it makes sense. Many times it will be right there and then as you're approaching the code to do something else. Other times it should be tackled later with more thought. The latter case is frequently a symptom of the absence of the former.
In Extreme Programming, that's called the Boy Scouting Rule.
The Boy Scout "leave it better than you found it" is a good rule to follow. All code has its breaking points, so when you're adding a new feature and find that the existing code doesn't support it without hacks, it probably needs a refactor.
ramesh31 11 hours ago [-]
Indeed there's a distinction that needs to be made here between "not refactoring this code means I'll need to add hacks" and "Oh I'll just clean that up while I'm in here." The former can be necessary, but the latter is something you learn with experience to avoid.
cassianoleal 3 hours ago [-]
> the latter is something you learn with experience to avoid.
The latter is something you learn to judge the right time to tackle. Sometimes a small improvement that's not required will mean you're not pressed to make the refactor to avoid hacks. The earliest you can tackle problems, the cheaper they are to solve.
esafak 15 hours ago [-]
Just do it in a follow up PR to keep them atomic. But do clean up; it's so easy now.
BoredomIsFun 15 minutes ago [-]
It feels like a pointless conversation, if no sampler settings (min_p, temperature etc.) mentioned.
simonw 15 hours ago [-]
I've not seen over-editing in Claude Code or Codex in quite a while, so I was interested to see the prompts being used for this study.
Just had one today where GPT-5.4, instead of adding the 10 lines I asked for (an addition that could be done pretty mechanically by just looking at some previous code and adding a similar thing with different/new variable names) proceeded to rewrite 50 lines instead, because it was "cleaner". It was not. It also didn't originally add the thing I asked for either, which was perplexing.
Over-editing is definitely not some long gone problem. This was on xhigh thinking, because I forgot to set it to lower.
qweiopqweiop 14 hours ago [-]
Likewise, this felt like an early agent problem to me.
janalsncm 14 hours ago [-]
This is a really solid writeup. LLMs are way too verbose in prose and code, and my suspicion is this is driven mainly by the training mechanism.
Cross entropy loss steers towards garden path sentences. Using a paragraph to say something any person could say with a sentence, or even a few precise words. Long sentences are the low perplexity (low statistical “surprise”) path.
Almured 16 hours ago [-]
I feel ambivalent about it. In most cases, I fully agree with the overdoing assessment and then having to spend 30min correcting and fixing. But I also agree with the fact sometimes the system is missing out on more comprehensive changes (context limitations I suppose)! I am starting to be very strict when coding with these tool but still not quite getting the level of control I would like to see
aerhardt 16 hours ago [-]
I'm building a website in Astro and today I've been scaffolding localization. I asked Codex 5.4 x-high to follow the official guidelines for localization and from that perspective the implementation was good. But then it decides to re-write the copy and layout of all pages. They were placeholders, but still?
Codex also has a tendency to apply unwanted styles everywhere.
I see similar tendencies in backend and data work, but I somehow find it easier to control there.
I'm pretty much all in on AI coding, but I still don't know how to give these things large units of work, and I still feel like I have to read everything but throwaway code.
magicalhippo 16 hours ago [-]
You can steer it though. When I see it going off the reservation I steer it back. I also commit often, just about after every prompt cycle, so I can easily revert and pick up the ball in a fresh context.
But yeah, I saw a suggestion about adding a long-lived agent that would keep track of salient points (so kinda memory) but also monitor current progress by main agent in relation to the "memory" and give the main agent commands when it detects that the current code clashes with previous instructions or commands. Would be interesting to see if it would help.
traderj0e 15 hours ago [-]
They also don't understand how exceptions work. They'll try-catch everything, print the error, and continue. If I see a big diff, I know it just added 10 try-catches in random parts of my codebase.
jasonjmcghee 16 hours ago [-]
I never use xhigh due to overthinking. I find high nearly always works better.
Purely anecdotal.
sabas123 13 hours ago [-]
This is even the official OpenAI guideline too.
pilgrim0 16 hours ago [-]
Like others mentioned, letting the agent touch the code makes learning difficult and induces anxiety. By introducing doubt it actually increases the burden of revision, negating the fast apparent progress. The way I found around this is to use LLMs for designing and auditing, not programming per se. Even more so because it’s terrible at keeping the coding style. Call it skill issue, but I’m happier treating it as a lousy assistant rather than as a dependable peer.
kgeist 13 hours ago [-]
Interesting, my assumption used to be that models over-edit when they're run with optimizations in attention blocks (quantization, Gated DeltaNet, sliding window etc.). I.e. they can't always reconstruct the original code precisely and may end up re-inventing some bits. Can't it be one of the reasons too?
jacek-123 6 hours ago [-]
Feels like a training-data artifact. SFT and preference data are full of "here's a cleaner version of your file", not "here's the minimum 3-line diff". The model learned bigger, more polished outputs win. Prompting around it helps a bit but you're fighting the prior.
dbvn 15 hours ago [-]
Don't forget the non-stop unnecessary comments
Flavius 15 hours ago [-]
Token bonanza! Inference sellers love this simple trick.
kgeist 13 hours ago [-]
Custom constrained decoding could have solved this. Penalize comment tokens :)
ozozozd 7 hours ago [-]
There is no need for a new name. It’s called a high-impact change. As opposed to a low-impact change, where one changes or adds the least number of lines necessary to achieve the goal.
Not surprised to see this, since once again, because some of us didn’t like history as a subject, lines of code is a performance measure, like a pissing contest.
rcvassallo83 10 hours ago [-]
This resonates
I've had success with greenfield code followed by frustration when asking for changes to that code due to over editing
And prompting for "minimal changes" does keep the edits down. In addition to this instruction, adding specifics about how to make the change and what not to do tends to get results I'm looking for.
"add one function that does X, add one property to the data structure, otherwise leave it as is, don't add any new validation"
14 hours ago [-]
brianwmunz 14 hours ago [-]
I feel like a core of this is that agents aren't exactly a replacement for a junior developer like some people say. A junior dev has its own biases, predispositions, history and understanding of the internal and external aspects of a product and company. An AI agent wants to do what you ask in the best way possible which is...not always what a dev wants :) The fix the article talks about is simple but shows that these models have no inherent sense of project scope or proportionality. You have to give context (as much context as possible) explicitly to fill in the gaps so it infers less and makes smaller decisions.
btbuildem 14 hours ago [-]
I wish there was a reliable way to choke the agents back and prevent them from doing this. Every line of code added is a potential bug, and they overzealously spew pages and pages of code. I've routinely gone through my (hobby) projects and (yes, still with the aid of an LLM) trimmed some 80% of the generated code with barely any loss of functionality.
The cynic in me thinks it's done on purpose to burn more tokens. The pragmatist however just wants full control over the harness and system prompts. I'm sure this could be done away with if we had access to all the knobs and levers.
qurren 14 hours ago [-]
> if we had access to all the knobs and levers.
We do, just tell it what you want in your AGENTS.md file.
Agents also often respond well to user frustration signs, like threatening to not continue your subscription.
gobdovan 10 hours ago [-]
> Agents also often respond well to user frustration signs, like threatening to not continue your subscription.
From the phrasing, I can't but imagine you as a very calm, completely unemotional person that only emulates user frustration signs, strategically threatening AI that you'll close your subscription when it nukes your code.
15 hours ago [-]
pyrolistical 16 hours ago [-]
I attempt to solve most agent problems by treating them as a dumb human.
In this case I would ask for smaller changes and justify every change. Have it look back upon these changes and have it ask itself are they truly justified or can it be simplified.
collimarco 11 hours ago [-]
Over-editing and over-adding... I can find solutions that are just a few lines of code in a single file where AI would change 10 files and add 100s of lines of code. Writing less code is more important than ever. Too much code means more technical debt, a maintainability nightmare and more places where bugs can hide.
Gigachad 11 hours ago [-]
I've seen this at work where people submit PRs that implement whole internal libraries to do something that could have been done with an existing tool or just done simpler in general.
It's impossible to properly review this in a reasonable time and they always introduce tons of subtle bugs.
recursivecaveat 14 hours ago [-]
My experience is usually the opposite. The code they write is verbose yes, but the diffs are over-minimal. Whenever I see a comment like "Tool X doesn't support Y or has a bug with Z [insert terrible kludge]" and actually fixing the problem in the other file would be very easy, I know it is AI-generated. I suspect there is a bias towards local fixes to reduce token usage.
whinvik 16 hours ago [-]
Yeah I have always felt GPT 5.4 does too much. It is amazing at following instructions precisely but it convinces itself to do a bit too much.
I am surprised Gemini 3.1 Pro is so high up there. I have never managed to make it work reliably so maybe there's some metric not being covered here.
itopaloglu83 16 hours ago [-]
I always described it as over-complicating the code, but doing too much is a better diagnosis.
vibe42 15 hours ago [-]
With the pi-mono coding agent (running local, open models) this works very well:
"Do not modify any code; only describe potential changes."
I often add it to the end when prompting to e.g. review code for potential optimizations or refactor changes.
exitb 16 hours ago [-]
As mentioned in the article, prompting for minimal changes does help. I find GPT models to be very steerable, but it doesn’t mean much when you take your hands of the wheel. These type of issues should be solved at planning stage.
16 hours ago [-]
Bengalilol 15 hours ago [-]
Tangent and admittedly off-topic but I've come to see LLM-assisted coding as a kind of teleportation.
With LLMs, you glimpse a distant mountain. In the next instant, you're standing on its summit. Blink, and you are halfway down a ridge you never climbed. A moment later, you're flung onto another peak with no trail behind you, no sense of direction, no memory of the ascent. The landscape keeps shifting beneath your feet, but you never quite see the panorama. Before you know it, you're back near the base, disoriented, as if the journey never happened. But confident, you say you were on the top of the mountain.
Manual coding feels entirely different. You spot the mountain, you study its slopes, trace a route, pack your gear. You begin the climb. Each step is earned steadily and deliberately. You feel the strain, adjust your path, learn the terrain. And when you finally reach the summit, the view unfolds with meaning. You know exactly where you are, because you've crossed every meter to get there. The satisfaction isn't just in arriving, nor in saying you were there: it is in having truly climbed.
jdkoeck 15 hours ago [-]
The thing is, with manual coding, you spot a view in the distance, you trek your way for a few hours, and you realize when you get there that the view isn’t as great as you thought it was.
With LLM-assisted coding, you skip the trek and you instantly know that’s not it.
bluequbit 6 hours ago [-]
I call this overcooking. Adding unnecesary features.
slopinthebag 16 hours ago [-]
I think the industry has leaned waaay too far into completely autonomous agents. Of course there are reasons why corporations would want to completely replace their engineers with fully autonomous coding agents, but for those of us who actually work developing software, why would we want less and less autonomy? Especially since it alienates us from our codebases, requiring more effort in the future to gain an understanding of what is happening.
I think we should move to semi-autonomous steerable agents, with manual and powerful context management. Our tools should graduate from simple chat threads to something more akin to the way we approach our work naturally. And a big benefit of this is that we won't need expensive locked down SOTA models to do this, the open models are more than powerful enough for pennies on the dollar.
NitpickLawyer 16 hours ago [-]
I'm hearing this more and more, we need new UX that is better suited for the LLM meta. But none that I've seen so far have really got it, yet.
grttww 16 hours ago [-]
When you steer a car, there isn’t this degree of probability about the output.
How do you emulate that with llm’s? I suppose the objective is to get variance down to the point it’s barely noticeable. But not sure it’ll get to that place based on accumulating more data and re-training models.
slopinthebag 14 hours ago [-]
Well, the point is by steering it you can get both more expected/reproducible output, and you can correct bad assumptions before they become solidified in your codebase.
You can get pretty close to reproducible output by narrowing the scope and using certain prompts/harnesses. As in, you get roughly the same output each time with identical prompts, assuming you're using a model which doesn't change every few hours to deal with load, and you aren't using a coding harness that changes how it works every update. It's not deterministic, but if you ask it for a scoped implementation you essentially get the same implementation every time, with some minor and usually irrelevant differences.
So you can imagine with a stable model and harness, with steering you can basically get what you ask it for each time. Tooling that exploits this fact can be much more akin to using an autocomplete, but instead of a line of code it's blocks of code, functions, etc.
A harness that makes it easy to steer means you can basically write the same code you would have otherwise written, just faster. Which I think is a genuine win, not only from a productivity standpoint but also you maintain control over the codebase and you aren't alienated or disenfranchised from the output, and it's much easier to make corrections or write your own implementations where you feel it's necessary. It becomes more of an augmentation and less of a replacement.
grttww 12 hours ago [-]
You wrote all that and didn’t address the question lmao.
There’s diminishing returns and moreover this idea that people are holding it wrong / they need to figure out the complexity goes against all that has been done over the past 30 years : making things simpler.
slopinthebag 12 hours ago [-]
You asked me how one could minimise the non-deterministic output of LLM's and I responded, if that's not good enough of an answer feel free to ask a follow up.
lo1tuma 16 hours ago [-]
I’m not sure if I share the authors opinion. When I was hand-writing code I also followed the boy-scout rule and did smaller refactorings along the line.
lopsotronic 16 hours ago [-]
When asked to show their development-test path in the form of a design document or test document, I've also noticed variance between the document generated and what the chain-of-thought thingy shows during the process.
The version it puts down into documents is not the thing it was actually doing. It's a little anxiety-inducing. I go back to review the code with big microscopes.
"Reproducibility" is still pretty important for those trapped in the basements of aerospace and defense companies. No one wants the Lying Machine to jump into the cockpit quite yet. Soon, though.
We have managed to convince the Overlords that some teensy non-agentic local models - sourced in good old America and running local - aren't going to All Your Base their Internets. So, baby steps.
tim-projects 15 hours ago [-]
> The model fixes the bug but half the function has been rewritten.
The solution to this is to use quality gates that loop back and check the work.
I'm currently building a tool with gates and a diff regression check. I haven't seen these problems for a while now.
Well seeing as they don't KNOW anything this isn't surprising at all
spullara 14 hours ago [-]
this is one of the best things about using claude over gpt. claude understands the bigger assignment and does all the work and sometimes more than necessary but for me it beats the alternative.
jollyllama 15 hours ago [-]
It's called code churn. Generally, LLMs make code churn.
standardly 14 hours ago [-]
I've had a bad experience using AI for front-end stuff, where I replace or deprecate a feature only to notice later all the artifacts it left behind, some which were never even used in the first place.
I re-did an entire UI recently, and when one of the elements failed to render I noticed the old UI peeking out from underneath. It had tried just covering up old elements instead of adjusting or replacing them. Like telling your son to clean their room, so they push all the clothes under the bed and hope you don't notice LOL
It saves 2 hours of manual syntax wrangling but introduces 1 .5 hours of clean up and sanity checking. Still a net productivity increase, but not sure if its worth how lazy it seems to be making me (this is an easy error to correct, im sure, but meh Claude can fix it in 2 seconds so...)
graybeardhacker 16 hours ago [-]
I use Claude Code every day and have for as long as it has been available. I use git add -p to ensure I'm only adding what is needed. I review all code changes and make sure I understand every change. I prompt Claude to never change only whitespace. I ask it to be sure to make the minimal changes to fix a bug.
Too many people are treating the tools as a complete replacement for a developer. When you are typing a text to someone and Google changes a word you misspelled to a completely different word and changes the whole meaning of the text message do you shrug and send it anyway? If so, maybe LLMs aren't for you.
nunez 10 hours ago [-]
i use claude in more or less the same way but it sure is tempting to just glaze over the 300+ line diffs it produces.
m463 14 hours ago [-]
You know, this made me think of over-engineering.
...and that led me to believe that AI might be very capable to develop over-engineered audio equipment. Think of all the bells and whistles that could be added, that could be expressed in ridiculous ways with ridiculous price tags.
This seems like something that should be easy to prevent in pi harness. Just tell it to make an extension that before calling file edit tool asks the model to make sure that no lines unconnected with the current topic are going to be unnecessarily changed by this edit.
devdevai 20 minutes ago [-]
[dead]
DrokAI 3 hours ago [-]
[dead]
panavm 6 hours ago [-]
[dead]
EthanFrostHI 5 hours ago [-]
[dead]
EverMemory 8 hours ago [-]
[dead]
ArielTM 11 hours ago [-]
[dead]
jeremie_strand 16 hours ago [-]
[dead]
rosscorinne96 6 hours ago [-]
[dead]
aroido-bigcat 11 hours ago [-]
[dead]
maxbeech 11 hours ago [-]
[dead]
sebringj 15 hours ago [-]
[dead]
tantalor 15 hours ago [-]
> Code review is already a bottleneck
Counterpoint: no it isn't
> makes this job dramatically harder
No it doesn't
esafak 15 hours ago [-]
How many LOC do you generate and read a day? Only your own code or others' too?
It doesn't really make sense economically for me to write software for work anymore. I'm a teacher, architect, and infrastructure maintainer now. I hand over most development to my experienced team of Claude sessions. I review everything, but so does Claude (because Claude writes thorough tests also.) It has no problem handling a large project these days.
I don't mean for this post to be an ad for Claude. (Who knows what Anthropic will do to Claude tomorrow?) I intend for this post to be a question: what am I doing that makes Claude profoundly effective?
Also, I'm never running out of tokens anymore. I really only use the Opus model and I find it very efficient with tokens. Just last week I landed over 150 non-trivial commits, all with Claude's help, and used only 1/3 of the tokens allotted for the week. The most commits I could do before Claude was 25-30 per week.
(Gosh, it's hard to write that without coming across as an ad for Anthropic. Sorry.)
I looked at some stats yesterday and was surprised to learn Cursor AI now writes 97% of my code at work. Mostly through cloud agents (watching it work is too distracting for me)
My approach is very simple: Just Talk To It
People way overthink this stuff. It works pretty good. Sharing .md files and hyperfocusing on various orchestrations and prompt hacks of the week feels as interesting as going deep on vim shortcuts and IDE skins.
Just ask for what you want, be clear, give good feedback. That’s it
If you find any, consider making them into skills or /commands or maybe even add them to AGENTS.md.
Is it by characters human typed vs AI generated, or by commit or something?
Cursor dashboard. I know they're incentivized to over-estimate but feels directionally accurate when I look at recent PRs.
Don’t really think about it. I think when I talk to it through Slack, cursor users codex, in my ide looks like it’s whatever highest claude. In Github comments, who even knows
Personally, I tend to get crap quality code out of Claude. Very branchy. Very un-DRY. Consistently fails to understand the conventions of my codebase (e.g. keeps hallucinating that my arena allocator zero initializes memory - it does not). And sometimes after a context compaction it goes haywire and starts creating new regressions everywhere. And while you can prompt to fix these things, it can take an entire afternoon of whack-a-mole prompting to fix the fallout of one bad initial run. I've also tried dumping lessons into a project specific skill file, which sometimes helps, but also sometimes hurts - the skill file can turn into a footgun if it gets out of sync with an evolving codebase.
In terms of limits, I usually find myself hitting the rate limit after two or three requests. On bad days, only one. This has made Claude borderline unusable over the past couple weeks, so I've started hand coding again and using Claude as a code search and debugging tool rather than a code generator.
I've found this can be vastly reduced with AGENTS.md instructions, at least with codex/gpt-5.4.
In TFA they found that prompting mitigates over-editing up to about 10 percentage points.
It's bitten me several times at work, and I rather not waste any more of my limited time doing the re-prompt -> modify code manually cycle. I'm capable of doing this myself.
It's great for the simple tasks tho, most feature work are simple tasks IMO. They were only "costly" in the sense that it took a while to previously read the code, find appropriate changes, create tests for appropriate changes, etc. LLMs reduce that cycle of work, but that type of work in general isn't the majority of my time at my job.
I've worked at feature factories before, it's hell. I can't imagine how much more hell it has become since the introduction of these tools.
Feature factories treat devs as literal assembly line machines, output is the only thing that matters not quality. Having it mass induced because of these tools is just so shitty to workers.
I fully expect a backlash in the upcoming years.
---
My only Q to the OP of this thread is what kind of teacher they are, because if you teach people anything about software while admitting that you no longer write code because it's not profitable (big LOL at caring about money over people) is just beyond pathetic.
This means it can do anything in the VM, install dependencies, etc... So far, it managed to bork the VM once (unbootable), I could have spent a bit of time figuring out what happened but I had a script to rebuild the VM so didn't bother. To be entirely fair to claude, the VM runs arch linux which is definitely easier to break than other distros.
I don't measure my productivity, but I see it in the sort of tasks I tackle after years of waiting. It's especially good at tedious tasks like turning 100 markdown files into 5 json files and updating the code that reads them, for example.
Are you working more on operational stuff or on "long-running product" stuff?
My personal headcanon: this tooling works well when built on simple patterns, and can handle complex work. This tooling has also been not great at coming up with new patterns, and if left unsupervised will totally make up new patterns that are going to go south very quickly. With that lens, I find myself just rewriting what Claude gives me in a good number of cases.
I sometimes race the robot and beat the robot at doing a change. I am "cheating" I guess cuz I know what I want already in many cases and it has to find things first but... I think the futzing fraction[0] is underestimated for some people.
And like in the "perils of laziness lost"[1] essay... I think that sometimes the machine trying too hard just offends my sensibilities. Why are you doing 3 things instead of just doing the one thing!
One might say "but it fixes it after it's corrected"... but I already go through this annoying "no don't do A,B, C just do A, yes just that it's fine" flow when working with coworkers, and it's annoying there too!
"Claude writes thorough tests" is also its own micro-mess here, because while guided test creation works very well for me, giving it any leeway in creativity leads to so many "test that foo + bar == bar + foo" tests. Applying skepticism to utility of tests is important, because it's part of the feedback loop. And I'm finding lots of the test to be mainly useful as a way to get all the imports I need in.
If we have all these machines doing this work for us, in theory average code quality should be able to go up. After all we're more capable! I think a lot of people have been using it in a "well most of the time it hits near the average" way, but depending on how you work there you might drag down your average.
[0]: https://blog.glyph.im/2025/08/futzing-fraction.html [1]: https://bcantrill.dtrace.org/2026/04/12/the-peril-of-lazines...
I've been doing a greenfield project with Claude recently. The initial prototype worked but was very ugly (repeated duplicate boilerplate code, a few methods doing the same exact thing, poor isolation between classes)... I was very much tempted to rewrite it on my own. This time, I decided to try and get it to refactor so get the target architecture and fix those code quality issues, it's possible but it's very much like pulling teeths... I use plan mode, we have multiple round of reviews on a plan (that started based on me explaining what I expect), then it implements 95% of it but doesn't realize that some parts of it were not implemented... It reminds me of my experience mentoring a junior employee except that claude code is both more eager (jumping into implementation before understanding the problem), much faster at doing things and dumber.
That said, I've seen codebases created by humans that were as bad or worse than what claude produced when doing prototype.
The first (and maybe even second) usage of a gnarly, badly thought out pattern might work fine. But you're only a couple steps away from if statement soup. And in the world where your agent's life is built around "getting the tests to pass", you can quickly find it doing _very_ gnarly things to "fix" issues.
I think you're likely in the silent majority. LLMs do some stupid things, but when they work it's amazing and it far outweighs the negatives IMHO, and they're getting better by leaps and bounds.
I respect some of the complaints against them (plagiarism, censorship, gatekeeping, truth/bias, data center arms race, crawler behavior, etc.), but I think LLMs are a leap forward for mankind (hopefully). A Young Lady's Illustrated Primer for everyone. An entirely new computing interface.
Much like giving a codebase to a newbie developer, whatever patterns exist will proliferate and the lack of good patterns means that patterns will just be made up in an ad-hoc and messy way.
I'm fascinated by this question.
I think the first two sections of this article point towards an answer: https://aphyr.com/posts/412-the-future-of-everything-is-lies...
I've personally had radically different experiences working on different projects, different features within the same project, etc.
This is the problem.
I think there is a huge gap between people on salaries getting effectively more responsibility by being given spend that they otherwise would not have had and people hustling on projects on their own.
Yes it is 100% what I use but I am never happy with usage. It burns up by sub fast and there is little feelings of control. Experiments like using lower tier models are hard to understand in reality. Graphify might work or it might not. I have no idea.
I guess it comes down to how ossified you want your existing code to be.
If it's a big production application that's been running for decades then you probably want the minimum possible change.
If you're just experimenting with stuff and the project didn't exist at all 3 days ago then you want the agent to make it better rather than leave it alone.
Probably they just need to learn to calibrate themselves better to the project context.
Even within the same project, for a given PR, there are some parts of the codebase I want to modify freely and some that I want fixed to reduce the diff and testing scope.
I try to explain up-front to the agent how aggressively they can modify the existing code and which parts, but I've had mixed success; usually they bias towards a minimal diff even if that means duplication or abusing some abstractions. If anyone has had better success, I'd love to hear your approach.
I'll literally run an agent & tell it to clean up a markdown file that has too much design in it, delete the technical material, and/or delete key implementations/interfaces in the source, then tell a new session to do the work, come up with the design. (Then undelete and reconcile with less naive sessions.)
Path dependence is so strong. Right now I do this flow manually but I would very much like to codify this, make a skill for this pattern that serves so well.
I suspect AI's learned to do this in order to game the system. Bailing out with an exception is an obvious failure and will be penalized, but hiding a potential issue can sometimes be regarded as a success.
I wonder how this extrapolates to general Q&A. Do models find ways to sound convincing enough to make the user feels satisfied and the go away? I've noticed models often use "it's not X, it's Y", which is a binary choice designed to keep the user away from thinking about other possibilities. Also they often come up with a plan of action at the end of their answer, a sales technique known as the "assumptive close", which tries to get the user to think about the result after agreeing with the AI, rather than the answer itself.
They are trained on human feedback, so there is no other way this goes. Every bit of every response is pointed toward subversion of the assumed evaluator.
I can't help but read complaints about the capabilities of AI – and I'm certainly not accusing you of complaining about AI, just a general thought – and think "Yet" to myself every time.
I've spent far more time pitting one AI context against another (reviewing each other's work) than I have using AI to build stuff these days.
The benefit is that since it mostly happens asynchronously, I'm free to do other stuff.
1. I have no real understanding of what is actually happening under the hood. The ease of just accepting a prompt to run some script the agent has assembled is too enticing. But, I've already wiped a DB or two just because the agent thought it was the right thing to do. I've also caught it sending my AWS credentials to deployment targets when it should never do that.
2. I've learned nothing. So the cognitive load of doing it myself, even assembling a simple docker command, is just too high. Thus, I repeatedly fallback to the "crutch" of using AI.
Of course this requires being fortunate enough that you have one of those AI positive employers where you can spend lots of money on clankers.
I don't review every move it makes, I rather have a workflow where I first ask it questions about the code, and it looks around and explores various design choices. then i nudge it towards the design choice I think is best, etc. That asking around about the code also loads up the context in the appropriate manner so that the AI knows how to do the change well.
It's a me in the loop workflow but that prevents a lot of bugs, makes me aware of the design choices, and thanks to fast mode, it is more pleasant and much faster than me manually doing it.
One the one hand, reviewing and micromaning everything it does is tedious and unrewarding. Unlike reviewing a colleague's code, you're never going to teach it anything; maybe you'll get some skills out of it if you finds something that comes up often enough it's worth writing a skill for. And this only gets you, at best, a slight speedup over writing it yourself, as you have to stay engaged and think about everything that's going on.
Or you can just let it grind away agentically and only test the final output. This allows you to get those huge gains at first, but it can easily just start accumulating more and more cruft and bad design decisions and hacks on top of hacks. And you increasingly don't know what it's doing or why, you're losing the skill of even being able to because you're not exercising it.
You're just building yourself a huge pile of technical debt. You might delete your prod database without realizing it. You might end up with an auth system that doesn't actually check the auth and so someone can just set a username of an admin in a cookie to log in. Or whatever; you have no idea, and even if the model gets it right 95% of the time, do you want to be periodically rolling a d20 and if you get a 1 you lose everything?
Maybe I’m just weird (actually that’s a given) but I don’t mind babysitting the clanker while it works.
The agent only has access to exactly what it needs, be it an implementation agent, analysis agent, or review agent.
Makes it very easy to stay in command without having to sit and approve tons of random things the agent wants to do.
I do not allow bash or any kind of shell. I don't want to have to figure out what some random python script it's made up is supposed to do all the time.
Both OpenCode and VsCode support this. I think in ClaudeCode you can do it with skills now.
The other benefit is the MCP tool can mediate e.g. noisy build tool output, and reduce token usage by only showing errors or test failures, nothing else, or simply an ok response with the build run or test count.
So far, I have not needed to give them access to more than build tools, git, and a project/knowledge system (e.g. Obsidian) for the work I have them doing. Well and file read/write and web search.
I use Cursor but it's getting expensive lately, so I'm trying to reduce context size and move to OpenCode or something like that which I can use with some cheaper provider and Kimi 2.5 or whatever.
BTW, one tip is to look at the size of the codebase. When you see 100KLOC for a first draft of a C compiler, you know something has gone horribly wrong. I would suggest that you at least compare the number of lines the agent produced to what you think the project should take. If it's more than double, the code is in serious, serious trouble. If it's in the <1.5x range, there's a chance it could be saved.
Asking the agent questions is good - as an aid to a review, not as a substitute. The agents lie with a high enough frequency to be a serious problem.
The models don't yet write code anywhere near human quality, so they require much closer supervision than a human programmer.
You could have it build something that takes fewer lines of code, but you aren’t gonna to find much with that level of specification and guardrails.
It has about doubled my development pace. An absolutely incredible gain in a vacuum, though tiny compared to what people seem to manage without these self-constraints. But in exchange, my understanding of the code is as comprehensive as if I had paired on it, or merged a direct report's branch into a project I was responsible for. A reasonable enough tradeoff, for me.
anonu has explicitly said that they've wiped a database twice as a result of agents doing stuff. What sort of diff would help against an agent running commands, without your approval?
The diff: +8000 -4000
I also don't find the permissions it prompts for very meaningful. Permission to use a file search tool? Permission to make a web request? It's a clumsy way to slow it down enough for me to catch up.
Day 1: Carefully handles the creds, gives me a lecture (without asking) about why .env should be in .gitignore and why I should edit .env and not hand over the creds to it.
Day 2: I ask for a repeat, has lost track of that skill or setting, frantically searches my entire disk, reads .env including many other files, understands that it is holding a token, manually creates curl commands to test the token and then comes back with some result.
It is like it is a security expert on Day 1 and absolute mediocre intern on Day 2
( This was low-stakes test creds anyway which I was testing with thankfully. )
I never pass creds via env or anything else it can access now.
My approach now is to get it to write me linqpad scripts, which has a utility function to get creds out of a user-encrypted share, or prompts if it's not in the store.
This works well, but requires me to run the scripts and guide it.
Ultimately, fully autotonous isn't compatible with secrets. Otherwise, if it really wanted to inspect it, then it could just redirect the request to an echo service.
The only real way is to deal with it the same way we deal with insider threat.
A proxy layer / secondary auth, which injects the real credentials. Then give claude it's own user within that auth system, so it owns those creds. Now responsibilty can be delegated to it without exposing the original credentials.
That's a lot of work when you're just exploring an API or DB or similar.
1. Everything is specified, written and tested by me, then cleaned up by AI. This is for the core of the application.
2. AI writes the functions, then sets up stub tests for me to write. Here I’ll often rewrite the functions as they often don’t do what I want, or do too much. I just find it gets rid of a lot of boilerplate to do things this way.
3. AI does everything. This is for experiments or parts of an application that I am perfectly willing to delete. About 70% of the time I do end up deleting these parts. I don’t allow it to touch 1 or 2.
Of course this requires that the architecture is setup in a way where this is possible. But I find it pretty nice.
[1] except perhaps read-only credentials to help diagnose problems, but even then I would only issue it an extremely short-lived token in case it leaks it somehow
Only helps if we listen to it :) which is fun b/c it means staying sharp which is inherently rewarding
Don’t give your agent access to content it should not edit, don’t give keys it shouldn’t use.
> python <<'EOF'
> ${code the agent wrote on the spot}
> EOF
I mean, yeah, in theory it's just as dangerous as running arbitrary shell commands, which the agent is already doing anyway, but still...
By default these shell commands don't have network access or write access outside the project directory which is good, but nowhere near customizable enough. Once you approve a command because it needs network access, its other restrictions are lifted too. It's all or nothing.
I'm not trying to be offense, so with all due respect... this sounds like a "you" problem. (And I've been there, too)
You can ask the LLMs: how do I run this, how do I know this is working, etc etc.
Sure... if you really know nothing or you put close to zero effort into critically thinking about what they give you, you can be fooled by their answers and mistake complete irrelevance or bullshit for evidence that something works is suitably tested to prove that it works, etc.
You can ask 2 or 3 other LLMs: check their work, is this conclusive, can you find any bugs, etc etc.
But you don't sound like you know nothing. You sound like you're rushing to get things done, cutting corners, and you're getting rushed results.
What do you expect?
Their work is cheap. They can pump out $50k+ worth of features in a $200/mo subscription with minimal baby-sitting. Be EAGER to reject their work. Send it back to them over and over again to do it right, for architectural reviews, to check for correctness, performance, etc.
They are not expensive people with feelings you need to consider in review, that might quit and be hard to replace. Don't let them cut corners. For whatever reason, they are EAGER to cut corners no matter how much you tell them not to.
I'm only 5 years into this career, and I'm going to work manually and absorb as much knowledge as possible while I'm still able to do it. Yes, that means manually doing shit-kicker work. If AI does get so good that I need to use it, as you say, then I'll be running it locally on a version I can master and build tooling for.
https://vivekhaldar.com/articles/when-compilers-were-the--ai...
We are completely comfortable now letting the compilers do their thing, and never seem to worry that we "don't know what is actually happening under the hood".
I am not saying these situations are exactly analogous, but I am saying that I don't think we can know yet if this will be one of those things that we stop worrying about or it will be a serious concern for a while.
> Many assembly programmers were accustomed to having intimate control over memory and CPU instructions. Surrendering this control to a compiler felt risky. There was a sentiment of, if I don’t code it down to the metal, how can I trust what’s happening? In some cases, this was about efficiency. In other cases, it was about debuggability and understanding programming behavior. However, as compilers matured, they began providing diagnostic output and listings that actually improved understanding.
I would 100% use LLMs more and more aggressively if they were more transparent. All my reservations come from times when I prompt “change this one thing” and it rewrites my db schema for some reason, or adds a comment that is actively wrong in several ways. I also think I have a decent working understanding of the assembly my code compiles to, and do occasionally use https://godbolt.org/. Of course, I didn’t start out that way, but I also don’t really have any objections to teenagers vibe-coding games, I just think at some point you have to look under the hood if you’re serious.
Isn't that what git is for, though? Just have your LLM work in a branch, and then you will have a clear record of all the changes it made when you review the pull request.
LLMs are nothing like that
It is just the scope that makes it appear non-deterministic to a human looking at it, and it is large enough to be impossible for a human to follow the entire deterministic chain, but that doesn't mean it isn't in the end a function that translates input data into output data in a deterministic way.
There is a world of difference between translation and generation. It's even in the name: generative AI. I didn't say anything about magic.
edit: there might be a future where we develop robopsychology enough to understand LLM more than black boxes, we we are not there yet.
[1] Aside from injected randomness and parallel scheduling artifacts.
Care to point to any that are set up to be deterministic?
Did you ever stop to think about why no one can get any use out of a model with temp set to zero?
I get why that is in practice different then the manner in which compilers are deterministic, but my point is the difference isnt because of determinism.
Create a program that reads from /dev/random (not urandom). It's not determistic.
In other words, it isn't the random number part of LLMs that make them seem like a black box and unpredictable, but rather the complexity of the underlying model. Even if you ran it in a deterministic way, I don't think people would suddenly feel more confident about the outputted code.
A non deterministic compiler is probably defective and in any case much less useful
Although, while the compiler devs might know what was going on in the compiler, they wouldn't know what the compiler was doing with that particular bit of code that the FORTRAN developer was writing. They couldn't possibly foresee every possible code path that a developer might traverse with the code they wrote. In some ways, you could say LLMs are like that, too; the LLM developers know how the LLM code works, but they don't know the end result with all the training data and what it will do based on that.
In addition, to the end developer writing FORTRAN it was a black box either way. Sure, someone else knows how the compiler works, but not the developer.
There's plenty of resources online to rectify that, though.
Also compilers usually compose well: you can test snippets of code in isolation and the generated code it will have at least some relation to whatever asm would be generated when the snippet is embedded in a larger code base (even under inter-procedural optimizations or LTO, you can predict and often control how it will affect the generated code).
Demonstrably incorrect. This is because the model selection, among other data, is not fixed for (I would say most) LLMs. They are constantly changing. I think you meant something more like an LLM with a fixed configuration. Maybe additional constraints, depending on the specific implementation.
The idea being that if you're working in an area, you should refactor and tidy it up and clean up "tech debt" while there.
In practice, it was seldom done, and here we have LLMs actually doing it, and we're realising the drawbacks.
At times even when a function is right there doing exactly what's needed.
Worse, when it modifies a function that exists, supposedly maintaining its behavior, but breaks for other use cases. Good try I guess.
Worst. Changing state across classes not realising the side effect. Deadlock, or plain bugs.
I spent some time dealing with this today. The real issue for me, though, was that the refactors the agent did were bad. I only wanted it to stop making those changes so I could give it more explicit changes on what to fix and how.
"Refactor-as-you-go" means to refactor right after you add features / fix bugs, not like what the agent does in this article.
Instead you to do it later, and then never do it.
If LLMs are doing sensible and necessary refactors as they go then great
I have basically zero confidence that is actually the case though
This is horrible practice, and very typical junior behavior that needs to be corrected against. Unless you wrote it, Chesterton's Fence applies; you need to think deeply for a long time about why that code exists as it does, and that's not part of your current task. Nothing worse than dealing with a 1000 line PR opened for a small UI fix because the code needed to be "cleaned up".
Tech debt needs to be dealt with when it makes sense. Many times it will be right there and then as you're approaching the code to do something else. Other times it should be tackled later with more thought. The latter case is frequently a symptom of the absence of the former.
In Extreme Programming, that's called the Boy Scouting Rule.
https://furqanramzan.github.io/clean-code-guidelines/princip...
The latter is something you learn to judge the right time to tackle. Sometimes a small improvement that's not required will mean you're not pressed to make the refactor to avoid hacks. The earliest you can tackle problems, the cheaper they are to solve.
I think they're in here, last edited 8 months ago: https://github.com/nreHieW/fyp/blob/5a4023e4d1f287ac73a616b5...
Over-editing is definitely not some long gone problem. This was on xhigh thinking, because I forgot to set it to lower.
Cross entropy loss steers towards garden path sentences. Using a paragraph to say something any person could say with a sentence, or even a few precise words. Long sentences are the low perplexity (low statistical “surprise”) path.
Codex also has a tendency to apply unwanted styles everywhere.
I see similar tendencies in backend and data work, but I somehow find it easier to control there.
I'm pretty much all in on AI coding, but I still don't know how to give these things large units of work, and I still feel like I have to read everything but throwaway code.
But yeah, I saw a suggestion about adding a long-lived agent that would keep track of salient points (so kinda memory) but also monitor current progress by main agent in relation to the "memory" and give the main agent commands when it detects that the current code clashes with previous instructions or commands. Would be interesting to see if it would help.
Purely anecdotal.
Not surprised to see this, since once again, because some of us didn’t like history as a subject, lines of code is a performance measure, like a pissing contest.
I've had success with greenfield code followed by frustration when asking for changes to that code due to over editing
And prompting for "minimal changes" does keep the edits down. In addition to this instruction, adding specifics about how to make the change and what not to do tends to get results I'm looking for.
"add one function that does X, add one property to the data structure, otherwise leave it as is, don't add any new validation"
The cynic in me thinks it's done on purpose to burn more tokens. The pragmatist however just wants full control over the harness and system prompts. I'm sure this could be done away with if we had access to all the knobs and levers.
We do, just tell it what you want in your AGENTS.md file.
Agents also often respond well to user frustration signs, like threatening to not continue your subscription.
From the phrasing, I can't but imagine you as a very calm, completely unemotional person that only emulates user frustration signs, strategically threatening AI that you'll close your subscription when it nukes your code.
In this case I would ask for smaller changes and justify every change. Have it look back upon these changes and have it ask itself are they truly justified or can it be simplified.
It's impossible to properly review this in a reasonable time and they always introduce tons of subtle bugs.
I am surprised Gemini 3.1 Pro is so high up there. I have never managed to make it work reliably so maybe there's some metric not being covered here.
"Do not modify any code; only describe potential changes."
I often add it to the end when prompting to e.g. review code for potential optimizations or refactor changes.
With LLMs, you glimpse a distant mountain. In the next instant, you're standing on its summit. Blink, and you are halfway down a ridge you never climbed. A moment later, you're flung onto another peak with no trail behind you, no sense of direction, no memory of the ascent. The landscape keeps shifting beneath your feet, but you never quite see the panorama. Before you know it, you're back near the base, disoriented, as if the journey never happened. But confident, you say you were on the top of the mountain.
Manual coding feels entirely different. You spot the mountain, you study its slopes, trace a route, pack your gear. You begin the climb. Each step is earned steadily and deliberately. You feel the strain, adjust your path, learn the terrain. And when you finally reach the summit, the view unfolds with meaning. You know exactly where you are, because you've crossed every meter to get there. The satisfaction isn't just in arriving, nor in saying you were there: it is in having truly climbed.
With LLM-assisted coding, you skip the trek and you instantly know that’s not it.
I think we should move to semi-autonomous steerable agents, with manual and powerful context management. Our tools should graduate from simple chat threads to something more akin to the way we approach our work naturally. And a big benefit of this is that we won't need expensive locked down SOTA models to do this, the open models are more than powerful enough for pennies on the dollar.
How do you emulate that with llm’s? I suppose the objective is to get variance down to the point it’s barely noticeable. But not sure it’ll get to that place based on accumulating more data and re-training models.
You can get pretty close to reproducible output by narrowing the scope and using certain prompts/harnesses. As in, you get roughly the same output each time with identical prompts, assuming you're using a model which doesn't change every few hours to deal with load, and you aren't using a coding harness that changes how it works every update. It's not deterministic, but if you ask it for a scoped implementation you essentially get the same implementation every time, with some minor and usually irrelevant differences.
So you can imagine with a stable model and harness, with steering you can basically get what you ask it for each time. Tooling that exploits this fact can be much more akin to using an autocomplete, but instead of a line of code it's blocks of code, functions, etc.
A harness that makes it easy to steer means you can basically write the same code you would have otherwise written, just faster. Which I think is a genuine win, not only from a productivity standpoint but also you maintain control over the codebase and you aren't alienated or disenfranchised from the output, and it's much easier to make corrections or write your own implementations where you feel it's necessary. It becomes more of an augmentation and less of a replacement.
There’s diminishing returns and moreover this idea that people are holding it wrong / they need to figure out the complexity goes against all that has been done over the past 30 years : making things simpler.
The version it puts down into documents is not the thing it was actually doing. It's a little anxiety-inducing. I go back to review the code with big microscopes.
"Reproducibility" is still pretty important for those trapped in the basements of aerospace and defense companies. No one wants the Lying Machine to jump into the cockpit quite yet. Soon, though.
We have managed to convince the Overlords that some teensy non-agentic local models - sourced in good old America and running local - aren't going to All Your Base their Internets. So, baby steps.
The solution to this is to use quality gates that loop back and check the work.
I'm currently building a tool with gates and a diff regression check. I haven't seen these problems for a while now.
https://github.com/tim-projects/hammer
I re-did an entire UI recently, and when one of the elements failed to render I noticed the old UI peeking out from underneath. It had tried just covering up old elements instead of adjusting or replacing them. Like telling your son to clean their room, so they push all the clothes under the bed and hope you don't notice LOL
It saves 2 hours of manual syntax wrangling but introduces 1 .5 hours of clean up and sanity checking. Still a net productivity increase, but not sure if its worth how lazy it seems to be making me (this is an easy error to correct, im sure, but meh Claude can fix it in 2 seconds so...)
Too many people are treating the tools as a complete replacement for a developer. When you are typing a text to someone and Google changes a word you misspelled to a completely different word and changes the whole meaning of the text message do you shrug and send it anyway? If so, maybe LLMs aren't for you.
...and that led me to believe that AI might be very capable to develop over-engineered audio equipment. Think of all the bells and whistles that could be added, that could be expressed in ridiculous ways with ridiculous price tags.
Counterpoint: no it isn't
> makes this job dramatically harder
No it doesn't