There’s a factor that occurs, after a few weeks of working with coding brokers at a gentle tempo, the place you cease considering of your self because the particular person typing and begin considering of your self because the particular person seeing. The Latin phrase for imaginative and prescient is visio, “I see”; the Italian visione and English imaginative and prescient each maintain that. It’s a a lot older thought than the trendy “mission assertion on a slide” utilization. It means: I’ve, in my head, an image of the place this could go.
That image is the factor I’m accountable for, now. The typing has been outsourced.
This weblog submit is an try to put in writing down, actually, what my day-to-day seems like a few weeks into working this experiment at full tempo. Additionally it is — as a labored instance — the story of how SwiftBash, SwiftScript, SwiftPorts, and SwiftJS — 4 tasks you’ve examine right here, every its personal announcement submit — lastly clicked collectively right into a single factor this weekend. It’s not a submit about any of these 4 tasks particularly. They’ve their very own posts. It’s a submit concerning the loop that constructed them.
The position of the skilled engineer
The factor I don’t must do anymore is sort. Not the code, not the exams, not the difficulty textual content, not the PR description, not the commit messages, not the assessment responses. All of that’s agent work now. What I carry is upstream of all of it — the image in my head: the place these items wish to match, what form the seams between them ought to be, which abstraction belongs wherein package deal, what doesn’t but exist however goes to wish to.
It seems a coding agent is good on the mechanics — together with the writing-things-down mechanics — and virtually utterly with out opinion about which factor ought to be constructed. You need to carry the opinion. You need to have, very clearly, an image of what beauty like, as a result of the agent will fortunately produce code (and points, and PRs) that look believable all the best way to the take a look at failures, and your solely edge is that you recognize what the reply ought to roughly resemble earlier than the agent begins.
That’s the experienced-software-engineer half. The imaginative and prescient. The visio.
What follows is, structurally, my outer loop. It runs all day. There are normally two or three of those entering into parallel throughout completely different repos.
The loop, step-by-step
1. An thought will get fleshed out right into a GitHub problem.
The concept is mine. Every part else concerning the problem isn’t. I describe what I’m seeing — generally a sentence, generally a paragraph, usually only a half-formed nudge — and an agent does the analysis, reads the encircling code, sketches alternate options, asks me clarifying questions, and in the end writes the total problem textual content: motivation, present state, proposed form, acceptance standards, out-of-scope. I edit. We go a few rounds. The acceptance bullets matter essentially the most — they’re the answer-key for every little thing that comes later. With out them, the agent that picks the difficulty up has nothing to hill-climb towards. With them, virtually every little thing else is mechanical.
The problems in the SwiftBash repo, the SwiftPorts one, and the brand new ShellKit repo all appear like this. They’re lengthy. They’re lengthy on function. And primarily not one of the prose in them was typed by me.
2. An agent picks up the difficulty and works on it domestically.
Native improvement, native exams, a department, finally a pull request. The agent runs the take a look at suite earlier than opening the PR. More often than not it’s inexperienced once I see it.
That is the step the place I’ve to be current the most, regardless that it seems prefer it ought to be essentially the most autonomous. The agent will hit a junction — normally a “ought to this dwell in package deal A or package deal B?”, or “I see two methods to mannequin this; the kind system doesn’t resolve between them” — and cease and ask. Eighty % of these questions are answered with “do it” or “make it so” (Picard, on the bridge of the Enterprise, has been a remarkably helpful position mannequin for this sort of work). However the different twenty % are style questions: I can see a less complicated path the agent didn’t think about, and if I don’t inform it, Opus will earnestly produce the extra elaborate one. With out the human at this junction, the agent overbuilds. Quietly, plausibly, however it overbuilds.
The irritating form this takes is that the agent shall be working away whereas I’m at lunch or asleep, and sooner or later it’ll hit one in all these questions, cease, and wait. I come again to the keyboard and realise nothing has occurred within the final hour due to a silly clarification it may have requested anybody. There may be, proper now, no good reply to “agent wants human enter however human just isn’t on the keyboard”. That is the a part of the loop I’m not but certain how you can tighten.
3. Codex critiques the PR.
That is the step I’m most stunned I got here to depend upon. I’ve Codex configured to assessment each PR I open — and the feedback are really helpful usually sufficient that I learn each single one. Possibly it’s a lacking edge case. Possibly a path that wasn’t shell-quoted. Possibly an HTTP technique that wasn’t within the allow-list. Possibly a Shell.present that ought to have been learn however wasn’t.
The agent that wrote the PR addresses every remark: 👍 for the nice catches (the bulk — these critiques are an actual second pair of eyes), 👎 for the false positives (uncommon, however they occur), and the dialog will get marked resolved. I learn alongside.
4. CI runs on each commit. 5 platforms.
Each push, GitHub Actions runs the total build-and-test matrix on macOS, iOS, Linux, Android, and Home windows. The ambition — for SwiftBash and the encircling tasks — is that every one 5 keep inexperienced eternally. The CI configuration that will get us there’s its personal story, advised in 4 Inexperienced Checkmarks. It’s now 5.
GitHub Actions is, to be sincere, not quick. A Home windows job can take half an hour; the total matrix takes longer. It’s additionally not free at scale — the one motive this loop is economically viable for me is that every one of those tasks are open supply, and GitHub provides limitless Actions minutes to public repositories. That’s the deal that makes the entire thing work, and it’s the explanation I’m unlikely to start out a closed-source experiment any time quickly. The large payoff: I do not need to take care of a Home windows field, an Android emulator, or a Linux VM on my desk. I write Swift on a Mac and watch 4 different platforms inform me whether or not I broke them.
5. The agent watches CI and reacts.
Some platforms are well-behaved. Some — Home windows particularly — have opinions. A take a look at that quotes a path with ahead slashes passes all over the place besides the place backslashes survive bash quoting in another way. A POSIX exec-bit verify is meaningless on a filesystem that doesn’t have one. getpid() is deprecated beneath WinSDK. BOOL is bridged in another way. getaddrinfo lives behind a special import.
Every of those is small. Most of them are five-line fixes when you perceive the platform quirk. The artwork is fixing them in a approach that doesn’t un-fix the opposite 4 platforms.
6. Hill climbing.
That is the a part of the day the place the agent and I are at our most helpful to one another. The success criterion is binary — all 5 checkmarks inexperienced — and there’s a finite stack of small, mechanical, platform-shaped failures to grind by. The agent reads the CI log, identifies the platform quirk, makes the repair, pushes, and waits. A Home windows CI run can take as much as thirty minutes. Typically a single iteration takes one repair; generally a repair surfaces a brand new failure beneath. You climb. You watch the altimeter.
That is the place Opus is at its most quietly spectacular. It’s not glamorous work. It’s affected person, particular, mechanical work — precisely the type of work that people get tired of and begin reducing corners on after the third Home windows-only department. The agent doesn’t get bored. So long as I maintain the success criterion sharp (“5 inexperienced, no continue-on-error shortcuts”), it retains climbing. The current push to elevate Home windows from “advisory” to “dedicated” — a few dozen platform-specific fixes, ending with a five-line workflow change to delete the continue-on-error: true gates — occurred in a single centered stretch on the night of Could 8, someplace between dinner and bedtime. About two and a half hours, finish to finish, towards construct steps that take half an hour every.
7. 5 inexperienced checkmarks. Merge.
Repeat.
A cousin loop: exterior PRs
More and more the PRs I’m taking a look at usually are not from brokers I began. They’re from outdoors contributors. And people contributors, more and more, are utilizing coding brokers themselves.
This can be a unusual and barely recursive new sample. I’ve one in all my coding brokers assessment the incoming PR — generally with a number of questions of my very own sprinkled in — and the dialog that emerges is, in impact, two or three coding brokers and two people collaborating on the form of a change. Typically the PR has missed a design consideration I had in my head; I’ll ask for adjustments. Typically one PR has lumped collectively three separable enhancements; I’ll ask for it to be cut up. Typically the PR is simply good — 5 inexperienced checkmarks, the design suits, Codex is blissful, and I merge.
That can be a part of the outer loop. The vision-holding extends throughout the boundaries of the repo.
The instance: how SwiftBash and pals clicked collectively
Now the labored instance. I’ll maintain it intentionally high-level — every of those tasks has its personal announcement submit for the gory element.
Two weeks in the past there was SwiftBash: a sandboxed bash interpreter, in pure Swift, with no Course of and no fork. Then SwiftScript: a tree-walking Swift interpreter that wants no toolchain. Then SwiftPorts: pure-Swift reimplementations of gh, glab, git, jq, and the compression household. Then SwiftJS: a Node-shaped runtime on JavaScriptCore.
4 good items. 4 separate good items. Every had its personal non-public notion of “the place does stdout go, what’s the working listing, what am I allowed to learn, who am I.” The seams between them didn’t but line up.
I may see, in my head, how they need to match. There wanted to be a fifth package deal — a tiny one — that owned the runtime context: stdio, surroundings, sandbox, community coverage, identification. The 4 runtimes would every undertake it. The bash interpreter would nonetheless personal bash semantics; the Swift interpreter would nonetheless personal Swift semantics; the JS runtime would nonetheless personal JavaScriptCore. However the shell context — the substrate all of them shared — could be one package deal. I known as it ShellKit.
That was the imaginative and prescient. Turning it right into a stack of points — one per repo, every with its personal motivation and acceptance standards — was a night’s value of back-and-forth with an agent that did the precise writing. The agent then applied it throughout three repos over a single weekend: ShellKit acquired printed, SwiftBash adopted it, SwiftPorts adopted it, the JavaScript runtime acquired each host-touching floor gated on the brand new shared Shell sort, the SwiftScript shebang dispatch dropped to a five-line bridge. From the primary ShellKit-adoption decide to the final green-Home windows CI run was about seventeen hours of wall-clock time, virtually all of it the agent toiling away on PRs whereas I pointed and reviewed.
What that produced is one thing I’m genuinely blissful about: a single, composable Shell you may construct up with bash instructions, Swift port CLIs, a Swift interpreter, and a JS interpreter, hand a sandbox and a community allow-list to, and run a polyglot pipeline by. Not one of the seams creak. The bash shell pipes into jq which pipes right into a Swift script which calls fetch from JavaScript, and each step honors the identical sandbox.
Once more, none of that paragraph is the purpose of this submit. The purpose is: I had an image, the image was appropriate, and the loop took it from image to working substrate over a weekend.
Some surprising revelations
A handful of issues have stunned me about working this manner for the previous few weeks.
The position inversion is actual, and quieter than I anticipated. I assumed I might really feel much less helpful. I really feel extra helpful. The choices I make on the problem stage — what to incorporate, what to scope out, what counts as carried out — propagate by the loop with extraordinary leverage, they usually’re the one choices which can be nonetheless mine to make. A transparent thought produces a transparent problem produces a clear PR. A muddy thought produces a muddy problem produces a muddy PR. The considering I do up entrance, earlier than the agent ever drafts a phrase, dwarfs something I’d save by skipping it.
Codex genuinely catches issues. That is the one which stunned me most. I anticipated agent-on-agent assessment to be a type of theatre. It’s not. The Codex feedback discover actual bugs and actual missed edge circumstances at a hit-rate that will be respectable for a cautious human reviewer. I’d estimate ninety % of the feedback are value a 👍 and an precise repair. Two coding brokers conversing a couple of PR is a meaningfully completely different assessment than one agent performing alone.
Hill-climbing is precisely nearly as good because the altimeter. Opus 4.7 is very good on the affected person, repetitive, “repair one platform-quirk at a time” work — however provided that the success criterion is unambiguous. Essentially the most irritating moments within the Home windows climb had been those the place the CI log was being truncated and the sign of progress was lacking. The hill is climbable; the altimeter must be studying.
5 platforms adjustments the way you write code. When each commit will get instantly examined on macOS, iOS, Linux, Android, and Home windows, your default reflexes change. You cease reaching for Basis.Course of. You cease assuming POSIX. You design for the smallest widespread floor and add platform-specific niceties on prime, somewhat than the opposite approach round. This isn’t a self-discipline I might have adopted by myself; the matrix imposed it, and I’m grateful it did.
The day of “weeks of labor” is over for a complete class of activity. The four-projects-clicking-together work on this submit would have been, optimistically, two or three weeks of human effort. It was a weekend. I maintain mentally re-calibrating what counts as an affordable measurement for “this afternoon’s challenge”. The reply retains rising.
How would you tighten this loop?
I’m going to shut on the query I most wish to ask different individuals, as a result of I believe the readers of this weblog are precisely the individuals who’d have a solution.
There are two comfortable spots in my loop that I’ve not but found out how you can repair. They aren’t technical bottlenecks — they’re coordination bottlenecks, and I believe they’re the following fascinating frontier.
The “agent caught on a query whereas I’m asleep” downside. I described this above. The agent hits a style query, stops, waits. The clock retains working and nothing occurs. I’d love a setup the place the agent can submit the query right into a channel I’m watching from wherever — Discord, Slack, an iOS notification, no matter — and I can reply with one faucet, and it picks up the place it left off. The items all exist. No one, so far as I can inform, has wired them collectively but. You probably have, please inform me.
The “manually pinging the reviewer” downside. Proper now, when an exterior PR is available in, I nonetheless must go to the appropriate matter thread on my Discord and inform my reviewer-bot (“OpenClaw”) to Overview PR #N. The assessment is nice when it lands. The pinging is foolish. Ideally the second GitHub sends the new-PR notification, OpenClaw spins up, critiques the diff, and presents me with a one-screen abstract plus two buttons: Good to merge and Request these adjustments. I faucet one. Carried out. The review-trigger step shouldn’t want a human.
Native CI runners for hill-climbing. Half-hour Home windows builds on GitHub Actions are high quality for the merge gate, however they’re a awful iteration floor. I maintain that means to take a look at working a Home windows VM and an Android emulator on one thing native — even only for the noisy hill-climb section, earlier than the change goes again as much as the cloud matrix because the canonical verify. Should you’ve automated this in a approach that doesn’t double your upkeep burden, I’d genuinely prefer to examine it.
There are absolutely extra comfortable spots — the 4 I’ve listed (issue-stage questions, agent-asleep questions, guide assessment pings, sluggish CI hill-climbing) are those I stumble upon daily. I’m certain the reader bumps into completely different ones. I’d prefer to know which.
Should you’ve discovered a option to tighten any a part of this loop on an OSS challenge of your individual — your individual outer-loop diagram, your individual Discord-and-bot incantation, a self-hosted runner setup that earns its maintain, a “agent asks, human one-taps” circulation that already works — I’d love to listen to it. The repos are at Cocoanetics/SwiftBash, Cocoanetics/SwiftScript, Cocoanetics/SwiftPorts, and the brand new Cocoanetics/shellkit. Open a difficulty, write to me, or — higher but — submit the way you’ve solved a bit of this and hyperlink me the diagram. Proper now everybody appears to be inventing their outer loop in non-public. I’d like for that to cease being the case.
Associated
Classes: Recipes

