Analytical Writing as Applied AI: a Method, Built From a Phone in Japan

6 minute read

NB: The research bundle behind this lives here: github.com/rorads/open-web-analysis – frozen rubric, raw evidence, scoring code and figures.

I’m writing this from a train in Japan, on holiday, with no laptop. The thing I want to talk about isn’t the holiday, though – it’s that an entire analytical project (a rubric, ~195 short texts scored against it, the aggregation code, the charts) got built from my phone, by talking to Claude Code through its iOS app. A year ago I might have been able to say that, but the up front work to get to that point would have been significant, and the outputs would be quite questionable.

I’ve wanted to write more analytical pieces for a while – small, honest investigations where the evidence is scattered across the open web and the fun is in turning something qualitative into something you can count. I’ve done bespoke versions before: a graph-database take on aid traceability and a personal “Spotify Wrapped” from an old iTunes library. What I wanted this time was a repeatable method, so the limiting factor becomes which question is worth asking rather than how to wire it all up.

The method, briefly

The shape is deliberately boring – boring is what makes it trustworthy:

flowchart LR
    Q([A question]) --> R[Freeze the rubric]
    R --> C[(Build the corpus<br/>raw evidence)]
    C --> S[Score against rubric<br/>grounded, LLM agents]
    S --> A[Aggregate<br/>indices + fingerprint]
    A --> V[Visualise]
    V --> P([Blog post])
    S -. spot-check &amp; recalibrate .-> R

    style Q fill:#2d8cff,stroke:#fff,stroke-width:2px,color:#fff
    style R fill:#1e293b,stroke:#fff,stroke-width:1px,color:#fff
    style C fill:#6366f1,stroke:#fff,stroke-width:2px,color:#fff
    style S fill:#10b981,stroke:#fff,stroke-width:2px,color:#fff
    style A fill:#f59e42,stroke:#fff,stroke-width:2px,color:#fff
    style V fill:#38bdf8,stroke:#fff,stroke-width:2px,color:#fff
    style P fill:#f43f5e,stroke:#fff,stroke-width:2px,color:#fff

Freeze the rubric before scoring anything; keep raw evidence immutable; ground every score in a quoted line rather than the model’s memory; then spot-check a sample, blind, to catch where it drifts. It’s the same provenance-first instinct behind the IATI work – build a tidy, inspectable data plane and let an agent reason over it, with the receipts attached.

The first piece run through this is a slightly daft one about national anthems, and how much of their ‘belligerence’ they quietly drop from the verses people actually sing. I’ve written it up separately: what national anthems quietly stopped singing. The point here is the machinery, not the anthems.

What this replaces

There was a time I’d have reached for a topic model. Years ago I gave an ODI lunchtime lecture on exactly this sort of problem – using Latent Dirichlet Allocation to pull structure out of free text where rigid codelists fell short. LDA is a perfectly good tool, but it is hungry: it wants a large corpus, and it repays you with clusters of co-occurring words that you then have to squint at and name yourself. Point it at 195 short poems and you get mush.

The rubric-and-LLM approach wants the opposite of a big corpus – a few hundred short texts suit it nicely. You decide the themes up front, in plain language, and the model judges each text against them, quoting the line behind every score. The result reads more like a marking scheme than a word cloud, which makes it far easier to trust, and to argue with. Work that used to need a sizable corpus and an afternoon of interpretation now runs on a few hundred documents and gives back something a human can read straight off. That shift – from inferring fuzzy topics to scoring an explicit rubric on a small corpus – is what makes a project like the anthems one worth doing at all.

What actually changed

I’ve been doing AI-assisted work for years, so it’s worth being precise about what’s different this time, because it isn’t simply “the model got better” (though it did).

What changed is the product surfaces and integrations arriving at the same moment the models became, for this kind of work, almost bulletproof. Claude Opus 4.5+ / GPT 5.2+ have crossed the line where I stop re-reading everything they write, and generally trust the code completely if I know the domain is what I’d consider trivial with just effort and a bit of research required. I’m now delegating rather than supervising, and the difference is mostly trust, which I think these models how now earned.

And the specific thing that made today possible is a stack, not a single tool: a strong model, plus GitHub as the substrate, plus cloud-provisioned ephemeral compute (Claude Code’s web/app version spins up a throwaway container, clones the repo, does the work, commits, and is reclaimed). The pattern that fell out of that was agentic and asynchronous – Claude Code fanned the anthem work across a dozen-plus background agents, scoring batches in parallel, and pinged me as each finished. My job shrank to the bits that genuinely need a human: setting the rubric, sanity-checking a sample, making the judgement calls (is the Pope a “monarch”?), and saying “ship it.” I’d kick off a fan-out, lock my phone, go and look at a temple, and come back to a validated dataset and a tidy commit. The laptop wasn’t a bottleneck because there was no laptop.

What this means at work (and why I can’t do this at work)

Here’s the uncomfortable part: Almost none of this is something I could do in my actual job.

The whole loop hinges on two things a holiday hobby can be relaxed about and a bank or a government absolutely cannot: authentication and trusted code execution. An agent that can clone a repo, run arbitrary code in a cloud container, and push commits is wonderful when the repo is a public hobby project and the blast radius is “a chart looks wrong.” Point the same capability at systems with real data and real consequences and every one of those steps becomes a controlled, audited, scoped decision. I can operate one-handed on a train precisely because no one sensible would let me operate this way on anything that matters.

Which is, I think, the actual story of enterprise and government AI adoption right now. The models are ready; the plumbing is the work. The high-impact effort isn’t prompting – it’s building the enabling infrastructure: identity and access management that agents can participate in safely, execution sandboxes with real guarantees, scoped credentials, provenance and audit baked in so a plain-language answer arrives with its receipts. That’s unglamorous platform work, and it’s exactly where the value (and the difficulty) sits.

It’s also where the moats are being dug. Simon Willison’s argument that Anthropic and OpenAI have found product-market fit – and his read on the token economics underneath (he worked out he’d run through about $2,180 of tokens in a month against a $200 subscription) – lands naturally here: the consumer-grade magic is, for now, generously subsidised and genuinely good, which is why I can burn tokens fan-out-scoring anthems from a proverbial deckchair without thinking about the bill. The defensible enterprise thing they’re selling is the trusted surface around it that lets a regulated organisation hand an agent the keys without flinching. The labs (and a lot of vendors) know this, and are building exactly that perimeter.

What’s next

More of these, now the scaffolding exists. If you’ve a question whose answer is buried in a pile of open-web text and wants to be a chart, that’s the shape I’m collecting. The anthems piece is first out of the gate.