The Fiefdom of Files

The Spelling of Others

18th of February, 2026

We're using language models to automate the wrong things; rather than removing creativity and human agency, let's automate the boring stuff. Independently, it would be great if there was a way to use language models to be more kind to other people, instead of trying to directly derive economic value from them all the time. So I was wondering: What is something that I do that is (1) boring, (2) automatable, and (3) kind? One answer is spel checkng. When reading blogs, I sometimes find errata and mail them to the author. It's repetitive, automatable, and usually the author appreciates the gesture.

The problem decomposes into the following steps. First we have to get inbound, maybe a couple thousand pieces of writing that plausibly have errors in them. Then, we identify the errors and the author's mail address using a language model. Finally, we send out our well-meaning spam.

As for the source of blogs of potentially receptive but still error-prone authors, the Hacker News front page is ideal. It is also famously easy to query; we use the Algolia endpoint. As a first filter for websites we don't care about, we sort out about 100 hand-picked common websites that are not blogs. The links are then crawled, and fed to a language model. The model's job is to classify whether the page belongs to a single person (we only want to help out individuals for now) and to list spelling errors with confidence scores. The latter is why modern language models are the enabling technology here; spellcheckers have been around forever, but only now can we specify what kinds of errors we have in mind, and get meaningful probabilities of how sure the model is. When an error is found, another model is tasked with finding the email address, with a budget of two more hops on the author's website. Since we want to operate at scale, and don't need that much intelligence, we use a small model, Haiku 4.5. We fetch the posts on today's front page as a pilot.

Curiously, even for a task as simple as spellchecking (in 2026!) it's hard to get the model to consistently output what was intended. British vs. American spelling, slang, creative neologisms, stylizations ("oh nooooo" vs. "oh no"), text encoding mishaps and more all lead to false positives. Even worse, once in a while the model fails for no reason in particular, flagging something completely correct. Luckily the confidence score plus some prompting affords enough maneuvering room to bring the false positive rate way down. We don't care much about false negatives, so I didn't spend much time debugging for those. However, even after many iterations, the system is not robust enough to skip manual review; I still need to have at least a quick look at every flagged error. As for fetching email addresses, there are also many edge cases to consider, but these are more from the realm of web crawling, which is beside the point here.

The pilot led to 3/30 posts with addressable errors and available addresses. We are now ready to think about how we want the emails to read, which is arguably the most important part. Importantly, I decided not to hide behind the automation and put my name to the emails, and send them from my school account. I also want to make it as easy as possible to correct the error for the site owner, so the context of the error is also included for easy searching.

And analogously with an HTML list for multiple errors. The wording is a bit embellished in terms of my investment in the article, but since I ultimately manually check every error, I felt comfortable writing it like this. Also, since every post made it at least to the second page of HN, and I generally would want to encourage anyone writing a blog to continue, it seems fine to indiscriminately call everybody's work good.

Before sending the first mails, I stopped for a moment to reassess whether this was a good idea. What ultimately helped me decide was whether I would like to get the mail as an author. The answer was a definite yes; errors make a blog appear less professional, and when readers send me a version of the above email, I'm happy. So I ventured to send mail.

Prospects

For the pilot, a striking 3/3 authors mailed back within a day and explicitly thanked me! Meanwhile, the token cost for the experiment was $0.17. At long last, an engine that converts cash into thank-yous at significantly above market rate. (Which is to say, I'm pretty sure I would have gotten less thank-yous if I donated $0.06 to each author.)

Next, we want to estimate how well the system works with some more samples, gaining confidence in the final run. We check one week of posts. Of the 62 posts with identified errors, I picked out 31 where the mail address couldn't be determined to figure out how well the mail address fetching agent worked. For 20, I couldn't determine the address either by a manual search. The rest I manually entered into the database and used them as more datapoints to improve the agent. So, we would hope to lose at most 1/3 of errors to insufficient engineering on the email address fetching step, which is satisfactory. At least in one instance, I noticed the agent found an address I couldn't immediately find. Of the 62 posts, I was sufficiently confident about sending only 25 after manual review.

Since we don't want to send 2 emails to the same address, we should now move forward to crawling and analyzing every page that's in scope, which turns out to be ~7000 pages to crawl. Naturally, being the brilliant systems programmer that I am, the run hit the Anthropic API so hard my account's access was suspended for a month. Consequently the email search was done with gemini-3-flash-preview instead. Swapping out the model was surprisingly not a big problem.

In total, the full run yields 1743 posts with flagged errors, of which 743 have addresses, and those contain 1430 errors, all for $28. Next up was the manual validation of those 1430 errors, which proved a fun human-computer interaction problem. I ended up making a custom CLI that allows for one-keystroke classification of the errors, in which I would spend the next days of the project. This worked quite well. One marginal improvement that I wish I had made in hindsight is a parallel HTML renderer that fetches the webpages again and auto-scrolls to the mistake, because I had to click on most links and CTRL-F around to get more context. Over a total of ~9 hours, spending 23 seconds per potential error, confirmed as few as 837/1430.

Since spell checking is (seemingly) so simple, it makes for a good toy problem to study surprising failure modes of language models. Reasons why I didn't accept many errors were diverse, including: not wanting to point out errors in the title because it felt too obvious, weird posts, low effort posts, not actually a blog, the error is in code, or in poetry, weird-looking context, hallucinated or already corrected error, hallucinated correction, hallucinated context, 404, server timeout, the error is intentional/ironic, it's just an abbreviation, it's just slang, it's a neologism, there are too many errors so the mail would look odd, there's an unflagged error right next to it, the author is just quoting someone else who made the error, the author didn't seem like they should/want to get mail. The biggest issue was comment sections of blogging platforms, as often the error would just be from a sloppy comment below an orthographically impeccable piece. On the other hand, the model did surprisingly well in some regards. For instance, I was surprised to find many zero-shot corrections of names of public figures. I really dread getting names wrong in my posts, so this is a great improvement over old-school spell checkers. In one instance, unexpectedly, it also pointed out a wrong gendered pronoun seemingly zero-shot.

As an aside, the project is heavily enabled by the blogging sphere being in English, a language that famously features arbitrary spelling. English being my second language, I curse it everyday and wish it could be more like, say, Hungarian, in which such a thing as a spelling bee would be unthinkable. I certainly freshened up my knowledge going through these errors, even learning one word anew: "to kowtow" to somebody (not to "cow-tow") means to prostrate oneself, or to bow down before somebody. This comes from the Chinese 叩頭, ritualistic head-to-the-ground bowing for rulers or in religious worship.

The Great Mailing

The 837 errors were distributed over 404 addresses, meaning everyone got served 2.07 errors on average. By the way, the most common error by a large margin was "its" vs. "it's", with 20 occurrences. If we were to continue the experiment into the future as is, analyzing the first two pages of HN every day, we'd expect to mail about 1 person per day. Since I was using my university mail account, 404 mails was a sufficient quantity for me to be responsible with API use for once in my life and contact Harvard's email webmaster explaining the plan, who luckily was cool with it. Soon all of blogging intelligentsia zoomed across my terminal.

The recipients of the mail campaign aren't passive in this story; on the contrary, some played a reverse card, informing of issues like my personal website being down. Indeed, the night before I had played fast and loose with the DNS records, misconfiguring them. Another recipient pointed out that the title tags on this blog aren't populated correctly. The circle of life, isn't it marvelous? Exactly one astute author (of 25midi.com) figured out that there was an experiment behind this, even using that exact term. In the blog post name descriptor, the model had added the ever-suspicious em-dash where none was to be found in the original.

So far, one person out of 194 replies got back and informed me about a false positive, that I mistakenly pointed out something that wasn't an error. Since I would assume a false positive would lead to a lower response rate, I would guess the false positive rate is slightly higher than 0.6%. While not bad at all, I had hoped for around 0.1%, i.e. no observed false positives, because I genuinely hate to bother even a single person with this. I do believe the equilibrium to be net positive utility is higher however, maybe 5%.

Speaking of, two authors replied indicating they had intentionally put in the spelling errors! In a time of slop, spelling errors do have merit, namely as an assurance to the reader that a human wrote the text. The topic of distinguishing posts from language model output came up in many replies, with many authors joking that though they corrected the error, they weren't too sad about it either because it communicates to readers that a human wrote it. I think this is genuinely a good point, and one which I hadn't considered at the start. Ultimately though, I'd like to argue against spelling errors as proof-of-humanness, because it's very easy to fake. I personally would be easily fooled by a slop pipeline that simply includes an extra step of adding 1 or 2 mishaps. Instead I would advocate for getting weirder on the level of grammar, using your own freaky neologisms, or best of all cultivating a unique style or structure. A more fleshed out argument is spelled out in That Squiggly, Treacherous Line. This is also why I didn't validate flagged words that are technically wrong and probably unintended, but sounded cool anyways. Some examples: combinated (in place of "combined"), underhood (as an adverb, instead of "under the hood"), SaaSes (plural of SaaS).

Some notable mentions among the errata found were in

Since this blog is also in the source dataset, the pipeline successfully identified two typos in The Fiefdom, which were promptly eliminated.

It's not trivial to find a mail text that works when addressing 272 very different people. For example, when writing the above template, I had a person of roughly equal social standing in mind. However, because I ended up addressing people with comparatively much higher social status, I felt a bit awkward and was relieved to learn that probably nobody took issue. One mail accidentally went out to someone I know, but don't know super well, and who is also more senior than me. That felt weird! I followed up manually there. Overall though, it's remarkable how there is just one international style in which a nobody like me can respectfully address everyone. Nerdy blogging is very egalatarian in comparison to other spaces; I imagine this experiment would go less well in fields like public policy. Studying math in uni, I always found it beautiful how first-year undergraduates regularly can and do point out the professor's mistakes. This somehow gave me conviction that I was studying the right thing, that math had better social norms than other disciplines. I still think that's the case to some degree.

Reception

Boy, did the replies come in. From 404 mails sent, I've gotten 194 responses until now, meaning we already have a rate of ~50% after an average of 5 days since the emails got sent out. It feels really odd to get an email every five minutes! The experience has had me empathize with professors and celebrities claiming to have this so-called problem. I am always happy about getting mail, so even though these weren't very deep interactions for the most part, I felt strangely validated and was glad I had framed the text as an interpersonal thing rather than an experiment.

The content of the replies was again unanimously positive, some quite strongly worded even. While I had read at least a little of every post, this made me feel pretty bad; normally someone who reads an article carefully enough to spot orthography mistakes is invested in the content. The authors were thanking me for my interest though I'd just had a few-second skim. That's why I went back through every reply that mentioned the content in any way, and made a point of spending at least some minutes with it. Of course, any reply with an explicit question received an answer, too. I strongly dislike the expression "to feel seen", for precisely the above reason. Yes, I'd made the authors "feel seen", but they often weren't actually seen, meaning I had to go back and fix that. I of course instated the same must-read policy with regards to people who were sweet enough to put in a note on their page that I'd found the error.

Which brings us back to the original purpose of the project; I love the blogging ecosystem, so it's no chore to read even hundreds of posts. In going through them, I particularly enjoyed these:

In turn, many authors remarked they'd been enjoying my blog, and knew The Fiefdom before receiving my mail. It was surprising that people whose work I greatly respect were relating such things, which is encouraging me to continue working out in the open.

Survival

Because I'm a statistician I couldn't help but work the email data a bit. In particular, I was interested in the distribution of the time to first reply, and what factors influence it. We are in the classic right-censored survival analysis setting, where the terminal event is the time of first reply. If someone replied we know exactly when (no left-censoring), but if someone hasn't replied yet, we don't know if and when they will (right-censoring). Since the overwhelming majority of replies was positive, we don't filter out the 3-4 neutral sentiment replies. Plotting the Kaplan-Meier estimator against some promising canditates, it seems that surprisingly, a Pareto distribution seems to be the best fit. I initially guessed that thinner-tail distributions like Weibull would be a better fit. This replicates Eckmann, Moses and Sergi (2003). The Pareto fit looks very good, so I'm happy to use it to predict the future. It yields a 57.2% response rate after a month, up from the current 49.8% after 5 days, which seems reasonable.

Share of safety research over time

Fitting a Cox proportional hazard model, with covariates (1) number of errors included in the mail, (2) the average confidence of the spell-checker model over all errors in the mail (normalized from 0.9-1.0 to -1.0-1.0), and (3) a crude approximation to what percentage of people in EU+USA timezones were awake when the email was sent. Hazard ratios were 0.87 (95% CI 0.73-1.03), 0.96 (95% CI 0.89-1.04) and 3.02 (95% CI 0.60-15.2) for the three covariates. Not very inspiring, let's just stick to the Kaplan-Meier estimator.

Reflections

Was this a good idea? Though it seemed risky at the outset, I think the reception was really great, for a small time and capital investment; I received 185 response mails including the string "thank". Ignoring the manual labor, we bought thank-yous from a group of very cool and impressive people at $0.15 a piece, which is extremely good. I might continue using this pipeline as a blend of reading list and icebreaker: If I enjoy someone's writing and want to talk to them, it might be a nice gesture to put their blog into the grinder and start the conversation with some errata. Indeed, some good email chains with people have already developed! Also, I felt a certain sense of community, a sense of connectedness with the tech blogging sphere, because it was an opportunity to read a lot of articles in a short time, engaged in follow-up discussions about their content, and got to learn what people thought of this site, too.

What seemed clear is that it would be hard to scale the operation up much further. Most obviously, the token costs are dominated by the manual review time, 32-fold if I paid myself $100 an hour. Since we now have a N=1430 ground truth dataset though, it's easy to lower the false positive rate through a better ruleset for the classifying model. Switching to a more expensive model and/or using an ensemble would presumbly iron some of the hallucinations and weirder failure modes where a better ruleset doesn't help. Whether both together would eventually obviate the manual review is unclear. If the manual review was solved, the next bottlenecks would be, in order, time to actually read posts from authors who reply, the goodwill of the university webmaster, getting enough blog posts with high enough relevance, efficiency/concurrency.

A sensible continuation would be to apply a similar method to higher-stakes scenarios like math/logic/citation errors in preprints, or inconsistencies in company annual reports. I suppose many startups are doing stuff like this these days? The main problem there is that semantic errors are much harder to check by hand, but that sounds tractable. If I find time, I'll try something like that next.