The Fiefdom of Files

Why does Claude Speak Byzantine Music Notation?

31st of March 2025

A Caesar cipher is a reasonable transformation for a transformer to learn in its weights, given that a specific cipher offset occurs often enough in its training data. There will be some hidden representation of the input tokens' spelling, and this representation could be used to shift letters onto other letters in even a single attention head. Most frontier models can fluently read and write a Caesar cipher on ASCII text, with offsets that presumably occur in their training data, like 1, -1, 2, 3, etc.

As we will shortly see, they can also infer the correct offset on the fly given a short sentence, which is already quite impressive for a single forward pass.

It is also natural that this effect does not generalize to uncommon offsets, because numerical algorithms implemented in the weights are restricted to values in the training distribution.

We now test this in frontier models by having them decode the cipher without allowing any test time thinking tokens, as a function of the offset. We add the offset to each Unicode encoding of the message, then translate back to a character. Unlike the regular Caesar cipher, we do not perform modulo.

To illustrate, the message "i am somewhat of a researcher myself" will land on "𝁩𝀠𝁡𝁭𝀠𝁳𝁯𝁭𝁥𝁷𝁨𝁡𝁴𝀠𝁯𝁦𝀠𝁡𝀠𝁲𝁥𝁳𝁥𝁡𝁲𝁣𝁨𝁥𝁲𝀠𝁭𝁹𝁳𝁥𝁬𝁦".

The success rate of decoding 6 different messages per cipher offset is shown below. We disallow chain-of-thought, and just consider an immediate decoding: "Decode the following message: {message}. Only respond with the decoded message, absolutely nothing else."

We see that Claude-3.7-Sonnet can infer an offset in the first forward pass (a process that would be interesting to understand mechanistically) and then apply the deciphering correctly. However, the success rate gets progressively worse as the offsets get further from zero. All roughly as expected.

This was my understanding at least, until reading Erziev (2025), a description of a phenomenon where many models including Claude and gpt-4o fluently read and write hidden messages in high Unicode ranges, such as the href="https://en.wikipedia.org/wiki/Byzantine_Musical_Symbols">Byzantine music notation Unicode block.

For the specific case of the Byzantine music Unicode block, we can understand the transformation as a Caesar-like cipher in Unicode space, with offset 118784. Using it in the Caesar-like Unicode cipher leads to near-perfect decoding accuracy.

The reason that this has any chance of working is the following. At least in most public tokenizers like o200k, addition in certain Unicode ranges commutes with addition in token space. For instance, if we define \(\tau:\Sigma^*\rightarrow\hat{\Sigma}^*\) as the o200k tokenizer, then in a subset of the Byzantine music notation range (\(\texttt{U+118784}-\texttt{U+119029}\)), a linearity property holds.

\(\tau([\texttt{U+118881}+k])=[43120,223,94+k],k\in\{0,\ldots,29\}\setminus\{12\},\)

so all but one of these symbols is mapped to three tokens each, where the first two are the same and can be easily ignored by an attention head, and the third token increments exactly with the Unicode.

The first of those tokens, (token 94 of the o200k tokenizer), is the first binary token of its vocabulary, and along with the following 93 tokens represents the binary strings b'\xa1' to b'\xff'.

This makes it possible in theory for a circuit to exist that implements a shifting cipher on these characters. What remains is the explanation for why this cipher was learned in practice.

The special offset 118784 lands the character "a" on \(\texttt{U+118881}\), which is the first in the series of characters that get tokenized in the above arithmetic series. In particular, it is interesting that an offset that is one or two larger does not work. This means that the model has learned a map specifically from the binary range b'\xa1'−b'\xba' to the lowercase ASCII range 97-122. I would believe it if someone told me that this somehow occurs often in the training data, but I cannot think of how exactly.

However, gpt-4o retains some deciphering capability even with an offset one larger, which is weak confirmation that commutativity of addition in the Unicode-token spaces is part of the story. Since Claude handles the deciphering well, we could conclude that its secret tokenizer also handles binary strings with arithmetic increments, like o200k.

It is odd that this unusual shifting algorithm works across multiple model families, and even more consistently than a regular Caesar cipher, which is supposedly very common in the training data. This could suggest that the algorithm uses circuits from other tasks, or that there is a more fundamental reason why this specific ability is useful in next-token prediction.