LLMs are getting better at character-level text manipulation

(blog.burkert.me)

68 points | by curioussquirrel 9 hours ago ago

26 comments

simonw 6 hours ago

If you take a look at the system prompt for Claude 3.7 Sonnet on this page you'll see: https://docs.claude.com/en/release-notes/system-prompts#clau...

> If Claude is asked to count words, letters, and characters, it thinks step by step before answering the person. It explicitly counts the words, letters, or characters by assigning a number to each. It only answers the person once it has performed this explicit counting step.

But... if you look at the system prompts on the same page for later models - Claude 4 and upwards - that text is gone.

Which suggests to me that Claude 4 was the first Anthropic model where they didn't feel the need to include that tip in the system prompt.

[-]

kristianp 5 hours ago

Does that mean they've managed to post train the thinking steps required to get these types of questions correct?

[-]

simonw 4 hours ago

That's my best guess, yeah.

ivape 5 hours ago

Or they’d rather use that context window space for more useful instructions for a variety of other topics.

[-]

astrange 5 hours ago

Claude's system prompt is still incredibly long and probably hurting its performance.

https://github.com/asgeirtj/system_prompts_leaks/blob/main/A...

[-]

jazzyjackson 2 hours ago

They ain't called guard rails for nothing! There's a whole world "off-road" but the big names are afraid of letting their superintelligence off the leash. A real shame we're letting brand safety get in the way of performance and creativity, but I guess the first New York Times article about a pervert or terrorist chat bot would doom any big name partnerships.

[-]

astrange an hour ago

Anthropic's entire reason for being is publishing safety papers along the lines of "we told it to say something scary and it said it", so of course they care about this.

malshe 5 hours ago

I play Quartiles in Apple News app daily (https://support.apple.com/guide/iphone/solve-quartiles-puzzl...). Occasionally when I get stuck, I use ChatGPT to find a word that uses four word fragments or tiles. It never worked before GPT 5. And with GPT 5 it works only with reasoning enabled. Even then, there is no guarantee it will find the correct word and may end up hallucinating badly.

jazzyjackson 2 hours ago

That's good. 1 800 chat gpt really let me down today, I like calling it to explain acronyms and define words since I travel with a flip phone without google, today I saw the word "littoral" and tried over and over to spell it out but the model could only give me the definition for "literal" (admittedly a homonym but hence spelling it out, Lima indigo tango tango oscar Romeo alpha Lima, to no avail)

I said "I know you're a robot and bad at spelling but listen..." And got cut off with a "sorry, my guidelines won't let me help with that request..."

Thankfully, the flip phone allows for some satisfaction when hanging up.

[-]

BoorishBears an hour ago

Did you try "literal but with an o"?

xwolfi an hour ago

I know this word, it's French and it means coastline, coastal, something at the edge of the land and sea ! We use it in French a lot to describe positively a long coastline. I'm surprised it's used in an English context, but all French words can be used in English I guess if you're a bit "confiant" about it !

necovek 4 hours ago

I think the base64 decoding is interesting: in a sense, model training set likely had lots of base64-encoded data (imagine MIME data in emails, JSON, HTML...), but for it to decode successfully, it had to learn decode sequences for every 4 base64 characters (which turn into 3 bytes). This could have been generated as a training set data easily, and I only wonder if each and every one was them was found enough times to end up in the weights?

viraptor 4 hours ago

Why bother testing though? I was hoping this topic has finally died recently, but no. Someone's still interested in testing LLMs for something they're explicitly not designed for and nobody is using them for this in practice. I really hope one day openai will just add a "when asked about character level changes, insights and encodings, generate and run a program to answer it" to their system so we can never hear about it again...

[-]

tkgally 3 hours ago

One reason for testing this is that it might indicate how accurately models can explain natural language grammar, especially for agglutinative and fusional languages, which form words by stringing morphemes together. When I tested ChatGPT a couple of years ago, it sometimes made mistakes identifying the components of specific Russian and Japanese words. I haven’t run similar tests lately, but it would be nice to know how much language learners can depend on LLM explanations about the word-level grammars of the languages they are studying.

Later: I asked three LLMs to draft such a test. Gemini’s [1] looks like a good start. When I have time, I’ll try to make it harder, double-check the answers myself, and then run it on some older and newer models.

[1] https://g.co/gemini/share/5eefc9aed193

[-]

gizmo686 2 hours ago

What you are testing for is fundamentally different than character level text manipulation.

A major optimization in modern LLMs is tokenization. This optimization is based on the assumption that we do not care about character level details, so we can combine adjacent characters into tokens, then train and run the main AI model on smaller strings built out of a much larger dictionary of tokens. Given this architecture, it is impressive that AIs can perform character level operations at all. They essentially need to reverse engineer the tokenization process.

However, morphemes are semantically meaningful, so a quality tokenizer will tokenize at the morpheme level, instead of the word level. [0]. This is of particuarly obvious importance in Japanese, as the lack of spaces between words means that the naive "tokenize on whitespace" approach is simply not possible.

We can explore the tokenizer of various models here: https://huggingface.co/spaces/Xenova/the-tokenizer-playgroun...

Looking at the words in your example, we see the tokenization of the Gemma model (closely related to Gemini) is:

  un-belie-vably
  dec-entral-ization
  bio-degradable
  mis-understanding
  anti-dis-establishment-arian-ism
  пере-писы-ваться
  pere-pis-y-vat-'-s-ya
  до-сто-примеча-тельность
  do-stop-rime-chat-el-'-nost-'
  пре-по-дава-тель-ница
  бе-зо-т-вет-ственности
  bezotvetstvennosti
  же-лез-нодоро-жный
  z-hele-zn-odoro-zh-ny-y
  食べ-させ-られた-くな-かった
  tab-es-aser-are-tak-unak-atta)
  図書館
  tos-ho-kan
  情報-技術
  j-ō-h-ō- gij-utsu
  国際-関係
  kok-us-ai- kan-kei
  面白-くな-さ-そうだ

Further, the training data that is likely to be relevent in this type of query probably isolates the individual morphemes while talking about a bunch of words that the use them; so it is a much shorter path for the AI to associate these close but not quite morphene tokens with the actual sequence of tokens that corresponds to what we think of as a morphene.

[0] Morpheme level tokenization is itself a non-trivial problem. However, has been pretty well solved long before the current generation of AI.

[-]

tkgally an hour ago

Thanks for the explanation. Very interesting.

I notice that that particular tokenization deviates from the morphemic divisions in several cases, including ‘dec-entral-ization’, ‘食べ-させ-られた-くな-かった’, and ‘面白-くな-さ-そうだ.’ ‘dec’ and ‘entral’ are not morphemes, nor is ‘くな.’

neerajsi 3 hours ago

https://www.anthropic.com/news/analysis-tool

Seems like they already built this capability.

redox99 3 hours ago

Character level LLMs are used for detecting insults and toxic chat in video games and the like.

[-]

minimaxir 3 hours ago

Can you give an example of a video game explicitly using character-level LLMs? There were prototypes of char-rnns back in the day for chat moderation but it has significant compute overhead.

jazzyjackson 2 hours ago

I figure an LLM would be way better at classifying insults than regexing against a bad word list. Why would character level be desirable?

MountDoom 3 hours ago

I remember people making the exact same argument about asking LLMs math questions back when they couldn't figure out the answer to 18 times 7. "They are text token predictors, they don't understand numbers, can we put this nonsense to rest."

The whole point of LLMs is that they do more than we suspected they could. And there is value in making them capable of handling a wider selection of tasks. When an LLM started to count the numbers of "r"s in "strawberry", OpenAI was taking a victory lap.

minimaxir 3 hours ago

I made a response to this counterpoint in a blog post I wrote about a similar question posed to LLMs (how many b's are in blueberry): https://news.ycombinator.com/item?id=44878290

> Yes, asking an LLM how many b’s are in blueberry is an adversarial question in the sense that the questioner is expecting the LLM to fail. But it’s not an unfair question, and it’s objectively silly to claim that LLMs such as GPT-5 can operate at a PhD level, but can’t correctly count the number of letters in a word.

It's a subject that the Hacker News bubble and the real world treat differently.

[-]

brookst 3 hours ago

It’s like defending a test showing hammers are terrible at driving screws by saying many people are unclear on how to use tools.

It remains unsurprising that a technology that lumps characters together is not great at processing below its resolution.

Now, if there are use cases other than synthetic tests where this capability is important, maybe there’s something interesting. But just pointing out that one can’t actually climb the trees pictured on the map is not that interesting.

[-]

achierius 2 hours ago

And yet... now many of them can do it. I think it's premature to say "this technology is for X" when what it was originally invented for was translation, and every capability it has developed since then has been an immense surprise.

IncreasePosts 3 hours ago

Wouldn't a llm that just tokenized by character be good at it?

hansonkd 4 hours ago

chatgpt5 still is pathetically bad at roman numerals. I asked it to find the longest roman numeral in a range. first guess was the highest number in the range despite being a short numeral. second guess after help was a longer numeral but outside the range. last guess was the correct longest numeral but it miscounted how many characters it contained.