Is Mandarin superior for LLM data?

(medium.com)

7 points | by treebeard901 7 hours ago ago

4 comments

There has been some reporting that Mandarin and languages similar to it provide several advantages over using English. First, the theory is that a language like Mandarin can encode more complex ideas using much less memory than it would take in a language like English.

Some reports, which I can find and post here if necessary, claim this can lead to a 40% or so overall performance difference.

There is also a view that due to the way complex meanings are encoded in a pictograph type language that it improves the inference stage and ultimately greatly reduces hallucinations.

There has been some work from Microsoft and others to compress tokens from the user side. Other papers have suggested the advantages are so great that a new kind of symbol based language should be created to use for all of the training data.

Does anyone have any experience with this sort of LLM optimization? Is Mandarin and similar languages more efficient for LLMs?

[-]

ggm 7 hours ago

I'm not saying the possibility isn't there, but I would want to see a really strong case made that language syntax and grapheme/morpheme influences models this way.

IF that was true, then it's interesting because Chinese is anything but the precise language something like Z or Coq or APL tries to be, words have remarkably fluid meaning, highly contextualised. The opportunity for a mis-walk through the information space seems higher, not lower.

Sometimes a cigar is just a cigar, but Honey and Winnie the Pooh have two clear meanings now in China. As does Draco Malfoy. I can't see how this helps an LLM.

(I'm an AI skeptic, and a complete outsider in this space)

[-]

re-thc 5 hours ago

> I can't see how this helps an LLM.

Your examples aren’t language specific though. I doubt English doesn’t have words that have twisted meanings now.

harr01 6 hours ago

Take a language that is equidistant from english and chinese, like Malay, and translate that dataset into both languages. Then train a model and benchmark them. I suspect the differences would be irrelevant.