Probably pay attention to tokenizers

(cybernetist.com)

225 points | by ingve a day ago ago

70 comments

  • kelseyfrog 16 hours ago

    Tokenizers aren't considered the "sexy" part of LLMs, but where others see boring, I see opportunity. Papers like xVal[1], point toward specialization strategies in tokenization. Spelling and letter tasks are another problem that could benefit from innovation on the tokenization.

    LLMs are notoriously bad at counting letters in words or performing simply oulipos of letter omission. GPT-4o, for example, writes a small python program and executes it in order to count letter instances. We all know that tokenization effectively erases knowledge about letters in prompts and directly negatively impacts performance at these tasks, yet we haven't found a way to solve it.

    1. https://ar5iv.labs.arxiv.org/html/2310.02989

    • bunderbunder 12 hours ago

      This was ages ago, in the pre-transformer era, and I can't find the link anymore. But once upon a time I read a great paper that demonstrated that most of the performance differences being reported among popular embedding models of the time were better explained by text cleaning and tokenization than they were by the embedding model itself.

      In other words, if you train a model using word2vec's preprocessing and GloVe's algorithm, the result looks more like a "standard-issue" word2vec model than a "standard-issue" GloVe model.

    • screye 14 hours ago

      Tokenizers face an odd compute issue.

      Since they're part of the pre-processing pipeline, you can't quickly test them out for effectiveness. You have to restart a pretraining run to test downstream effectiveness.

      Separately,

      As much as an attention module can do universal nonlinear transformations....I wonder if it makes sense to add specifuc modules for some math primitives as well. I remember that the executor paper [1] (slightly precursor to the attention is allyou need paper) created self contained modules for operations like less than, count, sum and then explicitly orchestrated them in the decoder.

      I'm surprised we haven't seen such solutions produce sota results from math-ai or code-ai research communities.

      [1] https://arxiv.org/abs/1705.03633

    • IncreasePosts 16 hours ago

      What's the issue with character-level tokenization(I assume this would be much better at count-the-letter tasks)? The article mentions it as an option but doesn't talk about why subword tokenization is preferred by most of the big LLMs out there.

      • stephantul 15 hours ago

        Using subwords makes your sequences shorter, which makes them cost less.

        Besides that, for alphabetic languages, there exists almost no relation between form and meaning. I.e.: “ring” and “wing” differ by one letter but have no real common meaning. By picking the character or byte as your choice of representation, the model basically has to learn to distinguish ring and wing in context. This is a lot of work!

        So, while working on the character or byte level saves you some embeddings and thus makes your model smaller, it puts all of the work of distinguishing similar sequences with divergent meanings on the model itself, which means you need a larger model.

        By having subwords, a part of this distinguishing work already has been done by the vocabulary itself. As the article points out, this sometimes fails.

        • sundarurfriend 11 hours ago

          > Besides that, for alphabetic languages, there exists almost no relation between form and meaning.

          Also true for Abugida-based languages, for eg. சரம் (saram = string) vs மரம் (maram = tree), and many more. I think your intention with specifying "alphabetic languages" was to say "non-logographic languages", right?

          • bunderbunder 10 hours ago

            I'll do you one more and say "non-Chinese languages". Written Japanese - including the kanji portion of the script - has the same characteristic.

            And even in Chinese it's a fairly weak relationship. A large portion of the meanings of individual characters come from sound loan. For example the 英 in 英雄 means "hero", in 英语 means "England", an in 精英 means "flower". The relationship there is simple homophony.

            On the other hand, one thing you do get with written Chinese is that "1 character = 1 morpheme" very nearly works. So mechanistically breaking a text into a sequence of morphemes can be done pretty reliably without the aid of a semantic model or exhaustive hard-coded mapping. I think that for many other languages you can't even get close using only syntactic analysis.

            • thaumasiotes 7 hours ago

              > I'll do you one more and say "non-Chinese languages". Written Japanese - including the kanji portion of the script - has the same characteristic.

              Written Japanese is much more ideographic than written Chinese. Japanese spelling is determined, such as it is, by semantics. Chinese spelling is determined by sound. Thus, 女的, 娘们, and 妮子, all meaning 'girl' or 'woman', have no spelling in common because they are different words, while Japanese uses 女 for "jo" and "onna" despite a total lack of any relationship between those words.

          • stephantul 4 hours ago

            I was trying to say “at least for alphabetic languages”. I don’t like to say things about languages I can’t speak or write. So, no, it wasn’t my intention to say “non-logographic languages”

        • p1esk 11 hours ago

          Has anyone tried to combine a token embedding with some representation of the characters in the (sub)word? For example, use a 512 long vector to represent a token, and reserve the last 12 values to spell out the word.

          • mattnewton 11 hours ago

            I'm not following - spell out the word how? Like put the actual bytes as numerical input to the transformer layer?

            • p1esk 9 hours ago

              Yes

              • stephantul 4 hours ago

                Not that I know of, but encoding orthography in a fixed width vector usually carries the assumption that words with the same prefix are more similar. So there’s an alignment problem. You usually solve this using dynamic programming, but that doesn’t work in a vector.

                For example “parent” and “parents” are aligned, they share letters in the same position, but “skew” and “askew” share no letters in the same position.

                • p1esk 3 hours ago

                  The other 500 values in the skew/askew vectors will be similar though. The 12 character values don’t need to be aligned, their function is to provide spelling. Adding such info will probably help LLM answer questions requiring character level knowledge (e.g. counting ‘r’s in ‘strawberry’).

          • RicoElectrico 11 hours ago

            Well, fastText uses character n-grams to compute embeddings for out-of-vocabulary words. This is pre-transformers work BTW.

            • p1esk 4 hours ago

              IIRC, overlapping ngram vectors are summed to form the token embedding - doesn’t it effectively destroy any character level representation of the token? Doesn’t really make sense to me.

              • stephantul 4 hours ago

                It works because they use really large ngram values, up to 6. So most character-level information is in these subwords.

                • p1esk 3 hours ago

                  Let’s say we want to use 6-grams and build an embedding vector for the word “because”: we add integer vectors for “becaus” and “ecause”, right? For example: [1,2,3,4,5,6] + [2,3,4,5,6,2] = [3,5,7,9,11,8]. Obviously we cannot use this resulting numerical vector to spell the input word. Pretty much all character level info is lost.

        • bunderbunder 12 hours ago

          I suspect that the holy grail here is figuring out how to break the input into a sequence of morphemes and non-morpheme lexical units.

          • thaumasiotes 7 hours ago

            What do you mean by non-morpheme lexical units? Syntactic particles, units too small to be morphemes? Lexical items that contain multiple morphemes?

            In either case, isn't this something we already do well?

      • SEGyges 15 hours ago

        tokens are on average four characters and the number of residual streams (and therefore RAM) the LLM allocates to a given sequence is proportionate to the number of units of input. the flops is proportionate to their square in the attention calculation.

        you can hypothetically try to ameliorate this by other means, but if you just naively drop from tokenization to character or byte level models this is what goes wrong

        • p1esk 3 hours ago

          4x seq length expansion doesn’t sound that bad.

      • Centigonal 15 hours ago

        I think it has to do with both performance (smaller tokens means more tokens per sentence read and more runs per sentence generated) and with how embeddings work. You need a token for "dog" and a token for "puppy" to represent the relationship between the two as a dimension in latent space.

      • cma 13 hours ago

        Context length performance and memory scales N^2. Smaller tokens mean worse scaling, up to a point.

    • kaycebasques 14 hours ago

      > but where others see boring, I see opportunity

      I feel this way about embeddings

      This line of thought seems related to the old wisdom of finding innovative solutions by mucking around in the layer below whatever the "tools of the trade" are for your domain

    • doctorpangloss 13 hours ago

      > LLMs are notoriously bad at counting letters in words or performing simply oulipos of letter omission.

      If it were so simple, why hasn’t this already been dealt with?

      Multimodal VQA models also have had a hard time generalizing counting. Counting is not as simple as changing the tokenizer.

      • kelseyfrog 13 hours ago

        I'm saying the oulipo rule is simple, not the task given current tokenization methods

      • danielmarkbruce 8 hours ago

        Should the number 23 be tokenized as one token or two tokens?

        • doctorpangloss 5 hours ago

          It doesn’t matter. The challenge with counting doesn’t have to do with tokenization. Why this got into the zeitgeist, I don’t know.

          • imtringued an hour ago

            No LLM struggles with two digit arithmetic. 100 digit addition is possible with the use of state of the art position encodings. Counting is not bottlenecked by arithmetic at all.

            When you ask an LLM to count the number of "r" in the word Strawberry, the LLM will output a random number. If you ask it to separate the letters into S t r a w b e r r y, then each letter is tokenized independently and the attention mechanism is capable of performing the task.

            What you are doing is essentially denying that the problem exists.

        • tomrod 8 hours ago

          We already solved that with binary representation ;-)

        • thaumasiotes 7 hours ago

          Two. That's the reality.

          You interpret the token sequence by constructing a parse tree, but that doesn't require you to forget that the tokens exist.

          • danielmarkbruce 6 hours ago

            If you use standard BPE, you likely won't tokenize every number by it's digits, depending on the data set used to create the tokenizer.

            The point is, you have a choice. You can do the tokenization however you like. The reason 23 is interesting is that there is a case to be made that a model will more likely understand 23 is related to Jordan if it's one token, and if it's two tokens it's more difficult. The opposite is true for math problems.

            The reality is whatever we want to make it. It's likely that current schemes are... sub optimal. In practice it would be great if every token was geometrically well spaced after embedding, and preserve semantic information, among other things. The "other things" have taken precedent thus far.

    • Der_Einzige 12 hours ago

      I wrote a whole paper about this exact topic! (Syntactic, phonetic, and related constraints)

      https://aclanthology.org/2022.cai-1.2/

    • danielmarkbruce 9 hours ago

      And decoders.

  • Joker_vD 15 hours ago

    > You need to understand [the input data] before you can do anything meaningful with it.

    IMHO that's the main reason people turn to any sort of automated data-processing tools in the first place: they don't want to look at the input data. They'd rather have "the computer" look at it and maybe query them back with some additional info gathering requests. But thinking on their own? Ugh.

    So I boldly propose the new definition of AGI: it's the data-processing entity that will (at last!) reliably liberate you from having to look at your data before you start shoving this data into that processing entity.

    • bunderbunder 13 hours ago

      Over the past year I've encountered so many situations where a person's opinion of how well an LLM accomplishes a task actually says more about that person's reading comprehension skills than it does the LLM's performance. This applies to both positive and negative opinions.

  • cranium 2 hours ago

    I finally understood the weirdness of tokenizers after watching the video Andrej Karpathy made: "Let's build the GPT Tokenizer" (https://www.youtube.com/watch?v=zduSFxRajkE).

    He goes through why we need them instead of raw byte sequences (too expensive) and how the Byte Pair Encoding algorithm works. Worth spending 2h for the deeper understanding if you deal with LLMs.

  • HanClinto 12 hours ago

    I really appreciated this blog post, and in particular I appreciated the segment talking about typos.

    We were discussing this earlier this week -- I'm helping with a RAG-like application for a project right now, and we're concerned with how much small typos or formatting differences in users' queries can throw off our embedding distances.

    One thought was: Should we be augmenting our training data (or at the very least, our pretraining data) with intentional typos / substitutions / capitalizations, just to help it learn that "wrk" and "work" are probably synonyms? I looked briefly around for typo augmentation for (pre)training, and didn't see anything at first blush, so I'm guessing that if this is a common practice, that it's called something else.

    • tmikaeld 11 hours ago

      I work with full text search where this is common. Here is some points.

      Stemming: Reducing words to their base or root form (e.g., “working,” “worked” becoming “work”).

      Lemmatization: Similar to stemming, but more sophisticated, accounting for context (e.g., “better” lemmatizes to “good”).

      Token normalization: Standardizing tokens, such as converting “wrk” to “work” through predefined rules (case folding, character replacement).

      Fuzzy matching: Allowing approximate matches based on edit distance (e.g., “wrk” matches “work” due to minimal character difference).

      Phonetic matching: Matching words that sound similar, sometimes used to match abbreviations or common misspellings.

      Thesaurus-based search: Using a predefined list of synonyms or alternative spellings to expand search queries.

      Most of these are open and free lists you can use, check the sources on manticore search for example.

      • soared 10 hours ago

        Porter stemming is currently widely used in adtech for keywords.

      • thaumasiotes 7 hours ago

        > Lemmatization: Similar to stemming, but more sophisticated, accounting for context (e.g., “better” lemmatizes to “good”).

        I don't understand. How is that different from stemming? What's the base form of "better" if not "good"? The nature of the relationship between "better" and "good" is no different from that between "work" and "worked".

    • andix 11 hours ago

      For queries there is an easy solution: give the question/search term to a LLM and let it rephrase it. A lot of basic RAG examples do that.

      This might also work for indexing your data, but has the potential to get really expensive quickly.

    • bongodongobob 10 hours ago

      I'm glad this is mentioned. I've suspected that using correct grammar, punctuation and spelling greatly impacts response quality. It's hard to objectify so I've just decided to write my prompts in perfect English just to be sure. I have a friend who prompts like he texts and I've always felt he was getting lower quality responses. Not unusable, just a little worse, and he needs to correct it more.

  • maytc an hour ago

    The difference in the dates example seems right to me 20 October 2024 and 2024-20-10 are not the same.

    Months in different locales can be written as yyyy-MM-dd. It can also be a catalog/reference number. So, it seems right that their embedding similarity is not perfectly aligned.

    So, it's not a tokenizer problem. The text meant different things according to the LLM.

  • yoelhacks 14 hours ago

    I used to work on an app that very heavily leaned on Elasticsearch to do advanced text querying for similarities between a 1-2 sentence input and a corpus of paragraph+ length documents.

    It was fascinating how much tokenization strategies could affect a particular subset of queries. A really great example is a "W-4" or "W4" Standard tokenization might split on the "-" or split on letter / number boundaries. That input now becomes completely unidentifiable in the index, when it otherwise would have been a very rich factor in matching HR / salary / tax related content.

    Different domain, but this doesn't shock me at all.

    • carom 12 hours ago

      The trained embedding vectors for the token equivalents of W4 and W-4 would be mapped to a similar space due to their appearance in the same contexts.

      • dangerlibrary 10 hours ago

        The point of the GP post is that the "w-4" token had very different results from ["w", "-4"] or similar algorithms where the "w" and "4" wound up in separate tokens.

    • AStrangeMorrow 8 hours ago

      Yes, used to work on a system that has elasticsearch and also some custom Word2Vec models. What had the most impact on the quality of the search is ES and on the quality of our W2V model were tokenization and a custom ngrams system.

  • Xenoamorphous 14 hours ago

    > One of the things I noticed over the past year is how a lot of developers who are used to developing in the traditional (deterministic) space fail to change the way they should think about problems in the statistical space which is ultimately what LLM apps are.

    I’m a developer and don’t struggle with this, where I really struggle is trying to explain this to users.

  • bcherry 14 hours ago

    It's kind of interesting because I think most people implementing RAG aren't even thinking about tokenization at all. They're thinking about embeddings:

    1. chunk the corpus of data (various strategies but they're all somewhat intuitive)

    2. compute embedding for each chunk

    3. generate search query/queries

    4. compute embedding for each query

    5. rank corpus chunks by distance to query (vector search)

    6. construct return values (e.g chunk + surrounding context, or whole doc, etc)

    So this article really gets at the importance of a hidden, relatively mundane-feeling, operation that occurs which can have an outsized impact on the performance of the system. I do wish it had more concrete recommendations in the last section and code sample of a robust project with normalization, fine-tuning, and eval.

  • r_hanz 7 hours ago

    Very nicely written article. Personally, I find RAG (and more abstractly, vector search) the only mildly interesting development in the latest LLM fad, and have always felt that LLMs sit way too far down the diminishing returns curve to be interesting. However, I can’t believe tokenization and embeddings in general, are not broadly considered the absolutely most paramount aspect of all deep learning. The latent space your model captures is the most important aspect of the whole pipeline, or else what is any deep learning model even doing?

  • halyax7 13 hours ago

    an issue I've seen in several RAG implementations is assuming that the target documents, however cleverly they're chunked, will be good search keys for incoming queries. Unless your incoming search text looks semantically like the documents you're searching over (not the case in general), you'll get bad hits. On a recent project, we saw a big improvement in retrieval relevance when we separated the search keys from the returned values (chunked documents), and we used an LM to generate appropriate keys which were then embedded. Appropriate in this case means "sentences like what the user might input if theyre expecting this chunk back"

    • marlott 12 hours ago

      Interesting! So you basically got a LM to rephrase the search phrase/keys into the style of the target documents, then used that in the RAG pipeline? Did you do an initial search first to limit the documents?

      • NitpickLawyer 10 hours ago

        IIUC they're doing some sort of "q/a" for each chunk from documents, where they ask an LLM to "play the user role and ask a question that would be answered by this chunk". They then embed those questions, and match live user queries with those questions first, then maybe re-rank on the document chunks retrieved.

        • nullc 21 minutes ago

          I wonder if backwards a backward predicting LLM might do better.

          E.g. augment the data with "Certainly, the document matching your question is <document>" and then sample the backwards completion to get candidate questions.

  • woolr 12 hours ago

    Can't repro some of the numbers in this blog post, for example:

      from sentence_transformers import SentenceTransformer
      from sentence_transformers import util
    
      model = SentenceTransformer('all-MiniLM-L6-v2')
    
      data_to_check = [
        "I have recieved wrong package",
        "I hve recieved wrong package"
      ]
      embeddings = model.encode(data_to_check)
      util.cos_sim(embeddings, embeddings)
    
    Outputs:

      tensor([[1.0000, 0.9749],
            [0.9749, 1.0000]])
    • 1986 11 hours ago

      Your data differs from theirs - they have "I have received wrong package" vs "I hve received wrong pckage", you misspelled "received" in both and didn't omit an "a" from "package" in the "bad" data

  • andix 11 hours ago

    This is an awesome article, but I’m missing the part where solutions for each of the problems were discussed.

    Run a spell check before tokenizing? Maybe even tokenize the misspelled word and the potential corrected word next to each other like „misspld (misspelled)“?

    For the issue with the brand names the tokenizer doesn’t know, I have no idea how to handle it. This problem is probably even worse in less common languages, or in languages which use a lot of compound words.

  • ratedgene 14 hours ago

    Can't someone expand on this

    > Chunking is more or less a fixable problem with some clever techniques: these are pretty well documented around the internet;

    Curious about what chunking solutions are out there for different sets of data/problems

    • hansvm 13 hours ago

      It's only "solved" if you're okay with a 50-90% retrieval rate or have particularly nice data. There's a lot of stuff like "referencing the techniques from Chapter 2 we do <blah>" in the wild, and any chunking solution is unlikely to correctly answer queries involving both Chapter 2 and <blah>, at least not without significant false positive rates.

      That said, the chunking people are doing is worse than the SOTA. The core thing you want to do is understand your data well enough to ensure that any question, as best as possible, has relevant data within a single chunk. Details vary (maybe the details are what you're asking for?).

    • pphysch 13 hours ago

      Most data has semantic boundaries: whether tokens, words, lines, paragraphs, blocks, sections, articles, chapters, versions, etc. and ideally the chunking algorithm will align with those boundaries in the actual data. But there is a lot of variety.

  • quirkot 12 hours ago

    Is this true?

    >> Do not panic! A lot of the large LLM vocabularies are pretty huge (30k-300k tokens large)

    Seems small by an order of magnitude (at least). English alone is 1+ millions words

    • macleginn 10 hours ago

      Most of these 1+ million words are almost never used, so 200k is plenty for English. Optimistically, we hope that rarer words would be longer and to some degree compositional (optim-ism, optim-istic, etc.), but unfortunately this is not what tokenisers arrive at (and you are more likely to get "opt-i-mis-m" or something like that). People have tried to optimise tokenisation and the main part of LLM training jointly, which leads to more sensible results, but this is unworkable for larger models, so we are stuck with inflated basic vocabularies.

      It is also probably possible now to go even for larger vocabularies, in the 1-2 million range (by factorising the embedding matrix, for example), but this does not lead to noticeable improvements in performance, AFAIK.

      • Der_Einzige 9 hours ago

        Performance would be massively improved on constrained text tasks. That alone makes it worth it to expand the vocabulary size.

    • mmoskal 12 hours ago

      Tokens are often sub-word, all the way down to bytes (which are implicitly understood as UTF8 but models will sometimes generate invalid UTF8...).

    • spott 5 hours ago

      BPE is complete. Every valid Unicode string can be encoded with any BPE tokenizer.

      BPE basically starts with a token for every valid value for a Unicode byte and then creates new tokens by looking at common pairs of bytes (‘t’ followed by ‘h’ becomes a new token ’th’)

  • Spivak 16 hours ago

    I think I take something different away from the article, yes tokenizers are important but they're a means to get at something much much bigger which is how to clean up and normalize unstructured data. It's a current endeavor of mine at $dayjob for how to do this in a way that can work reasonably well even for badly mangled documents. I don't have any silver bullets, at least nothing worthy of a blog-post yet, but since this is needed when dealing with OCR documents so "post-ocr correction" turns up quite a few different approaches.

    And this is an aside, but I see folks using LLMs to do this correction in the first place. I don't think using LLMs to do correction in a multi-pass system is inherently bad but I haven't been able to get good results out of "call/response" (i.e. a prompt to clean up this text). The best results are when you're running an LLM locally and cleaning incrementally by using token probabilities to help guide you. You get some candidate words from your wordlist based on the fuzzy match of the text you do have, and candidate words predicted from the previous text and when both align -- ding! It's (obviously) not the fastest method however.

    • SEGyges 15 hours ago

      you might have better luck giving the LM the original document and having it generate its own OCR independently, then asking the llm to tiebreak between its own generation and the OCR output while the image is still in the context window until it is satisfied that it got things correct

    • 7thpower 14 hours ago

      This is interesting. What types of content are you using this approach on and how does it handle semi structured data? For instance, embedded tables.