Full LLM training and evaluation toolkit

(github.com)

114 points | by testerui 6 hours ago ago

4 comments

Might be worth updating the title to "SmolLM: state-of-the-art small language model trained on open datasets" (See the first table of https://huggingface.co/blog/smollm for benchmarks)

It was fascinating digging into this to find their dataset weights defined in a declarative YAML file [2]. 70% is from FineWeb/Commoncrawl but filtered using a classifier trained on Llama-70b's rating from 0-5 of the educational content of the text [3]. This is something we know small models like Phi-3 have been doing for a while, but it's great to see a fully open reproduction of it that beats their benchmarks. Definitely supports the idea you can get even better reasoning at smaller model sizes by carefully filtering and curating your training data (and generating good synthetic data from/distilling bigger models).

You can see the 450k Llama educational value scores here: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-ll... It's interesting, I think the text with 3 scores is really good, but the 5 scores pick content that is not very reasoning or information-heavy but just mentions education or a worksheet. For SmolLM they just took the documents with scores >= 3 so it doesn't matter a ton.

2. https://github.com/huggingface/smollm/blob/9efce803bc7e37727... 3. https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier

[-]

timhigins 4 hours ago

Update: While SmolLM was SOTA at the time of release in July, SmolLM 2 1.7B (which is the newest release) is not currently the best model under 2B params on https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...

abeppu 4 hours ago

While it's great that this is open source, and I understand the pressure for smaller models that can be run in a wider range of contexts, I continue to be annoyed that authors keep posting comparisons to models which are slightly smaller.

In this page, SmolLM2-1.7B does a bit better than Qwen2.5-1.5B which is ahead of Llama3.2-1B. At the next size level up, in other comparisons I've seen that e.g. Phi-3.5 (which is ~3.8B params) does a bit better than Llama 3.2 3B. Gemma 2 has a 9B size, llama 3.1 has an 8B size and I think when that came out Mistral had a 7B model -- so whenever a new "small" thing does "better" than its peers, we can't easily see whether it's because of any of the many small choices that the authors made were actually better.

bashfulpup an hour ago

Pythia is stupidly easy to use.

Then hookup a simple test harness. - this is like a grand total of 3 commands - git pull, install, point and run a model