GPT-OSS Reinforcement Learning

(docs.unsloth.ai)

155 points | by vinhnx a day ago ago

43 comments

Thank you unsloth for the amazing work once again!

The new sleep mode in vLLM is really amazing, and it seems the community hasn’t quite wrapped their heads around how much more accessible this makes RL training.

I’m reading a lot of dismaying posts on this thread, pushing the idea that only big labs should be doing RL. This couldn’t be further from the truth, folks should try it for themselves and see the outcomes!

[-]

danielhanchen 10 hours ago

Thank you! No worries at all! Yes! Sleep mode is super cool since this means the allocation of memory for inference can be totally decoupled away from training, which opens the door to many larger RL runs!

wey-gu 13 hours ago

Unsloth/Denial is amazing.

[-]

homarp 8 hours ago

Daniel I suppose?

[-]

danielhanchen 7 hours ago

Ye probs mis-spelt :)

decodebytes 11 hours ago

You can now generate reasoning with Tool calling or format pre-existing datasets to the OpenAI harmony format used for gpt-oss with DeepFabric:

https://github.com/lukehinds/deepfabric/discussions/334

Der_Einzige a day ago

I love love love Unsloth and everything they do, so do not take what I am about to say as criticism of them.

But what's the point? GPT-OSS is regarded as a pretty bad open source model compared to the latest deepseek or qwen releases. Most attempts to use Reinforcement Learning or even any kind of post-training fail in that the data you have is of worse quality and quantity than the data that the model was originally trained on. So you get catastrophic forgetting and a model with lower general IQ than before fine-tuning.

This is true btw even if you use lora or better techniques to supposedly "mitigate" catastrophic forgetting. Even pyreft/reft, which in some cases impact only 0.001% of a models parameters, cause these kind of issues in my experiments.

So why should anyone except AI researchers and the big 4 AI providers care about fine-tuning? The vast majority of people who think they need fine-tuning need good quality RAG/Agentic RAG systems, since they can trivially add or remove data to their model (machine unlearning doesn't work yet), also ground models and objectively makes them more accurate, and fully manipulate and manage how it's used in their prompts context. On top of that, vector DBs/embeddings "easily" scale to billions of records.

[-]

danielhanchen 16 hours ago

Oh hey! Thanks for the love :)

The primary goal of the release and our notebook https://colab.research.google.com/github/unslothai/notebooks... was actually to showcase how to mitigate reward hacking in reinforcement learning - for example when RL learns to cheat and output global variables instead like editing the timer to cheat on benchmarking and others. You can edit the notebook to do rl on other powerful models like Qwen, Llama etc automatically with Unsloth as well via our automatic compiler! We also made sink attention and moe inference super optimized for training - note flash attention 3 doesn't have sink backwards support so you'll have to use unsloth.

Gpt-oss tbh in our tests is a truly powerful model, especially the 120b variant - it's extremely popular in western enterprises since yes it's from openai but also because reasoning mode high and the censored nature and its reasoning capabilities are attractive. A big underutilized feature is its web search and internal intermediate tool calling which it can do as part of its reasoning chain just like o3 or gpt5.

RL yes isn't an all powerful hammer, but it can solve so many more new problems. For a financial institution, you can make automatic trading strategies via RL. For an intelligence agency, decryption via RL. For a legal startup, possibly case breakthroughs via RL, automatic drug candidates etc. And yes, big labs want to automate all tasks via massive RL for eg being able to play pokemon and all other games as one example. RL opened so many doors since you don't need any data, just one prompt like "make fast matrix multiplications kernels", and reward functions - it can allow many more interesting use cases where data is a constraint!

[-]

ripped_britches 13 hours ago

Can you elaborate on “decryption via RL”

[-]

danielhanchen 10 hours ago

Definitely not breaking any modern day standards, but from what I understand, some folks are trying it on simple ciphers or combinations of simple ciphers to first see if RL can help.

terataiijo 13 hours ago

I think you can train a model to decrypt an encrypted. My friend tried this only on like simple example tho. As long as we have the environment, we can do these things.

Der_Einzige 12 hours ago

I’m sorry but I don’t buy for a second that you can do meaningful and even close to reliable decryption with RLHF on currently known secure ciphers.

Furthermore, I’m very worried that whoever may be paying for this is barking up the wrong tree. I feel that the damage done with extremely bad decryption attempts would massively outweigh the very few times when whatever it “decrypts” is meaningfully close to what the actual text was.

I’m aware of how easy certain things in surveillance are (I.e n-gram analysis is enough to dox anyone on HN in like 10 words of text) - but even sort of decent decryption of SHA-256 would be a literally front page of the world achievement.

[-]

vlovich123 10 hours ago

If you’re going to be rude and arrogant, then the level of knowledge you exhibit has to match. SHA-256 decryption would be a front of the world achievement because it would be redefining foundational mathematics since it’s not an encryption algorithm. The words you’d be looking for are either a collision of SHA-256 or breaking encryption algorithms like AES, RSA, ECC etc.

sha-256 is used in the construction of certain encryption algorithms as a primitive but by itself never encrypts anything. If it did you’ve also got middleout compression invented since you could encrypt arbitrary length input into 256 bits of output.

[-]

danielhanchen 10 hours ago

Oh yes if RL breaks SHA-256 that'll be revolutionary - but definitely not that - some folks are for now investigating basic combinations of old school ciphers for now - security applications with RL are most likely for now related to automatically finding attack surfaces and creating defensive layers - I probably should have re-worded "decryption for RL" to just "security for RL" sorry!

vlovich123 a day ago

I have compared instruction following of stock gpt-OSS against stock qwen and the 20B outperformed all others, intelligently following instructions and reasoning about tool calling correctly for tools it hasn’t been trained on. Additionally, it performs like a 3B model because it uses 32 experts. I don’t know where this claim that it sucks comes from but my evaluation of similarly competitive models showed it leading the pack by a lot.

[-]

corlinp 21 hours ago

The performance is great, but the censorship is ridiculous for me. I tried it as a backend for my game Guessix[1], but it would refuse for ridiculous reasons like "Cannot answer questions about copyrighted works like Harry Potter."

1. https://guessix.com/

[-]

artdigital 19 hours ago

Try the uncensored/jailbroken variants like openai-gpt-oss-20b-abliterated-uncensored-neo-imatrix

I just tried to ask it how to make crystal meth and it generated a very detailed step by step guide

[-]

Squarex 16 hours ago

I have heard that uncensorted gpt-oss is not very good because of it being trained mainly on synthetic data. Is not not true?

[-]

bavell 16 hours ago

Iirc abliteration (ablation?) can be done without "training" and is pretty quick. It finds the individual weights related to the concept you want to ablate, and modifies those weights to "deactivate" them. Precision brain surgery, to anthropomorphize.

[-]

Squarex 15 hours ago

The problem with synthetic data would be that the censored information would not be in the training data at all.

corlinp 11 hours ago

Very interesting! Do the benchmarks hold up well or does it reduce performance in other areas too?

BoorishBears 19 hours ago

Use constrained generation

[-]

corlinp 12 hours ago

Do you mean like structured outputs? Unfortunately here the model is guided to explicitly tell you when you violate the rules and why, it can confuse it's system rules with the game rules and say you're not allowed to ask a question about copyrighted material etc.

artdigital 18 hours ago

Mind explaining?

[-]

BoorishBears 9 hours ago

If you constrain the model to a JSON schema, most frivolous refusals go away.

And if you finetune on a few formatted examples the effect is even greater

7thpower 16 hours ago

Curious as well

asabla 21 hours ago

I'm always so confused by those statements as well. Because just like you, I feel that the 20B version is really good at following instructions.

Some of the qwen models are too, but they seem to need a bit more handholding.

This is of course just anecdotal from my end. And I've been slacking on keeping up with evals while testing at home

Der_Einzige a day ago

Even if you're right about all of this it still doesn't refute my "fine-tuning doesn't work for 99% of customers including you" thesis.

Also, most mainstream AI benchmarks do not agree with you:

LLMarena (https://lmarena.ai/leaderboard) has GPT-OSS 120B as #53 and GPT-OSS 20B as #69 (nice), which is extremely far from leading.

DeepSeek V3.1 is ranked #9, and is a solid 60+ elo points above GPT-OSS.

I know you're going to link some of the "ya but chatbot arena sucks cus of theoretical attacks against it" paper and the llama4 debacle, but here's more evidence that GPT-OSS blows:

https://livebench.ai/#/?q=GPT-oss

GPT-oss global average: 54.60

Deepseek V3.1 thinking global average: 70.75

Qwen 3 32B global average: 63.71

So bring receipts next time because I did.

[-]

vlovich123 21 hours ago

Qwen 32B is a dense model. The competitors for GPT-OSS 20B are max 8B dense models. You’re comparing it against models it’s not competing with and calling it crap. That’s like saying Ferrari is better than Toyota. Sure, but only if we’re comparing 0-60. If we add a budget of 20K, suddenly the Toyota starts looking more competitive.

I never claimed it was a frontier model. Just best in class for the performance it can achieve and the memory footprint it can fit in.

And btw OSS did super well on domain specific tests without fine tuning. A model I don’t need to fine tune beats one that does.

[-]

stingraycharles 21 hours ago

Excuse my ignorance, but what is a “dense” model? Why is comparing GPT-OSS 20B against 8B dense models “fair” but not when comparing it against 32B models?

[-]

jychang 20 hours ago

So gpt-oss-20b is a sparse MoE model

Which means it has ~3b parameters active per token.

Qwen3-32b has 32b params active per token

spwa4 20 hours ago

An MoE model will not activate most of the model for any given query, so there is zero compute happening across an increasing part of the model.

Dense model means 32B parameters => 32B get used in calculation for every token. Every calculation takes time, and assuming similar latent space size (which they all have). For example Qwen-32B

MoE model has for example 80B parameters, but only 3B get used in calculation for any given token. For example Qwen3-Next 80B A3B

Performance comparison:

Qwen-32B => 56 tok/sec, 32 GB of VRAM

Qwen3-Next 80B A3B => 167 tok/sec, 85 GB of VRAM

So despite being close to 3x "bigger", Qwen3-Next is more than 3 times faster with the same compute capacity. There's a but though. But because what gets activated from one token to the next is a different subset of the model, it is still critical to have all 80B parameters loaded into memory.

So MoE performs much better with less compute, at the cost of more memory. It also performs better on benchmarks, it is a better model.

Similar techniques have long been used in ML to great success, rather than trying to create one brilliant model, create many that each have pros and cons, and then train a second model to figure out the best model for the task in front of you. There's even a name for the practice "ensemble models". MoE only kind-of fits (because you can't easily swap out models)

There's other factors, the big one being attention. That's why non-attention models, like MAMBA, will wipe the floor in terms of performance per flop (a compute unit), with anything else. When it comes to intelligence however ...

yowlingcat 15 hours ago

Qwen3 Next and Qwen3-30b-a3b are pretty decent proxies for GPT-OSS 120B and 20B respectively (and in fact are both MoEs with 3B active rather than 8B active parameters), and they lap GPT-OSS pretty hard in this specific benchmark, getting to #17 and #33 respectively. That being said, it's hard to take benchmarks beyond a grain of salt because real world tasks that I try to use these models for always have a lot more variation than the benchmarks illustrate. I do view GPT-OSS as a pretty good alternative to the Qwen models in some cases but there are tradeoffs -- while I see better reasoning from GPT-OSS sometimes, the prompt adherence and overall flexibility of the Qwen models makes them a lot better IMO as general purpose local open weight models.

[-]

vlovich123 2 hours ago

Those models got released later than GPT-OSS. It’s like saying the Android phone released 6 months after the iPhone is faster. Maybe, but it also had 6 extra months of development.

Der_Einzige 21 hours ago

You don't mention that MoE models horrible damage logprobs (by definition) and basically need a whole new theory of LLM sampling written for them.

Dense models are better for a reason, and the idea that "everyone is doing MoE now and dense models are dead" is total bunk nonsense.

You can quantize dense models, and 4 bit quantized Qwen 32B is still better than full precision GPT-OSS. Luckily Unsloth even gives you tools to go down to 1.58bits!

mountainriver 13 hours ago

I literally talked to 5 customers last week that needed fine tuning, legitimately needed it. I get if you’re just doing basic RAG on text you generally don’t but that’s only part of the ecosystem

[-]

vlovich123 2 hours ago

I didn’t say it didn’t need it, just that it needed it less. For example, previous models needed a lot of careful fine tuning to properly do tool calling. OSS does not need that.

mountainriver 13 hours ago

This is a deeply untrue statement echoed by a lot of people unfortunately.

There was a post the other day on HN where a dev spent $2k doing RL on an open model and beat the frontier models on Web Voyager.

Research is packed with examples like this yet we keep hearing from the community that training doesn’t work.

Then consider voice/vision modalities where context engineering doesn’t work well at all. I just spent 2 years on the computer use problem and can tell you with absolute certainty that context won’t get you there (we tried and tried). The only thing that meaningfully moved the needle was RL.

I really wish this idea that only big labs can train models would die. It’s really hurting the ecosystem at this point.

ericfr11 5 hours ago

I work for a D2C product, with a standard frontend/backend, and a RAG system. It's becoming annoying when the product manager (not a tech guy) keeps asking for a fine-tuned LLM: most people read news and assume they know what's best.

stingraycharles a day ago

Lots of business decisions are based on “we prefer to use this model because it’s from OpenAI”, not necessarily because it’s the best.

I’ve also seen a lot of enterprises suddenly invest in fine-tuning their own models, even though there is absolutely no reason they should be doing that. But because there’s a huge “we need to do something with AI” directive from the top. Eg “if we fine tune an AI model to understand our custom query language it will prevent our data scientists from taking down production databases”. This is an actual example I encountered just last week.

So if there’s a point, it’s probably not that it’s the best idea, but rather than enterprises are willing to buy it.

strangescript 13 hours ago

GPT-OSS models are amazing, and a lot of the bad press was poor implementations of them in the usual tools and people not understanding how to handle their unique quant out of the box approach. Unlsoth has done amazing job unpacking best approaches

KronisLV 15 hours ago

> So why should anyone except AI researchers and the big 4 AI providers care about fine-tuning?

I currently have Qwen3 Coder 30B A3B on prem for developers to use and it's pretty good for that: fits within two Nvidia L4 cards (with quantization), has about 60 tok/s performance and can even use tools with something like RooCode or OpenWebUI.

However, if anyone asks it stuff in Latvian, it does mess up more often than not, like someone who has half-learned the language: more or less uses the words that represent the correct concepts, but not always the best pick and really often missing or wrongly used diacritics (ā, č, ē, ģ, ī, ķ, ļ, ņ, š, ū, ž). In a word, some basis of the language is there, but sadly in practice it's still unusable.

So far working with Latvian text I need to maintain EuroLLM running in parallel, which has great Latvian knowledge, but at the same time just knows less (not a model that's good for programming) and doesn't know how to call tools as well and isn't really meant for long contexts: https://huggingface.co/collections/utter-project/eurollm-66b...

So my ideal model (for this use case) would be something along the lines of: around 30B since can't fit anything much bigger into the VRAM for now, MoE, supports tool calling, okay programming knowledge, has a long enough context support, can converse in Latvian so I don't need to run 2 models (which means that the context sizes that can fit within VRAM are way too small for either of the models).

Without finetuning, I just have to sit and wait for someone to release that and it feels unlikely that it'll just pop into existence (even the bigger EuroLLM model is nowhere to be seen, TildeOpen works bad). With finetuning, maybe I have a chance to take the Latvian Wikipedia or whatever other samples of the language being used I can get, filter down to topics I care about, maybe use EuroLLM to generate question/answer pairs for training from that input data and then just run Unsloth over a weekend or something (who knows, maybe that'd be enough to bring Qwen3 up to speed).

Do most people care about that stuff? Probably not, but when you have to deal with a language that's less represented in the training data but will never have the budget to train something from scratch, finetuning is nice to have. RAG can't fix it not knowing how to use language.

WOTERMEON 8 hours ago

Reads like ai slop / mad marketer