Results are modest, maybe 20-30% fewer training steps to reach target performance. This won't solve the problem of organic data exhaustion. We need 100x more data.
They didn't test against actual language model pretraining, only tested against a random init.
- A: Pre-trained on their synthetic LSTM data -> fine-tuned on Wikipedia
- B: Pre-trained on different natural language corpus -> fine-tuned on Wikipedia
- C: Random initialization -> fine-tuned on Wikipedia
This is a cool concept, but for comparison, I can’t help but wish there was more comparison between the treatment group and a control group that doesn’t see any universal pretraining data.
It’s good to compare various model sizes and evaluation tasks and random data generators. I just think the paper would more effectively prove its point if it could show models of same sizes which see this random data can learn better from evaluation data later on.
Could even take the initial checkpoint of the model before universal pretraining against the pretrained checkpoint. If the method works, the one that did UP will win.
Maybe I’m way off, I’ll admit I only skimmed it so far. Seems promising, just wishing for some controls.
In figures 2, 4, and 6, the top left end of the training curves represents models that have not seen any pretraining data. In figure 5, they're represented by dashed curves.
Abstract: "We investigate the use of randomly generated data for the sake of pre-training a model. We justify this approach theoretically from the perspective of algorithmic complexity, building on recent research that shows that sequence models can be trained to approximate Solomonoff induction. We derive similar, but complementary theoretical results. We show empirically that synthetically generated data can be used to pre-train a model before the data is seen. We replicate earlier results that models trained this way show zero-shot in-context learning across a variety of datasets, and that this performance improves with scale. We extend earlier results to real-world data, and show that finetuning a model after pre-training offers faster convergence and better generalization."
Results are modest, maybe 20-30% fewer training steps to reach target performance. This won't solve the problem of organic data exhaustion. We need 100x more data.
They didn't test against actual language model pretraining, only tested against a random init.
- A: Pre-trained on their synthetic LSTM data -> fine-tuned on Wikipedia
- B: Pre-trained on different natural language corpus -> fine-tuned on Wikipedia
- C: Random initialization -> fine-tuned on Wikipedia
They only test A vs C, not A vs B.
This paper addresses the problem of running out of data. You can't do B when you ran out of data so it's irrelevant.
This is a cool concept, but for comparison, I can’t help but wish there was more comparison between the treatment group and a control group that doesn’t see any universal pretraining data.
It’s good to compare various model sizes and evaluation tasks and random data generators. I just think the paper would more effectively prove its point if it could show models of same sizes which see this random data can learn better from evaluation data later on.
Could even take the initial checkpoint of the model before universal pretraining against the pretrained checkpoint. If the method works, the one that did UP will win.
Maybe I’m way off, I’ll admit I only skimmed it so far. Seems promising, just wishing for some controls.
In figures 2, 4, and 6, the top left end of the training curves represents models that have not seen any pretraining data. In figure 5, they're represented by dashed curves.
Abstract: "We investigate the use of randomly generated data for the sake of pre-training a model. We justify this approach theoretically from the perspective of algorithmic complexity, building on recent research that shows that sequence models can be trained to approximate Solomonoff induction. We derive similar, but complementary theoretical results. We show empirically that synthetically generated data can be used to pre-train a model before the data is seen. We replicate earlier results that models trained this way show zero-shot in-context learning across a variety of datasets, and that this performance improves with scale. We extend earlier results to real-world data, and show that finetuning a model after pre-training offers faster convergence and better generalization."