The Problem of Information Choice in LLM Pretraining
Creating massive language fashions entails substantial computational funding, particularly when experimenting with various pretraining corpora. Evaluating datasets at full scale—on the order of billions of parameters and lots of of billions of tokens—can eat lots of of 1000’s of GPU hours per run. Consequently, practitioners resort to smaller‐scale experiments as proxies for big‐mannequin conduct. But these “pilot” research are not often revealed, producing a fragmented panorama by which every laboratory repeats related small‐scale exams with out shared benchmarks or methodologies . This opacity impedes reproducibility, underutilizes collective insights, and obscures the true commerce‑offs between growth compute and closing mannequin efficiency.

DataDecide
To deal with these limitations, the Allen Institute for AI (AI2), in collaboration with the College of Washington and the College of Pennsylvania, at this time releases DataDecide—a complete suite of managed pretraining experiments spanning 25 distinct corpora and 14 mannequin sizes from 4 million to 1 billion parameters. DataDecide’s datasets embody effectively‑recognized sources reminiscent of Dolma, DCLM, RefinedWeb, C4, and FineWeb, alongside variations produced by area ablation, deduplication, high quality filtering, and supply mixing. Every mannequin is educated at a set token‑to‑parameter ratio of 100 (100 tokens per parameter), reflecting the “overtraining” regime that optimizes inference effectivity. In whole, over 1,050 fashions and greater than 30,000 checkpoints—every evaluated throughout ten downstream duties—are launched to the general public.
Technical Construction and Pragmatic Advantages
DataDecide orchestrates experiments alongside three axes:
- Information Recipes: Twenty‑5 effectively‑documented pretraining corpora, every embodying totally different curation methods (see Desk 1 within the paper for full recipe specs) .
- Mannequin Scale: Fourteen parameter configurations (4 M–1 B), programmatically derived through the OLMo mannequin ladder to make sure constant coaching hyperparameters throughout scales. Every non‑goal scale consists of two “early‑cease” seed runs, whereas the 1 B‑parameter fashions characteristic three full seed reruns to quantify variability.
- Analysis Suite: The OLMES benchmark of ten a number of‑selection duties (e.g., MMLU, ARC Simple/Problem, HellaSwag, MBPP, HumanEval) offers a multifaceted view of language understanding, commonsense reasoning, and code technology efficiency.
By releasing each pretraining datasets and corresponding fashions, DataDecide permits researchers to:
- Reuse checkpoints for brand new evaluations with out retraining.
- Experiment with novel prediction strategies (e.g., superior scaling‑regulation suits, smoothing strategies).
- Examine benchmark sensitivity to coaching knowledge and mannequin scale.
Key Findings and Quantitative Insights
DataDecide’s systematic evaluation yields 4 sensible pointers:
- Single‑Scale Baseline Robustness: Rating corpora by downstream accuracy at a single, small scale (e.g., 150 M parameters) achieves ~80 p.c determination accuracy for predicting the most effective dataset on the 1 B‑parameter goal scale. In distinction, eight baseline scaling‑regulation extrapolations don’t surpass this straightforward heuristic, underscoring its value‑effectiveness.
- Job‑Dependent Compute Sensitivity: The compute funds required for dependable selections varies markedly by activity. Benchmarks like MMLU and ARC Simple turn out to be predictable with lower than 0.01 p.c of the goal compute, whereas HellaSwag and SocialIQA demand orders of magnitude extra FLOPs to realize related determination accuracy .
- Proxy Metric Choice: Steady probability metrics—particularly the character‑normalized common likelihood of right continuations (CORRECT PROB) and whole likelihood (TOTAL PROB)—outperform discrete accuracy measures at small scales. That is most pronounced on code duties (MBPP, HumanEval), the place determination accuracy jumps from close to‑random to over 80 p.c with CORRECT PROB because the proxy .
- Variance and Unfold Issues: Excessive determination accuracy correlates with low run‑to‑run variance (noise) and ample efficiency unfold throughout datasets. Proxy metrics that scale back noise or amplify unfold thus instantly improve prediction reliability.
Concluding Perspective
DataDecide transforms pretraining knowledge choice from an advert hoc artwork right into a clear, knowledge‐pushed science. By open‑sourcing all 25 corpora, 1,050 fashions, 30,000+ checkpoints, and analysis scripts on Hugging Face and GitHub, AI2 invitations the group to breed findings, lengthen evaluations to new benchmarks, and innovate on determination‑making strategies. As LLM growth continues to demand ever‑better compute sources, DataDecide provides a principled framework for minimizing wasted experiments and maximizing perception—paving the best way towards extra environment friendly, reproducible, and collaborative AI analysis.
Take a look at the Paper, Mannequin on Hugging Face and Technical particulars. Additionally, don’t neglect to observe us on Twitter and be a part of our Telegram Channel and LinkedIn Group. Don’t Overlook to hitch our 90k+ ML SubReddit.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.