Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text
Abstract
Golden Goose synthesizes unlimited RLVR tasks from unverifiable internet text by creating multiple-choice question-answering versions of fill-in-the-middle tasks, enabling large-scale training and achieving state-of-the-art results in cybersecurity and other domains.
Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we propose Golden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiple-choice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting data GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.
Community
TL;DR: We introduce Golden Goose 🦢, a simple method that synthesizes unlimited RLVR tasks from unverifiable internet text by constructing multiple-choice fill-in-the-middle problems. This enables the use of reasoning-rich unverifiable corpora typically excluded from prior RLVR data curation (e.g., science textbooks), allowing RL to scale beyond the data saturation of existing RLVR datasets and achieving new SoTA results on 1.5B and 4B-Instruct models. In a real-world deployment to cybersecurity, where no prior RLVR data exists, Golden Goose synthesizes RLVR tasks from raw FineWeb scrapes, yielding a new SoTA 4B cybersecurity LLM that surpasses a 7B domain-specialized model.
Turning Pre-Train Datasets into the best RLVR tasks! 🎉
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DARL: Encouraging Diverse Answers for General Reasoning without Verifiers (2026)
- ReasonTabQA: A Comprehensive Benchmark for Table Question Answering from Real World Industrial Scenarios (2026)
- WildSci: Advancing Scientific Reasoning from In-the-Wild Literature (2026)
- P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering (2026)
- Aletheia: What Makes RLVR For Code Verifiers Tick? (2026)
- JustRL: Scaling a 1.5B LLM with a Simple RL Recipe (2025)
- DARC: Decoupled Asymmetric Reasoning Curriculum for LLM Evolution (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Thank you for sharing this research! Are there any plans to upload the cybersecurity LLM you created? I would be very interested in testing it out.
this reminds me a bit of the Reinforcement Pre-Training (RPT) approach.
Nice to see Cyber get some love!
Amazing work
Hi authors, thank you for the great work on Golden Goose!
We've been trying to reproduce the results reported in the paper, but have run into some issues and would appreciate your guidance.
Specifically:
- When we subsample the GooseReason dataset and run RL training, we are unable to reproduce the reported performance gains. We understand that subsampling may affect results, but the gap is larger than expected.
- More importantly, we constructed our own GooseReason-Cyber dataset following the procedure described in the paper and trained models on it, but the resulting cybersecurity benchmark performance falls significantly short of what is reported.
We suspect these discrepancies may stem from details that are not fully specified in the paper. Could you share more information on the following?
RL Training Hyperparameters:
- GRPO clip ratio (epsilon)
- Batch size and mini-batch size
- Number of rollouts per prompt
- Learning rate and scheduler details
- KL penalty coefficient (if used)
- Any other training tricks or implementation details that were critical to achieving the reported results
GooseReason-Cyber Data Construction:
- How did you sample or filter cybersecurity-related documents from FineWeb? Were there specific keywords, classifiers, or domain filters used?
- What was the filtering or quality control pipeline after generating the MCQA tasks (e.g., deduplication, difficulty filtering, distractor quality checks)?
We believe sharing these details would be very helpful for the community to build upon your work. Thank you in advance!
Get this paper in your agent:
hf papers read 2601.22975 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper