arxiv:2601.22975

Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text

Published on Jan 30

· Submitted by

Ximing Lu on Feb 2

Authors:

Jian Hu ,

Di Zhang ,

Yunheng Zou ,

Hyunwoo Kim ,

Abstract

Golden Goose synthesizes unlimited RLVR tasks from unverifiable internet text by creating multiple-choice question-answering versions of fill-in-the-middle tasks, enabling large-scale training and achieving state-of-the-art results in cybersecurity and other domains.

AI-generated summary

Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we propose Golden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiple-choice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting data GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.

View arXiv page View PDF Add to collection

Community

Ximing

Paper submitter Feb 2

TL;DR: We introduce Golden Goose 🦢, a simple method that synthesizes unlimited RLVR tasks from unverifiable internet text by constructing multiple-choice fill-in-the-middle problems. This enables the use of reasoning-rich unverifiable corpora typically excluded from prior RLVR data curation (e.g., science textbooks), allowing RL to scale beyond the data saturation of existing RLVR datasets and achieving new SoTA results on 1.5B and 4B-Instruct models. In a real-world deployment to cybersecurity, where no prior RLVR data exists, Golden Goose synthesizes RLVR tasks from raw FineWeb scrapes, yielding a new SoTA 4B cybersecurity LLM that surpasses a 7B domain-specialized model.

di-zhang-fdu

Paper author Feb 2

Turning Pre-Train Datasets into the best RLVR tasks! 🎉

librarian-bot

Feb 3

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

westbrook-ai

Feb 3

Thank you for sharing this research! Are there any plans to upload the cybersecurity LLM you created? I would be very interested in testing it out.

ykarout

Feb 5

this reminds me a bit of the Reinforcement Pre-Training (RPT) approach.
Nice to see Cyber get some love!
Amazing work

navvh

4 days ago

•

edited 4 days ago

Hi authors, thank you for the great work on Golden Goose!

We've been trying to reproduce the results reported in the paper, but have run into some issues and would appreciate your guidance.

Specifically:

When we subsample the GooseReason dataset and run RL training, we are unable to reproduce the reported performance gains. We understand that subsampling may affect results, but the gap is larger than expected.
More importantly, we constructed our own GooseReason-Cyber dataset following the procedure described in the paper and trained models on it, but the resulting cybersecurity benchmark performance falls significantly short of what is reported.

We suspect these discrepancies may stem from details that are not fully specified in the paper. Could you share more information on the following?

RL Training Hyperparameters:

GRPO clip ratio (epsilon)
Batch size and mini-batch size
Number of rollouts per prompt
Learning rate and scheduler details
KL penalty coefficient (if used)
Any other training tricks or implementation details that were critical to achieving the reported results

GooseReason-Cyber Data Construction:

How did you sample or filter cybersecurity-related documents from FineWeb? Were there specific keywords, classifiers, or domain filters used?
What was the filtering or quality control pipeline after generating the MCQA tasks (e.g., deduplication, difficulty filtering, distractor quality checks)?

We believe sharing these details would be very helpful for the community to build upon your work. Thank you in advance!