II-Thought
Tue 25 Mar 2025

We introduce II-Thought-RL-v0, our first iteration to develop a large-scale, multi-domain Reinforcement Learning (RL) dataset. By providing a high-quality, large-scale dataset on RL question-answer pairs, we aim to advance reasoning research. This foundational step will pave the way for future iterations incorporating more complex reasoning traces.
In recent months, several datasets have been introduced to advance reasoning research, including PrimeIntellect’s Synthetic1, which spans a broad range of domains and tasks; Huggingface’s OpenR1, with 220k high-quality problems and reasoning traces; General Reasoning, with 323k diverse samples; and DeepScaler, which provides 40k high-quality math data points used to train state-of-the-art 1.5B models. These efforts have been valuable and inspiring to our work. However, we still have significant room for improvement in data quality, diversity, and integrity.
Upon closer examination, we identified notable benchmark contamination in several datasets. For instance, nearly 20 problems from Math-500 are duplicated in OpenR1, and at least 100 problems appear in both Math-500 and DeepScaler. Additionally, as General Reasoning is a crowd-sourced initiative, it contains a considerable amount of low-quality data that warrants further curation. Finally, there remains a gap in domain diversity and the availability of large-scale reinforcement learning datasets for reasoning tasks.
To address these issues, our approach is grounded in four core principles:
- Scale: We introduce a dataset of over 340k high-quality reasoning problems spanning multiple domains.
- Quality: We ensure high accuracy and relevance across all entries through rigorous filtering and refinement.
- Verifiability: Each sample is selectively chosen to support automated evaluation and reinforcement learning applications.
- Integrity: We implement strict deduplication and decontamination protocols, ensuring no overlap with proprietary benchmarks and maintaining clean, trustworthy training data.
This post will share insights into our dataset collection process, key findings, curation approach, preliminary analyses, and work-in-progress for future iterations.
Try it now:
1. Background
DeepSeek recently introduced R1, a reasoning model that quickly emerged as a transformative force in advancing cognitive reasoning within LLMs. R1 uses Group Relative Policy Optimization (GRPO) [5], paired with verifiable rewards and explicit reasoning prompts (think <think>
tags that nudge the model to start analyzing before giving the final answer). This approach doesn’t just tweak outputs—it encourages the model to think step-by-step, refining its logic and uncovering insights. This results in complete reasoning chains, fewer superficial answers, and rare moments of spontaneous clarity that mirror human cognition.
These intricate reasoning paths can then distill its reasoning process into smaller models via fine-tuning reasoning traces. While DeepSeek has released several powerful distilled models based on the Qwen family [20], replicating their capabilities remains challenging for the open-source community.
High-quality reasoning stems from well-crafted question-answer pairs, with complex problems yielding deeper insight. Effective reinforcement learning must rely on such high-quality data. This approach enabled us to build a robust, multi-domain RL dataset and laid the groundwork for refined reasoning traces in our next iteration of supervised fine-tuning.
2. Proprietary RL Dataset
The success of DeepSeek R1 is primarily attributed to its collection of high-quality, verifiable question pairs suitable for GRPO training. Here, “verifiable” refers to responses that can be deterministically confirmed or systematically validated, ensuring an objective and precise basis for reinforcement.
Tulu-3 [22] from AllenAI, one of the first models to achieve strong performance with Reinforcement Learning with Verifiable Reward (RLVR), emphasizes that creating data for RLVR requires prompts paired with binary verifier functions—constructing a dataset where a verifier function accompanies each input x.
For example, math problems with deterministic results—where the final answer is formatted explicitly (e.g., enclosed in a box)—are inherently verifiable. Similarly, coding tasks can be validated by running predefined test cases, producing unambiguous outcomes that confirm the correctness of a response.

Other more complex cases require the verifier function to be an LLM judge, for example, in the case of the medical domain:

This new paradigm and the success of R1 have shifted the focus of many datasets towards Reinforcement Learning (RL). For instance, NuminaMath [6][7] and Code Contest [10], initially developed for Supervised Fine-Tuning (SFT), are now leveraged for Reinforcement Learning. Similarly, Medical-o1 from HuatuoGPT-o1 [14] and Dria-Agent-a [23] are curated with only verifiable medical problems. Increasingly, datasets like Deep-Scaler [8], Curated Thought [21], and BigMath [12] are specifically selected and designed to train verifiable RL models.
However, challenges persist. Few datasets address critical issues like data quality, reward hacking, scale, and domain diversity. Common problems include:
- Low-quality question-answer pairs: Ambiguous phrasing, broken formats, or irrelevant information.
- Reward hacking vulnerabilities: Simplistic question types like True/False or multiple-choice can be easily gamed [26].
- Limited scale and domain: Many datasets consist of only a few thousand samples within a narrow domain.
Given that reinforcement learning for advanced reasoning remains an emerging field, the availability of high-quality, large-scale datasets is paramount. To support and accelerate progress in this domain, we have compiled a proprietary collection of verifiable datasets encompassing a range of disciplines. Key characteristics include:
- Objective Verification: Encompassing mathematical, logical, and computational tasks, each response is subject to systematic validation.
- Diverse Domains: The dataset spans fully verifiable and semi-verifiable challenges, promoting adaptability across varying levels of complexity.
- LLM Evaluator: For semi-verifiable tasks, integrated LLM-based evaluation ensures alignment between rewards and reasoning processes.
Dataset | Domain | Source | Amount |
---|---|---|---|
Mix-Math | Math | 906,052 | |
NuminaMath-1.5 | Math | AI-MO/NuminaMath-1.5 [7] | 859,494 |
DeepScaler | Math | agentica-org/DeepScaleR-Preview-Dataset [8] | 40,315 |
CAMEL | Science | Camel-AI/{physics},{chemistry},{biology} [9] | 60,000 |
OpenTextBook | Science | Intelligent-Internet/Text-book-RL | 12,738 |
Code Contest | Code | deepmind/code_contest [10] | 13,328 |
Apps & Taco | Code | PrimeIntellect/SYNTHETIC-1 [11] | 106,071 |
Real World SWE | Code | PrimeIntellect/real-world-swe-problems [11] | 69,841 |
Python Codeforces | Code | matrixstudio/codeforces-python-submissions [13] | 621,356 |
Open-ICPC | Code | Intelligent-Internet/ICPC | 2020 |
medical-o1-verifiable-problem | Medical | FreedomIntelligence/medical-o1-verifiable-problem [14] | 40,644 |
riddle_sense | Riddle | ink-usc/riddle_sense [15] | 3510 |
Total | 2,734,369 |
In addition to publicly available datasets, our collection incorporates carefully selected problems from ICPC competitions and the IMO shortlist and verifiable science textbook questions crawled from the internet, further enriching the dataset’s complexity and variety.
3. The Data Dilemma: Quality Over Quantity
If reinforcement learning is the engine driving deep reasoning, data is the fuel—and right now, the tank’s full of impurities. After we followed the criteria introduced by Big-Math [12] and further inspected our collection, we found that problems plagued the datasets:
Ambiguity: Vague tasks make it difficult for problems to find a correct answer, making it impossible to assign the proper rewards.

- Language or Formatting Issues: Questions with grammatical errors, poor formatting, or improper notation hinder clarity for both models and human annotators and are thus excluded.

- Irrelevant or Unnecessary Content: Questions combining unrelated topics cause confusion and detract from the learning objective. These are systematically filtered out.

- Redundant and Contaminated examples: Upon reviewing the dataset, we identified numerous duplicated training examples and substantial overlap with prominent benchmarks such as GPQA, AIME 2024, AIME 2025, MATH500, and LiveCodeBench [16], [17] [21]. Duplicate data can lead to inefficient training and introduce biases, while benchmark contamination compromises the reliability and credibility of the model's performance.
Dataset | MATH500 | AIME2024 | AIME2025 | Gaokao-En | Olympiad Bench | AMC |
AI-MO/NuminaMath-CoT | 8104/1 | 0 | 5 | 792/1 | 491/2 | 47 |
AI-MO/NuminaMath-1.5 | 6154/3 | 48/15 | 10/0 | 601/0 | 854/7 | 68 |
agentica-org/DeepScaleR-Preview-Dataset | 627/1 | 0 | 2 | 75/1 | 77 | 4 |
Intelligent-Internet/ICPC | 0 | 0 | 0 | 0 | 0 | 0 |
PrimeIntellect/SYNTHETIC-1 | 69 | 0 | 0 | 4 | 119 | 0 |
open-r1/OpenR1-Math-220k | 30 | 0 | 2 | 17 | 260/3 | 19 |
Humanity's Last Exam

Olympiad-Bench


AIME-2024/2025



HuggingFaceH4/MATH-500



Shallow Examples: Structured tasks such as multiple-choice questions, True/False, and Yes/No questions that let models “hack” rewards without genuinely understanding.
- Questions from Textbook-RL and Medical-o1:


- Questions Containing Answers: Questions explicitly including the correct answer within their text encourage memorization rather than genuine reasoning.

- Questions Containing Multiple Parts: Questions containing multiple parts are very hard to verify and might deteriorate the training process.

- Dependence on External Visual Elements: Questions requiring external visuals like diagrams or charts without sufficient textual context are unsuitable for our text-focused reasoning dataset.

- Unsolvable problems

- Open-Ended questions

Recognizing these challenges, we’ve curated and introduced a new reinforcement learning dataset designed to push LLMs toward deep reasoning. Drawing from multiple proprietary sources, this large-scale curated collection prioritizes quality and relevance.
4. Data Processing Pipeline
The complete final cleaning pipeline is illustrated in the graph below:

Step 1: Deduplication and decontamination
To address these challenges, we first applied exact match similarity deduplication before using semantic similarity to systematically remove redundant entries following the same pipeline from OpenThought [18], ensuring every task in the dataset offers unique and meaningful learning opportunities. We meticulously decontaminated the dataset by filtering out examples closely resembling those in established benchmarks with a 75% similarity threshold. Although this stringent measure might occasionally remove valuable content, it substantially reduces contamination risks, safeguarding the dataset's originality, integrity, and overall quality.
Step 2: Quality Filtering
We implement a rigorous, multi-stage verification process designed explicitly for datasets like NuminaMath 1.5 and DeepScaler [6] [8]. We aim to systematically remove incomplete, low-quality, and unverifiable problems, significantly improving dataset quality and reliability.
We introduce a two-stage filtering approach that leverages the complementary strengths of Gemini 2.0 Flash [16] and Qwen-2.5-32B-Instruct [20] :
First stage: Initial Quality Screening
- Model: Gemini 2.0 Flash
- Purpose: Quickly identify and eliminate questions exhibiting unclear phrasing, incomplete information, or structural issues
- Prompt: We follow the same prompt format as MagPie [19] to filter out low-quality samples by keeping only good and excellent questions

Second Stage: Comprehensive Quality Assurance
- Model: Qwen-2.5-32B-Instruct
- Purpose: Applies rigorous verification standards based on carefully defined quality criteria, ensuring only well-structured, clear, and verifiable questions are retained.


This meticulous two-stage verification ensures the final dataset is clean and highly reliable, explicitly optimized for reasoning distillation in smaller-scale models.
Final Stage: Rule-based curation
To conclude the data curation process, we apply specific length constraints and verification standards:
- Problem Statements:
- To ensure clarity and manageability, we remove any problems whose statements exceed 4096 characters or are shorter than 100 characters.
- Remove proof-based questions using regex following following Curated Thought [21]
- Filter out unparsable answers using the math-verify library by Huggingface.
- Semi-Verifiable Problems: Answers to these questions are limited to 100 characters. Responses exceeding this length are filtered out to streamline the verification process performed by the LLM Judge.
- Coding Problems: We exclude coding problems with no test cases.

- Looking at the distribution, many problems contain more than 100 test cases for 1226 tests. We retain all original test cases in our release but recommend selecting only a representative subset to optimize efficiency during reinforcement learning training.
5. Data Analysis and Clustering
After completing the curation process, the following chart illustrates the comparison between the original dataset and the refined dataset in terms of data quantity:

To provide a clearer understanding of the dataset's composition and characteristics, we utilize both a treemap and a t-SNE plot to visualize data distribution:



An analysis of the final dataset reveals the distribution of data sources and domains. Notably, there exists a scarcity of data in several key domains—namely science, engineering, finance, and medicine—despite their potential to yield numerous complex, verifiable question-answer pairs. These fields, characterized by intricate problem-solving and multidisciplinary applications, are underrepresented in the current collection.
To further study the dataset, a t-SNE (t-Distributed Stochastic Neighbor Embedding) visualization of embeddings by domain, generated through proportional sampling, provides a detailed representation of the data distribution across subject areas. This dimensionality reduction technique projects high-dimensional data into a two-dimensional space, preserving local structures and revealing domain clustering patterns.


The visualization reveals that the “math” domain forms the largest and most centralized cluster, indicating a concentration of mathematical content. Adjacent clusters like “code” suggest a substantial representation of programming-related data, including software engineering tasks.

The word cloud reveals some common problems and patterns in the math dataset. A large number of terms such as “number,” “sum,” “value,” “integer,” “product,” and “determine” suggest a significant focus on numerical reasoning and algebraic problem-solving. Programming-related terms like “code,” “instruction,” “return,” “function,” and “test case” are also prominently featured, indicating a robust representation of coding and algorithmic tasks, including software engineering challenges (e.g., “swe_code”). Geometric and analytical terms such as “triangle,” “radius,” “area,” and “length” also represent geometric problems in the set.
Still, the word cloud exhibits a notable absence of terms associated with underrepresented domains such as science, engineering, finance, and medicine. This imbalance underscores an opportunity for enhancement. We aim to address this gap in future iterations by expanding the dataset to encompass a broader and more diverse range of domains. We ensure comprehensive coverage across multidisciplinary areas, enriching the dataset with tasks that reflect real-world complexity and variability. By incorporating additional data from these critical fields, we intend to strengthen the dataset’s utility as a resource for training AI systems capable of robust, domain-agnostic reasoning.
6. Reward Verifiers
We introduce a verification framework to evaluate LLM-generated responses against ground truth, supporting our RL dataset.
Our approach includes specialized verifiers for different tasks:
- Mathematical Problems: We use Math-Verify [24] to evaluate mathematical expressions with high precision. It processes LaTeX-formatted expressions, numerical values, and various equivalent forms of a number (e.g.,
2
,4/2
,√4
). This flexibility ensures mathematically correct answers are recognized, regardless of format. The verifier compares LLM-generated results with expected values and outputs a binary correctness score:0
for incorrect and1
for correct answers. - Algorithmic Coding Problems: We execute LLM-generated code using Sandbox Fusion [25], a secure and efficient code execution sandbox. By validating outputs against predefined test cases, we compute a correctness score as the fraction of passed test cases

- Software Engineering Problems: We extend the SWE Verifier [26] to assess software engineering tasks. It utilizes
difflib.SequenceMatcher
outputs a similarity score ranging from0
to1
to compare code patches and measure how accurately LLM-modified code aligns with expected changes. - Semi-Verifiable Problems: For domains such as textbooks, medical queries, and general knowledge tasks, we implement an LLM-as-Judge approach. Here, a language model evaluates LLM-generated responses against the ground truth, ensuring alignment through contextual and semantic verification.
7. Vietnamese Entrance Exam Benchmark
These problems were carefully curated from high-quality TeX sources or manually segmented and extracted using OCR from Gemini-2.0-Flash. Initially written in Vietnamese in multiple-choice format, all questions were subsequently translated into English. We then applied a reformulation process inspired by BigMath [12] to convert each multiple-choice question into a single, verifiable numerical answer. Using a set of predefined criteria, each multiple-choice question was first transformed into a question-answer pair. Another LLM judge then evaluated the reformulated version to verify the validity of the transformation. Finally, the judge used its judgment to rewrite the question and answer to meet its criteria, resulting in our final question and answer pair. Below are examples demonstrating the outcomes of our reformulation process.
Following our rigorous decontamination process, taking a step back and reassessing the integrity of current proprietary math benchmarks is essential. Many of today's models may have encountered parts of these benchmarks during pre-training and post-training, potentially inflating their performance. To thoroughly test the reasoning capabilities of current models and their performance in other domains, we introduce the Vietnamese Entrance Exam: a benchmark comprising 400 questions from the Vietnamese University entrance exam. As reasoning models become more complex it is important to have specialized benchmarks to measure their performance. We introduced a Vietnamese entrance exam benchmark to show the process of creating this for an appropriate level and localization.
These problems were carefully curated from high-quality TeX sources or manually segmented and extracted using OCR from Gemini-2.0-Flash. Initially written in Vietnamese in multiple-choice format, all questions were subsequently translated into English. We then applied a reformulation process inspired by BigMath [12] to convert each multiple-choice question into a single, verifiable numerical answer. Using a set of predefined criteria, each multiple-choice question was first transformed into a question-answer pair. Another LLM judge then evaluated the reformulated version to verify the validity of the transformation. Finally, the judge used its judgment to rewrite the question and answer to meet its criteria, resulting in our final question and answer pair. Below are examples demonstrating the outcomes of our reformulation process.

Here is our evaluation of several popular reasoning models across different subjects. Notably, all models show considerable difficulty when tackling chemistry problems.
Model | O1 | O1-mini | O3-mini | DeepSeek-R1 | DeepSeek-R1-Distill-Qwen-32B | Qwen/QwQ-32B |
---|---|---|---|---|---|---|
Chemistry benchmark | 21.27 | 22.34 | 18.08 | 30.85 | 19.14 | 26.59 |
Physic benchmark | 52.63 | 56.84 | 60 | 74.73 | 57.89 | 73.68 |
Math benchmark | 50.2 | 65.02 | 72.48 | 80.24 | 72.43 | 81.89 |
The full benchmark dataset is available here: Vietnamese Entrance Exam.
8. Initial Experiments and Evaluations
To analyze the effectiveness of our work, we randomly select around 53,000 samples from our curated dataset and prepare them for reinforcement learning (RL) training. We leverage deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B as the base model and continue training GRPO from this.
For evaluation, we used the EvalScope to evaluate models and report pass accuracy across all benchmarks. The number of responses generated per problem is as follows:
- 64 responses:
AMC23, AIME24, AIME25
- 4 responses:
Minerva-Math, Math-Gaokao-2023-English, Math500, Olympiad-Bench, Vietnamese-Entrance-Math-Exam
- 1 response:
IFEval
Sampling Configs:
- Max context length: 16384
- Temperature: 0.6
- Top p: 0.95
- Top k: 40
- Seed: 42
Additionally, for Live-Code-Bench, we leverage QWQ-Evaluation to reproduce results using a max context length of 32768, averaging over 8 runs.
The experiment shows that our model improves performance on all benchmarks.
Benchmark
|
DeepSeek-R1-Distill-Qwen-1.5B
|
Qwen2.5-Math-1.5B-Instruct
|
II-Thought-1.5B-Preview
|
---|---|---|---|
AMC23
|
69.69
|
54.26
|
73.83
|
AIME24
|
29.43
|
10.73
|
30.45
|
AIME25
|
23.39
|
8.8
|
27.59
|
Olympiad Bench
|
43.15
|
36.07
|
45.57
|
Math500
|
83.6
|
79.4
|
85.8
|
Math Gaokao 2023 English
|
72.99
|
64.59
|
76.59
|
Minerva Math
|
27.57
|
20.42
|
30.25
|
Vietnamese Entrance Math Exam
|
40.32
|
38.63
|
42.67
|
LiveCodeBench
|
16.66
|
18.59
|
19.55
|
IFEval
|
44.24
|
40.89
|
44.84
|
Average
|
45.10
|
37.45
|
44.84
|

Following the previous study [27]. We also evaluate the thinking efficiency of deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
and II-RL-v1-1.5B
across six dimensions:
- Reasoning Tokens: The number of tokens generated in the model’s long chain of thought during reasoning. For R1-type reasoning models, this metric represents the number of tokens between the
<think>
and</think>
markers. - First Correct Tokens: The number of tokens from the start of the model’s reasoning process to the first position that can be recognized as the correct answer.
- Reflection Tokens: The number of tokens from the position of the first correct answer to the end of reasoning.
- Num Thought: This metric indicates the number of different thought paths the model generates during reasoning. It is calculated by counting the occurrences of generated marker words (e.g., “alternatively,” “but wait,” “let me reconsider”).
- Token Efficiency: refers to the ratio of first correct tokens to the total number of reasoning tokens.

We observe that our model requires only half the tokens to solve the problem compared to the distilled model. The *Distilled-R1*
model generates more thought tokens
, requiring additional reasoning traces before finalizing the correct answer path.
9. Conclusion and Future Research
High-quality data isn’t a luxury; it’s a necessity. Only through high-quality data can we reward sound reasoning traces. By pairing advanced techniques like GRPO with meticulously curated datasets, we’re not just teaching AI to respond better—we’re teaching it to think better with the signal it receives.
Designed with rigorous standards for verifiability and diversity, we introduce II-Thought, a dataset that excels in mathematical and coding tasks. Still, it reveals a scarcity of science, engineering, finance, and medicine data, as highlighted by t-SNE visualizations and word cloud analysis. Future iterations will address this imbalance by expanding into these underrepresented domains, enhancing the dataset’s multidisciplinary scope. This effort aims to train AI systems that reason across diverse real-world contexts, bridging the gap between pattern recognition and cognitive reasoning. We will then introduce complex, high-quality reasoning traces to this RL dataset, accelerating reasoning distillation from R1 to smaller models.
References
[1] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.
[2] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., ... & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35, 27730-27744.
[3] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
[4] Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., ... & He, Y. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.
[5] Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., ... & Guo, D. (2024). Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300.
[6] Li, J., Beeching, E., Tunstall, L., Lipkin, B., Soletskyi, R., Costa Huang, S., Rasul, K., Yu, L., Jiang, A., Shen, Z., Qin, Z., Dong, B., Zhou, L., Fleureau, Y., Lample, G., & Polu, S. (2024). NuminaMath. Numina. Hugging Face repository. https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf
[7] Li, J., Beeching, E., Tunstall, L., Lipkin, B., Soletskyi, R., Costa Huang, S., Rasul, K., Yu, L., Jiang, A., Shen, Z., Qin, Z., Dong, B., Zhou, L., Fleureau, Y., Lample, G., & Polu, S. (2024). NuminaMath. Numina. Hugging Face repository. https://huggingface.co/AI-MO/NuminaMath-1.5
[8] Luo, M., Tan, S., Wong, J., Shi, X., Tang, W., Roongta, M., Cai, C., Luo, J., Zhang, T., Li, E., Popa, R. A., & Stoica, I. (2025). DeepScaleR: Surpassing O1-Preview with a 1.5B model by scaling RL. Notion Blog. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2
[9] Li, G., Hammoud, H., Itani, H., Khizbullin, D., & Ghanem, B. (2023). Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, 36, 51991-52008.
[10] Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., ... & Vinyals, O. (2022). Competition-level code generation with alphacode. Science, 378(6624), 1092-1097.
[11] Mattern, J., Jaghouar, S., Basra, M., Straube, J., Di Ferrante, M., Gabriel, F., Ong, J. M., Weisser, V., & Hagemann, J. (2025). SYNTHETIC-1: Two million collaboratively generated reasoning traces from Deepseek-R1. https://www.primeintellect.ai/blog/synthetic-1-release
[12] Albalak, A., Phung, D., Lile, N., Rafailov, R., Gandhi, K., Castricato, L., ... & Haber, N. (2025). Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models. arXiv preprint arXiv:2502.17387.
[13] MatrixStudio. (n.d.). Codeforces Python Submissions. Hugging Face. Retrieved March 12, 2025, from https://huggingface.co/datasets/MatrixStudio/Codeforces-Python-Submissions
[14] Chen, J., Cai, Z., Ji, K., Wang, X., Liu, W., Wang, R., Hou, J., & Wang, B. (2024). HuatuoGPT-o1, towards medical complex reasoning with LLMs. arXiv. https://arxiv.org/abs/2412.18925
[15] Lin, B. Y., Wu, Z., Yang, Y., Lee, D.-H., & Ren, X. (2021). RiddleSense: Reasoning about riddle questions featuring linguistic creativity and commonsense knowledge. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
[16] Team, G., Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A., ... & Batsaikhan, B. O. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
[17] Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., ... & Bowman, S. R. (2024). Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling.
[18] OpenThoughts Team. (2025, January). Open Thoughts. Retrieved from https://open-thoughts.ai
[19] Xu, Z., Jiang, F., Niu, L., Deng, Y., Poovendran, R., Choi, Y., & Lin, B. Y. (2024). Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing. arXiv preprint arXiv:2406.08464.
[20] Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., ... & Qiu, Z. (2024). Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115.
[21] Hochlehnert, A., Bhatnagar, H., Udandarao, V., Prabhu, A., & Bethge, M. (2025). CuratedThoughts: Data curation for RL training datasets. Retrieved from https://huggingface.co/datasets/bethgelab/CuratedThoughts
[22] Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., ... & Hajishirzi, H. (2024). T\" ulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124.
[23] andthattoo. (n.d.). Verifiable Pythonic function calling. Retrieved March 12, 2025, from https://huggingface.co/blog/andthattoo/dria-agent-a
[24] Math-Verify from https://github.com/huggingface/Math-Verify/
[25] SandboxFusion from https://github.com/bytedance/SandboxFusion
[26] Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., ... & Yang, Z. (2025). Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599.
[27] Wang, Y., Liu, Q., Xu, J., Liang, T., Chen, X., He, Z., ... & Yu, D. (2025). Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs. arXiv preprint arXiv:2501.18585.
[28] Yue, A. S., Madaan, L., Moskovitz, T., Strouse, D. J., & Singh, A. K. (2024). HARP: A challenging human-annotated math reasoning benchmark. arXiv preprint arXiv:2412.08819.
[29] Gao, B., Song, F., Yang, Z., Cai, Z., Miao, Y., Dong, Q., ... & Chang, B. (2024). Omni-math: A universal olympiad level mathematic benchmark for large language models. arXiv preprint arXiv:2410.07985.