Tue 5 Aug 20259 min read

II-Search

II-Search

Introduction

The II-Researcher framework is dedicated to producing powerful, open-source agents for complex research. Our latest additions to this initiative are II-Search-4B and II-Search-CIR-4B, two 4-billion-parameter language models designed to tackle complex information-seeking and open-ended research tasks.

II-Search-4B introduces a three-phase SFT and one-phase RL curriculum that progressively stimulates tool use, refines multi-step reasoning, and applies rejection-sampling fine-tuning for comprehensive report generation.

II-Search-CIR-4B extends this foundation with code-integrated reasoning (CIR), enabling the model to embed executable Python within its chain-of-thought to call search and browsing APIs programmatically. Despite their modest scale, both models close the performance gap and in several cases surpass much larger baselines on information-seeking and open-ended research tasks.

Model Summary

Run on Mac w/ LMStudio:

Model card:

Datasets:

Training Methodology: II-Search-4B

A core innovation in our training is a custom chat template designed to maintain a continuous chain of thought during multi-turn conversations.

For Qwen3's base models, the previous turn's reasoning_content is removed from the history. Only the function call requests and their outputs are kept. This format interrupts the chain of thought, as the model must restart its reasoning each turn. We address this by defining a template that retains the reasoning from the previous turn in its history.

Formally, given the current state, the model generates this turn's thinking with actions (tool call or stop and generate final output):

Equation

The training for II-Search-4B is divided into four phases.

Phase 1: Tool Call Ability Stimulation
The initial phase focuses on teaching the model the fundamental pattern of tool use. We use larger teacher models (Qwen3-235B-A22B and DeepSeek R1) to generate high-quality reasoning paths with function calls on the MuSiQue dataset. This dataset is designed explicitly for multi-hop question answering, making it ideal for stimulating multi-step tool interactions.

Phase 2: Reasoning Improvement
After Phase 1, the model effectively uses tools without syntax errors, but two key problems emerged:

  1. Tool Laziness: The model tended to perform only a single search before concluding, even for complex questions. This was because the Phase 1 training data could often be "shortcut" by the robust knowledge base of the teacher model. To counteract this, we drew inspiration from the Random Walk algorithm to create a synthetic dataset that forces the model to perform multiple iterative searches (up to 14-15 turns).
  2. Poor Reasoning Quality: The model's reasoning was often nonsensical and lengthy. This stems from distilling the verbose reasoning style of the Qwen3-235B-A22B model, which smaller 4B models struggle to mimic without confusing themselves. We implemented several techniques to refine these reasoning paths for clarity and directness.

Phase 3: Comprehensive Report Generation Improvement
The final SFT phase enhances the model's ability to produce high-quality, detailed reports through two main steps:

  1. Rejection Sampling & Fine-Tuning: We applied a strict filtering process to the dataset from Phase 2, retaining only "high-quality" traces (defined as leading to the correct answer and involving at least five tool calls). This filtered dataset was then used for fine-tuning.
  2. STORM-Inspired Enhancement: Inspired by STORM (Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking), we prompt the teacher model to conduct internet-based research, collect references, and write Wikipedia-like articles to enhance the model's ability to use search and visit tools.

Phase 4: Reinforcement Learning

Finally, to ensure optimal tool utilization and generalization of reasoning, we trained the model using reinforcement learning with GRPO algorithm on the MuSiQue dataset (19,000 samples). We incorporated our in-house search database (containing Wiki data, Fineweb data, and arXiv data) while limiting tool usage to 7 times per problem. The reward function is structured as follows:

Reward equation

Training Methodology: II-Search-CIR-4B

Inspired by the success of our II-Researcher approach, which applies tools with augmented reasoning on top of the DeepSeek R1 model, II-Search-4B-CIR introduces Code-Integrated Reasoning (CIR), a more powerful and flexible method for tool interaction with the reasoning process.

Code Integrated Reasoning (CIR)

We instruct the model to generate code blocks enclosed between <start_code>```python and ```<end_code>, within which it can invoke a set of predefined functions.

These functions act as interfaces to external resources, similar to the tool call paradigm but offering greater flexibility and control. This approach enables the model to not only retrieve external information but also process, filter, and reason over it programmatically within the code itself.

In our setup, we provide two predefined functions:

  • web_search(query: str, num_result: int) -> str
  • web_visit(url: str) -> str

Cold-Start SFT Distillation

We first curated a dataset and trained a cold-start SFT model based on it, following our previous study, II-Researcher, to create the Code-Search-Instruct-Reasoning dataset, which leverages both natural language reasoning and interactive Python code execution during inference.

  1. We carefully designed the system prompt to instruct the model to use correct formatting and code calls.
  2. We also implemented a strategic hint, following the model's thinking, beginning with the token <think>.

These strategies significantly increased the likelihood that models would produce reasoning processes that integrate code. For detailed system-prompt and hint information, please refer to the Prompt for II-Search-CIR.

We also carefully decontaminated our dataset from the benchmark by using 8-gram filtering and fuzzy matching (90% word similarity).

We used Deep-seek-R1-0528 and Qwen3-235B-A22B to generate our dataset. Our results showed that Deep-seek-R1-0528 was more efficient than Qwen3-235B-A22B. Due to computational constraints of our infrastructure, we generated a total of 88k traces.

ModelQwen3-235B-A22BDeep-seek-R1-0528
Pass Rate64%81%
Pass rate for generating code-integrated reasoning traces of teacher models

Reinforcement Learning (RL)

Following the SFT phase, we employed Reinforcement Learning (RL) using the DAPO algorithm to further refine the model.

  • Dataset: We used the MuSiQue and 2WikiMultiHopQA datasets for the RL stage.
  • RL Training: Our approach utilizes Reinforcement Learning (RL) to train our model. This section will cover how we employ the RL framework to train II-Researcher-CIR, as well as the code used within it.
  • Rollout Strategy: Switched from a synchronous to an asynchronous method for generating training traces, significantly improving speed.
  • Output Masking: Implemented execution result masking to ensure training stability and reduce the probability of model collapse.
  • Reward Design: We use LLM as a judge with gemini-flash. For the specific prompt, please refer to the Prompt for II Search CIR.

Benchmarking and Results

We use three different benchmark datasets, focusing on information seeking and short/accurate answer giving of the model:

  1. OpenAI/SimpleQA
  2. Google/Frames
  3. vtllms/sealqa (Subset Seal_0)

This evaluation compares the model with compact Open Source AI systems tuned for information-seeking, Qwen3-4B (the base model), Jan-4B, and WebSailor-3B, and reports Google Frames results for two recent MoE variants: Qwen3-30B-A3B-Instruct-2507 and Qwen3-30B-A3B-Thinking-2507. All models run with their official chat templates and recommended inference settings; exact configs are listed in the Appendix. Retrieval uses the SerpDev API, and answers are extracted and evaluated with Google Gemini Pro 2.5 using the judge prompts provided by each benchmark’s authors.

All benchmark traces are available here.

Overall Results

Detailed Results

Simple QA:

II-Search-4BII-Search-CIR-4BQwen3-4BJan-nano-128kWebSailor-3B
Pass rate %91.891.876.880.1 81.8
# Search2.22.51.00.92.1
# Visit3.55.30.11.96.4
# Tool used5.77.81.12.88.5

Frames:

II-Search-4BII-Search-CIR-4BQwen3-4BJan-nano-128kWebSailor-3B
Pass rate %67.572.230.724.834.0
# Search4.26.11.11.07.4
# Visit3.25.00.13.77.2
# Tool used7.411.01.24.714.6
II-Search-4BII-Search-CIR-4BQwen3-30B-A3B-Instruct-2507Qwen3-30B-A3B-Thinking-2507
Pass rate %67.572.244.533.1
# Search4.26.12.00.9
# Visit3.25.01.70.0 (0.01)
# Tool used7.411.03.70.9

Seal_0:

II-Search-4BII-Search-CIR-4BQwen3-4BJan-nano-128kWebSailor-3B
Pass rate %22.526.46.312.71.8
# Search4.35.90.90.96.6
# Visit5.77.70.15.210.0
# Tool used10.013.51.06.116.6

Conclusion: Empowering the Future of Open-Source AI

II-Search-4B and II-Search-CIR-4B are another step in our commitment to the open-source community and the II-Researcher framework.

We encourage you to test these models and see what they can do. You can download and try today, links below.

To share your results, ask questions, and collaborate with our team and the community, join our Discord. Let’s continue building the future of AI, together.

Appendix

This section includes supplementary materials from the original research paper.

A.1 System Prompt for conducting research and writing Wikipedia-like report

SYSTEM_PROMPT = """
You are a LLM-powered knowledge curation system. Your goal is to research [TOPIC] from scratch and generate a full-length Wikipedia-like report with citations. Follow this exact process step-by-step, using chain-of-thought reasoning throughout. If you have access to tools like web search or page browsing, use them to gather real-time information—do not rely solely on internal knowledge.

**Configurations**:
- Research depth: Collect at least 10 diverse sources (e.g., academic papers, news articles, expert opinions).
- Report length: 2000-5000 words, structured like a Wikipedia article.
- Citations: Use inline [1] format, with a references section at the end listing full URLs and summaries.
- Toggles: do_research=True, do_generate_outline=True, do_generate_article=True, do_polish_article=True.

**Step 1: Pre-writing Stage (Research)**
Conduct internet-based research to collect references.
- Perspective-Guided Question Asking: First, identify 5-7 diverse perspectives on [TOPIC] (e.g., historical, technical, societal, economic, ethical). For each perspective, generate 3-5 insightful questions to deepen understanding.
- Simulated Conversation: Simulate a conversation between a "Wikipedia writer" (you) and a "topic expert" (grounded in searched sources). Ask follow-up questions based on answers to update knowledge.
- Use tools: For each question, perform a web search (query: "[question] site:edu OR site:gov OR site:org" num_results=5) or browse relevant pages (e.g., browse_page url="https://en.wikipedia.org/wiki/[related_topic]" instructions="Extract key facts and references"). Synthesize findings with snippets and URLs. Collect and list all references here.

**Step 2: Outline Generation**
Based on the collected information, generate a hierarchical outline for the report. Structure it with:
- Introduction
- Main sections (3-6, with subsections) covering key perspectives and findings
- Conclusion
- References
Use bullet points or numbered lists for clarity. Think step-by-step: Organize info logically, ensure broad coverage, and note any gaps.

**Step 3: Article Generation**
Write the full article based on the outline. For each section:
- Populate with synthesized information from research.
- Include inline citations [1] linking to sources.
- Maintain neutral, encyclopedic tone.
- Use subheadings for structure.

**Step 4: Polishing**
Refine the article:
- Add an executive summary at the top.
- Remove duplicates, fix inconsistencies, and enhance readability.
- Verify all claims against sources; flag uncertainties.
- Reorganize if needed for better flow.

Output the final polished report. If any step requires clarification, ask me before proceeding.
""".strip()

A.2 Prompt for II-Search-CIR

# System Prompt-II Search CIR
You are II Researcher, developed by Intelligent Internet.
You first thinks about the reasoning process in the mind and then provides the user with the answer. 
You are specialized in multistep reasoning.

You excel at the following tasks:
1. High-level information gathering, research, fact-checking, and structured documentation.
2. Skillful construction of efficient, targeted queries to extract the most relevant information. (SEARCH SKILL)
3. Applying programming to analyze, transform, or solve complex tasks far beyond basic software development.



- You first think about the reasoning process in the mind and then provide the user with the answer.
- You are trained for deep, multi-step inference using prior knowledge, training data, and live searches.
- You write and execute Python code for advanced data collection (via web_search) and content retrieval (via visit_webpage).



When you need to solve a problem or answer a question:
1. Start by thinking through the problem logically step by step.
2. Use web_search for initial discovery. Use visit_webpage to confirm, deepen, and validate.
3. Adapt and extend reasoning as new information emerges.
4. Do not finalize answers until you are confident and supported by evidence.
5. Only use  when all reasoning and code execution is complete.
6. After , provide only the final answer—clear, accurate, and with source citations if appropriate.



CRITICAL: You MUST call python code using ONLY this exact format. DO NOT write any other code blocks!

To call a python, write EXACTLY like this:

```python
...```


Examples of CORRECT tool calls:

```python
response_str = web_search(query="# the query to search")
print(response_str)

Share this article: