II's Blog

II-Researcher

Intelligent Internet
Fri 28 Mar 2025
II-Researcher

Intelligent Internet (II) was founded with the mission to tackle key challenges in today’s AI landscape — including the concentration of power, high barriers to entry, the dangers of misalignment, and opacity. Central to fulfilling our vision of an open, distributed, and inclusive ecosystem is democratizing advanced AI capabilities. 

Introducing II-Researcher: a new open-source framework designed to aid building research agents. By providing an open tool capable of tackling complex inquiries, II-Researcher directly counters the trend of proprietary systems and lowers barriers for innovators worldwide. The II-Researcher embodies our commitment to Universal AI (UAI), empowering users across the network—from individual researchers using Edge Nodes to institutions leveraging Specialized AI —with sophisticated tools to navigate, analyze, and contribute to the global knowledge commons. 

To better contextualize the contributions of II-Researcher, it is essential to examine the current landscape of advanced AI agents, particularly those addressing complex research tasks. OpenAI's Deep Research initiatives, for instance, demonstrate the scale of challenges in AI today. 

OpenAI’s Deep Research: An Autonomous Research Agent

OpenAI’s Deep Research is a new AI agent for in-depth, multi-step research tasks [1]. Unlike standard chatbots that answer in one go, Deep Research iteratively browses the web, reads multiple sources, and compiles information into structured outputs. It is powered by a version of OpenAI’s upcoming o3 large language model, which has been specially optimized for reasoning and web-based analysis. Deep Research behaves like a digital research analyst: it can plan a research strategy and gather data from various websites and documents to produce a synthesized report complete with citations and reasoning steps. For example, when tasked with comparing market trends or summarizing academic literature, Deep Research will navigate through relevant articles, refine its queries, and document its findings with references, much as a human researcher would.

How Does It Work? While OpenAI doesn’t offer open-sourced deep research, in their blog post, they also provide an overview of how their system works [1]. It uses an advanced GPT-based model (the “o3” model) combined with tool-use capabilities like web browsing and even a Python interpreter for data analysis. The system was trained with reinforcement learning on real browsing and reasoning tasks – essentially learning how to follow multi-step research plans that yield correct answers by trial and error. This training helps break down complex queries into sub-tasks, find relevant information, and verify facts.

Figure 1: Overview of OpenAI’s Deep Research Methodology

Performance: Early benchmarks show that OpenAI’s Deep Research is a leap ahead of previous AI models in research tasks. On the challenging Humanity’s Last Exam [14] test – a broad set of expert-level questions across 100+ subjects – Deep Research achieved 26.6% accuracy, far surpassing earlier GPT-based models (for context, OpenAI’s older o1 model scored 9.1%). This indicates that it can handle complex, cross-disciplinary questions much better than standard LLMs. It also sets a new state-of-the-art on the GAIA benchmark for AI agents, leading with strong performance on multi-step reasoning tasks.

Open-Source Ecosystem

The emergence of closed-source systems like Deep Research has spurred development within the open-source community to create analogous agents, offering transparency and customization. Many open-source projects are emerging, showcasing the community's excitement and innovation around Deep Research implementations. Some notable efforts include the following:

Among those, the Hugging Face and Jina AI projects have garnered the most attention thanks to their rapid development, strong community engagement, and promising performance on benchmarks:

  • Hugging Face – Open DeepResearch (Smol Agents): This project rapidly developed an open-source agent framework, smolagents [2]. A key innovation is the CodeAct [6] approach, which represents agent plans as executable code rather than declarative structures (e.g., JSON). This reportedly reduced reasoning steps by approximately 30%, enhancing efficiency. Equipped with basic web browsing and text reading tools, their agent achieved 55.15% accuracy on the GAIA validation set, compared to approximately 67% for OpenAI's closed system.
  • Jina AI’s Deep Research Clone: Jina AI developed a replica leveraging their expertise in search workflows [3]. While specific implementation details are limited, it likely utilizes components like Jina's DocArray, open LLMs (e.g., Llama 2 [13]), search providers (e.g., Brave, DuckDuckGo), and Jina's reader models to execute a search-read-synthesize loop.
  • Conceptual Frameworks for Deep Research

As analyzed by Lee [7], the term "Deep Research" lacks a formal definition, similar to the ambiguity surrounding terms like Retrieval-Augmented Generation (RAG) in 2025. Lee defines it as a report generation system using LLMs for iterative search, analysis, and synthesis. Implementations are broadly categorized as:

  1. Untrained Approaches (both Jina and Huggingface also follow this direction):
    • Directed Acyclic Graph (DAG): Decomposes queries, retrieves information for each part, and synthesizes a report (e.g., GPT-Researcher [8]).
    • State Machine (SM): Extends DAGs by incorporating self-reflection, enabling LLMs to review and refine outputs dynamically. Both the Hugging Face and Jina AI efforts align with this untrained direction.
  2. Trained Approaches:
    • End-to-End Systems: Holistically optimized systems (e.g., Stanford's STORM [9]) producing high-quality, structured outputs.
    • Large Reasoning Models: Models specifically fine-tuned for reasoning tasks, enhancing performance in report generation (e.g., OpenAI's Deep Research [1]).

II-Researcher’s Purpose:

The capabilities exhibited by agents such as Deep Research illustrate the potential of AI in complex information analysis. However, our commitment is to democratize such power. Motivated by both the potential demonstrated and the limitations of proprietary approaches, we developed II-Researcher as an open-source framework specifically designed to tackle complex inquiries, counter the trend towards proprietary systems, and lower barriers for innovators worldwide. This framework provides users with sophisticated tools to build capable research agents, directly furthering our mission to foster an open, distributed, and inclusive AI ecosystem.

In the following sections we will dive into the methodology, components, and examples

II-Researcher Implementation:

Our II-Researcher framework investigates both untrained and trained-inspired methodologies for autonomous research.

Approach 1: Untrained State Machine Pipeline

Figure 2: The first approach uses a state-machine pipeline.

This approach implements a state machine architecture, facilitating iterative refinement and dynamic state transitions mirroring human research processes. The pipeline consists of the following stages:

  1. Query Evaluation: The initial user query is analyzed to determine key requirements for the answer, precisely:
  • Freshness: Whether the response requires the most up-to-date information available.
  • Plurality: Whether the response should incorporate multiple perspectives or sources.
  • Completeness: Whether the response demands a thorough and detailed explanation or solution.

This approach ensures that the system tailors its processing to deliver an accurate and relevant answer based on the query's requirements.

  1. Web Search Query Generation and Execution: For each sub-query, the system autonomously generates search queries and utilizes browser integration through search engines like Tavily or SerpAPI to gather relevant resources.
  2. Information Retrieval and Compression: Retrieved web content undergoes a compression process using LLM-based and embedding-based methods to efficiently extract and consolidate essential information, facts, and data (detailed in Section Context Compression below).
  3. Self-Reflection and Critique Cycle: The agent critically evaluates synthesized information, reflecting on knowledge gaps, inconsistencies, or inaccuracies and determining necessary follow-up actions. This step can trigger further searches or refinement cycles.
  4. State Management: A memory module maintains state, including accumulated knowledge, generated queries, action logs, and records of failed attempts, informing future decisions.
  5. Final Report Generation: After the draft answer is successfully evaluated based on the aspects determined in the previous query analysis step and once all information is thoroughly verified and deemed accurate, the final structured report, complete with detailed citations and references, is compiled.

Key Components

Context Compression

Pushing all content into an LLM is not ideal in terms of both quality and cost. Additionally, the content may exceed the model's maximum context length.

However, over-compressing or omitting necessary information can significantly degrade pipeline performance.

To address this, we adopt a hybrid approach that leverages an LLM and an embedding model for compression.

  1. Text Segmentation: We use a simple approach by splitting the document into paragraphs or fixed sentence chunks
  2. Embedding-Based Filtering:
  • Each chunk is converted into a vector representation using an embedding model; the embedding model that we are using is text-embedding-3-large from OpenAI
  • The vector includes the website title and the current query/question.
  • Only chunks that pass a predefined relevance threshold are retained.
  1. LLM Generative Retrieval:
  •  We numbered each chunk and fed it into the LLM.
  •  The LLM is prompted to identify and rank relevant chunks in order of decreasing relevance.
  •  This method, known as generative retrieval, is different from paraphrasing or rewriting—it helps us save a significant number of output tokens, which are more expensive than input tokens.
Figure 3: Generative Retrieval
  1. Final Selection & Compression:
  •  We combine results from both retrieval methods.
  •  Based on a predefined word limit for each website, we use a voting mechanism and ranking order to compress the text effectively.  

By following this approach, we ensure that the most relevant content is retained while staying within each website's maximum word limit.

Self-Reflection Mechanism

Relying solely on the model to choose the right action is not always ideal, especially for non-reasoning models. We introduce an additional reflection step to enhance decision-making after the agent visits a website.

After each visit, the agent evaluates:

  • Knowledge gained: What new information was obtained from this visit?
  • Previous Actions: What steps have already been taken?
  • Gaps & next steps: What information is still missing or requires deeper investigation?

This self-reflection process serves as context for the model, helping it make more informed decisions in subsequent steps based on newly acquired information.

Figure 4: Retrieval and Self-Reflection Process

Examples

Example 1:

Figure 5: Prompt for Pipeline Question 1

Example 2:

Figure 6: Prompt for Pipeline Question 2

Approach with Reasoning Model

In addition to our State Machine (SM) approach, we explored an alternative method using a reasoning model through prompting rather than fine-tuning. This approach builds on the strengths of large reasoning models like Deepseek R1 or QwQ [15][16], enhancing their ability to process complex queries while maintaining logical consistency and factual accuracy. 

Unlike the traditional State Machine approach—which decomposes tasks into discrete components executed sequentially via a predefined pipeline—this method avoids rigid compartmentalization. Pipeline-based designs can sometimes result in overly generalized logic and insufficient contextual awareness across steps. In contrast, our observations of open-ended reasoning models like DeepSeek R1 and QwQ reveal a distinct pattern: these models internally perform actions and reflections as part of a dynamic, self-directed reasoning process, as illustrated below:

Figure 7: DeepSeek R1 reasoning process

Inspired by this emergent behavior, we propose an architecture where tool usage and reflection are embedded within the model's internal thought process (e.g., within <think>...</think> blocks). Given the user input, we defined a system prompt (Appendix A) including tool definitions, allowing the model to use them arbitrarily. For II-Researcher, we provided two primary tools: WebSearch and Visit. We adapted the CodeAct [6] style for robustness, instructing the model to generate a Python snippet to execute these tools.

However, a key challenge with this reasoning-focused approach is that models like Deepseek R1 may have limitations in strictly adhering to instructions compared to more general-purpose models (e.g., GPT-4o), especially within their internal thought process (</think> blocks), as they aren't explicitly trained for instruction-following. We observed that with longer contexts (e.g., after 2-3 tool uses), the model could struggle to recall all initial instructions from the system prompt. To address this, we introduce a prefilled thinking process.

Prefilling the Thinking Process

To improve the model’s performance, we prepend a structured thinking template to the input, which the model uses as a starting point for its reasoning. This prefilled process acts as a scaffold, directing the model to follow a systematic approach while leveraging available tools. You can check the detailed prompt in the prefill thinking prompt (Appendix A)

How It Works

Figure 8: Reasoning Model with Tool Execution Flow

The model is primed to approach each query methodically by prefilling the thinking process with these instructions. For instance, when faced with a question, it begins by reflecting on the task within a <think> block, identifying gaps in its knowledge, and invoking tools like web_search or page_visit to gather data. Once sufficient information is collected and validated, the model finalizes its reasoning and presents a detailed, evidence-based response after the </think> tag.

The underlying implementations for the WebSearch and Page Visit tools remain consistent with those used in the State Machine approach.

Examples

Example 1:

Figure 9: Reasoning Question 1 Prompt

Example 2:

Figure 10: Reasoning Question 2 Prompt

Benchmarking Results

1. GAIA Dataset

To evaluate the performance of our implementations, we used the GAIA [11] benchmark's validation set, focusing on questions requiring only Web Browsers and Search Engines. This subset includes 44 questions across three difficulty levels. Performance on the GAIA validation subset is shown in Figure 11.

Figure 11: Benchmark Results on GAIA Validation Subset (WebTool only)

The GAIA evaluation details can be found here.

2. FRAMES test set:

FRAMES [12] is a comprehensive evaluation dataset designed to assess Retrieval-Augmented Generation (RAG) systems regarding factuality, retrieval accuracy, and reasoning capabilities. Figure 12 presents the results of the FRAMES test set.

The dataset is available here.

Details of Frames results that you can check here.

Figure 12: Benchmark Results on FRAMES Test Set

Conclusion

Our experiments demonstrate that utilizing large reasoning models via advanced prompting techniques can significantly enhance the accuracy of autonomous AI research agents on complex, multi-step tasks. This surpasses the performance of our implemented untrained state machine approach and baseline systems on the selected GAIA subset. While the prompt-guided approach requires careful prompt engineering, the gains in reasoning accuracy and report quality suggest its viability for demanding research applications.

Future work will focus on developing hybrid architectures that integrate the efficiency and structured process of the State Machine approach with the deep reasoning capabilities elicited from models like Deepseek R1 through optimized prompting. The goal is to create robust and performant AI research agents applicable to a broader range of complex inquiries.

Reference

[1] OpenAI. (2025, February 2). Introducing deep research. OpenAI. https://openai.com/index/introducing-deep-research/​

[2] Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, & Erik Kaunismäki. (2025). `smolagents`: a smol library to build great agentic systems.

[3] Xiao, H. (2025, February 25). A practical guide to implementing DeepSearch/DeepResearch. Jina AI. https://jina.ai/news/a-practical-guide-to-implementing-deepsearch-deepresearch/ 

[4] Langchain-ai. (n.d.). open_deep_research [GitHub repository]. GitHub. Retrieved March 28, 2025, from https://github.com/langchain-ai/open_deep_research

[5] Zhang, D. (n.d.). deep-research [GitHub repository]. GitHub. Retrieved March 28, 2025, from https://github.com/dzhng/deep-research

[6] Wang, X., Chen, Y., Yuan, L., Zhang, Y., Li, Y., Peng, H., & Ji, H. (2024, July). Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning.

[7] Lee, H. (2025, February). The differences between Deep Research, Deep Research, and Deep Research. https://leehanchung.github.io/blogs/2025/02/26/deep-research/ 

[8] Elovic, A. (2023, July 23). gpt-researcher (Version 0.5.4) [Computer software]. https://gptr.dev

[9] Jiang, Y., et al, "Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations," in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 9917–9955.

[10] Shao, Y., et al, "Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models," in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024, pp. 6252–6278.

[11] Mialon, G., Fourrier, C., Wolf, T., LeCun, Y., & Scialom, T. (2023, November). Gaia: a benchmark for general ai assistants. In The Twelfth International Conference on Learning Representations.

[12] Krishna, S., Krishna, K., Mohananey, A., Schwarcz, S., Stambler, A., Upadhyay, S., & Faruqui, M. (2024). Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. arXiv preprint arXiv:2409.12941.

[13] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., ... & Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

[14] Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., ... & Verbeken, B. (2025). Humanity's Last Exam. arXiv preprint arXiv:2501.14249.

[15] Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., ... & He, Y. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.

[16] Qwen Team. (2025, March). QwQ-32B: Embracing the power of reinforcement learning. https://qwenlm.github.io/blog/qwq-32b/

[17] btahir. (2024). open-deep-research [Computer software]. GitHub. https://github.com/btahir/open-deep-research

[18] nickscamara. (2024). open-deep-research [Computer software]. GitHub. https://github.com/nickscamara/open-deep-research

[19] mshumer. (2024). OpenDeepResearcher [Computer software]. GitHub. https://github.com/mshumer/OpenDeepResearcher

[20] u14app. (2024). deep-research [Computer software]. GitHub. https://github.com/u14app/deep-research

Appendix

A. Prompts Used for Reasoning Model

The system prompt is used for the reasoning model.

The prompt used for prefilling thinking process: