psql_bm25s: when retrieval stops being a budget item

A Postgres-native lexical retrieval primitive for long-running, multi-agent work.

Read technical report here.

TL;DR

Long-running agents lose to retrieval, not context. The longer they run, the more their usefulness depends on cheaply re-reading the state they have already accumulated.

Most production agent state already lives in Postgres. Existing Postgres BM25 extensions work, but are slow enough that harnesses end up rationing retrieval, calling it occasionally instead of using it as a default.

psql_bm25s is a Postgres-native lexical retrieval primitive. Exact BM25-family scoring, integrated as a Postgres access method, with mutable index maintenance and replication-friendly storage.

We built it with the harness-engineering approach from our Zenith post: repeated gap-finding against the benchmark surface, with the AI coding agent inside a validation-heavy loop.

On the PG18 15-dataset BEIR benchmark, psql_bm25s reaches ~4× the Python bm25s reference at the median, ~7× TensorChord vchord_bm25, and ~23× ParadeDB pg_search. On msmarco it reaches 96.7 QPS against pg_search's 4.4.

Run an AI agent long enough and the context window stops being the constraint that matters. By session twenty, by week three, by the hundredth tool call, the more pressing question is whether the agent can cheaply re-read everything it has already accumulated: old tool outputs, corrections from yesterday, citations from last Tuesday, the application data it has been working over the whole time. In most systems we care about, that accumulated state already lives in PostgreSQL alongside the application that produced it.

Lexical search over that state is not the unsolved part of the problem. BM25 has been the standard keyword-search formula for decades, several Postgres extensions implement it, and a surprising number of agent prototypes get a long way on grep over a folder of Markdown files. What is actually difficult is making lexical retrieval cheap enough that a developer is willing to let the agent run it on every step over the whole pile of accumulated state, rather than rationing it to occasional fallback paths when nothing else has worked.

psql_bm25s is the extension we built to make that affordable. It performs exact BM25-family retrieval over Postgres tables, maintains its indexes mutably as rows change, and survives crash recovery and physical replication the way Postgres expects an index to. We built it because the layer above needed it: CommonGround [1], our open-source coordination layer for teams of humans and agents, records every step of a project as a navigable artifact: tool calls, handoff rationales, branched reasoning, judgments, deliverables. Over weeks those records pile up into the project's working memory. Working memory is only useful when it stays cheap to look things up in, and psql_bm25s is how we keep it cheap. II-Commons [2], our retrieval product for scientific datasets, runs on the same primitive. We are open-sourcing it because most teams building on top of agent infrastructure today are using retrieval layers that will not hold up at the scale this work is heading toward.

Retrieval as a budget item

In agent systems, retrieval is a budget item, and the harness around the agent (the control loop that decides when to plan, when to call a tool, when to verify) is the thing that has to budget it. When a search call is slow, the harness rations it: search drops from a default to a fallback, the agent gets a narrower slice of memory to search over, the tool budget caps how often retrieval can fire at all. The system still works, but the agent ends up grounding itself on less evidence than it could, and the grounding it does happens later in the loop than it should.

Cheap retrieval changes what the harness is willing to spend it on. With the cost low enough to ignore, the agent can run several focused queries where one overloaded query used to do, re-check its accumulated state at every step rather than once per task, and catch its own mistakes earlier because looking up what it said yesterday no longer costs anything worth tracking. The interesting question is not whether to give the agent BM25, which has been a settled choice for years, but what an agent does when looking things up stops costing the harness anything at all.

What the extension does

psql_bm25s is implemented as a Postgres access method [3], which means it plugs in alongside B-tree, GIN, and GIST rather than running as a separate search service that the database has to be kept in sync with. Queries reach the index through id-based or token-based APIs, can draw evidence from multiple fields, and compose with the rest of Postgres in the ordinary way: filters, joins, permissions, recency, whatever application-level ranking logic is already in place. Writes update the index in place, and restarts and replicas behave the way they would for any other Postgres index.

The same primitive works over three different sources of evidence at once: an application's own Postgres tables, the agent-generated memory and tool outputs that accumulate next to them, and prebuilt indexes over public corpora that either ship with psql_bm25s or get imported separately. An agent looking up evidence does not need a different retrieval model for each source, and the harness composing those queries does not need to learn three retrieval interfaces to assemble grounding from all of them at once.

It is deliberately not an agent runtime, a reranker, or a memory policy engine. Those layers exist for good reasons, and they have to make their own calls about what to store, when to summarize, and when to forget. What we wanted was a small, exact retrieval primitive they could build on, not a system that tried to be one of them.

Because retrieval composes through SQL, candidate sets from psql_bm25s mix naturally with whatever else the database already knows how to do. A hybrid query that pulls top-k lexical candidates with psql_bm25s, joins them against vector candidates from a vector extension in the same database, and applies tenancy and recency filters before returning a single ranked list is one SQL statement, not a pipeline. That hybrid path is one of the main reasons we wanted the primitive to live inside Postgres rather than alongside it.

Mutable data is the part that bites

Real application data is not append-only: records get edited, agent memory gets corrected, tool outputs get rewritten into more durable forms as a project matures. Most published search benchmarks ignore this entirely (they evaluate against a fixed corpus and report query throughput), but most production retrieval workloads break under it, because updates have to stay cheap as the corpus grows, and ordinary writes cannot afford to pay rebuild cost on every change.

Our solution is what we call a base-plus-delta maintenance model. The base is a fully optimized BM25 index, and is what queries normally hit. The delta is a small, append-only structure that records writes, updates, and deletes since the last time the base was rebuilt. A write only ever touches the delta, which keeps the cost of updates bounded, and a background process periodically folds the accumulated delta into a new base and retires the old one.

Reads choose their own freshness. In realtime mode, a query sees exact committed BM25 results, either by waiting briefly for pending maintenance or by reading base and delta together. In eventual mode, queries stay fast and may briefly see slightly stale rankings while maintenance catches up in the background. Which mode to use is the application's call, not the index's, and either way ordinary writes do not pay rebuild cost and ordinary reads do not pay unbounded staleness cost.

Agent state rarely lives in a single text column. A memory record has a statement, a source, a timestamp, a confidence label, sometimes a task context; an application record has names, statuses, owners. We wanted lexical evidence to keep that structure intact at the point the agent received it, so that title matches stay distinguishable from body matches and field-scoped evidence is preserved, and we wanted SQL to be able to apply filters, joins, and permissions over the candidate set before the ranked list ever reached the model.

Grounding is not just finding text. It is finding the right text with enough surrounding context attached for whatever consumes it next to use it safely. The retrieval primitive should not be the one deciding what "safely" means in any given system, since that is the application's call to make, but it has to leave enough structure intact for the application to do the deciding.

Performance

Our current public benchmark is the PG18 15 × 5 BEIR [4] matrix: fifteen standard BEIR retrieval datasets, top-k = 1000, run on a Google Cloud n2-standard-16 against PostgreSQL 18. We compare two psql_bm25s paths, one over pretokenized integer arrays (int4[], queried with psql_bm25s_query_ids) and one over text arrays (text[], queried with psql_bm25s_query_tokens), against the Python bm25s reference implementation [5], ParadeDB pg_search, and TensorChord vchord_bm25.

Query throughput across all fifteen datasets, expressed as a multiple of the Python reference:

The three largest workloads are the ones we care about most, since they most closely match the corpora long-running agents tend to produce: large document counts, many queries, query-time cost setting the user-visible latency.

The other side of the tradeoff, and the part worth being honest about, is build time. The bm25s formulation pushes work into indexing, precomputing per-term scores in sparse form, so that repeated queries are cheaper at runtime, and that is only a trade worth making if construction itself stays operationally reasonable. In the current matrix, total build for the ids path is 0.31× the Python reference and the text[] path is 0.52×, which is slower than the cheapest BM25 build one could write but comfortably inside the range where the query-time gains pay for the construction cost.

How we built it

The second half of this post is about the workflow we used to build psql_bm25s: the harness, in the sense our Zenith post [6] develops the term. Zenith works through harness design for agents executing long, open-ended tasks (state preservation, gap-finding, revisable planning, independent verification, stopping discipline) and argues that long-horizon agent performance is a control problem rather than a model problem. The same framing applies one layer down, to systems engineering with an AI coding agent inside the loop. Most of what made this project work was not the agent's per-session output, but the structure of the loop around it that decided which output got to stay.

The failure modes Zenith identifies (premature completion, self-reported "done," stale plans, no principled stopping rule) show up just as readily when the agent is editing a database extension as when it is building a product.

What changes between the two cases is what "verified" means. For a product build, verification eventually lands on a user-testing layer that opens the real surface and checks rendered behavior against a spec. For a Postgres extension, verification lands on a benchmark matrix that has to hold up under access-method correctness, write-ahead logging, replication, and mutable maintenance constraints simultaneously.

In Zenith's vocabulary, what we ran was closer to RALPH applied to optimization branches than to Zenith-style adaptive orchestration. Each pass asked the same gap-finding question, except the gap was not between current artifact and product spec but between current optimization and a benchmark surface that had to be true under every Postgres operating constraint at once. The question shifted accordingly:

RALPH
"What is still missing?"

psql_bm25s harness
"Does this still hold under every constraint that matters?"

We did not need an orchestrator that could synthesize new workers or change the shape of the work at runtime; what we needed was an outer loop strict enough that an optimization could not get through unless it survived measurement under all of those constraints together.

The harness loop

The rule was simple. No optimization landed unless it had passed through the same loop: state a hypothesis about the hot path, implement the change, run before-and-after benchmarks, compare under every constraint that mattered, and decide whether to keep it. Ideas that sounded plausible did not get to stay on those grounds alone; ideas earned their place by surviving the loop. Most of the engineering work of the project, in retrospect, was building the loop and then trusting it under pressure.

The validation surface had a lot of axes. Every optimization had to preserve exact BM25 ranking, fit a working Postgres access method, maintain its index mutably as rows changed, and behave correctly under write-ahead logging, crash recovery, and physical replication, while at the same time exposing SQL-native query APIs, keeping foreground reads and writes predictable, and scaling throughput at corpus sizes that mattered. Each of those constraints is individually tractable, but the hard part was that every change had to satisfy all of them at once. A faster query path that breaks replication is not faster, and a cleaner maintenance design that doubles foreground latency is not cleaner.

Inside that outer loop, each agent session ran under a tight boundary: a scoped task, a baseline to beat, a benchmark command that produced a number, and a rule for what to do when the number went the wrong way. Within those boundaries the agent was useful for exactly the kind of work it is genuinely good at, like tracing hot paths through unfamiliar code, writing focused C and SQL changes, running benchmarks, and summarizing branch comparisons faithfully. We did not treat its output as architecture decisions; we treated it as execution and exploration inside a validation-heavy system. The difference is small in any single session and compounds substantially over the lifetime of a project.

Failure usually arrived in one of four shapes: correctness broke, the benchmark signal was negative or weak, the change helped one path while breaking another, or the gain was real but did not justify the new complexity it brought with it. The temptation in any of those is to keep prompting the agent until the branch looks better, and we tried to resist it. The right move was to rerun the loop if the hypothesis still seemed valid, repair the implementation if the issue was local, or otherwise drop the branch and write down what we had learned from it. The agent made exploration cheap; the harness made dropping things safe.

What stayed

The visible output of the project is the extension itself and the BEIR matrix. The less visible output, and the one that will matter more over the next few years of work, is the paper trail underneath it: benchmark archives, raw-data records, branch comparisons, and closeout notes for every optimization campaign that ran. Performance regressions are almost always historical problems, in the sense that a change six months from now will tend to reopen a question this project already answered, and without the records the team has to rediscover the same boundary the hard way. With the records, old experiments stay part of the design memory rather than fading from it.

Performance for an in-database retrieval layer is not only query throughput. It is also build time, mutable maintenance, crash recovery, replication, and the cost of keeping foreground operations boring, and the archive is what kept all of those dimensions connected to each other rather than optimized in isolation. Without it, the default outcome is that one dimension improves at the others' expense, slowly enough that nobody notices until the regression is hard to back out of.

The broader pattern this project sits inside is the one Zenith argues for from the agent side: AI-assisted work scales when the control layer around the model gets stronger, not when the model gets stronger on its own. The agent let us implement faster, the harness let us learn faster, and the benchmark archive let what we learned stay around for future work. By the end, the project felt less like ordinary feature development and more like a small lab: wide exploration, fast measurement, aggressive pruning, durable records of what survived.

For a retrieval primitive inside Postgres, writing code was not the hard part. The hard part was knowing which code should remain.

Summary

Long-running agents lose to retrieval, not context. A retrieval primitive fast enough to use freely changes what the harness above it is willing to spend retrieval on.

Postgres is already where most production agent state lives. A native, mutable, replication-friendly BM25 primitive lets that state become an asset rather than a static log.

Base-plus-delta maintenance lets writes stay cheap as the corpus grows, with the freshness/latency tradeoff exposed to the application.

Performance for an in-database retrieval layer isn't only query throughput. It's also build time, maintenance, recovery, and replication, and those have to hold up together rather than in isolation.

Harness engineering applied to systems work looks like RALPH against the benchmark surface: the coding agent inside a tight outer loop, ideas earning their place through measurement rather than plausibility.

Documentation

For readers who want to go deeper, the project repository on GitHub carries the full technical report and architecture document, alongside reference material on the API surface, query semantics, multi-field search, multicolumn fusion indexes, and field-aware indexes. The benchmark and raw-data trail behind the numbers in this post is also published there, including the full performance index and the per-dataset cross-engine matrix that the figures above summarise.

Acknowledgements

The algorithmic starting point was bm25s / BM25-Sparse [5], by Xing Han Lù. The paper made the eager sparse scoring tradeoff clear and the project's source made it legible; both shortened the runway between "this looks promising" and "this is worth carrying into Postgres." The C implementation, the access-method integration, the WAL and recovery behavior, the maintenance design, and the SQL-facing surfaces are ours, but the retrieval model started there.

References

[1] Intelligent Internet (2026). Common Ground Core: From Agent Chaos to Structured Intelligence. https://ii.inc/web/blog/post/common-ground-core-cgc

[2] Intelligent Internet. II-Commons. https://commons.ii.inc/

[3] PostgreSQL Documentation. Index Access Method Interface Definition. https://www.postgresql.org/docs/current/indexam.html

[4] Thakur, N. et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS Datasets and Benchmarks Track. https://arxiv.org/abs/2104.08663

[5] Lù, X. H. (2024). BM25S: Orders of magnitude faster lexical search via eager sparse scoring. https://arxiv.org/abs/2407.03618. Project: bm25s.github.io

[6] Intelligent Internet (2026). From RALPH to Zenith: Designing harnesses for long-running agents. https://ii.inc/web/blog/post/zenith-research

psql_bm25s: when retrieval stops being a budget item