Arculae Technology

001 — Built on chonkyDB

A unified retrieval system for knowledge-intensive AI

Most RAG stacks are assembled from separate components. Arculae is built on an integrated database designed for precision retrieval.

Arculae's retrieval infrastructure is powered by chonkyDB, a fully custom-built proprietary database purpose-built for the demands of knowledge-intensive AI workloads.

Most RAG systems are assembled from loosely coupled third-party components: a vector database here, a graph store there, a full-text engine somewhere else, glued together with middleware and hope. chonkyDB takes a fundamentally different approach.

Vector search, knowledge graphs, full-text retrieval, semantic tagging, and temporal queries are unified in a single, natively integrated system: no external database components, no precision lost at system boundaries.

It is built to treat each knowledge object as one coherent unit across multiple retrieval paths — so vectors, graph links, and lexical signals stay consistent as the underlying material evolves.

005 — Why this matters for Arculae

Retrieval quality is answer quality

When an AI agent queries an Arcula, the quality of the answer depends entirely on the quality of the retrieval.

A near-miss in semantic search, a missed connection in the knowledge graph, a stale temporal reference: any of these turns proprietary knowledge into noise.

chonkyDB exists because off-the-shelf solutions weren't precise enough for what we're building. The margin between a useful insight and a generic response is often a single retrieval decision, and that decision needs to be right.

In the knowledge economy, retrieval is also a governance event: policy gates, exposure budgets, and audit traces are part of correctness. “Agent-callable” knowledge isn't just retrievable — it's attributable, controllable, and compliant by design.

006 — Benchmarks

Public benchmark snapshots

The benchmark cards below show current benchmark snapshots for chonkyDB and the comparison systems, together with downloadable run documents.

The new code retrieval benchmark below publishes the full dual-surface KPI matrix across repository retrieval baselines.

Snapshot date: 2026-03-17

Code Retrieval — Repository search benchmark

Metrics: claim cover, multi-hop support coverage, path MRR, path nDCG@10, path recall@1/3/5/10, symbol MRR, symbol nDCG@10, and symbol recall@1/3/5/10.

Surface

Frontier Broad v1

System	Claim Cover	Multi-Hop	Path MRR	Path nDCG@10	Path R@1	Path R@3	Path R@5	Path R@10	Symbol MRR	Symbol nDCG@10	Symbol R@1	Symbol R@3	Symbol R@5	Symbol R@10
chonkyDB (Arculae)	1.000	0.700	0.917	0.925	0.611	0.972	0.972	1.000	0.884	0.755	0.481	0.676	0.815	0.870
graph_bm25	0.944	0.600	0.794	0.797	0.528	0.750	0.917	0.917	0.514	0.475	0.250	0.306	0.389	0.685
RANGER Dual Route OpenAI	1.000	0.700	0.898	0.890	0.611	0.861	0.889	1.000	0.772	0.677	0.389	0.537	0.731	0.824
Repoformer OpenAI Cosine	0.917	0.500	0.641	0.654	0.306	0.694	0.806	0.861	0.327	0.341	0.111	0.250	0.333	0.574
Repoformer Sparse	0.944	0.600	0.761	0.776	0.500	0.778	0.917	0.917	0.520	0.489	0.250	0.278	0.472	0.713
RepoHyper Cosine Radius	0.972	0.700	0.652	0.698	0.306	0.694	0.833	0.972	0.331	0.331	0.111	0.250	0.333	0.528

Surface

RepoAlign / RepoQA v1

System	Claim Cover	Multi-Hop	Path MRR	Path nDCG@10	Path R@1	Path R@3	Path R@5	Path R@10	Symbol MRR	Symbol nDCG@10	Symbol R@1	Symbol R@3	Symbol R@5	Symbol R@10
chonkyDB (Arculae)	1.000	0.500	0.851	0.844	0.481	0.759	0.815	0.981	0.766	0.655	0.333	0.500	0.630	0.769
graph_bm25	0.833	0.357	0.778	0.732	0.454	0.704	0.787	0.787	0.452	0.414	0.157	0.306	0.333	0.602
RANGER Dual Route OpenAI	0.889	0.429	0.561	0.637	0.213	0.602	0.731	0.889	0.226	0.251	0.037	0.093	0.269	0.454
Repoformer OpenAI Cosine	0.546	0.214	0.500	0.468	0.213	0.491	0.519	0.519	0.209	0.237	0.037	0.093	0.269	0.417
Repoformer Sparse	0.759	0.357	0.745	0.691	0.370	0.685	0.759	0.759	0.288	0.334	0.083	0.185	0.278	0.565
RepoHyper Cosine Radius	0.889	0.429	0.561	0.637	0.213	0.602	0.731	0.889	0.226	0.251	0.037	0.093	0.269	0.454

Run document

Evaluation Sheet (JSON)

LoCoMo — Long conversation memory

Metric: Overall accuracy (%)

Rank	System	Score
—	chonkyDB (Arculae)	Accuracy: 89.33
1	Mnemis	Accuracy: 93.9
2	Backboard	Accuracy: 90.00
3	Hindsight (reported config A)	Accuracy: 89.61
4	Hindsight (reported config B)	Accuracy: 85.67
5	Hindsight (reported config C)	Accuracy: 83.18
6	Memobase	Accuracy: 75.78
7	TiMem	Accuracy: 75.30
8	Zep	Accuracy: 75.14
9	Mem0-Graph	Accuracy: 68.44
10	Mem0	Accuracy: 66.88

Notes

Arculae row is a disclosure run; full protocol and artifacts are in the Evaluation Sheet (JSON).

External values are quoted from public sources and shown for context (not reproduced by Arculae).

Evaluation Sheet (JSON)

Source

LoCoMo paper (ACL 2024)

Hindsight technical report (2025), Table 4

Mnemis paper (2026)

TiMem paper (2026)

LongMemEval — Long-term assistant memory

Metric: Overall accuracy (%)

Rank	System	Score
—	chonkyDB (Arculae)	Accuracy: 100.0
1	Mnemis	Accuracy: 91.6
2	Hindsight (reported config A)	Accuracy: 91.4
3	Hindsight (reported config B)	Accuracy: 89.0
4	Supermemory (reported config A)	Accuracy: 85.2
5	Supermemory (reported config B)	Accuracy: 84.6
6	Hindsight (reported config C)	Accuracy: 83.6
7	Supermemory (reported config C)	Accuracy: 81.6
8	TiMem	Accuracy: 76.88
9	Zep	Accuracy: 71.2
10	Full-context baseline (reported)	Accuracy: 60.2

Notes

Arculae row is a disclosure run; full protocol and artifacts are in the Evaluation Sheet (JSON).

External values are quoted from public sources and shown for context (not reproduced by Arculae).

Evaluation Sheet (JSON)

Source

LongMemEval paper (ICLR 2025)

Hindsight technical report (2025), Table 3

Mnemis paper (2026)

TiMem paper (2026)

HotPotQA — Multi-hop question answering

Metrics: This table uses Joint F1 only (same metric for all listed rows).

Rank	System	Score
—	chonkyDB (Arculae, n=50)	Joint F1: 80.87
1	Beam Retrieval (single model)	Joint F1: 77.54
2	PipNet (single model)	Joint F1: 76.95
3	Smoothing R3 (single model)	Joint F1: 76.69
4	FE2H on ALBERT (single model)	Joint F1: 76.54
5	R3 (single model)	Joint F1: 76.02
6	SAE+ (single model)	Joint F1: 75.72
7	S2G+EGA (single model)	Joint F1: 75.47
8	S2G+ (single model)	Joint F1: 75.45
9	AMGN+ (single model)	Joint F1: 75.24
10	RD Model (single model)	Joint F1: 75.17

Fullwiki (external context): Top-10 official leaderboard snapshot (Joint F1).

Rank	System	Score
1	AISO (single model)	Joint F1: 72.00
2	Chain-of-Skills (single model)	Joint F1: 71.65
3	TPRR (single model)	Joint F1: 70.83
4	HopRetriever + Sp-search (single model)	Joint F1: 70.61
5	EBS-Large (single model)	Joint F1: 70.04
6	HopRetriever (single model)	Joint F1: 69.84
7	IRRR+ (single model)	Joint F1: 69.60
8	Anonymous (single model)	Joint F1: 69.54
9	EBS-SH (single model)	Joint F1: 68.94
10	IRRR (single model)	Joint F1: 68.59

Notes

Arculae row is a disclosure run; full protocol and artifacts are in the Evaluation Sheet (JSON).

External values are quoted from the official HotPotQA leaderboards (Joint F1).

Evaluation Sheet (JSON)

Source

Arculae HotPotQA methodology (protocol + artifacts)

HotPotQA official leaderboards (Distractor + Fullwiki)

Hidden Insights — Discovery-native KPI disclosure

Metrics: Hidden recall, support-aware hidden precision, expected-item recall, non-expected selection fraction, and exact quote validity.

KPI	Score
Hidden recall	100.0%
Support-aware hidden precision	100.0%
Expected-item recall	88.89%
Non-expected selection fraction	6.67%
Quote validity	100.0%

Notes

This card is an Arculae disclosure run on a 3-query internal hidden-insights subset (40 ingested items).

It surfaces internal discovery KPIs only; it is not a public external leaderboard reproduction.

Evaluation Sheet (JSON)

Source

Arculae internal disclosure artifact · run date: 2026-02-14 · exact quote validity: 258 / 258

Concept Analogies — Discovery-native KPI disclosure

Metrics: Low-lex hidden recall, support-aware hidden precision, non-expected selection fraction, forbidden selection fraction, and exact quote validity.

KPI	Score
Low-lex hidden recall	66.0%
Support-aware hidden precision	100.0%
Non-expected selection fraction	39.5%
Forbidden selection fraction	0.0%
Quote validity	100.0%

Notes

This card is an Arculae disclosure run on a 50-query internal concept-analogy set (79 ingested items).

It surfaces internal discovery KPIs only; it is not a public external leaderboard reproduction.

Evaluation Sheet (JSON)

Source

Arculae internal disclosure artifact · run date: 2026-02-24 · exact quote validity: 179 / 179

007 — Doc compaction

When context is limited, evidence wins

Long PDFs don’t fit into an AI model’s context window. So the question becomes: what do you keep when you only have a small budget?

Comparison — chonkyDB vs LLMLingua2

Our chonkyDB compaction builds a short evidence briefing that keeps key sentences and numbers in their original wording — so answers can be grounded in text that’s actually present.

LLMLingua2 often compresses by rewriting. That can be helpful for summaries, but it’s much harder to reliably quote exact evidence when the wording changed.

Questions answered with proof

Roomier context

≈2.3×

chonkyDB

62%

LLMLingua2

26%

Tight context

≈3.3×

chonkyDB

47%

LLMLingua2

14%

Bottom line: at the same size limit, chonkyDB answers roughly 2–3× more questions correctly with quoted evidence than LLMLingua2 on our 20‑PDF benchmark suite.

This matters most for long, real-world documents — manuals, policies, technical papers, and reports — where the important detail is often a single sentence or number.

Snapshot date: 2026-03-02 · 20 PDFs · strict evidence requirement

Evaluation Sheet (JSON)

Built on chonkyDB Retrieval

A unified retrieval system for knowledge-intensive AI

Unified retrieval, end-to-end

The margin is a retrieval decision

Retrieval quality is answer quality

Public benchmark snapshots

When context is limited, evidence wins

Interested?