Retrieval · Systems

LEANN: A Low-Storage Vector Index

Vector search without storing the vectors.

Semantic search keeps an embedding for every chunk of your data, and that store outgrows the data itself. LEANN deletes it, then recomputes the handful of embeddings each query actually touches. The index drops to 5% of the corpus and the answers stay the same.

Explaining the paperLEANN: A Low-Storage Vector IndexWang, Li, Liu, Wu, Mao, Zhao, Yan, Xu, Zhou, Stoica, Min, Zaharia, Gonzalez · UC Berkeley · MLSys 2026 · arXiv:2506.08276 ↗

Your documents fit on your laptop, but the index that searches them outgrows the machine.

An arithmetic problem hides inside every "chat with your documents" app. To search text by meaning rather than keywords, you run every chunk of it through an embedding model, an encoder that maps text to a vector of numbers placed so that related passages land near each other. Then you store all those vectors, and at query time you embed the question and look for its nearest neighbors. The vectors are the searchable form of your data. They are also enormous.

The paper's running example makes it concrete. It starts from a 76 GB dump of Wikipedia text cut into 256-token chunks: 60 million of them. Embed each with Contriever, a standard BERT-sized retrieval encoder that outputs 768 numbers per chunk. At four bytes per number, each roughly 1 KB chunk of text now carries a 3 KB vector: the description is three times the size of the thing it describes. Across the corpus that is 173 GB of embeddings (the paper's count, in binary gigabytes; multiply it out in decimal and you get 184), plus 15 GB of index structure on top. You need 188 GB of storage to search 76 GB of text, and the popular indexes do not even fit it in RAM on a 32 GB machine. LEANN, out of UC Berkeley, gets the same search quality out of 4 GB.

The index outweighs the library

First, state the problem precisely. Vector search returns, for a query vector, the $k$ stored vectors closest to it (the "top- $k$ "). Doing that exactly means comparing the query against all 60 million vectors, a linear scan, which is why every practical system uses approximate nearest neighbor search (ANNS): an index that finds most of the true top- $k$ while computing only a tiny fraction of the distances. How much of the truth you get is the recall:

\mathrm{Recall@}k \;=\; \frac{\lvert S_q \cap S'_q \rvert}{k}

(2)

where $S_q$ is the true top- $k$ set and $S'_q$ is what the index returned. The paper targets 90% recall at $k=3$ : of the three best chunks for your question, the index hands the language model 2.7 of them on average. Below that, retrieval-augmented generation (RAG), the setup where a model reads retrieved chunks before answering, starts answering from the wrong pages.

Storage is what breaks this down on a laptop. The 188 GB above is HNSW, the standard graph index, and it is not the worst offender. Click through the ledger:

Figure 1 · the storage ledger

LEANN · PQ codes 2 GB + pruned graph 2 GB · No stored embeddings: 2 GB of rough PQ codes plus a 2 GB pruned graph. 5% of the corpus, same 25.5% downstream accuracy as HNSW.

What it costs to make a 76 GB corpus searchable, from the paper's Table 1 and Figure 3. Amber is stored vectors, violet is index metadata. The dashed line is the corpus itself: HNSW and DiskANN store more than twice the data they serve. LEANN's 4 GB is the entire index.

Compression is the obvious response, specifically product quantization (PQ), which we will meet properly later. Squeezing the vectors 35×, down to 5 GB, is enough compression to matter and enough error to hurt: downstream question-answering accuracy falls to 17.9%, below the 18.3% of BM25, the lexical ranking function from the 1990s that vector search was supposed to retire. And the 15 GB of graph metadata is untouched by vector compression, so even a perfect quantizer leaves a 20 GB floor. Compression is the only knob for how you store the embeddings, and these numbers show it running out of room: squeeze hard and accuracy craters, squeeze gently and the 20 GB floor stays. If no setting of how you store them gets you past that floor, you are left deciding whether to store them at all.

A query touches a sliver of the graph

The way past that floor is in how the best indexes already work. There are two main families. An inverted file index (IVF) clusters the vectors with k-means, stores a centroid per cluster, and answers a query by scanning the centroids and then checking every vector inside the few nearest clusters. A proximity graph connects each vector to a few dozen near neighbors and answers a query by walking: the state of the art, because it reaches a given recall with far fewer distance computations than anything else. LEANN builds on the graph family, specifically on HNSW.

The walk is called best-first search, and it is simple enough to hold in your head. Keep a short list of the best candidates seen so far, at most ef of them, ordered by distance to the query. Start the list with a fixed entry node. Then loop: take the closest candidate you have not expanded yet, compute the distance to each of its graph neighbors, and let any neighbor good enough to make the list push a worse candidate off the end. When every node on the list has been expanded and nothing new can break in, the top $k$ of the list is your answer. The knob ef is the quality dial: a longer list explores more dead ends before settling, which costs more distance computations and misses less.

Why does a greedy walk find the nearest neighbors at all? Each hop moves to a node closer to the query, and because every node links to its own neighborhood, "closer" compounds: it is the six-degrees game, where you reach a stranger by always handing the message to whichever acquaintance is nearest the target. The candidate list keeps a purely greedy walk from getting stuck. A pure greedy walk would get stuck the first time every neighbor looks worse; the ef-sized shortlist remembers runner-up paths and lets the search back out of dead ends.

Count how many nodes the walk actually visits. Empirically the count grows roughly logarithmically with corpus size (a regularity, not a theorem; HNSW proves it only for idealized graphs, and the paper itself hedges to "polylogarithmic"). At 60 million vectors a search at high recall computes distances to a few thousand nodes, around 0.005% of the corpus. Per query, 99.99% of that 173 GB embedding store is dead weight. Exploiting this was hard for one reason: which 0.005% you need depends on the query and is only revealed hop by hop during the walk, so you cannot prefetch it, and a conventional index keeps everything on disk just in case. The figure shows how little of the corpus a search actually touches:

Figure 2 · best-first search, counted

ef8

Best-first search on a 900-chunk toy corpus wired HNSW-style. Tap anywhere to move the query, then play or scrub the walk. Teal nodes are the only ones whose embeddings the search ever computes; amber rings are the true top-3. Drag ef to 1 and the under-provisioned walk can miss a neighbor; drag it to 32 and the cost grows while the answer stops improving. In 768 dimensions this dial stretches much further before saturating, but the shape survives: the walk touches only a small fraction, and which part is only revealed as it goes.

The counter in the top-left of the figure shows the running tally of how many of the 900 embeddings the search has actually had to compute. A structure that examines a few dozen nodes out of 900 here, or a few thousand out of 60 million at scale, is a machine for ignoring data, and that is precisely what the recompute design relies on.

Recompute the embeddings instead of storing them

LEANN does not store the embeddings. Store the graph and the original text, and when the walk needs a node's embedding, recompute it on the spot by running that chunk back through the same encoder that built the index. The same model on the same text produces the same vector, so the recomputed vectors are identical to what the walk would have read from disk, and the answers do not change. Instead of storing 60 million precomputed vectors, you keep only what you need to remake them: the text (which is your data, already on disk) and the encoder (which any dense retriever already needs at query time, to embed the query itself).

If you have trained large models, you have met this trade under another name: gradient checkpointing, where you discard activations during the forward pass and recompute them during the backward pass, buying memory with compute. The two methods differ in what they recompute. Checkpointing re-runs a fixed, predictable schedule of everything; LEANN re-runs a tiny, query-dependent subset selected during the search, and the entire design depends on that subset staying small.

The subset is small, but recomputing it carries a cost. Each touched node is now one forward pass of a 110M-parameter encoder over a 256-token chunk, so a few thousand touches is a few seconds of GPU work, where HNSW answered in 50 milliseconds. That sounds disqualifying until you look at what the seconds sit next to. In the paper's RAG setup, a local 4B-parameter model takes 20.86 s to generate the answer; HNSW's 0.05 s search is 0.2% of the wait. The request is sequential (search finishes, then generation starts), so retrieval is not hidden behind anything; it is small relative to the stage everyone is already waiting on. Slowing retrieval 50×, from 0.05 s to 2.48 s, moves the end-to-end request from 20.9 s to 23.3 s, about 10% (the paper's Table 1). The design trades on that asymmetry: spend the part of the latency budget nobody is using, and buy back 184 GB of disk with it.

That trade also constrains which index family you can build on. Once each distance computation costs an encoder forward pass instead of a memory read, the index's job description changes: minimize how many vectors you look at, at almost any other cost. Graphs touch a polylogarithmic number of vectors. IVF, by contrast, must scan every vector in each probed cluster, and with the standard $\sqrt{N}$ clusters that is on the order of $\sqrt{N}$ embeddings per probe, roughly 7,300 chunks per cluster at 60 million. The paper's IVF-Recompute baseline (the strategy of EdgeRAG, which generates embeddings online inside an IVF index) pays exactly that: 307.6 s per query on NQ (Natural Questions, a factual open-domain QA benchmark) against LEANN's 2.48 s, with the gap reaching up to 200× across datasets and hardware. Same recompute-don't-store idea, two orders of magnitude apart, decided entirely by how many embeddings the index forces you to produce.

Even a polylogarithmic sliver, though, is recomputed wastefully if you do it naively: the encoder runs on every neighbor the walk reaches, most of which are poor matches, and it runs on them a handful at a time, which starves the GPU. LEANN's two latency mechanisms attack those two wastes in order.

Spend the encoder only where it counts

The first source of waste is that not every neighbor deserves a forward pass. To triage them, LEANN keeps one piece of stored vector data after all, a product quantization table. PQ chops the 768-dimensional vector into roughly 30 sub-vectors and replaces each with the id of its nearest match from a 256-entry codebook learned for that slice. A 3,072-byte vector becomes ~30 one-byte ids, about 100× smaller, 2 GB for the whole corpus, and distances can be estimated straight from the ids with table lookups, no encoder involved. (The paper calls this a "100× smaller codebook"; strictly, it is the per-vector codes that shrink. The codebook itself is under a megabyte.) The cost of 100× compression shows up in accuracy: reconstruct a vector from 30 bytes and the distance you compute from it is systematically off, good enough to say "promising" or "dud," not good enough to say "third nearest versus ninth."

So there are now two distance oracles: a cheap one that is somewhat wrong and an expensive one that is exactly right. The design question is which oracle the walk uses to choose its next hop. DiskANN, the canonical disk-based graph index, uses the cheap one for navigation: PQ codes in RAM determine every hop, and the exact vectors (which arrive with each disk read the walk was making anyway) only re-rank the finalists at the end. That works at DiskANN's moderate compression. At LEANN's 100×, it fails in a way re-ranking cannot repair: the distorted distances send the walk into detours, true neighbors never enter the candidate list at all, and you cannot re-rank what you never reached.

The repair that re-ranking offers is real, but it comes too late. A walk produces an ordered list of nodes it chose to expand, and re-ranking only re-sorts that finished list with exact distances; it cannot reach back and visit a node the walk never expanded. So the failure is one-directional: PQ noise at 100× can knock a true neighbor far enough down the candidate ordering that the walk steps past it toward a quantization artifact, that artifact's neighbors get expanded instead, and the walk continues into a region that holds none of the answer. The true neighbor was never reached, so it is never in the list to re-rank, and no amount of exact re-sorting at the end recovers a route that turned the wrong way at the start.

LEANN inverts the roles. The walk is ordered by exact, recomputed distances, so it stays near the true neighbors; the PQ estimates are demoted to nominating which candidates are worth recomputing. The asymmetry holds because PQ noise can only ever cost an extra recompute on a candidate that proves worthless, never a wrong turn, because the wrong-turn decision, which node to expand next, is made on an exact distance every time. Concretely, the search keeps two queues: the exact queue EQ (the usual best-first list, admitting only recomputed embeddings) and an approximate queue AQ holding a PQ estimate for every neighbor the walk has reached. Each hop PQ-scores all the new neighbors for fractions of a microsecond, then recomputes only the top $\alpha\%$ of AQ and moves them into EQ. The cheap PQ score is a first cut that selects which candidates get the expensive encoder pass. And AQ never discards anyone, so a candidate passed over early can still be recomputed later, once the walk reaches its neighborhood and its relative standing improves. Run both designs on the same corpus, same query, same quantization noise:

Figure 3 · who steers the walk

α15%

The paper's two-level search against the design it rejects. Teal nodes cost an encoder pass; hollow amber nodes were scored from 30-byte codes for almost nothing. In two-level mode the walk finds the true top-3 while recomputing fewer embeddings than recompute-everything; at

\alpha

= 100% the savings vanish (that end of the slider is the naive search), and at 5% even exact steering starts to miss. Switch to PQ steers and the distorted distances push the walk into detours: a handful of recomputes, the wrong neighbors.

The second source of waste is that the surviving recomputations arrive in dribbles. Best-first search is sequential by definition (the next hop depends on the last), so a naive implementation hands the GPU work a few chunks at a time, and a GPU fed five chunks at once runs at a fraction of its throughput; most of the wall-clock goes to launch overhead and idle compute units, and a batch of 64 costs barely more than a batch of 8. LEANN loosens the search order to fill the machine: nominated candidates accumulate across hops until roughly 64 are waiting, then one batched encoder call scores them all. The ordering goes slightly stale, a candidate may be expanded a hop later than strict best-first would have, and recall is unaffected. The full search now looks like this:

# one LEANN query (the paper's Algorithm 2, plus batching)
EQ = queue(ef)          # exact queue: only RECOMPUTED embeddings get in
AQ = heap()             # approx queue: cheap PQ-table distances, keeps all
push(EQ, entry, exact_dist(q, entry))
while EQ has an unvisited node:
    u = closest unvisited node in EQ    # the walk advances using exact distances
    for v in neighbors(u):
        if v not seen:
            push(AQ, v, pq_dist(q, v))  # ~30-byte codes, no encoder call
    C = top alpha% of AQ, excluding EQ  # nominate the promising few
    batch += C
    if len(batch) >= 64:                # dynamic batching: fill the GPU
        embs = encoder(text[batch])     # ONE batched forward pass
        for c, e in zip(batch, embs):
            push(EQ, c, dist(q, e))
        batch = []
return k closest in EQ

Measured on the real system, the triage alone (two-level search) speeds retrieval up 1.4× on average, up to 1.6×; adding dynamic batching takes the average to 1.8× and the peak to 2.0×. HotpotQA, whose multi-hop questions force the longest walks, gains the most from batching, which is what you would expect: longer walks leave more candidates waiting around to be batched.

Prune the graph, keep the high-degree nodes

Recomputation deletes the 173 GB. The remaining 15 GB is the graph itself: each node's list of neighbor ids, four bytes each. With the standard settings of faiss (Meta's widely used vector-search library), the base layer allows up to 60 neighbors per node (the build parameter $M{=}30$ , doubled at the bottom layer by convention), and the paper's arithmetic makes the burden vivid: round that 60-neighbor cap up to 64 for clean arithmetic, and a node spends 256 bytes describing its wiring, a quarter of the size of the 1 KB chunk it points to. Once the embeddings are gone, this metadata accounts for the entire index. To shrink it, cut edges. The question is which ones.

Two obvious approaches fail. Delete edges at random, or rebuild the graph with a uniformly lower degree cap, and accuracy collapses well before the storage halves, because what you are spending is connectivity, which the walk depends on. The way out is that edges are not equally valuable. In a built HNSW graph the degree distribution is heavily skewed, and search traffic is skewed harder: a 2024 study of HNSW (its title: "the 'H' in HNSW stands for hubs") found that walks spend their early hops on a small set of heavily connected, heavily visited hub nodes that emerge from high-dimensional geometry rather than from anything the construction plans. Most walks route through one of these hubs and then step out to the sparsely connected nodes near the answer. (That study identifies hubs by measured traffic; LEANN cuts by degree, a static stand-in for traffic that you can read off the graph without running queries.)

Why hubs form at all explains why preserving them works. In 768 dimensions almost every pair of random points sits at nearly the same distance: volume piles up so far from any center that the spread of distances collapses, and most points are roughly equidistant from most others. When you wire each point to its nearest neighbors under those flattened distances, a few points end up sitting just slightly closer to a great many others than their rivals do, and those few collect a disproportionate share of the edges and, with them, a disproportionate share of the traffic, because almost every walk finds it cheapest to route through one of them. Nobody designed this concentration; it precipitates out of the geometry the moment you build a proximity graph in high dimensions. That is exactly why the pruning rule can be so aggressive on everything else. The hubs are not an incidental feature competing for the storage budget, they are the structure that carries most of the walk's traffic, so keeping the budget on them and cutting edges from the low-degree nodes keeps the part that carries navigation and throws away the part that mostly duplicated it.

So the pruning rule protects the hubs. Formally the paper poses it as a constrained optimization: keep the graph under a storage budget $B$ and above a recall floor $\tau$ while minimizing how many nodes each search must recompute:

\min \; T(G_1)=\sum_{i=1}^{ef}\lvert V_i\rvert \quad \text{s.t.}\quad \mathrm{Space}(G_1)\le B,\qquad \mathrm{Acc}(G_1)\ge \tau

(3)

and solves it with a heuristic in one pass. Re-insert every node, but hand out two different edge allowances: most nodes may keep only $m$ outgoing links (a fifth of $M$ ), while the top 2% of nodes by degree keep the full $M$ . The asymmetry that matters is in the back-links: when a new node is inserted, any existing node may accept a back-link from it up to the high cap $M$ , not its own little $m$ . A low-degree node cannot keep many edges of its own, but it is never denied a link up to a hub. (Standard HNSW shrinks any over-full neighbor list back to the cap; letting ordinary nodes stay attached to hubs at the high cap is LEANN's modification, and without it the hubs would be unreachable from the low-degree nodes they exist to serve.) Cut the toy corpus three ways at the same edge budget and try to search what remains:

Figure 4 · three ways to halve a graph

ef16

strategy	edges	recall@3	computed / query
original	100%	100%	72
keep hubs	37%	73%	40
random	39%	50%	49
nearest-only	37%	3%	20

keep hubs · 37% of edges · recall 73% · 40 embeddings per query at ef=16

Node size is degree; amber marks the original top-2% hubs. At the same edge budget, keep hubs stays close to the original graph's recall, random bleeds recall everywhere, and nearest-only (a uniform low cap that keeps only each node's shortest edges) strands the walk on its entry island: long edges had been carrying the cross-region connections. Cut to a third and the recall drop steepens sharply. The ef slider shows what searching harder can and cannot rescue.

On the real 60M-node graph the gain is striking: cut the average degree from 18 to 9 and the recall-versus-recomputation curve barely moves, while the stored graph shrinks from faiss's 15 GB padded layout to a 2 GB packed adjacency list (compressed sparse row: 60M nodes × 9 neighbors × 4 bytes). The pruned graph matches the original, while at the same storage budget random pruning needs up to 1.8× more recomputations to hit the same recall and the uniform low cap needs up to 5.8× more, when it gets there at all (it cannot reach the 94% and 96% recall targets).

Building an index you cannot hold

Constructing the graph poses a chicken-and-egg problem: it requires searching the graph, insertion by insertion, which seems to need every embedding live at once: a 173 GB peak during a build whose entire purpose is never to store 173 GB. LEANN sidesteps it with sharding that respects geometry: run k-means on a small sample of the corpus to get $k$ centroids, then embed each passage once, record its two nearest centroids, and discard the embedding immediately. Each shard is then built alone (re-embed its members, wire the subgraph, discard again), so peak storage is one shard's worth, and the shard graphs are finally merged: a passage that lives in two shards keeps its higher HNSW level and the union of its edge lists, randomly dropped back to the cap.

The two-centroid assignment stitches the shards back together. Passages near a cluster boundary belong to both shards, so the merged graph inherits edges that cross the boundary, and the k-means grouping means each shard wires passages among their true semantic neighbors rather than random strangers. Built this way with 15 shards, peak construction storage drops about 5×, and the merged graph's recall-versus-recomputation curve sits nearly on the unsharded one. Shard randomly instead and the curve degrades sharply: the stitching only works if the shards were split along natural cluster boundaries.

Updates get the same treatment. A naive insertion into a recompute-only graph costs $\mathcal{O}(M \cdot efC + efC^2 + M^3)$ embedding computations (here efC is the ef used while building, the construction-time queue length) across its three phases (find candidate neighbors, select among them, then re-trim every neighbor that accepted a back-link). A distance cache kills the redundant re-computations inside the select-and-trim phases, and replacing the careful trim with random selection flattens the rest, landing at $\mathcal{O}(M \cdot efC)$ : measured, 63.3× faster than naive insertion. Batched additions buffer new passages on the side, where queries search them alongside the graph until they are merged in asynchronously. Deletions set a flag that filters results, deferring surgery on the graph until enough tombstones accumulate. All three keep the storage ceiling in place throughout the index's lifetime.

What a query costs now

Everything above is in service of one trade, storage for query-time compute, so judge it on both sides of the ledger at once. Storage first: 4 GB against HNSW's 188 GB on the Wikipedia corpus, a 47× reduction (the paper rounds to "up to 50×"), under 5% of the raw data. On the paper's personal-scale corpora (financial filings, the Enron email archive, LAION images) the savings hold at 97% to 98% versus HNSW. Latency is the side that has to hold up when timed, and it breaks down per dataset, because the overhead is not one number:

Figure 5 · the request, end to end

LEANN · retrieval 2.48 s + generation 20.86 s = 10.6% of the request

Every method tuned to 90% recall@3 on the 60M-chunk corpus, RTX 4090, from the paper's Table 2. Teal is retrieval, amber is the 4B-model generation that follows. LEANN's retrieval is 50× slower than HNSW's and still 1.6% to 23.4% of the request; IVF-Recompute, the same recompute idea on the wrong index, runs off the scale. HNSW and IVF carry an asterisk: their bars were measured on a bigger server, because 188 GB does not fit the 32 GB testbed at all.

Accuracy does not move: at the same recall target LEANN's downstream RAG accuracy ties HNSW's exactly (25.5% on NQ), which is the designed outcome, since the embeddings are identical either way. Against the methods that fit in comparable storage there is no contest: LEANN beats BM25 by up to 11.8 exact-match points and 35×-compressed PQ by up to 11.3, the two baselines that fail to reach 90% recall at all. The gains concentrate on factual benchmarks (NQ, TriviaQA) where fetching the right passage is most of the battle, and shrink on GPQA (graduate-level questions Wikipedia barely covers) and HotpotQA (multi-hop questions, while this setup retrieves in a single hop).

The fine print sets the boundaries. The latency trade depends on generation staying expensive: on GPQA, where the model reasons for 69.6 s, LEANN's overhead is 1.6%; on HotpotQA on the same GPU it is 23.4%. Move to an M1 Ultra Mac, where the encoder runs on weaker silicon, and retrieval stretches to 13.8 s on NQ (23% of the request) and 44.8 s on HotpotQA, half the total wait, though HNSW and IVF do not run on that machine at all for want of RAM. Seconds-class retrieval is the wrong tool for search-as-you-type, and if your generation is a fast cloud model streaming its first token in under a second, the spare slice of latency this design uses does not exist. The knobs the paper measures point the same direction: swap Contriever for the 34M-parameter GTE-small and retrieval speeds up 2.3× with accuracy within 2%; let the disk budget breathe and caching 10% of embeddings buys 1.5× (the same traffic skew that justifies hub-preserving pruning means a small cache catches 41.9% of lookups).

Every vector index before this one was designed under the assumption that a distance computation is a memory read, so the embeddings must exist on disk; storage was the cost of speed. LEANN treats that assumption as a choice: a graph walk touches so few vectors that you can pay encoder-rates for them at query time, and then the index only references the data instead of duplicating it. 76 GB of documents, 4 GB of index, and a search that finishes well before the language model starts answering.

Provenance Verified against primary literature

HNSW (2016)Best-first graph search, the ef quality knob, and the layer-0 degree cap of 2M (60 at M=30).

PQ (Jégou et al., 2011)Product quantization: byte-sized codes, table-lookup distances, and the error that grows with compression.

DiskANN (2019)The design LEANN inverts: PQ steers the walk, exact vectors re-rank from the same disk reads already in flight.

Hub highways (2024)Search traffic concentrates on a few hub nodes in high-dimensional proximity graphs.

EdgeRAG (2024)IVF-style on-device recomputation, the O(√N) baseline LEANN beats by up to 200×.

correctionTwo of the paper’s phrases need care: the "O(log N)" recomputation count is an empirical regularity (HNSW proves it only for idealized graphs; the paper itself hedges to "polylogarithmic"), and the "100× smaller codebook" actually describes the per-vector codes. The codebook proper is under a megabyte. We teach both as such.

Questions you might still have

If the embeddings are gone, what exactly is on disk?
Four things: your original text (which you were keeping anyway), a pruned adjacency list (~2 GB at 60M chunks), ~30-byte PQ codes per chunk (~2 GB), and the encoder weights. The encoder is not an extra cost: any dense retriever already needs it at query time to embed the query itself.

Does recomputing change the search results?
No. The same encoder that built the index re-embeds the same text, so the recomputed vectors match the ones stored on disk. At the same 90% recall target, LEANN’s downstream RAG accuracy ties HNSW exactly (25.5%).

What happens without a serious GPU?
It runs, slower. On an M1 Ultra Mac, retrieval takes 13.8 s on NQ (23% of the request) and 44.8 s on HotpotQA (50%). The storage math is unchanged; the latency advantage shrinks as the encoder slows. HNSW and IVF do not run there at all: they need more RAM than the machine has.

Can you cache the hot embeddings and skip some recomputation?
Yes, and the paper measures it: storing 10% of embeddings (the high-degree nodes) gives a 1.5× speedup with cache hit rates up to 41.9%. The same traffic skew that justifies hub-preserving pruning makes a small cache disproportionately effective.

Footnotes & further reading

The paper: Wang, Li, Liu, Wu, Mao, Zhao, Yan, Xu, Zhou, Stoica, Min, Zaharia, Gonzalez, LEANN: A Low-Storage Vector Index (UC Berkeley, 2025; to appear at MLSys 2026). Code. Numbers in this explainer follow the v2 experiments (RTX 4090 + Qwen3-4B); v1 used an A10 and a 1B model, so quotes of the two versions differ.
The graph index LEANN builds on: Malkov & Yashunin, Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs (2016). The 60-neighbor cap is faiss's layer-0 convention of 2M with M=30; the O(log N) scaling is proven only under idealized assumptions and observed to bend at the 200M scale in HNSW's own plots.
Product quantization: Jégou, Douze, Schmid, Product quantization for nearest neighbor search (TPAMI 2011). Their GIST results already show recall collapsing at aggressive compression on high-dimensional vectors, the regime LEANN's 100× codes live in.
The inverted design: Subramanya et al., DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node (NeurIPS 2019): PQ in RAM steers, full vectors ride along with each disk read and re-rank at no added I/O. Its degree knob is called R, not M; the paper's "DiskANN with M=60" borrows HNSW's name.
The hub finding: Munyampirwa, Lakshman, Coleman, Down with the Hierarchy: The 'H' in HNSW Stands for Hubs (2024). Hubs there are defined by measured search traffic; LEANN's top-2%-by-degree rule is a static proxy motivated by, not taken from, that result.
The recompute-on-IVF predecessor: Seemakhupt et al., EdgeRAG: Online-Indexed RAG for Edge Devices (2024). EdgeRAG also caches and keeps stored embeddings for its slowest clusters; LEANN's IVF-Recompute baseline adopts the recompute idea with LEANN's own IVF settings.
The encoders: Izacard et al., Unsupervised Dense Information Retrieval with Contrastive Learning (Contriever, 2021), a BERT-base model with mean pooling; and Li et al., Towards General Text Embeddings with Multi-stage Contrastive Learning (GTE, 2023), whose small 384-dimensional variant powers the 2.3× ablation.
One gap between paper and package: as of this writing the released library ships with the headline mechanisms off by default (degree-based pruning disabled, recompute ratio at "everything," batching unset); the paper's numbers come from its experimental configuration. This matters if you benchmark the pip install and wonder where the 50× went.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.