LEANN: A Low-Storage Vector Index
Vector search without storing the vectors.
Semantic search keeps an embedding for every chunk of your data, and that store outgrows the data itself. LEANN deletes it, then recomputes the handful of embeddings each query actually touches. The index drops to 5% of the corpus and the answers stay the same.
Explaining the paperLEANN: A Low-Storage Vector IndexYour documents fit on your laptop. The index that searches them does not. Now what?
An arithmetic problem hides inside every "chat with your documents" app. To search text by meaning rather than keywords, you run every chunk of it through an embedding model, an encoder that maps text to a vector of numbers placed so that related passages land near each other. Then you store all those vectors, and at query time you embed the question and look for its nearest neighbors. The vectors are the searchable form of your data. They are also enormous.
The paper's running example makes it concrete. Take a 76 GB dump of Wikipedia text and cut it into 256-token chunks: 60 million of them. Embed each with Contriever, a standard BERT-sized retrieval encoder that outputs 768 numbers per chunk. At four bytes per number, each roughly 1 KB chunk of text now carries a 3 KB vector: the description is three times the size of the thing it describes. Across the corpus that is 173 GB of embeddings (the paper's count, in binary gigabytes; multiply it out in decimal and you get 184), plus 15 GB of index structure on top. You need 188 GB of storage to search 76 GB of text, and the popular indexes do not even fit it in RAM on a 32 GB machine. LEANN, out of UC Berkeley, gets the same search quality out of 4 GB.
The index outweighs the library
Before the fix, the problem deserves precision. Vector search returns, for a query vector, the stored vectors closest to it (the "top-"). Doing that exactly means comparing the query against all 60 million vectors, a linear scan, which is why every practical system uses approximate nearest neighbor search (ANNS): an index that finds most of the true top- while computing only a tiny fraction of the distances. How much of the truth you get is the recall:
where is the true top- set and is what the index returned. The paper targets 90% recall at : of the three best chunks for your question, the index hands the language model 2.7 of them on average. Below that, retrieval-augmented generation (RAG), the setup where a model reads retrieved chunks before answering, starts answering from the wrong pages.
Storage is where ambition meets the laptop. The 188 GB above is HNSW, the standard graph index, and it is not the worst offender. Click through the ledger:
LEANN · PQ codes 2 GB + pruned graph 2 GB · No stored embeddings: 2 GB of rough PQ codes plus a 2 GB pruned graph. 5% of the corpus, same 25.5% downstream accuracy as HNSW.
The obvious fix is compression, and it has a name: product quantization (PQ), which we will meet properly later. Squeezing the vectors 35×, down to 5 GB, is enough compression to matter and enough error to hurt: downstream question-answering accuracy falls to 17.9%, below the 18.3% of BM25, the lexical ranking function from the 1990s that vector search was supposed to retire. And the 15 GB of graph metadata is untouched by vector compression, so even a perfect quantizer leaves a 20 GB floor. Compression is the only knob for how you store the embeddings, and these numbers show it running out of room: squeeze hard and accuracy craters, squeeze gently and the 20 GB floor stays. If no setting of how you store them gets you past that floor, the lever left to pull is whether you store them at all.
A query touches a sliver of the graph
The escape hatch is hiding in how the best indexes already work. There are two main families. An inverted file index (IVF) clusters the vectors with k-means, stores a centroid per cluster, and answers a query by scanning the centroids and then checking every vector inside the few nearest clusters. A proximity graph connects each vector to a few dozen near neighbors and answers a query by walking: the state of the art, because it reaches a given recall with far fewer distance computations than anything else. LEANN builds on the graph family, specifically on HNSW.
The walk is called best-first search, and it is simple enough to hold in your head. Keep a short list of the best candidates seen so far, at most ef of them, ordered by distance to the query. Start the list with a fixed entry node. Then loop: take the closest candidate you have not expanded yet, compute the distance to each of its graph neighbors, and let any neighbor good enough to make the list push a worse candidate off the end. When every node on the list has been expanded and nothing new can break in, the top of the list is your answer. The knob ef is the quality dial: a longer list explores more dead ends before settling, which costs more distance computations and misses less.
Why does a greedy walk find the nearest neighbors at all? Each hop moves to a node closer to the query, and because every node links to its own neighborhood, "closer" compounds: it is the six-degrees game, where you reach a stranger by always handing the message to whichever acquaintance is nearest the target. The candidate list is what keeps greed honest. A pure greedy walk would get stuck the first time every neighbor looks worse; the ef-sized shortlist remembers runner-up paths and lets the search back out of dead ends.
Now the fact the whole paper stands on: count how many nodes the walk actually visits. Empirically the count grows roughly logarithmically with corpus size (a regularity, not a theorem; HNSW proves it only for idealized graphs, and the paper itself hedges to "polylogarithmic"). At 60 million vectors a search at high recall computes distances to a few thousand nodes, around 0.005% of the corpus. Per query, 99.99% of that 173 GB embedding store is dead weight. The catch, and the reason nobody had cashed this in: which 0.005% you need depends on the query and is only revealed hop by hop during the walk, so you cannot prefetch it, and a conventional index keeps everything on disk just in case. Watch how little of the corpus a search actually lights up:
ef to 1 and the under-provisioned walk can miss a neighbor; drag it to 32 and the cost grows while the answer stops improving. In 768 dimensions this dial stretches much further before saturating, but the shape survives: the walk touches a sliver, and the sliver is only revealed as it goes.The number to watch is the counter in the top-left of the figure, the running tally of how many of the 900 embeddings the search has actually had to compute. A structure that examines a few dozen nodes out of 900 here, or a few thousand out of 60 million at scale, is a machine for ignoring data, and that is precisely what makes the next move possible.
Trade the warehouse for the recipe
LEANN's core decision: do not store the embeddings. Store the graph and the original text, and when the walk needs a node's embedding, recompute it on the spot by running that chunk back through the same encoder that built the index. The same model on the same text produces the same vector, so the walk sees exactly what it would have read from disk, and the answers do not change. Instead of a warehouse of 60 million precomputed vectors, you keep the recipe: the text (which is your data, already on disk) and the encoder (which any dense retriever already needs at query time, to embed the query itself). Nothing new is stored; something is re-cooked on demand.
If you have trained large models, you have met this trade under another name: gradient checkpointing, where you discard activations during the forward pass and recompute them during the backward pass, buying memory with compute. The difference is what gets recomputed. Checkpointing re-runs a fixed, predictable schedule of everything; LEANN re-runs a tiny, query-dependent subset that the search itself chooses, and the entire design lives or dies on that subset staying small.
Small, but not free. Each touched node is now one forward pass of a 110M-parameter encoder over a 256-token chunk, so a few thousand touches is a few seconds of GPU work, where HNSW answered in 50 milliseconds. That sounds disqualifying until you look at what the seconds sit next to. In the paper's RAG setup, a local 4B-parameter model takes 20.86 s to generate the answer; HNSW's 0.05 s search is 0.2% of the wait. The request is sequential (search finishes, then generation starts), so retrieval is not hidden behind anything; it is small relative to the stage everyone is already waiting on. Slowing retrieval 50×, from 0.05 s to 2.48 s, moves the end-to-end request from 20.9 s to 23.3 s, about 10% (the paper's Table 1). That asymmetry is the bet: spend the part of the latency budget nobody is using, and buy back 184 GB of disk with it.
The bet also settles which index family you must build on. Once each distance computation costs an encoder forward pass instead of a memory read, the index's job description changes: minimize how many vectors you look at, at almost any other cost. Graphs touch a polylogarithmic sliver. IVF, by contrast, must scan every vector in each probed cluster, and with the standard clusters that is on the order of embeddings per probe, roughly 7,300 chunks per cluster at 60 million. The paper's IVF-Recompute baseline (the strategy of EdgeRAG, which generates embeddings online inside an IVF index) pays exactly that: 307.6 s per query on NQ against LEANN's 2.48 s, with the gap ranging from 21× to 200× across datasets and hardware. Same recompute-don't-store idea, two orders of magnitude apart, decided entirely by how many embeddings the index forces you to produce.
Even a polylogarithmic sliver, though, is recomputed wastefully if you do it naively: the encoder runs on every neighbor the walk glances at, most of which are duds, and it runs on them a handful at a time, which starves the GPU. LEANN's two latency mechanisms attack those two wastes in order.
Spend the encoder only where it counts
First waste: not every neighbor deserves a forward pass. To triage them, LEANN keeps one piece of stored vector data after all, a product quantization table. PQ chops the 768-dimensional vector into roughly 30 sub-vectors and replaces each with the id of its nearest match from a 256-entry codebook learned for that slice. A 3,072-byte vector becomes ~30 one-byte ids, about 100× smaller, 2 GB for the whole corpus, and distances can be estimated straight from the ids with table lookups, no encoder involved. (The paper calls this a "100× smaller codebook"; strictly, it is the per-vector codes that shrink. The codebook itself is under a megabyte.) The price of 100× is accuracy: reconstruct a vector from 30 bytes and the distance you compute from it is systematically off, good enough to say "promising" or "dud," not good enough to say "third nearest versus ninth."
So there are now two distance oracles: a cheap one that is somewhat wrong and an expensive one that is exactly right. The design question is which one steers the walk. DiskANN, the canonical disk-based graph index, lets the cheap one steer: PQ codes in RAM guide every hop, and the exact vectors (which ride along free with each disk read) only re-rank the finalists at the end. That works at DiskANN's moderate compression. At LEANN's 100×, it fails in a way re-ranking cannot repair: the distorted distances steer the walk itself into detours, true neighbors never enter the candidate list at all, and you cannot re-rank what you never reached.
The repair re-ranking offers is real but it comes too late. A walk produces an ordered list of nodes it chose to expand, and re-ranking only re-sorts that finished list with exact distances; it cannot reach back and visit a node the walk never expanded. So the failure is one-directional: PQ noise at 100× can knock a true neighbor far enough down the candidate ordering that the walk steps past it toward a quantization artifact, that artifact's neighbors get expanded instead, and the walk wanders off into a region that holds none of the answer. The true neighbor was never reached, so it is never in the list to re-rank, and no amount of exact re-sorting at the end recovers a route that turned the wrong way at the start. A wrong steering decision is permanent in a way a wrong final ordering is not.
LEANN inverts the roles. Exact, recomputed distances steer the walk, so it never loses the trail; the PQ estimates are demoted to nominating which candidates are worth recomputing. The asymmetry is the whole point: PQ noise can only ever cost an extra recompute on a candidate that turns out to be a dud, never a wrong turn, because the wrong-turn decision, which node to expand next, is made on an exact distance every time. Concretely, the search keeps two queues: the exact queue EQ (the usual best-first list, admitting only recomputed embeddings) and an approximate queue AQ holding a PQ estimate for every neighbor the walk has glanced at. Each hop PQ-scores all the new neighbors for fractions of a microsecond, then recomputes only the top of AQ and promotes them into EQ. It is a résumé screen before the on-site: cheap scores decide who gets the expensive interview. With one improvement over a real hiring pipeline: AQ never discards anyone, so a candidate passed over early can still get the call later, once the walk drifts into its neighborhood and its relative standing improves. Run both designs on the same corpus, same query, same quantization noise:
Second waste: the surviving recomputations arrive in dribbles. Best-first search is sequential by definition (the next hop depends on the last), so a naive implementation hands the GPU work a few chunks at a time, and a GPU fed batches of five runs at a fraction of its throughput; most of the wall-clock goes to launch overhead and idle compute units, and a batch of 64 costs barely more than a batch of 8. LEANN loosens the search order to fill the machine: nominated candidates accumulate across hops until roughly 64 are waiting, then one batched encoder call scores them all. The ordering goes slightly stale, a candidate may be expanded a hop later than strict best-first would have, and recall does not care. The whole search now looks like this:
# one LEANN query (the paper's Algorithm 2, plus batching)
EQ = queue(ef) # exact queue: only RECOMPUTED embeddings get in
AQ = heap() # approx queue: cheap PQ-table distances, keeps all
push(EQ, entry, exact_dist(q, entry))
while EQ has an unvisited node:
u = closest unvisited node in EQ # exact distances steer the walk
for v in neighbors(u):
if v not seen:
push(AQ, v, pq_dist(q, v)) # ~30-byte codes, no encoder call
C = top alpha% of AQ, excluding EQ # nominate the promising few
batch += C
if len(batch) >= 64: # dynamic batching: fill the GPU
embs = encoder(text[batch]) # ONE batched forward pass
for c, e in zip(batch, embs):
push(EQ, c, dist(q, e))
batch = []
return k closest in EQMeasured on the real system, the triage alone (two-level search) speeds retrieval up 1.4× on average, up to 1.6×; adding dynamic batching takes the average to 1.8× and the peak to 2.0×. HotpotQA, whose multi-hop questions force the longest walks, gains the most from batching, which is what you would expect: longer walks leave more candidates waiting around to be batched.
Prune the graph, spare the hubs
Recomputation deletes the 173 GB. The remaining 15 GB is the graph itself: each node's list of neighbor ids, four bytes each. With faiss's standard settings the base layer allows up to 60 neighbors per node (the build parameter , doubled at the bottom layer by convention), and the paper's arithmetic makes the burden vivid: a node with 64 neighbors spends 256 bytes describing its wiring, a quarter of the size of the 1 KB chunk it points to. Once the embeddings are gone, this metadata is the index. To shrink it, cut edges. The question is which ones.
The two obvious answers fail. Delete edges at random, or rebuild the graph with a uniformly lower degree cap, and accuracy collapses well before the storage halves, because what you are spending is connectivity, the very thing the walk runs on. The way out is that edges are not equally valuable. In a built HNSW graph the degree distribution is heavily skewed, and search traffic is skewed harder: a 2024 study of HNSW (its title: "the 'H' in HNSW stands for hubs") found that walks spend their early hops on a small set of heavily connected, heavily visited hub nodes, a highway system that emerges from high-dimensional geometry rather than from anything the construction plans. The hubs are the interchanges; the rest of the graph is local streets. (That study identifies hubs by measured traffic; LEANN cuts by degree, a static stand-in for traffic that you can read off the graph without running queries.)
Why hubs form at all is worth a moment, because it explains why preserving them works. In 768 dimensions almost every pair of random points sits at nearly the same distance: volume piles up so far from any center that the spread of distances collapses, and most points are roughly equidistant from most others. When you wire each point to its nearest neighbors under those flattened distances, a few points end up sitting just slightly closer to a great many others than their rivals do, and those few collect a disproportionate share of the edges and, with them, a disproportionate share of the traffic, because almost every walk finds it cheapest to route through one of them. Nobody designed the highways; they precipitate out of the geometry the moment you build a proximity graph in high dimensions. That is exactly why the pruning rule can be so aggressive on everything else. The hubs are not an incidental feature competing for the storage budget, they are the structure the walk actually rides, so spending the budget on them and starving the side streets keeps the part that carries navigation and throws away the part that mostly duplicated it.
So the pruning rule protects the interchanges. Formally the paper poses it as a constrained optimization: keep the graph under a storage budget and above a recall floor while minimizing how many nodes each search must recompute:
and solves it with a heuristic in one pass. Re-insert every node, but hand out two different edge allowances: most nodes may keep only outgoing links (a fifth of ), while the top 2% of nodes by degree keep the full . The asymmetry that does the quiet work is in the back-links: when a new node is inserted, any existing node may accept a back-link from it up to the high cap , not its own little . A modest node cannot hoard edges of its own, but it is never denied its on-ramp to a hub. (Standard HNSW shrinks any over-full neighbor list back to the cap; letting ordinary nodes stay attached to hubs at the high cap is LEANN's modification, and without it the highways would be unreachable from the side streets they exist to serve.) Cut the toy corpus three ways at the same edge budget and try to search what remains:
| strategy | edges | recall@3 | computed / query |
|---|---|---|---|
| original | 100% | 100% | 72 |
| keep hubs | 37% | 73% | 40 |
| random | 39% | 50% | 49 |
| nearest-only | 37% | 3% | 20 |
keep hubs · 37% of edges · recall 73% · 40 embeddings per query at ef=16
On the real 60M-node graph the result is striking: cut the average degree from 18 to 9 and the recall-versus-recomputation curve barely moves, while the stored graph shrinks from faiss's 15 GB padded layout to a 2 GB packed adjacency list (compressed sparse row: 60M nodes × 9 neighbors × 4 bytes). The pruned graph matches the original, while at the same storage budget random pruning needs up to 1.8× more recomputations to hit the same recall and the uniform low cap needs up to 5.8× more, when it gets there at all (it cannot reach the 94% and 96% recall targets). Every edge you keep on a hub buys more navigation than an edge anywhere else.
Building an index you cannot hold
One paradox remains. Constructing the graph requires searching it, insertion by insertion, which seems to need every embedding live at once: a 173 GB peak during a build whose entire purpose is never to store 173 GB. LEANN sidesteps it with sharding that respects geometry: run k-means on a small sample of the corpus to get centroids, then embed each passage once, record its two nearest centroids, and discard the embedding immediately. Each shard is then built alone (re-embed its members, wire the subgraph, discard again), so peak storage is one shard's worth, and the shard graphs are finally merged: a passage that lives in two shards keeps its higher HNSW level and the union of its edge lists, randomly dropped back to the cap.
The two-centroid assignment is the stitch. Passages near a cluster boundary belong to both shards, so the merged graph inherits edges that cross the boundary, and the k-means grouping means each shard wires passages among their true semantic neighbors rather than random strangers. Built this way with 15 shards, peak construction storage drops about 5×, and the merged graph's recall-versus-recomputation curve sits nearly on the unsharded one. Shard randomly instead and the curve degrades sharply: the stitching only works if the pieces were cut along the data's own seams.
Updates get the same treatment. A naive insertion into a recompute-only graph costs embedding computations across its three phases (find candidate neighbors, select among them, then re-trim every neighbor that accepted a back-link). A distance cache kills the redundant re-computations inside the select-and-trim phases, and replacing the careful trim with random selection flattens the rest, landing at : measured, 63.3× faster than naive insertion. Batched additions buffer new passages on the side, where queries search them alongside the graph until they are merged in asynchronously. Deletions set a flag that filters results, deferring surgery on the graph until enough tombstones accumulate. The point of all three: the storage ceiling holds for the index's whole life, not just the day it ships.
What a query costs now
Everything above is in service of one trade, storage for query-time compute, so judge it on both sides of the ledger at once. Storage first: 4 GB against HNSW's 188 GB on the Wikipedia corpus, a 47× reduction (the paper rounds to "up to 50×"), under 5% of the raw data. On the paper's personal-scale corpora (financial filings, the Enron email archive, LAION images) the savings hold at 97% to 98% versus HNSW. Latency is the side that has to survive contact with a stopwatch, and it is worth exploring per dataset, because the overhead is not one number:
LEANN · retrieval 2.48 s + generation 20.86 s = 10.6% of the request
Accuracy does not move: at the same recall target LEANN's downstream RAG accuracy ties HNSW's exactly (25.5% on NQ), which is the designed outcome, since the walk sees the same embeddings either way. Against the methods that fit in comparable storage there is no contest: LEANN beats BM25 by up to 11.8 exact-match points and 35×-compressed PQ by up to 11.3, the two baselines that fail to reach 90% recall at all. The gains concentrate on factual benchmarks (NQ, TriviaQA) where fetching the right passage is most of the battle, and shrink on GPQA (graduate-level questions Wikipedia barely covers) and HotpotQA (multi-hop questions, while this setup retrieves in a single hop).
The fine print is honest and worth reading. The latency bet depends on generation staying expensive: on GPQA, where the model reasons for 69.6 s, LEANN's overhead is 1.6%; on HotpotQA on the same GPU it is 23.4%. Move to an M1 Ultra Mac, where the encoder runs on weaker silicon, and retrieval stretches to 13.8 s on NQ (23% of the request) and 44.8 s on HotpotQA, half the total wait, though HNSW and IVF do not run on that machine at all for want of RAM. Seconds-class retrieval is the wrong tool for search-as-you-type, and if your generation is a fast cloud model streaming its first token in under a second, the quiet stretch of latency this design spends does not exist. The knobs the paper measures point the same direction: swap Contriever for the 33M-parameter GTE-small and retrieval speeds up 2.3× with accuracy within 2%; let the disk budget breathe and caching 10% of embeddings buys 1.5× (the traffic skew from the hubs section means a small cache catches 41.9% of lookups).
The idea is bigger than the benchmark. Every vector index before this one was designed under the assumption that a distance computation is a memory read, so the embeddings must exist on disk; storage was the cost of speed. LEANN notices that the assumption is a choice: a graph walk touches so few vectors that you can pay encoder-rates for them at query time, and then the index is only the map, never the territory. 76 GB of documents, 4 GB of map, and a search that arrives before the language model has cleared its throat.
Questions you might still have
If the embeddings are gone, what exactly is on disk?
Four things: your original text (which you were keeping anyway), a pruned adjacency list (~2 GB at 60M chunks), ~30-byte PQ codes per chunk (~2 GB), and the encoder weights. The encoder is not an extra cost: any dense retriever already needs it at query time to embed the query itself.
Does recomputing change the search results?
No. The same encoder that built the index re-embeds the same text, so the walk sees the same vectors it would have read from disk. At the same 90% recall target, LEANN’s downstream RAG accuracy ties HNSW exactly (25.5%).
What happens without a serious GPU?
It runs, slower. On an M1 Ultra Mac, retrieval takes 13.8 s on NQ (23% of the request) and 44.8 s on HotpotQA (50%). The storage math is unchanged; the latency bet thins as the encoder slows. HNSW and IVF do not run there at all: they need more RAM than the machine has.
Can you cache the hot embeddings and skip some recomputation?
Yes, and the paper measures it: storing 10% of embeddings (the high-degree nodes) gives a 1.5× speedup with cache hit rates up to 41.9%. The same traffic skew that justifies hub-preserving pruning makes a small cache disproportionately effective.
Footnotes & further reading
- The paper: Wang, Li, Liu, Wu, Mao, Zhao, Yan, Xu, Zhou, Stoica, Min, Zaharia, Gonzalez, LEANN: A Low-Storage Vector Index (UC Berkeley, 2025; to appear at MLSys 2026). Code. Numbers in this explainer follow the v2 experiments (RTX 4090 + Qwen3-4B); v1 used an A10 and a 1B model, so quotes of the two versions differ.
- The graph index LEANN builds on: Malkov & Yashunin, Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs (2016). The 60-neighbor cap is faiss's layer-0 convention of 2M with M=30; the O(log N) scaling is proven only under idealized assumptions and observed to bend at the 200M scale in HNSW's own plots.
- Product quantization: Jégou, Douze, Schmid, Product quantization for nearest neighbor search (TPAMI 2011). Their GIST results already show recall collapsing at aggressive compression on high-dimensional vectors, the regime LEANN's 100× codes live in.
- The inverted design: Subramanya et al., DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node (NeurIPS 2019): PQ in RAM steers, full vectors ride along with each disk read and re-rank for free. Its degree knob is called R, not M; the paper's "DiskANN with M=60" borrows HNSW's name.
- The hub finding: Munyampirwa, Lakshman, Coleman, Down with the Hierarchy: The 'H' in HNSW Stands for Hubs (2024). Hubs there are defined by measured search traffic; LEANN's top-2%-by-degree rule is a static proxy motivated by, not taken from, that result.
- The recompute-on-IVF predecessor: Seemakhupt et al., EdgeRAG: Online-Indexed RAG for Edge Devices (2024). EdgeRAG also caches and keeps stored embeddings for its slowest clusters; LEANN's IVF-Recompute baseline adopts the recompute idea with LEANN's own IVF settings.
- The encoders: Izacard et al., Unsupervised Dense Information Retrieval with Contrastive Learning (Contriever, 2021), a BERT-base model with mean pooling; and Li et al., Towards General Text Embeddings with Multi-stage Contrastive Learning (GTE, 2023), whose small 384-dimensional variant powers the 2.3× ablation.
- One gap between paper and package: as of this writing the released library ships with the headline mechanisms off by default (degree-based pruning disabled, recompute ratio at "everything," batching unset); the paper's numbers come from its experimental configuration. Worth knowing if you benchmark the pip install and wonder where the 50× went.
How could this explainer be improved? Found an error, or something unclear? I read every message.