HNSW vs CAGRA
Many applications require finding nearest neighbors in high-dimensional datasets such as recommendation systems, search engines, embeddings from NLP or computer vision models, or anomaly detection. But in many cases, exact search becomes prohibitively expensive at scale (millions to billions of vectors); as a consequence, approximate nearest neighbor (ANN) algorithms are widely used as a substitute.
Today, we want to compare two common graph-based contenders for ANN search:
For each algorithm, we'll cover the graph's construction, search mechanics, and performance trade-offs. We'll wrap up with some guidance on how to choose the right approach.Understanding CAGRA
To understand CAGRA, we must first understand CUDA.What is CUDA?
CUDA (compute unified device architecture) is NVIDIA's programming model and platform for general-purpose parallel computing using GPUs. CUDA allows you to leverage the parallelism of GPUs for non-graphical tasks. In other words, you can divide arbitrary code tasks into smaller sub-tasks (called kernels) that can execute in parallel across the GPUs' many cores. In certain computationally intensive settings, this offers a significant performance advantage over traditional CPU programs which execute sequentially.CAGRA extends CUDA
CAGRA applies this concept to the ANN search problem. CAGRA (CUDA ANNS GRAph-based) is a GPU-accelerated graph-based ANN algorithm, utilizing parallel processing with NVIDIA GPUs to serve high-throughput, high-performance workloads.
CAGRA graph is described by its creators as "search implementation centric and heuristically optimized". Under the hood, the CAGRA graph is a version of the NSW graph optimized to run on GPUs. Both NSW and CAGRA have a flat, directional graph structure. But NSW adds or removes connections between nodes dynamically based on a greedy proximity heuristic (preserving only the most valuable/non-redundant connections); CAGRA, meanwhile, has a fixed out-degree where each vertex has a predetermined, uniform number of connections set by the parameter graph_degree. With a uniform number of connections, CAGRA can optimize GPU resources by evenly balancing load across compute cores and then maximizing resource utilization within each unit.How does graph construction work with CAGRA?
There are a few steps of CAGRA graph construction. First, CAGRA builds an initial k-nearest neighbor (kNN) graph. This starts with constructing a kNN graph using NN-descent, where each node is connected to its k = d_init nearest neighbors (usually 2d or 3d, where d = intermediate_graph_degree, the parameter which controls the final number of edges per node in the finished CAGRA graph). For example, if the final graph is expected to have 64 edges per node (intermediate_graph_degree=64), the initial graph might start with 128 or 192 edges per node. Keep in mind that a higher intermediate_graph_degree results in a denser, higher quality graph that yields higher recall during search, but at the cost of slower search performance.
Next, CAGRA must sort neighbors by distance. For every node, CAGRA sorts the list of its neighbors based on distance to the source node, where the closest nodes come first. This is a GPU-efficient process as each node's list is independent and can be parallelized.
At this point, we have a massive, dense, well-connected kNN graph. However, it's too big and redundant! To fix this, CAGRA must optimize and shrink the graph to improve recall and throughput. This primarily includes two techniques:
- Reordering Edges and Pruning. Each edge, sorted by distance, is assigned a significance "rank" based on the degree to which it can be replaced by a short detour. Edges are then reordered to favor diversity instead of distance, connecting neighbors that lead to different parts of the graph instead of only nearby nodes. Perhaps counterintuitively, this helps boost the graph's local reachability (i.e., "from one node, how many other nodes can you reach within a few steps?") because each node's neighbors lead to more unique areas. After reordering, only the top
intermediate_graph_degreeedges are retained; the rest are pruned to remove redundancy. - Reverse Edge Addition and Re-Pruning. After the initial pruning, some nodes might not be easy to reach from others. To fix that, CAGRA adds reverse edges (edges that point backward from target nodes to the source nodes) to improve strong connectivity. Another way of thinking about this is global reachability, i.e., "can you get from any node to any other node in the graph?". This process makes it significantly less likely that parts of the graph become unreachable. After adding all of the reverse edges, CAGRA needs to prune once more in order to keep the total out-degree fixed at
intermediate_graph_degree. Accordingly, CAGRA assumes that each reverse edge has the same rank as its original edge, then prunes to keep only theintermediate_graph_degreehighest ranked edges overall for each node.
How does CAGRA search work?
CAGRA search begins by randomly selecting a tunable number of entry points (candidate nodes) from the graph. This starting set controls the breadth of initial exploration and is tunable based on the desired trade-off between recall and latency.
Each candidate node is explored independently and in parallel across GPU thread blocks. Within each block, the search proceeds greedily:
- For the current node, distances between the query vector and each of its child (neighbor) vectors are computed in parallel.
- Each visited vertex and its corresponding distance are stored in a sorted internal list of the current best nearest neighbors.
- A fixed number of closest child vectors (controlled by the
search_widthparameter) get selected for further exploration. - Graph traversal expands outward until termination when either every vertex has been visited or (more likely) the nearest neighbors list reaches length of
itopk_size.
The Performance Characteristics of CAGRA
CAGRA's search and graph layout are designed to maximize GPU utilization. Because each node has a fixed out-degree, traversal can be parallelized efficiently across thousands of GPU cores with minimal branching.
This parallelism occurs at two levels: multiple search entry points (multi-block parallelism across cores) and neighbor distance computations (intra-block parallelism across thread groups i.e., warp "teams"). With uniform node degrees, we get predictable memory access patterns, which reduce divergence and improve cache efficiency, both crucial for high-throughput GPU workloads.
CAGRA's architecture enables it to sustain extremely high QPS even at billion-scale datasets, particularly in batch query scenarios where multiple queries can be scheduled concurrently on GPU streams. These performance advantages are most pronounced on larger batch sizes. For small-batch or single-query workloads, CAGRA may be outperformed by CPU-based algorithms (like HNSW) which have lower overhead and faster warm start.
Notably, CAGRA doesn't require memory to be preloaded. However, if memory isn't preloaded, then the GPU will be mostly idle, reducing the system's net efficiency.HNSW
HNSW (Hierarchical Navigable Small World) is a graph-based ANN algorithm designed to achieve high recall and fast search performance on CPUs. Unlike GPU-optimized systems like CAGRA, HNSW achieves efficiency through dense graph organization and hierarchical search heuristics that minimize the number of distance computations needed per query.
Conceptually, HNSW can be thought of as a multi-layer graph structure that balances global navigability ("can you quickly jump from one region of the graph to another?") with local precision ("can you home in on the nearest neighbors once in the right region?").How are HNSW graphs structured?
The HNSW graph is composed of multiple layers of increasing sparsity:
HNSW layers are linked together by connections between sibling nodes in contiguous layers and sometimes by longer connections between more distant layers. This design allows the top-down search process to skip large portions of the graph when traversing between distant regions, similar to a skip list.How are HNSWs graphs constructed?
HNSW graphs start with nodes being inserted one by one into the graph. Then, for each node:
- A random maximum layer level is assigned.
- Starting from the topmost layer, the HNSW algorithm performs a greedy search to find the nearest neighbor of the new node within that layer.
- The search result serves as the entry point for the next lower layer, where another greedy search is performed to find closer connections.
- This process repeats downward layer by layer until reaching Layer 0.
Once the bottom layer is reached, a new node connects to its M nearest neighbors (controlled by the M parameter), which defines the graph's average out-degree. To maintain navigability and prevent redundancy, the algorithm prunes overly similar edges using the heuristic that each node should connect to diverse regions of its neighborhood rather than only to its closest points. The resulting graph self-organizes into a structure that is both compact and highly navigable, without requiring a full pairwise (brute force) distance computation.How does the HNSW search process work?
The HNSW search process begins at the topmost layer using a single entry point. This might either be one that's randomly selected, the most recently inserted, or a preselected universal entry node.
Then, within each layer, the HNSW algorithm performs a greedy search:
- From the current node, it explores neighboring nodes and moves to whichever one is closest to the query vector.
- This continues until no closer neighbor can be found within the current layer (or the max number of explorations per layer is hit).
Once the search converges in a layer, the algorithm descends one level lower and repeats the process using the best node from the previous layer as the entry point. Finally, when the search reaches Layer 0 (the base graph), it performs a best-first search:
- A small candidate pool (controlled by the
ef_searchparameter) is maintained. New candidates are added as better nodes are discovered. - The search expands outward until the candidate pool stabilizes or the max number of explorations in the bottom layer is hit. Only then, the top-k closest results are returned.
The Performance Characteristics of HNSW
HNSW's hierarchical structure provides an elegant balance between global reachability and local accuracy. Upper layers function as navigational shortcuts that allow searches to "zoom in" near the correct region of the dataset before performing local refinement.
Since HNSW runs on CPUs, its performance depends heavily on memory locality and cache efficiency. The algorithm's graph layout and limited branching factor (M) help ensure predictable memory access and low per-query overhead. The ef_search parameter directly controls the trade-off between recall and latency: higher values improve accuracy by exploring more candidates but require more distance computations.
In practice, HNSW delivers exceptional single-query latency on CPUs, making it well-suited for low-throughput or latency-sensitive workloads. However, its lack of native parallelism and hierarchical structure make it less ideal for GPU acceleration or massive batch processing compared to architectures like CAGRA.
Additionally, HNSWs could implement quantization, where vectors are snapped to a grid of values with small performance loss but massive memory improvements (up to 10x saved space). CAGRA, meanwhile, could hypothetically be combined with quantization, but the memory improvements aren't as explored, especially since memory doesn't need to be pre-loaded.Choosing between CAGRA and HNSW
Generally speaking, you should choose CAGRA when:
Meanwhile, choose HNSW when:
If you are interested in exploring a performant, scalable, and efficient HNSW-powered vector database, consider trying Vectroid.Resources

Written by Vectroid Team
Engineering Team