Back to Resources
Technical Guides12 min read

HNSW vs CAGRA: GPU vs CPU ANN Algorithms

A comprehensive comparison of HNSW and CAGRA, two leading graph-based Approximate Nearest Neighbor (ANN) algorithms. Learn how HNSW's CPU-optimized hierarchical structure compares to CAGRA's GPU-accelerated parallel architecture, and discover which is best for your workload.

Vectroid Team

Vectroid Team

Engineering Team

#vector-search#ann-algorithms#hnsw#cagra#gpu#cuda#performance#graph-based#technical-guide

HNSW vs CAGRA

Many applications require finding nearest neighbors in high-dimensional datasets such as recommendation systems, search engines, embeddings from NLP or computer vision models, or anomaly detection. But in many cases, exact search becomes prohibitively expensive at scale (millions to billions of vectors); as a consequence, approximate nearest neighbor (ANN) algorithms are widely used as a substitute.

Today, we want to compare two common graph-based contenders for ANN search:

  • HNSW (Hierarchical Navigable Small World), a CPU-optimized, hierarchical graph-based ANN algorithm optimized for low-latency single-query performance.
  • CAGRA (CUDA ANNS Graph-based), a GPU-optimized, flat graph-based ANN algorithm optimized for parallelization in high-throughput workloads and billion-scale datasets.
  • For each algorithm, we'll cover the graph's construction, search mechanics, and performance trade-offs. We'll wrap up with some guidance on how to choose the right approach.Understanding CAGRA

    To understand CAGRA, we must first understand CUDA.What is CUDA?

    CUDA (compute unified device architecture) is NVIDIA's programming model and platform for general-purpose parallel computing using GPUs. CUDA allows you to leverage the parallelism of GPUs for non-graphical tasks. In other words, you can divide arbitrary code tasks into smaller sub-tasks (called kernels) that can execute in parallel across the GPUs' many cores. In certain computationally intensive settings, this offers a significant performance advantage over traditional CPU programs which execute sequentially.CAGRA extends CUDA

    CAGRA applies this concept to the ANN search problem. CAGRA (CUDA ANNS GRAph-based) is a GPU-accelerated graph-based ANN algorithm, utilizing parallel processing with NVIDIA GPUs to serve high-throughput, high-performance workloads.

    CAGRA graph is described by its creators as "search implementation centric and heuristically optimized". Under the hood, the CAGRA graph is a version of the NSW graph optimized to run on GPUs. Both NSW and CAGRA have a flat, directional graph structure. But NSW adds or removes connections between nodes dynamically based on a greedy proximity heuristic (preserving only the most valuable/non-redundant connections); CAGRA, meanwhile, has a fixed out-degree where each vertex has a predetermined, uniform number of connections set by the parameter graph_degree. With a uniform number of connections, CAGRA can optimize GPU resources by evenly balancing load across compute cores and then maximizing resource utilization within each unit.How does graph construction work with CAGRA?

    There are a few steps of CAGRA graph construction. First, CAGRA builds an initial k-nearest neighbor (kNN) graph. This starts with constructing a kNN graph using NN-descent, where each node is connected to its k = d_init nearest neighbors (usually 2d or 3d, where d = intermediate_graph_degree, the parameter which controls the final number of edges per node in the finished CAGRA graph). For example, if the final graph is expected to have 64 edges per node (intermediate_graph_degree=64), the initial graph might start with 128 or 192 edges per node. Keep in mind that a higher intermediate_graph_degree results in a denser, higher quality graph that yields higher recall during search, but at the cost of slower search performance.

    Next, CAGRA must sort neighbors by distance. For every node, CAGRA sorts the list of its neighbors based on distance to the source node, where the closest nodes come first. This is a GPU-efficient process as each node's list is independent and can be parallelized.

    At this point, we have a massive, dense, well-connected kNN graph. However, it's too big and redundant! To fix this, CAGRA must optimize and shrink the graph to improve recall and throughput. This primarily includes two techniques:

    1. Reordering Edges and Pruning. Each edge, sorted by distance, is assigned a significance "rank" based on the degree to which it can be replaced by a short detour. Edges are then reordered to favor diversity instead of distance, connecting neighbors that lead to different parts of the graph instead of only nearby nodes. Perhaps counterintuitively, this helps boost the graph's local reachability (i.e., "from one node, how many other nodes can you reach within a few steps?") because each node's neighbors lead to more unique areas. After reordering, only the top intermediate_graph_degree edges are retained; the rest are pruned to remove redundancy.
    2. Reverse Edge Addition and Re-Pruning. After the initial pruning, some nodes might not be easy to reach from others. To fix that, CAGRA adds reverse edges (edges that point backward from target nodes to the source nodes) to improve strong connectivity. Another way of thinking about this is global reachability, i.e., "can you get from any node to any other node in the graph?". This process makes it significantly less likely that parts of the graph become unreachable. After adding all of the reverse edges, CAGRA needs to prune once more in order to keep the total out-degree fixed at intermediate_graph_degree. Accordingly, CAGRA assumes that each reverse edge has the same rank as its original edge, then prunes to keep only the intermediate_graph_degree highest ranked edges overall for each node.

    How does CAGRA search work?

    CAGRA search begins by randomly selecting a tunable number of entry points (candidate nodes) from the graph. This starting set controls the breadth of initial exploration and is tunable based on the desired trade-off between recall and latency.

    Each candidate node is explored independently and in parallel across GPU thread blocks. Within each block, the search proceeds greedily:

    1. For the current node, distances between the query vector and each of its child (neighbor) vectors are computed in parallel.
    2. Each visited vertex and its corresponding distance are stored in a sorted internal list of the current best nearest neighbors.
    3. A fixed number of closest child vectors (controlled by the search_width parameter) get selected for further exploration.
    4. Graph traversal expands outward until termination when either every vertex has been visited or (more likely) the nearest neighbors list reaches length of itopk_size .

    The Performance Characteristics of CAGRA

    CAGRA's search and graph layout are designed to maximize GPU utilization. Because each node has a fixed out-degree, traversal can be parallelized efficiently across thousands of GPU cores with minimal branching.

    This parallelism occurs at two levels: multiple search entry points (multi-block parallelism across cores) and neighbor distance computations (intra-block parallelism across thread groups i.e., warp "teams"). With uniform node degrees, we get predictable memory access patterns, which reduce divergence and improve cache efficiency, both crucial for high-throughput GPU workloads.

    CAGRA's architecture enables it to sustain extremely high QPS even at billion-scale datasets, particularly in batch query scenarios where multiple queries can be scheduled concurrently on GPU streams. These performance advantages are most pronounced on larger batch sizes. For small-batch or single-query workloads, CAGRA may be outperformed by CPU-based algorithms (like HNSW) which have lower overhead and faster warm start.

    Notably, CAGRA doesn't require memory to be preloaded. However, if memory isn't preloaded, then the GPU will be mostly idle, reducing the system's net efficiency.HNSW

    HNSW (Hierarchical Navigable Small World) is a graph-based ANN algorithm designed to achieve high recall and fast search performance on CPUs. Unlike GPU-optimized systems like CAGRA, HNSW achieves efficiency through dense graph organization and hierarchical search heuristics that minimize the number of distance computations needed per query.

    Conceptually, HNSW can be thought of as a multi-layer graph structure that balances global navigability ("can you quickly jump from one region of the graph to another?") with local precision ("can you home in on the nearest neighbors once in the right region?").How are HNSW graphs structured?

    The HNSW graph is composed of multiple layers of increasing sparsity:

  • The bottom layer (Layer 0) contains all nodes. It serves as the fully navigable data layer where actual nearest neighbor search occurs.
  • Each higher layer is a progressively sparser "shortcut" graph built over a subset of the nodes.
  • The maximum layer assignment for each node is determined randomly based on an exponentially decaying probability distribution, meaning only a few nodes occupy the top layers while most appear first toward the bottom.
  • HNSW layers are linked together by connections between sibling nodes in contiguous layers and sometimes by longer connections between more distant layers. This design allows the top-down search process to skip large portions of the graph when traversing between distant regions, similar to a skip list.How are HNSWs graphs constructed?

    HNSW graphs start with nodes being inserted one by one into the graph. Then, for each node:

    1. A random maximum layer level is assigned.
    2. Starting from the topmost layer, the HNSW algorithm performs a greedy search to find the nearest neighbor of the new node within that layer.
    3. The search result serves as the entry point for the next lower layer, where another greedy search is performed to find closer connections.
    4. This process repeats downward layer by layer until reaching Layer 0.

    Once the bottom layer is reached, a new node connects to its M nearest neighbors (controlled by the M parameter), which defines the graph's average out-degree. To maintain navigability and prevent redundancy, the algorithm prunes overly similar edges using the heuristic that each node should connect to diverse regions of its neighborhood rather than only to its closest points. The resulting graph self-organizes into a structure that is both compact and highly navigable, without requiring a full pairwise (brute force) distance computation.How does the HNSW search process work?

    The HNSW search process begins at the topmost layer using a single entry point. This might either be one that's randomly selected, the most recently inserted, or a preselected universal entry node.

    Then, within each layer, the HNSW algorithm performs a greedy search:

    1. From the current node, it explores neighboring nodes and moves to whichever one is closest to the query vector.
    2. This continues until no closer neighbor can be found within the current layer (or the max number of explorations per layer is hit).

    Once the search converges in a layer, the algorithm descends one level lower and repeats the process using the best node from the previous layer as the entry point. Finally, when the search reaches Layer 0 (the base graph), it performs a best-first search:

    1. A small candidate pool (controlled by the ef_search parameter) is maintained. New candidates are added as better nodes are discovered.
    2. The search expands outward until the candidate pool stabilizes or the max number of explorations in the bottom layer is hit. Only then, the top-k closest results are returned.

    The Performance Characteristics of HNSW

    HNSW's hierarchical structure provides an elegant balance between global reachability and local accuracy. Upper layers function as navigational shortcuts that allow searches to "zoom in" near the correct region of the dataset before performing local refinement.

    Since HNSW runs on CPUs, its performance depends heavily on memory locality and cache efficiency. The algorithm's graph layout and limited branching factor (M) help ensure predictable memory access and low per-query overhead. The ef_search parameter directly controls the trade-off between recall and latency: higher values improve accuracy by exploring more candidates but require more distance computations.

    In practice, HNSW delivers exceptional single-query latency on CPUs, making it well-suited for low-throughput or latency-sensitive workloads. However, its lack of native parallelism and hierarchical structure make it less ideal for GPU acceleration or massive batch processing compared to architectures like CAGRA.

    Additionally, HNSWs could implement quantization, where vectors are snapped to a grid of values with small performance loss but massive memory improvements (up to 10x saved space). CAGRA, meanwhile, could hypothetically be combined with quantization, but the memory improvements aren't as explored, especially since memory doesn't need to be pre-loaded.Choosing between CAGRA and HNSW

    Generally speaking, you should choose CAGRA when:

  • You have NVIDIA GPUs to use. During graph construction, the enforcement of a uniform degree and reordering of edges help maximize GPU utilization and maintain high local and global reachability.
  • You're working with ultra large datasets. CAGRA scales efficiently to billion-scale datasets with very high-dimensional vectors, particularly when the dataset can be preloaded onto GPU memory or streamed effectively. However, the largest GPU memory available is around 80 GB, and it's quite expensive (whereas CPU memory is quite available and affordable).
  • You want maximum throughput. CAGRA excels at high-throughput workloads due to parallelized search across many thread blocks. Single-query latency can be competitive too, but less so.
  • Batch queries are common (or you need massive parallelization for some other reason). CAGRA really shines at parallelism, so it is ideal for batch queries, high-throughput scenarios, or billion-scale datasets where GPU compute can be fully utilized.
  • Meanwhile, choose HNSW when:

  • You are CPU-bound (or dataset sizes / workflow requirements don't justify GPU acceleration). HNSW is generally preferable in situations where GPU acceleration may not be available or necessary. In CPU compatible cases, HNSW performs extremely well with minimal configuration (though some implementations could involve tuning hierarchical layers).
  • You need low-latency single-query responses. HNSW excels at low-latency workloads, particularly single-query or small-batch scenarios. Its overhead per query is small, making it suitable for applications that require fast responses without GPUs.
  • Simpler deployment or memory footprint is important. HNSW has a compact, local-friendly implementation that doesn't require external resources or complex memory management. It's typically easier to deploy on standard hardware or in environments with limited resources.
  • If you are interested in exploring a performant, scalable, and efficient HNSW-powered vector database, consider trying Vectroid.Resources

  • RAPIDS cuVS - CAGRA
  • What is the CUDA Toolkit?
  • NVIDIA cuVS
  • Accelerating Vector Search: Fine-Tuning GPU Index Algorithms
  • CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs
  • Vectroid Team

    Written by Vectroid Team

    Engineering Team

    12 min read

    Want product news and updates? Sign up for our newsletter.