AWS lately introduced the overall availability of GPU-accelerated vector (k-NN) indexing on Amazon OpenSearch Service. Now you can construct billion-scale vector databases in underneath an hour and index vectors as much as 10 instances quicker at 1 / 4 of the fee. This function dynamically attaches serverless GPUs to spice up domains and collections operating CPU-based cases. With this function, you may scale AI apps shortly, innovate quicker, and run vector workloads leaner.
On this submit, we talk about the advantages of GPU-accelerated vector indexing, discover key use instances, and share efficiency benchmarks.
Overview of vector search and vector indexes
Vector search is a method that improves search relevance, and is a cornerstone of generative AI functions. It includes utilizing an embeddings mannequin to transform content material into numerical encodings (vectors), enabling content material matching by semantic similarity as a substitute of simply key phrases. You may construct vector databases by ingesting vectors into OpenSearch Service to construct indexes that allow searches throughout billions of vectors in milliseconds.
Challenges with scaling vector databases
Prospects are more and more scaling vector databases to multi-billion-scale on OpenSearch Service to energy generative AI functions, product catalogs, information bases, and extra. Functions have gotten more and more agentic, integrating AI brokers that depend on vector databases for high-quality search outcomes throughout enterprise knowledge sources to allow chat-based interactions and automation.
Nonetheless, there are challenges on the way in which to billion-scale. First, multi-million to billion-scale vector indexes take hours to days to construct. These indexes use algorithms like Hierarchal Navigable Small Worlds (HNSW) to allow high-quality, millisecond searches at scale. Nonetheless, they require extra compute energy than conventional indexes to construct. Moreover, you need to rebuild your indexes each time your mannequin modifications, reminiscent of switching between distributors, variations, or after fine-tuning. Some use instances reminiscent of customized search require fashions to be fine-tuned every day and adapt to evolving person behaviors. All vectors have to be regenerated when the mannequin modifications, so the index have to be rebuilt. HNSW may also degrade following vital updates and deletes, so indexes have to be rebuilt to regain accuracy.
Lastly, as your agentic functions change into extra dynamic, your vector database should scale for heavy streaming ingestion, updates, and deletes whereas sustaining low search latency. If search and indexing use the identical infrastructure, these intensive processes will compete for restricted compute and RAM, so search latency can degrade.
Answer overview
You may overcome these challenges by enabling GPU-accelerated indexing on OpenSearch Service 3.1+ domains or collections. GPU acceleration will dynamically activate, as an example, in response to a reindex command on a million-plus-size index. Throughout activation, index duties are offloaded to GPU servers that run NVIDIA cuVS to construct HNSW graphs. Superior velocity and effectivity are achieved by means of parallelization of vector operations. Inverted indexes will proceed utilizing your cluster’s CPU for indexing and search on non-vector knowledge. These indexes function alongside HNSW to assist key phrase, hybrid, and filtered vector search. The sources required to construct inverted indexes is low in comparison with HNSW.
GPU acceleration is enabled as a cluster-level configuration, however it may be disabled on particular person indexes. This function is serverless, so that you don’t have to handle GPU cases. You merely pay-per-use by means of OpenSearch Compute Models (OCUs).
The next diagram illustrates how this function works.

The workflow consists of the next steps:
- You write vectors into your area or assortment, utilizing the prevailing APIs: bulk, reindex, index, replace, delete, and pressure merge.
- GPU acceleration is activated when the listed vector knowledge surpasses a configured threshold inside a refresh interval.
- This results in a safe, single-tenant task of GPU servers to your cluster from a multi-tenant heat pool of GPUs managed by OpenSearch Service.
- Inside milliseconds, OpenSearch Service initiates and offloads HNSW operations.
- When the write quantity falls under the brink, GPU servers are scaled down and returned to the nice and cozy pool.
This automation is absolutely managed. You solely pay for acceleration time, which you’ll be able to monitor from Amazon CloudWatch.
This function isn’t simply designed for ease of use. It allows GPU acceleration advantages with out financial challenges. For instance, a site sized to host 1 billion (1,024 dimension) vectors compressed 32 instances (utilizing binary quantization) takes three r8g.12xlarge.search cases to offer the required 1.15 TBs of RAM. A design that requires operating a site on GPU cases, would wish six g6.12xlarge cases to do the identical, leading to 2.4 instances increased price and extreme GPUs. This resolution delivers effectivity by offering the correct quantity of GPUs solely whenever you want them, so that you acquire velocity with price financial savings.
Use instances and advantages
This function has three major makes use of and advantages:
- Construct large-scale indexes quicker, growing productiveness and innovation velocity
- Cut back price by reducing Amazon OpenSearch Serverless indexing OCU utilization, or downsizing domains with write-heavy vector workloads
- Speed up writes, decrease search latency, and enhance person expertise in your dynamic AI functions
Within the following sections, we talk about these use instances in additional element.
Construct large-scale indexes quicker
We benchmarked index builds for 1M, 10M, 113M, and 1B vector check instances to show velocity good points on each domains and collections. Velocity good points ranged from 6.4 to 13.8 instances quicker. These checks had been carried out with manufacturing configurations (Multi-AZ with replication) and default GPU service limits. All checks had been run on right-sized search clusters, and the CPU-only checks had CPU utilization maxed solely for indexing. The next chart illustrates the relative velocity good points from GPU acceleration on managed domains.

The entire index construct time on domains features a pressure merge to optimize the underlying storage engine for search efficiency. Throughout regular operation, merges are automated. Nonetheless, when benchmarking domains, we carry out a guide merge after indexing to verify merging impression is constant throughout checks. The next desk summarizes the index construct benchmarks and dataset references for domains.
We ran the identical efficiency checks on collections. The efficiency is totally different on OpenSearch Serverless as a result of its serverless structure includes efficiency trade-offs reminiscent of automated scaling, which introduces a ramp-up to succeed in peak efficiency. The next desk summarizes these outcomes.
OpenSearch Serverless doesn’t assist pressure merge, so the complete profit from GPU acceleration could be delayed till the automated background merges full. The default minimal OCUs needed to be elevated for checks past 1 million vectors to deal with increased indexing throughput.
Cut back price
Our serverless GPU design uniquely delivers velocity good points and value financial savings. With OpenSearch Serverless, your web indexing prices will likely be decreased when you have indexing workloads which can be vital sufficient to activate GPU acceleration. The next desk presents the OCU utilization and value consumption utilization from the earlier index construct checks.
The vector acceleration OCUs offload and scale back indexing OCUs. The entire OCU utilization is much less with GPU as a result of the index is constructed extra effectively, leading to price financial savings.
With managed domains, price financial savings are situational as a result of search and indexing infrastructure isn’t decoupled like on OpenSearch Serverless. Nonetheless, when you have a write-heavy, compute-bound vector search utility (that’s, your area is sized for vCPUs to maintain write throughput), you may downsize your area.
The next benchmarks show the effectivity good points from GPU acceleration. We measure the infrastructure prices throughout the indexing duties. GPU acceleration has the further price of GPUs at $0.24 per OCU/hour. Nonetheless, as a result of indexes are constructed quicker and extra effectively, it’s extra economical to make use of GPU to cut back CPU utilization in your area and downsize it.
*Domains are operating a high-availability configuration with none cost-optimizations
Speed up writes, decrease search latency
In skilled fingers, domains provide operational management and the flexibility to realize nice scalability, efficiency, and value optimizations. Nonetheless, operational obligations embrace managing indexing and search workloads on shared infrastructure. In case your vector deployment includes heavy, sustained streaming ingestion, updates, and deletes, you may observe increased search instances in your area. As illustrated within the following chart, as you improve vector writes, the CPU utilization will increase to assist HNSW graph constructing. Concurrent search latency additionally will increase due to competitors for compute and RAM sources.

You would resolve the issue by including knowledge nodes to extend your area’s compute capability. Nonetheless, enabling GPU acceleration is easier and cheaper. As illustrated within the chart, GPU frees up CPU and RAM in your area, serving to you maintain low and steady search latency underneath excessive write throughput.
Get began
Able to get began? If you have already got an OpenSearch Service vector deployment, use the AWS Administration Console, AWS Command Line Interface (AWS CLI), or API to allow GPU acceleration in your OpenSearch 3.1+ area or vector assortment. Take a look at it along with your present indexing workloads. Should you’re planning to construct a brand new vector database, check out our new vector ingestion function, which simplifies vector ingestion, indexing, and automates optimizations. Try this demonstration on YouTube.
Acknowledgments
The authors want to thank Manas Singh, Nathan Stephens, Jiahong Liu, Ben Gardner, and Zack Meeks from NVIDIA, and Yigit Kiran and Jay Deng from AWS for his or her contributions to this submit.
In regards to the authors
Authors want to add particular because of Manas Singh, Nathan Stephens, Jiahong Liu, Ben Gardner, Zack Meeks NVIDIA and Yigit Kiran and Jay Deng from AWS.

