What Google’s TurboQuant can and can’t do for AI’s spiraling cost


aiiiboard-gettyimages-1462023760

Orla/ iStock / Getty Images Plus via Getty Images

Follow ZDNET: Add us as a preferred source on Google.


ZDNET’s key takeaways

  • Google’s TurboQuant can dramatically reduce AI memory usage.
  • TurboQuant is a response to the spiraling cost of AI.
  • A positive outcome is making AI more accessible by lowering inference costs.

With the cost of artificial intelligence skyrocketing thanks to soaring prices for computer components such as memory, Google last week responded with a proposed technical innovation called TurboQuant.

TurboQuant, which Google researchers discussed in a blog post, is another DeepSeek AI moment, a profound attempt to reduce the cost of AI. It could have a lasting benefit by reducing AI’s memory usage, making models much more efficient. 

Also: What is DeepSeek AI? Is it safe? Here’s everything you need to know

Even so, just as DeepSeek did not stop massive investment in AI chips, observers say TurboQuant will likely lead to continued growth in AI investment. It’s the Jevons paradox: Make something more efficient, and it ends up increasing overall usage of that resource. 

However, TurboQuant is an approach that may help run AI locally by slimming the hardware demands of a large language model. 

More memory, more money 

The big cost factor for AI at the moment  — and probably for the foreseeable future — is the ever-greater use of memory and storage technologies. AI is data-hungry, introducing a reliance on memory and storage unprecedented in the history of computing. 

TurboQuant, first described by Google researchers in a paper a year ago, employs “quantization” to reduce the number of bits and bytes required to represent the data. 

Also: Why you’ll pay more for AI in 2026, and 3 money-saving tips to try

Quantization is a form of data compression that uses fewer bits to represent the same value. In the case of TurboQuant, the focus is on what’s called the “key-value cache,” or, for shorthand, “KV cache,” one of the biggest memory hogs of AI. 

When you type into a chatbot such as Google’s Gemini, the AI has to compare what you’ve typed to a repository of measures that serve as a kind of database.

The thing that you type is called the query, and it is matched against data held in memory, called a key, to find a numeric match. Basically, it’s a similarity score. The key is then used to retrieve from memory exactly which words should be returned to you as the AI’s response, known as the value. 

Normally, every time you type, the AI model must calculate a new key and value, which can slow the whole operation. To speed things up, the machine retains a key-value cache in memory to store recently used keys and values. 

The cache then becomes its own problem: The more you work with a model, the more memory the key-value cache takes up. “This scaling is a significant bottleneck in terms of memory usage and computational speed, especially for long context models,” according to Google lead author Amir Zandieh and colleagues.

Also: AI isn’t getting smarter, it’s getting more power hungry – and expensive

Making things worse, AI models are increasingly being built with more complex keys and values, known as the context window. That gives the model more search options, potentially improving accuracy. Gemini 3, the current version, made a big leap in context window to one million tokens. Prior state-of-the-art models such as OpenAI’s GPT-4 had a context window of just 32,768 tokens. A larger context window also increases the amount of memory a key-value cache consumes. 

Speeding up quantization for real-time

The solution to that expanding KV cache is to quantize the keys and the values so the whole thing takes up less space. Zandieh and team claim in their blog post that the data compression is “massive” with TurboQuant.  “Reducing the KV cache size without compromising accuracy is essential,” they write.

Quantization has been used by Google and others for years to slim down neural networks. What’s novel about TurboQuant is that it’s meant to quantize in real time. Previous compression approaches reduced the size of a neural network at compile time, before it is run in production. 

Also: Nvidia wants to own your AI data center from end to end

That’s not good enough, observed Zandieh. The KV cache is a living digest of what’s learned at “inference time,” when people are typing to an AI bot, and the keys and values are changing. So, quantization has to happen fast enough and accurately enough to keep the cache small while also staying up to date. The “turbo” in TurboQuant implies this is a lot faster than traditional compile-time quantization. 

Two-stage approach

TurboQuant has two stages. First, the queries and keys are compressed. This can be done geometrically because queries and keys are vectors of data that can be depicted on an X-Y graph as a line, which can be rotated on that graph. They call the rotations “PolarQuant.” By randomly trying different rotations with PolarQuant and then retrieving the original line, they find a smaller number of bits that still preserves accuracy.

As they put it, “PolarQuant acts as a high-efficiency compression bridge, converting Cartesian inputs into a compact Polar ‘shorthand’ for storage and processing.”

google-2026-polarquant-illustration

Google

The compressed vectors still produce errors when the comparison is performed between the query and the key, which is known as the “inner product” of two vectors. To fix that, they use a second method, QJL, introduced by Zandieh in 2024. That approach keeps one of the two vectors in its original state, so that multiplying a compressed (quantized) vector with an uncompressed vector serves as a test to improve the accuracy of the multiplication. 

They tested TurboQuant by applying it to Meta Platforms’s open-source Llama 3.1-8B AI model, and found that “TurboQuant achieves perfect downstream results across all benchmarks while reducing the key value memory size by a factor of at least 6x” — a six-fold reduction in the amount of KV cache needed.

The approach also differs from other methods for compressing the KV cache, such as the approach taken last year by DeepSeek, which constrained key and value searches to speed up inference.

Also: DeepSeek claims its new AI model can cut the cost of predictions by 75% – here’s how

In another test, using Google’s Gemma open-source model and models from French AI startup Mistral, “TurboQuant proved it can quantize the key-value cache to just 3 bits without requiring training or fine-tuning and causing any compromise in model accuracy,” they wrote, “all while achieving a faster runtime than the original LLMs (Gemma and Mistral).” 

“It is exceptionally efficient to implement and incurs negligible runtime overhead,” they observed

google-2026-turboquant-performance

Google

Will AI be any cheaper?

Zandieh and team expect TurboQuant to have a significant impact on the production use of AI inference. “As AI becomes more integrated into all products, from LLMs to semantic search, this work in fundamental vector quantization will be more critical than ever,” they wrote. 

Also: Want to try OpenClaw? NanoClaw is a simpler, potentially safer AI agent

But will it really reduce the cost of AI? Yes and no. 

In an age of agentic AI, programs such as OpenClaw software that operate autonomously, there are a lot of parts to AI besides just the KV cache. Other uses of memory, such as retrieving and storing database records, will ultimately affect an agent’s efficiency over the long term. 

Those who follow the AI chip world last week argued that just as DeepSeek AI’s efficiency didn’t slow AI investment last year, neither will TurboQuant.

Vivek Arya, a Merrill Lynch banker who follows AI chips, wrote to his clients who were worried about DRAM maker Micron Technology that TurboQuant will simply make more efficient use of AI. The “6x improvement in memory efficiency [will] likely [lead] to 6x increase in accuracy (model size) and/or context length (KV cache allocation), rather than 6x decrease in memory,” wrote Arya.

Also: AI agents of chaos? New research shows how bots talking to bots can go sideways fast

What TurboQuant can do, though, is make some individual instances of AI more economical, especially for local deployment. 

For example, a swelling KV cache and longer context windows may prove less of a burden when running some AI models on limited hardware budgets. That will be a relief for users of OpenClaw who want their MacBook Neo or Mac mini to serve as a budget local AI server. 





Source link

Leave a Reply

Subscribe to Our Newsletter

Get our latest articles delivered straight to your inbox. No spam, we promise.

Recent Reviews


Google Maps has a long list of hidden (and sometimes, just underrated) features that help you navigate seamlessly. But I was not a big fan of using Google Maps for walking: that is, until I started using the right set of features that helped me navigate better.

Add layers to your map

See more information on the screen

Layers are an incredibly useful yet underrated feature that can be utilized for all modes of transport. These help add more details to your map beyond the default view, so you can plan your journey better.

To use layers, open your Google Maps app (Android, iPhone). Tap the layer icon on the upper right side (under your profile picture and nearby attractions options). You can switch your map type from default to satellite or terrain, and overlay your map with details, such as traffic, transit, biking, street view (perfect for walking), and 3D (Android)/raised buildings (iPhone) (for buildings). To turn off map details, go back to Layers and tap again on the details you want to disable.

In particular, adding a street view and 3D/raised buildings layer can help you gauge the terrain and get more information about the landscape, so you can avoid tricky paths and discover shortcuts.

Set up Live View

Just hold up your phone

A feature that can help you set out on walks with good navigation is Google Maps’ Live View. This lets you use augmented reality (AR) technology to see real-time navigation: beyond the directions you see on your map, you are able to see directions in your live view through your camera, overlaying instructions with your real view. This feature is very useful for travel and new areas, since it gives you navigational insights for walking that go beyond a 2D map.

To use Live View, search for a location on Google Maps, then tap “Directions.” Once the route appears, tap “Walk,” then tap “Live View” in the navigation options. You will be prompted to point your camera at things like buildings, stores, and signs around you, so Google Maps can analyze your surroundings and give you accurate directions.

Download maps offline

Google Maps without an internet connection

Whether you’re on a hiking trip in a low-connectivity area or want offline maps for your favorite walking destinations, having specific map routes downloaded can be a great help. Google Maps lets you download maps to your device while you’re connected to Wi-Fi or mobile data, and use them when your device is offline.

For Android, open Google Maps and search for a specific place or location. In the placesheet, swipe right, then tap More > Download offline map > Download. For iPhone, search for a location on Google Maps, then, at the bottom of your screen, tap the name or address of the place. Tap More > Download offline map > Download.

After you download an area, use Google Maps as you normally would. If you go offline, your offline maps will guide you to your destination as long as the entire route is within the offline map.

Enable Detailed Voice Guidance

Get better instructions

Voice guidance is a basic yet powerful navigation tool that can come in handy during walks in unfamiliar locations and can be used to ensure your journey is on the right path. To ensure guidance audio is enabled, go to your Google Maps profile (upper right corner), then tap Settings > Navigation > Sound and Voice. Here, tap “Unmute” on “Guidance Audio.”

Apart from this, you can also use Google Assistant to help you along your journey, asking questions about your destination, nearby sights, detours, additional stops, etc. To use this feature on iPhone, map a walking route to a destination, then tap the mic icon in the upper-right corner. For Android, you can also say “Hey Google” after mapping your destination to activate the assistant.

Voice guidance is handy for both new and old places, like when you’re running errands and need to navigate hands-free.

Add multiple stops

Keep your trip going

If you walk regularly to run errands, Google Maps has a simple yet effective feature that can help you plan your route in a better way. With Maps’ multiple stop feature, you can add several stops between your current and final destination to minimize any wasted time and unnecessary detours.

To add multiple stops on Google Maps, search for a destination, then tap “Directions.” Select the walking option, then click the three dots on top (next to “Your Location”), and tap “Edit Stops.” You can now add a stop by searching for it and tapping “Add Stop,” and swap the stops at your convenience. Repeat this process by tapping “Add Stops” until your route is complete, then tap “Start” to begin your journey.

You can add up to ten stops in a single route on both mobile and desktop, and use the journey for multiple modes (walking, driving, and cycling) except public transport and flights. I find this Google Maps feature to be an essential tool for travel to walkable cities, especially when I’m planning a route I am unfamiliar with.


More to discover

A new feature to keep an eye out for, especially if you use Google Maps for walking and cycling, is Google’s Gemini boost, which will allow you to navigate hands-free and get real-time information about your journey. This feature has been rolling out for both Android and iOS users.



Source link