Symbols

"Accelerating Large-Scale LLM Inference with CPU-GPU Memory Sharing on NVIDIA Grace Hopper and Blackwell Architectures"

Friday, Sep 5, 2025 1:33 pm ET2min read

Large Language Models (LLMs) require significant memory for inference, but the NVIDIA Grace Blackwell and Hopper architectures' unified memory and high-bandwidth interconnect enable efficient memory sharing between CPU and GPU, making it possible to work with large models and datasets that exceed traditional GPU memory limits. This setup improves LLM fine-tuning, KV cache offload, inference, and scientific computing efficiency.

Large Language Models (LLMs) are at the forefront of AI innovation, but their massive size often complicates inference efficiency. Models like Llama 3 70B and Llama 4 Scout 109B require substantial memory, which can exceed traditional GPU limits, leading to out-of-memory (OOM) errors. However, NVIDIA's Grace Blackwell and Hopper architectures provide a solution through unified memory and high-bandwidth interconnects.

The NVIDIA Grace Blackwell and Hopper architectures utilize NVLink-C2C, a 900 GB/s memory-coherent interconnect, creating a unified memory address space shared by the CPU and GPU. This setup allows the CPU and GPU to access and operate on the same data without explicit data transfers, resolving OOM issues [2].

For instance, loading Llama 3 70B and Llama 4 Scout 109B models in half precision (FP16) requires approximately 140 GB and 218 GB of memory, respectively. During inference, these models typically require additional data structures such as the key-value (KV) cache, which grows with context length and batch size. A KV-cache representing a 128k token context window for a single user (batch size 1) consumes about 40 GB of memory with Llama 3 70B, and this scales linearly with the number of users. In a production deployment, attempting to load such large models entirely into GPU memory could result in an OOM error. However, the unified memory architecture in NVIDIA Grace Hopper and Grace Blackwell architectures expands the total available memory, making it feasible to work with models and datasets that would otherwise be too large for the GPU alone [2].

The high-bandwidth connection of the NVLink-C2C and unified memory architecture found in Grace Hopper and Grace Blackwell improve the efficiency of LLM fine-tuning, KV cache offload, inference, scientific computing, and more. For example, when a model is loaded onto a platform like the NVIDIA GH200 Grace Hopper Superchip, which features unified memory architecture, it utilizes the 96 GB of high-bandwidth GPU memory and accesses the 480 GB of LPDDR memory connected to the CPU without the need for explicit data transfer. This expands the total available memory, making it feasible to work with models and datasets that would otherwise be too large for the GPU alone [2].

The unified memory architecture and high-bandwidth interconnects in NVIDIA's Grace Blackwell and Hopper architectures enable efficient memory sharing between CPU and GPU, making it possible to work with large models and datasets that exceed traditional GPU memory limits. This setup improves LLM fine-tuning, KV cache offload, inference, and scientific computing efficiency, positioning NVIDIA to maintain its leadership in the AI chip market.

References:
[1] https://developer.nvidia.com/blog/accelerate-large-scale-llm-inference-and-kv-cache-offload-with-cpu-gpu-memory-sharing/
[2] https://www.ainvest.com/news/nvidia-market-dominance-jeopardy-assessing-threat-broadcom-openai-chip-alliance-2509/

Stay ahead of the market.

Get curated U.S. market news, insights and key dates delivered to your inbox.

Comments

﻿

Add a public comment...

No comments yet

AInvest
PRO

Editorial Disclosure & AI Transparency: Ainvest News utilizes advanced Large Language Model (LLM) technology to synthesize and analyze real-time market data. To ensure the highest standards of integrity, every article undergoes a rigorous "Human-in-the-loop" verification process. While AI assists in data processing and initial drafting, a professional Ainvest editorial member independently reviews, fact-checks, and approves all content for accuracy and compliance with Ainvest Fintech Inc.’s editorial standards. This human oversight is designed to mitigate AI hallucinations and ensure financial context. Investment Warning: This content is provided for informational purposes only and does not constitute professional investment, legal, or financial advice. Markets involve inherent risks. Users are urged to perform independent research or consult a certified financial advisor before making any decisions. Ainvest Fintech Inc. disclaims all liability for actions taken based on this information. Found an error?Report an Issue

"Accelerating Large-Scale LLM Inference with CPU-GPU Memory Sharing on NVIDIA Grace Hopper and Blackwell Architectures"

Stay ahead of the market.

Comments

AInvestPRO

AInvest

AInvest
PRO