"Accelerating Large-Scale LLM Inference with CPU-GPU Memory Sharing on NVIDIA Grace Hopper and Blackwell Architectures"
PorAinvest
viernes, 5 de septiembre de 2025, 1:33 pm ET2 min de lectura
NVDA--
The NVIDIA Grace Blackwell and Hopper architectures utilize NVLink-C2C, a 900 GB/s memory-coherent interconnect, creating a unified memory address space shared by the CPU and GPU. This setup allows the CPU and GPU to access and operate on the same data without explicit data transfers, resolving OOM issues [2].
For instance, loading Llama 3 70B and Llama 4 Scout 109B models in half precision (FP16) requires approximately 140 GB and 218 GB of memory, respectively. During inference, these models typically require additional data structures such as the key-value (KV) cache, which grows with context length and batch size. A KV-cache representing a 128k token context window for a single user (batch size 1) consumes about 40 GB of memory with Llama 3 70B, and this scales linearly with the number of users. In a production deployment, attempting to load such large models entirely into GPU memory could result in an OOM error. However, the unified memory architecture in NVIDIA Grace Hopper and Grace Blackwell architectures expands the total available memory, making it feasible to work with models and datasets that would otherwise be too large for the GPU alone [2].
The high-bandwidth connection of the NVLink-C2C and unified memory architecture found in Grace Hopper and Grace Blackwell improve the efficiency of LLM fine-tuning, KV cache offload, inference, scientific computing, and more. For example, when a model is loaded onto a platform like the NVIDIA GH200 Grace Hopper Superchip, which features unified memory architecture, it utilizes the 96 GB of high-bandwidth GPU memory and accesses the 480 GB of LPDDR memory connected to the CPU without the need for explicit data transfer. This expands the total available memory, making it feasible to work with models and datasets that would otherwise be too large for the GPU alone [2].
The unified memory architecture and high-bandwidth interconnects in NVIDIA's Grace Blackwell and Hopper architectures enable efficient memory sharing between CPU and GPU, making it possible to work with large models and datasets that exceed traditional GPU memory limits. This setup improves LLM fine-tuning, KV cache offload, inference, and scientific computing efficiency, positioning NVIDIA to maintain its leadership in the AI chip market.
References:
[1] https://developer.nvidia.com/blog/accelerate-large-scale-llm-inference-and-kv-cache-offload-with-cpu-gpu-memory-sharing/
[2] https://www.ainvest.com/news/nvidia-market-dominance-jeopardy-assessing-threat-broadcom-openai-chip-alliance-2509/
Large Language Models (LLMs) require significant memory for inference, but the NVIDIA Grace Blackwell and Hopper architectures' unified memory and high-bandwidth interconnect enable efficient memory sharing between CPU and GPU, making it possible to work with large models and datasets that exceed traditional GPU memory limits. This setup improves LLM fine-tuning, KV cache offload, inference, and scientific computing efficiency.
Large Language Models (LLMs) are at the forefront of AI innovation, but their massive size often complicates inference efficiency. Models like Llama 3 70B and Llama 4 Scout 109B require substantial memory, which can exceed traditional GPU limits, leading to out-of-memory (OOM) errors. However, NVIDIA's Grace Blackwell and Hopper architectures provide a solution through unified memory and high-bandwidth interconnects.The NVIDIA Grace Blackwell and Hopper architectures utilize NVLink-C2C, a 900 GB/s memory-coherent interconnect, creating a unified memory address space shared by the CPU and GPU. This setup allows the CPU and GPU to access and operate on the same data without explicit data transfers, resolving OOM issues [2].
For instance, loading Llama 3 70B and Llama 4 Scout 109B models in half precision (FP16) requires approximately 140 GB and 218 GB of memory, respectively. During inference, these models typically require additional data structures such as the key-value (KV) cache, which grows with context length and batch size. A KV-cache representing a 128k token context window for a single user (batch size 1) consumes about 40 GB of memory with Llama 3 70B, and this scales linearly with the number of users. In a production deployment, attempting to load such large models entirely into GPU memory could result in an OOM error. However, the unified memory architecture in NVIDIA Grace Hopper and Grace Blackwell architectures expands the total available memory, making it feasible to work with models and datasets that would otherwise be too large for the GPU alone [2].
The high-bandwidth connection of the NVLink-C2C and unified memory architecture found in Grace Hopper and Grace Blackwell improve the efficiency of LLM fine-tuning, KV cache offload, inference, scientific computing, and more. For example, when a model is loaded onto a platform like the NVIDIA GH200 Grace Hopper Superchip, which features unified memory architecture, it utilizes the 96 GB of high-bandwidth GPU memory and accesses the 480 GB of LPDDR memory connected to the CPU without the need for explicit data transfer. This expands the total available memory, making it feasible to work with models and datasets that would otherwise be too large for the GPU alone [2].
The unified memory architecture and high-bandwidth interconnects in NVIDIA's Grace Blackwell and Hopper architectures enable efficient memory sharing between CPU and GPU, making it possible to work with large models and datasets that exceed traditional GPU memory limits. This setup improves LLM fine-tuning, KV cache offload, inference, and scientific computing efficiency, positioning NVIDIA to maintain its leadership in the AI chip market.
References:
[1] https://developer.nvidia.com/blog/accelerate-large-scale-llm-inference-and-kv-cache-offload-with-cpu-gpu-memory-sharing/
[2] https://www.ainvest.com/news/nvidia-market-dominance-jeopardy-assessing-threat-broadcom-openai-chip-alliance-2509/

Divulgación editorial y transparencia de la IA: Ainvest News utiliza tecnología avanzada de Modelos de Lenguaje Largo (LLM) para sintetizar y analizar datos de mercado en tiempo real. Para garantizar los más altos estándares de integridad, cada artículo se somete a un riguroso proceso de verificación con participación humana.
Mientras la IA asiste en el procesamiento de datos y la redacción inicial, un miembro editorial profesional de Ainvest revisa, verifica y aprueba de forma independiente todo el contenido para garantizar su precisión y cumplimiento con los estándares editoriales de Ainvest Fintech Inc. Esta supervisión humana está diseñada para mitigar las alucinaciones de la IA y garantizar el contexto financiero.
Advertencia sobre inversiones: Este contenido se proporciona únicamente con fines informativos y no constituye asesoramiento profesional de inversión, legal o financiero. Los mercados conllevan riesgos inherentes. Se recomienda a los usuarios que realicen una investigación independiente o consulten a un asesor financiero certificado antes de tomar cualquier decisión. Ainvest Fintech Inc. se exime de toda responsabilidad por las acciones tomadas con base en esta información. ¿Encontró un error? Reportar un problema

Comentarios
Aún no hay comentarios