3.0 KiB

Raw Blame History

id	title	status	source_sections	related_topics	key_equations	key_terms	images	examples	open_questions
gb10-superchip	NVIDIA GB10 Grace Blackwell Superchip	established	Web research: NVIDIA newsroom, WCCFTech, Phoronix, The Register, Arm	[memory-and-storage ai-frameworks ai-workloads connectivity physical-specs]	[flops-fp4 nvlink-c2c-bandwidth]	[gb10 grace-blackwell superchip cortex-x925 cortex-a725 blackwell-gpu tensor-core cuda-core nvlink-c2c soc]	[]	[]	[Exact clock speeds for CPU and GPU dies under sustained load Detailed per-precision TFLOPS breakdown (FP4/FP8/FP16/FP32/FP64) Thermal throttling behavior and sustained vs. peak performance]

NVIDIA GB10 Grace Blackwell Superchip

The GB10 is a system-on-a-chip (SoC) that combines an NVIDIA Grace CPU and an NVIDIA Blackwell GPU on a single package, connected via NVLink Chip-to-Chip (NVLink-C2C) interconnect. It is the core silicon in the Dell Pro Max GB10 and the NVIDIA DGX Spark.

1. Architecture Overview

The GB10 is composed of two distinct compute dies:

CPU tile: Designed by MediaTek, based on ARM architecture v9.2
GPU tile: Designed by NVIDIA, based on the Blackwell architecture

These are stitched together using TSMC's 2.5D advanced packaging technology and connected via NVIDIA's proprietary NVLink-C2C interconnect, which provides 600 GB/s of bidirectional bandwidth between the CPU and GPU dies.

2. CPU: Grace (ARM)

The Grace CPU portion contains 20 cores in a big.LITTLE-style configuration:

10x ARM Cortex-X925 — high-performance cores
10x ARM Cortex-A725 — efficiency cores

Architecture: ARMv9.2

This is the same Grace CPU lineage used in NVIDIA's data center Grace Hopper and Grace Blackwell products, adapted for desktop power envelopes.

3. GPU: Blackwell

The Blackwell GPU portion features:

6,144 CUDA cores (comparable to the RTX 5070 core count)
5th-generation Tensor Cores — optimized for AI inference and training
Peak performance: 1 PFLOP (1,000 TFLOPS) at FP4 precision

The Tensor Cores are the key differentiator for AI workloads, providing hardware acceleration for mixed-precision matrix operations used in deep learning.

4. NVLink-C2C Interconnect

The CPU and GPU communicate via NVLink Chip-to-Chip:

Bidirectional bandwidth: 600 GB/s
Enables unified coherent memory — both CPU and GPU see the same 128GB LPDDR5X pool
Eliminates the PCIe bottleneck found in traditional discrete GPU systems

This coherent memory architecture means there is no need to explicitly copy data between "host" and "device" memory, simplifying AI development workflows.

5. Power Envelope

System TDP: ~140W (from related specifications)
External PSU: 280W USB Type-C adapter (headroom for storage, networking, peripherals)

Key Relationships

Provides compute for: ai-workloads, ai-frameworks
Memory subsystem: memory-and-storage
Housed in: physical-specs
Connected externally via: connectivity
Scales via: multi-unit-stacking

3.0 KiB Raw Blame History