---
id: multi-unit-stacking
title: "Multi-Unit Stacking"
status: established
source_sections: "NVIDIA DGX Spark User Guide: Spark Stacking, Jeff Geerling review, ServeTheHome review"
related_topics: [connectivity, gb10-superchip, ai-workloads, memory-and-storage]
key_equations: []
key_terms: [connectx-7, smartnic, qsfp, stacking, mpi, nccl, slurm, kubernetes]
images: []
examples: []
open_questions:
  - "Performance overhead of inter-unit communication vs. single unit (quantified)"
  - "Can more than 2 units be stacked?"
  - "Actual tokens/sec for 405B models on stacked configuration"
---

# Multi-Unit Stacking

Two Dell Pro Max GB10 units can be connected together to create a distributed compute cluster, effectively doubling the available compute and memory for running larger AI models.

## 1. How It Works

Each Dell Pro Max GB10 has **2x QSFP56 200 Gbps ports** powered by the NVIDIA ConnectX-7 SmartNIC. These ports enable direct unit-to-unit connection:

- **Combined memory:** 256 GB (128 GB per unit, NOT unified — distributed across nodes)
- **Combined compute:** 2 PFLOP FP4 (1 PFLOP per unit)
- **Interconnect:** 200GbE RDMA via QSFP56 DAC cable
- **CX-7 ports support Ethernet configuration only** — no InfiniBand (T1 Documented)

## 2. Required Hardware

### Approved QSFP DAC Cables (T1 Documented, NVIDIA DGX Spark User Guide)

| Manufacturer | Part Number          | Description                              |
|-------------|----------------------|------------------------------------------|
| Amphenol    | NJAAKK-N911         | QSFP to QSFP112, 32AWG, 400mm, LSZH     |
| Amphenol    | NJAAKK0006          | 0.5m variant                             |
| Luxshare    | LMTQF022-SD-R       | QSFP112 400G DAC Cable, 400mm, 30AWG     |

These are short DAC (Direct Attach Copper) cables. The units are designed to sit directly on top of each other.

## 3. Software Configuration (T1 Documented, NVIDIA DGX Spark User Guide)

### Prerequisites
- Two DGX Spark / Dell Pro Max GB10 systems
- Both running Ubuntu 24.04 (or later) with NVIDIA drivers installed
- Internet connectivity for initial setup
- Root/sudo access on both systems

### Network Setup

**Option 1 — Automatic (Recommended):**
Use NVIDIA's netplan playbook downloaded from their repository. Applied via standard `netplan apply` commands.

**Option 2 — Manual static IP:**
- Interface name: `enP2p1s0f1np1`
- Node 1: `192.168.100.10/24`
- Node 2: `192.168.100.11/24`
- Verify with ping test between nodes

### SSH Configuration
The NVIDIA discovery script automates **passwordless SSH** between nodes, required for MPI communication.

### Communication Frameworks
- **MPI** — inter-process CPU communication between nodes
- **NCCL v2.28.3** — GPU-accelerated collective operations across nodes

### Verification
1. Ping connectivity test between nodes
2. Interface verification: `ip a` and `ethtool`
3. NCCL test suite execution (via NVIDIA playbook)

## 4. How It Appears to Software

Stacking does **NOT** present as a single logical device. It creates a **2-node distributed cluster** requiring explicit multi-node code:

- Frameworks must use distributed execution (e.g., PyTorch Distributed, Megatron-LM)
- MPI handles inter-process communication
- NCCL handles GPU-to-GPU tensor transfers across the 200GbE link
- This is fundamentally different from a single larger GPU — there is communication overhead

## 5. Model Capacity

| Configuration  | Memory  | Max Model Size (approx) |
|---------------|---------|-------------------------|
| Single unit    | 128 GB  | ~200B parameters (FP4)  |
| Dual stacked   | 256 GB  | ~405B parameters (FP4)  |

This enables running models like **Llama 3.1 405B** (with quantization) that would not fit in a single unit's memory.

## 6. Scaling Beyond 2 Units

The documentation mentions potential for:
- **Job orchestration with Slurm or Kubernetes**
- **Containerized execution with Singularity or Docker**

Whether >2 units can be practically clustered is not explicitly documented, but the 200GbE RDMA networking and Slurm/K8s support suggest it is architecturally possible.

## 7. Physical Configuration

The compact form factor (150x150x51mm per unit) is designed to be **stackable** — two units sit on top of each other on a desk, connected via short (400-500mm) QSFP DAC cables.

## Key Relationships

- Connected via: [[connectivity]] (QSFP56/ConnectX-7 ports)
- Extends capacity of: [[ai-workloads]]
- Doubles resources from: [[gb10-superchip]], [[memory-and-storage]]
- Software stack: [[dgx-os-software]]