--- id: multi-unit-stacking title: "Multi-Unit Stacking" status: established source_sections: "NVIDIA DGX Spark User Guide: Spark Stacking, Jeff Geerling review, ServeTheHome review" related_topics: [connectivity, gb10-superchip, ai-workloads, memory-and-storage] key_equations: [] key_terms: [connectx-7, smartnic, qsfp, stacking, mpi, nccl, slurm, kubernetes] images: [] examples: [] open_questions: - "Performance overhead of inter-unit communication vs. single unit (quantified)" - "Can more than 2 units be stacked?" - "Actual tokens/sec for 405B models on stacked configuration" --- # Multi-Unit Stacking Two Dell Pro Max GB10 units can be connected together to create a distributed compute cluster, effectively doubling the available compute and memory for running larger AI models. ## 1. How It Works Each Dell Pro Max GB10 has **2x QSFP56 200 Gbps ports** powered by the NVIDIA ConnectX-7 SmartNIC. These ports enable direct unit-to-unit connection: - **Combined memory:** 256 GB (128 GB per unit, NOT unified — distributed across nodes) - **Combined compute:** 2 PFLOP FP4 (1 PFLOP per unit) - **Interconnect:** 200GbE RDMA via QSFP56 DAC cable - **CX-7 ports support Ethernet configuration only** — no InfiniBand (T1 Documented) ## 2. Required Hardware ### Approved QSFP DAC Cables (T1 Documented, NVIDIA DGX Spark User Guide) | Manufacturer | Part Number | Description | |-------------|----------------------|------------------------------------------| | Amphenol | NJAAKK-N911 | QSFP to QSFP112, 32AWG, 400mm, LSZH | | Amphenol | NJAAKK0006 | 0.5m variant | | Luxshare | LMTQF022-SD-R | QSFP112 400G DAC Cable, 400mm, 30AWG | These are short DAC (Direct Attach Copper) cables. The units are designed to sit directly on top of each other. ## 3. Software Configuration (T1 Documented, NVIDIA DGX Spark User Guide) ### Prerequisites - Two DGX Spark / Dell Pro Max GB10 systems - Both running Ubuntu 24.04 (or later) with NVIDIA drivers installed - Internet connectivity for initial setup - Root/sudo access on both systems ### Network Setup **Option 1 — Automatic (Recommended):** Use NVIDIA's netplan playbook downloaded from their repository. Applied via standard `netplan apply` commands. **Option 2 — Manual static IP:** - Interface name: `enP2p1s0f1np1` - Node 1: `192.168.100.10/24` - Node 2: `192.168.100.11/24` - Verify with ping test between nodes ### SSH Configuration The NVIDIA discovery script automates **passwordless SSH** between nodes, required for MPI communication. ### Communication Frameworks - **MPI** — inter-process CPU communication between nodes - **NCCL v2.28.3** — GPU-accelerated collective operations across nodes ### Verification 1. Ping connectivity test between nodes 2. Interface verification: `ip a` and `ethtool` 3. NCCL test suite execution (via NVIDIA playbook) ## 4. How It Appears to Software Stacking does **NOT** present as a single logical device. It creates a **2-node distributed cluster** requiring explicit multi-node code: - Frameworks must use distributed execution (e.g., PyTorch Distributed, Megatron-LM) - MPI handles inter-process communication - NCCL handles GPU-to-GPU tensor transfers across the 200GbE link - This is fundamentally different from a single larger GPU — there is communication overhead ## 5. Model Capacity | Configuration | Memory | Max Model Size (approx) | |---------------|---------|-------------------------| | Single unit | 128 GB | ~200B parameters (FP4) | | Dual stacked | 256 GB | ~405B parameters (FP4) | This enables running models like **Llama 3.1 405B** (with quantization) that would not fit in a single unit's memory. ## 6. Scaling Beyond 2 Units The documentation mentions potential for: - **Job orchestration with Slurm or Kubernetes** - **Containerized execution with Singularity or Docker** Whether >2 units can be practically clustered is not explicitly documented, but the 200GbE RDMA networking and Slurm/K8s support suggest it is architecturally possible. ## 7. Physical Configuration The compact form factor (150x150x51mm per unit) is designed to be **stackable** — two units sit on top of each other on a desk, connected via short (400-500mm) QSFP DAC cables. ## Key Relationships - Connected via: [[connectivity]] (QSFP56/ConnectX-7 ports) - Extends capacity of: [[ai-workloads]] - Doubles resources from: [[gb10-superchip]], [[memory-and-storage]] - Software stack: [[dgx-os-software]]