Browse Source
Expand knowledge base with NVIDIA docs, reviews, and Dell owner's manual
Expand knowledge base with NVIDIA docs, reviews, and Dell owner's manual
Phase 2: Integrated NVIDIA DGX Spark User Guide, Jeff Geerling review, ServeTheHome review, and NVIDIA Developer Forums. Resolved 18 open questions including environmental specs, stacking config, benchmarks, Docker support, UEFI settings, and firmware update procedures. Phase 3: Parsed full Dell Owner's Manual (Rev A01, Dec 2025, 45 pages). Applied critical corrections: SSD is PCIe Gen4 (not Gen5), supports both M.2 2230 and 2242, HDMI is 2.1a (not 2.1b), operating temp 0-35C (Dell) vs 5-30C (NVIDIA). Added complete BIOS menu structure, SSD replacement procedure, display resolutions, wireless module details, and full environmental specifications. 32 open questions now resolved. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>master
16 changed files with 745 additions and 195 deletions
-
2CLAUDE.md
-
14context/ai-frameworks.md
-
29context/ai-workloads.md
-
66context/connectivity.md
-
43context/dgx-os-software.md
-
62context/equations-and-bounds.md
-
24context/gb10-superchip.md
-
56context/memory-and-storage.md
-
92context/multi-unit-stacking.md
-
113context/open-questions.md
-
105context/physical-specs.md
-
151context/setup-and-config.md
-
18context/skus-and-pricing.md
-
77phases/phase-02-deep-research.md
-
88phases/phase-03-owners-manual.md
-
BINreference/sources/dell-pro-max-with-gb10-fcm1253-om-en-us.pdf
@ -1,52 +1,108 @@ |
|||||
--- |
--- |
||||
id: multi-unit-stacking |
id: multi-unit-stacking |
||||
title: "Multi-Unit Stacking" |
title: "Multi-Unit Stacking" |
||||
status: provisional |
|
||||
source_sections: "Web research: WCCFTech, NVIDIA newsroom" |
|
||||
|
status: established |
||||
|
source_sections: "NVIDIA DGX Spark User Guide: Spark Stacking, Jeff Geerling review, ServeTheHome review" |
||||
related_topics: [connectivity, gb10-superchip, ai-workloads, memory-and-storage] |
related_topics: [connectivity, gb10-superchip, ai-workloads, memory-and-storage] |
||||
key_equations: [] |
key_equations: [] |
||||
key_terms: [connectx-7, smartnic, qsfp, stacking, nvlink] |
|
||||
|
key_terms: [connectx-7, smartnic, qsfp, stacking, mpi, nccl, slurm, kubernetes] |
||||
images: [] |
images: [] |
||||
examples: [] |
examples: [] |
||||
open_questions: |
open_questions: |
||||
- "Exact cable/interconnect required between units (QSFP type, length limits)" |
|
||||
- "Software configuration steps for multi-unit mode" |
|
||||
- "Performance overhead of inter-unit communication vs. single unit" |
|
||||
- "Does stacking appear as a single device to frameworks or require explicit multi-node code?" |
|
||||
|
- "Performance overhead of inter-unit communication vs. single unit (quantified)" |
||||
- "Can more than 2 units be stacked?" |
- "Can more than 2 units be stacked?" |
||||
|
- "Actual tokens/sec for 405B models on stacked configuration" |
||||
--- |
--- |
||||
|
|
||||
# Multi-Unit Stacking |
# Multi-Unit Stacking |
||||
|
|
||||
Two Dell Pro Max GB10 units can be connected together to create a more powerful combined system, effectively doubling the available compute and memory. |
|
||||
|
Two Dell Pro Max GB10 units can be connected together to create a distributed compute cluster, effectively doubling the available compute and memory for running larger AI models. |
||||
|
|
||||
## 1. How It Works |
## 1. How It Works |
||||
|
|
||||
Each Dell Pro Max GB10 has **2x QSFP 200 Gbps ports** powered by the NVIDIA ConnectX-7 SmartNIC. These ports enable direct unit-to-unit connection: |
|
||||
|
Each Dell Pro Max GB10 has **2x QSFP56 200 Gbps ports** powered by the NVIDIA ConnectX-7 SmartNIC. These ports enable direct unit-to-unit connection: |
||||
|
|
||||
- **Combined memory:** 256 GB unified (128 GB per unit) |
|
||||
|
- **Combined memory:** 256 GB (128 GB per unit, NOT unified — distributed across nodes) |
||||
- **Combined compute:** 2 PFLOP FP4 (1 PFLOP per unit) |
- **Combined compute:** 2 PFLOP FP4 (1 PFLOP per unit) |
||||
- **Interconnect bandwidth:** Up to 400 Gbps (2x 200 Gbps QSFP) |
|
||||
|
- **Interconnect:** 200GbE RDMA via QSFP56 DAC cable |
||||
|
- **CX-7 ports support Ethernet configuration only** — no InfiniBand (T1 Documented) |
||||
|
|
||||
## 2. Model Capacity |
|
||||
|
## 2. Required Hardware |
||||
|
|
||||
|
### Approved QSFP DAC Cables (T1 Documented, NVIDIA DGX Spark User Guide) |
||||
|
|
||||
|
| Manufacturer | Part Number | Description | |
||||
|
|-------------|----------------------|------------------------------------------| |
||||
|
| Amphenol | NJAAKK-N911 | QSFP to QSFP112, 32AWG, 400mm, LSZH | |
||||
|
| Amphenol | NJAAKK0006 | 0.5m variant | |
||||
|
| Luxshare | LMTQF022-SD-R | QSFP112 400G DAC Cable, 400mm, 30AWG | |
||||
|
|
||||
|
These are short DAC (Direct Attach Copper) cables. The units are designed to sit directly on top of each other. |
||||
|
|
||||
|
## 3. Software Configuration (T1 Documented, NVIDIA DGX Spark User Guide) |
||||
|
|
||||
|
### Prerequisites |
||||
|
- Two DGX Spark / Dell Pro Max GB10 systems |
||||
|
- Both running Ubuntu 24.04 (or later) with NVIDIA drivers installed |
||||
|
- Internet connectivity for initial setup |
||||
|
- Root/sudo access on both systems |
||||
|
|
||||
|
### Network Setup |
||||
|
|
||||
|
**Option 1 — Automatic (Recommended):** |
||||
|
Use NVIDIA's netplan playbook downloaded from their repository. Applied via standard `netplan apply` commands. |
||||
|
|
||||
|
**Option 2 — Manual static IP:** |
||||
|
- Interface name: `enP2p1s0f1np1` |
||||
|
- Node 1: `192.168.100.10/24` |
||||
|
- Node 2: `192.168.100.11/24` |
||||
|
- Verify with ping test between nodes |
||||
|
|
||||
|
### SSH Configuration |
||||
|
The NVIDIA discovery script automates **passwordless SSH** between nodes, required for MPI communication. |
||||
|
|
||||
|
### Communication Frameworks |
||||
|
- **MPI** — inter-process CPU communication between nodes |
||||
|
- **NCCL v2.28.3** — GPU-accelerated collective operations across nodes |
||||
|
|
||||
|
### Verification |
||||
|
1. Ping connectivity test between nodes |
||||
|
2. Interface verification: `ip a` and `ethtool` |
||||
|
3. NCCL test suite execution (via NVIDIA playbook) |
||||
|
|
||||
|
## 4. How It Appears to Software |
||||
|
|
||||
|
Stacking does **NOT** present as a single logical device. It creates a **2-node distributed cluster** requiring explicit multi-node code: |
||||
|
|
||||
|
- Frameworks must use distributed execution (e.g., PyTorch Distributed, Megatron-LM) |
||||
|
- MPI handles inter-process communication |
||||
|
- NCCL handles GPU-to-GPU tensor transfers across the 200GbE link |
||||
|
- This is fundamentally different from a single larger GPU — there is communication overhead |
||||
|
|
||||
|
## 5. Model Capacity |
||||
|
|
||||
| Configuration | Memory | Max Model Size (approx) | |
| Configuration | Memory | Max Model Size (approx) | |
||||
|---------------|---------|-------------------------| |
|---------------|---------|-------------------------| |
||||
| Single unit | 128 GB | ~200B parameters (FP4) | |
| Single unit | 128 GB | ~200B parameters (FP4) | |
||||
| Dual stacked | 256 GB | ~400B parameters (FP4) | |
|
||||
|
| Dual stacked | 256 GB | ~405B parameters (FP4) | |
||||
|
|
||||
This enables running models like **Llama 3.1 405B** (with quantization) that would not fit in a single unit's memory. |
This enables running models like **Llama 3.1 405B** (with quantization) that would not fit in a single unit's memory. |
||||
|
|
||||
## 3. Physical Configuration |
|
||||
|
## 6. Scaling Beyond 2 Units |
||||
|
|
||||
|
The documentation mentions potential for: |
||||
|
- **Job orchestration with Slurm or Kubernetes** |
||||
|
- **Containerized execution with Singularity or Docker** |
||||
|
|
||||
The compact form factor (150x150x51mm per unit) is designed to be **stackable** — two units can sit on top of each other on a desk, connected via short QSFP cables. |
|
||||
|
Whether >2 units can be practically clustered is not explicitly documented, but the 200GbE RDMA networking and Slurm/K8s support suggest it is architecturally possible. |
||||
|
|
||||
## 4. Open Areas |
|
||||
|
## 7. Physical Configuration |
||||
|
|
||||
This feature is one of the less-documented aspects of the system. Key unknowns include the exact software configuration, whether it presents as a single logical device, and inter-node communication overhead. See open questions in frontmatter. |
|
||||
|
The compact form factor (150x150x51mm per unit) is designed to be **stackable** — two units sit on top of each other on a desk, connected via short (400-500mm) QSFP DAC cables. |
||||
|
|
||||
## Key Relationships |
## Key Relationships |
||||
|
|
||||
- Connected via: [[connectivity]] (QSFP/ConnectX-7 ports) |
|
||||
|
- Connected via: [[connectivity]] (QSFP56/ConnectX-7 ports) |
||||
- Extends capacity of: [[ai-workloads]] |
- Extends capacity of: [[ai-workloads]] |
||||
- Doubles resources from: [[gb10-superchip]], [[memory-and-storage]] |
- Doubles resources from: [[gb10-superchip]], [[memory-and-storage]] |
||||
|
- Software stack: [[dgx-os-software]] |
||||
@ -0,0 +1,77 @@ |
|||||
|
# Phase 2: Deep Research — Reviews, Official Docs, Community Data |
||||
|
|
||||
|
**Date:** 2026-02-14 |
||||
|
**Goal:** Fill gaps from Phase 1 by integrating official NVIDIA documentation, independent reviews, and community findings |
||||
|
|
||||
|
## What Was Done |
||||
|
|
||||
|
1. Attempted to access Dell Owner's Manual PDF — blocked by 403 on Dell support site and manuals.plus |
||||
|
2. Found and ingested **NVIDIA DGX Spark User Guide** (HTML version at docs.nvidia.com) — the authoritative hardware/software reference |
||||
|
3. Ingested **Jeff Geerling's review** — power measurements, benchmarks, thermal analysis, networking throughput |
||||
|
4. Ingested **ServeTheHome review** — power draw by workload, noise levels, LLM benchmarks, port layout details |
||||
|
5. Ingested **Tom's Hardware review** — overall verdict, rating (4/5) |
||||
|
6. Ingested **NVIDIA Developer Forums** — SSD replacement details (M.2 2242 form factor confirmed) |
||||
|
7. Ingested **ServeTheHome firmware article** — Dell vs NVIDIA firmware signing, update procedures |
||||
|
|
||||
|
## Sources Added |
||||
|
|
||||
|
- NVIDIA DGX Spark User Guide: Hardware Overview (docs.nvidia.com/dgx/dgx-spark/hardware.html) |
||||
|
- NVIDIA DGX Spark User Guide: System Overview (docs.nvidia.com/dgx/dgx-spark/system-overview.html) |
||||
|
- NVIDIA DGX Spark User Guide: First Boot (docs.nvidia.com/dgx/dgx-spark/first-boot.html) |
||||
|
- NVIDIA DGX Spark User Guide: UEFI Settings (docs.nvidia.com/dgx/dgx-spark/uefi-settings.html) |
||||
|
- NVIDIA DGX Spark User Guide: Spark Stacking (docs.nvidia.com/dgx/dgx-spark/spark-clustering.html) |
||||
|
- NVIDIA DGX Spark User Guide: Software (docs.nvidia.com/dgx/dgx-spark/software.html) |
||||
|
- Jeff Geerling: "Dell's version of the DGX Spark fixes pain points" (jeffgeerling.com) |
||||
|
- ServeTheHome: "NVIDIA DGX Spark Review" (servethehome.com) — pages 1 and 4 |
||||
|
- ServeTheHome: "NVIDIA DGX Spark and Dell Partner GB10 Firmware" (servethehome.com) |
||||
|
- Tom's Hardware: "Nvidia DGX Spark review" (tomshardware.com) |
||||
|
- NVIDIA Developer Forums: "Exchange internal SSD" thread (forums.developer.nvidia.com) |
||||
|
- Storage Review teardown (referenced in forums) |
||||
|
|
||||
|
## Key Findings |
||||
|
|
||||
|
### Resolved 18 open questions: |
||||
|
- SSD is user-replaceable FRU (M.2 2242 PCIe Gen5) |
||||
|
- Memory: 256-bit, 16 channels |
||||
|
- Docker + NVIDIA Container Runtime pre-installed |
||||
|
- Environmental: 5-30°C, 10-90% humidity, up to 3,000m altitude |
||||
|
- Noise: < 40 dB at 1-1.5m |
||||
|
- Cooling: Dual-fan + dense heatsink, front-to-back airflow |
||||
|
- Firmware: apt + fwupdmgr (Dell uses different signed firmware from NVIDIA) |
||||
|
- PXE boot supported via UEFI |
||||
|
- First boot: 10-step wizard fully documented |
||||
|
- Stacking: Specific QSFP DAC cables documented, MPI + NCCL v2.28.3, ethernet-only |
||||
|
- Stacking is 2-node distributed cluster (NOT single logical device) |
||||
|
- QSFP ports usable for general 200GbE networking |
||||
|
- Benchmark data: Llama 3.2 3B ~100 tok/s, GPT-OSS-120B ~14.5 tok/s |
||||
|
- Dell design prevents thermal throttling (better than DGX Spark reference) |
||||
|
- Power draw profiled across all workload types |
||||
|
- HDMI display compatibility issue documented |
||||
|
- 2-year support guarantee |
||||
|
- Dell PSU is 280W vs DGX Spark 240W |
||||
|
|
||||
|
### New data added: |
||||
|
- RT Cores (4th gen), NVENC, NVDEC specs |
||||
|
- FP64 HPL benchmark: ~675 GFLOPS |
||||
|
- Gaming performance (Cyberpunk, Doom Eternal via FEX/Proton) |
||||
|
- USB ports are USB 3.2 Gen 2x2 |
||||
|
- QSFP connected via x4 PCIe Gen 5 |
||||
|
- Regulatory model: D21U / D21U001 |
||||
|
- DGX Spark weight (1.2kg) vs Dell (1.31kg) differentiated |
||||
|
- UEFI menu structure documented |
||||
|
- Software stack: NVIDIA Sync, DGX Dashboard, AI Enterprise, NGC confirmed |
||||
|
- Stacking cable part numbers (Amphenol, Luxshare) |
||||
|
- Stacking IP configuration (192.168.100.10/11) |
||||
|
|
||||
|
## What Changed |
||||
|
|
||||
|
All 12 context files updated. open-questions.md rebuilt with 18 resolved items. |
||||
|
|
||||
|
## Remaining Gaps |
||||
|
|
||||
|
- Dell Owner's Manual PDF still not ingested (403 from all sources — needs manual download) |
||||
|
- DGX Spark UEFI Manual (separate document, not yet found online) |
||||
|
- Exact TFLOPS for FP8/FP16/FP32 still inferred |
||||
|
- No Llama 3.3 70B specific tokens/sec benchmark |
||||
|
- No fine-tuning benchmarks |
||||
|
- No image generation benchmarks |
||||
@ -0,0 +1,88 @@ |
|||||
|
# Phase 3: Dell Owner's Manual Integration |
||||
|
|
||||
|
**Date:** 2026-02-14 |
||||
|
**Goal:** Parse the full Dell Pro Max GB10 FCM1253 Owner's Manual (Rev A01, Dec 2025, 45 pages) and integrate into knowledge base |
||||
|
|
||||
|
## Source |
||||
|
|
||||
|
- `reference/sources/dell-pro-max-with-gb10-fcm1253-om-en-us.pdf` |
||||
|
- Extracted via PyMuPDF (fitz), all 45 pages parsed successfully |
||||
|
|
||||
|
## Critical Corrections Made |
||||
|
|
||||
|
These findings from the authoritative Dell Owner's Manual contradict earlier data from web research and NVIDIA forums: |
||||
|
|
||||
|
1. **SSD is PCIe Gen4, NOT Gen5** — NVIDIA Developer Forums and Storage Review teardown claimed Gen5. Dell Owner's Manual (Rev A01, Dec 2025) says PCIe Gen4 NVMe, up to 64 GT/s. |
||||
|
|
||||
|
2. **Supports BOTH M.2 2230 AND M.2 2242** — Earlier research indicated 2242-only. Manual lists 2230 (1TB TLC, 2TB QLC) and 2242 (1TB/4TB TLC SED Opal 2.0). |
||||
|
|
||||
|
3. **HDMI is 2.1a, NOT 2.1b** — Initial web research from Dell product page suggested 2.1b. Manual confirms 2.1a (matching DGX Spark spec). |
||||
|
|
||||
|
4. **Operating temperature range is 0-35°C (Dell), not 5-30°C (NVIDIA)** — Dell's spec is wider. This may reflect Dell's improved thermal design. |
||||
|
|
||||
|
5. **Bottom cover screws are Torx (M2x4.4), not Phillips** — Earlier sources said Phillips. Manual specifies T5 or T8 Torx screwdriver required. |
||||
|
|
||||
|
6. **Weight range is 1.22-1.34 kg** — Not a single 1.31 kg figure. Varies by configuration. |
||||
|
|
||||
|
## New Data Added |
||||
|
|
||||
|
### Hardware Details |
||||
|
- Processor cache: 16 MB |
||||
|
- All ports are on the BACK (front has no ports) |
||||
|
- Power button: press to turn on, hold 4 seconds to force shutdown |
||||
|
- Service tag location: bottom of unit |
||||
|
- Rubber base plate: magnetically attached, pry from left/right gaps |
||||
|
- Bottom cover: 4x M2x4.4 Torx screws |
||||
|
- SSD: 1x M2x2 screw, thermal pads on top and bottom |
||||
|
- Required tools: Phillips #0, T5 or T8 Torx |
||||
|
|
||||
|
### Display Specifications |
||||
|
- USB-C DP 1.4a: max 7680x4320 at 60 Hz (8K@60) |
||||
|
- HDMI 2.1a: max 7680x4320 at 30 Hz (8K@30) |
||||
|
- MST (Multi-Stream Transport): Not supported |
||||
|
- Cable recommendation: connect right to left, ≤6.5mm width |
||||
|
|
||||
|
### Networking |
||||
|
- Realtek RTL8127-CG (10GbE Ethernet controller) |
||||
|
- AzureWave AW-EM637 Wi-Fi module (2.4/5/6 GHz) |
||||
|
- Encryption: 128-bit AES-CCMP, 256-bit AES-GCMP, 256-bit AES-GMAC |
||||
|
|
||||
|
### Power Adapter |
||||
|
- Dimensions: 23x78x162 mm |
||||
|
- Input: 100-240VAC, 50-60Hz |
||||
|
- Multi-voltage output: 48V/36V/28V/20V/15V/9V/5V |
||||
|
|
||||
|
### Environmental (Dell-specific) |
||||
|
- Operating: 0-35°C, 10-90% humidity, -15.2m to 3048m altitude |
||||
|
- Storage: -40 to 65°C, 0-95% humidity, -15.2m to 10668m altitude |
||||
|
- Vibration: 0.66 GRMS operating, 1.30 GRMS storage |
||||
|
- Shock: 110G operating, 160G storage (2ms half-sine) |
||||
|
- Airborne contaminants: G1 per ISA-S71.04-1985 |
||||
|
|
||||
|
### BIOS/UEFI (Full Structure) |
||||
|
- Entry: Delete key at Dell logo (BIOS), F7 (one-time boot) |
||||
|
- Navigation: F1=Help, F2=Restore, F3=Defaults, F4=Save&Exit, ESC=Exit |
||||
|
- Menus: Main, Advanced, Security, Boot, Save & Exit |
||||
|
- Advanced: Platform Configuration (iGPU Memory Carveout, DRAM Encryption, Watchdog Timer), Network Stack, NVMe Config with self-test, VLAN, TLS Auth |
||||
|
- Security: Secure Boot, Media Sanitization, TCG Storage (Opal), Expert Key Management, password policies |
||||
|
- Boot: Boot priorities, Fast Boot, Quiet Boot, custom boot options |
||||
|
|
||||
|
### SKU Details |
||||
|
- 2TB model uses M.2 2230 QLC (no SED) |
||||
|
- 4TB model uses M.2 2242 TLC with Opal 2.0 SED |
||||
|
- Additional 1TB options exist (2230 TLC and 2242 TLC SED) |
||||
|
|
||||
|
### Troubleshooting |
||||
|
- Network power cycle procedure (7 steps) |
||||
|
- Force shutdown: hold power 4 seconds |
||||
|
- NVMe self-test via BIOS |
||||
|
|
||||
|
## Files Updated |
||||
|
|
||||
|
All 12 context files updated. 14 additional resolved questions added to open-questions.md. |
||||
|
|
||||
|
## Remaining Gaps |
||||
|
|
||||
|
- Figures/diagrams from manual are referenced but not visually captured (PDF images) |
||||
|
- Full UEFI Manual (separate document referenced in Dell manual) not yet found |
||||
|
- DGX Spark-specific BIOS differences (if any) unknown |
||||
Write
Preview
Loading…
Cancel
Save
Reference in new issue