Browse Source
Integrate NVIDIA Spark playbooks: CUDA sm_121, TensorRT-LLM, fine-tuning, Ollama, ComfyUI
Integrate NVIDIA Spark playbooks: CUDA sm_121, TensorRT-LLM, fine-tuning, Ollama, ComfyUI
Phase 4: Parsed all 9 playbooks from build.nvidia.com/spark. Key findings: CUDA compute capability sm_121, toolkit 13.0, TensorRT-LLM confirmed, fine-tuning scripts (SFT/LoRA/QLoRA up to 70B), Nemotron-3-Nano 30B MoE, speculative decoding (EAGLE-3/Draft-Target), ComfyUI image gen, Ollama+Open WebUI, RAPIDS scientific computing, DGX Dashboard on port 11000, NVIDIA Sync full documentation. 11 questions resolved. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>master
8 changed files with 214 additions and 23 deletions
-
1CLAUDE.md
-
67context/ai-frameworks.md
-
55context/ai-workloads.md
-
24context/dgx-os-software.md
-
2context/equations-and-bounds.md
-
1context/gb10-superchip.md
-
17context/open-questions.md
-
70phases/phase-04-spark-playbooks.md
@ -0,0 +1,70 @@ |
|||
# Phase 4: NVIDIA Spark Playbooks Integration |
|||
|
|||
**Date:** 2026-02-14 |
|||
**Goal:** Integrate official NVIDIA playbooks from build.nvidia.com/spark into knowledge base |
|||
|
|||
## Source |
|||
|
|||
- https://build.nvidia.com/spark (main page, 9 playbooks + connection guide) |
|||
|
|||
## Key Discoveries |
|||
|
|||
### Critical Technical Facts (previously unknown) |
|||
|
|||
1. **CUDA compute capability: `sm_121`** — required for compiling CUDA kernels on Blackwell GB10 (`-DCMAKE_CUDA_ARCHITECTURES="121"`) |
|||
2. **CUDA toolkit version: 13.0** — PyTorch wheels use `cu130` index |
|||
3. **DGX Dashboard runs on port 11000** — JupyterLab ports in `/opt/nvidia/dgx-dashboard-service/jupyterlab_ports.yaml` |
|||
4. **TensorRT-LLM confirmed** — container `tensorrt-llm/release:1.2.0rc6` |
|||
5. **PyTorch NGC container:** `nvcr.io/nvidia/pytorch:25.11-py3` |
|||
6. **RAPIDS container:** version 25.10 |
|||
7. **UMA buffer cache flush:** `sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'` |
|||
|
|||
### Fine-Tuning (fully documented) |
|||
|
|||
- **Full SFT:** Llama 3.2 3B (all parameters, bfloat16) |
|||
- **LoRA:** Llama 3.1 8B (rank 8 default) |
|||
- **LoRA + FSDP:** Llama 3.1 70B (multi-node via Docker Swarm) |
|||
- **QLoRA 4-bit:** Llama 3.1 70B (single unit) |
|||
- Dependencies: transformers, peft, datasets, trl, bitsandbytes |
|||
|
|||
### Inference Tools |
|||
|
|||
- **llama.cpp:** Build with CUDA sm_121, provides OpenAI-compatible API (streaming, function calling) |
|||
- **Nemotron-3-Nano 30B:** MoE (3B active), ~38 GB at Q8, built-in reasoning/tool-calling |
|||
- **Speculative Decoding:** EAGLE-3 (built-in drafting) and Draft-Target (8B+70B, FP4) |
|||
- **Ollama + Open WebUI:** Docker container, ports 12000 (Sync) or 8080 (direct) |
|||
|
|||
### Image Generation |
|||
|
|||
- **ComfyUI** confirmed working (SD, SDXL, Flux) on port 8188 |
|||
- Native Blackwell GPU acceleration with CUDA 13.0 |
|||
|
|||
### Scientific Computing |
|||
|
|||
- **scRNA-seq:** RAPIDS-singlecell, ~130s full pipeline, exact nearest-neighbor graph |
|||
- **Portfolio Optimization:** cuOpt + cuML, Mean-CVaR model, ~7 min pipeline |
|||
|
|||
### Development Environment |
|||
|
|||
- **VS Code:** ARM64 .deb install or remote SSH via Sync |
|||
- **Cursor:** Remote SSH via Sync |
|||
- **NVIDIA AI Workbench:** Launchable via Sync |
|||
- **NVIDIA Sync:** Full details documented (SSH key automation, mDNS, port forwarding) |
|||
|
|||
## Files Updated |
|||
|
|||
- `context/gb10-superchip.md` — sm_121 CUDA architecture |
|||
- `context/ai-frameworks.md` — Major expansion: CUDA 13.0, TensorRT-LLM, Ollama, ComfyUI, NGC containers, UMA tip |
|||
- `context/ai-workloads.md` — Fine-tuning scripts, Nemotron, speculative decoding, image gen, scientific computing |
|||
- `context/dgx-os-software.md` — NVIDIA Sync §8 (full detail), DGX Dashboard §9 (port, features) |
|||
- `context/setup-and-config.md` — NVIDIA Sync cross-reference |
|||
- `context/equations-and-bounds.md` — sm_121, CUDA 13.0 |
|||
- `context/open-questions.md` — 11 new resolved questions, 1 new open question |
|||
- `CLAUDE.md` — Phase 4 added to history |
|||
|
|||
## Remaining Gaps |
|||
|
|||
- Quantitative speculative decoding speedup (tokens/sec improvement not published) |
|||
- ComfyUI image generation benchmarks (images/sec) |
|||
- Fine-tuning wall-clock times |
|||
- Full list of Ollama-compatible models tested on GB10 |
|||
Write
Preview
Loading…
Cancel
Save
Reference in new issue