From fbdcc807b38968b5100819f1414aba322e251419 Mon Sep 17 00:00:00 2001 From: Joe DiPrima Date: Sat, 14 Feb 2026 16:18:26 -0600 Subject: [PATCH] Integrate NVIDIA Spark playbooks: CUDA sm_121, TensorRT-LLM, fine-tuning, Ollama, ComfyUI Phase 4: Parsed all 9 playbooks from build.nvidia.com/spark. Key findings: CUDA compute capability sm_121, toolkit 13.0, TensorRT-LLM confirmed, fine-tuning scripts (SFT/LoRA/QLoRA up to 70B), Nemotron-3-Nano 30B MoE, speculative decoding (EAGLE-3/Draft-Target), ComfyUI image gen, Ollama+Open WebUI, RAPIDS scientific computing, DGX Dashboard on port 11000, NVIDIA Sync full documentation. 11 questions resolved. Co-Authored-By: Claude Opus 4.6 --- CLAUDE.md | 1 + context/ai-frameworks.md | 67 +++++++++++++++++++++++----- context/ai-workloads.md | 55 +++++++++++++++++++---- context/dgx-os-software.md | 24 +++++++++- context/equations-and-bounds.md | 2 + context/gb10-superchip.md | 1 + context/open-questions.md | 17 +++++++- phases/phase-04-spark-playbooks.md | 70 ++++++++++++++++++++++++++++++ 8 files changed, 214 insertions(+), 23 deletions(-) create mode 100644 phases/phase-04-spark-playbooks.md diff --git a/CLAUDE.md b/CLAUDE.md index ee0db34..bc1547d 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -167,3 +167,4 @@ Dell Pro Max GB10 (product) | 1 | 2026-02-14 | Initial knowledge base created from web research | | 2 | 2026-02-14 | Deep research: NVIDIA docs, reviews, 18 questions resolved | | 3 | 2026-02-14 | Dell Owner's Manual (Rev A01) integrated, critical corrections applied | +| 4 | 2026-02-14 | NVIDIA Spark playbooks: CUDA sm_121, TensorRT-LLM, fine-tuning, Sync, Dashboard, ComfyUI, Ollama | diff --git a/context/ai-frameworks.md b/context/ai-frameworks.md index e232fc6..668893e 100644 --- a/context/ai-frameworks.md +++ b/context/ai-frameworks.md @@ -2,10 +2,10 @@ id: ai-frameworks title: "AI Frameworks and Development Tools" status: established -source_sections: "Web research: NVIDIA newsroom, Arm learning paths, NVIDIA DGX Spark User Guide" +source_sections: "Web research: NVIDIA newsroom, Arm learning paths, NVIDIA DGX Spark User Guide, build.nvidia.com/spark playbooks" related_topics: [dgx-os-software, gb10-superchip, ai-workloads] key_equations: [] -key_terms: [pytorch, nemo, rapids, cuda, ngc, jupyter, tensorrt, llama-cpp, docker, nvidia-container-runtime, fex] +key_terms: [pytorch, nemo, rapids, cuda, ngc, jupyter, tensorrt, tensorrt-llm, llama-cpp, docker, nvidia-container-runtime, fex, ollama, comfyui, sm_121, cu130, speculative-decoding] images: [] examples: [] open_questions: @@ -38,31 +38,74 @@ The Dell Pro Max GB10 supports a broad AI software ecosystem, pre-configured thr ## 2. Inference Tools -### CUDA Toolkit -- Low-level GPU compute API -- Compiler (nvcc) for custom CUDA kernels -- Profiling and debugging tools +### CUDA Toolkit (v13.0) +- **CUDA compute capability:** `sm_121` (Blackwell on GB10) — use `-DCMAKE_CUDA_ARCHITECTURES="121"` when compiling +- **PyTorch CUDA wheels:** `cu130` (e.g., `pip3 install torch --index-url https://download.pytorch.org/whl/cu130`) +- Low-level GPU compute API, compiler (nvcc), profiling and debugging tools ### llama.cpp - Quantized LLM inference engine - ARM-optimized builds available for GB10 - Supports GGUF model format +- Build with CUDA: `cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121"` (T1, build.nvidia.com/spark) +- Provides **OpenAI-compatible API** via `llama-server` (chat completions, streaming, function calling) - Documented in [Arm Learning Path](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/) -### TensorRT (expected) -- NVIDIA's inference optimizer -- Blackwell architecture support expected +### TensorRT-LLM +- NVIDIA's LLM inference optimizer — **confirmed available** (T1, build.nvidia.com/spark) +- Container: `tensorrt-llm/release:1.2.0rc6` +- Supports **speculative decoding** for faster inference: + - **EAGLE-3:** Built-in drafting head, no separate draft model needed + - **Draft-Target:** Pairs small (8B) and large (70B) models, uses FP4 quantization +- Configurable KV cache memory fraction for memory management + +### Ollama +- LLM runtime with model library — runs via Docker on GB10 (T1, build.nvidia.com/spark) +- Container: `ghcr.io/open-webui/open-webui:ollama` (bundles Open WebUI + Ollama) +- Models available from ollama.com/library (e.g., `gpt-oss:20b`) +- Port: 12000 (via NVIDIA Sync) or 8080 (direct) ## 3. Development Environment -- **DGX Dashboard** — web-based system monitor with integrated JupyterLab (T0 Spec) +- **DGX Dashboard** — web-based system monitor at `http://localhost:11000` with integrated JupyterLab (T0 Spec). JupyterLab ports configured in `/opt/nvidia/dgx-dashboard-service/jupyterlab_ports.yaml`. +- **VS Code** — ARM64 .deb available; also remote SSH via NVIDIA Sync or manual SSH (T1, build.nvidia.com/spark) +- **Cursor** — supported via NVIDIA Sync remote SSH launch (T1, build.nvidia.com/spark) +- **NVIDIA AI Workbench** — launchable via NVIDIA Sync (T1, build.nvidia.com/spark) - **Python** — system Python with AI/ML package ecosystem - **NVIDIA NGC Catalog** — library of pre-trained models, containers, and SDKs - **Docker + NVIDIA Container Runtime** — pre-installed for containerized workflows (T0 Spec) - **NVIDIA AI Enterprise** — enterprise-grade AI software and services -- **Tutorials:** https://build.nvidia.com/spark +- **Tutorials & Playbooks:** https://build.nvidia.com/spark + +### Key NGC Containers (confirmed ARM64) + +| Container | Tag | Use Case | +|-----------|-----|----------| +| `nvcr.io/nvidia/pytorch` | `25.11-py3` | PyTorch training & fine-tuning | +| `tensorrt-llm/release` | `1.2.0rc6` | Optimized LLM inference | +| RAPIDS | `25.10` | GPU-accelerated data science | +| `ghcr.io/open-webui/open-webui` | `ollama` | Open WebUI + Ollama LLM chat | + +## 4. Image Generation + +### ComfyUI +- Node-based image generation UI for Stable Diffusion, SDXL, Flux, etc. (T1, build.nvidia.com/spark) +- Runs natively on GB10 Blackwell GPU +- Requires: Python 3.8+, CUDA toolkit, PyTorch with `cu130` +- Port: 8188 (`--listen 0.0.0.0` for remote access) +- Storage: ~20 GB minimum (plus model files, e.g., SD 1.5 ~2 GB) + +## 5. UMA Memory Management Tip + +DGX Spark uses Unified Memory Architecture (UMA) — CPU and GPU share the same LPDDR5X pool. If GPU memory appears low due to filesystem buffer cache: + +```bash +sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' +``` + +This frees cached memory back to the unified pool without data loss. (T1, build.nvidia.com/spark) -## 4. Software Compatibility Notes +## 6. Software Compatibility Notes Since the GB10 is an ARM system: diff --git a/context/ai-workloads.md b/context/ai-workloads.md index 1a5a0fa..63e1bac 100644 --- a/context/ai-workloads.md +++ b/context/ai-workloads.md @@ -2,18 +2,17 @@ id: ai-workloads title: "AI Workloads and Model Capabilities" status: established -source_sections: "Web research: NVIDIA newsroom, Dell product page, WCCFTech, Jeff Geerling, ServeTheHome, Tom's Hardware" +source_sections: "Web research: NVIDIA newsroom, Dell product page, WCCFTech, Jeff Geerling, ServeTheHome, Tom's Hardware, build.nvidia.com/spark playbooks" related_topics: [gb10-superchip, memory-and-storage, ai-frameworks, multi-unit-stacking] key_equations: [model-memory-estimate] -key_terms: [llm, inference, fine-tuning, quantization, fp4, fp8, fp16, parameter-count] +key_terms: [llm, inference, fine-tuning, quantization, fp4, fp8, fp16, parameter-count, lora, qlora, sft, fsdp, speculative-decoding, nemotron, comfyui, rapids-singlecell] images: [] examples: [llm-memory-estimation.md] open_questions: - "Tokens/sec for Llama 3.3 70B specifically (only 3B and GPT-OSS-120B benchmarked so far)" - "Maximum batch size for inference at various model sizes" - - "Fine-tuning performance — how long to SFT a 7B model on this hardware?" - - "Stable Diffusion / image generation performance" - "Training from scratch — is it practical for any meaningful model size?" + - "Speculative decoding speedup factor (tokens/sec improvement not quantified yet)" --- # AI Workloads and Model Capabilities @@ -34,6 +33,7 @@ The Dell Pro Max GB10 is designed primarily for **local AI inference and fine-tu | Llama 3.2 3B | ~100 tokens/sec | — | Jeff Geerling | | GPT-OSS-120B | ~14.5 tokens/sec | INT4 | ServeTheHome | | Llama 3.1 70B | Competitive w/ Ryzen AI Max+ 395 | — | Jeff Geerling | +| Nemotron-3-Nano 30B | Runs (MoE, 3B active) | Q8_K | build.nvidia.com/spark | | HPL (Linpack) FP64 | ~675 GFLOPS | FP64 | Jeff Geerling | | Geekbench 6 | Comparable to Ryzen AI Max+ 395; trails Apple M3 Ultra | — | Jeff Geerling | @@ -41,6 +41,8 @@ The Dell Pro Max GB10 is designed primarily for **local AI inference and fine-tu **INT4 inference** on GPT-OSS-120B is roughly equivalent to an RTX 5070's performance (T2, ServeTheHome). +**Nemotron-3-Nano 30B** is a MoE architecture (30B total, 3B active params) requiring ~38 GB GPU memory at Q8. Provides OpenAI-compatible API via llama.cpp server. (T1, build.nvidia.com/spark) + ## 2. Model Size vs. Memory With 128 GB of unified memory, the system can hold: @@ -61,21 +63,58 @@ With 128 GB of unified memory, the system can hold: - Interactive chat, code generation, document analysis - Privacy-sensitive applications (medical, legal, financial) -### Fine-Tuning -- Supervised fine-tuning (SFT) of models using NVIDIA NeMo -- LoRA/QLoRA for parameter-efficient fine-tuning of larger models -- Custom domain adaptation +### Fine-Tuning (T1 Documented, build.nvidia.com/spark) + +NVIDIA provides official fine-tuning scripts with four approaches: + +| Script | Model | Method | Notes | +|--------|-------|--------|-------| +| Full SFT | Llama 3.2 3B | All parameters trainable | Fits in memory at bfloat16 | +| LoRA | Llama 3.1 8B | Parameter-efficient adapters | `lora_rank=8` default | +| LoRA + FSDP | Llama 3.1 70B | Distributed across 2 units | Multi-node via Docker Swarm | +| QLoRA (4-bit) | Llama 3.1 70B | Quantized base + LoRA | Fits on single unit | + +- Container: `nvcr.io/nvidia/pytorch:25.11-py3` +- Dependencies: `transformers`, `peft`, `datasets`, `trl`, `bitsandbytes` +- Key params: `--batch_size`, `--seq_length` (default 2048), `--num_epochs`, `--gradient_checkpointing` +- Dataset: Alpaca (configurable `--dataset_size`, default 512 samples) +- Multi-node: Docker Swarm + FSDP for 2-unit distributed training ### AI Prototyping - Rapid iteration on model architectures - Dataset preprocessing with RAPIDS - Experiment tracking and evaluation +### Image Generation (T1 Documented, build.nvidia.com/spark) +- **ComfyUI** confirmed working — node-based UI for Stable Diffusion, SDXL, Flux +- Runs natively on Blackwell GPU with CUDA 13.0 +- See [[ai-frameworks]] §4 for setup details + +### Speculative Decoding (T1 Documented, build.nvidia.com/spark) +- Accelerates LLM inference by using a small draft model to predict tokens verified by the large model +- **EAGLE-3:** Built-in drafting head (no separate model needed) +- **Draft-Target:** Pairs 8B draft + 70B target with FP4 quantization +- Uses TensorRT-LLM container (`tensorrt-llm/release:1.2.0rc6`) +- Configurable `max_draft_len` (1-8 tokens) and KV cache memory fraction + ### Data Science - GPU-accelerated analytics with RAPIDS - Large-scale data processing - Graph analytics +### Scientific Computing (T1 Documented, build.nvidia.com/spark) + +**Single-cell RNA Sequencing:** +- RAPIDS-singlecell library (GPU-accelerated, follows Scanpy API) +- Full scRNA-seq pipeline in ~130 seconds (preprocessing ~21s, clustering/DE ~104s) +- Requires ~40 GB unified memory +- Computes exact nearest-neighbor graph (vs. Scanpy's approximate) + +**Portfolio Optimization:** +- cuOpt LP/MILP solvers + cuML for GPU-accelerated KDE +- Mean-CVaR (Conditional Value-at-Risk) modeling +- Full pipeline in ~7 minutes + ### Gaming (bonus, not primary use case) Surprisingly, ARM Linux gaming works via FEX (x86-to-ARM translation) + Steam/Proton: - Cyberpunk 2077: ~100 fps at 1080p, low settings (T2, Jeff Geerling) diff --git a/context/dgx-os-software.md b/context/dgx-os-software.md index f341b56..8e71a27 100644 --- a/context/dgx-os-software.md +++ b/context/dgx-os-software.md @@ -41,7 +41,7 @@ The system ships ready to run AI workloads with: - **NVIDIA drivers** — optimized for GB10 Blackwell GPU - **Docker + NVIDIA Container Runtime** — container support out of the box (T0 Spec) - **NVIDIA Sync** — cross-platform desktop app for remote device management (see §8) -- **DGX Dashboard** — system monitoring with integrated JupyterLab +- **DGX Dashboard** — system monitoring web UI at `http://localhost:11000` with integrated JupyterLab (see §9) - **NGC** — access to NVIDIA GPU Cloud containerized applications and models - **AI Enterprise** — enterprise-grade AI software assets and services - **Python** — system Python plus development environments @@ -145,6 +145,28 @@ NVIDIA Sync is a **cross-platform desktop application** (macOS, Windows, Linux) - **Connection timeout during boot:** Wait for device to fully boot - **Authentication failure:** Reconfigure connection in Sync app +## 9. DGX Dashboard (T1 Documented, build.nvidia.com/spark) + +DGX Dashboard is a locally-hosted web application for system management and development. + +### Access + +- **Local:** `http://localhost:11000` or desktop shortcut in Ubuntu app launcher +- **Remote via NVIDIA Sync:** Automatic SSH tunnel (recommended) +- **Remote via manual SSH:** `ssh -L 11000:localhost:11000 user@spark-ip` + +For JupyterLab remote access, also forward the user-specific port from: +`/opt/nvidia/dgx-dashboard-service/jupyterlab_ports.yaml` + +### Features + +- **GPU/system monitoring** — real-time resource utilization panels and telemetry +- **JupyterLab** — one-click launch with pre-configured Python virtual environments + - Working directory: `/home//jupyterlab` + - Requirements tracking via `requirements.txt` +- **System updates** — package and firmware update management via GUI +- **Settings** — system configuration interface + ## Key Relationships - Runs on: [[gb10-superchip]] diff --git a/context/equations-and-bounds.md b/context/equations-and-bounds.md index e5cf695..aa91acd 100644 --- a/context/equations-and-bounds.md +++ b/context/equations-and-bounds.md @@ -38,6 +38,8 @@ Reference for all quantitative specifications, formulas, and validation ranges f - **Copy engines:** 2 (T0 Spec) - **NVENC:** 1 (T0 Spec) - **NVDEC:** 1 (T0 Spec) +- **CUDA compute capability:** sm_121 (T1, build.nvidia.com/spark) +- **CUDA toolkit:** 13.0 / cu130 (T1, build.nvidia.com/spark) ## 2. Memory diff --git a/context/gb10-superchip.md b/context/gb10-superchip.md index a910532..649fc77 100644 --- a/context/gb10-superchip.md +++ b/context/gb10-superchip.md @@ -48,6 +48,7 @@ The Blackwell GPU portion features: - **4th-generation RT Cores** — ray tracing acceleration (T0 Spec) - **1x NVENC / 1x NVDEC** — hardware video encode/decode engines (T0 Spec) - **2 copy engines** (T0 Spec) +- **CUDA compute capability:** `sm_121` (T1 Documented, build.nvidia.com/spark — required when compiling CUDA kernels with `-DCMAKE_CUDA_ARCHITECTURES="121"`) - Peak performance: **1 PFLOP (1,000 TFLOPS) at FP4 precision with sparsity** The Tensor Cores are the key differentiator for AI workloads, providing hardware acceleration for mixed-precision matrix operations used in deep learning. diff --git a/context/open-questions.md b/context/open-questions.md index 6f446a9..b09708d 100644 --- a/context/open-questions.md +++ b/context/open-questions.md @@ -86,11 +86,14 @@ Catalog of known unknowns, research gaps, and unresolved questions about the Del - *Status:* Only Llama 3.2 3B (~100 tok/s) and GPT-OSS-120B (~14.5 tok/s) benchmarked. - *Would resolve:* Most common use case performance - **Q:** Fine-tuning time estimates for common model sizes? - - *Status:* Unknown. + - *Status:* Partially resolved — scripts and methods documented (Full SFT 3B, LoRA 8B, QLoRA 70B) but wall-clock times not published. - *Would resolve:* Training workflow planning - **Q:** Stable Diffusion / image generation performance? - - *Status:* Unknown. + - *Status:* **Partially resolved** — ComfyUI confirmed working with SD 1.5. Quantitative benchmarks (images/sec) not published. - *Would resolve:* Non-LLM AI workload suitability +- **Q:** Speculative decoding speedup factor? + - *Status:* EAGLE-3 and Draft-Target methods documented. Quantitative speedup (tokens/sec improvement) not published. + - *Would resolve:* Inference optimization ROI --- @@ -130,3 +133,13 @@ Catalog of known unknowns, research gaps, and unresolved questions about the Del | 2026-02-14 | Power adapter dimensions? | 23 x 78 x 162 mm, multi-voltage output (5V-48V) | Dell Owner's Manual Rev A01 | | 2026-02-14 | USB-C MST support? | Not supported (single display per port only) | Dell Owner's Manual Rev A01 | | 2026-02-14 | Service tools required? | Phillips #0, T5 or T8 Torx screwdriver | Dell Owner's Manual Rev A01 | +| 2026-02-14 | CUDA compute capability / SM architecture? | sm_121 (compile with `-DCMAKE_CUDA_ARCHITECTURES="121"`) | build.nvidia.com/spark | +| 2026-02-14 | CUDA toolkit version? | CUDA 13.0 (PyTorch wheels: cu130) | build.nvidia.com/spark | +| 2026-02-14 | DGX Dashboard URL/port? | `http://localhost:11000` | build.nvidia.com/spark | +| 2026-02-14 | TensorRT-LLM availability? | Confirmed — container `tensorrt-llm/release:1.2.0rc6` | build.nvidia.com/spark | +| 2026-02-14 | Fine-tuning methods supported? | Full SFT (3B), LoRA (8B), QLoRA 4-bit (70B), FSDP multi-node | build.nvidia.com/spark | +| 2026-02-14 | Image generation support? | ComfyUI confirmed (SD, SDXL, Flux) on port 8188 | build.nvidia.com/spark | +| 2026-02-14 | Ollama / Open WebUI support? | Yes — Docker container, port 12000 (Sync) or 8080 (direct) | build.nvidia.com/spark | +| 2026-02-14 | NVIDIA Sync details? | Cross-platform app, SSH key automation, VS Code/Cursor/Dashboard launch, port forwarding | build.nvidia.com/spark | +| 2026-02-14 | PyTorch NGC container? | `nvcr.io/nvidia/pytorch:25.11-py3` (ARM64) | build.nvidia.com/spark | +| 2026-02-14 | Speculative decoding methods? | EAGLE-3 (built-in drafting) and Draft-Target (8B+70B) | build.nvidia.com/spark | diff --git a/phases/phase-04-spark-playbooks.md b/phases/phase-04-spark-playbooks.md new file mode 100644 index 0000000..75cdc40 --- /dev/null +++ b/phases/phase-04-spark-playbooks.md @@ -0,0 +1,70 @@ +# Phase 4: NVIDIA Spark Playbooks Integration + +**Date:** 2026-02-14 +**Goal:** Integrate official NVIDIA playbooks from build.nvidia.com/spark into knowledge base + +## Source + +- https://build.nvidia.com/spark (main page, 9 playbooks + connection guide) + +## Key Discoveries + +### Critical Technical Facts (previously unknown) + +1. **CUDA compute capability: `sm_121`** — required for compiling CUDA kernels on Blackwell GB10 (`-DCMAKE_CUDA_ARCHITECTURES="121"`) +2. **CUDA toolkit version: 13.0** — PyTorch wheels use `cu130` index +3. **DGX Dashboard runs on port 11000** — JupyterLab ports in `/opt/nvidia/dgx-dashboard-service/jupyterlab_ports.yaml` +4. **TensorRT-LLM confirmed** — container `tensorrt-llm/release:1.2.0rc6` +5. **PyTorch NGC container:** `nvcr.io/nvidia/pytorch:25.11-py3` +6. **RAPIDS container:** version 25.10 +7. **UMA buffer cache flush:** `sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'` + +### Fine-Tuning (fully documented) + +- **Full SFT:** Llama 3.2 3B (all parameters, bfloat16) +- **LoRA:** Llama 3.1 8B (rank 8 default) +- **LoRA + FSDP:** Llama 3.1 70B (multi-node via Docker Swarm) +- **QLoRA 4-bit:** Llama 3.1 70B (single unit) +- Dependencies: transformers, peft, datasets, trl, bitsandbytes + +### Inference Tools + +- **llama.cpp:** Build with CUDA sm_121, provides OpenAI-compatible API (streaming, function calling) +- **Nemotron-3-Nano 30B:** MoE (3B active), ~38 GB at Q8, built-in reasoning/tool-calling +- **Speculative Decoding:** EAGLE-3 (built-in drafting) and Draft-Target (8B+70B, FP4) +- **Ollama + Open WebUI:** Docker container, ports 12000 (Sync) or 8080 (direct) + +### Image Generation + +- **ComfyUI** confirmed working (SD, SDXL, Flux) on port 8188 +- Native Blackwell GPU acceleration with CUDA 13.0 + +### Scientific Computing + +- **scRNA-seq:** RAPIDS-singlecell, ~130s full pipeline, exact nearest-neighbor graph +- **Portfolio Optimization:** cuOpt + cuML, Mean-CVaR model, ~7 min pipeline + +### Development Environment + +- **VS Code:** ARM64 .deb install or remote SSH via Sync +- **Cursor:** Remote SSH via Sync +- **NVIDIA AI Workbench:** Launchable via Sync +- **NVIDIA Sync:** Full details documented (SSH key automation, mDNS, port forwarding) + +## Files Updated + +- `context/gb10-superchip.md` — sm_121 CUDA architecture +- `context/ai-frameworks.md` — Major expansion: CUDA 13.0, TensorRT-LLM, Ollama, ComfyUI, NGC containers, UMA tip +- `context/ai-workloads.md` — Fine-tuning scripts, Nemotron, speculative decoding, image gen, scientific computing +- `context/dgx-os-software.md` — NVIDIA Sync §8 (full detail), DGX Dashboard §9 (port, features) +- `context/setup-and-config.md` — NVIDIA Sync cross-reference +- `context/equations-and-bounds.md` — sm_121, CUDA 13.0 +- `context/open-questions.md` — 11 new resolved questions, 1 new open question +- `CLAUDE.md` — Phase 4 added to history + +## Remaining Gaps + +- Quantitative speculative decoding speedup (tokens/sec improvement not published) +- ComfyUI image generation benchmarks (images/sec) +- Fine-tuning wall-clock times +- Full list of Ollama-compatible models tested on GB10