# Worked Example: LLM Memory Estimation on Dell Pro Max GB10 ## Problem Estimate whether Llama 3.3 70B can run on a single Dell Pro Max GB10, and at what precision. ## Given - **Model:** Llama 3.3 70B (70 billion parameters) - **Available memory:** 128 GB unified LPDDR5X - **Usable memory:** ~110 GB (after OS, framework, overhead) ## Calculation ### Step 1: Raw Model Weight Memory | Precision | Bytes/Param | Memory for 70B | |-----------|-------------|-----------------------| | FP4 | 0.5 | 70 × 0.5 = 35 GB | | FP8/INT8 | 1.0 | 70 × 1.0 = 70 GB | | FP16 | 2.0 | 70 × 2.0 = 140 GB | | FP32 | 4.0 | 70 × 4.0 = 280 GB | ### Step 2: Total Memory with Overhead (1.3x multiplier) | Precision | Weights | Total (~1.3x) | Fits in 110 GB? | |-----------|---------|----------------|-----------------| | FP4 | 35 GB | ~46 GB | Yes | | FP8/INT8 | 70 GB | ~91 GB | Yes | | FP16 | 140 GB | ~182 GB | No | | FP32 | 280 GB | ~364 GB | No | ### Step 3: Conclusion - **FP4 quantized:** Fits comfortably (46/110 GB = 42% utilization). Plenty of room for large KV cache and batch sizes. - **FP8/INT8 quantized:** Fits (91/110 GB = 83% utilization). Tight but workable for single-request inference. - **FP16 (half precision):** Does NOT fit in a single unit. Would require 2-unit stacking (see [[multi-unit-stacking]]). - **FP32 (full precision):** Does NOT fit even with stacking. ## Verification NVIDIA confirms Llama 3.3 70B runs locally on a single GB10 unit. This is consistent with FP8 or FP4 quantized inference, which our calculation shows fitting within memory bounds. ## Sources - Memory specs: [[memory-and-storage]] - Estimation formulas: [[equations-and-bounds]] - Model capabilities: [[ai-workloads]]