# Worked Example: LLM Memory Estimation on Dell Pro Max GB10

## Problem

Estimate whether Llama 3.3 70B can run on a single Dell Pro Max GB10, and at what precision.

## Given

- **Model:** Llama 3.3 70B (70 billion parameters)
- **Available memory:** 128 GB unified LPDDR5X
- **Usable memory:** ~110 GB (after OS, framework, overhead)

## Calculation

### Step 1: Raw Model Weight Memory

| Precision | Bytes/Param | Memory for 70B        |
|-----------|-------------|-----------------------|
| FP4       | 0.5         | 70 × 0.5 = 35 GB     |
| FP8/INT8  | 1.0         | 70 × 1.0 = 70 GB     |
| FP16      | 2.0         | 70 × 2.0 = 140 GB    |
| FP32      | 4.0         | 70 × 4.0 = 280 GB    |

### Step 2: Total Memory with Overhead (1.3x multiplier)

| Precision | Weights | Total (~1.3x) | Fits in 110 GB? |
|-----------|---------|----------------|-----------------|
| FP4       | 35 GB   | ~46 GB         | Yes             |
| FP8/INT8  | 70 GB   | ~91 GB         | Yes             |
| FP16      | 140 GB  | ~182 GB        | No              |
| FP32      | 280 GB  | ~364 GB        | No              |

### Step 3: Conclusion

- **FP4 quantized:** Fits comfortably (46/110 GB = 42% utilization). Plenty of room for large KV cache and batch sizes.
- **FP8/INT8 quantized:** Fits (91/110 GB = 83% utilization). Tight but workable for single-request inference.
- **FP16 (half precision):** Does NOT fit in a single unit. Would require 2-unit stacking (see [[multi-unit-stacking]]).
- **FP32 (full precision):** Does NOT fit even with stacking.

## Verification

NVIDIA confirms Llama 3.3 70B runs locally on a single GB10 unit. This is consistent with FP8 or FP4 quantized inference, which our calculation shows fitting within memory bounds.

## Sources

- Memory specs: [[memory-and-storage]]
- Estimation formulas: [[equations-and-bounds]]
- Model capabilities: [[ai-workloads]]