Model Selection and Optimization Strategy

Estimated reading time: 20 minutes.

Objective: Learn the practical decision frameworks for choosing model architectures, calculating GPU memory requirements, selecting quantization strategies, and determining when multi-GPU configurations are necessary.

Model Selection: Dense vs Mixture-of-Experts (MoE)

As a platform engineer, choosing between Dense and MoE architectures is about managing your GPU compute and VRAM budget effectively.

Dense Models

In a dense model, every parameter is activated for every token generated. If you have a 70B parameter model, every word processed involves 70B calculations.

Pros:

Highly predictable performance
Easier to optimize for specific hardware
Better for complex, multi-step reasoning tasks
Lower total VRAM footprint

Cons:

Computationally expensive per token
Scaling intelligence linearly increases cost

Mixture-of-Experts (MoE) Models

An MoE model has a massive total parameter count (e.g., 600B+), but for any given token, only a small fraction (the "experts") are activated by a router.

Example: Mixtral 8×7B has 8 experts of 7B parameters each (56B total), but only activates 2 experts per token (14B active).

Pros:

You get the "intelligence" of a massive model with the "compute cost" of a much smaller one
Faster generation (higher throughput)
Excellent for specialized tasks (coding, writing, analysis)

Cons:

Enormous VRAM footprint—ALL parameters must reside in memory
Complex infrastructure requirements
Router mistakes can degrade performance

Decision Framework

Use this matrix to select the right architecture:

Need	Recommended Architecture	Reasoning
Highest Speed/Throughput	MoE (Mixtral, DeepSeek-V3)	Lower active compute per token
Limited GPU Memory	Dense (Llama, Mistral)	Fits in smaller total VRAM
Complex Multi-Step Logic	Dense (Large Parameter)	All weights available for every token
Diverse Multi-Tasking	MoE	Specialization via expert routing
High Concurrency (50+ users)	MoE	Better throughput per GPU
Low Concurrency (<50 users)	Dense	Don’t pay for idle VRAM

Need

Recommended Architecture

Reasoning

Highest Speed/Throughput

MoE (Mixtral, DeepSeek-V3)

Lower active compute per token

Limited GPU Memory

Dense (Llama, Mistral)

Fits in smaller total VRAM

Complex Multi-Step Logic

Dense (Large Parameter)

All weights available for every token

Diverse Multi-Tasking

MoE

Specialization via expert routing

High Concurrency (50+ users)

MoE

Better throughput per GPU

Low Concurrency (<50 users)

Dense

Don’t pay for idle VRAM

If you have high traffic with many concurrent users, MoE is usually cheaper because you can process more tokens per second per GPU. If you have low, intermittent traffic, Dense is better because you aren’t paying to keep massive weights "idle" in VRAM.

Model Sizing: Calculating VRAM Requirements

Before deployment, you must calculate if the model fits in available GPU memory. The total VRAM required is:

\$VRAM_{total} = M_{weights} + M_{kv_cache} + M_{overhead}\$

Component 1: Model Weights

Calculate based on Total Parameters and Precision (bits per parameter).

Precision	Bytes per Parameter	Example: 70B Model
FP16 / BF16	2 bytes	70B × 2 = 140 GB
INT8 / FP8	1 byte	70B × 1 = 70 GB
INT4	0.5 bytes	70B × 0.5 = 35 GB

Precision

Bytes per Parameter

Example: 70B Model

FP16 / BF16

2 bytes

70B × 2 = 140 GB

INT8 / FP8

1 byte

70B × 1 = 70 GB

INT4

0.5 bytes

70B × 0.5 = 35 GB

Component 2: KV Cache

The "memory" of the conversation. This stores keys and values from the attention mechanism for all previous tokens in the context.

Calculation Formula:

KV_cache = 2 * batch * context * layers * hidden\_dim * bytes

Where:

2 = Keys + Values (stored separately)
batch = concurrent sequences (use 1 for sizing)
context = context window length in tokens
layers = number of transformer layers
hidden_dim = model hidden dimension
bytes = precision (FP16=2, FP8=1, INT8=1)

Common Model Architectures:

Llama models use Grouped Query Attention (GQA) which reduces KV cache size:

Model	Layers	KV Heads	Head Dim	KV Cache (128K, FP8)
Llama-7B	32	8	128	~8 GB
Llama-70B	80	8	128	~20 GB
Llama-135B	120	8	128	~30 GB

Example: Llama-70B at different context lengths (FP8 KV cache)

12K context: 2 × 1 × 12,000 × 80 × 8 × 128 × 1 ≈ 2 GB
32K context: 2 × 1 × 32,000 × 80 × 8 × 128 × 1 ≈ 5 GB
128K context: 2 × 1 × 128,000 × 80 × 8 × 128 × 1 ≈ 20 GB

FP8 KV cache is industry standard: Modern inference engines (vLLM, TGI) default to FP8 precision for KV cache even when model weights are FP16/INT8. This saves significant memory with negligible accuracy impact. Example: Llama-70B FP16 (140 GB weights) + 128K context = 140 GB + 20 GB = 160 GB total.

KV cache grows linearly with context length. For very long contexts (512K+ tokens), KV cache can exceed model weight size. Always account for KV cache in sizing calculations.

Component 3: Overhead

Framework overhead includes:

CUDA kernels and runtime (~2-3 GB)
Intermediate activations
Operating system overhead

Rule of thumb: Add 10-15% safety margin to your total calculation.

Complete Sizing Example

Scenario: Deploy Llama-70B model with FP16 weights and 128K context window

Model Specs:

Parameters: 70B
Layers: 80
KV heads: 8 (Grouped Query Attention)
Head dimension: 128

Calculation:

Weights (FP16):

70B × 2 bytes = 140 GB

KV Cache (FP8, standard for inference):

2 × 1 × 128,000 × 80 × 8 × 128 × 1 byte = 20,971,520,000 bytes ≈ 20 GB

Overhead (10% margin):

(140 + 20) × 0.10 ≈ 16 GB

Total VRAM Required:

140 + 20 + 16 = 176 GB

Hardware Decision:

Single H100-80GB? No (80 GB < 176 GB)
Two H100-80GB via NVLink? Yes (160 GB capacity - tight fit, not recommended)
Three H100-80GB via NVLink? Yes (240 GB with 64 GB margin) ✓
Two A100-80GB? No (160 GB insufficient)
Three A100-80GB? Yes (240 GB with 64 GB margin) ✓

Best choice: 3× H100-80GB or 3× A100-80GB with TP=3 for safe production margin

Quantization: Trading Precision for Efficiency

Quantization reduces the precision of model weights and activations, lowering memory requirements and potentially increasing speed.

Why Quantize?

Memory Savings Example: Llama-405B model

Precision	Model Size	GPUs Required (80GB)
FP16	810 GB	11 GPUs
INT8/FP8	405 GB	6 GPUs
INT4	203 GB	3 GPUs

Precision

Model Size

GPUs Required (80GB)

FP16

810 GB

11 GPUs

INT8/FP8

405 GB

6 GPUs

INT4

203 GB

3 GPUs

Benefits:

Reduces GPU count: Lower hardware costs
Increases batch size: More memory for KV cache
Can improve speed: Low-precision tensor cores accelerate computation
Enables deployment: Models that wouldn’t fit now fit

Trade-offs:

Potential accuracy loss: Lower precision can reduce model quality
Not all models quantize well: Some architectures more sensitive than others
Requires validation: Must test accuracy on your specific tasks

Quantization Formats and Hardware Alignment

Different GPU architectures have specialized hardware for different precision formats:

Quantization Format	Best Hardware	Use Case
W4A16 (4-bit weights, FP16 activations)	Any GPU	Memory-constrained deployments, edge devices
W8A8-INT8 (8-bit weights, INT8 activations)	Ampere, Turing	High-throughput inference on older GPUs
W8A8-FP8 (8-bit weights, FP8 activations)	Hopper, Blackwell	Accuracy-sensitive with speed requirements
FP8 with 2:4 Sparsity	Hopper, Blackwell	Maximum performance on modern hardware

Quantization Format

Best Hardware

Use Case

W4A16 (4-bit weights, FP16 activations)

Any GPU

Memory-constrained deployments, edge devices

W8A8-INT8 (8-bit weights, INT8 activations)

Ampere, Turing

High-throughput inference on older GPUs

W8A8-FP8 (8-bit weights, FP8 activations)

Hopper, Blackwell

Accuracy-sensitive with speed requirements

FP8 with 2:4 Sparsity

Hopper, Blackwell

Maximum performance on modern hardware

Hardware Alignment Matters: Using INT8 on Ampere GPUs or FP8 on Hopper GPUs leverages dedicated tensor cores for maximum acceleration. Mismatched quantization formats run on standard CUDA cores, losing performance benefits.

Quantization Decision Framework

Ask these questions to select the right quantization scheme:

Question	How It Drives the Decision
1. What GPU architecture?	Ampere/Turing → INT8 Hopper/Blackwell → FP8
2. How much accuracy loss is acceptable?	<0.5% drop → W8A8 with GPTQ 1-3% acceptable → W4A16 with AWQ 3-5% acceptable → INT4
3. What’s the workload type?	Online/interactive → Weight-only quantization Batch/offline → Weight + activation quantization
4. How much VRAM do you have?	Severely limited → INT4 Moderately limited → INT8/FP8 Abundant → Consider staying FP16

Question

How It Drives the Decision

1. What GPU architecture?

Ampere/Turing → INT8

Hopper/Blackwell → FP8

2. How much accuracy loss is acceptable?

<0.5% drop → W8A8 with GPTQ

1-3% acceptable → W4A16 with AWQ

3-5% acceptable → INT4

3. What’s the workload type?

Online/interactive → Weight-only quantization

Batch/offline → Weight + activation quantization

4. How much VRAM do you have?

Severely limited → INT4

Moderately limited → INT8/FP8

Abundant → Consider staying FP16

Always validate quantized model accuracy on your specific tasks before production deployment. Benchmark scores don’t guarantee performance on your use cases.

GPU Parallelism: When and How to Use Multiple GPUs

When a model doesn’t fit on a single GPU, you need to split it across multiple GPUs using parallelism strategies.

When Do You Need Multi-GPU?

Single GPU sufficiency check:

Calculate total VRAM needed (weights + KV cache + overhead)
Compare to largest available GPU (e.g., H100-80GB)
If model fits with 20% margin → single GPU deployment
If model doesn’t fit → multi-GPU required

Example: Llama-70B in FP16

VRAM needed: 140 GB (weights) + 16 GB (128K context) + 15 GB (overhead) = 171 GB
H100-80GB capacity: 80 GB
Verdict: Requires multi-GPU (171 GB doesn’t fit in 80 GB)

Tensor Parallelism (TP)

What it does: Splits each model layer across multiple GPUs. All GPUs work on the same batch of tokens simultaneously.

When to use:

Model is too large for single GPU
GPUs are connected via high-speed interconnect (NVLink, NVSwitch)
All GPUs are in the same physical node

Network requirements:

NVLink (900 GB/s): Excellent for TP
PCIe (64 GB/s): Acceptable for TP but slower
Ethernet (10-100 Gb/s): Too slow for TP

Configuration:

TP degree must divide evenly into number of attention heads
Common TP degrees: 2, 4, 8

Example: Llama-70B across 4× H100 via NVLink

Each GPU holds 1/4 of each layer (~17.5 GB per GPU for weights)
All 4 GPUs process each token together
Total VRAM per GPU: ~43 GB (fits comfortably in 80 GB)

Never use Tensor Parallelism across slow networks (Ethernet/PCIe Gen3). TP requires constant communication between GPUs during every forward pass. Slow networks cause 80-90% GPU utilization loss.

Pipeline Parallelism (PP)

What it does: Splits model layers sequentially across GPUs. Each GPU holds complete layers and processes tokens in pipeline fashion.

Instead of splitting a single layer across multiple GPUs (as in Tensor Parallelism), Pipeline Parallelism splits the entire model vertically by groups of layers. For example, in a 40-layer model, GPU 0 might handle layers 1–10, GPU 1 handles 11–20, and so on.

When to use:

Model spans multiple physical nodes
Interconnect is slower (InfiniBand, Ethernet)
Prefer batch throughput over latency

Network requirements:

InfiniBand (100-400 Gb/s): Good for PP
Ethernet (100+ Gb/s): Acceptable for PP

Trade-offs:

Higher latency than TP (tokens pass through GPUs sequentially)
Better batch efficiency
Works across nodes

Choosing the Right Parallelism Strategy

Scenario	Recommended Strategy	Configuration	Reasoning
Model barely doesn’t fit single GPU	Consider quantization first	INT8 or FP8	Simpler than multi-GPU
Model fits on 2-8 GPUs in same node	Tensor Parallelism (TP)	TP=2 to TP=8 via NVLink	Best latency, simple setup
Model requires 8+ GPUs across nodes	Hybrid TP+PP	TP=8 per node, PP=2+ across nodes	Balance latency and scale
Slow interconnect between nodes	Pipeline Parallelism (PP)	PP across nodes	Avoid TP communication overhead

Scenario

Recommended Strategy

Configuration

Reasoning

Model barely doesn’t fit single GPU

Consider quantization first

INT8 or FP8

Simpler than multi-GPU

Model fits on 2-8 GPUs in same node

Tensor Parallelism (TP)

TP=2 to TP=8 via NVLink

Best latency, simple setup

Model requires 8+ GPUs across nodes

Hybrid TP+PP

TP=8 per node, PP=2+ across nodes

Balance latency and scale

Slow interconnect between nodes

Pipeline Parallelism (PP)

PP across nodes

Avoid TP communication overhead

Before configuring multi-GPU parallelism, always consider quantization first. A 70B model in FP16 requiring 2 GPUs might fit on a single GPU with INT8 quantization, simplifying deployment significantly.

Verifying Network Topology

Before configuring Tensor Parallelism, verify your GPUs have high-speed interconnects:

nvidia-smi topo -m

Look for NVLink connections:

NV# = NVLink connection (excellent for TP)
SYS = PCIe/system connection (slower, avoid for TP)

Summary: Putting It All Together

When deploying a new model, follow this decision sequence:

Select architecture (Dense vs MoE) based on workload characteristics
Calculate VRAM requirements (weights + KV cache + overhead)
Evaluate quantization options aligned to your GPU hardware
Determine GPU configuration:
- Single GPU if model fits with margin
- Multi-GPU with TP if NVLink available
- Multi-GPU with PP if spanning nodes
Validate the configuration with test deployment

Key takeaways:

Dense models: predictable, easier to deploy, better for reasoning
MoE models: faster, higher throughput, require more VRAM
Always calculate VRAM before deployment to avoid surprises
Quantization can reduce GPU count significantly
Match quantization format to GPU architecture (INT8 for Ampere, FP8 for Hopper)
Use TP for single-node, high-speed interconnect scenarios
Use PP for multi-node or slower interconnect scenarios
Consider quantization before multi-GPU as a simpler solution

What’s Next

Ready to practice these concepts? Continue to Section 2: Sizing and Parallelism Lab for hands-on exercises in VRAM calculation, parallelism planning, and quantization selection.