Skip to main content

Single-Node Multi-GPU

LLMBoost provides intelligent multi-GPU parallelism to maximize performance and handle large models efficiently. Scale inference across multiple GPUs on a single server with automatic or manual configuration.

Why Multi-GPU Matters

Handle Larger Models - Deploy models that don't fit on a single GPU
Increased Throughput - Serve more concurrent requests
Flexible Scaling - Use exactly the resources you need
Automatic Configuration - Let LLMBoost choose optimal settings

Single-Node Multi-GPU Diagram


Parallelism Strategies

LLMBoost supports multiple dimensions of parallelism that can be used independently or combined:

1. Tensor Parallelism (TP)

Shards tensors within a layer across GPUs for models too large to fit on a single GPU.

  • When to use: Model exceeds single GPU memory
  • Parameter: --tp <N> or tp=<N>
  • Example: --tp 2 splits the model across 2 GPUs

How it works: Each GPU holds a portion of each layer and processes its slice of the computation in parallel.

2. Data Parallelism (DP)

Replicates the model across GPUs to handle concurrent requests efficiently.

  • When to use: Multiple concurrent users or high request volume
  • Parameter: --dp <N> or dp=<N>
  • Example: --dp 4 creates 4 model replicas

How it works: Multiple complete copies of the model process different requests independently.

3. Pipeline Parallelism (PP)

Coming Soon

Pipeline Parallelism support is under development for ultra-large models.


Choosing the Right Strategy

Use CaseRecommended StrategyExample
Model doesn't fit on 1 GPUTensor Parallelism--tp 2 or --tp 4
Multiple concurrent usersData Parallelism--dp 4 or --dp 8
Maximize GPU utilizationCombine TP + DP--tp 2 --dp 4
Automatic optimizationSet both to 0--tp 0 --dp 0

Usage Examples

Let LLMBoost automatically determine the optimal parallelism strategy:

# LLMBoost automatically configures tp and dp
lbh serve meta-llama/Llama-3.1-70B-Instruct

Manual Configuration

Specify parallelism manually for fine-grained control:

# Tensor parallelism across 2 GPUs
lbh serve meta-llama/Llama-3.1-70B-Instruct -- \
--tp 2 --dp 1

# Data parallelism across 4 GPUs
lbh serve meta-llama/Llama-3.1-8B-Instruct -- \
--tp 1 --dp 4

# Combined: 2x TP, 4x DP (uses 8 GPUs total)
lbh serve meta-llama/Llama-3.1-70B-Instruct -- \
--tp 2 --dp 4

Configuration Parameters

Core Parallelism Settings

ParameterDescriptionValuesDefault
--tp / tpTensor parallelism degreeInteger or 0 (auto)0 (auto)
--dp / dpData parallelism degreeInteger or 0 (auto)0 (auto)

Memory and Performance Tuning

ParameterDescriptionDefault
--max_model_lenMaximum sequence lengthAuto-detected
--max_num_seqsMax concurrent sequencesAuto-configured
--max_num_batched_tokensMax tokens in prefill batchAuto-configured

Example with Advanced Settings

llmboost serve \
--model_name meta-llama/Llama-3.1-70B-Instruct \
--tp 2 --dp 4 \
--max_model_len 8192 \
--max_num_seqs 256 \

Performance Considerations

GPU Memory Requirements

Estimate GPU memory needed:

Memory per GPU ≈ (Model Size / TP degree) + KV Cache

Examples (approximate):

  • Llama-3.1-8B (FP16): ~16 GB => Fits on 1x A100 (40GB)
  • Llama-3.1-70B (FP16): ~140 GB => Requires --tp 2 or higher on A100s
  • Llama-3.1-70B (FP8): ~70 GB => Fits on 2x A100 with --tp 2

Throughput Optimization

For maximum throughput:

  1. Start with automatic configuration: --tp 0 --dp 0
  2. Monitor GPU utilization: Ensure all GPUs are used
  3. Increase data parallelism: If GPUs are underutilized
  4. Fine-tune batch sizes: Adjust --max_num_seqs for your workload

Best Practices

Use auto-configuration first - Let LLMBoost optimize for you
TP must divide model evenly - Use TP degrees that divide the layer count
Total GPUs = TP × DP - Plan your GPU allocation accordingly
Monitor memory usage - Leave headroom for KV cache growth


Verification

After starting the server with LLMBOOST_LOG_LEVEL=DEBUG, the parallelism configuration will be printed in the terminal. You should see output similar to:

# Auto TP: 2 Auto DP: 4

Test with concurrent requests to see data parallelism in action:

from openai import OpenAI
import concurrent.futures

client = OpenAI(base_url="http://localhost:8011/v1", api_key="-")

def make_request(i):
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-70B-Instruct",
messages=[{"role": "user", "content": f"Request {i}"}],
max_tokens=50
)
return response.choices[0].message.content

# Send 10 concurrent requests
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
results = list(executor.map(make_request, range(10)))

print(f"Completed {len(results)} concurrent requests")

Troubleshooting

Out of Memory Errors

If you encounter Out of Memory (OOM) errors:

  1. Increase TP degree: Reduce memory per GPU
  2. Reduce max_model_len: Decrease context window
  3. Lower gpu_memory_utilization: Reduce to 0.85 or 0.80
  4. Use quantization: Enable FP8 quantization
llmboost serve \
--model_name meta-llama/Llama-3.1-70B-Instruct \
--tp 4 \ # Increase TP
--quantization fp8 \ # Use FP8
--gpu_memory_utilization 0.85

Underutilized GPUs

If some GPUs show low utilization:

  1. Increase DP degree: Add more workers
  2. Increase concurrent requests: Higher load
  3. Adjust max_num_seqs: Allow more concurrent sequences

Next Steps


Questions? Contact contact@mangoboost.io