Single-Node Multi-GPU
LLMBoost provides intelligent multi-GPU parallelism to maximize performance and handle large models efficiently. Scale inference across multiple GPUs on a single server with automatic or manual configuration.
Why Multi-GPU Matters
Handle Larger Models - Deploy models that don't fit on a single GPU
Increased Throughput - Serve more concurrent requests
Flexible Scaling - Use exactly the resources you need
Automatic Configuration - Let LLMBoost choose optimal settings
Parallelism Strategies
LLMBoost supports multiple dimensions of parallelism that can be used independently or combined:
1. Tensor Parallelism (TP)
Shards tensors within a layer across GPUs for models too large to fit on a single GPU.
- When to use: Model exceeds single GPU memory
- Parameter:
--tp <N>ortp=<N> - Example:
--tp 2splits the model across 2 GPUs
How it works: Each GPU holds a portion of each layer and processes its slice of the computation in parallel.
2. Data Parallelism (DP)
Replicates the model across GPUs to handle concurrent requests efficiently.
- When to use: Multiple concurrent users or high request volume
- Parameter:
--dp <N>ordp=<N> - Example:
--dp 4creates 4 model replicas
How it works: Multiple complete copies of the model process different requests independently.
3. Pipeline Parallelism (PP)
Pipeline Parallelism support is under development for ultra-large models.
Choosing the Right Strategy
| Use Case | Recommended Strategy | Example |
|---|---|---|
| Model doesn't fit on 1 GPU | Tensor Parallelism | --tp 2 or --tp 4 |
| Multiple concurrent users | Data Parallelism | --dp 4 or --dp 8 |
| Maximize GPU utilization | Combine TP + DP | --tp 2 --dp 4 |
| Automatic optimization | Set both to 0 | --tp 0 --dp 0 |
Usage Examples
- Using LLMBoost Hub
- Manual Setup
- LLMBoost SDK
Automatic Configuration (Recommended)
Let LLMBoost automatically determine the optimal parallelism strategy:
# LLMBoost automatically configures tp and dp
lbh serve meta-llama/Llama-3.1-70B-Instruct
Manual Configuration
Specify parallelism manually for fine-grained control:
# Tensor parallelism across 2 GPUs
lbh serve meta-llama/Llama-3.1-70B-Instruct -- \
--tp 2 --dp 1
# Data parallelism across 4 GPUs
lbh serve meta-llama/Llama-3.1-8B-Instruct -- \
--tp 1 --dp 4
# Combined: 2x TP, 4x DP (uses 8 GPUs total)
lbh serve meta-llama/Llama-3.1-70B-Instruct -- \
--tp 2 --dp 4
Automatic Configuration
# Inside LLMBoost container
llmboost serve \
--model_name meta-llama/Llama-3.1-70B-Instruct \
--tp 0 --dp 0 # Automatic
Manual Configuration
# Tensor parallelism
llmboost serve \
--model_name meta-llama/Llama-3.1-70B-Instruct \
--tp 2 --dp 1
# Data parallelism
llmboost serve \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--tp 1 --dp 4
# Combined parallelism
llmboost serve \
--model_name meta-llama/Llama-3.1-70B-Instruct \
--tp 2 --dp 4
The below command should be executed inside the LLMBoost Docker container, which you can run using the manual docker setup.
from llmboost import LLMBoost
if __name__ == '__main__':
# Automatic configuration
llm = LLMBoost(
model_name="meta-llama/Llama-3.1-70B-Instruct",
tp=0, # Automatic
dp=0 # Automatic
)
# Manual tensor parallelism
llm = LLMBoost(
model_name="meta-llama/Llama-3.1-70B-Instruct",
tp=2, # Split across 2 GPUs
dp=1
)
# Manual data parallelism
llm = LLMBoost(
model_name="meta-llama/Llama-3.1-8B-Instruct",
tp=1,
dp=4 # 4 model replicas
)
# Combined parallelism
llm = LLMBoost(
model_name="meta-llama/Llama-3.1-70B-Instruct",
tp=2, # 2-way tensor parallelism
dp=4 # 4 data parallel workers (8 GPUs total)
)
llm.start()
Configuration Parameters
Core Parallelism Settings
| Parameter | Description | Values | Default |
|---|---|---|---|
--tp / tp | Tensor parallelism degree | Integer or 0 (auto) | 0 (auto) |
--dp / dp | Data parallelism degree | Integer or 0 (auto) | 0 (auto) |
Memory and Performance Tuning
| Parameter | Description | Default |
|---|---|---|
--max_model_len | Maximum sequence length | Auto-detected |
--max_num_seqs | Max concurrent sequences | Auto-configured |
--max_num_batched_tokens | Max tokens in prefill batch | Auto-configured |
Example with Advanced Settings
llmboost serve \
--model_name meta-llama/Llama-3.1-70B-Instruct \
--tp 2 --dp 4 \
--max_model_len 8192 \
--max_num_seqs 256 \
Performance Considerations
GPU Memory Requirements
Estimate GPU memory needed:
Memory per GPU ≈ (Model Size / TP degree) + KV Cache
Examples (approximate):
- Llama-3.1-8B (FP16): ~16 GB => Fits on 1x A100 (40GB)
- Llama-3.1-70B (FP16): ~140 GB => Requires
--tp 2or higher on A100s - Llama-3.1-70B (FP8): ~70 GB => Fits on 2x A100 with
--tp 2
Throughput Optimization
For maximum throughput:
- Start with automatic configuration:
--tp 0 --dp 0 - Monitor GPU utilization: Ensure all GPUs are used
- Increase data parallelism: If GPUs are underutilized
- Fine-tune batch sizes: Adjust
--max_num_seqsfor your workload
Best Practices
Use auto-configuration first - Let LLMBoost optimize for you
TP must divide model evenly - Use TP degrees that divide the layer count
Total GPUs = TP × DP - Plan your GPU allocation accordingly
Monitor memory usage - Leave headroom for KV cache growth
Verification
After starting the server with LLMBOOST_LOG_LEVEL=DEBUG, the parallelism configuration will be printed in the terminal. You should see output similar to:
# Auto TP: 2 Auto DP: 4
Test with concurrent requests to see data parallelism in action:
from openai import OpenAI
import concurrent.futures
client = OpenAI(base_url="http://localhost:8011/v1", api_key="-")
def make_request(i):
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-70B-Instruct",
messages=[{"role": "user", "content": f"Request {i}"}],
max_tokens=50
)
return response.choices[0].message.content
# Send 10 concurrent requests
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
results = list(executor.map(make_request, range(10)))
print(f"Completed {len(results)} concurrent requests")
Troubleshooting
Out of Memory Errors
If you encounter Out of Memory (OOM) errors:
- Increase TP degree: Reduce memory per GPU
- Reduce max_model_len: Decrease context window
- Lower gpu_memory_utilization: Reduce to 0.85 or 0.80
- Use quantization: Enable FP8 quantization
llmboost serve \
--model_name meta-llama/Llama-3.1-70B-Instruct \
--tp 4 \ # Increase TP
--quantization fp8 \ # Use FP8
--gpu_memory_utilization 0.85
Underutilized GPUs
If some GPUs show low utilization:
- Increase DP degree: Add more workers
- Increase concurrent requests: Higher load
- Adjust max_num_seqs: Allow more concurrent sequences
Next Steps
- Streaming - Enable real-time token generation
- LBH Advanced Usage - Advanced container workflows
- LLMBoost Speedup - See LLMBoost's speedups
Questions? Contact contact@mangoboost.io