Single-Node Multi-GPU

LLMBoost provides intelligent multi-GPU parallelism to maximize performance and handle large models efficiently. Scale inference across multiple GPUs on a single server with automatic or manual configuration.

Why Multi-GPU Matters

Handle Larger Models - Deploy models that don't fit on a single GPU
Increased Throughput - Serve more concurrent requests
Flexible Scaling - Use exactly the resources you need
Automatic Configuration - Let LLMBoost choose optimal settings

Single-Node Multi-GPU Diagram

Parallelism Strategies

LLMBoost supports multiple dimensions of parallelism that can be used independently or combined:

1. Tensor Parallelism (TP)

Shards tensors within a layer across GPUs for models too large to fit on a single GPU.

When to use: Model exceeds single GPU memory
Parameter: --tp <N> or tp=<N>
Example: --tp 2 splits the model across 2 GPUs

How it works: Each GPU holds a portion of each layer and processes its slice of the computation in parallel.

2. Data Parallelism (DP)

Replicates the model across GPUs to handle concurrent requests efficiently.

When to use: Multiple concurrent users or high request volume
Parameter: --dp <N> or dp=<N>
Example: --dp 4 creates 4 model replicas

How it works: Multiple complete copies of the model process different requests independently.

3. Pipeline Parallelism (PP)

Coming Soon

Pipeline Parallelism support is under development for ultra-large models.

Choosing the Right Strategy

Use Case	Recommended Strategy	Example
Model doesn't fit on 1 GPU	Tensor Parallelism	`--tp 2` or `--tp 4`
Multiple concurrent users	Data Parallelism	`--dp 4` or `--dp 8`
Maximize GPU utilization	Combine TP + DP	`--tp 2 --dp 4`
Automatic optimization	Set both to 0	`--tp 0 --dp 0`

Usage Examples

Using LLMBoost Hub
Manual Setup
LLMBoost SDK

Automatic Configuration (Recommended)

Let LLMBoost automatically determine the optimal parallelism strategy:

# LLMBoost automatically configures tp and dp
lbh serve meta-llama/Llama-3.1-70B-Instruct

Manual Configuration

Specify parallelism manually for fine-grained control:

# Tensor parallelism across 2 GPUs
lbh serve meta-llama/Llama-3.1-70B-Instruct -- \
  --tp 2 --dp 1

# Data parallelism across 4 GPUs
lbh serve meta-llama/Llama-3.1-8B-Instruct -- \
  --tp 1 --dp 4

# Combined: 2x TP, 4x DP (uses 8 GPUs total)
lbh serve meta-llama/Llama-3.1-70B-Instruct -- \
  --tp 2 --dp 4

Automatic Configuration

# Inside LLMBoost container
llmboost serve \
  --model_name meta-llama/Llama-3.1-70B-Instruct \
  --tp 0 --dp 0  # Automatic

Manual Configuration

# Tensor parallelism
llmboost serve \
  --model_name meta-llama/Llama-3.1-70B-Instruct \
  --tp 2 --dp 1

# Data parallelism
llmboost serve \
  --model_name meta-llama/Llama-3.1-8B-Instruct \
  --tp 1 --dp 4

# Combined parallelism
llmboost serve \
  --model_name meta-llama/Llama-3.1-70B-Instruct \
  --tp 2 --dp 4

The below command should be executed inside the LLMBoost Docker container, which you can run using the manual docker setup.

from llmboost import LLMBoost

if __name__ == '__main__':
    # Automatic configuration
    llm = LLMBoost(
        model_name="meta-llama/Llama-3.1-70B-Instruct",
        tp=0,  # Automatic
        dp=0   # Automatic
    )

    # Manual tensor parallelism
    llm = LLMBoost(
        model_name="meta-llama/Llama-3.1-70B-Instruct",
        tp=2,  # Split across 2 GPUs
        dp=1
    )

    # Manual data parallelism
    llm = LLMBoost(
        model_name="meta-llama/Llama-3.1-8B-Instruct",
        tp=1,
        dp=4   # 4 model replicas
    )

    # Combined parallelism
    llm = LLMBoost(
        model_name="meta-llama/Llama-3.1-70B-Instruct",
        tp=2,  # 2-way tensor parallelism
        dp=4   # 4 data parallel workers (8 GPUs total)
    )

    llm.start()

Configuration Parameters

Core Parallelism Settings

Parameter	Description	Values	Default
`--tp` / `tp`	Tensor parallelism degree	Integer or 0 (auto)	0 (auto)
`--dp` / `dp`	Data parallelism degree	Integer or 0 (auto)	0 (auto)

Memory and Performance Tuning

Parameter	Description	Default
`--max_model_len`	Maximum sequence length	Auto-detected
`--max_num_seqs`	Max concurrent sequences	Auto-configured
`--max_num_batched_tokens`	Max tokens in prefill batch	Auto-configured

Example with Advanced Settings

llmboost serve \
  --model_name meta-llama/Llama-3.1-70B-Instruct \
  --tp 2 --dp 4 \
  --max_model_len 8192 \
  --max_num_seqs 256 \

Performance Considerations

GPU Memory Requirements

Estimate GPU memory needed:

Memory per GPU ≈ (Model Size / TP degree) + KV Cache

Examples (approximate):

Llama-3.1-8B (FP16): ~16 GB => Fits on 1x A100 (40GB)
Llama-3.1-70B (FP16): ~140 GB => Requires --tp 2 or higher on A100s
Llama-3.1-70B (FP8): ~70 GB => Fits on 2x A100 with --tp 2

Throughput Optimization

For maximum throughput:

Start with automatic configuration: --tp 0 --dp 0
Monitor GPU utilization: Ensure all GPUs are used
Increase data parallelism: If GPUs are underutilized
Fine-tune batch sizes: Adjust --max_num_seqs for your workload

Best Practices

Use auto-configuration first - Let LLMBoost optimize for you
TP must divide model evenly - Use TP degrees that divide the layer count
Total GPUs = TP × DP - Plan your GPU allocation accordingly
Monitor memory usage - Leave headroom for KV cache growth

Verification

After starting the server with LLMBOOST_LOG_LEVEL=DEBUG, the parallelism configuration will be printed in the terminal. You should see output similar to:

# Auto TP: 2 Auto DP: 4

Test with concurrent requests to see data parallelism in action:

from openai import OpenAI
import concurrent.futures

client = OpenAI(base_url="http://localhost:8011/v1", api_key="-")

def make_request(i):
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-70B-Instruct",
        messages=[{"role": "user", "content": f"Request {i}"}],
        max_tokens=50
    )
    return response.choices[0].message.content

# Send 10 concurrent requests
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    results = list(executor.map(make_request, range(10)))

print(f"Completed {len(results)} concurrent requests")

Troubleshooting

Out of Memory Errors

If you encounter Out of Memory (OOM) errors:

Increase TP degree: Reduce memory per GPU
Reduce max_model_len: Decrease context window
Lower gpu_memory_utilization: Reduce to 0.85 or 0.80
Use quantization: Enable FP8 quantization

llmboost serve \
  --model_name meta-llama/Llama-3.1-70B-Instruct \
  --tp 4 \  # Increase TP
  --quantization fp8 \  # Use FP8
  --gpu_memory_utilization 0.85

Underutilized GPUs

If some GPUs show low utilization:

Increase DP degree: Add more workers
Increase concurrent requests: Higher load
Adjust max_num_seqs: Allow more concurrent sequences

Next Steps

Streaming - Enable real-time token generation
LBH Advanced Usage - Advanced container workflows
LLMBoost Speedup - See LLMBoost's speedups

Questions? Contact contact@mangoboost.io

Why Multi-GPU Matters​

Parallelism Strategies​

1. Tensor Parallelism (TP)​

2. Data Parallelism (DP)​

3. Pipeline Parallelism (PP)​

Choosing the Right Strategy​

Usage Examples​

Automatic Configuration (Recommended)​

Manual Configuration​

Automatic Configuration​

Manual Configuration​

Configuration Parameters​

Core Parallelism Settings​

Memory and Performance Tuning​

Example with Advanced Settings​

Performance Considerations​

GPU Memory Requirements​

Throughput Optimization​

Best Practices​

Verification​

Troubleshooting​

Out of Memory Errors​

Underutilized GPUs​

Next Steps​

Why Multi-GPU Matters

Parallelism Strategies

1. Tensor Parallelism (TP)

2. Data Parallelism (DP)

3. Pipeline Parallelism (PP)

Choosing the Right Strategy

Usage Examples

Automatic Configuration (Recommended)

Manual Configuration

Automatic Configuration

Manual Configuration

Configuration Parameters

Core Parallelism Settings

Memory and Performance Tuning

Example with Advanced Settings

Performance Considerations

GPU Memory Requirements

Throughput Optimization

Best Practices

Verification

Troubleshooting

Out of Memory Errors

Underutilized GPUs

Next Steps