Skip to main content

Multi-Node Deployment

Deploy LLMBoost across multiple nodes in a Kubernetes cluster with automatic orchestration, load balancing, and monitoring. Scale your inference infrastructure to handle production workloads with enterprise-grade reliability.

Why Multi-Node Matters

Horizontal Scaling - Distribute models across multiple servers
High Availability - Eliminate single points of failure
Enterprise Ready - Built-in monitoring and management UIs
Automatic Orchestration - Kubernetes-native deployment with Helm

Multi-Node Deployment Diagram


Prerequisites

Required Infrastructure

Before deploying multi-node LLMBoost, ensure you have the following installed AND running:

  • Kubernetes (1.32.0 or higher)
  • CNI (Calico or Flannel)
  • Helm (version 3.19 or higher https://helm.sh/docs/intro/install/)
  • Latest CUDA and ROCm drivers for Nvidia and AMD GPUs, respectively

Required Tools

ToolPurposeVerification
KubernetesContainer orchestrationkubectl version
HelmPackage managementhelm version
DockerContainer runtimedocker ps

Verify Prerequisites

# Verify Kubernetes cluster is accessible
kubectl cluster-info # Should show: `Kubernetes control plane is running at https://<manager-node>:<port>'

# Verify Helm is installed
helm version # Should be v3.19 or higher

# Verify Docker is running
docker ps
docker --version # Should be 27.3 or higher

# Verify lbh is installed
lbh --version # Should be v0.3.0 or higher

Architecture Overview

LLMBoost multi-node deployment uses a Kubernetes-native architecture:

  • Custom Resource Definitions (CRDs): Define model deployments declaratively
  • Operator: Automatically reconciles desired state with actual deployment
  • Load Balancing: Kubernetes Services distribute traffic across model replicas
  • Monitoring: Built-in Grafana dashboards for metrics visualization
  • Management UI: Headlamp-based interface for cluster administration

Usage Examples

Step 1: Install Cluster Infrastructure

Where to Run Commands

All lbh cluster commands should be run on the control/manager node with access to the Kubernetes cluster. Please contact your cluster administrator if unsure.

Install the LLMBoost multi-node infrastructure using:

lbh cluster install

This command will:

  • Create the llmboost namespace
  • Install the LLMBoost Operator via Helm chart
  • Start the management and monitoring services
  • If a cluster configuration file exists at ~/.llmboost_hub/cluster_config.json, it will also deploy the specified model deployments automatically.

After installation completes, you'll receive:

  • Management UI Token: For accessing the Kubernetes dashboard
  • Monitoring UI Credentials: Username and password for the Monitoring UI
note

After installing the helm chart using lbh cluster install, full unmasked credentials are displayed in the output. To see them again at any time, use lbh cluster --show-secrets.

Step 2: Configure Model Deployments

Create a cluster configuration file (e.g., cluster_config.json):

Basic functional example

Deploy one model on any node (node_replicas: 1 means LLMBoost will pick one available node):

{
"schema_version": "1.0",
"cluster": {
"name": "production-cluster"
},
"model_deployments": [
{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"node_replicas": 1
}
]
}

Advanced example template

{
"schema_version": "1.0",
"cluster": {
"name": "production-cluster",
"huggingfaceToken": "hf_xxx"
},
"model_deployments": [
{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"node_replicas": 3
},
{
"model": "meta-llama/Llama-3.1-70B-Instruct",
"resource_selector": [
{"id": "k8s-node-1", "port": 8011},
{"id": "k8s-node-2", "port": 8011}
]
}
]
}
Node Names

The above template uses example node IDs (k8s-node-1, k8s-node-2). Replace these with the actual node names or IDs in your Kubernetes cluster. You can retrieve node names using:

kubectl get nodes

Explanation of cluster_config.json

(Click to expand/collapse)

Configuration File Structure

The cluster_config.json file defines your multi-node deployment configuration. It consists of three main sections:

Top-Level Keys:

KeyTypeRequiredDescription
schema_versionStringYesConfiguration schema version (currently "1.0")
clusterObjectYesCluster-wide settings including name and authentication
model_deploymentsArrayYesList of model deployment configurations

Cluster Settings (cluster object):

KeyTypeRequiredDescriptionExample
nameStringYesUnique identifier for your cluster"production-cluster"
huggingfaceTokenStringNoHugging Face API token for downloading private models"hf_xxx..."

Model Deployment Configuration (model_deployments array items):

KeyTypeRequiredDescriptionExample
modelStringYesFull Hugging Face model name (org/model)"meta-llama/Llama-3.1-8B-Instruct"
docker_imageStringNoCustom Docker image (auto-detected if omitted)"mangollm/mb-llmboost:latest"
model_pathStringNoCustom path to model files inside container"/workspace/custom/path"
node_replicasIntegerEither this OR resource_selectorNumber of replicas to auto-distribute across nodes3
resource_selectorArrayEither this OR node_replicasExplicit list of node assignments with portsSee below

Resource Selector Options (resource_selector array items):

KeyTypeRequiredDefaultDescriptionExample
idStringYesN/AKubernetes node name/identifier"k8s-node-1"
portIntegerNo8011Service port for this deployment8012

Deployment Strategies

Use node_replicas when:

  • You want automatic distribution across available nodes
  • Your cluster has homogeneous nodes (similar GPU/CPU specs)
  • You don't need control over which specific nodes run the model
Mutually Exclusive

You cannot use both node_replicas and resource_selector in the same deployment. Choose one strategy per model.

{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"node_replicas": 4 // LLMBoost automatically selects 4 nodes
}

Use resource_selector when:

  • You need explicit control over node placement
  • Different nodes have different capabilities or locations
  • You want to assign different ports per node
  • You're targeting specific hardware configurations
{
"model": "meta-llama/Llama-3.1-70B-Instruct",
"resource_selector": [
{"id": "gpu-node-high-mem-1", "port": 8011},
{"id": "gpu-node-high-mem-2", "port": 8011}
]
}

Step 3: Deploy Models

By default, lbh cluster install tries to automatically deploy the cluster configuration file at ~/.llmboost_hub/cluster_config.json during installation.

# Deploy with default config location: ~/.llmboost_hub/cluster_config.json
lbh cluster deploy

To disable the default deployment and deploy, use the -f flag to point to your configuration file.

lbh cluster deploy -f /path/to/cluster_config_1.json

This command will create the necessary Kubernetes deployment files (stored at ~/.llmboost_hub/model_deployments/) and apply them to your cluster.

Step 4: Verify Services are Ready

Before accessing the services, verify that they are ready to accept requests:

# Check if services are running on worker nodes
# Replace with your worker node IPs
curl <worker-node-ip>:30080/api/status

# Expected output:
# {"status":"running","server_name":"meta-llama/Llama-3.1-8B-Instruct"}

Step 5: Monitor Deployment Status

lbh cluster status

To view access credentials:

# Show masked credentials
lbh cluster status --show-secrets

# Show full unmasked credentials
lbh -v cluster status --show-secrets

Sample Output:

Models: 4/4             Mgmt.: 16/16

Pod Status Restarts Message
------------------------------------- -------- -------- -------
llama-3-1-8b-instruct-abc123 Running 0
llama-3-1-8b-instruct-def456 Running 0
llama-3-1-70b-instruct-node1-xyz789 Running 0
llama-3-1-70b-instruct-node2-mno345 Running 0

Service URLs:
Service URL
-------------------- --------------------------
Monitoring (Grafana) http://cluster-node:30082
Ingress Controller http://cluster-node:30080

Step 6: View Logs

lbh cluster logs

Filter logs by category:

# View only model deployment logs
lbh cluster logs --models

# View management pod logs
lbh cluster logs --management

Filter logs by pod or pattern:

# Filter to specific pod
lbh cluster logs --pod llama-3-1-8b-instruct-abc123

# Show last 50 lines with error filtering
lbh cluster logs --models --tail 50 --grep ERROR

Step 7: Remove Deployments (Optional)

lbh cluster remove meta-llama/Llama-3.1-8B-Instruct

Remove all deployments:

# With confirmation prompt
lbh cluster remove --all

# Skip confirmation
lbh cluster remove --all --force

Step 8: Uninstall Infrastructure (Optional)

lbh cluster uninstall

Skip confirmation prompt:

lbh cluster uninstall --force
warning

The namespace llmboost is not automatically deleted. To completely remove:

kubectl delete namespace llmboost

Configuration Parameters

Cluster Configuration

FieldDescriptionRequired
schema_versionConfiguration schema versionYes
cluster.nameCluster identifierYes
cluster.huggingfaceTokenHF token for private modelsNo

Model Deployment Configuration

FieldDescriptionRequired
modelHugging Face model nameYes
docker_imageCustom Docker imageNo (auto-detected)
model_pathCustom model path in containerNo
node_replicasNumber of replicas (auto-distributed)Either this or resource_selector
resource_selectorExplicit node assignmentsEither this or node_replicas

Resource Selector Options

FieldDescriptionDefault
idNode identifierRequired
portService port8011

Environment Variables

Configure LLMBoost Hub cluster operations:

VariableDefaultDescription
LBH_CLUSTER_CONFIG_PATH$LBH_HOME/cluster_config.jsonDefault cluster config file
LBH_KUBE_MODEL_DEPLOYMENTS_PATH$LBH_HOME/k8s/deployments/Generated manifests directory
KUBECONFIG~/.kube/configKubernetes config file

Monitoring and Management

Access Management UI (Headlamp)

The management UI is served from port 30080 of the control/manager node under the URL path /manage/.

Access Steps:

  1. Get the Management UI URL from lbh cluster status
  2. Navigate to http://<manager-node-ip>:30080/manage/ in your browser
    • Use localhost:30080/manage/ if accessing from the manager node
    • Note: The trailing / is required
  3. On first access, you'll be prompted for a login token
  4. Enter the token displayed during lbh cluster install or retrieve it with:
    lbh cluster status --show-secrets
    # Or get it directly from Kubernetes
    kubectl get secret -n llmboost headlamp-admin-token \
    -o jsonpath='{.data.token}' | base64 -d
  5. After successful login, you'll access the cluster management interface

Key Views and Capabilities:

Map View

Management UI - Map View

The Map view provides a visual topology of your cluster, displaying how deployments and workloads are distributed across nodes. This view helps you quickly identify resource allocation patterns and spot nodes with heavy workloads or potential bottlenecks. Each node shows the number of running pods and deployments, making it easy to assess cluster balance at a glance.

Namespace Overview

Management UI - Namespace

The namespace view displays all resources within the llmboost namespace, including deployments, DaemonSets, StatefulSets, and pods. This centralized view allows you to monitor resource counts, check pod readiness status, and quickly identify any failed or pending resources. Use this screen to verify that all components of your LLMBoost deployment are running as expected.

Nodes List

Management UI - Nodes

The Nodes view shows all Kubernetes nodes in your cluster with their health status, roles, IP addresses, and versions. This screen is essential for verifying node availability and identifying which nodes are ready to accept workloads. You can quickly see the age of each node and access detailed information about CPU, memory, and other resource capacities.

Cluster Overview

Management UI - Overview

The Overview page provides high-level cluster health metrics including total CPU units, memory capacity, pod counts, and node readiness statistics. The Events section displays recent Kubernetes events with timestamps, helping you troubleshoot deployments by showing pod scheduling, container creation, and image pulling activities. This is your first stop when diagnosing deployment issues or tracking recent changes.

Pods Management

Management UI - Pods

The Pods view lists all running pods across your cluster with their namespace, restart counts, readiness status, IP addresses, and assigned nodes. This detailed view allows you to monitor individual pod health, identify pods that are restarting frequently, and access pod logs for debugging. Use this screen to track which nodes are hosting specific model deployments and verify that pods are distributed as intended.

Access Monitoring Dashboard (Grafana)

The monitoring UI is accessible at port 30080 under the path /monitor/ of the control/manager node.

Access Steps:

  1. Get the Monitoring URL from lbh cluster status
  2. Navigate to http://<manager-node-ip>:30080/monitor/ in your browser
    • Note: The trailing / is required
  3. Login with credentials from:
    lbh cluster status --show-secrets
    # Or retrieve directly
    kubectl get secret -n llmboost grafana-admin-secret \
    -o jsonpath='{.data.admin-user}' | base64 -d
    kubectl get secret -n llmboost grafana-admin-secret \
    -o jsonpath='{.data.admin-password}' | base64 -d

LLMBoost-Specific Metrics:

LLMBoost exposes the following metrics for each node:

MetricDescription
num_active_requestsNumber of ongoing chat/completion requests in the node
num_total_requests_totalTotal number of requests served by the LLMBoost engine

GPU Metrics:

Platform-specific GPU metrics are also available:

Available Dashboards:

GPU and CPU Metrics Dashboard

Monitoring - GPU and CPU

This dashboard tracks real-time GPU utilization across all GPUs in your cluster, with separate time-series graphs for each GPU on each node. The CPU metrics section displays utilization and memory consumption per node, helping you identify compute bottlenecks and optimize resource allocation. Use these graphs to ensure GPUs are being fully utilized during inference and to detect idle resources that could handle additional workloads.

Detailed GPU Metrics

Monitoring - GPU Details

The detailed GPU metrics dashboard provides hardware-level insights including ECC (Error-Correcting Code) errors, VRAM throughput, memory utilization per GPU, and junction temperatures. These metrics are critical for identifying hardware issues, thermal throttling, or memory saturation that could impact inference performance. Monitor temperature trends to ensure cooling systems are adequate and watch VRAM utilization to verify models fit comfortably within available GPU memory.

Network Performance Metrics

Monitoring - Network

This dashboard monitors network health across your cluster, tracking receive bandwidth, TCP retransmit counts, and dropped packets per network interface. Network metrics help you identify connectivity issues, bandwidth saturation, or packet loss that could slow down distributed inference or cause request timeouts. High retransmit counts or dropped packets indicate network problems that should be investigated to maintain optimal multi-node communication.


Using the Inference Services

Accessing Individual Endpoints

Inference services on worker nodes listen for requests on port 30080. You can connect to individual nodes using the OpenAI-compatible API:

from openai import OpenAI

# Connect to a specific worker node
client = OpenAI(
base_url="http://<worker-node-ip>:30080/api/v1",
api_key="-"
)

response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is multi-node inference?"}
],
stream=False
)

print(response.choices[0].message.content)

Load Balancing Across Multiple Nodes

For optimal performance, distribute requests across multiple worker nodes:

import threading
from queue import Queue
from openai import OpenAI

# Define prompts
prompts = [
"How does multithreading work in Python?",
"Write me a Fibonacci generator in Python",
"Which pet should I get, dog or cat?",
"How do I fine-tune an LLM model?"
]

# Thread worker for sending requests
def run_thread(host, queue: Queue):
client = OpenAI(
base_url=f"http://{host}/api/v1",
api_key="-"
)
while not queue.empty():
prompt = queue.get()
chat_completion = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
],
stream=False
)
print(
f"-------------------------------------------------------------------\n"
f"Question: {prompt}\nAnswer: {chat_completion.choices[0].message.content}"
)

# Worker node endpoints
hosts = ["10.4.16.1:30080", "10.4.16.2:30080"]

threads = []
queue = Queue()

# Populate the request queue
for prompt in prompts:
queue.put(prompt)

# Launch threads for each host
for host in hosts:
t = threading.Thread(target=run_thread, args=(host, queue))
threads.append(t)
t.start()

# Wait for all threads to complete
for thread in threads:
thread.join()

Save this as client.py and run:

python client.py

Best Practices

Resource Planning

Estimate cluster requirements:

  • Model size: Larger models need more GPUs per node
  • Expected load: Higher concurrency requires more replicas
  • GPU memory: Ensure nodes have sufficient VRAM per model

Deployment Strategy

Use node_replicas for:

  • Homogeneous clusters with similar nodes
  • Simple scaling without node-specific constraints

Use resource_selector for:

  • Node-specific port assignments
  • Explicit control over which nodes run which models

High Availability

Ensure redundancy:

  • Deploy at least 3 replicas for production workloads
  • Distribute replicas across availability zones
  • Monitor pod health and automatic restarts

Troubleshooting

Pods Not Starting

Check pod status:

lbh cluster status
kubectl describe pod <pod-name> -n llmboost

Common issues:

  • Insufficient GPU resources on nodes
  • Image pull errors (check Docker credentials)
  • Invalid Hugging Face token for private models

Out of Memory Errors

Solutions:

  1. Deploy to nodes with more GPUs: Select nodes with higher GPU capacity
  2. Reduce model size: Use smaller models or quantization
  3. Check node resources: Verify available GPU memory
# Check node GPU resources
kubectl describe nodes | grep -A 5 "Allocated resources"

Service Not Accessible

Verify service configuration:

kubectl get svc -n llmboost
kubectl get endpoints -n llmboost

Check:

  • Service type (LoadBalancer, NodePort, ClusterIP)
  • Firewall rules and network policies
  • Ingress controller configuration

Deployment Not Reconciling

Check operator logs:

lbh cluster logs --management --pod operator
kubectl logs -n llmboost deployment/llmboost-operator

Verify:

  • Operator is running: kubectl get pods -n llmboost | grep operator
  • CRD is installed: kubectl get crd llmboostdeployments.mangoboost.io

Next Steps


Questions? Contact contact@mangoboost.io