Multi-Node Deployment

Deploy LLMBoost across multiple nodes in a Kubernetes cluster with automatic orchestration, load balancing, and monitoring. Scale your inference infrastructure to handle production workloads with enterprise-grade reliability.

Why Multi-Node Matters

Horizontal Scaling - Distribute models across multiple servers
High Availability - Eliminate single points of failure
Enterprise Ready - Built-in monitoring and management UIs
Automatic Orchestration - Kubernetes-native deployment with Helm

Multi-Node Deployment Diagram

Prerequisites

Required Infrastructure

Before deploying multi-node LLMBoost, ensure you have the following installed AND running:

Kubernetes (1.32.0 or higher)
CNI (Calico or Flannel)
Helm (version 3.19 or higher https://helm.sh/docs/intro/install/)
Latest CUDA and ROCm drivers for Nvidia and AMD GPUs, respectively

Required Tools

Tool	Purpose	Verification
Kubernetes	Container orchestration	`kubectl version`
Helm	Package management	`helm version`
Docker	Container runtime	`docker ps`

Verify Prerequisites

# Verify Kubernetes cluster is accessible
kubectl cluster-info # Should show: `Kubernetes control plane is running at https://<manager-node>:<port>'

# Verify Helm is installed
helm version # Should be v3.19 or higher

# Verify Docker is running
docker ps
docker --version # Should be 27.3 or higher

# Verify lbh is installed
lbh --version # Should be v0.3.0 or higher

Architecture Overview

LLMBoost multi-node deployment uses a Kubernetes-native architecture:

Custom Resource Definitions (CRDs): Define model deployments declaratively
Operator: Automatically reconciles desired state with actual deployment
Load Balancing: Kubernetes Services distribute traffic across model replicas
Monitoring: Built-in Grafana dashboards for metrics visualization
Management UI: Headlamp-based interface for cluster administration

Usage Examples

Using LLMBoost Hub
Manual Setup (kubectl/helm)

Step 1: Install Cluster Infrastructure

Where to Run Commands

All lbh cluster commands should be run on the control/manager node with access to the Kubernetes cluster. Please contact your cluster administrator if unsure.

Install the LLMBoost multi-node infrastructure using:

lbh cluster install

This command will:

Create the llmboost namespace
Install the LLMBoost Operator via Helm chart
Start the management and monitoring services
If a cluster configuration file exists at ~/.llmboost_hub/cluster_config.json, it will also deploy the specified model deployments automatically.

After installation completes, you'll receive:

Management UI Token: For accessing the Kubernetes dashboard
Monitoring UI Credentials: Username and password for the Monitoring UI

note

After installing the helm chart using lbh cluster install, full unmasked credentials are displayed in the output. To see them again at any time, use lbh cluster --show-secrets.

Step 2: Configure Model Deployments

Create a cluster configuration file (e.g., cluster_config.json):

Basic functional example

Deploy one model on any node (node_replicas: 1 means LLMBoost will pick one available node):

{
  "schema_version": "1.0",
  "cluster": {
    "name": "production-cluster"
  },
  "model_deployments": [
    {
      "model": "meta-llama/Llama-3.1-8B-Instruct",
      "node_replicas": 1
    }
  ]
}

Advanced example template

{
  "schema_version": "1.0",
  "cluster": {
    "name": "production-cluster",
    "huggingfaceToken": "hf_xxx"
  },
  "model_deployments": [
    {
      "model": "meta-llama/Llama-3.1-8B-Instruct",
      "node_replicas": 3
    },
    {
      "model": "meta-llama/Llama-3.1-70B-Instruct",
      "resource_selector": [
        {"id": "k8s-node-1", "port": 8011},
        {"id": "k8s-node-2", "port": 8011}
      ]
    }
  ]
}

Node Names

The above template uses example node IDs (k8s-node-1, k8s-node-2). Replace these with the actual node names or IDs in your Kubernetes cluster. You can retrieve node names using:

kubectl get nodes

Explanation of `cluster_config.json`

(Click to expand/collapse)

Configuration File Structure

The cluster_config.json file defines your multi-node deployment configuration. It consists of three main sections:

Top-Level Keys:

Key	Type	Required	Description
`schema_version`	String	Yes	Configuration schema version (currently `"1.0"`)
`cluster`	Object	Yes	Cluster-wide settings including name and authentication
`model_deployments`	Array	Yes	List of model deployment configurations

Cluster Settings (cluster object):

Key	Type	Required	Description	Example
`name`	String	Yes	Unique identifier for your cluster	`"production-cluster"`
`huggingfaceToken`	String	No	Hugging Face API token for downloading private models	`"hf_xxx..."`

Model Deployment Configuration (model_deployments array items):

Key	Type	Required	Description	Example
`model`	String	Yes	Full Hugging Face model name (org/model)	`"meta-llama/Llama-3.1-8B-Instruct"`
`docker_image`	String	No	Custom Docker image (auto-detected if omitted)	`"mangollm/mb-llmboost:latest"`
`model_path`	String	No	Custom path to model files inside container	`"/workspace/custom/path"`
`node_replicas`	Integer	Either this OR `resource_selector`	Number of replicas to auto-distribute across nodes	`3`
`resource_selector`	Array	Either this OR `node_replicas`	Explicit list of node assignments with ports	See below

Resource Selector Options (resource_selector array items):

Key	Type	Required	Default	Description	Example
`id`	String	Yes	N/A	Kubernetes node name/identifier	`"k8s-node-1"`
`port`	Integer	No	8011	Service port for this deployment	`8012`

Deployment Strategies

Use node_replicas when:

You want automatic distribution across available nodes
Your cluster has homogeneous nodes (similar GPU/CPU specs)
You don't need control over which specific nodes run the model

Mutually Exclusive

You cannot use both node_replicas and resource_selector in the same deployment. Choose one strategy per model.

{
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "node_replicas": 4  // LLMBoost automatically selects 4 nodes
}

Use resource_selector when:

You need explicit control over node placement
Different nodes have different capabilities or locations
You want to assign different ports per node
You're targeting specific hardware configurations

{
  "model": "meta-llama/Llama-3.1-70B-Instruct",
  "resource_selector": [
    {"id": "gpu-node-high-mem-1", "port": 8011},
    {"id": "gpu-node-high-mem-2", "port": 8011}
  ]
}

Step 3: Deploy Models

By default, lbh cluster install tries to automatically deploy the cluster configuration file at ~/.llmboost_hub/cluster_config.json during installation.

# Deploy with default config location: ~/.llmboost_hub/cluster_config.json
lbh cluster deploy

To disable the default deployment and deploy, use the -f flag to point to your configuration file.

lbh cluster deploy -f /path/to/cluster_config_1.json

This command will create the necessary Kubernetes deployment files (stored at ~/.llmboost_hub/model_deployments/) and apply them to your cluster.

Step 4: Verify Services are Ready

Before accessing the services, verify that they are ready to accept requests:

# Check if services are running on worker nodes
# Replace with your worker node IPs
curl <worker-node-ip>:30080/api/status

# Expected output:
# {"status":"running","server_name":"meta-llama/Llama-3.1-8B-Instruct"}

Step 5: Monitor Deployment Status

lbh cluster status

To view access credentials:

# Show masked credentials
lbh cluster status --show-secrets

# Show full unmasked credentials
lbh -v cluster status --show-secrets

Sample Output:

Models: 4/4             Mgmt.: 16/16

Pod                                    Status    Restarts  Message
-------------------------------------  --------  --------  -------
llama-3-1-8b-instruct-abc123           Running   0         
llama-3-1-8b-instruct-def456           Running   0         
llama-3-1-70b-instruct-node1-xyz789    Running   0         
llama-3-1-70b-instruct-node2-mno345    Running   0         

Service URLs:
Service               URL
--------------------  --------------------------
Monitoring (Grafana)  http://cluster-node:30082
Ingress Controller    http://cluster-node:30080

Step 6: View Logs

lbh cluster logs

Filter logs by category:

# View only model deployment logs
lbh cluster logs --models

# View management pod logs
lbh cluster logs --management

Filter logs by pod or pattern:

# Filter to specific pod
lbh cluster logs --pod llama-3-1-8b-instruct-abc123

# Show last 50 lines with error filtering
lbh cluster logs --models --tail 50 --grep ERROR

Step 7: Remove Deployments (Optional)

lbh cluster remove meta-llama/Llama-3.1-8B-Instruct

Remove all deployments:

# With confirmation prompt
lbh cluster remove --all

# Skip confirmation
lbh cluster remove --all --force

Step 8: Uninstall Infrastructure (Optional)

lbh cluster uninstall

Skip confirmation prompt:

lbh cluster uninstall --force

warning

The namespace llmboost is not automatically deleted. To completely remove:

kubectl delete namespace llmboost

Step 1: Install Helm Chart

Add and update Helm repository:

helm repo add llmboost https://mangoboost.github.io/llmboost-helm
helm repo update

Verify chart availability:

helm search repo llmboost

Install the chart:

helm install llmboost llmboost/llmboost \
  -n llmboost \
  -f values.yaml

Step 2: Retrieve Access Credentials

Management UI Token (Headlamp):

kubectl get secret -n llmboost headlamp-admin-token \
  -o jsonpath='{.data.token}' | base64 -d

Monitoring Username (Grafana):

kubectl get secret -n llmboost grafana-admin-secret \
  -o jsonpath='{.data.admin-user}' | base64 -d

Monitoring Password (Grafana):

kubectl get secret -n llmboost grafana-admin-secret \
  -o jsonpath='{.data.admin-password}' | base64 -d

Step 3: Create Model Deployment Manifest

Create a file llama-deployment.yaml:

apiVersion: mangoboost.io/v1
kind: LLMBoostDeployment
metadata:
  name: llama-3-1-8b-instruct
  namespace: llmboost
spec:
  # The URL prefix for OpenAI API endpoint
  apiEndpoint: /api
  deployments:
    # The deployment name can be customized by the user
    amd-deployment:
      gpu:
        # Deploy on the machines with AMD GPUs
        type: amd
        # Or, deploy on the machines with Nvidia GPUs
        # type: nvidia
      # Your Huggingface token (optional)
      huggingfaceToken: hf_xxxx
      # Use "mangollm/mb-llmboost-nvidia:latest" for Nvidia deployments
      image: mangollm/mb-llmboost-rocm:latest
      # Currently, only "serve" is supported"
      mode: serve
      # Model name from the Huggingface repository
      modelName: meta-llama/Llama-3.1-8B-Instruct
      # Optionally, specify the name of the Kubernetes node(s), on which the pods will be deployed
      # Otherwise, the pods will be deployed on all nodes that have the matching GPU
      # nodes:
      # - node-1
      # - node-2

    # Multiple heterogeneous deployments can be specified - load balancer will tie them under the single endpoint
    # nvidia-deployment:
    #  gpu:
    #    type: nvidia
    #  ...

Step 4: Deploy Model

kubectl apply -f llama-deployment.yaml

Verify deployment:

# Check deployment resource
kubectl get llmboostdeployments -n llmboost

# Check pods
kubectl get pods -n llmboost -l llmboost/deployment-group-name=llama-3-1-8b-instruct

Step 5: Check Deployment Status

List all deployments:

kubectl get llmboostdeployments -n llmboost

View deployment details:

kubectl describe llmboostdeployment llama-3-1-8b-instruct -n llmboost

Check pod status:

kubectl get pods -n llmboost

View pod logs:

kubectl logs -n llmboost <pod-name> --all-containers=true

Step 6: Test the API Endpoint

Query the API endpoint for status

curl http://<worker node IP>:30080/api/status

# After successful deployment, the endpoint should return the following response
{"status":"running","server_name":"meta-llama/Llama-3.1-8B-Instruct"}

Note the /api prefix in the URL which was specified in the spec.apiEndpoint field of the manifest

Step 7: Remove Deployment

Delete specific deployment:

kubectl delete llmboostdeployment llama-3-1-8b-instruct -n llmboost

Delete all deployments:

kubectl delete llmboostdeployments --all -n llmboost

Step 8: Uninstall Helm Chart

Delete the existing llmboostdeployments

kubectl get llmboostdeployments -n llmboost
kubectl delete llmboostdeployments <deployment 1> <deployment 2> ...

Uninstall the LLMBoost Helm chart

helm uninstall llmboost -n llmboost

Remove namespace (optional):

kubectl delete namespace llmboost

Configuration Parameters

Cluster Configuration

Field	Description	Required
`schema_version`	Configuration schema version	Yes
`cluster.name`	Cluster identifier	Yes
`cluster.huggingfaceToken`	HF token for private models	No

Model Deployment Configuration

Field	Description	Required
`model`	Hugging Face model name	Yes
`docker_image`	Custom Docker image	No (auto-detected)
`model_path`	Custom model path in container	No
`node_replicas`	Number of replicas (auto-distributed)	Either this or `resource_selector`
`resource_selector`	Explicit node assignments	Either this or `node_replicas`

Resource Selector Options

Field	Description	Default
`id`	Node identifier	Required
`port`	Service port	8011

Environment Variables

Configure LLMBoost Hub cluster operations:

Variable	Default	Description
`LBH_CLUSTER_CONFIG_PATH`	`$LBH_HOME/cluster_config.json`	Default cluster config file
`LBH_KUBE_MODEL_DEPLOYMENTS_PATH`	`$LBH_HOME/k8s/deployments/`	Generated manifests directory
`KUBECONFIG`	`~/.kube/config`	Kubernetes config file

Monitoring and Management

Access Management UI (Headlamp)

The management UI is served from port 30080 of the control/manager node under the URL path /manage/.

Access Steps:

Get the Management UI URL from lbh cluster status
Navigate to http://<manager-node-ip>:30080/manage/ in your browser
- Use localhost:30080/manage/ if accessing from the manager node
- Note: The trailing / is required
On first access, you'll be prompted for a login token

Enter the token displayed during lbh cluster install or retrieve it with:

lbh cluster status --show-secrets
# Or get it directly from Kubernetes
kubectl get secret -n llmboost headlamp-admin-token \
  -o jsonpath='{.data.token}' | base64 -d

After successful login, you'll access the cluster management interface

Key Views and Capabilities:

Map View

Management UI - Map View

The Map view provides a visual topology of your cluster, displaying how deployments and workloads are distributed across nodes. This view helps you quickly identify resource allocation patterns and spot nodes with heavy workloads or potential bottlenecks. Each node shows the number of running pods and deployments, making it easy to assess cluster balance at a glance.

Namespace Overview

Management UI - Namespace

The namespace view displays all resources within the llmboost namespace, including deployments, DaemonSets, StatefulSets, and pods. This centralized view allows you to monitor resource counts, check pod readiness status, and quickly identify any failed or pending resources. Use this screen to verify that all components of your LLMBoost deployment are running as expected.

Nodes List

Management UI - Nodes

The Nodes view shows all Kubernetes nodes in your cluster with their health status, roles, IP addresses, and versions. This screen is essential for verifying node availability and identifying which nodes are ready to accept workloads. You can quickly see the age of each node and access detailed information about CPU, memory, and other resource capacities.

Cluster Overview

Management UI - Overview

The Overview page provides high-level cluster health metrics including total CPU units, memory capacity, pod counts, and node readiness statistics. The Events section displays recent Kubernetes events with timestamps, helping you troubleshoot deployments by showing pod scheduling, container creation, and image pulling activities. This is your first stop when diagnosing deployment issues or tracking recent changes.

Pods Management

Management UI - Pods

The Pods view lists all running pods across your cluster with their namespace, restart counts, readiness status, IP addresses, and assigned nodes. This detailed view allows you to monitor individual pod health, identify pods that are restarting frequently, and access pod logs for debugging. Use this screen to track which nodes are hosting specific model deployments and verify that pods are distributed as intended.

Access Monitoring Dashboard (Grafana)

The monitoring UI is accessible at port 30080 under the path /monitor/ of the control/manager node.

Access Steps:

Get the Monitoring URL from lbh cluster status
Navigate to http://<manager-node-ip>:30080/monitor/ in your browser
- Note: The trailing / is required

lbh cluster status --show-secrets
# Or retrieve directly
kubectl get secret -n llmboost grafana-admin-secret \
  -o jsonpath='{.data.admin-user}' | base64 -d
kubectl get secret -n llmboost grafana-admin-secret \
  -o jsonpath='{.data.admin-password}' | base64 -d

LLMBoost-Specific Metrics:

LLMBoost exposes the following metrics for each node:

Metric	Description
`num_active_requests`	Number of ongoing chat/completion requests in the node
`num_total_requests_total`	Total number of requests served by the LLMBoost engine

GPU Metrics:

Platform-specific GPU metrics are also available:

NVIDIA (DCGM)
- User Guide
- Metrics List
AMD (Device Metrics Exporter)
- User Guide
- Metrics List

Available Dashboards:

GPU and CPU Metrics Dashboard

Monitoring - GPU and CPU

This dashboard tracks real-time GPU utilization across all GPUs in your cluster, with separate time-series graphs for each GPU on each node. The CPU metrics section displays utilization and memory consumption per node, helping you identify compute bottlenecks and optimize resource allocation. Use these graphs to ensure GPUs are being fully utilized during inference and to detect idle resources that could handle additional workloads.

Detailed GPU Metrics

Monitoring - GPU Details

The detailed GPU metrics dashboard provides hardware-level insights including ECC (Error-Correcting Code) errors, VRAM throughput, memory utilization per GPU, and junction temperatures. These metrics are critical for identifying hardware issues, thermal throttling, or memory saturation that could impact inference performance. Monitor temperature trends to ensure cooling systems are adequate and watch VRAM utilization to verify models fit comfortably within available GPU memory.

Network Performance Metrics

Monitoring - Network

This dashboard monitors network health across your cluster, tracking receive bandwidth, TCP retransmit counts, and dropped packets per network interface. Network metrics help you identify connectivity issues, bandwidth saturation, or packet loss that could slow down distributed inference or cause request timeouts. High retransmit counts or dropped packets indicate network problems that should be investigated to maintain optimal multi-node communication.

Using the Inference Services

Accessing Individual Endpoints

Inference services on worker nodes listen for requests on port 30080. You can connect to individual nodes using the OpenAI-compatible API:

from openai import OpenAI

# Connect to a specific worker node
client = OpenAI(
    base_url="http://<worker-node-ip>:30080/api/v1",
    api_key="-"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is multi-node inference?"}
    ],
    stream=False
)

print(response.choices[0].message.content)

Load Balancing Across Multiple Nodes

For optimal performance, distribute requests across multiple worker nodes:

import threading
from queue import Queue
from openai import OpenAI

# Define prompts
prompts = [
    "How does multithreading work in Python?",
    "Write me a Fibonacci generator in Python",
    "Which pet should I get, dog or cat?",
    "How do I fine-tune an LLM model?"
]

# Thread worker for sending requests
def run_thread(host, queue: Queue):
    client = OpenAI(
        base_url=f"http://{host}/api/v1",
        api_key="-"
    )
    while not queue.empty():
        prompt = queue.get()
        chat_completion = client.chat.completions.create(
            model="meta-llama/Llama-3.1-8B-Instruct",
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt}
            ],
            stream=False
        )
        print(
            f"-------------------------------------------------------------------\n"
            f"Question: {prompt}\nAnswer: {chat_completion.choices[0].message.content}"
        )

# Worker node endpoints
hosts = ["10.4.16.1:30080", "10.4.16.2:30080"]

threads = []
queue = Queue()

# Populate the request queue
for prompt in prompts:
    queue.put(prompt)

# Launch threads for each host
for host in hosts:
    t = threading.Thread(target=run_thread, args=(host, queue))
    threads.append(t)
    t.start()

# Wait for all threads to complete
for thread in threads:
    thread.join()

Save this as client.py and run:

python client.py

Best Practices

Resource Planning

Estimate cluster requirements:

Model size: Larger models need more GPUs per node
Expected load: Higher concurrency requires more replicas
GPU memory: Ensure nodes have sufficient VRAM per model

Deployment Strategy

Use node_replicas for:

Homogeneous clusters with similar nodes
Simple scaling without node-specific constraints

Use resource_selector for:

Node-specific port assignments
Explicit control over which nodes run which models

High Availability

Ensure redundancy:

Deploy at least 3 replicas for production workloads
Distribute replicas across availability zones
Monitor pod health and automatic restarts

Troubleshooting

Pods Not Starting

Check pod status:

lbh cluster status
kubectl describe pod <pod-name> -n llmboost

Common issues:

Insufficient GPU resources on nodes
Image pull errors (check Docker credentials)
Invalid Hugging Face token for private models

Out of Memory Errors

Solutions:

Deploy to nodes with more GPUs: Select nodes with higher GPU capacity
Reduce model size: Use smaller models or quantization
Check node resources: Verify available GPU memory

# Check node GPU resources
kubectl describe nodes | grep -A 5 "Allocated resources"

Service Not Accessible

Verify service configuration:

kubectl get svc -n llmboost
kubectl get endpoints -n llmboost

Check:

Service type (LoadBalancer, NodePort, ClusterIP)
Firewall rules and network policies
Ingress controller configuration

Deployment Not Reconciling

Check operator logs:

lbh cluster logs --management --pod operator
kubectl logs -n llmboost deployment/llmboost-operator

Verify:

Operator is running: kubectl get pods -n llmboost | grep operator
CRD is installed: kubectl get crd llmboostdeployments.mangoboost.io

Next Steps

Single-Node Multi-GPU - Optimize GPU utilization on single servers
Streaming - Enable real-time token generation
OpenAI API Compatible - Use standard API clients

Questions? Contact contact@mangoboost.io

Why Multi-Node Matters​

Prerequisites​

Required Tools​

Verify Prerequisites​

Architecture Overview​

Usage Examples​

Step 1: Install Cluster Infrastructure​

Step 2: Configure Model Deployments​

Basic functional example​

Advanced example template​

Explanation of cluster_config.json

Configuration File Structure​

Deployment Strategies​

Step 3: Deploy Models​

Step 4: Verify Services are Ready​

Step 5: Monitor Deployment Status​

Step 6: View Logs​

Step 7: Remove Deployments (Optional)​

Step 8: Uninstall Infrastructure (Optional)​

Step 1: Install Helm Chart​

Step 2: Retrieve Access Credentials​

Step 3: Create Model Deployment Manifest​

Step 4: Deploy Model​

Step 5: Check Deployment Status​

Step 6: Test the API Endpoint​

Step 7: Remove Deployment​

Step 8: Uninstall Helm Chart​

Configuration Parameters​

Cluster Configuration​

Model Deployment Configuration​

Resource Selector Options​

Environment Variables​

Monitoring and Management​

Access Management UI (Headlamp)​

Map View​

Namespace Overview​

Nodes List​

Cluster Overview​

Pods Management​

Access Monitoring Dashboard (Grafana)​

GPU and CPU Metrics Dashboard​

Detailed GPU Metrics​

Network Performance Metrics​

Using the Inference Services​

Accessing Individual Endpoints​

Load Balancing Across Multiple Nodes​

Best Practices​

Resource Planning​

Deployment Strategy​

High Availability​

Troubleshooting​

Pods Not Starting​

Out of Memory Errors​

Service Not Accessible​

Deployment Not Reconciling​

Next Steps​