Quick Start Guide

Get LLMBoost up and running in under 5 minutes. This guide shows you how to deploy your first model and make inference requests using LLMBoost Hub (lbh), our recommended CLI tool for streamlined deployment.

Prerequisites

Before you begin, ensure you have:

LLMBoost License (Contact contact@mangoboost.io to obtain a license key and Docker credentials)
Python 3.11+ installed
Docker 27.3.1+ installed, and configured for passwordless (non-sudo) execution. Set up following this guide.
AMD GPU with ROCm 6.3+
HuggingFace account for model access

Quick-Start Diagram

Recommended Approach

We recommend using LLMBoost Hub (lbh) for the easiest setup experience. Advanced users can also use the manual Docker approach.

Using LLMBoost Hub (Recommended)
Manual Docker Setup

Using LLMBoost Hub (lbh)

LLMBoost Hub is the easiest way to deploy and manage LLM inference with LLMBoost.

Step 0: (Optional) Create new Python 3.11+ Virtual Environment

(Click to expand/collapse)

We recommend using a virtual environment to avoid dependency conflicts and permission issues.

uv
conda
venv

Recommended Approach

We recommend using uv or conda, but you can also use venv.

pip install --user uv 
uv venv lbh-venv --seed --python 3.11 # Or higher
source lbh-venv/bin/activate
which python  # Verify python version
which pip     # Verify pip version (should be in same directory as python)

Recommended Approach

We recommend using uv or conda, but you can also use venv.

conda create -n lbh-env python=3.11 -y
conda activate lbh-env
which python  # Verify python version
which pip     # Verify pip version (should be in same directory as python)

python3.11 -m venv lbh-env
source lbh-env/bin/activate
which python  # Verify python version
which pip     # Verify pip version (should be in same directory as python)

All these approaches require you to activate the virtual environment in each terminal session before running lbh commands, or you can set up automatic activation using tools like direnv or shell-specific configurations. To exit the virtual environment, simply run deactivate.

Step 1: Install LLMBoost Hub

Do not use lbh as root user

Running lbh commands with sudo can lead to permission issues with Docker and Python packages. Always run lbh as a regular user.

pip install llmboost_hub

Troubleshooting Installation Issues (Click to expand/collapse)

If you encounter issues during installation, try the following:

Ensure you are in the correct (virtual) environment (which python).
Ensure that your pip is up to date by running pip install --upgrade pip.
Ensure python 3.11+ is being used, and a pip in the same directory is being used (which python and which pip should be in the same directory).
Try donwloading from testpypi:

pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple llmboost_hub --pre

If you are behind a proxy, configure pip to use the proxy:

pip install lbh --proxy http://user:password@proxyserver:port

Verify installed version is 0.3.0 or higher:

lbh --version

Step 2: Authenticate

Authenticate with your LLMBoost license, HuggingFace, and Docker:

# Authenticate LLMBoost license (required only once)
lbh login

# Login to HuggingFace (or set HF_TOKEN environment variable)
hf auth login

# Login to Docker (provided by MangoBoost)
docker login -u <docker_username>

Step 3: Fetch List of Optimized Models

lbh fetch # required only once a week to sync model list

Step 4: Serve Your First Model

Deploy a model with a single command:

lbh serve meta-llama/Llama-3.1-8B-Instruct

This command automatically:

Downloads the LLMBoost Docker image
Downloads the model from HuggingFace
Starts the inference server on port 8011
Configures optimal settings for your hardware

First Run

The first run will take a few minutes to download the Docker image and model. Subsequent runs will be much faster.

Quick Restart

To quickly restart the server in the future, use the -r|--restart flag:

lbh serve --restart meta-llama/Llama-3.1-8B-Instruct

This automatically restarts the existing container (without re-downloading assets). For more options, see the Command Reference.

Step 5: Make Your First Request

Once the server is ready, test it with a simple request:

lbh test meta-llama/Llama-3.1-8B-Instruct --query "Explain LLMs in simple terms."

Or use curl:

curl http://localhost:8011/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain LLMs in simple terms."}
    ]
  }'

Step 6: Stop serving the model

To stop serving the model, use the following command:

lbh stop meta-llama/Llama-3.1-8B-Instruct

Step 7: Explore More Commands

# View all running containers
lbh list

# Attach to a running container
lbh attach meta-llama/Llama-3.1-8B-Instruct

# Start the container with custom docker args
lbh run meta-llama/Llama-3.1-8B-Instruct -- --network host # custom docker args

# Get help
lbh --help
# or for any command
lbh <command> --help

Command Workflow

The typical workflow for deploying and managing models with LLMBoost Hub:

lbh login                     # Authenticate with LLMBoost license
lbh fetch [model]             # Search for available models (optional filter)
lbh list [model]              # Show local images and their status
lbh prep <Repo/Model-Name>    # Download Docker image and model assets
lbh run <Repo/Model-Name>     # Start container
lbh serve <Repo/Model-Name>   # Start LLMBoost inference server
lbh test <Repo/Model-Name>    # Send a test request
lbh stop <Repo/Model-Name>    # Stop container

Complete Example

# Authenticate (one-time setup)
lbh login

# Search for Llama models
lbh fetch llama

# Check local status
lbh list llama

# Download the model and the required Docker image
lbh prep meta-llama/Llama-3.1-8B-Instruct   

# Launch the Docker container (accepts custom docker args)
lbh run meta-llama/Llama-3.1-8B-Instruct    

# Launches the inference server inside the container. If the model or container isn't available, it will automatically run prep and run steps with default settings before starting the server.
lbh serve meta-llama/Llama-3.1-8B-Instruct  

# Test the deployment
lbh test meta-llama/Llama-3.1-8B-Instruct

# Stop when done
lbh stop meta-llama/Llama-3.1-8B-Instruct

# Restart serving
lbh serve --restart meta-llama/Llama-3.1-8B-Instruct

See the complete Command Reference for all available commands.

Manual Docker Setup

For advanced users who prefer direct Docker control or custom configurations.

Step 1: Set Environment Variables

export MODEL_PATH=<absolute_path_to_model_directory>
export LICENSE_FILE=<absolute_path_to_license_file>
export HF_TOKEN=<your_huggingface_token>

MODEL_PATH: Absolute path to your local model directory
LICENSE_FILE: Path to your LLMBoost license file (contact contact@mangoboost.io if needed)
HF_TOKEN: Get from huggingface.co/settings/tokens

Step 2: Pull LLMBoost Docker Image

Contact the MangoBoost team for access to the Docker image, then pull it:

docker pull mangollm/mb-llmboost-rocm:1.7.0

Step 3: Run LLMBoost Container

docker run -it --rm \
  --network host \
  --group-add video \
  --ipc host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device=/dev/dri:/dev/dri \
  --device=/dev/kfd:/dev/kfd \
  -v $MODEL_PATH:/workspace/models \
  -v $LICENSE_FILE:/workspace/llmboost_license.skm \
  -w /workspace \
  -e HF_TOKEN=$HF_TOKEN \
  mangollm/mb-llmboost-rocm:1.7.0 \
  bash

Step 4: Start Inference Server

Inside the container, start the inference server:

llmboost serve --model_name meta-llama/Llama-3.1-8B-Instruct

The server will start on port 8011 by default. Wait for the "Ready" message before making requests.

Step 5: Make Your First Request

From another terminal (on the host), test the server:

curl http://localhost:8011/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain LLMs in simple terms."}
    ]
  }'

Next Steps

Now that you have LLMBoost running, explore its powerful features:

OpenAI API Compatible - Use OpenAI client libraries with zero code changes
SLO Aware Serving - Optimize for latency and throughput
Multi-GPU Support - Scale inference across GPUs
Multi-Node Deployment - Scale across Kubernetes clusters
Streaming Responses - Real-time token-by-token output
Vision Models - Deploy multimodal image-to-text models

Configuration & Commands

Configuration Options - Customize your deployment
Command Reference - Complete lbh CLI documentation

Prerequisites​

Using LLMBoost Hub (lbh)​

Step 0: (Optional) Create new Python 3.11+ Virtual Environment

Step 1: Install LLMBoost Hub​

Step 2: Authenticate​

Step 3: Fetch List of Optimized Models​

Step 4: Serve Your First Model​

Step 5: Make Your First Request​

Step 6: Stop serving the model​

Step 7: Explore More Commands​

Command Workflow​

Complete Example​

Manual Docker Setup​

Step 1: Set Environment Variables​

Step 2: Pull LLMBoost Docker Image​

Step 3: Run LLMBoost Container​

Step 4: Start Inference Server​

Step 5: Make Your First Request​

Next Steps​

Configuration & Commands​

Prerequisites

Using LLMBoost Hub (lbh)

Step 1: Install LLMBoost Hub

Step 2: Authenticate

Step 3: Fetch List of Optimized Models

Step 4: Serve Your First Model

Step 5: Make Your First Request

Step 6: Stop serving the model

Step 7: Explore More Commands

Command Workflow

Complete Example

Manual Docker Setup

Step 1: Set Environment Variables

Step 2: Pull LLMBoost Docker Image

Step 3: Run LLMBoost Container

Step 4: Start Inference Server

Step 5: Make Your First Request

Next Steps

Configuration & Commands