Vision (Multimodal Models)

LLMBoost supports multimodal vision models that can understand and reason about images, enabling applications like image captioning, visual question answering, and content moderation.

Why Vision Models Matter

Understand Images - Describe, analyze, and reason about visual content
Visual QA - Answer questions about images
Content Moderation - Analyze images for safety and compliance
Accessibility - Generate alt text and descriptions
Multimodal Chat - Combine text and images in conversations

Supported Models

LLMBoost supports popular vision-language models including:

Qwen-VL series (e.g. Qwen/Qwen2.5-VL-72B-Instruct)
LLaVA models (e.g. llava-hf/llava-1.5-7b-hf)

Getting Started

Using LLMBoost Hub
Manual Setup

Deploy a Vision Model

# Serve a vision model
lbh serve Qwen/Qwen2.5-VL-72B-Instruct -- --query_type image --limit_mm_per_prompt 10

The inference service will listen on localhost:8011 once ready.

Deploy a Vision Model

The below command should be executed inside the LLMBoost Docker container, which you can run using the manual docker setup. Then, inside the container, launch the service by the command:

llmboost serve \
  --model_path Qwen/Qwen2.5-VL-72B-Instruct \
  --query_type image \
  --max_model_len 8192 \
  --limit_mm_per_prompt 10 \
  --tp 1 --dp 1

Parameters:

--query_type image: Enable image processing
--limit_mm_per_prompt 10: Maximum images per request
--max_model_len 8192: Maximum sequence length

Usage Examples

Single Image Analysis

Python (OpenAI)
curl

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8011/v1",
    api_key="-"
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-72B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What's in this image?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://picsum.photos/id/237/640/480.jpg"
                    }
                }
            ]
        }
    ],
    max_tokens=512
)

print(response.choices[0].message.content)

curl -X POST http://localhost:8011/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-VL-72B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What is in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://picsum.photos/id/237/640/480.jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 512
  }'

Multiple Images

Analyze multiple images in a single request:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8011/v1", api_key="-")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-72B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Compare these two images. What are the differences?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://picsum.photos/id/236/640/480.jpg"
                    }
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://picsum.photos/id/238/640/480.jpg"
                    }
                }
            ]
        }
    ],
    max_tokens=512
)

print(response.choices[0].message.content)

Conversational Vision

Build multi-turn conversations with images:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8011/v1", api_key="-")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-72B-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful AI assistant that can see and understand images."
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What's in this vacation photo?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://picsum.photos/id/237/640/480.jpg"
                    }
                }
            ]
        },
        {
            "role": "assistant",
            "content": "This image shows a black dog resting peacefully in a serene natural setting."
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What might happen next in this adventure?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://picsum.photos/id/128/640/480.jpg"
                    }
                }
            ]
        }
    ],
    max_tokens=512
)

print(response.choices[0].message.content)

Image Input Formats

LLMBoost supports multiple ways to provide images:

1. Remote URLs (Recommended)

{
    "type": "image_url",
    "image_url": {
        "url": "https://example.com/image.jpg"
    }
}

Supported formats: JPEG, PNG, GIF, WebP

2. Base64 Encoded

import base64

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# Use in request
{
    "type": "image_url",
    "image_url": {
        "url": f"data:image/jpeg;base64,{encode_image('path/to/image.jpg')}"
    }
}

3. Detail Parameter

Control image resolution processing:

{
    "type": "image_url",
    "image_url": {
        "url": "https://example.com/image.jpg",
        "detail": "auto"  # Options: "auto", "low", "high"
    }
}

auto: Automatically choose based on image size (recommended)
low: Faster processing, lower detail
high: Higher detail, slower processing

Configuration Parameters

Server Configuration

llmboost serve \
  --model_path Qwen/Qwen2.5-VL-72B-Instruct \
  --query_type image \  # Enable image processing
  --max_model_len 8192 \  # Max sequence length
  --limit_mm_per_prompt 10 \  # Max images per request
  --tp 1 --dp 1

Parameter	Description	Default
`--query_type`	Input type (must be `image` for vision)	`text`
`--limit_mm_per_prompt`	Maximum images per request	1
`--max_model_len`	Maximum sequence length	8192

Streaming with Vision Models

Vision models support streaming just like text models:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8011/v1", api_key="-")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-72B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe this image in detail."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://picsum.photos/id/237/640/480.jpg"
                    }
                }
            ]
        }
    ],
    stream=True,  # Enable streaming
    max_tokens=512
)

# Stream tokens as they're generated
for chunk in response:
    if chunk.choices:
        print(chunk.choices[0].delta.content, end="", flush=True)

Use Cases

1. Image Captioning

Generate descriptions for accessibility:

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-72B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Generate an alt text description for this image."},
                {"type": "image_url", "image_url": {"url": "https://..."}}
            ]
        }
    ]
)

2. Visual Question Answering

Answer specific questions about images:

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-72B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "How many people are in this image?"},
                {"type": "image_url", "image_url": {"url": "https://..."}}
            ]
        }
    ]
)

3. Content Moderation

Analyze images for safety:

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-72B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Is this image appropriate for a family audience?"},
                {"type": "image_url", "image_url": {"url": "https://..."}}
            ]
        }
    ]
)

4. Visual Search & Shopping

Extract product information:

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-72B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the product features and colors in this image."},
                {"type": "image_url", "image_url": {"url": "https://..."}}
            ]
        }
    ]
)

Troubleshooting

Out of Memory

# Increase tensor parallelism
lbh serve Qwen/Qwen2.5-VL-72B-Instruct -- --query_type image --limit_mm_per_prompt 10 --tp 2

Image Not Loading

Verify URL is accessible from server
Check image format (JPEG, PNG, GIF, WebP)
Ensure image size < 20MB
Try base64 encoding as alternative

Slow Processing

Reduce --max_model_len
Use "detail": "low" for simple images

Next Steps

Streaming - Enable real-time responses with vision
Single-Node Multi-GPU - Scale large vision models
Python SDK - Integrate vision models into Python apps

Questions? Contact contact@mangoboost.io

Why Vision Models Matter​

Supported Models​

Getting Started​

Deploy a Vision Model​

Deploy a Vision Model​

Usage Examples​

Single Image Analysis​

Multiple Images​

Conversational Vision​

Image Input Formats​

1. Remote URLs (Recommended)​

2. Base64 Encoded​

3. Detail Parameter​

Configuration Parameters​

Server Configuration​

Streaming with Vision Models​

Use Cases​

1. Image Captioning​

2. Visual Question Answering​

3. Content Moderation​

4. Visual Search & Shopping​

Troubleshooting​

Out of Memory​

Image Not Loading​

Slow Processing​

Next Steps​

Why Vision Models Matter

Supported Models

Getting Started

Deploy a Vision Model

Deploy a Vision Model

Usage Examples

Single Image Analysis

Multiple Images

Conversational Vision

Image Input Formats

1. Remote URLs (Recommended)

2. Base64 Encoded

3. Detail Parameter

Configuration Parameters

Server Configuration

Streaming with Vision Models

Use Cases

1. Image Captioning

2. Visual Question Answering

3. Content Moderation

4. Visual Search & Shopping

Troubleshooting

Out of Memory

Image Not Loading

Slow Processing

Next Steps