Vision (Multimodal Models)
LLMBoost supports multimodal vision models that can understand and reason about images, enabling applications like image captioning, visual question answering, and content moderation.
Why Vision Models Matter
Understand Images - Describe, analyze, and reason about visual content
Visual QA - Answer questions about images
Content Moderation - Analyze images for safety and compliance
Accessibility - Generate alt text and descriptions
Multimodal Chat - Combine text and images in conversations
Supported Models
LLMBoost supports popular vision-language models including:
- Qwen-VL series (e.g. Qwen/Qwen2.5-VL-72B-Instruct)
- LLaVA models (e.g. llava-hf/llava-1.5-7b-hf)
Getting Started
- Using LLMBoost Hub
- Manual Setup
Deploy a Vision Model
# Serve a vision model
lbh serve Qwen/Qwen2.5-VL-72B-Instruct -- --query_type image --limit_mm_per_prompt 10
The inference service will listen on localhost:8011 once ready.
Deploy a Vision Model
The below command should be executed inside the LLMBoost Docker container, which you can run using the manual docker setup. Then, inside the container, launch the service by the command:
llmboost serve \
--model_path Qwen/Qwen2.5-VL-72B-Instruct \
--query_type image \
--max_model_len 8192 \
--limit_mm_per_prompt 10 \
--tp 1 --dp 1
Parameters:
--query_type image: Enable image processing--limit_mm_per_prompt 10: Maximum images per request--max_model_len 8192: Maximum sequence length
Usage Examples
Single Image Analysis
- Python (OpenAI)
- curl
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8011/v1",
api_key="-"
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-72B-Instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://picsum.photos/id/237/640/480.jpg"
}
}
]
}
],
max_tokens=512
)
print(response.choices[0].message.content)
curl -X POST http://localhost:8011/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-VL-72B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://picsum.photos/id/237/640/480.jpg"
}
}
]
}
],
"max_tokens": 512
}'
Multiple Images
Analyze multiple images in a single request:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8011/v1", api_key="-")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-72B-Instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Compare these two images. What are the differences?"
},
{
"type": "image_url",
"image_url": {
"url": "https://picsum.photos/id/236/640/480.jpg"
}
},
{
"type": "image_url",
"image_url": {
"url": "https://picsum.photos/id/238/640/480.jpg"
}
}
]
}
],
max_tokens=512
)
print(response.choices[0].message.content)
Conversational Vision
Build multi-turn conversations with images:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8011/v1", api_key="-")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-72B-Instruct",
messages=[
{
"role": "system",
"content": "You are a helpful AI assistant that can see and understand images."
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this vacation photo?"
},
{
"type": "image_url",
"image_url": {
"url": "https://picsum.photos/id/237/640/480.jpg"
}
}
]
},
{
"role": "assistant",
"content": "This image shows a black dog resting peacefully in a serene natural setting."
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What might happen next in this adventure?"
},
{
"type": "image_url",
"image_url": {
"url": "https://picsum.photos/id/128/640/480.jpg"
}
}
]
}
],
max_tokens=512
)
print(response.choices[0].message.content)
Image Input Formats
LLMBoost supports multiple ways to provide images:
1. Remote URLs (Recommended)
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg"
}
}
Supported formats: JPEG, PNG, GIF, WebP
2. Base64 Encoded
import base64
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# Use in request
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{encode_image('path/to/image.jpg')}"
}
}
3. Detail Parameter
Control image resolution processing:
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg",
"detail": "auto" # Options: "auto", "low", "high"
}
}
auto: Automatically choose based on image size (recommended)low: Faster processing, lower detailhigh: Higher detail, slower processing
Configuration Parameters
Server Configuration
llmboost serve \
--model_path Qwen/Qwen2.5-VL-72B-Instruct \
--query_type image \ # Enable image processing
--max_model_len 8192 \ # Max sequence length
--limit_mm_per_prompt 10 \ # Max images per request
--tp 1 --dp 1
| Parameter | Description | Default |
|---|---|---|
--query_type | Input type (must be image for vision) | text |
--limit_mm_per_prompt | Maximum images per request | 1 |
--max_model_len | Maximum sequence length | 8192 |
Streaming with Vision Models
Vision models support streaming just like text models:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8011/v1", api_key="-")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-72B-Instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in detail."
},
{
"type": "image_url",
"image_url": {
"url": "https://picsum.photos/id/237/640/480.jpg"
}
}
]
}
],
stream=True, # Enable streaming
max_tokens=512
)
# Stream tokens as they're generated
for chunk in response:
if chunk.choices:
print(chunk.choices[0].delta.content, end="", flush=True)
Use Cases
1. Image Captioning
Generate descriptions for accessibility:
response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-72B-Instruct",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Generate an alt text description for this image."},
{"type": "image_url", "image_url": {"url": "https://..."}}
]
}
]
)
2. Visual Question Answering
Answer specific questions about images:
response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-72B-Instruct",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "How many people are in this image?"},
{"type": "image_url", "image_url": {"url": "https://..."}}
]
}
]
)
3. Content Moderation
Analyze images for safety:
response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-72B-Instruct",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Is this image appropriate for a family audience?"},
{"type": "image_url", "image_url": {"url": "https://..."}}
]
}
]
)
4. Visual Search & Shopping
Extract product information:
response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-72B-Instruct",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the product features and colors in this image."},
{"type": "image_url", "image_url": {"url": "https://..."}}
]
}
]
)
Troubleshooting
Out of Memory
# Increase tensor parallelism
lbh serve Qwen/Qwen2.5-VL-72B-Instruct -- --query_type image --limit_mm_per_prompt 10 --tp 2
Image Not Loading
- Verify URL is accessible from server
- Check image format (JPEG, PNG, GIF, WebP)
- Ensure image size < 20MB
- Try base64 encoding as alternative
Slow Processing
- Reduce
--max_model_len - Use
"detail": "low"for simple images
Next Steps
- Streaming - Enable real-time responses with vision
- Single-Node Multi-GPU - Scale large vision models
- Python SDK - Integrate vision models into Python apps
Questions? Contact contact@mangoboost.io