Skip to main content

Vision (Multimodal Models)

LLMBoost supports multimodal vision models that can understand and reason about images, enabling applications like image captioning, visual question answering, and content moderation.

Why Vision Models Matter

Understand Images - Describe, analyze, and reason about visual content
Visual QA - Answer questions about images
Content Moderation - Analyze images for safety and compliance
Accessibility - Generate alt text and descriptions
Multimodal Chat - Combine text and images in conversations


Supported Models

LLMBoost supports popular vision-language models including:

  • Qwen-VL series (e.g. Qwen/Qwen2.5-VL-72B-Instruct)
  • LLaVA models (e.g. llava-hf/llava-1.5-7b-hf)

Getting Started

Deploy a Vision Model

# Serve a vision model
lbh serve Qwen/Qwen2.5-VL-72B-Instruct -- --query_type image --limit_mm_per_prompt 10

The inference service will listen on localhost:8011 once ready.


Usage Examples

Single Image Analysis

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8011/v1",
api_key="-"
)

response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-72B-Instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://picsum.photos/id/237/640/480.jpg"
}
}
]
}
],
max_tokens=512
)

print(response.choices[0].message.content)

Multiple Images

Analyze multiple images in a single request:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8011/v1", api_key="-")

response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-72B-Instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Compare these two images. What are the differences?"
},
{
"type": "image_url",
"image_url": {
"url": "https://picsum.photos/id/236/640/480.jpg"
}
},
{
"type": "image_url",
"image_url": {
"url": "https://picsum.photos/id/238/640/480.jpg"
}
}
]
}
],
max_tokens=512
)

print(response.choices[0].message.content)

Conversational Vision

Build multi-turn conversations with images:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8011/v1", api_key="-")

response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-72B-Instruct",
messages=[
{
"role": "system",
"content": "You are a helpful AI assistant that can see and understand images."
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this vacation photo?"
},
{
"type": "image_url",
"image_url": {
"url": "https://picsum.photos/id/237/640/480.jpg"
}
}
]
},
{
"role": "assistant",
"content": "This image shows a black dog resting peacefully in a serene natural setting."
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What might happen next in this adventure?"
},
{
"type": "image_url",
"image_url": {
"url": "https://picsum.photos/id/128/640/480.jpg"
}
}
]
}
],
max_tokens=512
)

print(response.choices[0].message.content)

Image Input Formats

LLMBoost supports multiple ways to provide images:

{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg"
}
}

Supported formats: JPEG, PNG, GIF, WebP

2. Base64 Encoded

import base64

def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')

# Use in request
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{encode_image('path/to/image.jpg')}"
}
}

3. Detail Parameter

Control image resolution processing:

{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg",
"detail": "auto" # Options: "auto", "low", "high"
}
}
  • auto: Automatically choose based on image size (recommended)
  • low: Faster processing, lower detail
  • high: Higher detail, slower processing

Configuration Parameters

Server Configuration

llmboost serve \
--model_path Qwen/Qwen2.5-VL-72B-Instruct \
--query_type image \ # Enable image processing
--max_model_len 8192 \ # Max sequence length
--limit_mm_per_prompt 10 \ # Max images per request
--tp 1 --dp 1
ParameterDescriptionDefault
--query_typeInput type (must be image for vision)text
--limit_mm_per_promptMaximum images per request1
--max_model_lenMaximum sequence length8192

Streaming with Vision Models

Vision models support streaming just like text models:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8011/v1", api_key="-")

response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-72B-Instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in detail."
},
{
"type": "image_url",
"image_url": {
"url": "https://picsum.photos/id/237/640/480.jpg"
}
}
]
}
],
stream=True, # Enable streaming
max_tokens=512
)

# Stream tokens as they're generated
for chunk in response:
if chunk.choices:
print(chunk.choices[0].delta.content, end="", flush=True)

Use Cases

1. Image Captioning

Generate descriptions for accessibility:

response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-72B-Instruct",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Generate an alt text description for this image."},
{"type": "image_url", "image_url": {"url": "https://..."}}
]
}
]
)

2. Visual Question Answering

Answer specific questions about images:

response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-72B-Instruct",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "How many people are in this image?"},
{"type": "image_url", "image_url": {"url": "https://..."}}
]
}
]
)

3. Content Moderation

Analyze images for safety:

response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-72B-Instruct",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Is this image appropriate for a family audience?"},
{"type": "image_url", "image_url": {"url": "https://..."}}
]
}
]
)

4. Visual Search & Shopping

Extract product information:

response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-72B-Instruct",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the product features and colors in this image."},
{"type": "image_url", "image_url": {"url": "https://..."}}
]
}
]
)

Troubleshooting

Out of Memory

# Increase tensor parallelism
lbh serve Qwen/Qwen2.5-VL-72B-Instruct -- --query_type image --limit_mm_per_prompt 10 --tp 2

Image Not Loading

  • Verify URL is accessible from server
  • Check image format (JPEG, PNG, GIF, WebP)
  • Ensure image size < 20MB
  • Try base64 encoding as alternative

Slow Processing

  • Reduce --max_model_len
  • Use "detail": "low" for simple images

Next Steps


Questions? Contact contact@mangoboost.io