Streaming
LLMBoost supports real-time token-by-token streaming for interactive applications, enabling responsive user experiences with immediate feedback as the model generates text.
Why Streaming Matters
Better UX - Users see responses immediately, not after completion
Lower Perceived Latency - Start displaying content while generation continues
Interactive Applications - Perfect for chatbots and assistants
Efficient for Long Outputs - Display tokens as they're generated
How Streaming Works
With streaming enabled:
- Request is sent to LLMBoost server
- Tokens are generated one at a time
- Each token is sent immediately to the client
- Client displays tokens in real-time
- Stream ends when generation completes
Without streaming: Wait for entire response => Display all at once
With streaming: Display each token => Immediate feedback
Server Configuration
- Using LLMBoost Hub
- Manual Setup
Start the Streaming Server
# Deploy a model
lbh serve meta-llama/Llama-3.1-8B-Instruct --port 8011
The server will be available at http://localhost:8011 by default.
Start the Server
The below command should be executed inside the LLMBoost Docker container, which you can run using the manual docker setup. Then, inside the container, launch the service by the command:
llmboost serve --model_name meta-llama/Llama-3.1-8B-Instruct
The server will be available at http://localhost:8011 by default.
Usage Examples
OpenAI API (Streaming)
Before sending the prompts with OpenAI API,
- Python (Sync)
- Python (Async)
- curl
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8011/v1",
api_key="-"
)
# Enable streaming with stream=True
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short story about AI, in 100 words."}
],
stream=True, # Enable streaming
max_tokens=1024
)
# Process tokens as they arrive
for chunk in response:
if chunk.choices:
print(chunk.choices[0].delta.content, end="", flush=True)
print() # New line at end
Output: Tokens appear one by one in real-time.
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="http://localhost:8011/v1",
api_key="-"
)
async def stream_response():
response = await client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short story about AI, in 100 words."}
],
stream=True,
max_tokens=1024
)
# Async iteration over chunks
async for chunk in response:
if chunk.choices:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
asyncio.run(stream_response())
curl http://localhost:8011/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Write a short story about AI, in 100 words."
}
],
"stream": true,
"max_tokens": 1024
}'
Output: Server-sent events (SSE) format with token chunks.
LLMBoost SDK (Streaming)
- Synchronous
- Asynchronous
The below command should be executed inside the LLMBoost Docker container, which you can run using the manual docker setup.
from llmboost import LLMBoost
if __name__ == "__main__":
# Enable streaming in initialization
llm = LLMBoost(
model_name="meta-llama/Llama-3.1-8B-Instruct",
streaming=True, # Enable streaming
tp=1,
dp=1,
max_tokens=1024
)
llm.start()
# Submit prompt
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short story about AI, in 100 words."}
]
prompt = llm.apply_format(messages)
llm.issues_inputs(prompt)
# Stream tokens as they're generated
resp_finish = False
while not resp_finish:
outputs = llm.get_output()
for output in outputs:
print(output['val'], end="", flush=True)
if output['finished']:
resp_finish = True
break
print() # New line
llm.stop()
The below command should be executed inside the LLMBoost Docker container, which you can run using the manual docker setup.
import asyncio
from llmboost import LLMBoost
async def stream_with_llmboost():
# Enable async output and streaming
llm = LLMBoost(
model_name="meta-llama/Llama-3.1-8B-Instruct",
streaming=True,
enable_async_output=True, # Enable async mode
tp=1,
dp=1,
max_tokens=1024
)
llm.start()
# Submit prompt
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short story about AI, in 100 words."}
]
prompt = llm.apply_format(messages)
llm.issues_inputs(prompt)
# Async streaming
while True:
output = await llm.aget_output()
print(output['val'], end="", flush=True)
if output['finished']:
break
print()
llm.stop()
if __name__ == "__main__":
asyncio.run(stream_with_llmboost())
Streaming Response Format
Server-Sent Events (SSE)
Streaming responses use the SSE format:
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1699999999,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{"role":"assistant","content":"Once"},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1699999999,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{"content":" upon"},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1699999999,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{"content":" a"},"finish_reason":null}]}
...
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1699999999,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: {"usage": {"prompt_tokens": 54, "completion_tokens": 144, "total_tokens": 198}}
data: [DONE]
Chunk Structure
Each chunk contains:
{
"id": "chatcmpl-123",
"object": "chat.completion.chunk",
"created": 1699999999,
"model": "meta-llama/Llama-3.1-8B-Instruct",
"choices": [
{
"index": 0,
"delta": {
"content": "token" // New token content
},
"finish_reason": null // null until stream ends
}
]
}
Last chunk has "finish_reason": "stop" and empty delta.
Performance Comparison
| Mode | Time to First Token | Total Time | User Experience |
|---|---|---|---|
| Streaming | ~50-200ms | Same | Excellent |
| Non-Streaming | N/A | Same | Good |
Streaming provides significantly better perceived performance even though total generation time is similar. Users see output immediately instead of waiting for completion.
Use Cases
When to Use Streaming
- Chatbots and assistants - Interactive conversations
- Content generation - Long-form articles, stories
- Code generation - Real-time code completion
- Translation - Immediate feedback on translated text
- Summarization - Progressive summary display
When NOT to Use Streaming
- Offline batch processing - Processing many requests offline
- Structured outputs - Need complete JSON before parsing
- Analytics - Only care about final result
- Maximum throughput - Non-streaming can be slightly faster in batch scenarios
Troubleshooting
Delayed First Token
If time to first token is slow:
- Check GPU utilization - Ensure GPUs aren't overloaded
- Reduce batch size - Lower
--max_num_seqs - Enable prefill optimization - Use
--enable_chunked_prefill
Choppy Streaming
If tokens arrive in bursts:
- Network buffering - Check network configuration
- Server load - Reduce concurrent requests
- Client buffering - Adjust client buffer sizes
Next Steps
- Vision - Stream responses from multimodal models
- OpenAI API Compatible - Full API reference
Questions? Contact contact@mangoboost.io