Skip to main content

Streaming

LLMBoost supports real-time token-by-token streaming for interactive applications, enabling responsive user experiences with immediate feedback as the model generates text.

Why Streaming Matters

Better UX - Users see responses immediately, not after completion
Lower Perceived Latency - Start displaying content while generation continues
Interactive Applications - Perfect for chatbots and assistants
Efficient for Long Outputs - Display tokens as they're generated


How Streaming Works

With streaming enabled:

  1. Request is sent to LLMBoost server
  2. Tokens are generated one at a time
  3. Each token is sent immediately to the client
  4. Client displays tokens in real-time
  5. Stream ends when generation completes

Without streaming: Wait for entire response => Display all at once
With streaming: Display each token => Immediate feedback

Streaming Diagram


Server Configuration

Start the Streaming Server

# Deploy a model
lbh serve meta-llama/Llama-3.1-8B-Instruct --port 8011

The server will be available at http://localhost:8011 by default.


Usage Examples

OpenAI API (Streaming)

Before sending the prompts with OpenAI API,

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8011/v1",
api_key="-"
)

# Enable streaming with stream=True
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short story about AI, in 100 words."}
],
stream=True, # Enable streaming
max_tokens=1024
)

# Process tokens as they arrive
for chunk in response:
if chunk.choices:
print(chunk.choices[0].delta.content, end="", flush=True)

print() # New line at end

Output: Tokens appear one by one in real-time.

LLMBoost SDK (Streaming)

The below command should be executed inside the LLMBoost Docker container, which you can run using the manual docker setup.

from llmboost import LLMBoost

if __name__ == "__main__":
# Enable streaming in initialization
llm = LLMBoost(
model_name="meta-llama/Llama-3.1-8B-Instruct",
streaming=True, # Enable streaming
tp=1,
dp=1,
max_tokens=1024
)
llm.start()

# Submit prompt
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short story about AI, in 100 words."}
]

prompt = llm.apply_format(messages)
llm.issues_inputs(prompt)

# Stream tokens as they're generated
resp_finish = False
while not resp_finish:
outputs = llm.get_output()

for output in outputs:
print(output['val'], end="", flush=True)

if output['finished']:
resp_finish = True
break

print() # New line
llm.stop()

Streaming Response Format

Server-Sent Events (SSE)

Streaming responses use the SSE format:

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1699999999,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{"role":"assistant","content":"Once"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1699999999,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{"content":" upon"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1699999999,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{"content":" a"},"finish_reason":null}]}

...

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1699999999,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: {"usage": {"prompt_tokens": 54, "completion_tokens": 144, "total_tokens": 198}}

data: [DONE]

Chunk Structure

Each chunk contains:

{
"id": "chatcmpl-123",
"object": "chat.completion.chunk",
"created": 1699999999,
"model": "meta-llama/Llama-3.1-8B-Instruct",
"choices": [
{
"index": 0,
"delta": {
"content": "token" // New token content
},
"finish_reason": null // null until stream ends
}
]
}

Last chunk has "finish_reason": "stop" and empty delta.


Performance Comparison

ModeTime to First TokenTotal TimeUser Experience
Streaming~50-200msSameExcellent
Non-StreamingN/ASameGood
Best for UX

Streaming provides significantly better perceived performance even though total generation time is similar. Users see output immediately instead of waiting for completion.


Use Cases

When to Use Streaming

  • Chatbots and assistants - Interactive conversations
  • Content generation - Long-form articles, stories
  • Code generation - Real-time code completion
  • Translation - Immediate feedback on translated text
  • Summarization - Progressive summary display

When NOT to Use Streaming

  • Offline batch processing - Processing many requests offline
  • Structured outputs - Need complete JSON before parsing
  • Analytics - Only care about final result
  • Maximum throughput - Non-streaming can be slightly faster in batch scenarios

Troubleshooting

Delayed First Token

If time to first token is slow:

  1. Check GPU utilization - Ensure GPUs aren't overloaded
  2. Reduce batch size - Lower --max_num_seqs
  3. Enable prefill optimization - Use --enable_chunked_prefill

Choppy Streaming

If tokens arrive in bursts:

  1. Network buffering - Check network configuration
  2. Server load - Reduce concurrent requests
  3. Client buffering - Adjust client buffer sizes

Next Steps


Questions? Contact contact@mangoboost.io