Streaming

LLMBoost supports real-time token-by-token streaming for interactive applications, enabling responsive user experiences with immediate feedback as the model generates text.

Why Streaming Matters

Better UX - Users see responses immediately, not after completion
Lower Perceived Latency - Start displaying content while generation continues
Interactive Applications - Perfect for chatbots and assistants
Efficient for Long Outputs - Display tokens as they're generated

How Streaming Works

With streaming enabled:

Request is sent to LLMBoost server
Tokens are generated one at a time
Each token is sent immediately to the client
Client displays tokens in real-time
Stream ends when generation completes

Without streaming: Wait for entire response => Display all at once
With streaming: Display each token => Immediate feedback

Streaming Diagram

Server Configuration

Using LLMBoost Hub
Manual Setup

Start the Streaming Server

# Deploy a model
lbh serve meta-llama/Llama-3.1-8B-Instruct --port 8011

The server will be available at http://localhost:8011 by default.

Start the Server

The below command should be executed inside the LLMBoost Docker container, which you can run using the manual docker setup. Then, inside the container, launch the service by the command:

llmboost serve --model_name meta-llama/Llama-3.1-8B-Instruct

The server will be available at http://localhost:8011 by default.

Usage Examples

OpenAI API (Streaming)

Before sending the prompts with OpenAI API,

Python (Sync)
Python (Async)
curl

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8011/v1",
    api_key="-"
)

# Enable streaming with stream=True
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a short story about AI, in 100 words."}
    ],
    stream=True,  # Enable streaming
    max_tokens=1024
)

# Process tokens as they arrive
for chunk in response:
    if chunk.choices:
        print(chunk.choices[0].delta.content, end="", flush=True)

print()  # New line at end

Output: Tokens appear one by one in real-time.

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="http://localhost:8011/v1",
    api_key="-"
)

async def stream_response():
    response = await client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Write a short story about AI, in 100 words."}
        ],
        stream=True,
        max_tokens=1024
    )
    
    # Async iteration over chunks
    async for chunk in response:
        if chunk.choices:
            print(chunk.choices[0].delta.content, end="", flush=True)
    
    print()

asyncio.run(stream_response())

curl http://localhost:8011/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Write a short story about AI, in 100 words."
      }
    ],
    "stream": true,
    "max_tokens": 1024
  }'

Output: Server-sent events (SSE) format with token chunks.

LLMBoost SDK (Streaming)

Synchronous
Asynchronous

The below command should be executed inside the LLMBoost Docker container, which you can run using the manual docker setup.

from llmboost import LLMBoost

if __name__ == "__main__":
    # Enable streaming in initialization
    llm = LLMBoost(
        model_name="meta-llama/Llama-3.1-8B-Instruct",
        streaming=True,  # Enable streaming
        tp=1,
        dp=1,
        max_tokens=1024
    )
    llm.start()

    # Submit prompt
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a short story about AI, in 100 words."}
    ]

    prompt = llm.apply_format(messages)
    llm.issues_inputs(prompt)

    # Stream tokens as they're generated
    resp_finish = False
    while not resp_finish:
        outputs = llm.get_output()
        
        for output in outputs:
            print(output['val'], end="", flush=True)
            
            if output['finished']:
                resp_finish = True
                break

    print()  # New line
    llm.stop()

The below command should be executed inside the LLMBoost Docker container, which you can run using the manual docker setup.

import asyncio
from llmboost import LLMBoost

async def stream_with_llmboost():
    # Enable async output and streaming
    llm = LLMBoost(
        model_name="meta-llama/Llama-3.1-8B-Instruct",
        streaming=True,
        enable_async_output=True,  # Enable async mode
        tp=1,
        dp=1,
        max_tokens=1024
    )
    llm.start()
    
    # Submit prompt
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a short story about AI, in 100 words."}
    ]
    
    prompt = llm.apply_format(messages)
    llm.issues_inputs(prompt)
    
    # Async streaming
    while True:
        output = await llm.aget_output()

        print(output['val'], end="", flush=True)
        
        if output['finished']:
            break
    
    print()
    llm.stop()

if __name__ == "__main__":
    asyncio.run(stream_with_llmboost())

Streaming Response Format

Server-Sent Events (SSE)

Streaming responses use the SSE format:

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1699999999,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{"role":"assistant","content":"Once"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1699999999,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{"content":" upon"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1699999999,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{"content":" a"},"finish_reason":null}]}

...

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1699999999,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: {"usage": {"prompt_tokens": 54, "completion_tokens": 144, "total_tokens": 198}}

data: [DONE]

Chunk Structure

Each chunk contains:

{
  "id": "chatcmpl-123",
  "object": "chat.completion.chunk",
  "created": 1699999999,
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": "token"  // New token content
      },
      "finish_reason": null  // null until stream ends
    }
  ]
}

Last chunk has "finish_reason": "stop" and empty delta.

Performance Comparison

Mode	Time to First Token	Total Time	User Experience
Streaming	~50-200ms	Same	Excellent
Non-Streaming	N/A	Same	Good

Best for UX

Streaming provides significantly better perceived performance even though total generation time is similar. Users see output immediately instead of waiting for completion.

Use Cases

When to Use Streaming

Chatbots and assistants - Interactive conversations
Content generation - Long-form articles, stories
Code generation - Real-time code completion
Translation - Immediate feedback on translated text
Summarization - Progressive summary display

When NOT to Use Streaming

Offline batch processing - Processing many requests offline
Structured outputs - Need complete JSON before parsing
Analytics - Only care about final result
Maximum throughput - Non-streaming can be slightly faster in batch scenarios

Troubleshooting

Delayed First Token

If time to first token is slow:

Check GPU utilization - Ensure GPUs aren't overloaded
Reduce batch size - Lower --max_num_seqs
Enable prefill optimization - Use --enable_chunked_prefill

Choppy Streaming

If tokens arrive in bursts:

Network buffering - Check network configuration
Server load - Reduce concurrent requests
Client buffering - Adjust client buffer sizes

Next Steps

Vision - Stream responses from multimodal models
OpenAI API Compatible - Full API reference

Questions? Contact contact@mangoboost.io

Why Streaming Matters​

How Streaming Works​

Server Configuration​

Start the Streaming Server​

Start the Server​

Usage Examples​

OpenAI API (Streaming)​

LLMBoost SDK (Streaming)​

Streaming Response Format​

Server-Sent Events (SSE)​

Chunk Structure​

Performance Comparison​

Use Cases​

When to Use Streaming​

When NOT to Use Streaming​

Troubleshooting​

Delayed First Token​

Choppy Streaming​

Next Steps​