← Overview
03 · Inference

Your model, streaming on the same line.

Tokens straight off the GPU to the client you already use.

Drop-in compatible

Point your OpenAI-compatible client at the line and go. Nothing in your app has to change.

Faster first token

Tokens leave the GPU and start arriving immediately, so the answer begins sooner.

Just another track

Inference shares the same transport as voice, video and robotics. One line to operate, not five.

Inference · FAQ

Questions builders ask about your model, streaming on the same line..

Anything llama.cpp loads — Qwen 2.5, Llama 3 family, Mistral, Gemma, Phi, the lot — plus first-class ONNX Runtime for the embedding/classifier side. GGUF in, OpenAI-compatible SSE out.

For the same reason voice does: a single dropped packet stalls every concurrent stream on HTTP/2 (TCP head-of-line blocking). With 16 concurrent users on one connection, that means everyone's tokens pause on one loss. QUIC streams are independent — only the affected user feels it.

GGUF weights are mmap'd at process start; first token after warm pool is ~30 ms TTFT on a single-shard CPU box. Cold model load (changing model mid-flight) is whatever your disk can do — typically 200–800 ms for a 7B Q4.

Yes. Drop a GGUF into the tenant's weights bucket and point the agent at it. We don't re-quantize; what you upload is what runs.

Both. The default surface is OpenAI-compatible SSE — drop our endpoint into any client that already speaks `openai.chat.completions.stream(...)` and it works. Tool-calling, JSON mode, and streaming function call deltas all pass through. See OpenAI's streaming docs for the wire format we mirror.

Put it on one line.

Telequick ships every modality on the same transport.