LLM - OpenAI
QStash has built-in support for calling LLM APIs. This allows you to take advantage of QStash features such as retries, callbacks, and batching while using LLM APIs.
QStash is especially useful for LLM processing because LLM response times are often highly variable. When accessing LLM APIs from serverless runtimes, invocation timeouts are a common issue. QStash offers an HTTP timeout of 2 hours, which is sufficient for most LLM use cases. By using callbacks and the workflows, you can easily manage the asynchronous nature of LLM APIs.
QStash LLM API
You can publish (or enqueue) single LLM request or batch LLM requests using all existing QStash features natively. To do this, specify the destination api
as llm
with a valid provider. The body of the published or enqueued message should contain a valid chat completion request. For these integrations, you must specify the Upstash-Callback
header so that you can process the response asynchronously. Note that streaming chat completions cannot be used with them. Use the chat API for streaming completions.
All the examples below can be used with OpenAI-compatible LLM providers.
Publishing a Chat Completion Request
Enqueueing a Chat Completion Request
Sending Chat Completion Requests in Batches
Retrying After Rate Limit Resets
When the rate limits are exceeded, QStash automatically schedules the retry of publish or enqueue of chat completion tasks depending on the reset time of the rate limits. That helps with not doing retries prematurely when it is definitely going to fail due to exceeding rate limits.
Analytics via Helicone
Helicone is a powerful observability platform that provides valuable insights into your LLM usage. Integrating Helicone with QStash is straightforward.
To enable Helicone observability in QStash, you simply need to pass your Helicone API key when initializing your model. Here’s how to do it for both custom models and OpenAI:
Was this page helpful?