How are you reducing LLM token costs for async workflows?

(github.com)

1 points | by alexliu79 10 hours ago ago

3 comments

For non-urgent workflows I mostly use Batch APIs. As you said, the bare batch APIs are a pain to use.

On top of that, I would add that most async-to-batch libraries force users to learn a new framework or refactor their existing code, which is a huge friction in itself.

I've been in those trenches as a developer and I decided to create a literal 2-liner python lib that gets you from async to batch: https://github.com/vienneraphael/batchling

You don't need to change your code and it supports most providers (Anthropic, Gemini, Groq, Mistral, OpenAI, Together, Vertex, XAI, Doubleword) and all imagineable python frameworks (Langchain, PydanticAI, Instructor, DSPy, LiteLLM, Pydantic Evals, ..)

alexliu79 10 hours ago

We’ve been exploring one specific cost issue in AI products: a lot of async-friendly LLM workloads still run synchronously, which seems to create unnecessary token spend.

I’m curious how people here are handling this in practice for evals, extraction pipelines, classification jobs, or other multi-step workflows.

Are you using batch APIs already? Building internal tooling? Or just accepting the extra cost because batch workflows are too painful to adopt?

We’ve been building an open-source library called ParaLLeM to make it easier to move agent workflows from sync to batch without rewriting everything, and I’d love to understand how others are approaching this problem.

Repo: https://github.com/parallem-ai/parallem

c_chenfeng 10 hours ago

[dead]