Question 1

How can I reduce my OpenAI API bill?

Accepted Answer

Most production apps spend 60-80% of their LLM bill on input tokens, not output. System prompts, conversation history, retrieved context. So shrinking input is where there's most to save. bear-1.2 removes tokens the downstream model wasn't using anyway. At standard settings that's around two-thirds off, with accuracy at the uncompressed baseline.

Question 2

What is prompt compression?

Accepted Answer

A learned transformation that makes your prompt shorter while keeping the parts the model actually uses to answer. Summarization paraphrases your prompt in new language, which throws away verbatim details. Truncation just drops the tail. Compression is different. A model trained for the job picks out redundant tokens (boilerplate, filler, restated context) and removes them, so the result is shorter and cheaper but still produces the same answer from GPT, Claude, or anything else.

Question 3

Does compressing prompts hurt accuracy?

Accepted Answer

It depends on how aggressively you compress. Run aggressive compression (around two-thirds off) and accuracy stays at the uncompressed baseline. Compress lightly and accuracy actually goes up by several points on standard evals. The reason it works: most of the tokens being stripped are ones the model was already ignoring, so what's left has a better signal-to-noise ratio. To check either mode on your own workload, run your existing eval suite on compressed inputs and compare.

Question 4

Will prompt compression work with GPT, Claude, and Gemini?

Accepted Answer

Yes. The API transforms the text of your prompt, so anything you can pass as a string to a chat completion endpoint works. We've tested it against OpenAI (GPT-4o, GPT-5 family), Anthropic (Claude Sonnet 4.5 / 4.6, Opus), Google (Gemini 2.5 Pro), and open models like Llama and Qwen. You're not locked into any provider.

Question 5

How fast is the compression API?

Accepted Answer

Around 6ms of compression overhead for a 10K-token prompt, and well under 120ms even at 200K. The latency cost usually pays for itself: a shorter prompt means faster time-to-first-token from the downstream LLM, so end-to-end round-trip often goes down with compression in the loop, not up.

Question 6

Is the OpenAI API too expensive at scale?

Accepted Answer

It can be. A B2B product with 10K daily users sending around 8 messages a day at typical prompt sizes lands near $35-40K/month, and that grows with traffic. Most of the spend is input tokens, which is what compression cuts. Trimmed conversation history and shorter outputs help on top, but input is the biggest line.

Question 7

How does pricing work?

Accepted Answer

We charge per token saved, not per token sent. If a 10M-token prompt comes out at 6M after compression, you pay for the 4M we removed, not the 10M you sent in. There's a free tier to start, a Pro plan at $0.30 per million tokens saved, and Enterprise pricing for higher volumes. Full numbers and a worked example are on the pricing page.

Question 8

How is this different from summarization or truncation?

Accepted Answer

Summarization rewrites your prompt in new language, which loses verbatim details the model often needs: proper nouns, numbers, code, exact phrasings. Truncation just drops the tail and loses whatever was there. Compression keeps the tokens the model cares about and removes the ones it doesn't, character-for-character from your original input. The exact details survive.

Question 9

Where does compression help the most?

Accepted Answer

Long chatbot conversations are the canonical fit. Every new turn re-ships the prior turns, so token count grows linearly with conversation length. Agent loops have the same shape on a smaller timescale, with each iteration re-reading the reasoning trace. The other big bucket is document work: search, retrieval, PDF ingestion, scraped HTML. All of that is mostly boilerplate that compresses cleanly.

Optimize LLM context
by removing input bloat

Backed by people behind

Save tokens and improve accuracy on agent's background knowledge

Featurednew

Intelligent semantic processing

One API call

Benchmarks

Making LLMs understand financial documents better

Reducing LLM response times through compression

Improving LLM reading comprehension with compression

Zero accuracy loss on conversational QA with 14% fewer tokens

Use cases

LLM Entertainment & Gaming

Meeting Transcription

Web Scraping

Document Analysis

Frequently asked questions

Ready to compress?

Optimize LLM contextby removing input bloat