Backed byCombinator

Optimize LLM context
by removing input bloat

Bear-1.2 compression removes low signal tokens from your prompts before they hit your LLM.

Backed by people behind

Hugging Face
Silo
Wolt
Y Combinator
Supercell
SVA
PDF

Save tokens and improve accuracy on agent's background knowledge

Bear-1.2 compresses your agent's background knowledge before it enters the context window.

Featurednew

Intelligent semantic processing

The bear-1 and bear-1.2 models process tokens based on context and semantic intent. Compression runs deterministic and low latency.

In its most fundamental sense, compression is the process of encoding
information using fewer bits or resources than the original representation
by identifying and eliminating statistical redundancies or irrelevant data
within a dataset. Whether applied to digital media, text, or the high-
dimensional vector spaces of Large Language Models, compression relies on
the principle that most raw information contains noise or repeating patterns
that do not contribute new meaning. By applying an algorithm—or in your
case, an ML-based model—to map the input data into a more compact form,
you essentially distil the signal from the noise. In the context of ML
inputs, this means transforming long-form text into a dense, mathematically
efficient representation that preserves the original semantic intent and
logical relationships while significantly reducing the physical token count,
thereby allowing a system to process more information within the same fixed
computational window or budget.

One API call

Send text in, get compressed text back. Drop it in before your LLM call. That's the entire integration.

POSTapi.thetokencompany.com/v1/compress
{
"model": "bear-1.1",
"input": "Your long text to compress..."
}
response
{
"output": "Compressed text...",
"original_input_tokens": 1284,
"output_tokens": 436
}
Read the docs

Use cases

LLM Entertainment & Gaming

Longer memories, richer worlds, same budget.

Meeting Transcription

Distill hours of calls into signal-dense context.

Web Scraping

Strip boilerplate from crawled pages before ingest.

Document Analysis

Fit more PDFs and reports into one context window.

Frequently asked questions

Compression, costs, accuracy, and how this fits into an existing LLM stack.

How can I reduce my OpenAI API bill?

Most production apps spend 60-80% of their LLM bill on input tokens, not output. System prompts, conversation history, retrieved context. So shrinking input is where there's most to save. bear-1.2 removes tokens the downstream model wasn't using anyway. At standard settings that's around two-thirds off, with accuracy at the uncompressed baseline.

What is prompt compression?

A learned transformation that makes your prompt shorter while keeping the parts the model actually uses to answer. Summarization paraphrases your prompt in new language, which throws away verbatim details. Truncation just drops the tail. Compression is different. A model trained for the job picks out redundant tokens (boilerplate, filler, restated context) and removes them, so the result is shorter and cheaper but still produces the same answer from GPT, Claude, or anything else.

Does compressing prompts hurt accuracy?

It depends on how aggressively you compress. Run aggressive compression (around two-thirds off) and accuracy stays at the uncompressed baseline. Compress lightly and accuracy actually goes up by several points on standard evals. The reason it works: most of the tokens being stripped are ones the model was already ignoring, so what's left has a better signal-to-noise ratio. To check either mode on your own workload, run your existing eval suite on compressed inputs and compare.

See the full benchmarks

Will prompt compression work with GPT, Claude, and Gemini?

Yes. The API transforms the text of your prompt, so anything you can pass as a string to a chat completion endpoint works. We've tested it against OpenAI (GPT-4o, GPT-5 family), Anthropic (Claude Sonnet 4.5 / 4.6, Opus), Google (Gemini 2.5 Pro), and open models like Llama and Qwen. You're not locked into any provider.

How fast is the compression API?

Around 6ms of compression overhead for a 10K-token prompt, and well under 120ms even at 200K. The latency cost usually pays for itself: a shorter prompt means faster time-to-first-token from the downstream LLM, so end-to-end round-trip often goes down with compression in the loop, not up.

See the latency numbers

Is the OpenAI API too expensive at scale?

It can be. A B2B product with 10K daily users sending around 8 messages a day at typical prompt sizes lands near $35-40K/month, and that grows with traffic. Most of the spend is input tokens, which is what compression cuts. Trimmed conversation history and shorter outputs help on top, but input is the biggest line.

How does pricing work?

We charge per token saved, not per token sent. If a 10M-token prompt comes out at 6M after compression, you pay for the 4M we removed, not the 10M you sent in. There's a free tier to start, a Pro plan at $0.30 per million tokens saved, and Enterprise pricing for higher volumes. Full numbers and a worked example are on the pricing page.

How is this different from summarization or truncation?

Summarization rewrites your prompt in new language, which loses verbatim details the model often needs: proper nouns, numbers, code, exact phrasings. Truncation just drops the tail and loses whatever was there. Compression keeps the tokens the model cares about and removes the ones it doesn't, character-for-character from your original input. The exact details survive.

Where does compression help the most?

Long chatbot conversations are the canonical fit. Every new turn re-ships the prior turns, so token count grows linearly with conversation length. Agent loops have the same shape on a smaller timescale, with each iteration re-reading the reasoning trace. The other big bucket is document work: search, retrieval, PDF ingestion, scraped HTML. All of that is mostly boilerplate that compresses cleanly.

Ready to compress?

Access the compression API.