Ask once,
recall forever.
Stop paying for identical prompts. Recallm wraps your OpenAI-compatible client to return instant cached responses for semantically similar queries — no proxy, no infrastructure changes.
from openai import OpenAI
from llm_semantic_cache import (
SemanticCache, CacheConfig, InMemoryStorage
)
client = OpenAI()
cache = SemanticCache(
storage=InMemoryStorage(),
config=CacheConfig(threshold=0.92),
)
create = cache.wrap(client.chat.completions.create)
# First call — LLM is invoked, result cached
response = create(
model="gpt-4o-mini",
messages=[{"role": "user",
"content": "Capital of France?"}],
cache_context={"user_id": "u123"},
)
# Semantically equivalent — cache hit, 0ms latency
response = create(
model="gpt-4o-mini",
messages=[{"role": "user",
"content": "What is France's capital?"}],
cache_context={"user_id": "u123"},
)
How it works
Intercept
You wrap your client's create method once. Recallm intercepts every request before it leaves your application — no proxy servers, no infrastructure changes, no new services to run.
Embed & search
The prompt is converted to a vector locally using a fast ONNX model (~20MB, sub-10ms on CPU). We search your storage backend for entries with cosine similarity above your threshold.
Recall or forward
Hit: the cached response is returned instantly. Miss: the original call proceeds, the response is stored, and future similar prompts benefit from the cache. You always get a response.
By the numbers
on CPU
runs entirely in-process
support & FAQ workloads
Honest hit rates
Semantic caching is not magic. Hit rates depend entirely on how repetitive your workload is.
| Use case | Expected hit rate | Why |
|---|---|---|
| FAQ / support bot | 40–70% | High prompt repetition, forgiving similarity |
| Document summarization | 20–50% | Same docs re-processed, template prompts |
| General chat assistant | 5–15% | High diversity, dynamic context |
| Code generation | 3–10% | Exact problem statements vary, strict threshold |
If Recallm doesn't help your workload, the benchmarks page will tell you why before you ship it.
Known limitations
stream=Truebypasses the cache entirely — streaming responses are not cacheable in v0.1.0- Redis backend is not suitable for namespaces > 5,000 entries without partitioning
- Sync callers using
RedisStoragehave no timeout protection in v0.1.0