Python Library · MIT License · v0.2.0

Ask once,
recall forever.

Stop paying for identical prompts. Recallm wraps your OpenAI-compatible client to return instant cached responses for semantically similar queries — no proxy, no infrastructure changes.

Get started →

app.py

from openai import OpenAI
from recallm import (
    SemanticCache, CacheConfig, InMemoryStorage
)

client = OpenAI()
cache = SemanticCache(
    storage=InMemoryStorage(),
    config=CacheConfig(threshold=0.92),
)
create = cache.wrap(client.chat.completions.create)

# First call — LLM is invoked, result cached
response = create(
    model="gpt-4o-mini",
    messages=[{"role": "user",
               "content": "Capital of France?"}],
    cache_context={"user_id": "u123"},
)

# Semantically equivalent — cache hit, 0ms latency
response = create(
    model="gpt-4o-mini",
    messages=[{"role": "user",
               "content": "What is France's capital?"}],
    cache_context={"user_id": "u123"},
)

How it works

Step 01

Intercept

You wrap your client's create method once. Recallm intercepts every request before it leaves your application — no proxy servers, no infrastructure changes, no new services to run.

Step 02

Embed & search

The prompt is converted to a vector locally using a fast ONNX model (~20MB, sub-10ms on CPU). We search your storage backend for entries with cosine similarity above your threshold.

Step 03

Recall or forward

Hit: the cached response is returned instantly. Miss: the original call proceeds, the response is stored, and future similar prompts benefit from the cache. You always get a response.

By the numbers

<10ms added latency per lookup
on CPU

0 external services required
runs entirely in-process

40–70% cost reduction for
support & FAQ workloads

Honest hit rates

Semantic caching is not magic. Hit rates depend entirely on how repetitive your workload is.

Use case	Expected hit rate	Why
FAQ / support bot	40–70%	High prompt repetition, forgiving similarity
Document summarization	20–50%	Same docs re-processed, template prompts
General chat assistant	5–15%	High diversity, dynamic context
Code generation	3–10%	Exact problem statements vary, strict threshold

If Recallm doesn't help your workload, the benchmarks page will tell you why before you ship it.

Known limitations

stream=True bypasses the cache entirely — streaming responses are not cacheable
Redis backend is not suitable for namespaces > 5,000 entries without partitioning
Sync callers using RedisStorage have no timeout protection — rely on the backend client's own behaviour

Getting started · Configuration · Storage backends · GitHub

Ask once,recall forever.

Intercept

Embed & search

Recall or forward

Ask once,
recall forever.