Python Library · MIT License · v0.1.0

Ask once,
recall forever.

Stop paying for identical prompts. Recallm wraps your OpenAI-compatible client to return instant cached responses for semantically similar queries — no proxy, no infrastructure changes.

Get started →
app.py
from openai import OpenAI
from llm_semantic_cache import (
    SemanticCache, CacheConfig, InMemoryStorage
)

client = OpenAI()
cache = SemanticCache(
    storage=InMemoryStorage(),
    config=CacheConfig(threshold=0.92),
)
create = cache.wrap(client.chat.completions.create)

# First call — LLM is invoked, result cached
response = create(
    model="gpt-4o-mini",
    messages=[{"role": "user",
               "content": "Capital of France?"}],
    cache_context={"user_id": "u123"},
)

# Semantically equivalent — cache hit, 0ms latency
response = create(
    model="gpt-4o-mini",
    messages=[{"role": "user",
               "content": "What is France's capital?"}],
    cache_context={"user_id": "u123"},
)

Step 01

Intercept

You wrap your client's create method once. Recallm intercepts every request before it leaves your application — no proxy servers, no infrastructure changes, no new services to run.

Step 02

Embed & search

The prompt is converted to a vector locally using a fast ONNX model (~20MB, sub-10ms on CPU). We search your storage backend for entries with cosine similarity above your threshold.

Step 03

Recall or forward

Hit: the cached response is returned instantly. Miss: the original call proceeds, the response is stored, and future similar prompts benefit from the cache. You always get a response.


<10ms added latency per lookup
on CPU
0 external services required
runs entirely in-process
40–70% cost reduction for
support & FAQ workloads

Semantic caching is not magic. Hit rates depend entirely on how repetitive your workload is.

Use case Expected hit rate Why
FAQ / support bot 40–70% High prompt repetition, forgiving similarity
Document summarization 20–50% Same docs re-processed, template prompts
General chat assistant 5–15% High diversity, dynamic context
Code generation 3–10% Exact problem statements vary, strict threshold

If Recallm doesn't help your workload, the benchmarks page will tell you why before you ship it.


  • stream=True bypasses the cache entirely — streaming responses are not cacheable in v0.1.0
  • Redis backend is not suitable for namespaces > 5,000 entries without partitioning
  • Sync callers using RedisStorage have no timeout protection in v0.1.0

Getting started · Configuration · Storage backends · GitHub