Benchmarks
Expected hit rates
| Use case | Expected hit rate | Why |
|---|---|---|
| FAQ / support bot | 40–70% | Users repeat intent with small wording changes, so semantic similarity captures many near-duplicates. |
| Document summarization | 20–50% | Teams often reprocess the same corpus with templated prompts, creating moderate repetition. |
| General chat assistant | 5–15% | Open-ended chats vary heavily in user intent, so repeated semantic intent is limited. |
| Code generation | 3–10% | Requests are precise and threshold is strict, so near-matches are less likely to qualify as safe hits. |
These ranges are realistic workload expectations, not synthetic “best-case” numbers. Prompt distributions in this project intentionally include variation and noise so the benchmark reflects production-like repetition, not perfect duplicates.
How to run
How to read results
Output is a markdown table with:
- Use Case: workload profile being simulated.
- Expected Hit Rate: target range based on typical production behavior.
- Actual Hit Rate: measured hits/total calls for this run.
- Hits / Misses / Total: raw counts behind the percentage.
“Good” results are when actual hit rate lands inside or near the expected range for that workload. If a run is far below range, tune threshold and cache_context; if far above range, validate answer quality to ensure you did not over-loosen matching.
Benchmark limitations
CI benchmarks use a deterministic fake embedder, so hit rates differ from real embedding models. Always run with a real embedder before drawing production conclusions about quality, latency, or expected savings.
Running with a real embedder
Swap in a real embedder when constructing SemanticCache in benchmark code:
from llm_semantic_cache import CacheConfig, SemanticCache
from llm_semantic_cache.embeddings import FastEmbedEmbedder, SentenceTransformerEmbedder
from llm_semantic_cache.storage.memory import InMemoryStorage
cache_fastembed = SemanticCache(
storage=InMemoryStorage(),
config=CacheConfig(threshold="balanced"),
embedder=FastEmbedEmbedder("all-MiniLM-L6-v2"),
)
cache_torch = SemanticCache(
storage=InMemoryStorage(),
config=CacheConfig(threshold="balanced"),
embedder=SentenceTransformerEmbedder("all-MiniLM-L6-v2"),
)