Kuga
Back

llmframe

Open-sourced a production-grade RAG + multi-LLM framework — hybrid retrieval, cross-encoder reranking, swappable providers.

Jul 5, 2026

Problem

Most RAG setups lean on vector search alone, and pure semantic similarity quietly fails on exact names, product codes, and version numbers — the cases where a keyword match matters more than a semantic one. On top of that, teams building on a single LLM provider end up coupled to its exact API shape and pricing, with no fallback if it degrades or goes down. I wanted a framework that treated hybrid retrieval and provider-agnostic orchestration as defaults, not add-ons — and that was honest about being a real, tested, deployable piece of software rather than a notebook demo.

Constraints

Solo, open-source, built and maintained in public. No funding or team to lean on for review, so correctness had to come from tests and structure rather than a second pair of eyes. Had to run on a genuinely free local path (local embeddings, zero API cost) as well as a hosted path (OpenAI embeddings, Claude/GPT-4o generation), since an open-source tool that only works with paid keys limits who can actually use it.

What I built

A three-stage retrieval pipeline: vector search (ChromaDB) and BM25 keyword search run independently, get fused via Reciprocal Rank Fusion, then the fused candidates go through cross-encoder reranking before reaching the LLM. Streaming FastAPI service on top, with swappable LLM providers (Claude, GPT-4o) and swappable embeddings (OpenAI or a zero-cost local model) selected through a single config object rather than provider-specific code paths. 25 tests covering the pipeline, the chunker, and the API. Docker and docker-compose for one-command deployment. Packaged and published to PyPI with an automated GitHub Actions release workflow.

Key decision

Fuse before rerank, don't choose one retrieval method over the other. Reciprocal Rank Fusion (score = sum of 1/(k + rank) across both ranked lists) merges vector and BM25 results without needing to tune a manual blend weight between them — a chunk that ranks well in either list surfaces, and one that ranks well in both rises further. Reranking then runs on the fused set, not on each list separately, so the expensive cross-encoder pass only touches candidates that already survived two independent retrieval signals. The alternative — reranking each list separately then merging — would spend the reranker's cost on candidates that hybrid search would have deduplicated for free.

Outcome

Live on PyPI as an installable package, MIT-licensed, with CI running the full test suite (pipeline, chunker, API) on every push. Hybrid search and reranking are both feature-flagged (off by default) so the zero-cost local path stays the fastest way to try it, while `HYBRID_SEARCH_ENABLED` and `RERANKER_ENABLED` env vars unlock the full pipeline for production use.

Hindsight

Building the provider abstraction before I had a second provider fully wired in cost me some rework once GPT-4o's tool-call shape diverged from Claude's in ways the first interface didn't anticipate. If I did it again I'd implement two providers concurrently from the start instead of retrofitting the abstraction after the first one worked. I'd also add an evals harness earlier — right now correctness is unit-test-driven, and retrieval quality (not just "does it run") is the harder thing to catch regressions in.

Architecture notes

Three stages do the work: vector search, BM25 keyword search, and reranking. Vector search and BM25 run independently against the same chunk store, get merged with Reciprocal Rank Fusion, and only the fused top-N go through the cross-encoder.

Reciprocal Rank Fusion

def reciprocal_rank_fusion(
    vector_results: list[SearchResult],
    bm25_results: list[tuple[str, float]],
    chunk_map: dict[str, SearchResult],
    rrf_k: int = 60,
    top_n: int = 5,
) -> list[SearchResult]:
    scores: dict[str, float] = {}
    for rank, result in enumerate(vector_results):
        cid = result.chunk.id
        scores[cid] = scores.get(cid, 0.0) + 1.0 / (rrf_k + rank + 1)
    for rank, (chunk_id, _) in enumerate(bm25_results):
        scores[chunk_id] = scores.get(chunk_id, 0.0) + 1.0 / (rrf_k + rank + 1)
    sorted_ids = sorted(scores, key=lambda x: scores[x], reverse=True)[:top_n]
    ...

No manual weight to tune between "how much do I trust vector similarity vs. keyword overlap" — a chunk's fused score is just how consistently it ranked well across both signals.

Config-driven, not code-driven providers

Both LLM provider and embedding provider are config fields, not import-time choices:

@dataclass
class RAGConfig:
    chunk_size: int = 1000
    chunk_overlap: int = 200
    max_retrieved_chunks: int = 5
    embedding_provider: str = "openai"
    embedding_model: str = "text-embedding-3-small"
    reranker_enabled: bool = False
    hybrid_search_enabled: bool = False

Hybrid search and reranking are both off by default — the zero-cost local path (local embeddings, no reranker) is the fastest way to try the framework, and production deployments turn the full pipeline on with two environment variables.

Tech stack

PythonFastAPIChromaDBrank-bm25Cross-encoder rerankingClaude / GPT-4oDockerGitHub ActionsPyPI