llmframe
Open-sourced a production-grade RAG + multi-LLM framework — hybrid retrieval, cross-encoder reranking, swappable providers.
Jul 5, 2026
Problem
Constraints
What I built
Key decision
Outcome
Hindsight
Architecture notes
Three stages do the work: vector search, BM25 keyword search, and reranking. Vector search and BM25 run independently against the same chunk store, get merged with Reciprocal Rank Fusion, and only the fused top-N go through the cross-encoder.
Reciprocal Rank Fusion
def reciprocal_rank_fusion(
vector_results: list[SearchResult],
bm25_results: list[tuple[str, float]],
chunk_map: dict[str, SearchResult],
rrf_k: int = 60,
top_n: int = 5,
) -> list[SearchResult]:
scores: dict[str, float] = {}
for rank, result in enumerate(vector_results):
cid = result.chunk.id
scores[cid] = scores.get(cid, 0.0) + 1.0 / (rrf_k + rank + 1)
for rank, (chunk_id, _) in enumerate(bm25_results):
scores[chunk_id] = scores.get(chunk_id, 0.0) + 1.0 / (rrf_k + rank + 1)
sorted_ids = sorted(scores, key=lambda x: scores[x], reverse=True)[:top_n]
...
No manual weight to tune between "how much do I trust vector similarity vs. keyword overlap" — a chunk's fused score is just how consistently it ranked well across both signals.
Config-driven, not code-driven providers
Both LLM provider and embedding provider are config fields, not import-time choices:
@dataclass
class RAGConfig:
chunk_size: int = 1000
chunk_overlap: int = 200
max_retrieved_chunks: int = 5
embedding_provider: str = "openai"
embedding_model: str = "text-embedding-3-small"
reranker_enabled: bool = False
hybrid_search_enabled: bool = False
Hybrid search and reranking are both off by default — the zero-cost local path (local embeddings, no reranker) is the fastest way to try the framework, and production deployments turn the full pipeline on with two environment variables.