A retrieval-augmented generation system that answers questions about municipal policy documents using TF-IDF and BM25 retrieval with term-overlap re-ranking.
Four-page Streamlit application for document retrieval and evaluation
TF-IDF and BM25 retrieval evaluated on 30 municipal policy questions
Fifteen synthetic municipal policy documents covering Calgary bylaws, transit, water services, housing, parks, emergency management, and more are chunked into overlapping text segments (500 characters, 50 overlap). Two retrieval methods are compared: TF-IDF with cosine similarity and BM25 with Okapi scoring. A term-overlap re-ranker provides a second pass to improve ranking quality. Evaluation uses 30 hand-crafted questions with ground truth document IDs, measuring precision@k, recall@k, and mean reciprocal rank.
Set up the project locally in three commands
The document corpus consists of 15 synthetic municipal policy texts modeled on Calgary city bylaws, strategic plans, and public service descriptions. Topics include land use zoning, public transit, water services, affordable housing, parks and recreation, emergency management, climate action, transportation infrastructure, business licensing, community safety, waste services, economic development, snow control, arts and culture, and property assessment. The 30 evaluation questions were crafted to cover two questions per document, each with a ground truth relevant document ID.