DSpace Repository

Optimizing Hybrid Sparse–Dense Retrieval in Retrieval-Augmented Generation Pipelines for Balanced Latency and Accuracy

Show simple item record

dc.contributor.author Mamun, Abdullah Al
dc.date.accessioned 2026-04-22T06:06:01Z
dc.date.available 2026-04-22T06:06:01Z
dc.date.issued 2025-12-27
dc.identifier.citation SWT en_US
dc.identifier.uri http://dspace.daffodilvarsity.edu.bd:8080/handle/123456789/16993
dc.description Thesis en_US
dc.description.abstract Retrieval-Augmented Generation (RAG) has become a central approach for grounding large language models (LLMs) in external knowledge so that they can answer questions reliably and with current information. At the same time, interactive applications demand low latency, high throughput, and stable user experience, creating a tension between speed and answer quality. This thesis investigates how a hybrid sparse–dense retrieval design, coupled with generation- and system-level optimizations, can improve the latency–accuracy trade-off in a realistic RAG pipeline. The main objective is to design, implement, and empirically evaluate an integrated set of techniques that are able to raise answer accuracy while reducing end-to-end response time. The study was conducted using a production-style RAG pipeline evaluated on a filtered split of the Natural Questions dataset. The filtered corpus contains 86,213 question–answer pairs, and a random sample of 1,000 questions was used for benchmarking so that experiments remained tractable while still representative. A dual-index retrieval layer combining BM25 with dense sentence embeddings was implemented, together with adaptive top‐k retrieval driven by a heuristic query difficulty score and lightweight queryexpansion. At the generation layer, the pipeline employed a sequence-to-sequence LLM with a confidence-based early exit criterion that stopped decoding when predictions stabilized. System-level orchestration included stale prefetching of retrieval for upcoming batches, reuse of components across runs, and parallel workers for retrieval and generation. The baseline and optimized pipelines were compared under identical conditions usingaccuracy, retrieval and generation metrics, latency, time-to-first-token (TTFT), and throughput. en_US
dc.description.sponsorship DIU en_US
dc.language.iso en_US en_US
dc.publisher Daffodil International University en_US
dc.subject Retrieval-Augmented Generation (RAG) en_US
dc.subject Information Retrieval Systems en_US
dc.subject Hybrid Sparse–Dense en_US
dc.subject Retrieval Latency Optimization en_US
dc.title Optimizing Hybrid Sparse–Dense Retrieval in Retrieval-Augmented Generation Pipelines for Balanced Latency and Accuracy en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Browse

My Account