| dc.description.abstract |
Retrieval-Augmented Generation (RAG) has become a central approach for grounding large language models (LLMs) in external knowledge so that they can answer questions reliably and with current information. At the same time, interactive applications demand low latency, high throughput, and stable user experience, creating a tension between speed and answer quality. This thesis investigates how a hybrid sparse–dense retrieval design, coupled with generation- and system-level optimizations, can improve the latency–accuracy trade-off in a realistic RAG pipeline. The main objective is to design, implement, and empirically evaluate an integrated set of techniques that are able to raise answer accuracy while reducing end-to-end response time. The study was conducted using a production-style RAG pipeline evaluated on a filtered split of the Natural Questions dataset. The filtered corpus contains 86,213 question–answer pairs, and a random sample of 1,000 questions was used for benchmarking so that experiments remained tractable while still representative. A dual-index retrieval layer combining BM25 with dense sentence embeddings was implemented, together with adaptive top‐k retrieval driven by a heuristic query difficulty score and lightweight queryexpansion. At the generation layer, the pipeline employed a sequence-to-sequence LLM with a confidence-based early exit criterion that stopped decoding when predictions stabilized. System-level orchestration included stale prefetching of retrieval for upcoming batches, reuse of components across runs, and parallel workers for retrieval and generation. The baseline and optimized pipelines were compared under identical conditions usingaccuracy, retrieval and generation metrics, latency, time-to-first-token (TTFT), and throughput. |
en_US |