Abstract:
Vision language models are powerful, but they can produce text that looks plausible yet is not grounded in the image. In particular, relationship hallucinations, where a model describes an incorrect relation between two correctly identified objects, are especially pernicious for trust and downstream use. This thesis presents REL-FIX, a training free, scene graph guided framework designed to detect and correct relation level hallucinations in small vision language models without requiring expensive retraining or large scale judges. REL-FIX works by decomposing long form VLM outputs into subject, relation, object triplets, diagnosing hallucinations at the triplet level against ground truth scene graphs from the Tri HE benchmark, and then applying a two stage correction mechanism that generates candidate relations constrained by the scene graph and verifies them with a lightweight LLM judge. The pipeline emphasizes low resource reproducibility by using a compact generative VLM, Qwen2 VL 2B Instruct, together with accessible LLM judges such as Mistral 7B and a commercial Gemini variant for cross checking. Experiments on the 300 image Tri HE split demonstrate that REL-FIX substantially lowers relation hallucination rates while remaining cost effective. Using the Gemini judge, question level hallucination rate fell from 0.421 to 0.263 and relation hallucination from 0.341 to 0.196. With the Mistral judge the framework still reduced errors meaningfully, showing that open source judges can enable practical, low resource correction. Analysis shows that REL-FIX is particularly effective at repairing relational errors, with smaller but positive effects on object level errors. Remaining challenges include reliance on high quality scene graphs and triplet extraction noise, which are discussed along with directions for extending the method to automatically inferred scene graphs and multi hop reasoning. In sum, REL-FIX offers a modular, training free approach to improving factual consistency of small VLM outputs. It demonstrates that fine grained, scene graph guided correction can make small models significantly more reliable for tasks that require precise relational understanding.