REL-FIX: Scene Graph Guided Fine-Grained AI Correction of Relationship Hallucinations in Vision-Language Models

Prima, Jafrin Alam

DSpace Home
→
Faculty of Science and Information Technology
→
DEPARTMENT OF SOFTWARE ENGINEERING
→
Thesis Report
→
View Item

dc.contributor.author	Prima, Jafrin Alam
dc.date.accessioned	2026-04-25T09:24:00Z
dc.date.available	2026-04-25T09:24:00Z
dc.date.issued	2025-12-30
dc.identifier.citation	SWT	en_US
dc.identifier.uri	http://dspace.daffodilvarsity.edu.bd:8080/handle/123456789/17029
dc.description	Thesis Report	en_US
dc.description.abstract	Vision language models are powerful, but they can produce text that looks plausible yet is not grounded in the image. In particular, relationship hallucinations, where a model describes an incorrect relation between two correctly identified objects, are especially pernicious for trust and downstream use. This thesis presents REL-FIX, a training free, scene graph guided framework designed to detect and correct relation level hallucinations in small vision language models without requiring expensive retraining or large scale judges. REL-FIX works by decomposing long form VLM outputs into subject, relation, object triplets, diagnosing hallucinations at the triplet level against ground truth scene graphs from the Tri HE benchmark, and then applying a two stage correction mechanism that generates candidate relations constrained by the scene graph and verifies them with a lightweight LLM judge. The pipeline emphasizes low resource reproducibility by using a compact generative VLM, Qwen2 VL 2B Instruct, together with accessible LLM judges such as Mistral 7B and a commercial Gemini variant for cross checking. Experiments on the 300 image Tri HE split demonstrate that REL-FIX substantially lowers relation hallucination rates while remaining cost effective. Using the Gemini judge, question level hallucination rate fell from 0.421 to 0.263 and relation hallucination from 0.341 to 0.196. With the Mistral judge the framework still reduced errors meaningfully, showing that open source judges can enable practical, low resource correction. Analysis shows that REL-FIX is particularly effective at repairing relational errors, with smaller but positive effects on object level errors. Remaining challenges include reliance on high quality scene graphs and triplet extraction noise, which are discussed along with directions for extending the method to automatically inferred scene graphs and multi hop reasoning. In sum, REL-FIX offers a modular, training free approach to improving factual consistency of small VLM outputs. It demonstrates that fine grained, scene graph guided correction can make small models significantly more reliable for tasks that require precise relational understanding.	en_US
dc.description.sponsorship	DIU	en_US
dc.language.iso	en_US	en_US
dc.publisher	Daffodil International University	en_US
dc.subject	Deep Learning	en_US
dc.subject	Vision-Language	en_US
dc.subject	Models Scene Graphs	en_US
dc.subject	Relationship Hallucination	en_US
dc.subject	Fine-Grained AI Correction	en_US
dc.title	REL-FIX: Scene Graph Guided Fine-Grained AI Correction of Relationship Hallucinations in Vision-Language Models	en_US
dc.type	Thesis	en_US