DSpace Repository

SkinBench : A Multimodal LLM Benchmark for Skin Disease Diagnosis

Show simple item record

dc.contributor.author Akter, Swarna
dc.date.accessioned 2026-04-22T06:06:48Z
dc.date.available 2026-04-22T06:06:48Z
dc.date.issued 2025-12-27
dc.identifier.citation SWT en_US
dc.identifier.uri http://dspace.daffodilvarsity.edu.bd:8080/handle/123456789/16996
dc.description Thesis Report en_US
dc.description.abstract There is an emerging trend to use Large Language Models (LLMs) in medical diagnosis (particularly analysis of complex data such as medical imaging and patient histories). Nevertheless, currently, models tend to be not as deep in reasoning and as well as clinically accurate when used in real-life scenarios of healthcare. To solve this, we introduce SkinBench, which is a new benchmark and assessment framework that is concentrated on diagnostics of skin diseases. It is based on a multi-agent system comprising of three specialized agents they are (1) DescribeLLM which describes the clinical scenario; (2) DoctorLLM which acts as a clinician asking questions and reasoning through a diagnosis and (3) EvalLLM which assesses the quality of the diagnostic outcome. SkinBench has a total of 500 cases of skin diseases, which consist of both images and written reasoning, as well as the model-generated dialogues, and responses- 10 percent are checked by the real doctors. The first screening step is to first determine the standalone diagnostic accuracy of seven representative LLMs on these 500 cases: GPT-5.1 has the highest accuracy of 98.8, followed by Mistral with 98.6, GPT-4o with 95.6, DeepSeek with 94.0, Qwen with 91.4, Llama 3.2 with 90.6, and GPT-3.5 Turbo with the lowest accuracy of 88.6. Such ranking reveals that there exists significant differences in performance among models prior to the implementation of multi-agent reasoning and it is therefore pertinent that a more organized and realistic benchmark is developed. We thus evaluate seven common skin diseases in Bangladesh, Scabies, Psoriasis, Monkeyopx and Chickenpox, Candidiasis and Tinea, Atopic Dermatitis and Seborrheic Dermatitis, Acnes and Impetigo, and assess both open- source and closed-source LLMs on three main dimensions: (1) the accuracy of the diagnostic results of skin diseases, (2) the consistency of reasoning among multi-turn conversational interactions, and (3) the quality and explainability of clinical results. en_US
dc.description.sponsorship DIU en_US
dc.language.iso en_US en_US
dc.publisher Daffodil International University en_US
dc.subject Skin Disease Diagnosis en_US
dc.subject Multimodal Large Language Models (LLM) en_US
dc.subject Benchmark Dataset (SkinBench) en_US
dc.subject Medical AI Evaluation en_US
dc.title SkinBench : A Multimodal LLM Benchmark for Skin Disease Diagnosis en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Browse

My Account