Abstract:
There is an emerging trend to use Large Language Models (LLMs) in medical diagnosis (particularly analysis of complex data such as medical imaging and patient histories). Nevertheless, currently, models tend to be not as deep in reasoning and as well as clinically accurate when used in real-life scenarios of healthcare. To solve this, we introduce SkinBench, which is a new benchmark and assessment framework that is concentrated on diagnostics of skin diseases. It is based on a multi-agent system comprising of three specialized agents they are (1) DescribeLLM which describes the clinical scenario; (2) DoctorLLM which acts as a clinician asking questions and reasoning through a diagnosis and (3) EvalLLM which assesses the quality of the diagnostic outcome. SkinBench has a total of 500 cases of skin diseases, which consist of both images and written reasoning, as well as the model-generated dialogues, and responses- 10 percent are checked by the real doctors. The first screening step is to first determine the standalone diagnostic accuracy of seven representative LLMs on these 500 cases: GPT-5.1 has the highest accuracy of 98.8, followed by Mistral with 98.6, GPT-4o with 95.6, DeepSeek with 94.0, Qwen with 91.4, Llama 3.2 with 90.6, and GPT-3.5 Turbo with the lowest accuracy of 88.6. Such ranking reveals that there exists significant differences in performance among models prior to the implementation of multi-agent reasoning and it is therefore pertinent that a more organized and realistic benchmark is developed. We thus evaluate seven common skin diseases in Bangladesh, Scabies, Psoriasis, Monkeyopx and Chickenpox, Candidiasis and Tinea, Atopic Dermatitis and Seborrheic Dermatitis, Acnes and Impetigo, and assess both open- source and closed-source LLMs on three main dimensions: (1) the accuracy of the diagnostic results of skin diseases, (2) the consistency of reasoning among multi-turn conversational interactions, and (3) the quality and explainability of clinical results.