DSpace Repository

Zero-Shot AnomalyDetection in Industrial Manufacturing UsingVision Transformers and ConditionalDiffusion Models

Show simple item record

dc.contributor.author Shanto, Md MahfuzurRahman
dc.date.accessioned 2026-04-22T05:56:26Z
dc.date.available 2026-04-22T05:56:26Z
dc.date.issued 2025-12-30
dc.identifier.citation SWT en_US
dc.identifier.uri http://dspace.daffodilvarsity.edu.bd:8080/handle/123456789/16975
dc.description Thesis Report en_US
dc.description.abstract The automation of anomaly detection systems that do not require large amounts of labeled data are required in the industrial manufacturing quality control to detect defects. This thesis introduces a new multi-path zero-shot anomaly detection architecture that concurrently combines the use of Vision Transformer (ViT)-based memory banks, conditional diffusion models and CLIP-based zero-shot detection in a manufacturing industry setting. In contrast to the current methods that apply single detection strategies, the proposed framework integrates three complementary directions: Path A is implemented with the help of hierarchical ViT features (block 5 and 11) and k-nearest neighbor memory banks, which are applied to detect textural anomalies efficiently; Path B is applied using conditional diffusion models as guided by ViT features, which is used to identify structural anomalies via reconstruction; and Path C is applied to semantic zero-shot with the help of CLIP vision-language understanding with no training information. Extensive testing of 10 classes of the MVTec Anomaly Detection dataset show outstanding results with images average AUROC of 96.41%, Pixel average AUROC of 96.75 and Image AUPR of 98.30. The framework favors the zero-shot approaches (WinCLIP (91.8%), AnomalyCLIP (91.5%), AnoVL (92.5%), and DZAD (93.5%)) by 2.9-4.9 percentage points, and achieves similar performance to the state-of-the-art supervised models with-out using category-specific training examples. Design selection systematic ablation experiments support the design selection and demonstrate that hierarchical feature extraction yields superior detection by 2.18 percent compared to single-layer methods, and multi-path integration affords robustness to di-verse defects. The framework attains real-time inference of 67 FPS (15ms per image) on the main detection path, so it can be applied in the real-life industry. The study has three main contributions namely, (1) the first multi-path zero-shot model that involves ViT, diffusion models as well as CLIP to detect anomalies in industries; (2) technical contributions such as hierarchical feature extraction, ViT-conditioned diffusion, and patch-level CLIP analysis; (3) extensive evaluation that shows that the model is competitive with supervised methods and zero-shot models do not need training data. The zero-shot capability offers significant practical advantages for manufacturers deploying quality control systems across numerous product variants, reducing data collection costs and enabling immediate deployment to new production lines.. en_US
dc.description.sponsorship DIU en_US
dc.language.iso en_US en_US
dc.publisher Daffodil International University en_US
dc.subject Zero-Shot Learning en_US
dc.subject Anomaly Detection en_US
dc.subject Vision Transformers en_US
dc.subject Diffusion Models en_US
dc.subject CLIP en_US
dc.title Zero-Shot AnomalyDetection in Industrial Manufacturing UsingVision Transformers and ConditionalDiffusion Models en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Browse

My Account