Abstract:
The automation of anomaly detection systems that do not require large amounts of labeled data are required in the industrial manufacturing quality control to detect defects. This thesis introduces a new multi-path zero-shot anomaly detection architecture that concurrently combines the use of Vision Transformer (ViT)-based memory banks, conditional diffusion models and CLIP-based zero-shot detection in a manufacturing industry setting. In contrast to the current methods that apply single detection strategies, the proposed framework integrates three complementary directions: Path A is implemented with the help of hierarchical ViT features (block 5 and 11) and k-nearest neighbor memory banks, which are applied to detect textural anomalies efficiently; Path B is applied using conditional diffusion models as guided by ViT features, which is used to identify structural anomalies via reconstruction; and Path C is applied to semantic zero-shot with the help of CLIP vision-language understanding with no training information. Extensive testing of 10 classes of the MVTec Anomaly Detection dataset show outstanding results with images average AUROC of 96.41%, Pixel average AUROC of 96.75 and Image AUPR of 98.30. The framework favors the zero-shot approaches (WinCLIP (91.8%), AnomalyCLIP (91.5%), AnoVL (92.5%), and DZAD (93.5%)) by 2.9-4.9 percentage points, and achieves similar performance to the state-of-the-art supervised models with-out using category-specific training examples. Design selection systematic ablation experiments support the design selection and demonstrate that hierarchical feature extraction yields superior detection by 2.18 percent compared to single-layer methods, and multi-path integration affords robustness to di-verse defects. The framework attains real-time inference of 67 FPS (15ms per image) on the main detection path, so it can be applied in the real-life industry. The study has three main contributions namely, (1) the first multi-path zero-shot model that involves ViT, diffusion models as well as CLIP to detect anomalies in industries; (2) technical contributions such as hierarchical feature extraction, ViT-conditioned diffusion, and patch-level CLIP analysis; (3) extensive evaluation that shows that the model is competitive with supervised methods and zero-shot models do not need training data. The zero-shot capability offers significant practical advantages for manufacturers deploying quality control systems across numerous product variants, reducing data collection costs and enabling immediate deployment to new production lines..