Abstract
The integration of Vision Transformers (ViTs) into medical imaging has significantly improved the diagnostic accuracy of breast ultrasound (BUS) analysis by capturing global semantic context. However, the excessive computational complexity of these models renders them unsuitable for Point-of-Care (POC) applications, where portable ultrasound devices rely on low-power, edge computing hardware. This study proposes a novel Cross-Architecture Knowledge Distillation framework designed to bridge the gap between high-performance diagnostics and real-time efficiency. We distill the structural knowledge of a computationally heavy Hybrid ViT-ConvNeXt Teacher into an ultra-lightweight MobileNet-V3 Student. By leveraging soft-target supervision, the student model inherits the global reasoning capabilities of the transformer while retaining the inductive bias and speed of a CNN. Experimental validation on an independent test set of the BUSI dataset demonstrates that the distilled student achieves a diagnostic accuracy of 95.06%, effectively matching the teacher model. Crucially, the student model reduces the storage footprint by 74x (from 438.8 MB to 5.9 MB) and accelerates inference speed by 15x, achieving a processing rate of 61.46 Frames Per Second (FPS) on a standard CPU. These results confirm that the proposed framework satisfies the latency requirements for real-time video analysis, enabling the deployment of specialist-level cancer detection on handheld, battery-powered ultrasound devices without the need for cloud connectivity or GPU acceleration
