A Comprehensive Survey on AIOps for Alert Storm Management in Microservices: Understanding, Techniques, and Metrics

Date

2026-1

Type

Conference paper

Conference title

Author(s)

Manar Arif

Pages

314 - 331

Abstract

Artificial Intelligence for IT Operations (AIOps) has emerged as a critical paradigm for maintaining the high availability and reliability of large-scale distributed systems. As modern service architec tures evolve alongside advancements in IT infrastructure, their growing complexity—marked by escalating system scales and intricate dependen cies among components—has intensified challenges in operational man agement, particularly in addressing alert storms. Despite the prolifer ation of AIOps methodologies, a systematic and comprehensive survey dedicated to analyzing alert storm phenomena remains absent in the literature. To bridge this gap, this study presents a novel survey that rigorously examines alert storm identification, characterization, and sum marization within AIOps-driven systems. Through a structured review, our work contributes threefold to the field: First, we synthesize foun dational research on microservice architectures, incident management, and alert handling, establishing a cohesive framew ork for understand ing AIOps’ role in mitigating operational disruptions. Second, we clas sify existing methodologies into four distinct categories based on their technical approaches, elucidating their strengths and limitations. Finally, we evaluate performance metrics employed in alert storm management, offering insights into their applicability and efficacy. This survey not only consolidates critical knowledge for researchers and practitioners but also highlights future directions for advancing AIOps in complex, dependency-rich environments.

Publisher's website

View