FAULT-TOLERANT MACHINE LEARNING SYSTEMS: LEVERAGING MICROSERVICES FOR RESILIENT AI APPLICATIONS

Authors

Miguel De Rafael
Artificial Intelligence / DevOps, Australia.

Keywords:

Microservices, Fault Tolerance, Machine Learning, Resilient AI, Container Orchestration, AI Reliability

Synopsis

Purpose: The purpose of this paper is to investigate how microservice architectures can be effectively employed to build fault-tolerant machine learning (ML) systems for resilient AI applications.

Design/methodology/approach: This study reviews literature across distributed systems, microservices, and fault-tolerant ML, and integrates architectural diagrams and empirical tables to frame a robust system architecture for resilient AI deployment.

Findings: Key findings reveal that microservices enhance modularity and isolate failures, enabling graceful degradation and service recovery. Container orchestration and stateful fault detection models further bolster reliability.

Practical implications: Engineers and AI architects can apply these insights to build scalable, dependable ML solutions in high-availability environments such as healthcare, finance, and autonomous systems.

Originality/value: This paper consolidates fragmented insights across disciplines into a unified perspective on fault-tolerant AI system design, with structured diagrams and implementation metrics. 

References

[1] Power, A., & Kotonya, G. (2018). A microservices architecture for reactive and proactive fault tolerance in IoT systems. IEEE International Symposium on A World of Wireless, Mobile and Multimedia Networks. Link

[2] Kaul, D. (2019). Blockchain-powered cyber-resilient microservices. SSRN. PDF

[3] Boag, S., Dube, P., Herta, B., & Hummer, W. (2017). Scalable lifecycle management of deep learning jobs. NeurIPS Workshop. PDF

[4] Gummadi, V. P. K. (2019). Microservices architecture with APIs: Design, implementation, and MuleSoft integration. Journal of Electrical Systems, 15(4), 130–134. https://doi.org/10.52783/jes.9328

[5] Gummadi, V. P. K. (2020). API design and implementation: RAML and OpenAPI specification. Journal of Electrical Systems, 16(4). https://doi.org/10.52783/jes.9329

[6] Jayaram, K. R., Muthusamy, V., & Dube, P. (2019). FfDL: A flexible multi-tenant deep learning platform. ACM/IFIP Middleware. DOI

[7] Sheriffdeen, K., & Heart, S. (2019). API Resilience Strategies. ResearchGate. PDF

[8] Kumar, T. V. (2015). Cloud-native model deployment. PhilPapers. PDF

[9] Bukhari, T. T., Oladimeji, O., Etim, E. D., & Ajayi, J. O. (2018). Resilient multi-cloud networks. IRE Journals. PDF

[10] Rothenhaus, K., De Soto, K., Nguyen, E., & Millard, J. (2018). DevOps Reference Architecture. NPS Symposium. PDF

[11] Mohapatra, A., & Sehgal, N. (2018). Scalable deep learning on cloud platforms. IJTMH. PDF

[12] Wangchuk, S. D. (2019). Orchestration Security and Microservices Reliability. ResearchGate. PDF

Published

April 19, 2021