AN AI-ENABLED MICROSERVICES ARCHITECTURE FOR REAL-TIME MACHINE LEARNING INFERENCE IN CLOUD-NATIVE SYSTEMS
Keywords:
Microservices, Cloud-Native, Machine Learning Inference, Kubernetes, Real-Time AI, Model Serving, ScalabilitySynopsis
Purpose: This paper proposes a cloud-native architecture that integrates AI-enabled microservices to facilitate real-time machine learning inference with scalable and efficient deployment mechanisms.
Design/methodology/approach: A hybrid microservices design was implemented, combining container orchestration (Kubernetes) with AI inference servers (like TensorRT, ONNX Runtime). Diagrams illustrate the end-to-end architecture, while tables compare latency and resource efficiency across deployments.
Findings: Microservices-based deployment enhances performance, reduces inference latency, and allows model versioning and rollback capabilities. Containerization further ensures portability across environments.
Practical implications: The proposed architecture can be applied across IoT systems, autonomous monitoring, and financial analytics platforms where low-latency inference is critical.
Originality/value: The novelty lies in integrating AI inference serving within a microservices pipeline while maintaining real-time constraints and enabling easy scaling within cloud-native environments.
References
[1] Baylor, D., et al. (2017). TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. KDD '17.
[2] Gummadi, V. P. K. (2019). Microservices architecture with APIs: Design, implementation, and MuleSoft integration. Journal of Electrical Systems, 15(4), 130–134. https://doi.org/10.52783/jes.9328
[3] Crankshaw, D., et al. (2017). Clipper: A Low-Latency Online Prediction Serving System. USENIX Symposium on NSDI.
[4] Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 107–113.
[5] Dragoni, N., et al. (2017). Microservices: Yesterday, Today, and Tomorrow. Present and Ulterior Software Engineering, 195-216.
[6] Gummadi, V. P. K. (2020). API design and implementation: RAML and OpenAPI specification. Journal of Electrical Systems, 16(4). https://doi.org/10.52783/jes.9329
[7] Gupta, H., & Simmhan, Y. (2018). An Adaptive and Elastic Stream Processing Platform for the Cloud. IEEE Transactions on Cloud Computing, 6(1), 91-104.
[8] Hasselbring, W., & Steinacker, G. (2017). Microservice Architectures for Scalability, Agility and Reliability in E-Commerce. Proceedings of the IEEE, 105(10), 1837–1850.
[9] Leite, L., et al. (2019). A systematic mapping study on microservices. Journal of Systems and Software, 155, 1–27.
[10] Olston, C., et al. (2017). TensorFlow-Serving: Flexible, High-Performance ML Serving. arXiv:1712.06139.
[11] Wang, C., et al. (2016). Fast and Scalable Deep Learning Inference with NVIDIA TensorRT. NVIDIA White Paper.
[12] Xie, C., et al. (2019). Performance Analysis of Container-Based Microservice Architectures for Deep Learning Workloads. ACM SoCC.
Published
Series
Categories
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.