Microservices-Based Model Serving Strategies for High-Throughput and Low-Latency AI Applications

David Boud Weber

Microservices-Based Model Serving Strategies for High-Throughput and Low-Latency AI Applications

Authors

David Boud Weber

Senior Cloud-Native AI Platform Engineer , France

Keywords:

Microservices, AI Model Serving, Latency Optimization, High Throughput,, Kubernetes, Model Deployment, Cloud-native AI

Synopsis

Purpose: This study explores the implementation of microservices architectures in AI model serving to achieve low latency and high throughput, addressing critical requirements of real-time AI systems in domains like healthcare, finance, and IoT.

Design/methodology/approach: We adopt a comparative and synthesis-based approach, analyzing existing architectures and tools used for microservices-based model serving, and propose an optimized deployment pipeline using container orchestration and inference optimization.

Findings: Microservices improve fault isolation, scalability, and deployment agility in model serving. However, latency management requires architectural trade-offs, such as batching vs. per-request inference, GPU utilization, and service mesh optimizations.

Practical implications: Organizations deploying AI services in production can benefit from modular microservices to reduce downtime, improve rollout of models, and ensure real-time responsiveness using orchestration frameworks like Kubernetes and Knative.

Originality/value: This paper consolidates research insights into a modern reference architecture for serving AI models in high-throughput and low-latency environments, offering a blueprint for cloud-native AI deployment.

References

(1) Chung, E., Fowers, J., Ovtcharov, K., & Papamichael, M. (2018). Serving DNNs in Real Time at Datacenter Scale with Project Brainwave. IEEE Micro. https://ieeexplore.ieee.org/document/8344479

(2) Fowers, J., Ovtcharov, K., Papamichael, M., et al. (2018). A Configurable Cloud-Scale DNN Processor for Real-Time AI. ACM/IEEE ISCA. https://ieeexplore.ieee.org/document/8416814

(3) Gummadi, V. P. K. (2019). Microservices architecture with APIs: Design, implementation, and MuleSoft integration. Journal of Electrical Systems, 15(4), 130–134. https://doi.org/10.52783/jes.9328

(4) Ishakian, V., & Muthusamy, V. (2018). Serving Deep Learning Models in a Serverless Platform. IEEE IC2E. https://arxiv.org/pdf/1710.08460

(5) Kaul, D. (2019). Blockchain-Powered Cyber-Resilient Microservices. SSRN. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5096255

(6) Kannan, R. S., Subramanian, L., Raju, A., & Ahn, J. (2019). Grandslam: Guaranteeing SLAs for Jobs in Microservices Execution Frameworks. ACM SoCC. https://dl.acm.org/doi/10.1145/3302424.3303958

(7) Kumar, T. V. (2018). Event-Driven App Design for High-Concurrency Microservices. PhilPapers. https://philpapers.org/rec/VAREAD-5

(8) Liu, B. (2019). Study and Benchmarking of AI Model Serving Systems on Edge and Cloud. Aalto University. https://aaltodoc.aalto.fi/items/8736139c

(9) Liu, M., Peter, S., Krishnamurthy, A., et al. (2019). E3: Energy-Efficient Microservices on SmartNIC-Accelerated Servers. USENIX ATC. https://www.usenix.org/conference/atc19/presentation/liu-ming

(10) Miguel, L. B., Takabayashi, D., & Pizani, J. R. (2018). Marvin - Open Source Artificial Intelligence Platform. PMLR. http://proceedings.mlr.press/v82/miguel18a.html

(11) Rasi, N. (2018). TensorFlow Microservices with Quality of Service Guarantees. Politecnico di Milano. https://www.politesi.polimi.it/handle/10589/152321