Fault Tolerant Distributed Systems Design Using Predictive Analytics and Self-Healing Mechanisms
Keywords:
Distributed Systems, Fault Tolerance, Predictive Analytics, Self-Healing, Resilience, Machine Learning, System ReliabilitySynopsis
Distributed systems are foundational to modern computing, powering applications across industries. As systems scale, fault tolerance becomes imperative. This paper explores the integration of predictive analytics and self-healing mechanisms to enhance fault tolerance in distributed systems. Predictive models can identify potential system failures in advance, allowing for preemptive mitigation. Self-healing systems autonomously recover from faults, minimizing downtime and human intervention. This work evaluates existing models and proposes a hybrid architecture leveraging predictive insights with autonomous correction for resilient, adaptive distributed computing.
References
[1] Avizienis, A., Laprie, J. C., Randell, B., & Landwehr, C. (2004). Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11-33.
[2] Gundaboina A. Data Loss Prevention in Healthcare: Advanced Strategies for Protecting PHI in Cloud Environments. Journal of Artificial Intelligence, Machine Learning and Data Science 2023 1(2), 3045-3051. DOI: doi.org/10.51219/JAIMLD/anjan-gundaboina/628
[3] Chen, Z., Zheng, Q., & Zhang, J. (2012). A proactive fault management mechanism in cloud computing environment. Proceedings of the International Conference on Cloud Computing.
[4] Gundaboina, A. (2024). HITRUST Certification Best Practices: Streamlining Compliance for Healthcare Cloud Solutions. International Journal of Computer Science and Information Technology Research, 5(1), 76–94. https://ijcsitr.org/index.php/home/article/view/IJCSITR_2024_05_01_008
[5] Chandra, V., & Chen, Y. (2023). Enhancing Microservices Reliability with Kubernetes Self-healing Mechanisms. Journal of Systems Architecture.
[6] Cortez, E., Bonde, A., Muzio, A., Russinovich, M., Budiu, M., & Bhogal, H. (2017). Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. Proceedings of the Symposium on Operating Systems Principles (SOSP).
[7] Uppuluri, V. (2023). Design and Deployment of Predictive Models for Influenza Breakthrough Infections Using Pharmacy Test Data. Journal of Artificial Intelligence, Machine Learning & Data Science, 1(2), 3031–3037. https://doi.org/10.51219/JAIMLD/vijitha-uppuluri/626
[8] Dean, J., & Barroso, L. A. (2013). The tail at scale. Communications of the ACM, 56(2), 74–80.
[9] Potla, R.B. (2023). Supplier Collaboration Portals for Component Manufacturers: Procure-to-Pay Automation and Working-Capital Outcomes. International Journal of Artificial Intelligence (ISCSITR-IJAI), 4(1), 16–40. https://doi.org/10.63397/ISCSITR-IJAI_04_01_002
[10] Ganapathi, A., Kuno, H. A., & Wilkes, J. (2009). Predicting system failures using data mining of system logs. IEEE Transactions on Dependable and Secure Computing, 6(2), 128–143.
[11] Vallemoni, R.K. (2023). Merchant Onboarding and Risk Scoring: Data Governance, Master Data, and Golden-Record Strategies. ISCSITR - International Journal of Scientific Research in Information Technology (ISCSITR-IJSRIT), 4(1), 16–41. https://doi.org/10.63397/ISCSITR-IJSRIT_04_01_002
[12] Wang, S., Song, J., & Huang, Z. (2020). LSTM-based anomaly detection for predictive maintenance in distributed systems. Journal of Computer Science and Technology.
[13] Vallemoni, R.K. (2023). Data Lineage and Metadata in Payment Ecosystems: Auditability and Regulatory Readiness across the Life Cycle. Frontiers in Computer Science and Artificial Intelligence, 2(1), 46–58. https://doi.org/10.32996/fcsai.2023.2.1.5
Published
Series
Categories
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.