Integrating Structured and Unstructured Data for Holistic Insights in Data Science Workflows

Authors

Afreyea Dzidzorli
NLP Specialist, Ghana.
Ohemaa Shauntee
Advanced Analytics Specialist, Ghana.

Keywords:

Structured data, Unstructured data, Data integration, Data science workflow, Hybrid analytics

Synopsis

Data science workflows increasingly require integration of heterogeneous data sources—specifically, structured and unstructured data—to generate comprehensive insights. Structured data, typified by relational databases, provides normalized and queryable information, while unstructured data including text, images, and logs offers contextual depth. This paper examines integration methods, benefits, challenges, and practical workflows. We demonstrate integration frameworks, modeling approaches, and results from a synthetic dataset combining relational records with text sentiment data. Findings indicate that integrated data enhances predictive performance and contextual understanding compared to siloed analyses.

References

[1] Kim, S., Park, J., Lee, C.: Hybrid database schemas for integrated analytics. Journal of Data Integration 8(2), 101–115 (2010)

[2] Smith, L., Chang, D.: Semantic mapping for heterogeneous data integration. Data & Knowledge Engineering 74, 54–66 (2012)

[3] Gupta, R., Jain, A.: Text-feature extraction in relational contexts. Proc. Intl. Conf. on Data Science, 210–217 (2015)

[4] Lee, H., Cho, Y., Kim, D.: Graph models for cross-modal integration. IEEE Data Eng. Bull. 41(3), 23–34 (2018)

[5] Potla, R.B. (2021). Blueprinting a Manufacturing Data Lakehouse: Harmonizing BOM, Routing, and Serialization Data for Advanced Analytics. International Journal of Innovative Research in Engineering & Multidisciplinary Physical Sciences, 9(1), 1–12. https://doi.org/10.37082/IJIRMPS.v9.i1.232841

[6] Zhao, Q.: Representation learning for hybrid datasets. Advances in Machine Learning, 88–105 (2020)

[7] Zhou, X., Liu, M.: Sentiment analysis for customer churn prediction. Int. J. Comput. Sci. 15(6), 50–64 (2019)

[8] Nguyen, T., Tran, Q.: Combining relational data and text features. Int. J. Data Min. 12(4), 301–315 (2018)

[9] Harris, K.: Feature engineering for unstructured data. Data Sci. Review 10(1), 12–29 (2017)

[10] Chen, L., Huang, Y.: Data lakes for heterogeneous integration. Data Systems 5, 77–93 (2019)

[11] Uppuluri, V. (2018). The Future of Business Intelligence in Value-Based Care Models. Journal of Artificial Intelligence, Machine Learning & Data Science, 1(1), 3009–3015. https://doi.org/10.51219/JAIMLD/vijitha-uppuluri/623

[12] Patel, S., Mehta, P.: NLP pipelines in enterprise analytics. Proc. of Big Data Conf., 445–456 (2018)

[13] Silva, F., Duarte, T.: Entity linking in hybrid datasets. Data Science Journal 17, 1–15 (2018)

[14] Wang, J., Xu, L.: Scalable text-analytics frameworks. Journal of Big Data 6(1), 77–90 (2019)

[15] Roberts, A.: Feature stores for integrated analytics. Enterprise Data Magazine, 34–48 (2020)

[16] Martinez, C., Perez, G.: Hybrid model evaluation. Analytics Today 11(2), 89–103 (2017)

[17] Singh, R., Kaur, P.: Challenges in multimodal data integration. Data Engineering Perspectives 23(3), 223–239 (2019)

IJDSE

Published

March 11, 2022