Autonomous Data Quality Management via ML in Cloud Warehouses

Aditi Namdeo

doi:10.21590/ijhit.06.04.14

pdf

Published: 2024-12-27

DOI: https://doi.org/10.21590/ijhit.06.04.14

Keywords:

Autonomous Data Quality, Machine Learning, Cloud Data Warehouses, Data Profiling, Anomaly Detection, Data Governance, Reinforcement Learning

🔊 Listen Abstract:

Aditi Namdeo

Independent Researcher, Northeastern University, Boston, USA

Abstract

With this volume of data, velocity and variety, companies must be prepared to make a Degree of Autonomous Data Quality Management (ADQM) in their cloud data warehouse. If data is deployed to cloud or distributed environments with high frequency of change, defining the data quality rules subject-wise is not sufficient because they will not provide a reliable data, likewise if these rules are defined schema-wise, there can be multiple data schemas with different attributes or pattern which is also distributed. Traditional rule based data quality approaches, which used to be effective for ensuring data quality, are ineffective when schema and data pattern changes occur very often and extremely rapidly in cloud and distributed environments, e.g., data can be added with missing values and duplicated in the process of data quality improvement. To address the problems of data quality in real-time, a machine learning (ML) automatic data quality management mechanism for locating, categorizing and rectifying real time data quality problems inside the cloudwarehouses is proposed. Each of these four layers has built-in with 4 ingestion providers that support a variety of data types, ranging from structured such as relational tables, to unstructured such as JSONs, text files and more; data quality vendors that provide data statistics; built-in models for anomaly detection, supervised classifiers and various clustering models to find inconsistencies; and models built on reinforcement-learning feedback loops to automatically correct data using imputation, deduplication and schema alignment. It is employed in cloud warehouse systems that have scalability so that monitoring and adaptation to changing are possible. Experimental evaluations show that the system is more accurate in detecting and decreases the human effort in comparison to conventional ETL-based quality systems. The goal of the desired solution is to provide more trust on the data, lower operational costs, provide trusted analytics in the Enterprise cloud environment. Future work covers introducing the LLM within the framework for semantic DQ reasoning, adopting a domain transfer learning approach to tackle different variations of cloud applications scenarios, ensuring the privacy and security of the DQ process in multi-tenant cloud warehouses, introducing federated learning techniques, and the more powerful explainable, adaptable DQ process in realistic enterprise requirements and use cases at scale and in production in multi-tenant cloud settings.

Issue

Vol. 6 No. 04 (2024): International Journal of Humanities and Information Technology

Section

Articles

How to Cite

Autonomous Data Quality Management via ML in Cloud Warehouses. (2024). International Journal of Humanities and Information Technology, 6(04), 124-131. https://doi.org/10.21590/ijhit.06.04.14

Share. Empower. Inspire

Autonomous Data Quality Management via ML in Cloud Warehouses

Abstract

Issue

Section

How to Cite

Most read articles by the same author(s)

Similar Articles

Share. Empower. Inspire

Article Sidebar

Main Article Content

Abstract

Article Details

Issue

Section

How to Cite

Most read articles by the same author(s)

Similar Articles