Autonomous Data Quality Management via ML in Cloud Warehouses

Main Article Content

Aditi Namdeo

Abstract

With this volume of data, velocity and variety, companies must be prepared to make a Degree of Autonomous Data Quality Management (ADQM) in their cloud data warehouse. If data is deployed to cloud or distributed environments with high frequency of change, defining the data quality rules subject-wise is not sufficient because they will not provide a reliable data, likewise if these rules are defined schema-wise, there can be multiple data schemas with different attributes or pattern which is also distributed. Traditional rule based data quality approaches, which used to be effective for ensuring data quality, are ineffective when schema and data pattern changes occur very often and extremely rapidly in cloud and distributed environments, e.g., data can be added with missing values and duplicated in the process of data quality improvement. To address the problems of data quality in real-time, a machine learning (ML) automatic data quality management mechanism for locating, categorizing and rectifying real time data quality problems inside the cloudwarehouses is proposed. Each of these four layers has built-in with 4 ingestion providers that support a variety of data types, ranging from structured such as relational tables, to unstructured such as JSONs, text files and more; data quality vendors that provide data statistics; built-in models for anomaly detection, supervised classifiers and various clustering models to find inconsistencies; and models built on reinforcement-learning feedback loops to automatically correct data using imputation, deduplication and schema alignment. It is employed in cloud warehouse systems that have scalability so that monitoring and adaptation to changing are possible. Experimental evaluations show that the system is more accurate in detecting and decreases the human effort in comparison to conventional ETL-based quality systems. The goal of the desired solution is to provide more trust on the data, lower operational costs, provide trusted analytics in the Enterprise cloud environment. Future work covers introducing the LLM within the framework for semantic DQ reasoning, adopting a domain transfer learning approach to tackle different variations of cloud applications scenarios, ensuring the privacy and security of the DQ process in multi-tenant cloud warehouses, introducing federated learning techniques, and the more powerful explainable, adaptable DQ process in realistic enterprise requirements and use cases at scale and in production in multi-tenant cloud settings.

Article Details

Section

Articles

Similar Articles

You may also start an advanced similarity search for this article.