Managing Data Lakes Across a Hybrid Cloud Environment
Not long ago, experts were still debating whether the cloud was a fad or the real thing. The verdict is in: since realizing Hadoop isn’t the be-all and end-all, more data-driven organizations have determined the benefits of cloud computing are quite real.
Today, cloud service providers manage the elasticity and administration of the Hadoop cluster, so you can simply focus on analyzing your data. But this doesn’t mean all your problems are answered.
The simplicity and low cost of cloud computing makes it possible for business teams to spin up their own cloud deployments, and so organizations often find themselves with a proliferation of cloud instances – sometimes on different cloud platforms. Furthermore, many organizations keep a lot of data in legacy systems on premises.
Since many of these applications and systems were developed using technology that doesn’t work in the cloud, moving all that data to the cloud isn’t always feasible. It’s a long process wrought with high cost, huge disruption due to application downtime, and potential data loss. Consequently, most enterprises leave legacy data on premises and adopt a cloud-first policy for new projects.
Does the hybrid cloud approach – in which data is distributed across multiple environments – create more complexity in terms of governance and access? The separation of data does create challenges in having a unified method of governing data and consolidating it for analytics. Enterprises can build a catalog to create a logical data lake where analysts can find the data that they need regardless of where it is stored and provision it to the cloud as needed. In such an environment, some data (e.g., IoT) can be kept just in the cloud while other data (e.g., transactional) can be kept on premises and uploaded as needed to the cloud for analysis. This unified model for analytics brings together the best of all worlds.
Given the scope and complexity of this endeavor, the catalog must be driven in large part by automation with some level of human interaction to ensure ongoing accuracy. With data coming in from multiple sources, the same data items are often labeled differently with cryptic technical field names, if they’re labeled at all. It’s impossible to check for this manually, since most enterprises have millions of data sets and hundreds of millions of fields. However, there is no way around it because without knowing what’s in each field, there is no way for decision makers to leverage this data effectively or for the governance team to ensure it’s used in a compliant manner.
The automated catalog provides a single point of governance to ensure data is findable, understandable, tracked, trusted, and compliant. But it must inspect the files and data, crawling Hadoop systems and object stores like S3 and looking for new or changed files. The catalog then needs to parse each file to determine its schema before profiling each file to collect statistics on each field. Next, the catalog needs to analyze relational databases and process changes to the tables to keep statistics up to date.
These statistics can then give analysts an overview of what’s in each data set and idea of whether it would be useful. The statistics are also leveraged for automated discovery to help classify each field automatically and assign it an appropriate business label or tag. Such tags can be used to apply appropriate access policies and data quality rules as well as manage compliance to governmental and internal regulations like GDPR and HIPAA. In addition to analyzing each data set and keeping the analysis up to date, the catalog must capture each data set’s lineage across the hybrid enterprise – either by importing it from ETL tools and data platform auditing tools or by inferring the lineage using the statistical information that it collected on the datasets. This lineage information also helps analysts understand and trust the data set and is relevant to regulatory compliance.
With automated cataloging in place, the enterprise can leverage the benefits of a hybrid cloud environment while being able to find, understand and govern their disparate data in a centralized and uniform fashion. Organizations can quickly meet GDPR and other regulatory requirements while keeping their data in a permanent state of compliance. There can truly be self-service analytics that enables those who immediately understand the potential value of certain data to actually put that data to use. And, ultimately, enterprises can achieve greater agility, innovation, competitiveness and revenue.
This, after all, is what big data has been promising for years.
Alex Gorelik is CEO of Waterline Data (www.waterlinedata.com).