The Healthcare Analytics Summit is back! Join us live in Salt Lake City, Sept. 13-15.Register Now

The Four Essential Zones of a Healthcare Data Lake

January 16, 2018
Bryan Hinton

Senior Vice President and General Manager, DOS Platform Business

Article Summary

The role of a data lake in healthcare analytics is essential in that it creates broad data access and usability across the enterprise. It has symbiotic relationships with an enterprise data warehouse and a data operating system.

To avoid turning the data lake into a black lagoon, it should feature four specific zones that optimize the analytics experience for multiple user groups:

1. Raw data zone.
2. Refined data zone.
3. Trusted data zone.
4. Exploration zone.

Each zone is defined by the level of trust in the resident data, the data structure and future purpose, and the user type.

Understanding and creating zones in a data lake behooves leadership and management responsible for maximizing the return on this considerable investment of human, technical, and financial resources.

世界杯葡萄牙vs加纳即时走地has published articles describing early- and late-binding data warehouse architectures, comparing data lakes to data warehouses, and explaining how health systems can leverage unique data lake functions within their existing analytic platforms.

The evolvinghealthcare dataenvironment created the need for data lakes, but they are a significant IT investment. Understanding the relationship between an enterprise data warehouse (EDW) and a data lake, as well as the structural components of a data lake—the zones—is fundamental to investing in the right technology with the appropriate financial and human resources.

Why a Data Lake Is Necessary

In healthcare today, outcomes improvement efforts are fueled by limited information, primarily healthcare encounter data (Figure 1).

Diagram showing how the human health data ecosystem is large, though we use very little of it

为了了解更多的情况,让它成为焦点,并了解什么真正影响结果,我们需要基因组和家族数据,结果数据,7×24生物计量数据,消费者数据和社会经济数据。大规模结果改进所需的完整数据生态系统将使医疗数据总量增加十倍。根据国际数据公司2014年的一份报告,医疗保健数字世界每年增长48%。In 2013, the industry generated 4.4 zettabytes (1021bytes) of data. By 2020, it will generate 44 zettabytes. Unfortunately, this data volume would explode thedata warehouse大多数组织。幸运的是,数据湖可以处理这个量。

The Benefits of a Data Lake

The benefits of a data lake as a supplement to an EDW are numerous in terms of scale, schema, processing workloads, data accessibility, data complexity, and data usability:

  • A data lake, typically designed using Apache Hadoop, is the preferred choice for larger structured and unstructured datasets coming from multiple internal and external sources, such as radiology, physician notes, and claims. This removes data silos.
  • A data lake doesn’t demand definitions on the data it ingests. The data can be refined once the questions are known.
  • A data lake offers great flexibility on the tools and technology used to run queries. These benefits are instrumental to socializing data access and developing a data-driven culture across the organization.
  • A data lake is prepared for the future of healthcare data with the ability to integrate patient data from implanted monitors and wearable fitness devices.

The Data Lake’s Strength Leads to a Weakness

A data lake can scale to petabytes of information of both structured and unstructured data and can ingest data at a variety of speeds from batch to real-time. Unfortunately, these capabilities have led to a negative side effect.Gartner’s hype cycle for 2017shows that data lakes have passed the “peak of inflated expectations” and have started the slide into the “trough of disillusionment.” This isn’t surprising. Often, an industry develops a concept thinking it will solve world hunger, then learns its real-life limitations.


Understanding and creating zones within a data lake are the keys to draining the swamp.

The Four Zones of a Data Lake

Data lake zones form a structural governance to the assets in the data lake. To define zones, Zaloni excerpts content from theebook, “Big Data: Data Science and Advanced Analytics.” The book’s authors write that “zones allow the logical and/or physical separation of data that keeps the environment secure, organized, and agile.” Zones are physically created through “exclusive servers or clusters,” or virtually created through “the deliberate structuring of directories and access privileges.”

Healthcare analytics architectures need a data lake to collect the sheer volume of raw data that comes in from the various transactional source systems used in healthcare (e.g., EMR data, billing data, costing data, ERP data, etc.). Data then populates into various zones within the data lake. To effectively allocate resources for building and managing the data lake, it helps to define each zone, understand their relationships with one another, know the types of data stored in each zone, and identify each zone’s typical user.

Data lakes are divided into four zones (Figure 2). Organizations may label these zones differently according to individual or industry preference, but their functions are essentially the same.

Visualization of data lake zones
Figure 2: Data lake zones.

The Raw Data Zone


The Trusted Data Zone

将源数据摄取到EDW中,然后用于在可信数据区域中构建共享数据集市。在这一点上,术语是标准化的(例如,RxNorm, SNOMED等)。受信任的数据区域保存作为整个组织通用真理的数据。更广泛的人群已经对这些数据应用了广泛的治理,这些数据具有整个组织可以支持的更全面的定义。可信任的数据可以包括构建模块,例如某一时期急诊就诊次数、每年的住院率或基于风险的合同的成员数量。

The Refined Data Zone


Refined data is used by a broad group of people, but is not yet blessed by everyone in the organization. In other words, people beyond specific subject areas may not be able to derive meaning from refined data. A SAM gets promoted to the trusted zone when the definitions applied to its data elements have broadened to a much larger group of people.

The Exploration (Sandbox) Zone

Anyone can decide to move data from the raw, trusted, or refined zones into the exploration zone. Here, data from all of these zones can be morphed for private use. Once information has been vetted, it is promoted for broader use in the refined data zone.

Zones and Their Data Definitions

For an example of the data type in each zone, consider length of stay (LOS). There are dozens of ways to define LOS using ED presentation time, admit time, registration time, cut time, post-observation time, and discharge time. The clinical definition of LOS for an appendectomy may be from cut time to discharge time, but the corporate definition may be from admit time to discharge time. A SAM that focuses on appendectomy might choose to use the clinical definition, which doesn’t apply to the global definition (i.e., the definition in the trusted zone). For an individual SAM definition of LOS to be promoted to the trusted zone, it needs to be vetted through a broader group of people to confirm it has universal application.

Directors who have financial responsibility over a single line of business may need to evaluate their department’s productivity. They may need to see things a certain way, such as excluding corporate overhead, over which they have no control. This is what makes the SAM more specific to one area. The data definition has been vetted and agreed to by a group of people, though it has yet to reach global agreement.

The Right Technology for the Right Zone

Different technology can run on top of different zones in a data lake. The data lake itself typically runs on Hadoop, which is optimal for handling huge data volumes. Relational Databases like SQL Server are more user friendly and will provide data to a larger user base. SQL queries can run on top of Hadoop to produce data marts and SAMs in the trusted and refined zones.

Hortonworks refers to aConnected Data Architecture在这种情况下,“数据池需要确保连接的数据能够自由流动到对业务最有利的地方,从而从中获得价值。”区域可能不使用相同的数据技术。大部分数据将驻留在数据湖中,但更精细的区域可能有一部分数据驻留在数据仓库或更小的数据集市中。


Data Lakes Are Integral to a Larger Operating System

之前我们说过,庞大的数据量已经把数据湖变成了数据沼泽,通过一个更大的医疗分析生态系统来补救。数据操作系统的一部分或全部可以部署在任何医疗保健数据湖的顶部。The Health Catalyst®Data Operating Syste (DOS™) (Figure 3) can index, catalog, analyze, and provide insights from the terabytes and growing data assets in a health system: attributes that can provide IT departments, clinicians, population health managers, financial leaders, and health system leaders with the knowledge they need to produce massive outcomes improvements.

Diagram of DOS
图3:Health Cata世界杯葡萄牙vs加纳即时走地lyst数据操作系统。

DOS enables a data lake to be built with the required governance and meaning added to the data so it is easily organized into the appropriate zones. Data can then be used according to zone by the various data consumers in a health system. DOS also allows data to be analyzed and consumed by the Fabric Services layer to accelerate the development of innovativedata-first applications.

The Future of Data Lakes

The volume of healthcare data is mushrooming, and data architectures need to get ahead of the growth. Vast volumes of data will continue to flow into the EDW.


To prevent data lakes from becoming mired in the petabytes of data now swamping healthcare, the new architecture presented by the data operating system offers a breakthrough in analytics engineering that can renew the life of a data lake and accommodate the big-bang growth of healthcare data.

PowerPoint Slides

Would you like to use or share these concepts? Download presentation highlighting the key main points.

Click Here to Download the Slides

Five Solutions to Controlling Healthcare’s Cost Problem


We take pride in providing you with relevant, useful content. May we use cookies to track what you read? We take your privacy very seriously. Please see ourprivacy policy详情和任何问题。