What should a Data Analyst know about Data Warehousing? Part 2:


Architecture for Data Warehousing

Data Warehouse Environment.
Data Warehousing for Data Analyst Part2
Data Warehouse Environment

This is the continuation. The previous part was describing the Architecture of Data. Next, we discuss the data warehousing architecture and data warehousing requirements in terms of data analysis.

Data analysis and data warehousing greatly complement each other. First, data is what data analysis starts with, and second, there isn’t much point in building a warehouse if you are not going to analyze the data.

A data warehouse underlies decision support processing that provides consistent views of the business across different organizational units and time. It enables reconciling diverse reports because they have the same underlying source and definitions. 

Decision support processing serves the needs of management in making more informed decisions, which will lead to better results. 

A well-organized data warehouse environment requires an architecture that begins by looking at the whole and then drills-down to the details. Details are only important when viewing in a wider context. The major components of the architecture are:

  • Source Systems 
  • Extraction, Transformation, and Load tools 
  • Central Repository 
  • Metadata Repository 
  • Data Marts 
  • Analytic Sandbox 
  • End-User Community 

Data originates in the source systems and flows through the components of the data warehouse environment. Typically, a significant transformation of data occurs when moving from the operational level to the data warehouse level. Finally, it should deliver valuable information to the end-user community. 

This infrastructure comprises hardware, software, and networks and surely must be robust to meet the growing data and processing requirements.

Source Systems

Source systems include operational systems and external data feeds. They are focused on operational efficiency, not on decision support. So, the data might be redundant, the same information represented multiple times or in different ways, information of interest may be gathered not as intended, and so on. 

One of the challenges in data warehousing is developing a consistent definition to be used throughout the organization. That is where metadata helps. However, gathering data in a consistent format is almost always the most expensive part of implementing a data warehousing solution in the organization. 

Other challenges offered by the source systems are related to getting access to the data because the different systems might be used in different parts of the organization, the software is highly customized and/or uses complicated file structure, sharing features are limited, and, to add more, systems may be geographically dispersed.

Extraction, Transformation, and Load

Extraction, transformation, and load (ETL) tools automate the process of converting, reformatting, integrating, mapping, and moving data from multiple operational sources to other environments. 

Traditional ETL process.
Traditional ETL process.

For example, the job of an ETL tool might be, in particular,  to convert the various representations into a common customer dimension, and to make sure that the same customer record from different systems is used to create or update a single customer record in the customer dimension. See the following Figure.

An ETL tool extracts data from multiple tables to create the customer dimension.
An ETL tool extracts data from multiple tables to create the customer dimension.
Source: The Enterprise Big Data Lake Delivering the Promise of Big Data and Data Science by Alex Gorelik.

Organizations use ETL tools to safely move their data from one system to another. Each end to end process of that kind is called a pipeline. One organization might have hundreds of ETL pipelines. New pipelines are created along with the new business requirements. 

ETL tools provide a graphical interface that helps end-users to understand what is happening to the data. These tools are able to verify data and spit out error reports. 

Existing ETL tools connect to most databases. These tools are so powerful and the ETL process is so complex that only under very unusual circumstances does building the custom ETL make sense for an organization. 

Central Repository 

Always consider data warehousing as a process. In terms of data warehousing, a process is something that can adjust to users’ needs as they are clarified and change over time.

Business by itself is a rather dynamic and evolving environment. In addition, the more users learn about data and about the business, the more changes and enhancements they will require. That means that a central repository that is unable to support continuous system updates in order to meet end-user needs will become a little-used system. 

The central repository is usually a relational database. Its key component is a data model that describes the structure of the data inside a database.

One of the critical design issues in the central repository is granularity. It affects the volume of data and the type of query that can be answered. The volume of data in a warehouse is traded off against the level of detail of a query. The lower the level of granularity, the more versatile the query that can be issued and vice versa.

Granularity.
Granularity.
Source: Building the Data Warehouse by W. H. Inmon.

One of the advantages of relational databases is their ability to run on scalable machines by taking advantage of  Massively Parallel Processing systems (MPPs).

MPPs that are referred to as grid computing behave like separate computers connected by a very high-speed network. Some nodes may be specialized for processing and have minimal disk storage, when others may be specialized for the storage and have lots of disk capacity.

Uniprocessor. A processing unit communicates to memory and disk over a local bus. The speed of the processor, bus, disk, and memory limits performance and scalability.
Uniprocessor. A processing unit communicates to memory and disk over a local bus. The speed of the processor, bus, disk, and memory limits performance and scalability.
The Massively Parallel Processor has a shared-nothing architecture. It introduces a switch that connects independent processor/memory/disk components.
The Massively Parallel Processor has a shared-nothing architecture. It introduces a switch that connects independent processor/memory/disk components.

Metadata Repository 

Metadata repository is a component that contains and manages metadata (data about data). It is a vital component of the data warehouse environment.

It will be very frustrating for you to try to solve a problem using archival data if the meaning of the contents of the analyzed fields is unclear. Metadata provides you tools for browsing through the contents of the data warehouse. 

A good metadata repository should contain the following:

  • Data warehouse structure description
  • The annotated logical data model explaining the entities and attributes, including valid values
  • Mapping from the operational environment to the source systems
  • The physical schema
  • Mapping from the logical model to the physical schema
  • Operational metadata
  • The algorithms used for summarization
  • Common views and formulas for accessing the data
  • Security and access control information
  • System performance data
  • Business metadata

A metadata repository should provide this information in a comprehensive way so that users can understand what data is contained in the data warehouse, as well as how to access and use it. 

Data Marts

A data mart is a specialized system that brings together the data specifically suited to the needs of a department or related application. The data mart of the marketing department will have one structure, the data mart of sales or a finance department will have another, and so on. 

The data in data marts are fed from a central repository and in some cases from external sources, for example in case of using geographic data and information. 

Data that is taken from a central repository is recast. Consider the creation of “convenience” fields. Suppose the following elements are found in the data warehouse environment: Gross sales price, Tax, Commission, Shipping charge. If you want to do an analysis on net sales, it is better to enter the data as a result of a calculation:

Net sale = Gross sale price – (Tax + Commission + Shipping charge)

This saves space as only one data element is stored instead of four, and resources because data should only be calculated once.

Data marts leverage dimensional concepts. The dimensional model is good for servicing the needs of only one particular set of users at the expense of all others.

Dimensional models implemented in relational database management systems are called star schemas. Dimensional models implemented in multidimensional database environments are called online analytical processing (OLAP) cubes

Star Schema and OLAP Cube.
Star Schema and OLAP Cube.
Source: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling by Ralph Kimball and Margy Ross.

Data in an OLAP cube is stored and indexed using dimensional data technique. OLAP cube engine creates and manages precalculated summary tables. Therefore, OLAP provides superior query performance because of the advanced precalculations, indexing strategies, and other optimizations. 

As OLAP systems design the data structure keeping in mind users’ needs, it is easy for the end-users to drill down or up by adding or removing attributes from their analyses with the best performance without issuing new queries. 

Data marts play a very important role in the data warehouse environment since they usually have the largest number of users.

Data marts provide an information system with the structure that allows an organization to have very fast and flexible access to data, to slice and dice pre-summarized data any number of ways, and to explore the relationship between summary and detail data. Therefore data marts are often associated with reporting systems. 

Another important type of data mart is analytic sandboxes.

Analytic Sandboxes

Normal data access and heavy statistical processing do not mix well. When a heavy amount of statistical analysis hits the machine that is being used to service a data warehouse, normal processing might come to a dead halt.

Significant contention for machine resources, if happens frequently, becomes a problem. In such a case, it is wise to think of an analytic sandbox.

Analytic sandbox is an exploratory environment for ad hoc data analysis, where data analysts may experiment with data using their preferred languages and programming environments. It provides a good foundation for heavy data analysis. And once it is built, there will be no more contention problems with the data warehouse. 

You may want to use external data for your analysis, for example, to compare external results versus internal. It might be challenging to fit external data into a central repository due to the tight integration that is required. But it perfectly fits into the analytic sandbox. Therefore, external data can go into the analytic sandbox without passing through the central repository.

Analytic Sandbox

Analytic sandboxes are necessary for complex data structures that do not fit into relational databases, such as images, audio, video, and so on.

Analytic sandboxes are irreplaceable for ad hoc analysis on large volumes of data, for instance, web and telecommunications applications where data sources can grow by hundreds of millions or billions of transactions per day. 

Compared to standard data marts or reporting systems, analytic sandboxes can support more advanced and computationally intensive techniques such as Monte Carlo simulation or Bayesian models.

Analytic sandboxes leverage the power of vertical partitioning, in-memory processing, in-database analytics, cloud computing, MapReduce, and other advanced technologies for the best performance.

End-User Community

Users are the most important component in a data warehouse environment because without them it makes no sense. Different types of users have a different set of unique characteristics.

Analysts dig into data to discern patterns and create ad hoc reports. Analysts want to access as much data as possible. They use a wide range of statistical tools and create queries that might be enormous in size. They do place a heavy load on data warehouses and need fast access to consistent information.

Analysts’ main requirements on a data warehouse include a responsive system, consistent data, and the ability to drill down to the customer level and even to the transaction level detail. 

Business users are the most common data warehouse users. Their needs determine the architecture of the data warehouse, as well as the development of decision support applications. 

Business users leverage direct access to the data warehouse environment, having accurate, consistent, and easy-to-create reports. Their load on the data warehouse environment is predictable as they do the same activity on a routine basis, repeatedly running the same type of query. 

In most cases, they access data marts using powerful BI tools. 

Sometimes they might drill down to the central repository or benefit special ad hoc reports prepared by data analysts. 

Application developers have certain data warehouse requirements in order to develop stable and robust applications. They want to be sure that the applications are protected from changes in the structure of the data warehouse. Any updates should have a minimal impact. Also, developers need access to the metadata repository, which provides documentation on the structure of the data. The developers should be able to provide feedback on the structure of the data warehouse in order to improve the warehouse. 

Why is data warehousing so valuable for data analysts? 

The data warehouse contains a source of data needed for data analysis. This data is cleansed, integrated, and organized. This is exactly what data analysts might want. 

Data warehousing can be described as load-and-access processing. Data is integrated, transformed, and loaded into the central repository from the operational systems and external data feeds. After that, the data is accessed and analyzed there. If needed, corrections or adjustments can be made to the data warehouse. The meaning of the fields is well defined and accessible through metadata. Incorporating new data into the analysis is as simple as figuring out what data is available through metadata and retrieving it from the repository. Clean and easily accessed data allows data analysts to spend more time on analysis rather than moving data. 

Data warehouse environments must be able to manage massive amounts of data because they are optimized around a very different set of parameters and contain detailed data, summary data, and historical data gathered from a wide variety of sources. 

Large amounts of clean and consistent data in line with scalable hardware and software systems promote data analysis and boost its application. 

Other abilities of a good data warehouse environment should include:

  • easy data indexing and data monitoring
  • robust interface with a wide number of technologies 
  • parallel data processing
  • metadata control of the warehouse environment
  • support compound keys 
  • index-only processing 
  • recreation of accurate snapshots of history
  • ad hoc reporting

Such a data warehouse environment provides a good system for analyzing data, it enables hypothesis testing and simplifies efforts to measure the effects of taken actions. Measurement enables an organization to learn from its data and to draw valuable conclusions.

The importance of a data warehouse also lies in automating existing processes, such as putting reports online and providing existing applications with a clean source of data.

Online Analytic Processing (OLAP) enables users to summarize information along several dimensions at one time. It allows users to drill down to detailed data, provides visualization tools, and is a great way to distribute information to multiple end-users to meet their reporting needs.

Although building a data warehouse is challenging for any organization, if done correctly, it provides great rewards.

Data warehousing can greatly benefit data analysis efforts. By bringing the data together in one place and leveraging scalable hardware and software, data warehouses empower analysts to answer the most complex questions. The detailed, historical data found in the data warehouse can be examined in many ways revealing business patterns and business insights. It gives business users a better understanding of data and business. It supports their decisions and helps to significantly improve business processes.