What Should a Data Analyst Know about Data Warehousing? Part 1:


Architecture of Data 

Levels of data abstraction
The higher the level of data abstraction, the smaller the data size.

Contemporary organizations generate and accumulate large amounts of data, which, however, does not mean accumulating valuable information to support business decisions. Data analysis is what is required to make sense of the data. 

The most time-consuming efforts behind data analysis are identifying, acquiring, understanding, and cleansing data. 

Data warehousing is a process of bringing together disparate data from throughout an organization and optimizing it for access and analysis. It includes data transformations, cleansing, filtering, aggregations, and has strict requirements for clean and consistent data. 

That’s why a data warehouse is a perfect mate for a data analyst on the way of transforming data into valuable information.

The following is not a complete coverage of the subject, but will definitely help you get started.

The architecture of data addresses how the data is organized and structured to support the development, maintenance, and use of the data. Its purpose is to create a blueprint that shows how the organization is going to process and store data. The data hierarchy and descriptions make it easy to navigate the data warehouse environment.

Consider the following levels of abstraction: 

Business Rules

Metadata 

Data Models

Summary Data 

Operational Data

The levels of data abstraction are important for data analysis. A well-designed system allows you to drill down through these levels in order to get the detailed data that support every summarisation. 

Operational / Transaction Data

Each time a transaction is registered in the operational system it is stored somewhere. In terms of data analysis, such data is a valuable raw material that best describes user behavior.

A robust data warehousing approach should consider the fact that the operational system or business process might and will change over time, but even in such cases, data need to be compatible.

Imagine a chain of supermarkets, or a telephone company, or a bank – it is easy to see that the volume of data from the operational system is enormous. 

Important information about a particular data warehouse environment that you should consider is how long raw operational data is kept and the granularity level of the data that is kept for a longer period.

Summary Data

Operational Summary Data is derived from transactions and plays the same role. Keep in mind the difference, which is that summaries are for a period of time and transactions represent the events. 

These summaries of transactions may result in other transactions, for example, a monthly telephone bill (a summary) and its payment (a transaction). 

Decision-Support Summary Data is the data used for making decisions about the business, for example, marketing or financial reports.

Operational system is optimized for the operations but not for the analysis, therefore it is not wise to use one system for all purposes. The data warehouses and data marts are used to provide a decision-support system at the customer level. And it can be difficult for data warehouses to provide consistent definitions and layouts so that similar reports give similar results, regardless of which business systems produce them or when they are created.

An interesting thing, that on one hand summaries may hide the information as they aggregate data together, but on the other hand, summaries may bring information to the surface, as some information can be seen only when the data is aggregated.

Data Models

No matter how the data is stored, there are three ways of describing the layout: 

  • high-level model (Entity Relationship Diagram), 
  • mid-level model (Data Item Set), 
  • low-level model (Physical Model).

The high-level data model (Entity Relationship Diagram) shows entities and relationships.

Entity Relationship Diagram (ERD) for the manufacturing environment
Entity Relationship Diagram for the manufacturing environment

What entities belong in the scope of the model are determined by the scope of integration. The scope of integration should be written in a language understandable to a business user and should clearly define the boundaries of the data model.

The scope of integration determines the proportion of the enterprise to be reflected in the data model
The scope of integration determines the proportion of the enterprise to be reflected in the data model

The following corporate entity relationship diagram is composed of many individual entity relationship diagrams that represent the different views of users across the organization.

Corporate ERD constructed from the users view ERDs

A mid-level model (Data Item Set) is created for each major subject area, or entity, in the high-level data model. 

The Relationship between the entity relationship diagram (ERD) and the Data Item Set (DIS).
The Relationship between the ERD and the DIS.
Each entity is defined by its own DIS.

The mid-level model contains four basic constructs:

  • Primary grouping of data — holds data attributes that exist only once for each major subject area;
  • Secondary grouping of data — holds data attributes that can exist multiple times for each major subject area;
  • Connector — relates data from one grouping to another. The convention used to indicate a connector is an underlining of a foreign key.
  • Type of data — is indicated by a line leading to the right of a grouping of data. The grouping of data to the left is the supertype. The grouping of data to the right is the subtype of data.
Four constructs of the mid-level data model

After a relationship is identified at the high-level, it is manifested by a pair of connector relationships at the mid-level. See an example in the following figure.

The relationships identified in the entity relationship diagram (ERD) are reflected by connectors in the data item set (DIS).
The relationships identified in the entity relationship diagram (ERD) are reflected by connectors in the data item set (DIS).
Note that only one connector — from acctno to customer — is shown in this diagram. In reality, another connector from customer to acctno would be shown elsewhere in the DIS for customer.

Source: Building the Data Warehouse by W. H. Inmon

Example of the data item set:

An expanded data item set (DIS) showing the different types of loans that a bank may support.
An expanded data item set (DIS) showing the different types of loans that a bank may support.

Source: Building the Data Warehouse by W. H. Inmon

The following figure illustrates the corporate data item set is made up of the data item sets created as a result of each user view session.

The corporate DIS created from multiple DISs.
The corporate DIS created from multiple DISs.

The low-level model (Physical Data Model) is created from the mid-level data model by extending the mid-level data model to include keys and physical characteristics of the model. After granularity and partitioning are taking into account, multiply of other physical design activities are embedded into the design, such as arrays of data, merging tables, selective redundancy, further separation of data, derived data, preformatting, preallocation, relationship artifacts, prejoining tables. Sometimes, the physical model of a database is really very complicated.

The most common way to store data is in a relational database. The relational approach starts with the organization of data into a table. Different columns are in each row of the table. 

A simple table.
A simple table.

Relational technology uses keys and foreign keys to establish relationships between rows of data.

A key-foreign key relationship.
A key-foreign key relationship.

You may see in the following figure how different tables are connected by means of a series of key-foreign key relationships.

A relational database design.
A relational database design.

Relational database design tries to create tables with minimum redundancy and the least possible number of fields. Such databases are called normalized. As data warehouses typically contain data from multiple sources and applications, each with its own schema, data from these sources has to be normalized to convert it to a single schema. 

Normalized relational databases are a powerful way of storing and accessing data. But the goal of their implementation is usually to support the rapid update of large volumes of data as well as to smooth processing of a huge number of transactions. 

As a data analyst, you might be interested in combining data together. Probably you may use many queries, which requires several joins, aggregations, and subqueries. That may lead to the situation of preventing transactions from accessing necessary tables to complete a request, and this will put you and other users on hold.

The other database design approach is called the multidimensional approach. At its center is the star schema. This schema consists of a set of dimension and fact tables.

A fact table is a structure that contains many occurrences of data. Surrounding the fact table are dimensions tables that represent the entities being analyzed. Consider sales where there may be a customer dimension table containing all customer attributes (name, address, etc.), a time dimension table (date, fiscal year, etc.), and a product dimension table (make, model, etc.). See the following figure.

Tables in a simple star schema.

Source: The Enterprise Big Data Lake Delivering the Promise of Big Data and Data Science by Alex Gorelik

As a rule, a star schema contains one fact table. However, more than one fact table can be combined in database design to create a snowflake structure.

A snowflake structure.
A snowflake structure.

In a snowflake structure, different fact tables are connected by means of sharing one or more common dimensions. Sometimes these shared dimensions are called conformed dimensions

Note that the main differences between the multidimensional and the relational structure is in terms of flexibility and performance. The relational model is highly flexible but is not optimized for performance for any user. The multidimensional model is highly efficient at servicing the needs of one user community, but it is not good at flexibility. 

The cause of the differences in the original shaping of the models themselves. The relational environment is shaped by the corporate data model. The star schema is shaped by the end-user requirements. The same is shown in the following figure.

The relational model is shaped from a data model. A star join is shaped from user requirements.

Despite performance limitations, since the relational model supports the reshaping of data, the relational model is good for indirect access to data. If you want to get the good performance necessary for data analysis from a relational model, it is optimal to extract data from the model and recast it into a form suitable for quick and efficient direct access to data.

Metadata

Metadata (also called data about data) is an important component of the data warehouse environment. It goes beyond the data model and acts as an index to the contents of the database. Metadata provides discipline for a data warehouse environment because all changes to the warehouse must be reflected in the metadata and communicated to the users. 

Typical items that metadata stores are: 

  • Structure of data 
  • The values legally allowed in each field 
  • A description of the contents of each field 
  • Source data feeding the data warehouse 
  • Transformation of data
  • Data model 
  • History of extracts

Metadata provides assistance for data analysts in tracking down and understanding the data. 

When you approach a data warehouse with no metadata, you do not know where to begin with. You need to investigate what data is available in the database and what isn’t. This will lead to a significant waste of time, and even after you will conclude the investigation, there is no guarantee that you will find the right data or correctly interpret the data encountered. 

Metadata, however, will allow you to quickly navigate through the data and avoid any time waste.

Metadata has great value when it comes to the issue of external data. It is through metadata that external data is registered, accessed, and controlled in the data warehouse environment. It encompasses Document ID, Date of entry into the warehouse, Description, Source, Classification, etc.

Business Rules

Business rules – the highest level of abstraction – describe why relationships exist and how they are applied. Business rules place some form of constraint on a specific aspect of the database, such as the elements in a field specification for a particular field or the characteristics of a given relationship.

Business rules are based on the way the organization perceives and uses its data, which derives from the manner in which the organization operates its business. 

For example, a dance studio decided to oblige all instructors to teach at least one class but to limit them, however, to no more than eight classes. This leads to establishing a new business rule which is: 

‘An instructor must teach one class, but no more than eight.’

The following figure illustrates how this business rule affects the relationship diagram for tables from a dance studio database: 

Before the new business rule is established:

Data Warehouse-Relationship Diagram (Dance Studio 01).

After the new business rule is established: 

Data Warehouse-Relationship Diagram (Dance Studio 02).

As each instructor must be associated with at least one class, a Restrict deletion rule is established as well:

Data Warehouse-Relationship Diagram (Dance Studio 03)
Source: Database Design for Mere Mortals by Michael J. Hernandez

Business rules have a close relationship with data analysis. Such a technique as decision trees produces explicit rules. Often these rules may already be known! 

For instance, a model of direct mail response that ends up targeting wealthy areas may possibly reflect the fact that the historical data used to build the model was focused only on those areas. 

Investing your time in exploring the architecture of data of a particular data warehouse environment is wise because this will reward you by eliminating the significant waste of time and effort during the data analysis. It will be much easier for you to understand the data, to navigate through it, you will get acquainted with the business processes of the organization, and, of course, you do not want to analyze what might already be known.

To be continued. The next part describes the Architecture for Data Warehousing.