As a data architect, I am often amazed at how many with the same title really do not understand the core differences between dimensional model structures in my industry.
In fact, it is a rarity to find the data architect willing to be challenged without personalities coming into the mix. As as we all know, when you fight the person and not the problem, you end up with hurt feelings and resentments in the workplace. For an EDW, this can lead to people resigned and not willing to stand up for what they think, which leads to the ‘sheep’-like syndrome called ‘Group Think’, thus resulting in an EDW that is built off of emotion and not educated beliefs or best practices.
I thought I would take the time to explain some of the differences between dimensions to enable you, reader, to stand up for the correct approach when you are faced with a contended data model designed by you.
Remember, they are not attacking you, just the model. Look at it as a chance to grow and learn from others. Maybe, just maybe, having 4 or 6 eyes is better than just your 2 and your model, already attributed to you from a recognition perspective, will be that much better.
Here we go:
Slowly Changing Dimensions (SCD – of the Type II variety is where I become a Type II gal)
- What in the world is a SCD anyway?
- Ralph Kimball defined this his 1996 book as the following:
(Kimball, 1996), a slowly changing dimension is a dimension table in which a new row is created each time the underlying component entity changes some important characteristic. Its purpose is to record the state of the dimension entity at the time each transaction took place.
This concept is hard for for those who have primarily dealt with changes as handled in operational systems: no matter how a customer changes, we want to ensure that we have only one customer row in the customer table.
Thus each row in a slowly changing dimension does not correspond to a different entity but a different “state” of that entity—a “snapshot” of the entity at a point in time.
To create a slowly changing dimension table, the following design steps are required:
- Define which attributes of the dimension entity need to be tracked over time. This defines the conditions for creating each new dimensional instance.
- Generalize the key of the dimensional table to enable tracking of state changes. Usually this involves adding a version number to the original key of the dimension table.
Apart from the generalized key, the structure of the slowly changing dimension is the same as the original dimension. However, insertion and update processes for the table will need to be modified significantly.
Splitting Dimensions: “Tiny-Dimensions”
In practice, dimension tables often consist of millions of rows, making them unmanageable for browsing purposes. To address this issue, the most heavily used attributes (e.g., demographic fields for customer dimensions) may be separated into a mini-dimension table. This can improve performance significantly for the most common queries. The mini-dimension should contain a subset of attributes that can be efficiently browsed. As a rule of thumb, there should be fewer than 100,000 combinations of attribute values in a mini-dimension (i.e., fewer than 100,000 rows) to facilitate efficient browsing (Kimball, 1996). The number of attribute value combinations in a tiny dimension can be limited by:
- Including attributes in the mini-dimension that have discrete values (i.e., whose underlying domains consist of a fixed set of values).
- Grouping continuously valued attributes into “bands.” For example, age could be converted to a set of discrete ranges such as child (0-17), young (18-29), adult (30-45), mature (45-64), and senior (65+).
Figure 4. Mini-Dimensional Table (Detailed Level Design)
Figure 4 shows how customer demographics in the Order Item star could be stratified out into a mini-dimension table. Some of the attributes in the original table have been transformed to reduce the number of rows in the mini-dimension table:
# Employees and Revenue have been converted to ranges.
Date of First Order has been converted to Years of Service, reducing the number of combinations to years’ multiples rather than all dates combos that are possible.
Rule of Thumb:
Rather than create a lot of very small dimension tables, these may be combined into a single dimension, with each row representing a valid combination of values. As a rule of thumb, there should be no more than seven dimensions in each star schema to ensure that it is cognitively manageable in size (following the “seven, plus or minus two” principle).
Dealing with Non-Hierarchical Data
A major source of complexity in dimensional modeling is dealing with non-hierarchically structured data.
Dimensional models assume an underlying hierarchical structure and therefore exclude data that is naturally non-hierarchical.
So what do we do if important decision-making data is stored in the form of many-to-many relationships? In this section, we describe how to handle some particular types of non-hierarchical structures that commonly occur in practice:
- Many-to-many relationships: these define network structures among entities, and cause major headaches in dimensional modeling because they occur so frequently in ER models.
- Recursive relationships: these represent “hidden” hierarchies, in which the levels of the hierarchy are represented in data instances rather than the data structure.
- Generalization hierarchies: subtypes and supertypes require special handling in dimensional modeling, because of the issue of optional attributes. These represent hierarchies at the meta-data level only—data instances are not hierarchically related to each other so they cannot be treated as hierarchies for dimensional modeling purposes.
- Many-To-Many Relationships
Many-to-many relationships cause major headaches in dimensional modeling for two reasons. Firstly, they define network structures and therefore do not fit the hierarchical structure of a dimensional model. Secondly, they occur very commonly in practice. Here we consider three types of many-to-many relationships which commonly occur in practice:
- Time-dependent (history) relationships
- Generic (multiple role) relationships
- Multi-valued dependencies (“true” many-tomany relationships)
To include such relationships in a dimensional model generally requires converting them to one-to-many (hierarchical) relationships.
Type 1. Time-Dependent (Historical) Relationships
A special type of many-to-many relationship that occurs commonly in data warehousing applications is one which records the history of a single-valued relationship or attribute over time. That is, the attribute or relationship has only one value at a specific point in time, but has multiple values over time. For example, suppose that the history of employee positions is maintained in the example data model. As shown in Figure 5, the many-to-many relationship which results (Employee Position History) breaks the hierarchical chain and Position Type can no longer be collapsed into its associated component entity (Employee).
Figure 5. Time-Dependent (Historical) Relationship
There are two ways to handle this situation:
- Ignore history: Convert the historical relationship to a “point in time” relationship, which records the current value of the relationship or attribute. In the example, this would mean converting Employee Position History to a one-to-many relationship between Employee and Position Type, which records the employee’s current position. Position Type can then be collapsed into the Employee entity, and the Employee dimension will record the employee’s current position (i.e., at the time of the query). The history of previous positions (including their position at the time of the order) will be lost. A disadvantage of this solution is that it may result in (apparently) inconsistent results to queries about past events.
- Slowly changing dimension: Define Employee as a slowly changing dimension and create a new instance in the Employee dimension table when an employee changes position. This means that Position Type becomes single valued with respect to Employee, since each instance in the Employee table now represents a snapshot at a point in time, and an employee can only have one position at a point in time. Position Type can again be collapsed into the Employee dimension. The difference between this and the previous solution is that the position recorded in the dimension table is the employee’s position at the time of the order, not at the time of the query.
Type 2. Generic (Multiple Role) Relationships
Another situation that commonly occurs in practice is when a many-to-many relationship is used to represent a fixed number of different types of relationships between the same two entities. These correspond to different roles that an entity may play in relation to another entity. For example, in the example data model, an employee may have a number of possible roles in an order:
- They may receive the order
- They may approve the order
- They may dispatch the order
Figure 6. Generic (Multiple Role) Relationship converted to Specific Relationships
The intersection entity between Employee and Order means that Employee cannot be considered as a component of the Order transaction, and therefore orders cannot be analyzed by employee characteristics. To convert such a structure to dimensional form, the different types of relationships or roles represented by the generic relationship need to be factored out into separate one-to-many relationships (Figure 6). Once this is done, Employee becomes a component of the Order transaction, and can form a dimension in the resulting star schema.
Type 3. Multi-Valued Dependency (True Many-To-Many Relationship)
The final type of many-to-many relationship is when a true multi-valued dependency (MVD) exists between two entities: that is, when many entities of one type can be associated in the same type of relationship with many entities of another type at a single point in time. For example, in Figure 7, each customer may be involved in multiple industries. The intersection entity Customer Industry “breaks” the hierarchical chain and the industry hierarchy cannot be collapsed into the Customer component entity.
Figure 7. Multi-Valued Dependency (MVD)
One way of handling this is to convert the Customer Industry relationship to a 1:M relationship, by identifying the main or “principal” industry for each customer.
While each customer may be involved in many industries, there will generally be one industry in which they are primarily involved (e.g., earn most of their revenue). This converts the relationship into a one-to-many relationship, which means it can then be collapsed into the Customer table (Figure 8). Manual conversion effort may be required to identify which is the main industry if this is not recorded in the underlying production system.
Figure 8. Conversion of Many-To-Many Relationships to One-To-Many Recursive Relationships
Second, #2 from above:
Hierarchies are commonly represented in ER models using recursive relationships. Using a recursive relationship, the levels of the hierarchy are represented as data instances rather than as entities. More flexible for sure…However, such structures are less useful in a data warehousing environment, as they reduce understandability to end users and increase complexity of queries.
In converting an ER model to dimensional form, recursive relationships must be converted to explicit hierarchies, with each level shown as a separate entity. To convert this to dimensional form, each row in the Industry Classification Type entity becomes a separate entity. Once this is done, the levels of the hierarchy (which become classification entities) can be easily collapsed to form dimensions.
Figure 9. Conversion of Recursive Relationship to Explicit Hierarchy
Lastly, #3 from above:
In the simplest case, supertype/subtype relationships can be converted to dimensional form by merging the subtypes into the supertype and creating a “type” entity to distinguish between them. This can then be converted to a dimensional model in a straightforward manner as it forms a simple (two level) hierarchy. This will result in a dimension table with optional attributes for each subtype. This is the recommended approach when there are a relatively small number of subtype specific attributes and/or relationships. In the more complex case—when there are many subtype-specific attributes and when different transaction-entity attributes are applicable for different subtypes—separate dimensional models may need to be created for the supertype and each of the subtypes. These are called heterogeneous star schemas, and are the dimensional equivalent of subtypes and supertypes. In general, this will result in n+1 star schemas, where n is the number of subtypes (see Figure 10):
Figure 10. Heterogeneous Star Schemas ("Dimensional Subtyping")
- One Core Star Schema (“dimensional supertype”): This consists of a core fact table, a core dimension table plus other (non-subtyped) dimension tables. The core dimension table will contain all attributes of the supertype, while the core fact table will contain transaction attributes (facts) that are common to all subtypes.
- Multiple Custom Star Schemas (“dimensional subtypes”): A separate Customer Star Schema should be optionally created for each subtype in the underlying ER model. Each custom star schema will consist of a custom fact table, a custom dimension table plus other (non-subtyped) dimension tables.
- Each custom dimension table will contain all common attributes plus attributes specific to that subtype. The custom fact table will contain all common facts plus facts applicable only to that subtype.