Tuesday 16 September 2014

DW_Fundamentals



Data Warehouse Project :Phases in Life Cycle
This is the End-to End process of an application in the organization (From requirements gathering to production support).
There are 4 phases in the life-cycle
  1. Requirements Gathering
  2. Development
    1. ETL Development
    2. Report Development
  3. Testing
    1. ETL Testing
    2. Report Testing
  1. Production or Support
Requirements Gathering Phase(HLD & LLD) in Data Warehouse Project Life Cycle
Business Analyst(BA): He is the person or a mediator between the client and the organization.
  • First Business Analyst is going to interact with the client to know about the client business and pain-areas, from that business analysis BA will prepare the document called as Business requirement document or BRD or FSD(Functional Specification document) or SPEC.
  • From the BRD,the development team will prepare the HLD and LLD.
HLD(High-level-design-document): Divide the BRD into different modules for requirement analysis, prepared by developers. An HLD document is prepared based on Business Requirement.
LLD(Low-level-design-document):  Gives the detailed information about each entity in the BRD or HLD, prepared by developers. It gives the details descriptions of the each module in details which is completely based on HLD.
  • GAP Analysis: This is the comparison between client requirement and reporting tool features like is the tool fulfill all the requirements of the client or not.
  • POC(Proof of Concept): This is the dummy or a duplicate model where the development team will cook the data manually and prepares a sample model which is going to send for the client's approval.
Development Phase in Data Warehouse Project Life Cycle
There are 2 parts in development

  • ETL development: ETL developers will prepare a data-model with all dimensions and facts.Also build an integrated data warehouse from the heterogeneous data sources.
  • Reporting development: Once the DWH is built,the reporters will configure the repository and generate the reports as per the client's requirement.

Production or Support Phase in Data Warehouse Project Life Cycle

The company needs to give the application support for one month by default.After that based on the client's decision, the development company or some other company will give the production support.
Dimensions and Measures in Datawarehouse
Data warehouse consist of both dimensions and measures.
Dimensions are descriptive details about various objects allowing for their detailed analysis. For example time dimension allows us to see object with respect to year, quarter, month, day and hour. And customer dimensions helps in analyzing the customer who affects the business more.
Measures unlike dimensions give the fact numeric value instead of the detailed analysis.
Different type of measures are listed below:
  • Additive measures are measures that can be added across all dimensions. For example customer count in numbers can be added across all dimensions.
  • Semi-additive measures are measures that can be added across some, but not all dimensions. For example the bank account balance is simply a snapshot in time and cannot be added over time. However you could add multiple accounts of the same customer to get the total balance for that customer.
  • Non-additive measures are measures that cannot be added across any dimensions. For example the procurement is simply a snapshot in time and cannot be summed over time. Also you cannot combine procurement for various items.
The Types of Fact Table are
  1. Snapshot
  2. Cumulative
  3. Factless Fact Table
Snapshot
This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.
Eg: Daily balances fact can be summed up through the customers dimension but not through the time dimension.

Cumulative
This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.
Eg: Sales fact

Factless Fact Table
In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called “Factless Fact tables”.
Eg: A fact table which has only product key and date key is a factless fact. There are no measures in this table. But still you can get the number products sold over a period of time.

What is a FACTLESS FACT TABLE?Where we use Factless Fact
We know that fact table is a collection of many facts and measures having multiple keys joined with one or more dimesion tables.Facts contain both numeric and additive fields.But factless fact table are different from all these.
A factless fact table is fact table that does not contain fact.They contain only dimesional keys and it captures events that happen only at information level but not included in the calculations level.just an information about an event that happen over a period.

A factless fact table captures the many-to-many relationships between dimensions, but contains no numeric or textual facts. They are often used to record events or coverage information. Common examples of factless fact tables include:
  • Identifying product promotion events (to determine promoted products that didn’t sell)
  • Tracking student attendance or registration events
  • Tracking insurance-related accident events
  • Identifying building, facility, and equipment schedules for a hospital or university
Factless fact tables are used for tracking a process or collecting stats. They are called so because, the fact table does not have aggregatable numeric values or information.There are two types of factless fact tables: those that describe events, and those that describe conditions. Both may play important roles in your dimensional models.

Factless fact tables for Events
The first type of factless fact table is a table that records an event. Many event-tracking tables in dimensional data warehouses turn out to be factless.Sometimes there seem to be no facts associated with an important business process. Events or activities occur that you wish to track, but you find no measurements. In situations like this, build a standard transaction-grained fact table that contains no facts.
For eg.

 https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgv1ILBDAGpgtKkMYjoCThmC_NbyiKNldhXtLwoxcThTCSVcrvHuPKwjDC4RxmZcLDqTdeT5aTW2-Y3NqNEH7YPLu9X17Or8V_azUrNurNkiuNvJMLaXGgoxmIddNm5hJJnOZOcQHPud85B/s1600/employee_leave.png

The above fact is used to capture the leave taken by an employee.Whenever an employee takes leave a record is created with the dimensions.Using the fact FACT_LEAVE we can answer many questions like
  • Number of leaves taken by an employee
  • The type of leave an employee takes
  • Details of the employee who took leave
Factless fact tables for Conditions
Factless fact tables are also used to model conditions or other important relationships among dimensions. In these cases, there are no clear transactions or events.It is used to support negative analysis report. For example a Store that did not sell a product for a given period.  To produce such report, you need to have a fact table to capture all the possible combinations.  You can then figure out what is missing.
For eg, fact_promo gives the information about the products which have promotions but still did not sell
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUdaWO9BtfkMQCJFy9So9HYFbaPJ-VggRIZAYQxT22mKVxc93ZdItdjPm_b5PqFHcyI_WTCLzyY_j7FCBNKcCnBcelBd0Wqo2hCbU9tYUZl22JT6Nq6b6RpD771HxsGHl3drpJS_xBUcbT/s1600/promotional_ewvent.png
This  fact answers the below questions:
  • To find out products that have promotions.
  • To find out products that have promotion that sell.
  • The list of products that have promotion but did not sell.
This kind of factless fact table is used to track conditions, coverage or eligibility.  In Kimball terminology, it is called a "coverage table."

Note:
We may have the question that why we cannot include these information in the actual fact table .The problem is that if we do so then the fact size will increase enormously .

Factless fact table is crucial in many complex business processes. By applying you can design a dimensional model that has no clear facts to produce more meaningful information for your business processes.Factless fact table itself can be used to generate the useful reports.

Types of Dimensions
Dimension
A dimension table typically has two types of columns, primary keys to fact tables and textual\descriptive data.
Eg: Time, Customer

Types of Dimensions
  1. Slowly Changing Dimensions
  2. Rapidly Changing Dimensions
  3. Junk Dimensions
  4. Inferred Dimensions
  5. Conformed Dimensions
  6. Degenerate Dimensions
  7. Role Playing Dimensions
  8. Shrunken Dimensions
  9. Static Dimensions 
Slowly Changing Dimensions
Attributes of a dimension that would undergo changes over time. It depends on the business requirement whether particular attribute history of changes should be preserved in the data warehouse. This is called a slowly changing attribute and a dimension containing such an attribute is called a slowly changing dimension.



Rapidly Changing Dimensions
A dimension attribute that changes frequently is a rapidly changing attribute. If you don’t need to track the changes, the rapidly changing attribute is no problem, but if you do need to track the changes, using a standard slowly changing dimension technique can result in a huge inflation of the size of the dimension. One solution is to move the attribute to its own dimension, with a separate foreign key in the fact table. This new dimension is called a rapidly changing dimension.

Junk Dimensions
A junk dimension is a single table with a combination of different and unrelated attributes to avoid having a large number of foreign keys in the fact table. Junk dimensions are often created to manage the foreign keys created by rapidly changing dimensions.

Inferred Dimensions
While loading fact records, a dimension record may not yet be ready. One solution is to generate a surrogate key with null for all the other attributes. This should technically be called an inferred member, but is often called an inferred dimension.

Conformed Dimensions
A dimension that is used in multiple locations is called a conformed dimension. A conformed dimension may be used with multiple fact tables in a single database, or across multiple data marts or data warehouses.

Degenerate Dimensions
A degenerate dimension is when the dimension attribute is stored as part of fact table, and not in a separate dimension table. These are essentially dimension keys for which there are no other attributes. In a data warehouse, these are often used as the result of a drill through query to analyze the source of an aggregated number in a report. You can use these values to trace back to transactions in the OLTP system.

Role Playing Dimensions
A role-playing dimension is one where the same dimension key — along with its associated attributes — can be joined to more than one foreign key in the fact table. For example, a fact table may include foreign keys for both ship date and delivery date. But the same date dimension attributes apply to each foreign key, so you can join the same dimension table to both foreign keys. Here the date dimension is taking multiple roles to map ship date as well as delivery date, and hence the name of role playing dimension.

Shrunken Dimensions
A shrunken dimension is a subset of another dimension. For example, the orders fact table may include a foreign key for product, but the target fact table may include a foreign key only for productcategory, which is in the product table, but much less granular. Creating a smaller dimension table, with productcategory as its primary key, is one way of dealing with this situation of heterogeneous grain. If the product dimension is snowflaked, there is probably already a separate table for productcategory, which can serve as the shrunken dimension.

Static Dimensions
Static dimensions are not extracted from the original data source, but are created within the context of the data warehouse. A static dimension can be loaded manually — for example with status codes — or it can be generated by a procedure, such as a date or time dimension.

What is a Junk Dimension in Datawarehousing

The junk dimension is simply a structure that provides a convenient place to store the junk attributes. It is just a collection of random transactional codes, flags and/or text attributes that are unrelated to any particular dimension.
In OLTP tables that are full of flag fields and yes/no attributes, many of which are used for operational support and have no documentation except for the column names and the memory banks of the person who created them. Not only do those types of attributes not integrate easily into conventional dimensions such as Customer, Vendor, Time, Location, and Product, but you also don’t want to carry bad design into the data warehouse.However, some of the miscellaneous attributes will contain data that has significant business value, so you have to do something with them.

This scenario is especially common in legacy systems and databases that were created without solid, underlying design principles. Column names such as Completed, Packed, Shipped, Received, Delivered, and Returned (each with yes/no data values) are very common, and they do have business value.These miscellaneous indicators and flags that don't logically belong to the core dimension tables.  They are either too valuable to ignore or exclude.Often the meaning of the flags and text attributes is obscure. This situation leaves the designer with a number of bad alternatives

Designers sometimes want to treat them as Fact or make it into numerous small Dimensional tables. However, all of these options are less than ideal. Discarding the data can be dangerous because the miscellaneous values, flags, and yes/no fields might contain valuable business data. Including the miscellaneous attributes in the fact table could cause the fact table to swell to alarming proportions, especially if you have more than just a few miscellaneous attributes. The increased size of the fact table could cause serious performance problems because of the reduced number of records per physical I/O. Even if you tried to index these fields to minimize the performance problems, you still wouldn’t gain anything because so many of the miscellaneous fields contain flag values such as 0 and 1; Y and N; or open, pending, and closed.
A third, less obvious but preferable, solution is to incorporate a Junk Dimension as a holding place for these flags and indicators.
Advantage of junk dimension:
  • It provides a recognizable location for related codes, indicators and their descriptors in a dimensional framework.
  • This avoids the creation of multiple dimension tables.
  • Provide a smaller, quicker point of entry for queries compared to performance when these attributes are directly in the fact table. 
  • An interesting use for a junk dimension is to capture the context of a specific transaction.  While our common, conformed dimensions contain the key dimensional attributes of interest, there are likely attributes about the transaction that are not known until the transaction is processed.
  •  https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7mGYhB8ab-XRf4GwuEoVRpINF9N4bjOlkvUUyjj6Wjd-N92wV7K6RSf-IJMAR3SMedV4uhLiUIrvOC2pMTLS9TMpSHHcUkiea56OfZ48PaB8KZsEU2wOKbcHNtE80D8Rh8nHaHJaIOXbm/s1600/junk.jpg
Above figure shows a junk dimension. As in any dimensional design, each of the rows in the fact table will be associated with a row in this junk dimension.

Simple Datawarehouse - Junk Dimension
You want to keep the data warehouse design as simple and straightforward as possible, so that users will be able to access data easily. Miscellaneous attributes that contain business value are a challenge to include in your data warehouse design because they don’t fit neatly into conventional dimensions, and if improperly handled, can cause the data warehouse to swell in size and perform suboptimally. By placing miscellaneous attributes into junk dimensions, you can circumvent both of these problems.