Data
Warehouse Project :Phases in Life Cycle
This is the End-to End process
of an application in the organization (From requirements gathering to
production support).
There are 4 phases in the life-cycle
- ETL Testing
- Report Testing
- Production or Support
Requirements
Gathering Phase(HLD & LLD) in Data Warehouse Project Life Cycle
Business
Analyst(BA): He is the person or a mediator between the client and the
organization.
- First Business Analyst is going to interact with the client to know about the client business and pain-areas, from that business analysis BA will prepare the document called as Business requirement document or BRD or FSD(Functional Specification document) or SPEC.
- From the BRD,the development team will prepare the HLD and LLD.
HLD(High-level-design-document): Divide the BRD into different modules for requirement
analysis, prepared by developers. An HLD document is prepared based on Business
Requirement.
LLD(Low-level-design-document):
Gives the detailed information about each entity in the BRD or HLD,
prepared by developers. It gives the details descriptions of the each module in
details which is completely based on HLD.
- GAP Analysis: This is the comparison between client requirement and reporting tool features like is the tool fulfill all the requirements of the client or not.
- POC(Proof of Concept): This is the dummy or a duplicate model where the development team will cook the data manually and prepares a sample model which is going to send for the client's approval.
Development
Phase in Data Warehouse Project Life Cycle
There
are 2 parts in development
- ETL development: ETL developers will prepare a data-model with all dimensions and facts.Also build an integrated data warehouse from the heterogeneous data sources.
- Reporting development: Once the DWH is built,the reporters will configure the repository and generate the reports as per the client's requirement.
Production or Support Phase in Data Warehouse Project Life Cycle
The company
needs to give the application support for one month by default.After that based
on the client's decision, the development company or some other company will
give the production support.
Dimensions
and Measures in Datawarehouse
Data warehouse consist of both
dimensions and measures.
Dimensions
are descriptive details about various objects allowing for their detailed
analysis. For example time dimension allows us to see object with respect to
year, quarter, month, day and hour. And customer dimensions helps in analyzing
the customer who affects the business more.
Measures unlike
dimensions give the fact numeric value instead of the detailed analysis.
Different type of measures are
listed below:
- Additive measures are measures that can be added across all dimensions. For example customer count in numbers can be added across all dimensions.
- Semi-additive measures are measures that can be added across some, but not all dimensions. For example the bank account balance is simply a snapshot in time and cannot be added over time. However you could add multiple accounts of the same customer to get the total balance for that customer.
- Non-additive measures are measures that cannot be added across any dimensions. For example the procurement is simply a snapshot in time and cannot be summed over time. Also you cannot combine procurement for various items.
The Types of
Fact Table are
- Snapshot
- Cumulative
- Factless Fact Table
Snapshot
This type of fact table describes
the state of things in a particular instance of time, and usually includes more
semi-additive and non-additive facts. The second example presented here is a
snapshot fact table.
Eg: Daily balances fact can be
summed up through the customers dimension but not through the time dimension.
Cumulative
This type of fact table describes
what has happened over a period of time. For example, this fact table may describe
the total sales by product by store by day. The facts for this type of fact
tables are mostly additive facts. The first example presented here is a
cumulative fact table.
Eg: Sales fact
Factless Fact Table
In
the real world, it is possible to have a fact table that contains no measures
or facts. These tables are called “Factless Fact tables”.
Eg:
A fact table which has only product key and date key is a factless fact.
There are no measures in this table. But still you can get the number products
sold over a period of time.
What
is a FACTLESS FACT TABLE?Where we use Factless Fact
We know that fact table is a
collection of many facts and measures having multiple keys joined with one or
more dimesion tables.Facts contain both numeric and additive fields.But factless
fact table are different from all these.
A factless fact table
is fact table that does not contain fact.They contain only dimesional keys and
it captures events that happen only at information level but not included in
the calculations level.just an information about an event that happen over a
period.
A factless fact table captures the
many-to-many relationships between dimensions, but contains no numeric or
textual facts. They are often used to record events or coverage information.
Common examples of factless fact tables include:
- Identifying product promotion events (to determine promoted products that didn’t sell)
- Tracking student attendance or registration events
- Tracking insurance-related accident events
- Identifying building, facility, and equipment schedules for a hospital or university
Factless fact tables are used for
tracking a process or collecting stats. They are called so because, the fact
table does not have aggregatable numeric values or information.There are two
types of factless fact tables: those that describe events, and those that
describe conditions. Both may play important roles in your dimensional models.
Factless fact tables for Events
The first type of factless fact table is a table that records an event. Many event-tracking tables in dimensional data warehouses turn out to be factless.Sometimes there seem to be no facts associated with an important business process. Events or activities occur that you wish to track, but you find no measurements. In situations like this, build a standard transaction-grained fact table that contains no facts.
For eg.
The above fact is used to capture
the leave taken by an employee.Whenever an employee takes leave a record is
created with the dimensions.Using the fact FACT_LEAVE we can answer many questions
like
- Number of leaves taken by an employee
- The type of leave an employee takes
- Details of the employee who took leave
Factless fact tables for Conditions
Factless fact tables are also used to model conditions or other important relationships among dimensions. In these cases, there are no clear transactions or events.It is used to support negative analysis report. For example a Store that did not sell a product for a given period. To produce such report, you need to have a fact table to capture all the possible combinations. You can then figure out what is missing.
Factless fact tables are also used to model conditions or other important relationships among dimensions. In these cases, there are no clear transactions or events.It is used to support negative analysis report. For example a Store that did not sell a product for a given period. To produce such report, you need to have a fact table to capture all the possible combinations. You can then figure out what is missing.
For eg, fact_promo gives the information about the products
which have promotions but still did not sell
This
fact answers the below questions:
- To find out products that have promotions.
- To find out products that have promotion that sell.
- The list of products that have promotion but did not sell.
This kind of factless fact table is
used to track conditions, coverage or eligibility. In Kimball
terminology, it is called a "coverage table."
Note:
We may have the question that why we
cannot include these information in the actual fact table .The problem is that
if we do so then the fact size will increase enormously .
Factless fact table is crucial in
many complex business processes. By applying you can design a dimensional model
that has no clear facts to produce more meaningful information for your
business processes.Factless fact table itself can be used to generate the
useful reports.
Types
of Dimensions
Dimension
A dimension table typically has two types of
columns, primary keys to fact tables and textual\descriptive data.
Eg: Time, Customer
Types of Dimensions
- Slowly Changing Dimensions
- Rapidly Changing Dimensions
- Junk Dimensions
- Inferred Dimensions
- Conformed Dimensions
- Degenerate Dimensions
- Role Playing Dimensions
- Shrunken Dimensions
- Static Dimensions
Slowly Changing
Dimensions
Attributes of a dimension that would undergo changes over
time. It depends on the business requirement whether particular attribute
history of changes should be preserved in the data warehouse. This is called a
slowly changing attribute and a dimension containing such an attribute is
called a slowly changing dimension.
Rapidly Changing
Dimensions
A dimension attribute that changes frequently is a rapidly
changing attribute. If you don’t need to track the changes, the rapidly
changing attribute is no problem, but if you do need to track the changes,
using a standard slowly changing dimension technique can result in a huge
inflation of the size of the dimension. One solution is to move the attribute
to its own dimension, with a separate foreign key in the fact table. This new
dimension is called a rapidly changing dimension.
Junk Dimensions
A junk dimension is a single table with a combination of
different and unrelated attributes to avoid having a large number of foreign
keys in the fact table. Junk dimensions are often created to manage the foreign
keys created by rapidly changing dimensions.
Inferred Dimensions
While loading fact records, a dimension record may not yet
be ready. One solution is to generate a surrogate key with null for all the
other attributes. This should technically be called an inferred member, but is
often called an inferred dimension.
Conformed Dimensions
A dimension that is used in multiple locations is called a
conformed dimension. A conformed dimension may be used with multiple fact
tables in a single database, or across multiple data marts or data warehouses.
Degenerate Dimensions
A degenerate dimension is when the dimension attribute is
stored as part of fact table, and not in a separate dimension table. These are
essentially dimension keys for which there are no other attributes. In a data
warehouse, these are often used as the result of a drill through query to analyze
the source of an aggregated number in a report. You can use these values to
trace back to transactions in the OLTP system.
Role Playing
Dimensions
A role-playing dimension is one where the same dimension key
— along with its associated attributes — can be joined to more than one foreign
key in the fact table. For example, a fact table may include foreign keys for
both ship date and delivery date. But the same date dimension attributes apply
to each foreign key, so you can join the same dimension table to both foreign
keys. Here the date dimension is taking multiple roles to map ship date as well
as delivery date, and hence the name of role playing dimension.
Shrunken Dimensions
A shrunken dimension is a subset of another dimension. For
example, the orders fact table may include a foreign key for product, but the
target fact table may include a foreign key only for productcategory, which is
in the product table, but much less granular. Creating a smaller dimension
table, with productcategory as its primary key, is one way of dealing with this
situation of heterogeneous grain. If the product dimension is snowflaked, there
is probably already a separate table for productcategory, which can serve as
the shrunken dimension.
Static Dimensions
Static dimensions are not extracted from the original data
source, but are created within the context of the data warehouse. A static
dimension can be loaded manually — for example with status codes — or it can be
generated by a procedure, such as a date or time dimension.
What is a Junk Dimension in Datawarehousing
The
junk dimension is simply a structure that provides a convenient place to store
the junk attributes. It is just a collection of random transactional codes,
flags and/or text attributes that are unrelated to any particular dimension.
In
OLTP tables that are full of flag fields and yes/no attributes, many of which
are used for operational support and have no documentation except for the
column names and the memory banks of the person who created them. Not only do
those types of attributes not integrate easily into conventional dimensions
such as Customer, Vendor, Time, Location, and Product, but you also don’t want
to carry bad design into the data warehouse.However, some of the miscellaneous
attributes will contain data that has significant business value, so you have
to do something with them.
This scenario is especially common in legacy systems and databases that were created without solid, underlying design principles. Column names such as Completed, Packed, Shipped, Received, Delivered, and Returned (each with yes/no data values) are very common, and they do have business value.These miscellaneous indicators and flags that don't logically belong to the core dimension tables. They are either too valuable to ignore or exclude.Often the meaning of the flags and text attributes is obscure. This situation leaves the designer with a number of bad alternatives
Designers sometimes want to treat them as Fact or make it into numerous small Dimensional tables. However, all of these options are less than ideal. Discarding the data can be dangerous because the miscellaneous values, flags, and yes/no fields might contain valuable business data. Including the miscellaneous attributes in the fact table could cause the fact table to swell to alarming proportions, especially if you have more than just a few miscellaneous attributes. The increased size of the fact table could cause serious performance problems because of the reduced number of records per physical I/O. Even if you tried to index these fields to minimize the performance problems, you still wouldn’t gain anything because so many of the miscellaneous fields contain flag values such as 0 and 1; Y and N; or open, pending, and closed.
A
third, less obvious but preferable, solution is to incorporate a Junk
Dimension as a holding place for these flags and indicators.
Advantage
of junk dimension:
- It provides a recognizable location for related codes, indicators and their descriptors in a dimensional framework.
- This avoids the creation of multiple dimension tables.
- Provide a smaller, quicker point of entry for queries compared to performance when these attributes are directly in the fact table.
- An interesting use for a junk dimension is to capture the context of a specific transaction. While our common, conformed dimensions contain the key dimensional attributes of interest, there are likely attributes about the transaction that are not known until the transaction is processed.
Above
figure shows a junk dimension. As in any dimensional design, each of the rows
in the fact table will be associated with a row in this junk dimension.
Simple
Datawarehouse - Junk Dimension
You
want to keep the data warehouse design as simple and straightforward as
possible, so that users will be able to access data easily. Miscellaneous
attributes that contain business value are a challenge to include in your data
warehouse design because they don’t fit neatly into conventional dimensions,
and if improperly handled, can cause the data warehouse to swell in size and
perform suboptimally. By placing miscellaneous attributes into junk dimensions,
you can circumvent both of these problems.