Nagaraj's DatastageTechGuru: SMP, MPP Architecture

Parallel processing

A data warehouse is a central integrated database containing data from heterogeneous source systems in an organization. The data is transformed to eliminate inconsistencies, aggregated to summarize data, and loaded into the data warehouse. This database can be accessed by multiple users, ensuring that each group in an organization is accessing valuable, stable data.

For processing the large volumes of data from heterogeneous source systems effectively, the ETL (Extraction, Transformation and Load) software's implemented the parallel processing.

Parallel processing divided into pipeline parallelism and partition parallelism.

IBM Information Server or DataStage allows us to use both parallel processing methods.

There are two basic types of parallel processing; pipeline and partitioning. InfoSphere DataStage allows you to use both of these methods. The following sections illustrate these methods using a simple parallel job which extracts data from a data source, transforms it in some way, then writes it to another data source. In all cases this job would appear the same on your Designer canvas, but you can configure it to behave in different ways (which are shown diagrammatically).

Pipeline Parallelism:

DataStage pipelines data (where possible) from one stage to the next and nothing has to be done for this to happen. ETL (Extraction, Transformation and Load) Processes the data simultaneously in all the stages in a job are operating simultaneously. Downstream process would start as soon as the data is available in the upstream. Pipeline parallelism eliminates the need of intermediate storing to a disk.

If you ran the example job on a system with at least three processors, the stage reading would start on one processor and start filling a pipeline with the data it had read. The transformer stage would start running on another processor as soon as there was data in the pipeline, process it and start filling another pipeline. The stage writing the transformed data to the target database would similarly start writing as soon as there was data available. Thus all three stages are operating simultaneously. If you were running sequentially, there would only be one instance of each stage. If you were running in parallel, there would be as many instances as you had partitions

conceptual representation of same job using pipeline parallelism

Partition Parallelism:

The aim of most partitioning operations is to end up with a set of partitions that are as near equal size as possible, ensuring an even load across processors. This partition is ideal for handling very large quantities of data by breaking the data into partitions. Each partition is being handled by a separate instance of the job stages.

Imagine you have the same simple job as described above, but that it is handling very large quantities of data. In this scenario you could use the power of parallel processing to your best advantage by partitioning the data into a number of separate sets, with each partition being handled by a separate instance of the job stages.

Using partition parallelism the same job would effectively be run simultaneously by several processors, each handling a separate subset of the total data.

At the end of the job the data partitions can be collected back together again and written to a single data source.

Conceptual representation of job using partition parallelism

Parallel processing environments:

The environment in which you run your DataStage jobs is defined by your system's architecture and hardware resources.

All parallel-processing environments can be categorized as

SMP (Symmetrical Multi Processing)
Clusters or MPP (Massive Parallel Processing)

SMP (symmetric multiprocessing), shared memory:

Symmetric multiprocessing (SMP) involves a symmetric multiprocessor system hardware and software architecture where two or more identical processors connect to a single, shared main memory, have full access to all I/O devices, and are controlled by a single operating system instance that treats all processors equally, reserving none for special purposes. Most multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors.

SMP systems are tightly coupled multiprocessor systems with a pool of homogeneous processors running independently, each processor executing different programs and working on different data and with capability of sharing common resources (memory, I/O device, interrupt system and so on) and connected using a system bus or a crossbar.

Some hardware resources may be shared among processors.
Processors communicate via shared memory and have a single operating system.
All CPU's share system resources

MPP (massively parallel processing), shared-nothing:

In Massively Parallel Processing (MPP) data is partitioned across multiple servers or nodes with each server/node having memory/processors to process data locally. All communication is via a network interconnect — there is no disk-level sharing or
contention to be concerned with (i.e. it is a ‘shared-nothing’ architecture).

Massively parallel processing (MPP) is a form of collaborative processing of the same program by two or more processors. Each processor handles different segment (dataset) of data.

As mentioned earlier the main characteristic of MPP is data
distribution. Data is distributed across each segment achieve data and processing parallelism. This is achieved by using Partition Techniques. Data will spilt into segments (datasets) and distributed into across available nodes.

An MPP as a bunch of connected SMP's.
Each processor has exclusive access to hardware resources.
MPP systems are physically housed in the same box.

MPP Architecture in Teradata

Cluster Systems:

UNIX systems connected via networks
Cluster systems can be physically dispersed.

By understanding these concepts on various processing methods and environments enabled me to understand the overall parallel jobs architecture in DataStage.

Article Source: http://EzineArticles.com/7198379

Nagaraj's DatastageTechGuru

Friday, 5 December 2014

SMP, MPP Architecture

No comments:

Post a Comment

Blog Archive