Parallel processing
A data warehouse is a central
integrated database containing data from heterogeneous source systems in an
organization. The data is transformed to eliminate inconsistencies, aggregated
to summarize data, and loaded into the data warehouse. This database can be
accessed by multiple users, ensuring that each group in an organization is
accessing valuable, stable data.
For processing the large volumes of
data from heterogeneous source systems effectively, the ETL (Extraction,
Transformation and Load) software's implemented the parallel processing.
Parallel processing divided into
pipeline parallelism and partition parallelism.
IBM Information Server or DataStage
allows us to use both parallel processing methods.
There
are two basic types of parallel processing; pipeline and partitioning.
InfoSphere DataStage allows you to use both of these methods. The following
sections illustrate these methods using a simple parallel job which extracts
data from a data source, transforms it in some way, then writes it to another
data source. In all cases this job would appear the same on your Designer
canvas, but you can configure it to behave in different ways (which are shown
diagrammatically).
Pipeline Parallelism:
DataStage pipelines data (where
possible) from one stage to the next and nothing has to be done for this to
happen. ETL (Extraction, Transformation and Load) Processes the data
simultaneously in all the stages in a job are operating simultaneously.
Downstream process would start as soon as the data is available in the
upstream. Pipeline parallelism eliminates the need of intermediate storing to a
disk.
If
you ran the example job on a system with at least three processors, the stage
reading would start on one processor and start filling a pipeline with the data
it had read. The transformer stage would start running on another processor as
soon as there was data in the pipeline, process it and start filling another pipeline.
The stage writing the transformed data to the target database would similarly
start writing as soon as there was data available. Thus all three stages are
operating simultaneously. If you were running sequentially, there would only be
one instance of each stage. If you were running in parallel, there would be as
many instances as you had partitions
conceptual representation of same
job using pipeline parallelism
Partition Parallelism:
The aim of most partitioning
operations is to end up with a set of partitions that are as near equal size as
possible, ensuring an even load across processors. This partition is ideal for
handling very large quantities of data by breaking the data into partitions.
Each partition is being handled by a separate instance of the job stages.
Imagine
you have the same simple job as described above, but that it is handling very
large quantities of data. In this scenario you could use the power of parallel
processing to your best advantage by partitioning the data into a number of
separate sets, with each partition being handled by a separate instance of the
job stages.
Using
partition parallelism the same job would effectively be run simultaneously by
several processors, each handling a separate subset of the total data.
At
the end of the job the data partitions can be collected back together again and
written to a single data source.
Conceptual
representation of job using partition parallelism
Parallel processing environments:
The environment in which you run
your DataStage jobs is defined by your system's architecture and hardware
resources.
All parallel-processing environments
can be categorized as
- SMP (Symmetrical Multi Processing)
- Clusters or MPP (Massive Parallel Processing)
SMP (symmetric multiprocessing),
shared memory:
Symmetric multiprocessing (SMP) involves a symmetric multiprocessor system
hardware and software architecture where two or more identical processors
connect to a single, shared main memory, have full access to all I/O
devices, and are controlled by a single operating system instance that treats
all processors equally, reserving none for special purposes. Most
multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP
architecture applies to the cores, treating them as separate processors.
SMP systems are tightly coupled multiprocessor systems with a
pool of homogeneous processors running independently, each processor executing
different programs and working on different data and with capability of sharing
common resources (memory, I/O device, interrupt system and so on) and connected
using a system bus
or a crossbar.
- Some hardware resources may be shared among processors.
- Processors communicate via shared memory and have a single operating system.
- All CPU's share system resources
MPP (massively parallel processing),
shared-nothing:
In Massively Parallel Processing
(MPP) data is partitioned across multiple servers or nodes with each
server/node having memory/processors to process data locally. All communication
is via a network interconnect — there is no disk-level sharing or
contention to be concerned with (i.e. it is a ‘shared-nothing’ architecture).
contention to be concerned with (i.e. it is a ‘shared-nothing’ architecture).
Massively parallel processing (MPP)
is a form of collaborative processing of the same program by two or more
processors. Each processor handles different segment (dataset) of data.
As mentioned earlier the main characteristic of MPP is data
distribution. Data is distributed across each segment achieve data and processing parallelism. This is achieved by using Partition Techniques. Data will spilt into segments (datasets) and distributed into across available nodes.
distribution. Data is distributed across each segment achieve data and processing parallelism. This is achieved by using Partition Techniques. Data will spilt into segments (datasets) and distributed into across available nodes.
- An MPP as a bunch of connected SMP's.
- Each processor has exclusive access to hardware resources.
- MPP systems are physically housed in the same box.
MPP Architecture in Teradata
Cluster Systems:
- UNIX systems connected via networks
- Cluster systems can be physically dispersed.
By understanding these concepts on
various processing methods and environments enabled me to understand the
overall parallel jobs architecture in DataStage.
Article Source: http://EzineArticles.com/7198379
No comments:
Post a Comment