DATASTAGE
Using
"dsjob" command as follows. dsjob
-run -jobstatus projectname jobname
Stage
Variable - An intermediate processing variable that retains value during read
and doesnt pass the value into target column.
Derivation
- Expression that specifies value to be passed onto the target
Constraint- is like a filter condition
which limits the number of records coming from input according to business
rule.
The right order is => stage variables ->constraints
->Derivations
# What is the difference between change capture and change apply stages?
Ans=> Change capture stage
is used to get the difference between two sources i.e. after dataset and before
dataset. The source which is used as a reference to capture the changes is
called after dataset. The source in which we are looking for the change is
called before dataset. This change capture will add one field called
"chage code" in the output from this stage. By this change code one
can recognize which kind of change this is like whether it is delete, insert or
update.
Change apply stage is used along with change capture stage. It takes change code from the change capture stage and apply all the changes in the before dataset based on the change code.
Change apply stage is used along with change capture stage. It takes change code from the change capture stage and apply all the changes in the before dataset based on the change code.
Change Capture
is used to capture the changes between the two sources.
Change Apply will apply those changes in the output file.
Change Apply will apply those changes in the output file.
# Types of Parallel Processing?
Ans=>
Parallel Processing is broadly classified into 2 types.
a)
SMP - Symmetrical Multi Processing.
b)
MPP - Massive Parallel Processing.
# How to handle Date conversions in
Datastage? Convert a mm/dd/yyyy format to yyyy-dd-mm?
Ans=>a)
We use a) "Iconv" function - Internal Conversion.
b)
"Oconv" function - External Conversion.
Function
to convert mm/dd/yyyy format to yyyy-dd-mm is
Oconv (Iconv (Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]")
# Functionality of Link Partitioner and Link
Collector?
Link Partitioner: It actually
splits data into various partitions or data flows using various partition
methods.
Link Collector: It collects the
data coming from partitions, merges it into a single data flow and loads to
target.
# What is Data
Stage Engine?
It
is a JAVA engine running at the background.
# What does a
Config File in parallel extender consist of?
Config file consists of the following.
a)
Number of Processes or Nodes.
b)
Actual Disk Storage Location.
#what are Active
and Passive stages?
Active Stage: Active stage
model the flow of data and provide mechanisms for combining data streams,
aggregating data and converting data from one data type to another. Eg,
Transformer, aggregator, sort, Row Merger etc.
Passive Stage: A Passive stage
handles access to Database for the extraction or writing of data. Eg, IPC
stage, File types, Universe, Unidata, DRS stage etc.
# What does a
Config File in parallel extender consist of?
A)
Config file consists of the following.
a)
Number of Processes or Nodes.
b)
Actual Disk Storage Location.
# What are types of
Hashed File?
Hashed
File is classified broadly into 2 types.
a) Static - Sub divided into
17 types based on Primary Key Pattern.
b) Dynamic - sub divided into
2 types
i)
Generic ii) Specific.
Dynamic
files do not perform as well as a well, designed static file, but do perform
better
than
a badly designed one. When creating a dynamic file you can specify the
following
Although
all of these have default values)
By
Default Hashed file is "Dynamic - Type Random 30 D"
#What
is DataStage parallel Extender ?
Parallel extender is that the parallel
processing of data extraction and transformation application . there are two
types of parallel processing
1) pipeline parallelism
2) partition parallelism.
# What is a conductor
node?
Ans->Actually every process contains a conductor process where the execution was started and a section leader process for each processing node and a player process for each set of combined operators and a
individual player process for each uncombined operator.
when ever we want to kill a process we should have to destroy the player process and then section leader process and then conductor process.
when ever we want to kill a process we should have to destroy the player process and then section leader process and then conductor process.
#
What is datastage DSTX?
Ans=>DataStage Transformation Extender
was developed owned and marketed my Meractor(an aviation company). e.g. Version
6.5.2 is Mercator. Later on Ascential Softwares purchased it and named it as
DataStage Transformation Extender(DSTX).e.g. Version 7.5.1.2 is DSTX Later on
IBM purchased Asential Softwares and renamed the product as IBM-WTX(Websphere
Transformation Extender) e.g. Version 8.1 is WTX
DSTX/WTX is a middleware that is used in complex transformations and one of the major players in applications like EDI and SWIFT. It processes live data and is an EAI tool.Its major rival software is Gentran GIS.
Like DSPX it has a designer where "maps" (equivalent of jobs) are written down/designed. Its metadata is defined in Type Trees in Type Tree maker.
DSTX/WTX is a middleware that is used in complex transformations and one of the major players in applications like EDI and SWIFT. It processes live data and is an EAI tool.Its major rival software is Gentran GIS.
Like DSPX it has a designer where "maps" (equivalent of jobs) are written down/designed. Its metadata is defined in Type Trees in Type Tree maker.
# Differences
between server and parallel transformers ?
1.The
main difference is server Transformer supports basic transforms
only,but in pararllel both basic and pararllel transforms.
2.server
transformer is basic language(Basice
Compiler) compatability,pararllel transformer is c++ language compatabillity(C++ Compiler ).
3. Server transformer supports Multiple inputs (One main other refrence
links) but Parallel does’nt.
4. Multiple input linkds--single input link
5. Accepts routines which r written in basic language-- in c/c++ language
5. Accepts routines which r written in basic language-- in c/c++ language
# Explain What is mean by Hashing
Algorithm?
Ans=>Hashing
alogarithm is used when you create hash files in server jobs. it takes key
field and distribute records avilable cpus in DS sercver while running the DS
job. the cpu utiliztion in this process may vary based on server configuration.
# How to remove duplicates in
datastage?
We have the remove duplicate stage in the pararllel extender enterprise edition
using that we eill remove the duplicates (or) Using copy,sort,join stage we will remove the duplicates.
Use rowprevious compare function in
transformer.
#What is the exact
difference b/w LOOKUP stage , JOIN stage and MERGE stage?
Lookup stage:
1.Can only return multiple matching rows from one reference.
2. Can reject rows based on constraint.
3. Can set failure.
4. Does not need partitioned and sorted input.
Merge stage:
1. Can reject rows if fail to find match.
2. Does need partitioned and sorted input.
Join stage.
1. No reject.
2. Does need partitioned and sorted input.
1.Can only return multiple matching rows from one reference.
2. Can reject rows based on constraint.
3. Can set failure.
4. Does not need partitioned and sorted input.
Merge stage:
1. Can reject rows if fail to find match.
2. Does need partitioned and sorted input.
Join stage.
1. No reject.
2. Does need partitioned and sorted input.
#What is the difference between the routine,transform and function?give some
examples for each?
Ans=>main difference is routines are
return the value transform are cannot return the value.
Routine / Function: taking input data
and building some buziness logic and give output data
Examples: StringDecode KeyMgtGetNextValue
Transform : taking input data
and transforms to another form of data and return
Examples: TIMESTAMP NullToEmpty
#How many types of
look up stages in datastage parallel ext?
Ans=> There is only one look up stage in parallel extender and it is
of two types one is normal look up and another one is Sparse look up mostly we
normal look up for the reference datasets.
#How can i handle the after job subroutine in the
transformer stage?
Ans=>In Transformer Stage click on properities tab in tool bar(Left
corner tab). It displays stage properities page like all the other job
properities page. There you can specify your before after stage subroutins.
#How to enable the
runtime column propagation in the datastage?Where does this option exists?
Ans=>There is an option
in data stage administrator->projects tab->properties button->general
tab-> enable runtime column propogation for parallel jobs.
If you enable this you can
select runtime propogation to specify that columns encountered by a stage in a
parallel job can be used even if they are not defined in the meta data
explicitly.
You can see the runtime
propogation option in most of the active stages in the output tab(if exists) in
columns sub tab.
It
is available in 3 places
1)When u open the ds Administrator ; there open any project properties;there u can see the Runtime Column Propagation.
2) If u open the job properties there u can see the same option.
3) When u click the column tab of input or output tab of any processing stage there u will have the option.
use Runtime column Propagation according to ur requirement.If u mention at project level in Administrator it'll be fixed to all the jobs.
1)When u open the ds Administrator ; there open any project properties;there u can see the Runtime Column Propagation.
2) If u open the job properties there u can see the same option.
3) When u click the column tab of input or output tab of any processing stage there u will have the option.
use Runtime column Propagation according to ur requirement.If u mention at project level in Administrator it'll be fixed to all the jobs.
#how to load the
top 3 marks(column name) from seqential file to the target ?
Ans=>First sort the data on Marks column
using sort stage then use transformer stage.
In constraints use system variable @INROWNUM (@INROWNUM< 3).
In this way you can get top 3 marks.
In constraints use system variable @INROWNUM (@INROWNUM< 3).
In this way you can get top 3 marks.
#what is the difference between the routine,transform
and function?give some examples for each?
Ans=>routine and
function are sounds same functions are subset of routines..
Routine / Function : taking input
data and building some buziness logic and give output data
Examples: StringDecode KeyMgtGetNextValue
Transform : taking input data
and transforms to another form of data and return
Examples: TIMESTAMP NullToEmpty
HiTransformers are used to transform the
data from one from to other form where as routines are used to implement the
business logic.
Routine : Routine is like procedure which
is called to process certain operation.
#how to remove the locked jobs using datastage?
Ans=>Go to Director Tools then click Clear the Job Resources
optionthere u note the PID Number 1.Go to Administrator 2.Properties3.Command
4.type ds.tools5. u will get options6.select 4th option6.select the 6th
option7.Then enter the PID Number.
#what type of files are created when we create the hash
file?
Ans=>These are the three types of files will be created .dat .type
.over
#What are Static
Hash files and Dynamic Hash files?
As
the names itself suggest what they mean. In general we use Type-30 dynamic Hash
files. The Data file has a default size of 2Gb and the overflow file is used if
the data exceeds the 2GB size.
#what is the default size of the hash file? How to
Resize the hash file ?
Ans=>
999 MB for reading data into file.
999 MB for writing data into file.
Thats Approx.2GB for whole Hash File.
For More than 2 GB hash File need to Resize that hash file then Go to
DSEngine Path and LOGTO to particular project and ..
>RESIZE VK6 * * * 64BIT
>ANALYZE.FILE VK3
Then Reload the data into hash file then u can use.
#While
using oracle database as target
or Source why we will use OCI stage why don't we use ODBC stage?
Ans=>The
reason is performance,OCI stage is faster while dealing with bulkdata and where
it comes to ODBC stage its slow.But one thing ther is no rule that we Should
use only OCI stage while dealing withoracle database.
Useing OCI we can transefer the data rather
than ODBC.
OCI supports OraBulk and PlugIns useing
these concepts Bulk loads possible quickly.
Since OCI is Oracle
Connectivity Interface.so connect to oracle we have have user drivers which
specified by oracle not the drivers od some other software.
While using Oracle
Database as source we use oracle enterprise stage because it retrieves bulk of
data fastly where as the odbc stage has drivers and are slow while retreving
bulk of data.
The reason is
performance OCI stage is faster while dealing with bulkdata and where it comes
to ODBC stage its slow.But one thing ther is no rule that we Should use only
OCI stage while dealing withoracle database.
#What
are the different type of links in Datastage?
Ans=> they are 3 different links in the
datastage
1.stream link means stright link
2.Reference link it acts like a lookup.
3.Rejected link used in parallel jobs
=>two links
avilable in datastage
1.primary link
Means strightly forwarded standard line ________________>
2. reference link
means: reference link means:horizentling table
-------------------->
# What is DataStage
parallel Extender ?
its a new version
of datastage called PX it process data using partion/collection methods
(algorithms) it is very fast compare to older (server) version..partioning can
be defined via Config file which stores no of nodes and resource pool
information. there are various new stages added to PX version for example
dataset fileset row generator look up stage...and many morecheers
Parallel
extender is that the parallel processing of data extraction and transformation
application . there are two types of parallel processing
1) pipeline
parallelism
2) partition
parallelism.
# Displaying Date
like '2009-01-05 14:29:27'
=>
Oconv('2008-11-27','D-YMD[4,2,2]'):' ':Oconv('14:29:27','MTS')
#. Convert
'2008-12-31' to '20081231'
=> (Digits (toT1.START_DATE)) or
trim(toT1.START_DATE, '-','A') or
(toT1.START_DATE[1,4]:toT1.START_DATE[6,2]:toT1.START_DATE[8,2])
#.Convert '20081231'
to '2008-12-31'
=>
Oconv(IConv(toT1.LAST_INCDR_DATE,"DYMD[4,2,2]"),"D-YMD[4,2,2]")
Oconv(Iconv(toT1.START_DT,'D-DMY,[2,2,2]')
,'D-YMD,[4,2,2]')
START_DATE[1,4]:"-":toT1.START_DATE[5,2]:"-":toT1.START_DATE[7,2]
#.Convert
'01-JAN-09' to '2008-12-31'
=>
OCONV(ICONV(toT1.START_DT,'D-DMAY[2,3,2]'),'D-YMD[4,2,2]')
#.Convert date -1
and '2009-01-30'
=>
Oconv(Iconv(RecordDate,'D-YMD[4,2,2]')-1,'D-YMD[4,2,2]')
#.Pass Record Date -->
=> Oconv(@DATE,"D-YMD[4,2,2]")
: " ": oconv(@TIME,"MTS")
#.Display only
Month from '20090112134546' only '01'
=>Oconv(Iconv(toT.CALL_ORIGINATION_DATE[5,2],"DYMD"),"DM")
#. Want to Trigger
next job if previous job is finished with see log/finished properly
Link1_name.$JobStatus=1 or Link1_name.$JobStatus=2
J0GLedCurrMon.$JobStatus=1 or
J0GLedCurrMon.$JobStatus=2
#. Call procedures
in OCI Stages After/Before Queries=>
=>CALL DEVDBA.ANLY_PIB_CDMA_STAGING
('A')
#. Default Array
Size
in OCI Stages--> 32767
#. Sqlldr param’s
DIRECT=TRUE
SKIP=1 ERRORS=999999 READSIZE=5120000 STREAMSIZE=5120000 COLUMNARRAYROWS=50000
DIRECT=N
PARALLEL=TRUE SKIP=1 ERRORS=100000 READSIZE=20971520 BINDSIZE=20971520
ROWS=100000
Sqlldr
userid=#UserSum#/#PwdSum#@#DsnSum#
control=#SrcPathPRS#/daily/S_PURR.ctl
data=#SrcPathPRS#/daily/S_PRS_CURR_#CurrYYYYMM#.dat
log=#SrcPathPRS#/daily/S_PRS_CURR.log
bad=#SrcPathPRS#/daily/S_PRS_CURR.bad
direct=true
skip=1 errors=500000 READSIZE=5120000
streamsize=5120000 COLUMNARRAYROWS=20000
# What does a Config File in parallel extender
consist of?
Config
file consists of the following.
a)
Number of Processes or Nodes.
b)
Actual Disk Storage Location.
# Types of Parallel Processing?
Parallel Processing is broadly classified into
2 types.
a)
SMP - Symmetrical Multi Processing.
b)
MPP - Massive Parallel Processing.
# Compare and Contrast ODBC and Plug-In
stages?
ODBC : a) Poor
Performance.
b)
Can be used for Variety of Databases.
c)
Can handle Stored Procedures.
Plug-In: a) Good
Performance.
b)
Database specific.(Only one database)
c)
Cannot handle Stored Procedures.
# What is the flow of loading data into fact
& dimensional tables?
Fact table - Table with
Collection of Foreign Keys corresponding to the Primary Keys in Dimensional
table. Consists of fields with numeric values.
Dimension table - Table with
Unique Primary Key.
Load - Data should be
first loaded into dimensional table. Based on the primary key values in
dimensional table, the data should be loaded into Fact table.
# Why do you use SQL LOADER or OCI STAGE?
When
the source data is anormous or for bulk data we can use OCI and SQL loader
depending upon the source Data Will transfer very quickly to the Data Warehouse
by using SQL Loader.
#.What are OConv () and Iconv () functions
and where are they used?
IConv()
- Converts a string to an internal storage format
OConv()
- Converts an expression to an output format.
# What are Routines and where/how are they
written and have you written any routines before?
Routines
are stored in the Routines branch of the DataStage Repository, where you can
create, view or edit. The following are different types of routines:
1) Transform functions
2) Before-after job subroutines
3) Job Control routines
# What are all the third party tools used in
DataStage?
Autosys,TNG,event
coordinator are some of them that I know and worked with co-sort
# How can we implement Lookup in DataStage
Server jobs?
We
can use a Hash File as a lookup in server jobs. The hash file needs atleast one
key column to create. by using the
hashed files u can implement the lookup in datasatge,hashed files stores data
based on hashed algorithm and key values The DB2 stage can be used for lookups.
In
the Enterprise Edition, the Lookup stage can be used for doing lookups.
In
server canvs we can perform 2 kinds of direct lookups
One
is by using a hashed file and the other is by using Database/ODBC stage as a
lookup.
#.How can we
improve the performance of DataStage jobs?
Performance and tuning of DS jobs:
1.Establish Baselines
2.Avoid the Use of only one flow for
tuning/performance testing
3.Work in increment
4.Evaluate data skew
5.Isolate and solve
6.Distribute file systems to eliminate
bottlenecks
7.Do not involve the RDBMS in intial
testing
8.Understand and evaluate the tuning knobs available.
# How do you eliminate duplicate rows?
Delete
from from table name where rowid not in(select max/min(rowid)from emp group by
column name)
Data Stage provides us with a stage Remove
Duplicates in Enterprise
edition. Using that stage we can eliminate the duplicates based on a key
column.
The Duplicates can be eliminated by loading
thecorresponding data in the Hash file. Specify the columns on which u want to
eliminate as the keys of hash.
removal
of duplicates done in two ways:
1.
Use "Duplicate Data Removal" stage
or
2.
use group by on all the columns used in select , duplicates will go away.
#.What about System
variables?
DataStage
provides a set of variables containing useful system information that you can
access from a transform or routine. System variables are read-only.
@DATE The internal date when the program
started. See the Date function.
@DAY The day of the month extracted from the
value in @DATE.
@FALSE The compiler replaces the value with 0.
@FM A field mark, Char(254).
@IM An item mark, Char(255).
@INROWNUM Input row counter. For use in
constrains and derivations in Transformer stages.
@OUTROWNUM Output row counter (per link). For
use in derivations in Transformer stages.
@LOGNAME The user login name.
@MONTH The current extracted from the value in
@DATE.
@NULL The null value.
@NULL.STR The internal representation of the
null value, Char(128).
@PATH The pathname of the current DataStage
project.
@SCHEMA The schema name of the current
DataStage project.
@SM A subvalue mark (a delimiter used in
UniVerse files), Char(252).
@SYSTEM.RETURN.CODE
Status codes
returned by system processes or commands.
@TIME The internal time when the program
started. See the Time function.
@TM A text mark (a delimiter used in UniVerse
files), Char(251).
@TRUE The compiler replaces the value with 1.
@USERNO The user number.
@VM A value mark (a delimiter used in UniVerse
files), Char(253).
@WHO The name of the current DataStage project
directory.
@YEAR The current year extracted from @DATE.
REJECTED Can be used in the constraint
expression of a Transformer stage of an output link. REJECTED is initially
TRUE, but is set to FALSE whenever an output link is successfully written.
#.what is
difference between serverjobs & paraller jobs
Here
is the diff Server jobs. These are available if you have installed DataStage
Server. They run on the DataStage Server, connecting to other data sources as
necessary.
Parallel jobs:These are only
available if you have installed Enterprise Edition. These run on DataStage
servers that are SMP, MPP, or cluster systems. They can also run on a separate
z/OS (USS) machine if required.
The
Parallel jobs are also available if you have Datastage 6.0 PX, or Datastage 7.0
versions installed.
The
Parallel jobs are especially usefule if you have large amounts of data to
process.
Parallel
jobs can be run only on cluster machines .Here performance is high i.e speed is
high
Server jobs: These are compiled
and run on DataStage Server
Parallel
jobs: These are available only if you have Enterprise Edition installed. These
are compiled and run on a DataStage Unix Server, and can be run in parallel on
SMP, MPP, and cluster systems.
Server
jobs can be run on SMP,MPP machines.Here performance is low i.e speed is less.
#.what is hashing algorithm and explain
breafly how it works?
hashing
is key-to-address translation. This means the value of a key is transformed
into a disk address by means of an algorithm, usually a relative block and
anchor point within the block. It's closely related to statistical probability
as to how well the algorithms work.
It sounds fancy but these algorithms
are usually quite simple and use division and remainder techniques. Any good
book on database systems will have information on these techniques.
Interesting to note that these
approaches are called "Monte Carlo Techniques" because the behavior
of the hashing or randomizing algorithms can be simulated by a roulette wheel
where the slots represent the blocks and the balls represent the records (on
this roulette wheel there are many balls not just one).
#.what are the
enhancements made in datastage 7.5 compare with 7.0
Many
new stages were introduced compared to datastage version 7.0. In server jobs we
have stored procedure stage, command stage and generate report option was there
in file tab. In job sequence many stages like startloop activity, end loop
activity,terminate loop activity and user variables activities were introduced.
In parallel jobs surrogate key stage, stored procedure stage were introduced.
For all other specifications, please refer to the manual.raj.
#.what is data set?
and what is file set?
I
assume you are referring Lookup fileset only.It is only used for lookup stages
only.Dataset: DataStage parallel extender jobs use data sets to manage data
within a job. You can think of each link in a job as carrying a data set. The
Data Set stage allows you to store data being operated on in a persistent form,
which can then be used by other DataStage jobs.FileSet: DataStage can generate
and name exported files, write them to their destination, and list the files it
has generated in a file whose extension is, by convention, .fs. The data files
and the file that lists them are called a file set. This capability is useful
because some operating systems impose a 2 GB limit on the size of a file and
you need to distribute files among nodes to prevent overruns.
# How the hash file is doing lookup in serverjobs?How
is it comparing the key values?
Hashed
File is used for two purpose: 1. Remove
Duplicate Records 2. Then Used for
reference lookups.
The
hashed file contains 3 parts: Each record having Hashed Key, Key Header and
Data portion.By using hashed algorith and the key valued the lookup is faster.
Configuration Files Related
- APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one can have many configuration files for a project) to be used. In fact, this is what is generally used in production. However, if this environment variable is not defined then how DataStage determines which file to use?
- If the APT_CONFIG_FILE environment variable is not defined then DataStage look for default configuration file (config.apt) in following path:
- Current working directory.
- INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of DataStage installation.
- What are the different options a logical node can have in the configuration file?
- fastname – The fastname is the physical node name that stages use to open connections for high volume data transfers. The attribute of this option is often the network name. Typically, you can get this name by using Unix command ‘uname -n’.
- pools – Name of the pools to which the node is assigned to. Based on the characteristics of the processing nodes you can group nodes into set of pools.
- A pool can be associated with many nodes and a node can be part of many pools.
- A node belongs to the default pool unless you explicitly specify apools list for it, and omit the default pool name (“”) from the list.
- A parallel job or specific stage in the parallel job can be constrained to run on a pool (set of processing nodes).
- In case job as well as stage within the job are constrained to run on specific processing nodes then stage will run on the node which is common to stage as well as job.
- resource – resource resource_type “location” [{pools “disk_pool_name”}] | resource resource_type “value” . resource_type can be canonicalhostname (Which takes quoted ethernet name of a node in cluster that is unconnected to Conductor node by the hight speed network.) or disk (To read/write persistent data to this directory.) or scratchdisk (Quoted absolute path name of a directory on a file system where intermediate data will be temporarily stored. It is local to the processing node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX, ORACLE, etc.)
- How datastage decides on which processing node a stage should be run?
- If a job or stage is not constrained to run on specific nodes then parallel engine executes a parallel stage on all nodes defined in the default node pool. (Default Behavior)
- If the node is constrained then the constrained processing nodes are choosen while executing the parallel stage. (Refer to 2.2.3 for more detail).
- When configuring an MPP, you specify the physical nodes in your system on which the parallel engine will run your parallel jobs. This is called Conductor Node. For other nodes, you do not need to specify the physical node. Also, You need to copy the (.apt) configuration file only to the nodes from which you start parallel engine applications. It is possible that conductor node is not connected with the high-speed network switches. However, the other nodes are connected to each other using a very high-speed network switches. How do you configure your system so that you will be able to achieve optimized parallelism?
- Make sure that none of the stages are specified to be run on the conductor node.
- Use conductor node just to start the execution of parallel job.
- Make sure that conductor node is not the part of the default pool.
- Although, parallelization increases the throughput and speed of the process, why maximum parallelization is not necessarily the optimal parallelization?
- Datastage creates one process for every stage for each processing node. Hence, if the hardware resource is not available to support the maximum parallelization, the performance of overall system goes down. For example, suppose we have a SMP system with three CPU and a Parallel job with 4 stage. We have 3 logical node (one corresponding to each physical node (say CPU)). Now DataStage will start 3*4 = 12 processes, which has to be managed by a single operating system. Significant time will be spent in switching context and scheduling the process.
- Since we can have different logical processing nodes, it is possible that some node will be more suitable for some stage while other nodes will be more suitable for other stages. So, when to decide which node will be suitable for which stage?
- If a stage is performing a memory intensive task then it should be run on a node which has more disk space available for it. E.g. sorting a data is memory intensive task and it should be run on such nodes.
- If some stage depends on licensed version of software (e.g. SAS Stage, RDBMS related stages, etc.) then you need to associate those stages with the processing node, which is physically mapped to the machine on which the licensed software is installed. (Assumption: The machine on which licensed software is installed is connected through other machines using high speed network.)
- If a job contains stages, which exchange large amounts of data then they should be assigned to nodes where stages communicate by either shared memory (SMP) or high-speed link (MPP) in most optimized manner.
- Basically nodes are nothing but set of machines (specially in MPP systems). You start the execution of parallel jobs from the conductor node. Conductor nodes creates a shell of remote machines (depending on the processing nodes) and copies the same environment on them. However, it is possible to create a startup script which will selectively change the environment on a specific node. This script has a default name of startup.apt. However, like main configuration file, we can also have many startup configuration files. The appropriate configuration file can be picked up using the environment variable APT_STARTUP_SCRIPT. What is use of APT_NO_STARTUP_SCRIPT environment variable?
- Using APT_NO_STARTUP_SCRIPT environment variable, you can instruct Parallel engine not to run the startup script on the remote shell.
- What are the generic things one must follow while creating a configuration file so that optimal parallelization can be achieved?
- Consider avoiding the disk/disks that your input files reside on.
- Ensure that the different file systems mentioned as the disk and scratchdisk resources hit disjoint sets of spindles even if they’re located on a RAID (Redundant Array of Inexpensive Disks) system.
- Know what is real and what is NFS:
- Real disks are directly attached, or are reachable over a SAN (storage-area network -dedicated, just for storage, low-level protocols).
- Never use NFS file systems for scratchdisk resources, remember scratchdisk are also used for temporary storage of file/data during processing.
- If you use NFS file system space for disk resources, then you need to know what you are doing. For example, your final result files may need to be written out onto the NFS disk area, but that doesn’t mean the intermediate data sets created and used temporarily in a multi-job sequence should use this NFS disk area. Better to setup a “final” disk pool, and constrain the result sequential file or data set to reside there, but let intermediate storage go to local or SAN resources, not NFS.
- Know what data points are striped (RAID) and which are not. Where possible, avoid striping across data points that are already striped at the spindle level.
DATASTAGE ADMIN RELATED QUE/ANS
1. DsEngine
Stop and start Process
with
Operator User
-bash-3.00$ cd
/u01/app/datastage/product/Ascential/DataStage/DSEngine
-bash-3.00$ . ./dsenv
-bash-3.00$ cd bin
-bash-3.00$ ./rfcprocess stop
-bash-3.00$ ps -ef | grep rfc
-bash-3.00$ ./rfcprocess start
-bash-3.00$ ps -ef | grep rfc
_______________________________________
ask all user's to logout from system then check ..
_____________________________________________
Dsadm
User
cd /u01/app/datastage/product/Ascential/DataStage/DSEngine
. ./dsenv
bin/uv –admin –stop
Then
Fire These Commands and kill the
processes=>
netstat –a | grep dsr
netstat -a | grep dsrpc-> for the instance you want to shut
down, expect the server instance is listening, and no established connection,
nor FIN-WAIT nor CLOSE-WAIT connection
ps -ef | grep dsrpcd-> expect the instance you want to shut
down returned
ps -ef | grep dscs -> for the instance you want to shut down,
expect no returned
ps -ef | grep slave-> for the instance you want to shut down,
expect no returned
ps -ef | grep phantom-> for the instance you want to shut down,
expect no returned
ps -ef | grep osh -> for the instance you want to shut
down, expect no returned
ps –ef | grep operator ->find out all processes running with
operator
ps –ef | grep dsapi
ps –ef | grep sortcl ->find out all processes running with
sortcl
ps –ef | grep sqlldr ->find out all processes running with
sqlldr
lsof -i | grep process_id
If
process still exists then list out and
kill
netstat –a | grep dsr
P0471DSSEL-D1.dsrpc
10.8.80.72.1109 64891 119 49640 0 ESTABLISHED
Isof –I | grep 1109
________________________________
in
Short =>
cd /u01/app/datastage/product/Ascential/DataStage/DSEngine
. ./dsenv
bin/uv –admin –stop
bin/uv –admin –start
2.If
Job’s got locked then how to unlock=>
Ans. A) Go to the
Director =>
Go to Job => Cleanup
Resourses
Ans b) Go to unix
Prompt Then DSEngine Path
-bash-3.00$ cd
/u14/app/datastagepx/product/Ascential/DataStage/DSEngine
-bash-3.00$
. ./dsenv
-bash-3.00$
bin/uv
>logto
DSS_RTL_PROD
>DS.TOOLS
3.
How to Find out Editable job’s from Any
project.
cd `cat /.dshome`
or
cd /u01/app/datastage/product/Ascential/DataStage/DSEngine
. ./dsenv
bin/uv
SELECT NAME
,CATEGORY ,READONLY FROM DS_JOBS WHERE NAME NOT LIKE '\\%' AND READONLY = 'NRO'
;
SELECT COUNT(1)
FROM DS_JOBS ;
Select * from
DS_JOBS where NAME='CopyOfJ1PubSrc'; ==> For viewing exact loaction
_______________________________________
select
category as catgry fmt '25l',ds_jobs.name as job_name fmt '35l'
from
ds_jobobjects,ds_jobs
where
ds_jobobjects.objidno = ds_jobs.jobno
and
ds_jobobjects.oletype = 'chASHEDiNPUT'
and eval
ds_jobobjects."@record<6>" = 'h_cMD_ct_2' ;
4.
To check with particular hash file OutPut
select
category as catgry fmt '25l',ds_jobs.name as job_name fmt '35l'
from ds_jobobjects,ds_jobs
where ds_jobobjects.objidno = ds_jobs.jobno
and ds_jobobjects.oletype = 'chASHEDoUTPUT'
and eval
ds_jobobjects."@record<6>" = 'H_ERC_PMD_DETAILS' ;
5.For
checking CPU Speed and Running processes counts=>
operator-17:20:28$vmstat
2 5
kthr memory page disk faults cpu
r
b w swap free re
mf pi po fr de
sr s0 s1 s6 s1 in sy
cs us sy id
26
9 0 140049512
136446232 13158 3309 86566 38
38 0 1 9 0
-0 0
21741 218915 33053 41 28 31
6.For
checking user's and CPU Idle count's=>sar 2 10
operator-14:23:41$sar
2 10
SunOS P0471DSSEL
5.10 Generic_138888-03 sun4u
09/10/2009
14:23:50 %usr
%sys %wio %idle
14:23:55 86
14 0 0
14:24:00 85
15 0 0
7.For
checking processes running on Server=>
operator-14:24:46$top
last pid:
14675; load averages: 95.49, 92.49,
93.82 18:29:32
1016 processes:950
sleeping, 48 running, 2 zombie, 7 stopped, 9 on cpu
CPU states: 0.0% idle, 25.8% user, 74.2% kernel, 0.0% iowait,
0.0% swap
Memory: 160G real,
125G free, 45G swap in use, 102G swap free
PID USERNAME THR PRI NICE SIZE RES STATE TIME
CPU COMMAND
7678 operator
25 32 0
18G 2653M cpu/531 7:05 1.15% sortcl
14960 operator
11 43 0
26M 12M cpu/19 27:25
0.70% sortcl
8. How To Bind The
CPU’s for any processes.(Means That particular process can occupy only defined CPU’s Only remaining other are
free for another process)
=>psrinfo => this command
show's overall CPU's -How many cpu's is Server.
0 on-line
since 10/16/2009 23:12:42
1 on-line
since 10/16/2009 23:12:43
2 on-line
since 10/16/2009 23:12:43
532 on-line
since 10/16/2009 23:12:43
534 on-line
since 10/16/2009 23:12:43
535 on-line
since 10/16/2009 23:12:43
=>psrinfo | wc -l => Check
CPU's Count on Server.
48
=>psrset -i => this command
show's how many nodes (CPU's set) Defined in Machine.
This process will
done by Root User Only.
operator-10:56:34$psrset
-i
user processor set
1: processors 4 516 5 517
user processor set
2: processors 8 520 9 521 10 522 11 523
user processor set
3: processors 12 524 14 526
user processor set
4: processors 16 528 17 529
user processor set
5: processors 20 532 21 533 22 534 23 535
=>After
Execution of main job or Main Script execute
these command's for binding that process ID.
ps -ef|grep -i
bpst|awk '{print $2}'|grep -v -w 1|xargs sudo /usr/sbin/psrset -b 1
ps -ef|grep -i
bpst|awk '{print $3}'|grep -v -w 1|xargs sudo /usr/sbin/psrset -b 1
=> After
Completion of that processes execute these for unbinding all binded process
id's.
sudo
/usr/sbin/psrset -q|awk '{print $3}'|awk -F":" '{print $1}'|xargs
sudo /usr/sbin/psrset -u
DATAWAREHOUSING CONCEPTS
# Types of Dimensional
Modeling?
Dimensional Modeling is a logical design technique that seeks to
present the data
in a standard framework that is, intuitive and allows for high
performance access.
A) Dimensional modeling is again sub divided into 2 types.
a) Star Schema - Simple & Much Faster. Denormalized form.
b) Snowflake Schema - Complex with more Granularity. More
normalized form.
#What
is Star Schema?
Star Schema is a de-normalized multi-dimensional model. It
contains centralized fact
tables surrounded by dimensions table.
Dimension Table: It contains a primary key and description about
the fact table.
Fact Table: It contains foreign keys to the dimension tables,
measures and aggregates.
#What
is Snowflake schema?
It is partially normalized dimensional model in which at two
represents least one
dimension or more hierarchy related tables.
# Why
we need surrogate key?
It is used for integrating the data may help better for primary
key.
Index maintenance, joins, table size, key updates, disconnected
inserts and partitioning.
#Explain
Types of Fact Tables?
Factless Fact: It contains only foreign keys to the dimension
tables.
Additive Fact: Measures can be added across any dimensions.
Semi-Additive: Measures can be added across some dimensions. Eg, %
age, discount
Non-Additive: Measures cannot be added across any dimensions. Eg,
Average
Conformed Fact: The equation or the measures of the two fact
tables are the same under
the facts are measured across the dimensions with a same set of
measures.
#
Explain the Types of Dimension Tables?
Conformed Dimension: If a dimension table is connected to more
than one fact table,
the granularity that is defined in the dimension table is common
across between the fact tables.
Junk Dimension: The Dimension table, which contains only flags.
Monster Dimension: If rapidly changes in Dimension are known as
Monster Dimension.
De-generative Dimension: It is line item-oriented fact table
design.
#What
index is created on Data Warehouse?
Bitmap index is created in Data Warehouse.
# What is Full load
& Incremental or Refresh load?
Answer: Full Load: completely
erasing the contents of one or more tables and reloading with fresh data.
Incremental Load: applying ongoing
changes to one or more tables based on a predefined schedule.
# Compare ETL & Manual development?
Answer : ETL
- The process of extracting data from multiple sources.(ex. flat files,XML,
COBOL, SAP etc) is more simpler with the help of tools.
Manual - Loading the data
other than flat files and oracle table need more effort.
ETL - High and clear
visibilty of logic.
Manual - complex and not
so user friendly visibilty of logic.
ETL - Contains Meta data and changes can be done easily.
Manual - No Meta data concept and changes needs more effort.
ETL- Error hadling,log
summary and load progess makes life easier for developer and maintainer.
Manual - need maximum
effort from maintainance point of view.
ETL - Can handle
Historic data very well.
Manual
- as data grows the processing time degrads.
These
are some differences b/w manual and ETL developement.
# What is the difference between a data
warehouse and a data mart?
This
is a heavily debated issue. There are inherent similarities between the basic
constructs used to design a data warehouse and a data mart. In general a Data
Warehouse is used on an enterprise level, while Data Marts is used on a
business division/department level. A data mart only contains the required
subject specific data for local analysis.
# What is Index . How many type of Index .
Ans=>Indexes are optional structures associated with tables and
cluster.
Type of Index :-
- B-Tree indexes
- B-Tree cluster indexes
- Hash cluster indexes
- Reverse key indexes
- Bitmap Indexes .
#. how can u implement slowly changed
dimensions in datastage? explain?
SCDs
are three typesType 1- Modify the changeType 2- Version the modified changeType
3- Historical versioning of modified change by adding a new column to update
the changed data
SCD type1 -- >> just
use 'insert rows else update rows'
or ' update rows else
insert rows' in update action of
target
SCD type2 à> u have use one
hash file to look -up the target ,take 3 instance of target ,give diff condns
depending on the process,give diff update actions in target ,use system
variables like sysdate ,null
We can handle SCD
in the following ways
Type I: Just overwrite;
Type II: We need
versioning and dates;
Type III: Add old and new
copies of certain important fields.
Hybrid Dimensions: Combination of
Type II and Type III
yes
you can implement Type1 Type2 or Type 3.
Let
me try to explain Type 2 with time stamp.
Step :1 time stamp we are
creating via shared container. it return system time and one key. For
satisfying the lookup condition we are creating a key column by using the
column generator.
Step 2: Our source is Data
set and Lookup table is oracle OCI stage. by using the change capture stage we
will find out the differences. the change capture stage will return a value for
chage_code. based on return value we will find out whether this is for insert ,
Edit, or update. if it is insert we will
modify with current timestamp and the old time stamp will keep as history.
# how to implement type2 slowly changing
dimensions in data stage?explain with example?
We can handle SCD
in the following ways
Type 1: Just use, “Insert
rows Else Update rows”
Or
“Update rows Else Insert rows”, in update action of target
Type 2: Use the steps as
follows
a) U have use one hash file to Look-Up the
target
b) Take 3 instances of target
c) Give different conditions depending on the
process
d) Give different update actions in target
e) Use system variables like Sysdate and
Null.
Good Site.....Keep it up....Bro..
ReplyDeleteThanks bro..for your valuable comments.
Delete