Friday, 5 December 2014

DatastageFAQ's with Answers_v1.0




                                                            DATASTAGE

Using "dsjob" command as follows. dsjob -run -jobstatus projectname jobname  

Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the value into target column.
Derivation - Expression that specifies value to be passed onto the target 
Constraint- is like a filter condition which limits the number of records coming from input according to business rule.
The right order is => stage variables ->constraints ->Derivations

# What is the difference between change capture and change apply stages?
Ans=> Change capture stage is used to get the difference between two sources i.e. after dataset and before dataset. The source which is used as a reference to capture the changes is called after dataset. The source in which we are looking for the change is called before dataset. This change capture will add one field called "chage code" in the output from this stage. By this change code one can recognize which kind of change this is like whether it is delete, insert or update.
Change apply stage is used along with change capture stage. It takes change code from the change capture stage and apply all the changes in the before dataset based on the change code.
Change Capture is used to capture the changes between the two sources.
Change Apply will apply those changes in the output file.

# Types of Parallel Processing?
Ans=> Parallel Processing is broadly classified into 2 types.
a) SMP - Symmetrical Multi Processing.
b) MPP - Massive Parallel Processing.

# How to handle Date conversions in Datastage? Convert a mm/dd/yyyy format to yyyy-dd-mm?
Ans=>a) We use a) "Iconv" function - Internal Conversion.
b) "Oconv" function - External Conversion.
Function to convert mm/dd/yyyy format to yyyy-dd-mm is
Oconv (Iconv (Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]")

# Functionality of Link Partitioner and Link Collector?
Link Partitioner: It actually splits data into various partitions or data flows using various partition methods.
Link Collector: It collects the data coming from partitions, merges it into a single data flow and loads to target.

# What is Data Stage Engine?
It is a JAVA engine running at the background.

# What does a Config File in parallel extender consist of?
 Config file consists of the following.
a) Number of Processes or Nodes.
b) Actual Disk Storage Location.

#what are Active and Passive stages?
Active Stage: Active stage model the flow of data and provide mechanisms for combining data streams, aggregating data and converting data from one data type to another. Eg, Transformer, aggregator, sort, Row Merger etc.
Passive Stage: A Passive stage handles access to Database for the extraction or writing of data. Eg, IPC stage, File types, Universe, Unidata, DRS stage etc.

# What does a Config File in parallel extender consist of?
A) Config file consists of the following.
a) Number of Processes or Nodes.
b) Actual Disk Storage Location.

# What are types of Hashed File?
Hashed File is classified broadly into 2 types.
a) Static - Sub divided into 17 types based on Primary Key Pattern.
b) Dynamic - sub divided into 2 types
i) Generic ii) Specific.
Dynamic files do not perform as well as a well, designed static file, but do perform better
than a badly designed one. When creating a dynamic file you can specify the following
Although all of these have default values)
By Default Hashed file is "Dynamic - Type Random 30 D"

#What is DataStage parallel Extender ?
Parallel extender is that the parallel processing of data extraction and transformation application . there are two types of parallel processing
1) pipeline parallelism
2) partition parallelism.

# What is a conductor node?
Ans->Actually every process contains a conductor process where the execution was started and a section leader process for each processing node and a player process for each set of combined operators and a individual player process for each uncombined operator.
 
         when ever we want to kill a process we should have to destroy the player process and then section leader process and then conductor process.

# What is datastage DSTX?
Ans=>DataStage Transformation Extender was developed owned and marketed my Meractor(an aviation company). e.g. Version 6.5.2 is Mercator. Later on Ascential Softwares purchased it and named it as DataStage Transformation Extender(DSTX).e.g. Version 7.5.1.2 is DSTX Later on IBM purchased Asential Softwares and renamed the product as IBM-WTX(Websphere Transformation Extender) e.g. Version 8.1 is WTX
                 DSTX/WTX is a middleware that is used in complex transformations and one of the major players in applications like EDI and SWIFT. It processes live data and is an EAI tool.Its major rival software is Gentran GIS.
                  Like DSPX it has a designer where "maps" (equivalent of jobs) are written down/designed. Its metadata is defined in Type Trees in Type Tree maker.

# Differences between server and parallel transformers ?
1.The main difference is server Transformer supports basic transforms only,but in pararllel both basic and pararllel transforms.
2.server transformer is basic language(Basice Compiler) compatability,pararllel transformer is c++ language compatabillity(C++ Compiler ).
3. Server transformer supports Multiple inputs (One main other refrence links) but Parallel does’nt.
4. Multiple input linkds--single input link
5. Accepts routines which r written in basic language-- in c/c++ language




# Explain What is mean by Hashing Algorithm? 
Ans=>Hashing alogarithm is used when you create hash files in server jobs. it takes key field and distribute records avilable cpus in DS sercver while running the DS job. the cpu utiliztion in this process may vary based on server configuration.

# How to remove duplicates in datastage? 
We have the remove duplicate stage in the pararllel extender enterprise edition using that we eill remove the duplicates (or) Using copy,sort,join stage we will remove the duplicates.
Use rowprevious compare function in transformer.

#What is the exact difference b/w LOOKUP stage , JOIN stage and MERGE stage?
Lookup stage:
1.Can only return multiple matching rows from one reference.
2. Can reject rows based on constraint.
3. Can set failure.
4. Does not need partitioned and sorted input.
Merge stage:
1. Can reject rows if fail to find match.
2. Does need partitioned and sorted input.
Join stage.
1. No reject.
2. Does need partitioned and sorted input.

#What is the difference between the routine,transform and function?give some examples for each? 
Ans=>main difference is routines are return the value transform are cannot return the value.
Routine / Function: taking input data and building some buziness logic and give output data
Examples: StringDecode KeyMgtGetNextValue
Transform : taking input data and transforms to another form of data and return
Examples: TIMESTAMP NullToEmpty

#How many types of look up stages in datastage parallel ext?
Ans=> There is only one look up stage in parallel extender and it is of two types one is normal look up and another one is Sparse look up mostly we normal look up for the reference datasets.

#How can i handle the after job subroutine in the transformer stage?
Ans=>In Transformer Stage click on properities tab in tool bar(Left corner tab). It displays stage properities page like all the other job properities page. There you can specify your before after stage subroutins.

#How to enable the runtime column propagation in the datastage?Where does this option exists?
Ans=>There is an option in data stage administrator->projects tab->properties button->general tab-> enable runtime column propogation for parallel jobs.
                If you enable this you can select runtime propogation to specify that columns encountered by a stage in a parallel job can be used even if they are not defined in the meta data explicitly.
               You can see the runtime propogation option in most of the active stages in the output tab(if exists) in columns sub tab.

It is available in 3 places
1)When u open the ds Administrator ; there open any project properties;there u can see the Runtime Column Propagation.
2) If u open the job properties there u can see the same option.
3) When u click the column tab of input or output tab of any processing stage there u will have the option.
         use Runtime column Propagation according to ur requirement.If u mention at project level in Administrator it'll be fixed to all the jobs.
#how to load the top 3 marks(column name) from seqential file to the target ?
Ans=>First sort the data on Marks column using sort stage then use transformer stage.
In constraints use system variable @INROWNUM (@INROWNUM< 3).
In this way you can get top 3 marks.

#what is the difference between the routine,transform and function?give some examples for each?
Ans=>routine and function are sounds same functions are subset of routines..
Routine / Function : taking input data and building some buziness logic and give output data
Examples: StringDecode KeyMgtGetNextValue
Transform : taking input data and transforms to another form of data and return
Examples: TIMESTAMP NullToEmpty
HiTransformers are used to transform the data from one from to other form where as routines are used to implement the business logic.
Routine : Routine is like procedure which is called to process certain operation.

#how to remove the locked jobs using datastage?
Ans=>Go to Director Tools then click Clear the Job Resources optionthere u note the PID Number 1.Go to Administrator 2.Properties3.Command 4.type ds.tools5. u will get options6.select 4th option6.select the 6th option7.Then enter the PID Number.

#what type of files are created when we create the hash file?
Ans=>These are the three types of files will be created .dat .type .over

#What are Static Hash files and Dynamic Hash files?
As the names itself suggest what they mean. In general we use Type-30 dynamic Hash files. The Data file has a default size of 2Gb and the overflow file is used if the data exceeds the 2GB size.

#what is the default size of the hash file? How to Resize the hash file ?
Ans=>
999 MB for reading data into file.
999 MB for writing data into file.
Thats Approx.2GB for whole Hash File.

For More than 2 GB hash File need to Resize that hash file then Go to DSEngine Path and LOGTO to particular project and ..
>RESIZE VK6 * * * 64BIT
>ANALYZE.FILE VK3
Then Reload the data into hash file then u can use.


#While using oracle database as target or Source why we will use OCI stage why don't we use ODBC stage?
Ans=>The reason is performance,OCI stage is faster while dealing with bulkdata and where it comes to ODBC stage its slow.But one thing ther is no rule that we Should use only OCI stage while dealing withoracle database.
Useing OCI we can transefer the data rather than ODBC.
OCI supports OraBulk and PlugIns useing these concepts Bulk loads possible quickly.
Since OCI is Oracle Connectivity Interface.so connect to oracle we have have user drivers which specified by oracle not the drivers od some other software.
While using Oracle Database as source we use oracle enterprise stage because it retrieves bulk of data fastly where as the odbc stage has drivers and are slow while retreving bulk of data.
The reason is performance OCI stage is faster while dealing with bulkdata and where it comes to ODBC stage its slow.But one thing ther is no rule that we Should use only OCI stage while dealing withoracle database.


#What are the different type of links in Datastage?
Ans=> they are 3 different links in the datastage
1.stream link means stright link
2.Reference link it acts like a lookup.
3.Rejected link used in parallel jobs
=>two links avilable in datastage
1.primary link Means strightly forwarded standard line ________________>
2. reference link means: reference link means:horizentling table  -------------------->

# What is DataStage parallel Extender ?
its a new version of datastage called PX it process data using partion/collection methods (algorithms) it is very fast compare to older (server) version..partioning can be defined via Config file which stores no of nodes and resource pool information. there are various new stages added to PX version for example dataset fileset row generator look up stage...and many morecheers
                                Parallel extender is that the parallel processing of data extraction and transformation application . there are two types of parallel processing
1) pipeline parallelism
2) partition parallelism.

# Displaying Date like '2009-01-05 14:29:27'
            => Oconv('2008-11-27','D-YMD[4,2,2]'):' ':Oconv('14:29:27','MTS')

#. Convert '2008-12-31' to '20081231'
=> (Digits (toT1.START_DATE)) or  
trim(toT1.START_DATE, '-','A') or
             (toT1.START_DATE[1,4]:toT1.START_DATE[6,2]:toT1.START_DATE[8,2])

#.Convert '20081231' to '2008-12-31'
=> Oconv(IConv(toT1.LAST_INCDR_DATE,"DYMD[4,2,2]"),"D-YMD[4,2,2]")
                   Oconv(Iconv(toT1.START_DT,'D-DMY,[2,2,2]') ,'D-YMD,[4,2,2]')
                  START_DATE[1,4]:"-":toT1.START_DATE[5,2]:"-":toT1.START_DATE[7,2]

#.Convert '01-JAN-09' to '2008-12-31'
=> OCONV(ICONV(toT1.START_DT,'D-DMAY[2,3,2]'),'D-YMD[4,2,2]')

#.Convert date -1 and '2009-01-30'
=>  Oconv(Iconv(RecordDate,'D-YMD[4,2,2]')-1,'D-YMD[4,2,2]')

#.Pass Record Date -->
=> Oconv(@DATE,"D-YMD[4,2,2]") : " ": oconv(@TIME,"MTS")

#.Display only Month from '20090112134546' only '01'
=>Oconv(Iconv(toT.CALL_ORIGINATION_DATE[5,2],"DYMD"),"DM")


#. Want to Trigger next job if previous job is finished with see log/finished properly
Link1_name.$JobStatus=1 or Link1_name.$JobStatus=2
J0GLedCurrMon.$JobStatus=1 or J0GLedCurrMon.$JobStatus=2


#. Call procedures in OCI Stages After/Before Queries=>
=>CALL DEVDBA.ANLY_PIB_CDMA_STAGING ('A')

#. Default Array Size  in OCI Stages--> 32767

#. Sqlldr param’s
DIRECT=TRUE SKIP=1 ERRORS=999999 READSIZE=5120000 STREAMSIZE=5120000 COLUMNARRAYROWS=50000
DIRECT=N PARALLEL=TRUE SKIP=1 ERRORS=100000 READSIZE=20971520 BINDSIZE=20971520 ROWS=100000

Sqlldr userid=#UserSum#/#PwdSum#@#DsnSum#
control=#SrcPathPRS#/daily/S_PURR.ctl
data=#SrcPathPRS#/daily/S_PRS_CURR_#CurrYYYYMM#.dat
log=#SrcPathPRS#/daily/S_PRS_CURR.log
bad=#SrcPathPRS#/daily/S_PRS_CURR.bad
direct=true skip=1 errors=500000  READSIZE=5120000 streamsize=5120000 COLUMNARRAYROWS=20000

# What does a Config File in parallel extender consist of?
Config file consists of the following.
a) Number of Processes or Nodes.
b) Actual Disk Storage Location.

# Types of Parallel Processing?
 Parallel Processing is broadly classified into 2 types.
a) SMP - Symmetrical Multi Processing.
b) MPP - Massive Parallel Processing.

# Compare and Contrast ODBC and Plug-In stages?
ODBC : a) Poor Performance.
b) Can be used for Variety of Databases.
c) Can handle Stored Procedures.
Plug-In: a) Good Performance.
b) Database specific.(Only one database)
c) Cannot handle Stored Procedures.

# What is the flow of loading data into fact & dimensional tables?
Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table. Consists of fields with numeric values.
Dimension table - Table with Unique Primary Key.
Load - Data should be first loaded into dimensional table. Based on the primary key values in dimensional table, the data should be loaded into Fact table.

# Why do you use SQL LOADER or OCI STAGE?
When the source data is anormous or for bulk data we can use OCI and SQL loader depending upon the source Data Will transfer very quickly to the Data Warehouse by using SQL Loader.

#.What are OConv () and Iconv () functions and where are they used?  
IConv() - Converts a string to an internal storage format
OConv() - Converts an expression to an output format.

# What are Routines and where/how are they written and have you written any routines before?   
Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of routines:
   1) Transform functions
   2) Before-after job subroutines
   3) Job Control routines
# What are all the third party tools used in DataStage?
Autosys,TNG,event coordinator are some of them that I know and worked with co-sort

# How can we implement Lookup in DataStage Server jobs?
We can use a Hash File as a lookup in server jobs. The hash file needs atleast one key column to create.  by using the hashed files u can implement the lookup in datasatge,hashed files stores data based on hashed algorithm and key values The DB2 stage can be used for lookups.
In the Enterprise Edition, the Lookup stage can be used for doing lookups.
In server canvs we can perform 2 kinds of direct lookups
One is by using a hashed file and the other is by using Database/ODBC stage as a lookup.

#.How can we improve the performance of DataStage jobs?
  Performance and tuning of DS jobs:
       1.Establish Baselines
       2.Avoid the Use of only one flow for tuning/performance testing
       3.Work in increment
       4.Evaluate data skew
       5.Isolate and solve
       6.Distribute file systems to eliminate bottlenecks
       7.Do not involve the RDBMS in intial testing
       8.Understand and evaluate the tuning knobs available.

# How do you eliminate duplicate rows?
Delete from from table name where rowid not in(select max/min(rowid)from emp group by column name)
 Data Stage provides us with a stage Remove Duplicates in Enterprise edition. Using that stage we can eliminate the duplicates based on a key column.
  The Duplicates can be eliminated by loading thecorresponding data in the Hash file. Specify the columns on which u want to eliminate as the keys of hash.
removal of duplicates done in two ways:
1. Use "Duplicate Data Removal" stage  or
2. use group by on all the columns used in select , duplicates will go away.

#.What about System variables?
DataStage provides a set of variables containing useful system information that you can access from a transform or routine. System variables are read-only.
 @DATE The internal date when the program started. See the Date function.
 @DAY The day of the month extracted from the value in @DATE.
 @FALSE The compiler replaces the value with 0.
 @FM A field mark, Char(254). 
 @IM An item mark, Char(255).
 @INROWNUM Input row counter. For use in constrains and derivations in Transformer stages.
 @OUTROWNUM Output row counter (per link). For use in derivations in Transformer stages.
 @LOGNAME The user login name. 
 @MONTH The current extracted from the value in @DATE.
 @NULL The null value. 
 @NULL.STR The internal representation of the null value, Char(128). 
 @PATH The pathname of the current DataStage project.
 @SCHEMA The schema name of the current DataStage project.
 @SM A subvalue mark (a delimiter used in UniVerse files), Char(252). 
 @SYSTEM.RETURN.CODE 
Status codes returned by system processes or commands. 
 @TIME The internal time when the program started. See the Time function.
 @TM A text mark (a delimiter used in UniVerse files), Char(251).
 @TRUE The compiler replaces the value with 1.
 @USERNO The user number. 
 @VM A value mark (a delimiter used in UniVerse files), Char(253).
 @WHO The name of the current DataStage project directory. 
 @YEAR The current year extracted from @DATE.
 REJECTED Can be used in the constraint expression of a Transformer stage of an output link. REJECTED is initially TRUE, but is set to FALSE whenever an output link is successfully written.

#.what is difference between serverjobs & paraller jobs
Here is the diff Server jobs. These are available if you have installed DataStage Server. They run on the DataStage Server, connecting to other data sources as necessary.
Parallel jobs:These are only available if you have installed Enterprise Edition. These run on DataStage servers that are SMP, MPP, or cluster systems. They can also run on a separate z/OS (USS) machine if required.
The Parallel jobs are also available if you have Datastage 6.0 PX, or Datastage 7.0 versions installed.
The Parallel jobs are especially usefule if you have large amounts of data to process.
Parallel jobs can be run only on cluster machines .Here performance is high i.e speed is high
Server jobs: These are compiled and run on DataStage Server
Parallel jobs: These are available only if you have Enterprise Edition installed. These are compiled and run on a DataStage Unix Server, and can be run in parallel on SMP, MPP, and cluster systems.
Server jobs can be run on SMP,MPP machines.Here performance is low i.e speed is less.

#.what is hashing algorithm and explain breafly how it works?
hashing is key-to-address translation. This means the value of a key is transformed into a disk address by means of an algorithm, usually a relative block and anchor point within the block. It's closely related to statistical probability as to how well the algorithms work.
          It sounds fancy but these algorithms are usually quite simple and use division and remainder techniques. Any good book on database systems will have information on these techniques.
          Interesting to note that these approaches are called "Monte Carlo Techniques" because the behavior of the hashing or randomizing algorithms can be simulated by a roulette wheel where the slots represent the blocks and the balls represent the records (on this roulette wheel there are many balls not just one).

#.what are the enhancements made in datastage 7.5 compare with 7.0
Many new stages were introduced compared to datastage version 7.0. In server jobs we have stored procedure stage, command stage and generate report option was there in file tab. In job sequence many stages like startloop activity, end loop activity,terminate loop activity and user variables activities were introduced. In parallel jobs surrogate key stage, stored procedure stage were introduced. For all other specifications, please refer to the manual.raj.

#.what is data set? and what is file set?
I assume you are referring Lookup fileset only.It is only used for lookup stages only.Dataset: DataStage parallel extender jobs use data sets to manage data within a job. You can think of each link in a job as carrying a data set. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other DataStage jobs.FileSet: DataStage can generate and name exported files, write them to their destination, and list the files it has generated in a file whose extension is, by convention, .fs. The data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns.

# How the hash file is doing lookup in serverjobs?How is it comparing the key values?
Hashed File is used for two purpose:  1. Remove Duplicate Records  2. Then Used for reference lookups.
The hashed file contains 3 parts: Each record having Hashed Key, Key Header and Data portion.By using hashed algorith and the key valued the lookup is faster.

Configuration Files Related

  1. APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one can have many configuration files for a project) to be used. In fact, this is what is generally used in production. However, if this environment variable is not defined then how DataStage determines which file to use?
    1. If the APT_CONFIG_FILE environment variable is not defined then DataStage look for default configuration file (config.apt) in following path:
      1. Current working directory.
      2. INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of DataStage installation.
  2. What are the different options a logical node can have in the configuration file
    1.  fastname – The fastname is the physical node name that stages use to open connections for high volume data transfers. The attribute of this option is often the network name. Typically, you can get this name by using Unix command ‘uname -n’.
    2. pools – Name of the pools to which the node is assigned to. Based on the characteristics of the processing nodes you can group nodes into set of pools.  
      1. A pool can be associated with many nodes and a node can be part of many pools. 
      2. A node belongs to the default pool unless you explicitly specify apools list for it, and omit the default pool name (“”) from the list. 
      3. A parallel job or specific stage in the parallel job can be constrained to run on a pool (set of processing nodes). 
        1.  In case job as well as stage within the job are constrained to run on specific processing nodes then stage will run on the node which is common to stage as well as job. 
    3. resource –  resource resource_type “location” [{pools “disk_pool_name”}]  | resource resource_type “value” . resource_type can be canonicalhostname (Which takes quoted ethernet name of a node in cluster that is unconnected to Conductor node by the hight speed network.) or disk (To read/write persistent data to this directory.) or scratchdisk (Quoted absolute path name of a directory on a file system where intermediate data will be temporarily stored. It is local to the processing node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX, ORACLE, etc.)   
  3.  How datastage decides on which processing node a stage should be run
    1. If a job or stage is not constrained to run on specific nodes then parallel engine executes a parallel stage on all nodes defined in the default node pool. (Default Behavior) 
    2. If the node is constrained then the constrained processing nodes are choosen while executing the parallel stage. (Refer to 2.2.3 for more detail). 
  4. When configuring an MPP, you specify the physical nodes in your system on which the parallel engine will run your parallel jobs. This is called Conductor Node. For other nodes, you do not need to specify the physical node.  Also, You need to copy the (.apt) configuration file only to the nodes from which you start parallel engine applications. It is possible that conductor node is not connected with the high-speed network switches. However, the other nodes are connected to each other using a very high-speed network switches. How do you configure your system so that you will be able to achieve optimized parallelism?
    1. Make sure that none of the stages are specified to be run on the conductor node.  
    2. Use conductor node just to start the execution of parallel job.  
    3. Make sure that conductor node is not the part of the default pool.  
  5. Although, parallelization increases the throughput and speed of the process, why maximum parallelization is not necessarily the optimal parallelization?  
    1. Datastage creates one process for every stage for each processing node.  Hence, if the hardware resource is not available to support the maximum parallelization, the performance of overall system goes down. For example, suppose we have a SMP system with three CPU and a Parallel job with 4 stage. We have 3 logical node (one corresponding to each physical node (say CPU)). Now DataStage will start 3*4 = 12 processes, which has to be managed by a single operating system. Significant time will be spent in switching context and scheduling the process. 
  6. Since we can have different logical processing nodes, it is possible that some node will be more suitable for some stage while other nodes will be more suitable for other stages. So, when to decide which node will be suitable for which stage?  
    1. If a stage is performing a memory intensive task then it should be run on a node which has more disk space available for it. E.g. sorting a data is memory intensive task and it should be run on such nodes.  
    2. If some stage depends on licensed version of software (e.g. SAS Stage, RDBMS related stages, etc.) then you need to associate those stages with the processing node, which is physically mapped to the machine on which the licensed software is installed. (Assumption: The machine on which licensed software is installed is connected through other machines using high speed network.)  
    3. If a job contains stages, which exchange large amounts of data then they should be assigned to nodes where stages communicate by either shared memory (SMP) or high-speed link (MPP) in most optimized manner.  
  7. Basically nodes are nothing but set of machines (specially in MPP systems). You start the execution of parallel jobs from the conductor node. Conductor nodes creates a shell of remote machines (depending on the processing nodes) and copies the same environment on them. However, it is possible to create a startup script which will selectively change the environment on a specific node. This script has a default name of startup.apt. However, like main configuration file, we can also have many startup configuration files. The appropriate configuration file can be picked up using the environment variable APT_STARTUP_SCRIPT. What is use of APT_NO_STARTUP_SCRIPT environment variable
    1. Using APT_NO_STARTUP_SCRIPT environment variable, you can instruct Parallel engine not to run the startup script on the remote shell. 
  8. What are the generic things one must follow while creating a configuration file so that optimal parallelization can be achieved
    1. Consider avoiding the disk/disks that your input files reside on. 
    2.  Ensure that the different file systems mentioned as the disk and scratchdisk resources hit disjoint sets of spindles even if they’re located on a RAID (Redundant Array of Inexpensive Disks) system. 
    3. Know what is real and what is NFS:  
      1. Real disks are directly attached, or are reachable over a SAN (storage-area network -dedicated, just for storage, low-level protocols). 
      2. Never use NFS file systems for scratchdisk resources, remember scratchdisk are also used for temporary storage of file/data during processing. 
      3. If you use NFS file system space for disk resources, then you need to know what you are doing. For example, your final result files may need to be written out onto the NFS disk area, but that doesn’t mean the intermediate data sets created and used temporarily in a multi-job sequence should use this NFS disk area. Better to setup a “final” disk pool, and constrain the result sequential file or data set to reside there, but let intermediate storage go to local or SAN resources, not NFS.  
    4. Know what data points are striped (RAID) and which are not. Where possible, avoid striping across data points that are already striped at the spindle level.





   DATASTAGE ADMIN RELATED QUE/ANS

1. DsEngine Stop and start Process

with Operator User

-bash-3.00$ cd /u01/app/datastage/product/Ascential/DataStage/DSEngine
-bash-3.00$ . ./dsenv
-bash-3.00$ cd bin
-bash-3.00$ ./rfcprocess stop
-bash-3.00$ ps -ef | grep rfc

-bash-3.00$ ./rfcprocess start
-bash-3.00$ ps -ef | grep rfc
_______________________________________
ask all user's to logout from system then check ..
_____________________________________________

Dsadm User

cd /u01/app/datastage/product/Ascential/DataStage/DSEngine
. ./dsenv
bin/uv –admin –stop

Then Fire These Commands  and kill the processes=>
netstat –a | grep dsr
netstat -a | grep dsrpc-> for the instance you want to shut down, expect the server instance is listening, and no established connection, nor FIN-WAIT nor CLOSE-WAIT connection
ps -ef | grep dsrpcd-> expect the instance you want to shut down returned
ps -ef | grep dscs -> for the instance you want to shut down, expect no returned
ps -ef | grep slave-> for the instance you want to shut down, expect no returned
ps -ef | grep phantom-> for the instance you want to shut down, expect no returned
ps -ef | grep osh   -> for the instance you want to shut down, expect no returned
ps –ef | grep operator ->find out all processes running with operator
ps –ef | grep dsapi
ps –ef | grep sortcl ->find out all processes running with sortcl
ps –ef | grep sqlldr ->find out all processes running with sqlldr

lsof -i | grep process_id

If process still exists then list out  and kill
netstat –a | grep dsr
P0471DSSEL-D1.dsrpc  10.8.80.72.1109      64891    119 49640      0 ESTABLISHED
Isof –I | grep 1109
________________________________

in Short =>
cd /u01/app/datastage/product/Ascential/DataStage/DSEngine
. ./dsenv
bin/uv –admin –stop
bin/uv –admin –start


2.If Job’s got locked then how to unlock=>
Ans. A) Go to the Director =>
               Go to Job => Cleanup Resourses
Ans b) Go to unix Prompt Then DSEngine Path
             -bash-3.00$ cd /u14/app/datastagepx/product/Ascential/DataStage/DSEngine
-bash-3.00$ . ./dsenv
-bash-3.00$ bin/uv
>logto DSS_RTL_PROD
>DS.TOOLS

3. How to Find out  Editable job’s from Any project.
cd `cat /.dshome` or
cd /u01/app/datastage/product/Ascential/DataStage/DSEngine
. ./dsenv
bin/uv
SELECT NAME ,CATEGORY ,READONLY FROM DS_JOBS WHERE NAME NOT LIKE '\\%' AND READONLY = 'NRO' ;
SELECT COUNT(1) FROM DS_JOBS ;
Select * from DS_JOBS where NAME='CopyOfJ1PubSrc'; ==> For viewing exact loaction
_______________________________________
select category   as catgry  fmt '25l',ds_jobs.name as job_name fmt '35l'
from ds_jobobjects,ds_jobs
where ds_jobobjects.objidno = ds_jobs.jobno
and ds_jobobjects.oletype = 'chASHEDiNPUT'
and eval ds_jobobjects."@record<6>" = 'h_cMD_ct_2' ;

4. To check with particular hash file OutPut
select category   as catgry  fmt '25l',ds_jobs.name as job_name fmt '35l'
from   ds_jobobjects,ds_jobs
where  ds_jobobjects.objidno = ds_jobs.jobno
and    ds_jobobjects.oletype = 'chASHEDoUTPUT'
and    eval ds_jobobjects."@record<6>" = 'H_ERC_PMD_DETAILS' ;

5.For checking CPU Speed and Running processes counts=>
operator-17:20:28$vmstat 2 5
 kthr                   memory                       page                            disk               faults                     cpu
 r    b  w     swap             free           re    mf     pi     po fr de   sr s0 s1 s6 s1      in        sy     cs       us sy id
 26  9  0   140049512  136446232    13158 3309 86566 38 38 0    1   9   0 -0  0    21741 218915 33053   41 28 31

6.For checking user's and CPU Idle count's=>sar 2 10
operator-14:23:41$sar 2 10
SunOS P0471DSSEL 5.10 Generic_138888-03 sun4u    09/10/2009
14:23:50    %usr    %sys    %wio   %idle
14:23:55      86      14       0       0
14:24:00      85      15       0       0

7.For checking processes running on Server=>
operator-14:24:46$top
last pid: 14675;  load averages: 95.49, 92.49, 93.82     18:29:32
1016 processes:950 sleeping, 48 running, 2 zombie, 7 stopped, 9 on cpu
CPU states:  0.0% idle, 25.8% user, 74.2% kernel,  0.0% iowait,  0.0% swap
Memory: 160G real, 125G free, 45G swap in use, 102G swap free

   PID USERNAME THR PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
  7678 operator  25  32    0   18G 2653M cpu/531   7:05  1.15% sortcl
 14960 operator  11  43    0   26M   12M cpu/19  27:25  0.70% sortcl


8. How To Bind The CPU’s for any processes.(Means That particular process can occupy  only defined CPU’s Only remaining other are free for another process)

=>psrinfo => this command show's overall CPU's -How many cpu's is Server.
0       on-line   since 10/16/2009 23:12:42
1       on-line   since 10/16/2009 23:12:43
2       on-line   since 10/16/2009 23:12:43
532     on-line   since 10/16/2009 23:12:43
534     on-line   since 10/16/2009 23:12:43
535     on-line   since 10/16/2009 23:12:43

=>psrinfo | wc -l => Check CPU's Count on Server.
      48

=>psrset -i => this command show's how many nodes (CPU's set) Defined in Machine.
                          This process will done by Root User Only.
operator-10:56:34$psrset -i
user processor set 1: processors 4 516 5 517
user processor set 2: processors 8 520 9 521 10 522 11 523
user processor set 3: processors 12 524 14 526
user processor set 4: processors 16 528 17 529
user processor set 5: processors 20 532 21 533 22 534 23 535

=>After Execution of  main job or Main Script execute these command's for binding that process ID.
ps -ef|grep -i bpst|awk '{print $2}'|grep -v -w 1|xargs sudo /usr/sbin/psrset -b 1
ps -ef|grep -i bpst|awk '{print $3}'|grep -v -w 1|xargs sudo /usr/sbin/psrset -b 1

=> After Completion of that processes execute these for unbinding all binded process id's.
sudo /usr/sbin/psrset -q|awk '{print $3}'|awk -F":" '{print $1}'|xargs sudo /usr/sbin/psrset -u
















DATAWAREHOUSING CONCEPTS

# Types of Dimensional Modeling?
Dimensional Modeling is a logical design technique that seeks to present the data
in a standard framework that is, intuitive and allows for high performance access.
A) Dimensional modeling is again sub divided into 2 types.
a) Star Schema - Simple & Much Faster. Denormalized form.
b) Snowflake Schema - Complex with more Granularity. More normalized form.

#What is Star Schema?
Star Schema is a de-normalized multi-dimensional model. It contains centralized fact
tables surrounded by dimensions table.
Dimension Table: It contains a primary key and description about the fact table.
Fact Table: It contains foreign keys to the dimension tables, measures and aggregates.

#What is Snowflake schema?
It is partially normalized dimensional model in which at two represents least one
dimension or more hierarchy related tables.

# Why we need surrogate key?
It is used for integrating the data may help better for primary key.
Index maintenance, joins, table size, key updates, disconnected inserts and partitioning.

#Explain Types of Fact Tables?
Factless Fact: It contains only foreign keys to the dimension tables.
Additive Fact: Measures can be added across any dimensions.
Semi-Additive: Measures can be added across some dimensions. Eg, % age, discount
Non-Additive: Measures cannot be added across any dimensions. Eg, Average
Conformed Fact: The equation or the measures of the two fact tables are the same under
the facts are measured across the dimensions with a same set of measures.

# Explain the Types of Dimension Tables?
Conformed Dimension: If a dimension table is connected to more than one fact table,
the granularity that is defined in the dimension table is common across between the fact tables.
Junk Dimension: The Dimension table, which contains only flags.
Monster Dimension: If rapidly changes in Dimension are known as Monster Dimension.
De-generative Dimension: It is line item-oriented fact table design.

#What index is created on Data Warehouse?
Bitmap index is created in Data Warehouse.

# What is Full load & Incremental or Refresh load?
Answer: Full Load: completely erasing the contents of one or more tables and reloading with fresh data.
Incremental Load: applying ongoing changes to one or more tables based on a predefined schedule.

# Compare ETL & Manual development? 
Answer :  ETL - The process of extracting data from multiple sources.(ex. flat files,XML, COBOL, SAP etc) is more simpler with the help of tools.
Manual - Loading the data other than flat files and oracle table need more effort.
ETL - High and clear visibilty of logic.
Manual - complex and not so user friendly visibilty of logic.
ETL - Contains Meta data and changes can be done easily.
Manual - No Meta data concept and changes needs more effort.
ETL- Error hadling,log summary and load progess makes life easier for developer and maintainer.
Manual - need maximum effort from maintainance point of view.
ETL - Can handle Historic data very well.
Manual - as data grows the processing time degrads.
These are some differences b/w manual and ETL developement.

# What is the difference between a data warehouse and a data mart?
This is a heavily debated issue. There are inherent similarities between the basic constructs used to design a data warehouse and a data mart. In general a Data Warehouse is used on an enterprise level, while Data Marts is used on a business division/department level. A data mart only contains the required subject specific data for local analysis.

# What is Index . How many type of Index .
Ans=>Indexes are optional structures associated with tables and cluster.
Type of Index :-
  1. B-Tree indexes
  2. B-Tree cluster indexes
  3. Hash cluster indexes
  4. Reverse key indexes
  5. Bitmap Indexes .

#. how can u implement slowly changed dimensions in datastage? explain?
SCDs are three typesType 1- Modify the changeType 2- Version the modified changeType 3- Historical versioning of modified change by adding a new column to update the changed data
SCD type1 -- >> just use 'insert rows else update rows'  or  ' update rows  else  insert rows'  in update action of target
SCD type2 à> u have use one hash file to look -up the target ,take 3 instance of target ,give diff condns depending on the process,give diff update actions in target ,use system variables like sysdate ,null
We can handle SCD in the following ways
Type I: Just overwrite;
Type II: We need versioning and dates;
Type III: Add old and new copies of certain important fields.
Hybrid Dimensions: Combination of Type II and Type III
yes you can implement Type1 Type2 or Type 3.
Let me try to explain Type 2 with time stamp.
Step :1 time stamp we are creating via shared container. it return system time and one key. For satisfying the lookup condition we are creating a key column by using the column generator.
Step 2: Our source is Data set and Lookup table is oracle OCI stage. by using the change capture stage we will find out the differences. the change capture stage will return a value for chage_code. based on return value we will find out whether this is for insert , Edit,  or update. if it is insert we will modify with current timestamp and the old time stamp will keep as history.

# how to implement type2 slowly changing dimensions in data stage?explain with example?
We can handle SCD in the following ways

Type 1: Just use, “Insert rows Else Update rows”
Or “Update rows Else Insert rows”, in update action of target

Type 2: Use the steps as follows
a)     U have use one hash file to Look-Up the target
b)     Take 3 instances of target
c)     Give different conditions depending on the process
d)     Give different update actions in target
e)     Use system variables like Sysdate and Null.


2 comments: