Saturday 8 February 2014

Unix in Datastage

How to run a Shell Script within the scope of a Data stage job?

By using "ExcecSH" command at Before/After job properties.

select the EDIT tab in the toolbar-> choose job properties-> select the job parameters->choose the Before/ After routines ->select the EXCESH command

How to use command line to import/export parallel job on unix

dsjob -import <projectname> <DSX filename(which is containing the executable jobs> <job/jobs names>

Make sure you understand that the import only imports executables which usually isn't what people want. (for full export, use Manager tool)

dsimport command

The dsimport command is as follows:

dsimport.exe /D=domain /H=hostname 
/U=username /P=password 
/NUA project|/ALL|
/ASK dsx_pathname1 dsx_pathname2 ...

The arguments are as follows:

domain or domain:port_number. The application server name. This can also optionally have a port number.
hostname. The IBM® InfoSphere® DataStage® Server to which the file will be imported.
username. The user name to use for connecting to the application server.
password. The user's password .
NUA. Include this flag to disable usage analysis. This is recommended if you are importing a large project.
project, /ALL, or /ASK. Specify a project to import the components to, or specify /ALL to import to all projects or /ASK to be prompted for the project to which to import.
dsx_pathname. The file to import from. You can specify multiple files if required.

For example, the following command imports the components in the file jobs.dsx into the project dstage1 on the R101 server:

dsimport.exe /D=domain:9080  /U=wombat /P=w1ll1am dstage1 /H=R101 
C:/scratch/jobs.dsx

When importing jobs or parameter sets with environment variable parameters, the import adds that environment variable to the project definitions if it is not already present. The value for the project definition of the environment variable is set to an empty string. because the original default for the project is not known. If the environment variable value is set to $PROJDEF in the imported component, the import warns you that you need to set the environment variable value in the project yourself.

dsexport command

The dsexport command is as follows:

dsexport.exe /D domain /H hostname 
/U username /P password /JOB jobname 
/XML /EXT /EXEC /APPEND project pathname1

The arguments are as follows:

domain or domain:port_number. The application server name. This can also optionally have a port number.
hostname specifies the DataStage Server from which the file will be exported.
username is the user name to use for connecting to the Application Server.
password is the user’s password.
jobname specifies a particular job to export.
project. Specify the project to export the components from.
pathname. The file to which to export.

The command takes the following options:

/XML – export in XML format, only available with /JOB=jobname option.
/EXT – export external values, only available with /XML option.
/EXEC – export job executable only, only available with /JOB=jobname and when /XML is not specified.
/APPEND – append to existing dsx file/ only available with /EXEC option.

For example, the following command exports the project dstage2 from the R101 to the file dstage2.dsx:

dsexport.exe /D domain:9080 /H R101 /U billg /P paddock 
dstage2 C:/scratch/dstage2.dsx

How to Run Datastage Job from Unix Command Line .?

Most Data warehousing project requires that your job runs in batches at specified time slots. In such cases the Datastage jobs are usually scheduled by using an external scheduling tool like ESP Scheduler, Control M, Autosys, etc… This is made possible by writing scripts that will run your jobs through the command line. I would consider the command line & very powerful interface to Datastage which lets us do more than just run the normal job. There guides present in the Datastage documentation will be very helpful in exploring the various things that can be done through the command line. However I plan on giving you the basics you will need to need to carry out your execution

In UNIX, the Datastage home directory location will always be specified in the “.dshome” file which will be present in the root directory. Before you can run your Datastage commands you will have to run the following commands

§ cd `cat /.dshome`

This will change the location to the home directory. By default this will be /opt/IBM/InformationServer/Server/DSEngine

§ . ./dsenv > /dev/null 2>&1

This will run the dsenv file which contains all the environment variables. Without doing this, your UNIX commands won’t run on the command prompt.

After you have done this then you can use any Datastage command for interacting with the server. The main command you can use is the ‘dsjob’ command which is not used only to run jobs but for a wide variety of reasons. Let’s look at the various ways you can use the dsjob command

To run a job:

Using the dsjob command you can start,stop,reset or run the job in validation mode.

dsjob –run –mode VALIDATE/RESET/RESTART project_name job_name

This command will actually run the job in validation mode. Similarly you can use RESET or RESTART instead of VALIDATE depending on what type of run you want. If you want a normal run then you will not need to specify the –mode keyword as shown below

dsjob –run project_name job_name | job_name.invocationid

Running with the invocationid would mean that the job would be run with that specific invocation id

Now if you have parameters to set or paratemeterset values to set then this can also be as set as shown below

dsjob –run –param variable_name=”VALUE” –param psParameterSet=”vsValueSet” project_name job_name

To stop a job:

Stopping a job is fairly simple. You might not actually require it but still its worth to take a look. It acts the same way as you would stop a running job the Datastage director.

dsjob –stop project_name job_name|job_name.invocationid

To list projects, jobs, stages in jobs, links in jobs, parameters in jobs and invocations of jobs

dsjob can very easily give you all the above based on the different keywords. It will be useful for you if you want to get a report of what’s being used in what project and things like that

The various commands are shown below

‘dsjob –lprojects’ will give you a list of all the projects on the server

‘dsjob –ljobs project_name’ will give you a list of jobs in a particular project

‘dsjobs –lstages project_name job_name’ will give you a list of all the stages used in your job. Replacing –lstage with –links will give you a list of all the links in your job. Using –lparams will give you a list of all the parameters used in your job. Using –linvocations will give you a list of all the invocations of your multiple instance job.

To generate reports of a job

You can get the basic information of a job buy using the ‘jobinfo’ option as shown below

dsjob -jobinfo project_name job_name

Running this command will give you a short report of your job which includes The current status of the job, the name of any controlling job for the job, the date and time when the job started , the wave number of the last or current run (internal InfoSphere Datastage reference number) and the user status

You can get a more detailed report using the below command

dsjob -report project job_name BASIC|DETAIL|XML

BASIC means that your report will contain very basic information like start/end time of the job , time elapsed and the current status of the job. DETAIL as the name indicates will give you a very detailed report on the job down to the stages and link level. XML would give you an XML report which is also a detailed report in an XML format.

To access logs:

You can use the below command to get the list of latest 5 fatal errors from the log of the job that was just run

dsjob -logsum –type FATAL –max 5 project_name job_name

You can get different types of information based on the keyword you specify for –type. Full list of allowable types are available in the help guide for reference.

There are a number of other options also available to get different log information. You can explore this in more detail in the developer guide. With the Datastage commands you can administer jobs, run jobs, maintain jobs, handle errors, prepare meaningful job logs and even prepare reports. The possibilities are endless. If you like to code then you won’t mind spending your time exploring the command line options available.

ICONV, OCONV

ICONV and OCONV functions are quite often used to handle data in Datastage.
ICONV converts a string to an internal storage format and OCONV converts an expression to an output format.

Syntax:
Iconv (string, conversion code)
Oconv(expression, conversion )

<Karthi> This function available only in server job, not in parallel jobs.

ICONV

The following example show date conversions from external to internal format. All of these ICONV functions return the internal date 14143:

DateConversions:
  PRINT ICONV("20 SEP 2006","D")
  PRINT ICONV("09-20-2006","D")
  PRINT ICONV("09/20/2006","D")

The following example shows time conversions from external to internal format:

TimeConversions:
  PRINT ICONV("13:21","MT");       ! Returns 48060
  PRINT ICONV("1:21PM","MT");      ! Returns 48060
  PRINT ICONV("13:21:01","MT");    ! Returns 48061
  PRINT ICONV("13:21:01.65","MT"); ! Returns 48061.65

OCONV

The following example shows date conversions:

DateConversions:
  ! Month Abbreviation Formats:
  PRINT OCONV(0,"D");       ! "31 DEC 1967"
  PRINT OCONV(14100,"D");   ! "08 AUG 2006"
  PRINT OCONV(14100,"D2");  ! "08 AUG 06"
  PRINT OCONV(DATE(),"D");  ! current date in above format
  PRINT OCONV(@DATE,"D");   ! current date in above format
  PRINT OCONV(14120,"D-")   ! "08-28-2006"
  PRINT OCONV(14120,"D/")   ! "08/28/2006"
  PRINT OCONV(14120,"DE")   ! "28/08/2006"
  PRINT OCONV(14120,"D2/")  ! "08/28/06"
  PRINT OCONV(14120,"D2-E") ! "28-08-06"

The following example shows time conversions:

TimeConversions:
  PRINT OCONV(0,"MT")
  PRINT OCONV(TIME(),"MT")
  PRINT OCONV(TIME(),"MTH")
  PRINT OCONV(TIME(),"MTS")
  PRINT OCONV(TIME(),"MTS.")
  PRINT OCONV(TIME(),"MTHS*")

APT_CONFIG_FILE

{
node “node1″
{
fastname “SVR1″
pools “”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools “}
}
node “node2″
{
fastname “SVR1″
pools “”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools “”}
}
node “node3″
{
fastname “SVR2″
pools “” “sort”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools ”" }
}

}

This is a 3 node configuration file. Lets go through the basic entries and what it represents.

Fastname – This refers to the node name on a fast network. From this we can imply that the nodes node1 and node2 are on the same physical node. However if we look at node3 we can see that it is on a different physical node (identified by SVR2). So basically in node1 and node2 , all the resources are shared. This means that the disk and scratch disk specified is actually shared between those two logical nodes. Node3 on the other hand has its own disk and scratch disk space.

Pools – Pools allow us to associate different processing nodes based on their functions and characteristics. So if you see an entry other entry like “node0” or other reserved node pools like “sort”,”db2”,etc.. Then it means that this node is part of the specified pool. A node will be by default associated to the default pool which is indicated by “”. Now if you look at node3 can see that this node is associated to the sort pool. This will ensure that that the sort stage will run only on nodes part of the sort pool.

Resource disk - This will specify Specifies the location on your server where the processing node will write all the data set files. As you might know when Datastage creates a dataset, the file you see will not contain the actual data. The dataset file will actually point to the place where the actual data is stored. Now where the dataset data is stored is specified in this line.

Resource scratchdisk – The location of temporary files created during Datastage processes, like lookups and sorts will be specified here. If the node is part of the sort pool then the scratch disk can also be made part of the sort scratch disk pool. This will ensure that the temporary files created during sort are stored only in this location. If such a pool is not specified then Datastage determines if there are any scratch disk resources that belong to the default scratch disk pool on the nodes that sort is specified to run on. If this is the case then this space will be used.

SMP, MPP

SMP is the Symmetric Multi Processing in which some hardware resources are shared symmetrically among processors. The Processors communicate via shared memory and have single Operating system.
A symmetric multiprocessor system is a multiprocessor system with centralized shared memory called main memory . . . operating under a single operating system with two or more homogeneous processors--ie, it is not a heterogeneous computing system.

MPP (massively parallel processing) is the coordinated processing of a program by multiple processors that work on different parts of the program, with each processor using its own operating system and memory . Typically, MPP processors communicate using some messaging interface.

In terms of software, there are two different types of database software: symmetric multiprocessing (SMP) and massively parallel processing (MPP).
An SMP database system is a database system that runs on one or more machines with several identical processors sharing the same disk storage. When an SMP database system runs on more than one machine, it is called a clustered configuration. The database is physically located in a single disk storage system.
Examples of SMP database systems are SQL Server, Oracle, DB/2, Informix, and Sybase.
An MPP database system is a database system that runs on more than one machine where each machine has its own disk storage. The database is physically located in several disk storage systems that are interconnected to each other. An MPP database system is also known as a parallel database system. Examples of MPP database systems are Teradata, Neoview, Netezza, and DATAllegro.

The machines in SMP and MPP database systems are called nodes.
An MPP database system is faster and more scalable than an SMP database system. In an MPP database system, a table is physically located in several nodes, each with its own storage

Tuesday 4 February 2014

SQL Query

Query to display middle records drop first 5 last 5 records in emp table

select * from emp where rownum<=(select count(*)-5 from emp) - select * from emp where rownum<=5;

Query to display first N records

select * from(select * from emp order by rowid) where rownum<=&n;

Query to display odd records only?

Q). select * from emp where (rowid,1) in (select rowid,mod (rownum,2) from emp);

Query to display even records only?

Q.) select * from emp where (rowid,0) in (select rowid,mod (rownum,2) from emp);

How to display duplicate rows in a table?

Q). select * from emp where deptno=any

(select deptno from emp having count(deptno)>1 group by deptno);

Query to display 3rd highest and 3rd lowest salary?

Q). select * from emp e1 where 3=(select count(distinct sal) from emp e2 where e1.sal<=e2.sal)

union

select * from emp e3 where 3=(select count(distinct sal) from emp e4 where e3.sal>=e4.sal);

Query to display Nth record from the table?

Q). select * from emp where rownum<=&n minus select * from emp where rownum<&n;

Query to display the records from M to N;

Q.) select ename from emp group by rownum,ename having rownum>1 and rownum<6;

select deptno,ename,sal from emp where rowid in(select rowid from emp

where rownum<=7 minus select rowid from emp where rownum<4);

select * from emp where rownum<=7 minus select * from emp where rownum<5;

Query to delete the duplicate records?

Q). delete from dup where rowid not in(select max(rowid)from dup group by eno);

Query to display the duplicate records?

Q). select * from dup where rowid not in(select max(rowid)from dup group by eno);

Query for joining two tables(OUTER JOIN)?

Q). select e.ename,d.deptno from emp e,dept d where e.deptno(+)=d.deptno order by e.deptno;

select empno,ename,sal,dept.* from emp full outer join dept on emp.deptno=dept.deptno;

Right Outer Join:

select empno,ename,sal,dept.* from emp right outer join dept on emp.deptno=dept.deptno;

Left Outer Join:

select empno,ename,sal,dept.* from emp left outer join dept on emp.deptno=dept.deptno

Query for joining table it self(SELF JOIN)?

Q). select e.ename “employee name”,e1.ename “manger name” from emp e,emp e1 where e.mgr=e1.empno;

Query for combining two tables(INNER JOIN)?

select emp.empno,emp.ename,dept.deptno from emp,dept where emp.deptno=dept.deptno;

By using aliases:

select e.empno,e.ename,d.deptno from emp e,dept d where e.deptno=d.deptno;

select empno,ename,sal,dept.* from emp join dept on emp.deptno=dept.deptno:

Find the particular employee salary?

for maximum:

select * from emp where sal in(select min(sal)from

(select sal from emp group by sal order by sal desc)

where rownum<=&n);

select * from emp a where &n=(select count(distinct(sal)) from emp b where a.sal<=b.sal);

for minimum:

select * from emp where sal in(select max(sal) from(select sal from emp group by sal order by sal asc) where rownum<=&n);

select * from emp a where &n=(select count(distinct(sal)) from emp b where a.sal>=b.sal)

Find the lowest 5 employee salaries?

Q). select * from (select * from emp order by sal asc) where rownum<6;

Find the top 5 employee salaries queries

select * from (select * from emp order by sal desc) where rownum<6;

Find lowest salary queries

select * from emp where sal=(select min(sal) from emp);

Find highest salary queries

select * from emp where sal=(select max(sal) from emp);

Monday 3 February 2014

difference between join lookup merge

http://mydatastage-notes.blogspot.in/2013/03/difference-between-joinlookup-and-merge.html