Wednesday 29 January 2014

configuring DB2 database connectivity in a parallel environment in DataStage

Configure DataStage to run on all processing nodes that function as DB2 servers.

Steps

To configure InfoSphere DataStage to access IBM DB2 servers, include a node definition for each IBM DB2 server node in the configuration file. Also, complete the following steps.


Running the IBM DB2 configuration script

Use the DB2 configuration script to configure access to IBM DB2 databases.
Procedure
From the Linux or UNIX command line, run the $APT_ORCHHOME/bin/db2setup.sh script once for each IBM DB2 database that users will access. Pass the database name as an argument.
For example, the following command calls db2setup.sh to configure access to the database db2db:
db2setup.sh db2db


Granting user privileges to run IBM DB2 Enterprise stage jobs

You must grant privileges to each user who runs jobs that contain the DB2 Enterprise stage.
Procedure
Run the 
$APT_ORCHHOME/bin/db2grant.sh script once for each user. Use the following syntax:
db2grant.sh database_name user_name

Accessing a remote IBM DB2 server from IBM DB2 Enterprise stages

You use an DB2 client to connect from an IBM DB2 Enterprise stage to a remote IBM DB2 server.

DataStage Architecture

DataStage Architecture
This is the info as per my knowladge.
What is the architecture of data stage?
Architecture of DS is client/server architecture.
We have different types of client /server architecture for DataStage starting from the different versions.The latest version is DataStage 8.7
1. Datastage 7.5 (7.5.1 or 7.5.2) version-standalone
DataStage 7.5 version was a standalone version where DataStage engine, service and repository (metadata) was all installed in once server and client was installed in local PC and access the servers using the ds-client. Here the users are created in Unix/windows DataStage server and was added to the dstage group (dsadm is the owner of the DataStage and dstage is the group of that.)To give access to the new user just create new Unix/windows user in the DS-server and add them to dstage group. The will have access to the DataStage server from the client.
Client components & server components
Client components are 4 types they are
  1. Data stage designer
  2. Data stage administrator
  3. Data stage director
  4. Data stage manager
Data stage designer is user for to design the jobs. All the DataStage development activities are done here. For a DataStage developer he should know this part very well.
Data stage manager is used for to import & export the project to view & edit the contents of the repository. This is handled by DataStage operator/administrator
Data stage administrator is used for creating the project, deleting the project & setting the environment variables. This is handled by DataStage administrator
Data stage director is use for to run the jobs, validate the jobs, scheduling the jobs. This is handled by DataStage developer/operator
Server components
DS server: runs executable server jobs, under the control of the DS director, that extract,transform, and load data into a DWH.
DS Package installer: A user interface used to install packaged DS jobs and plug-in;
Repository or project: a central store that contains all the information required to build DWH or data mart.

More reference on DataStage 7.5
2.Datastage 8.0 (8.1 and 8.5)version-standalone
DataStage 8 version was a standalone version where DataStage engine and service are in DataStage server but the Database part repository (metadata) was installed in Oracle/DB2 Database server and client was installed in local PC and accesses the servers using the ds-client.
Metadata (Repository): This will be created as one database and will have 2 schemas (xmeta and isuser).This can be made as RAC DB (Active/Active in 2 servers, if any one DB failed means the other will be switch over without connection lost of the DataStage jobs running) where
  1. xmeta :will have information about the project and DataStage software
  2. iauser: will have information about the user of DataStage in IIS or webconsole
Note: we can install 2 or 3 DataStage instance in the same server like ds-8.0 or ds-8.1 or ds-8.5 and bring up any version whenever we want to work on that. This will reduce the hardware cost. But only one instance can be up and running.
The DataStage 8 was also a standalone version but here the 3 components were introduced defiantly.
1.information server(IIS)- isadmin
2.websphere server- wasadmin
3. Datastage server- dsadm
1. The IIS also called as DataStage webconsole was introduced where in which it will have all the user information of the DataStage. This is general accessed in web browser and don’t need and DataStage software installation.
After the DataStage installation. The IIS or webconsole will be generated and will have isadmin as administrator to mange this web console. once we login into the web console using isadmin we need to map the “dsadm” user in the engine credentials”(dsadm is the unix/windows user created in the datastage server with dstage group).Then after the mapping the new users will be created in the same user components(note:The users “xxx” created are internally tagged to dsadm mapped user which internally making connecting between unix datastage server and IIS webconsole.All the files/project ..etc created using “xxx” will be owned by “dsadm” user in the unix server)
We can restrict the “xxx” users here to access 1 or 2 projects.

Client components & server components
Client components are
  1. Data stage designer
  2. Data stage administrator
  3. Data stage director
  4. IBM import export manager
  5. Webconsole
  6. IBM infosphere DataStage and Qualitystage multi-client manager
  7. Others I have not come across J
Data stage designer is user for to design the jobs. All the DataStage development activities are done here. For a DataStage developer he should know this part very well.
Data stage administrator is used for creating the project, deleting the project & setting the environment variables. This is handled by DataStage administrator
Data stage director is use for to run the jobs, validate the jobs, scheduling the jobs. This is handled by DataStage developer/operator
IBM import export manager is used for to import & export the project to view & edit the contents of the repository. This is handled by DataStage operator/administrator
Webconsole is use for to create the datastage users and do the administration .This is handled by DataStage administrator
Multi-client manager is use for to install multipal client like ds-7.5,ds-8.1 or ds-8.5 in the local pc and can swap to any version when it is required. This is used by DataStage developer/operator/administrator/all
Server components:

3.Datastage 8.5version-Cluster(HA-High Availability clusters)
DataStage 8.5 version was a also have HA-High Availability clusters setup. All the function and working is same as DataStage 8.5 standalone but the hardware and software structure will be different.
1. DataStage engine Tier is in different server (2 Active/Active or Active/passive) and
2. Service Tire is in different server (2 Active/Active or Active/passive) and
3. Metadata Database part (repository) tire is in different server (2 Active/Active or Active/passive) was installed in Oracle/DB2 Database server with RAC(means 2 Database server in Active/Active mode, if one DB fails the other will be switched immediately and no connection lost)
The whole DataStage HA is made in such way that any fail in any part may be engine/service or metadata tire. It will automatically switch to other Active servers and without connection lost of the current DataStage jobs running. This is the amazing setup done and it is implementing in out Citibank project and I am lucky to work on this.
Also we can have multiple DataStage engines for ex: Singapore/Malaysia/Thiland/Russia(4 Engine tries) running for the same 2 service Tires/Medata DB Tires.(This will reduce the cost of the Hardware)



Monday 27 January 2014

sort stage - Resorting on Sub-Groups

Basics - Sort stage Basics  <--- Link

Resorting on Sub-Groups

Use Sort Key Mode property to re-use key column groupings from previous sorts
              – Uses significantly less memory / disk!
                       • Sort is now on previously-sorted key-column groups not the entire dataset
                       • Outputs rows after each group

Key column order is important!
              – Must be consistent across sort stages to be able to sub-sort on the same keys

Must retain incoming sort order and partitioning (SAME) between the sort stages




Don’t Sort (Previously Grouped)
What’s the difference between “Don’t Sort (Previously sorted)” and “Don’t Sort (Previously grouped)”?

When rows were previously grouped by a key, all the rows with the same key value are grouped together.
              – But the groups of rows are not necessarily in sort order.

When rows are previously sorted by a key, all the rows are grouped together and, moreover, the groups are in sort order.

In either case the Sort stage can be used to sort by a sub-key within each group of rows



Inserted Sorts

Inserted Sorts

By default, tsort operators are inserted into the score as necessary
– Before any stage that requires matched key values (Join, Merge, RemDups)

Only inserted if the user has NOT explicitly defined a sort

Check the job score for inserted tsort operators

Changing Inserted Sorting Behavior

------

 Set $APT_SORT_INSERTION_CHECK_ONLY or $APT_NO_SORT_INSERTION to change 
behavior of automatically inserted sorts

 Set $APT_SORT_INSERTION_CHECK_ONLY
– The inserted sort operators only VERIFY that the data is sorted
• If data is not sorted properly at runtime, the job aborts
• Recommended only on a per-job basis during performance tuning


Set $APT_NO_SORT_INSERTION to stop
inserted sorts entirely

Partitioner insertion, sort insertion

Partitioner insertion and sort insertion each make writing a flow easier by alleviating the need for a user to think about either partitioning or sorting data. By examining the requirements of operators in the flow, the parallel engine can insert partitioners, collectors and sorts as necessary within a dataflow.
However, there are some situations where these features can be a hindrance.
If data is pre-partitioned and pre-sorted, and the InfoSphere® DataStage® job is unaware of this, you could disable automatic partitioning and sorting for the whole job by setting the following environment variables while the job runs:
  • APT_NO_PART_INSERTION
  • APT_NO_SORT_INSERTION
You can also disable partitioning on a per-link basis within your job design by explicitly setting a partitioning method of Same on the Input pagePartitioning tab of the stage the link is input to.
To disable sorting on a per-link basis, insert a Sort stage on the link, and set the Sort Key Mode option to Don't Sort (Previously Sorted).
We advise that average users leave both partitioner insertion and sort insertion alone, and that power users perform careful analysis before changing these options.


Sunday 26 January 2014

Reading and writing fixed-length fields using the Sequential File stage.

Certain considerations apply when reading and writing fixed-length fields using the Sequential File stage.
  • If reading columns that have an inherently variable-width type (for example, integer, decimal, or varchar) then you should set the Field Width property to specify the actual fixed-width of the input column. Do this by selecting Edit Row... from the shortcut menu for a particular column in theColumns tab, and specify the width in the Edit Column Meta Data dialog box.
  • If writing fixed-width columns with types that are inherently variable-width, then set the Field Width property and the Pad char property in the Edit Column Meta Data dialog box to match the width of the output column.
Other considerations are as follows:
  • If a column is nullable, you must define the null field value and length in the Edit Column Meta Data dialog box.
  • Be careful when reading delimited, bounded-length varchar columns (that is, varchars with the length option set). If the source file has fields which are longer than the maximum varchar length, these extra characters are silently discarded. Set the environment variable APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS to reject these records instead
  • Avoid reading from sequential files using the Same partitioning method. Unless you have specified more than one source file, this will result in the entire file being read into a single partition, making the entire downstream flow run sequentially unless you explicitly repartition.
– 

Collector Methods


(Auto)
– Eagerly read any row from any input partition
– Output row order is undefined (non-deterministic)
– This is the default collector method
Generally, Auto is the fastest and most efficient method of collection

Round Robin
– Pick row from input partitions in round robin order
– Slower than auto, rarely used
Round robin collector can be used to reconstruct original (sequential)
row order for round-robin partitioned inputs
– As long as intermediate processing (e.g. sort, aggregator) has not altered
row order or reduced number of rows
– Rarely used


Ordered
– Read all rows from first partition, then second,…
– Preserves order that exists within partitions
Ordered is only appropriate when sorted input has been range partitioned
– No sort required to produce sorted output, when partitions have been sorted
– Rarely used as range partition is rarely used also

Sort Merge
– Produces a single (sequential) stream of rows sorted on specified key
columns from input sorted on those keys
– Row order is not preserved for non-key columns (non-stable sort)
 To generate a single stream of sorted data, use the Sort Merge
collector
– Input data must be sorted on these keys
– Sort Merge does not perform a sort

 

Preserve partitioning flag


A stage can also request that the next stage in the job preserves whatever partitioning it has implemented. It does this by setting the preserve partitioning flag for its output link. Note, however, that the next stage might ignore this request.
In most cases you are best leaving the preserve partitioning flag in its default state. The exception to this is where preserving existing partitioning is important. The flag will not prevent repartitioning, but it will warn you that it has happened when you run the job. If the Preserve Partitioning flag is cleared, this means that the current stage doesn’t care what the next stage in the job does about partitioning. On some stages, the Preserve Partitioning flag can be set to Propagate. In this case the stage sets the flag on its output link according to what the previous stage in the job has set. If the previous job is also set to Propagate, the setting from the stage before is used and so on until a Set or Clear flag is encountered earlier in the job. If the stage has multiple inputs and has a flag set to Propagate, its Preserve Partitioning flag is set if it is set on any of the inputs, or cleared if all the inputs are clear.