To me, look-up is the single most important (and difficult) transformation that we need to consider while tuning performance of Informatica jobs. The choice and use of correct type of Look-Up can dramatically vary the session performance in Informatica. So let’s delve deeper into this.
When you cache a look-up, Informatica fires the lookup query to database and brings in the entire set of data from the database to the Cache file directory of the Informatica server. Informatica then uses this cache file data whenever the lookup is called in the mapping flow during the session runtime. By doing this Informatica saves the time (and effort) to go back to remote database every time the lookup is called in the mapping.
In case of an un-cached or non-cached lookup, the opposite thing happens. Here Informatica goes to database to fetch data every time the lookup is called inside the mapping. How often the lookup is called depends on number of records coming from source (exactly equalling to number of records in case of a connected lookup)
There are couple of things that you need to consider while opting for any particular type of lookup. I tend to follow these general guidelines:
Above guidelines are based on the following mathematical approach that suggests when to use cache lookup and when to use uncached lookup.
N = Number of records coming from source (For the purpose of simplicity, let’s assume N also equals to the number of times the lookup will be called)
M = Number of records retrieved from the Lookup query
t = time required to retrieve a single record from database based on the Lookup query
T = time required to retrieve all the records from database based on the Lookup query = M * t
L = Time required to build the lookup index cache for cached lookup = f(N)
C = Network Time required to do one round trip for data fetch from database
In case of an uncached lookup, total time required will be:
T uncached = N * t + N * C
In case of a cached lookup, total time required to build the cache will be:
Tcached = T + L + C
In the above equation we can put L = f(N) and since C is generally very small number, we can ignore C and rewrite the above equation as –
Tcached = T + f(N)
Now if we assume that the database table has INDEX defined upon it and the index is being used while retrieving records from database then, we can further assume –
T = M * t
Replacing the above value in our earlier equation for Tcached we get –
Tcached= M * t + f(N)
We should use uncached lookup if –
Tcached > Tuncached
=> M * t + f(N) > N * t + N * C
Dividing all sides by N, we get –
C + t < (M/N)*t + f(N)/N
For a given Informatica and database setup, C, t & f(N) are fixed.
So the above equation is more likely to be true if (M/N) ratio is on higher side, i.e. N << M – that is – number of records coming from source is considerably less than number of records present in the lookup table (Refer back to point no. 1 in the discussion above).
Similarly, if N >> M – that is number of records coming from source is considerably high than the number of records present in lookup table, then we should use cached lookup.
When Informatica fires the lookup query to database, it appends a ORDER BY clause at the end of the query. However you can suppress this default behaviour by appending a comment “--“at the end of the override query. You should consider following factors regarding default Informatica Lookup ORDER BY –
WARNING! Found unsorted input when building Lookup cache of [XXX_XXX_XXX] (Current number of entries in the index cache = 1). Continue building the cache by insertion. This will cause it to take longer and the index cache may require more space.
The above warning requires little explanation. If you suppress the ORDER BY, then data fetched from database will not be sorted. Hence Informatica will require to again sort this data in its own cache space in order to generate the Index cache (yes! index cache can not be built on unsorted data). This means in any case, data has to be sorted - either from database or inside the Informatica. From our experience we have noticed that it's better to get data sorted in database level than in informatica level.
However, what you can probably do is: you can suppress the default ORDER BY with '--' and write your own ORDER BY only on those columns (lookup ports) that are being used in the joining. Let's say, you are performing a lookup in the "Employee" table based on emp_id and returning name, salary, department and date_of_joining columns. If you don't do anything, the default Lookup query that Informatica sends to database will be:
SELECT emp_id, name, salary, department, date_of_joining FROM Employee ORDER BY emp_id, name, salary, department, date_of_joining
But you know that you do not need to perform the ORDER BY on all those columns since joining column is emp_id and index cache will be built on only that column. Hence you can put your own order by in the SQL Override and suppress the default ORDER BY, like below:
SELECT emp_id, name, salary, department, date_of_joining FROM Employee ORDER BY emp_id --
This will ensure, unnecessary columns are not included in the database query
I recommend you always perform ORDER BY at least on the columns which are being used in the joining conditions in the lookup (i.e. lookup ports used in joining)
Remember Informatica creates its own index on the joining columns and therefore if the joining columns are ordered, Informatica would require less space (and time) to create the indexes.
Persistent Cache lookup can be Holy Grail for the cached lookups – if you know how to use them. Generally speaking, if the source data in the underlying lookup tables are not changing between consecutive sessions runs then one may use persistent lookup cache.
In short, to use a persistent cache, you have to check the following options in Lookup Transformation properties tab:
Once you do that, cache file created by Informatica session will *NOT* be deleted from the Cache directory and the same cache file will be used in all the consecutive runs. Advantage of doing this is you need not spend time building the same cache every time the session executes. However if the source data for the lookup changes meanwhile then you must refresh the cache by either of the following two options:
If the persistent cache lookup is reusable, then you can share the same lookup in multiple mapping without rebuilding the cache in each one of them. And you can have one additional mapping with re-cache option enabled for this lookup, which you can run whenever you need to refresh the cache file.
Also note that there are some disadvantages of using persistent cache lookup.
Without going much deeper in to the disadvantage part, I would like to mention that there is one particular disadvantage that you must take into account while opting for persistent cache option for big lookup tables. If the cache file size of your lookup table is more than 2GB, most likely Informatica will create multiple cache files for one lookup wherein maximum file size for each file will be 2GB.
If those are data cache files, Informatica will name them as .dat1, .dat2, .dat3 … etc. and corresponding index cache files will be named as .idx1, .idx2, .idx3 … etc.
Also note that in many flavors of UNIX (e.g. HP-UX 11i), NOLARGEFILES is a default option for the file system. This option prevents applications or users to create file larger than 2GB. You can check whether LARGEFILE option is enabled in your server by issuing the following command:
getconf FILESIZEBITS /mountpoint_name
However, please note that irrespective of the fact whether LARGEFILE option is enabled or disabled, Informatica will not create cache sized above 2GB (This is true for both 32-bit and 64-bit versions of Informatica)
My personal opinion is breaking one single data or index cache file into multiple files may slow down the lookup performance. Hence if your lookup cache size is more than 2GB then if possible consider the option of joining the lookup source table in the database level itself instead of building lookup cache.
If all the other factors remain same, choice of connected and unconnected lookup can impact lookup performance in one particular case. In case of a connected lookup, since the lookup is connected in the data flow pipeline so the lookup will be called for each record coming from source, irrespective of the fact whether the data returned from lookup is at all used later in the mapping or not. In case of unconnected lookup, this can be controlled by calling the lookup only when it is really needed.
To illustrate the difference, let’s consider the following example. Suppose you are loading sales data and if the sales is done in one specific geography location (say North America), then you want to populate one additional column in your target table with GST amount (which is based on which American State the sales was made). Suppose one record in the source is like this:
SalesID = 100067345 SalesLocation = CA Geography = North America SalesAmount = 100.54 ProductPartNo = PX937
In case of connected lookup, the lookup will be called for all the records irrespective of the fact whether the sales Geography is in North America or not. In case of unconnected lookup, you can write something like this inside an Expression Transformation, which will ensure the lookup is only called when Sales Geography is Northa America:
IIF (SALES_GEOGRAPHY = ‘North America’, :LKP.LKP_GET_GST_PCT_PER_STATE(SALES_LOCATION), 0)
Suppose out of 1,000,000 records, the above condition matches only for 100,000 records (10%). Then you save calling the lookup unnecessarily for 90% of cases.
Informatica may build the lookup cache even before the first row reaches the Lookup transformation (Pre-build scenario) or it may build the lookup cache only when the first row reaches the lookup transformation (On demand scenario). You can let Informatica decide what it wants to do (pre-build or on demand) or you may explicitly tell Informatica what you want to do. All you need to do is set the “Pre-Build Lookup Cache” option to “Always allowed” under the lookup transformation Properties tab to force Informatica build all the lookup caches beforehand.
So what’s the difference? How does it impact the performance? Generally speaking pre-building all the lookup cache beforehand may give you occasional performance boost as once the source qualifier starts pulling data, the processing of that data need not wait any more for the completion of building of lookup caches.
Performance boost of pre-building lookup caches is more apparent in cases where the source query itself is taking considerable amount of time to return rows so the same time can be simultaneously used to pre-build all the lookup caches instead of waiting for the source to return results and then starting to build other lookup caches on demand.
I said the performance benefit is occasional because there are number of other factors that need to be considered before setting Pre-Build Lookup Cache to Always allowed.
To me, this option which is available in session level under “Config Object” tab, is more of a limiting option rather than a performance augmenting option. This option determines how many additional concurrent pipelines Integration service can use while trying to build lookup caches simultaneously. I prefer setting it to “Auto”.
If you set it to Auto, Informatica will decide the value at run time while building the caches concurrently. Below I describe how Informatica will typically behave when you set this to “Auto”. Suppose one mapping has 4 lookup transformations that Informatica decided to build concurrently. It will give a message like below first in the session log:
Enabled using [2 (auto)] additional concurrent pipelines to build lookup caches. (Session likely will build or refresh  lookup caches;  on-demand only)
The above message means that Informatica has started 2 pipelines to handle 2 different lookups already. So this means 2 more lookups are yet to be built. When Informatica starts building those 2 more lookups, based on the situation whether the already allocated pipelines for the first 2 lookups are free by that time or not, it will generate any of the below two messages respectively:
Starting additional concurrent pipeline to build the lookup cache needed by Lookup transformation [LKP_XXX_XXX_XXX]
Lookup cache of [LKP_XXX_XXX_XXX], previously started by an additional concurrent pipeline, is now needed on demand.
The first message will appear if –
Now that we understand how additional pipeline behaves, let’s see how we can use this property to augment session performance. As I described before, the best thing to do with this is to set it to “Auto” which will anyway use the maximum possible pipelines when lookups are required to be built concurrently. However if you want to limit the concurrency then you can specify a numeric number there which will mean that no more additional pipelines will be created after it reaches the specified number. If any more additional pipeline is required, then it will wait for any current pipeline to free up first so that Informatica can reuse this. If you set it to 0 (zero), all lookups will be built serially.