Saturday, March 12, 2011

List of Fourth-coming articles in DWBIConcepts.com

Currently working on the following articles for www.DWBIConcepts.com:

1. A primer on Informatica MetaData Repository Tables

2. Benefits and troubles of Informatica Persistent Cache

3. Writing SQL Query to extract simple information from Informatica Metadata tables


Target to publish by 20th March 2011

Blogged using iPhone

Monday, August 30, 2010

Should we keep Index and Data in separate tablespace?

Index and Data in Separate Tablespaces

Recently I got caught up in a developer-DBA argument regarding placing of database indexes in separate tablespace from data tablespace. These developers and DBAs belong to a project where all data and indexes are stored in different tablespaces.

But still there are a few indexes on a daily truncate-and-load type temporary table that are managed through the ETL code. Meaning, those indexes are dropped before batch loading and recreated after the load. The trouble here is - the ETL jobs create the index in the data tablespace instead of creating them in the index tablespace.
While DBA wants developers to change the code to create the indexes in the proper tablespace, developers’ argument is why creating indexes in a different tablespace are so important?

Why do you need a separate Tablespace for Indexes?

The DBA’s argument is they need a separate tablespace for indexes for performance reasons.

And that is off course wrong.

Putting all your indexes in a separate tablespace than your data does not increase performance (tablespace is not a ‘performance thing’.) Table data and table indexes are not read simultaneously. Oracle reads index first followed by the data. So what really needs to be achieved for performance is an evenly distributed I/O over all available devices. Separating data and index does not necessarily achieve a balanced I/O. And if one really wants to achieve an evenly distributed I/O – a far better idea will be to take all the available devices – create a really big stripe set and put everything in it.

So DBA is wrong if s/he wants it for performance reason.



But then, is there any other benefit for separating index and data tablespace?

There can be some administrative plus points for the same. For example separating index and data might help in recovery. If you place all of your indexes on a separate disk, you don't have to worry if the disk of backup becomes corrupt because you can rebuild the indexes to restore the device.

Sunday, August 8, 2010

Testing in data warehouse projects

Recently Arun Sundararaman from Accenture has posted an article on DWBI Testing in Information-management.com. The article can be found out here. The article brings in a very timely discussion on the state of testing methodologies for data warehousing projects.

DWBI testing is so far the least explored area in the data warehousing domain. Majority of data warehousing projects that fail, rarely fail in the implementation phase, rather they mostly fail in the user acceptance phase. This is largely due to the fact that end users often find their data warehouse generating unacceptable reports (Or reports generating numbers outside their "tolerance" limit) while compared to actually known business scenarios. Whatever be the root cause of that, proper testing is the only way of detecting and fixing those issues.

Unfortunately, in the current data warehousing context, the only viable method of testing is through manual SQL scripting. Metadata management tools fail miserably if "SQL Override" or "Stored Procedures" are used in the ETL phase. But that's not the only real problem of automated testing. The main issue is we are yet to come up with a generic testing strategy for data warehouse data reconciliation method.

I believe this is a high time when data warehousing practitioners, both individual and organizations, take data warehousing testing seriously and develop a common methodology for the same.

Saturday, July 10, 2010

Compare between CTAS, COPY and DIRECT PATH Load in Oracle

OK. Here is a simple task that I am trying to achieve all through out tonight.

Loading Huge Table Over DBLINK


I have a big (infact very big) table on one database (SRCDB) and I am trying to pull the data from that table to a table in different database (TGTDB). Both SRCDB and TGTDB reside in different HP Unix servers connected over network. And I have only SELECT privilege on the SRCDB table. The table has no index (And I am not allowed to create an index, or any database object for that matter, in the SRCDB). But the SRCDB table has many partitions, only one of which I am supposed to pull from that.

Let's suppose the SRCDB table has 10 partitions. Each partition has 500 million records. And as I said above, I need to pull data from only one partition to the target. So what will be the best suited strategy here?

My Options


When I started to think about this, following options came into my mind:
1. Using Transportable Table Space
2. Using CTAS over DBLink
3. Using direct load path
4. Copy command
5. Data Pump

Comparing Copy, Direct Path and CTAS


I started with transportable table space as this is supposed to be the fastest one as it copies the whole table space from one database to the other. But I soon realized that I can not use the transportable table space option as I do not have any DDL permission on the source side. In order to use transportable table space, one needs to alter the tablespace to make it read only like below,

ALTER TABLESPACE src_tbl_space READ ONLY;

Once I ruled out the transportable tablespace option, it didn't take much time for me to rule out the Copy option also. Not only Copy option is deprecated, the same is inherently slower than the other options. So now at this point I had only CTAS and Direct path load options open. To use any one of them, I knew that I must first create a dblink at my end to access the data from remote end. DBLink will help me to SELECT remote database data as easily as I SELECT my local system data. So without wasting much time, I first created a DBlink and wrote a direct path insert script like below,

INSERT /*+ APPEND */
INTO TGT_TABLE
SELECT *
FROM SRC_TABLE;

Voila! This method is faster than the conventional insert (that is, simple INSERT INTO statement) because, this method writes data directly in the datafiles bypassing the buffer cache. This method also ignores any referential integrity constraints and starts the loading after the high-water-mark of the target table - thereby - increasing the performance. So this looked like a perfect solution. But then I paused.
Somewhere I heard CTAS (Create table as select) is faster than direct path method. So I thought of giving it a try. The only issue with CTAS is - it creates the target table - so if my target table is pre-existing, I can not use this method. Fortunately, my target table is not pre-existing, so I can use CTAS. Moreover CTAS statements that reference remote objects can run in parallel. So I fired a statement like this,

CREATE TABLE TGT_TABLE
NOLOGGING PARALLEL 8
AS
SELECT *
FROM SRC_TABLE partition (my_partition);

And guess what? I encountered following,
ORA-14100: partition extended table name cannot refer to a remote object.

(And this point I realized - even a simple task like this can have so much hard press me. But I guess I can't help it - that's the life with databases!)

So I realized the only option for me is to use WHERE clause and hope for partition pruning. So I did,

CREATE TABLE TGT_TABLE
NOLOGGING PARALLEL 8
AS
SELECT *
FROM SRC_TABLE
WHERE record_date between to_date('02-Jan-2010', 'DD-Jan-YYYY') and to_date('10-Jan-2010', 'DD-Jan-YYYY');

And finally I am done. At the end of the exercise - I know how to bring a big table from remote database in the most efficient way.

Friday, June 25, 2010

Can a Oracle parallel hint be evil?

About Oracle query hint, this is often said that - "An hint, if ignored by Oracle, is just a comment". But I didn't think that a "mere" hint can lead to a job failure also.

My friend put a parallel hint in one of the SQL query in a job. And the SQL threw Ora-01652 error (Unable to extend temp space). The first thing that came to my mind is - "how come a parallel hint is causing temp space failure?". I do understand that parallel hint can require much more system resources for processing the query - BUT - the total amount of temp space required by that query should remain same. After all the amount of data in the table that the query accesses does not increase when we access the query through multiple parallel threads.

But I was wrong. Very Very wrong.

I put the question to Tom Kyte's forum asktom and here is the answer I got,


Looks like the px coordinators can require some "extra" spaces when they combine the results from different parallel processes.

Did you know that?