Science of Data

Posts

Showing posts from 2010

Should we keep Index and Data in separate tablespace?

Index and Data in Separate Tablespaces Recently I got caught up in a developer-DBA argument regarding placing of database indexes in separate tablespace from data tablespace. These developers and DBAs belong to a project where all data and indexes are stored in different tablespaces. But still there are a few indexes on a daily truncate-and-load type temporary table that are managed through the ETL code. Meaning, those indexes are dropped before batch loading and recreated after the load. The trouble here is - the ETL jobs create the index in the data tablespace instead of creating them in the index tablespace. While DBA wants developers to change the code to create the indexes in the proper tablespace, developers’ argument is why creating indexes in a different tablespace are so important? Why do you need a separate Tablespace for Indexes? The DBA’s argument is they need a separate tablespace for indexes for performance reasons. And that is off course wrong. Putting all your indexes i...

Testing in data warehouse projects

Recently Arun Sundararaman from Accenture has posted an article on DWBI Testing in Information-management.com. The article can be found out here . The article brings in a very timely discussion on the state of testing methodologies for data warehousing projects. DWBI testing is so far the least explored area in the data warehousing domain. Majority of data warehousing projects that fail, rarely fail in the implementation phase, rather they mostly fail in the user acceptance phase. This is largely due to the fact that end users often find their data warehouse generating unacceptable reports (Or reports generating numbers outside their "tolerance" limit) while compared to actually known business scenarios. Whatever be the root cause of that, proper testing is the only way of detecting and fixing those issues. Unfortunately, in the current data warehousing context, the only viable method of testing is through manual SQL scripting. Metadata management tools fail miserably if ...

Compare between CTAS, COPY and DIRECT PATH Load in Oracle

OK. Here is a simple task that I am trying to achieve all through out tonight. Loading Huge Table Over DBLINK I have a big (infact very big) table on one database (SRCDB) and I am trying to pull the data from that table to a table in different database (TGTDB). Both SRCDB and TGTDB reside in different HP Unix servers connected over network. And I have only SELECT privilege on the SRCDB table. The table has no index (And I am not allowed to create an index, or any database object for that matter, in the SRCDB). But the SRCDB table has many partitions, only one of which I am supposed to pull from that. Let's suppose the SRCDB table has 10 partitions. Each partition has 500 million records. And as I said above, I need to pull data from only one partition to the target. So what will be the best suited strategy here? My Options When I started to think about this, following options came into my mind: 1. Using Transportable Table Space 2. Using CTAS over DBLink 3. Using direct load path 4...

Can a Oracle parallel hint be evil?

About Oracle query hint, this is often said that - "An hint, if ignored by Oracle, is just a comment". But I didn't think that a "mere" hint can lead to a job failure also. My friend put a parallel hint in one of the SQL query in a job. And the SQL threw Ora-01652 error (Unable to extend temp space). The first thing that came to my mind is - "how come a parallel hint is causing temp space failure?". I do understand that parallel hint can require much more system resources for processing the query - BUT - the total amount of temp space required by that query should remain same. After all the amount of data in the table that the query accesses does not increase when we access the query through multiple parallel threads. But I was wrong. Very Very wrong. I put the question to Tom Kyte's forum asktom and here is the answer I got, Looks like the px coordinators can require some "extra" spaces when they combine the results from different paral...

Google Insight - A simple implementation of Data Mining

Not many people know about Google Insight . Google Insight is a web based data mining tool from Google that analyze the search patterns across specific regions and time frame. We are always subjecting Google with different search queries, Google use these search queries to analyze trends of those searches across different dimensions. Consider the phrase "Data Warehousing". Many people are searching this term in Google for so many years, so many times. Google can use that data to plot the trends like "Popularity" of this phrase over time. Check it below, I believe this is a nice little data mining tool from Google that can be utilized free of cost to understand the mining patterns.

Informatica Incremental Aggregation

Saurav has posted a new article here on Incremental Aggregation Using Informatica . The need of incremental aggregation arise when we capture our source data (transactional data) incrementally in a frequency faster than the aggregation period. Take this example, a data warehouse system is refreshed every night from source data. The data warehouse has a monthly aggregated table. So it is obvious that every day's data you need to aggregate and put together in the monthly table. But in stead of loading the monthly table at month end, if you consider loading this monthly table everyday or every week or bi-monthly, then incremental aggregation is possibly the best option for you. Now performance wise, it remains an open question on how good is Informatica in doing incremental aggregation. I think Saurav might consider an other article by putting informatica in test with considerable data volume.

A List of the future articles scheduled to be published in www.akashmitra.com

Following are the list of articles I am planning to publish in this April'10 in the site, Oracle 1. Detail Oracle Server Architecture 2. A list of common and effective Oracle hints 3. How to read and interpret AUTO TRACE result 4. Brief Oracle Index tutorial for the application developers Informatica 1. All about Informatica Partitioning 2. All about Informatica LookUps Apart from these, I am also planning to introduce a new section on Data Warehousing Project Management in the site. Let me know your comments..

Companion Blog for www.akashmitra.com

This blog is a companion blog for www.akashmitra.com . Akashmitra.com is a dedicated website created for the data warehousing practitioners around the world. Visit akashmitra.com to read latest data warehousing news, articles, white papers and tutorials. This blog will periodically post the links to the newly published contents in akashmitra.com. All the future releases, upcoming articles, major changes etc. can be viewed, requested or discussed here. This blog can be used as an way for the akashmitra.com users to communicate with the team behind akashmitra.com.