Working around duplicates

I think I’ve got my duplicate rows in the student schedule table (SQL data warehouse) problem solved. The problem appears to be invalid data held in two of the five tables from where I was pulling the information. Unfortunately, there isn’t an easy way to fix the problem on the other MS SQL server so I’ve changed my DTS package to compensate.

A comment was left with the idea of using a trigger on the table that would initiate when duplicates were found. This trigger would move the duplicates to another table. That looked to be a good idea until I realized how many duplicates I had (over 2 million) so I’ve gone at it with a different approach that will utilize less server resources.

My original DTS package had one data transfer step that pulled everything I needed for the student schedule from the school’s live database. I’ve broken it into two steps that will use two temp tables and a final table (instead of just one temp table). I decided to break it into two steps (one teacher, one student) after seeing that I didn’t get duplicate data if I only pulled student information without tying in the teacher schedule in the same step. The same was true if I only pulled the teacher schedule.

The first step (for schedule data, not the entire DTS package) pulls the teacher schedule information into one temp table and the second step pulls the student schedule information into the other temp table. The tables are then merged into a single table by matching up the course ID number, school ID and term ID between the teacher and student temp tables. The final step truncates the temporary tables since that information is no longer required.

The new package creates about 740,000 rows (number of students x number of classes x number of terms) instead of over 2.5 million. It only takes about 35 seconds for the package to run. The old warehouse had a DTS package that would do this but it would include all the duplicates and take over 5 minutes to run.

This was the last big hurdle in creating the new data warehouse. I will add this DTS package to the main package that calls all the packages needed to create/update the warehouse, at the first of the year, when I get in to work tomorrow. I will also test the entire package to see how long the entire process takes. If I’ve guessed correctly, judging by how long each package takes individually, the entire process should take less than 1.5 minutes. I know that seems to be quite a long time but there are several packages that run to populate the warehouse with current information. One and a half minutes is still a lot lower than the 10+ minutes the previous warehouse packages took to completely run.

Hating people is like burning down your own house to get rid of a rat. – Henry Emerson Fosdick

18.Dec.05 General, Microsoft SQL


You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.

Leave a Reply