Wednesday, March 7, 2012

Only transform unique rows

Hi,

I have created a SSIS package which includes a Data Flow that will transform my data. The Data Flow Source for this Data Flow is a SQL Server table with approx. 60 mill. rows. Theses rows ought to be unique but isn't quite, so I would like to ensure that only one of each row makes it throug to my Data Flow Destination, another SQL Server table.

What is the most efficient way to ensure this - is there some Data Flow Transformation that can assure this, should I write a SQL statement that deletes all duplicates before the Data Flow or should I take a third approch?

- SuneI can see two ways to do so:

1. OLE DB source -> Sort -> OLE DB dest

Configure Sort transform to "Remove rows with duplicate sort values".

2. OLE DB source -> Aggregate -> OLE DB dest

Configure Aggregate transform to group by all columns.

No.2 might be more efficient. Would you like to try and tell us the results? I'm very interested to know.|||Actually there is a third way and that is to use the T-SQL statement to do the dedupe. This can be much more efficient than either of the 2 mentioned above since the engine is optimized to do this especially if it can use an index. I think you would have to do some benchmarks to determine which is optimal for your particular scenario.

Thanks,
Matt|||I have actually been using a T-SQL statement up until now and I has been working okay. However all of a sudden performance have degraded significantly which made me wonder if there was any alternative ways.

I hope to get around to make some benchmarks during the weekend and promise to post the results.

Thanks for your help so far.

- Sune

No comments:

Post a Comment