jump to navigation

SSIS is for ETL, ELT or EL….clamor continued September 3, 2013

Posted by msrviking in Business Intelligence, Data Integration, Design, Integration Services.
Tags: , , , , , , , ,
add a comment

In my yesterday’s post I had mentioned that we had to design and implement a SSIS solution that does EL, and had my own gripe about addressing only the non-functional requirements. Also it was just about a process of extraction & loading – EL.

Today let me start off with the definition of ETL, ELT, and then the new word with EL where T is missing, or supposedly not to be considered here. While writing this post I realized I should put across few notes on what an ETL or ETL means, or probably the differences between these implementations. I could have probably listed the points as a neat comparison table but blame it on my laziness that I am sharing these links. Trust me I have read these links, and I completely acknowledge the technical content’s relevance to ETL vs ELT.

http://www.dataacademy.com/files/ETL-vs-ELT-White-Paper.pdf

http://blog.performancearchitects.com/wp/2013/06/13/etl-vs-elt-whats-the-difference/

At the end ETL would mean that you have a powerful infrastructure in your destination, to handle transformation using SET operations, but then you would need a staging area to do the transformation and push the data into the destination tables. Whereas in ELT processes, the tools have the power to do the transformation in parallel infrastructure setup and in-memory, and away from the destination systems. After the data is transformed it is loaded into the databases for business to consume.

So having these principles in mind, we had yet to implement the EL instead of ETL or ELT. I personally didn’t find the approach to be good to implement, and which could have been done easily without any development effort using a CDC for Oracle in SQL Server, Replication. However I found some finer pros of this approach, and I remember noting those in one of my earlier post over here. Those few are,

  1. SSIS doesn’t need any additional changes to be done on the source end, except that we configure a connector to pull the data.

    Whereas Replication from Oracle had to create temporary tables, triggers in the source schema which probably is an overhead along with changes on the source schema. This is not definitely acceptable by any business or system owner. CDC for Oracle in SQL Server helped to a larger extent to address the above problem, but then enabling certain properties of the source instance would be overhead for security considerations.

  2. SSIS is essentially for a flow-control of data, and we could synch data on a pre-set precedence of steps. For example the master table’s data should ported first, and then the transactional tables.
  3. SSIS could be configured to pull data from the source based on business criteria by using parameters in the WHERE statements.
  4. SSIS gives finer control on error logging, handling and of course we could have retry packages to pull in data of failure.
  5. SSIS also gives me a perfect control on the defensive programming so that all business data is in place.
  6. SSIS could be tweaked, tuned to have optimal performance from extraction through loading.

Overall we will have control on what is happening when the data is extracted and loaded. Now I am wondering all these are perfect design guidelines for any ETL or ELT, and how I wished or rather would have not wanted on just an EL process, huh.

Cheers!

Oracle to SQL Server data synchronization–What are the other ways? May 17, 2013

Posted by msrviking in Architecture, Configuration, Data Integration, Heterogeneous, Integration, Integration Services, Replication.
Tags: , , , , , , , ,
2 comments

In this series I have been talking about environment setup, configuring the box with appropriate drivers, setting up Oracle Publisher and then short post on Transactional Replication. While doing this I had my bit of issues in setting and configuring the whole workflow so a parallel thread was running in my head asking only one question – “Is this the only way to synchronize data from Oracle?”.

The only one option which was looking viable was using SSIS, and with SSIS 2012 lot more features have come in to help the situation. I haven’t worked on using SSIS – CDC for Oracle per below links but intend to do it sometime.

Here are the few links I found when I Googled around on this topic,

1. Microsoft SQL Server 2012 Change Data Capture for Oracle by Attunity Topics (http://social.technet.microsoft.com/wiki/contents/articles/7644.microsoft-sql-server-2012-change-data-capture-for-oracle-by-attunity-topics.aspx)

2. CDC for Oracle in SQL Server 2012 (http://blogs.msdn.com/b/mattm/archive/2012/03/26/cdc-for-oracle-in-sql-server-2012.aspx)

3. CDC in SSIS for SQL Server 2012 (http://www.mattmasson.com/2011/12/cdc-in-ssis-for-sql-server-2012-2/)

4. Installing Microsoft SQL Server 2012 Change Data Capture for Oracle by Attunity (http://social.technet.microsoft.com/wiki/contents/articles/7647.installing-microsoft-sql-server-2012-change-data-capture-for-oracle-by-attunity.aspx)

5. Change Data Capture (CDC) for Oracle video now available (http://www.mattmasson.com/2012/09/change-data-capture-cdc-for-oracle-video-now-available/)

6. SQL Server 2012 CDC for Oracle – a Review of One Implementation (http://blogs.msdn.com/b/dfurman/archive/2013/04/15/sql-server-2012-cdc-for-oracle-a-review-of-one-implementation.aspx)

7. Troubleshoot CDC Instance Errors in Microsoft Change Data Capture for Oracle by Attunity (http://social.technet.microsoft.com/wiki/contents/articles/7642.troubleshoot-cdc-instance-errors-in-microsoft-change-data-capture-for-oracle-by-attunity.aspx)

Although I say that SSIS is the best option and yet to be tested I wanted to list the differences of implementation using Replication and SSIS. So I built a quick comparison table between Replication and SSIS from perspective of DDL and Performance to start, and this is work-in-progress. As of now I am unable to judge the timeline I should tie to close this table of comparison with other factors included. Nevertheless I am thinking this should help you folks too, just in case.

Feature Category

Feature

SSIS

Repli-cation

Notes

DDL /Schema Changes

– New column addition

– Change in mapped (source) column name

– Change in data type of source column name

– Drop column in the source table

Yes (customized coding and SMO)

No

– SSIS – SMO based programming can help in detecting changes at the source and help in remapping in destination

– Replication – Subscriber has to be dropped /deleted, Publisher should add new table or regenerate the snapshot in case of Oracle publisher

Performance

– On source table or system

Impacted

Impacted

– Replication – Triggers are created on the published tables to track data change at row level. Such triggers could be overhead to Oracle published tables. Triggers are created by SQL Server replication to capture data changes

– SSIS – the impact on the performance on the source table is lesser comparatively to replication since no triggers are created for tracking the changes of data. This could be achieved using CDC with SSIS 2012 for Oracle

– On destination table or system

Impacted

Impacted

– Replication – in case of large data changes (inserts and updates) transactional replication could cause performance bottleneck on the destination /subscriber tables because of DML statements cause lock escalation on the tables. Another option for large data changes is to use snapshot replication

– SSIS – the impact is lesser and can be controlled by batch operations

Synchro

nization

– Real-time

Supported

Supported

– Replication – transactional replication helps in achieving the real-time data synch up between Oracle and SQL Server however if the data changes are large then this could impact the performance. There is very minimal latency of data synchronization

– SSIS – with CDC for Oracle using SSIS 2012, real-time synchronization is achievable without any performance overhead, since no triggers are created on Oracle tables, and the changes in data doesn’t need any configuration or maintenance or setup issues because of the capabilities of CDC to create change tables in SQL Server. The data captured as part of changes is applied by SQL Server through the CDC tables

Please feel free to share your thoughts or comments. Happy reading.

Cheers!