jump to navigation

A year and more gone by..no posts March 23, 2015

Posted by msrviking in Configuration, Data Integration, Design, Integration, MySQL, Oracle, Performance tuning.
Tags: , , , , , ,
add a comment

I alone know on how I have missed blogging on what all I learnt on MySQL and Oracle RDS over last 14 months, and how much I wished that what all I did was on SQL Server. Well, not always it is the way you want it to be.

There was such a learning over this period on the way things are done in MySQL, Oracle at design, development and performance engineering that I might have to spend several days blogging about. In next few weeks to months I will be spending time on variety of learnings starting from design, coding, through performance tuning to achieve concurrency and scalability. Although these topics will primarily be around MySQL, Oracle I might map on some of the done work to SQL Server features and implementations.

A brief preview on what would be the topics on, and possibly will be having its own sub-series in detail with mapping

MySQL:

  1. What are the physical design checklist
  2. What are best instance level configurations
  3. What are the optimal working configuration values of host hardware and OS
  4. What are the ways to optimize execution plans
  5. What are the coding best practices
  6. What are the ways to troubleshoot deadlocking problems

Oracle RDS:

  1. What are the best practices for coding in PL/SQL
  2. What are the best practices for designing and building integration DB
  3. What are the ways to optimize execution plans
  4. What are the ways to monitor for any performance bottlenecks
  5. What are the ways to achieve concurrency, scalability and performance
  6. What is that not do-able when compared to on-premise instances
  7. How to administer the instance

Happy reading.

SSIS is for ETL, ELT or EL….clamor continued September 3, 2013

Posted by msrviking in Business Intelligence, Data Integration, Design, Integration Services.
Tags: , , , , , , , ,
add a comment

In my yesterday’s post I had mentioned that we had to design and implement a SSIS solution that does EL, and had my own gripe about addressing only the non-functional requirements. Also it was just about a process of extraction & loading – EL.

Today let me start off with the definition of ETL, ELT, and then the new word with EL where T is missing, or supposedly not to be considered here. While writing this post I realized I should put across few notes on what an ETL or ETL means, or probably the differences between these implementations. I could have probably listed the points as a neat comparison table but blame it on my laziness that I am sharing these links. Trust me I have read these links, and I completely acknowledge the technical content’s relevance to ETL vs ELT.

http://www.dataacademy.com/files/ETL-vs-ELT-White-Paper.pdf

http://blog.performancearchitects.com/wp/2013/06/13/etl-vs-elt-whats-the-difference/

At the end ETL would mean that you have a powerful infrastructure in your destination, to handle transformation using SET operations, but then you would need a staging area to do the transformation and push the data into the destination tables. Whereas in ELT processes, the tools have the power to do the transformation in parallel infrastructure setup and in-memory, and away from the destination systems. After the data is transformed it is loaded into the databases for business to consume.

So having these principles in mind, we had yet to implement the EL instead of ETL or ELT. I personally didn’t find the approach to be good to implement, and which could have been done easily without any development effort using a CDC for Oracle in SQL Server, Replication. However I found some finer pros of this approach, and I remember noting those in one of my earlier post over here. Those few are,

  1. SSIS doesn’t need any additional changes to be done on the source end, except that we configure a connector to pull the data.

    Whereas Replication from Oracle had to create temporary tables, triggers in the source schema which probably is an overhead along with changes on the source schema. This is not definitely acceptable by any business or system owner. CDC for Oracle in SQL Server helped to a larger extent to address the above problem, but then enabling certain properties of the source instance would be overhead for security considerations.

  2. SSIS is essentially for a flow-control of data, and we could synch data on a pre-set precedence of steps. For example the master table’s data should ported first, and then the transactional tables.
  3. SSIS could be configured to pull data from the source based on business criteria by using parameters in the WHERE statements.
  4. SSIS gives finer control on error logging, handling and of course we could have retry packages to pull in data of failure.
  5. SSIS also gives me a perfect control on the defensive programming so that all business data is in place.
  6. SSIS could be tweaked, tuned to have optimal performance from extraction through loading.

Overall we will have control on what is happening when the data is extracted and loaded. Now I am wondering all these are perfect design guidelines for any ETL or ELT, and how I wished or rather would have not wanted on just an EL process, huh.

Cheers!

SSIS for Extraction & Loading (EL) only September 2, 2013

Posted by msrviking in Business Intelligence, Data Integration, DBA Rant, Design, Heterogeneous, Integration, Integration Services.
Tags: , , , , , , , ,
add a comment

There were series of posts earlier on the different ways to pull data from Oracle to SQL Server where I shared on how to port data either through Replication, CDC for Oracle, and few others like SSIS. I thought I will give a starter to a series of post(s) on how we went about picking up the option of implementing SSIS as a way to bring data from Oracle to SQL Server.

The major drivers (business and non-functional) for picking the solution were

  • The destination database on SQL Server should be replica of the source
  • No business rules or transformation need to be implemented when data is pulled
  • The performance of pull and loading should be optimal with no overheads or bottlenecks either on the source or destination

The destination system was mentioned as ODS, but it surely isn’t the same as Operational Data Store of DWH systems. Sadly the naming convention had been adopted by the shop where we had to implement the solution, and you might see me using the word ODS for sake of explaining the implementation, and I am sorry for putting the incorrect usage of the word. You would probably see little more of conventional or standards being skewed in the implementation, and of course a bit of my comments on how this could have been handled, better.

So the ODS was to hold data that would be used by downstream business analytics applications for the business users, with an intent of providing Self-Service BI. The ODS is to be of exact – in the form of schema, and data as in the source. The job was to pull data from the source without any performance overhead, and failures. Not to forget to mention that there needn’t be any transformation of data. At the end we use SSIS as only an Extraction & Loading tool instead of ETL – Extraction, Transformation and Loading.

This sounds simple, eh, but beware this wasn’t that simple because the design was to be kept simple to address all of the above factors. The SSIS package(s) had to handle these at a non-functional level,

  • The package should be configurable to pick and pull data from any set date
  • The package should be re-runnable from the point of failure
  • The package should have configurability to address performance needs – package execution time should be less, the source should not be overloaded, the destination should be full in its efficiency while data is loading.
  • The package should be configurable to pull data from source in preset chunks. The preset chunks could be based on a period basis – days /months /years, or number of records per period
  • The package should have the option to flag any other dependent packages to run or not run during initial and incremental loads
  • The package should have defensive design to handle bunch of different type of errors
  • The package should have error logging and handling at a package level, and at record levels

In Toto this solution was more to how do we design for the non-functional requirements and implement, leaving the functional side of the data. This type of implementation is half-blind and I will talk more on those cons we had to face, when I get in to details of each implementation.

A post after few weeks of gap, and one with learnings and summaries on how it could have been better.

CDIs-Non Functional Requirements-Other few May 21, 2013

Posted by msrviking in Architecture, Business Intelligence, Data Integration, Design, Integration, Security.
Tags: , , , , , , , , ,
add a comment

In the series of CDIs-Non Functional Requirements I had covered on NFRs like Performance, Scalability, Availability, Reliability, and Utilization Benchmarks that could be useful to build a CDI system. In this post I shall talk about less known but important NFRs like Maintainability, Extensibility and Security. I wouldn’t term these as NFRs but questions on these topics will help you to get an insight on how the system is expected to behave, from a business and technical perspective. Also, these don’t work in silos, instead are linked back to some of the ones mentioned in the above list.

Maintainability

  1. What is the maintenance/release cycle?

    This question will give us an idea on the practices that is followed by the business and technical teams about the release cycle. And each release cycle would mean there would be change in the source system which may be applicable to down-stream applications. The more the release cycles, the difficult is the job of maintaining code base. And to avoid long-term overheads the system, data model and the low level design of the ETLs should be carefully built considering that this changes would be constant, and frequent.

  2. How frequently do source or target structures change? What is the probability of such change?

    This point is relevant to the first question, but elicits information one level deeper by asking “if more maintenance cycles, what are those changes at source and expected to be in target”. If the changes are constant, frequent and less complex then the data model and the ETLs have to be configurable to accommodate ‘certain’ changes in the source. The configurability comes with a rider and tradeoff on other NFR like performance. The changes on data source could affect the performance of the ETL and sometimes the business laid SLA can’t be met.

Now having said this, I presume the next NFR will be closely related with Maintainability.

Configurability

The answers to queries under this topic is supposedly to be challenging for the business, and technical teams. Not everyone is sure of what should be configurable and what shouldn’t be based on the changes that are expected from business at the source system level. One would get the answer of “not sure”, “may be”, “near future quite possible”, “probably” the source will change, and what change will remain as a question. The challenge of providing an appropriate solution at different layers will be a daunting task for the technical team.

  1. How many different/divergent sources (different source formats) are expected to be supported?

    The answer to this question will help in understanding what formats of sources (csv, tsv, xml..etc…) have to be supported, and if there is plenty of difference then alternate design practices could be implemented on the target which could provide extensibility to all formats.

  2. What kind of enhancements/changes to source formats are expected to come in?

    An answer to this point would help in deciding if there be abstract transformations or reusable mappings.

  3. What is the probability of getting new sources added? How frequently does the source format change? How divergent will the new source formats be?

    This information will help in knowing how often the sources format change, and is it with existing sources or with new ones. Again it would also help in deciding between abstract transformations or reusable mappings.

The last NFR is Security which is usually the last preferred in any system architecture and design, but most important.

Security

In case of CDI we are dealing with sensitive information of the customer and the transaction details. It is important to understand how business treats this type of data, and how do security compliance team want to consider the data is being gathered from different source systems and consolidated at a single place – “Data hub”. The below bunch of questions cover on the data protection level rather than access levels or privileges of different users.

  1. Are there any special security requirements (such as data encryption or privacy) applicable?

    An answer to this question usually would be “no”, but there are certain fields that are brought in from CRM and ERP systems and needs to be hidden from any misuse in case of breach of security. It is suggested that this question is explained well with a real scenario, and then a decision of having data or database encryption enabled or not could be taken.

  2. Are there any logging and auditing requirements?

    This is least required since the data from different systems is mostly massaged and made available in reporting format through a different data sink. A discussion in here would help in deciding if the security should be handled at reporting level (enabling different security features), rather than in massive data processes.

I hope all these posts on NFR for CDIs helps you in architecting, designing Data Hub system that is highly available, scalable, high performing, and most reliable.

Cheers!

Oracle to SQL Server Replication-Design considerations May 21, 2013

Posted by msrviking in Configuration, Data Integration, Design, Heterogeneous, Integration, Replication.
Tags: , , , , , , ,
add a comment

I am thinking I have covered all and best of the points for one to configure a transactional replication between Oracle and SQL Server in all my previous posts. In this post I would like to give my 2 cents on design considerations and limitations of Oracle publishers that I had to keep in mind during this exercise. There is a neat list of other points published in this Technet article.

  1. Configuration of Oracle Publisher with options

    Per the above article I learnt that the Oracle publishing can be done using the Oracle Gateway and Oracle Complete options. Few things that make difference these two options is that Oracle Gateway provides better performance but doesn’t let multi-transactional publications of a single to-be or published table. Whereas in the case of Oracle Complete option the table can be published in several publications, but lesser performance.

    I tried digging in Google to see on how these two options work as described, because I didn’t try on the setup I built to find the difference. I didn’t find any reliable sources that talk about difference in detail but one of the posts by Richard Douglas (Editor in Chief of SQL ServerPedia) gives a short description “Gateway is for high performance replication solutions. Complete is when you want to transform your data or to do row filtering”. I might have to try and see what it means in here.

    I remember seeing row filtering in a transaction replication using the option of Oracle Gateway, and I am yet to find out on how this is different when Oracle Complete is chosen. For now I shall park for future tries on this limitation and share my findings.

  2. It’s quite obvious with this point (supports few objects) because the similarities that Oracle and SQL Server have at object level implementation are few. So,

    What are the objects that are supported?

    1. Tables
    2. Index-organized tables
    3. Indexes
    4. Materialized views (as tables)

    What are those objects that cannot be replicated, but can exist on published tables?

    1. Domain-based indexes
    2. Function-based indexes
    3. Defaults
    4. Check constraints
    5. Foreign keys
    6. Storage options (tablespaces, clusters, etc.)

    What are the objects that cannot be replicated at all?

    1. Nested tables
    2. Views
    3. Packages, package bodies, procedures, and triggers
    4. Queues
    5. Sequences
    6. Synonyms

    Okay with this list we know what can be replicated, what cannot be replicated? I picked the list from the above post, but was hunting to find out why we can’t replicate these objects. One simple reason that keeps coming to my mind is that “some of the above objects in the ‘cannot’ list don’t have equivalents in SQL Server”. Okay, so aren’t there workarounds for these objects? For example Packages have bundle of programmed objects like packaged variables, cursors and few others which have to be converted through a specific process, and this specific process can’t be through Replication. That exercise could be migration of objects from Oracle to SQL Server, and surely can’t be part of Replication topology. Does this apply for simple objects also? I don’t have any technical reason to explain for this too so concluding it with my 2 cents. Meanwhile I shall see if I can find better “reasons”.

    Here is a link from Microsoft on what it all means to convert some of the above listed objects. This was one of the reasons that I helped me to conclude logically that this is part of migration from Oracle to SQL Server and can’t be part of Replication at a conceptual or architectural level.

Since my setup was to test the new Oracle Publisher I didn’t have to keep tab on other design considerations, but I am sure I will have to deal with others too when this gets on to full deployment mode than a POC.

I hope this short post helps you folks out there, and happy reading.

The usual ask – please feel free to share your thoughts or comments.

Cheers!