jump to navigation

SQL Server 2012 CDC for Oracle July 5, 2013

Posted by msrviking in Architecture, Change Data Capture, Configuration, Data Integration, Heterogeneous, Integration, Integration Services, Oracle.
Tags: , , , , , ,
add a comment

Sometime in June 2013, I had to work on an alternate option to Replication to replicate data from Oracle to SQL Server 2012. The primary requirements were

  1. There should not be performance impact on the Oracle instance, and the CDC should work seamlessly with a RAC setup having multiple log file locations, and reading any DML changes invariably where the transaction happens (different nodes of the Oracle RAC)
  2. There should not be any additional objects (temporary tables, triggers) that should be created in Oracle instance unlike to what happens in Replication setup
  3. The DDL changes on the source should be manageable with appropriate configuration

Keeping all this in mind I brought up CDC for Oracle in SQL Server 2012 as the next best option and built a POC. As a result of the POC I had documented the whole of the steps or probably the implementation picture of the POC for my future reference. Today I decided to share it with you all so that you too could benefit the implementation.

So here goes the excerpts from the document. Sorry couldn’t convert the document into blog style, though.

Cheers and Enjoy.

Change Data Capture (CDC)

Overview

Change Data Capture (CDC) is way of tracking and capturing changed data on real-time basis.

The data changes because of DML operations like insert, update, delete these are logged into log files of the source database. The log files serve as input to the capture process which runs asynchronously. The capture process reads the log files and adds the change information to the tracked tables that are associated with the respective source tables.

The below diagram (as represented in MSDN) summarizes the process of change data capture.


The data captured in the tracking tables is consumed by SQL Server Integration Services to refresh the end business tables in an Operational Data Store (ODS).

Internals of SQL Server 2012 CDC for Oracle

The capabilities of change data capture is to read data from data sources and synchronize the data changes incrementally from source to target tables, and is real-time.

The Change Data Capture for Oracle setup has these components

Oracle Database Instance

The data changes happening on the Oracle source tables are captured into redo log files, and these redo log files are archived for reading by Oracle LogMiner.

The Oracle LogMiner reads the archived redo log files if the Oracle databases

  1. Archived log mode is enabled
  2. Supplemental logging is enabled

CDC Service & CDC for Oracle Designer

SQL Server 2012 has the new feature called SQL Server 2012 CDC for Oracle and is available as part of Service Pack 1 (SP1). The CDC feature comes with two installable as part of the SP1, and these files install CDC Service Configuration, and Oracle CDC Configuration utilities.

Console for CDC Service

A pre-requisite before creating a CDC service is to have a system database called MSXDBCDC created on SQL Server instance. The database is created by executing the step Prepare SQL Server through the CDC service configuration console. The database consists of tables that help in storing the configured databases, services and trace (diagnostics) information.

All CDC operations use the CDC service to synchronize data, and the service is created as Windows service. The CDC Service management console is used to configure CDC service however CDC services could be configured and run on any server, other than the ODS or Integration Services (IS) systems. Also more than one CDC service could run on a dedicated system.

Designer console for CDC for Oracle

A CDC for Oracle instance is configured under a CDC service, and multiple CDC instances can be configured against an Oracle database using the CDC designer console.

A CDC instance for Oracle

  1. Has a connection string to Oracle instance. The connection string is configured against scan IP name, or VIP name
  2. Has multiple source tables (with respective primary keys) and with selected capture columns. Each source table with capture columns has a respective supplemental logging script which is executed on the Oracle instance. The supplemental logging on the table captures the data changes on the columns into redo log files
  3. Creates a CDC enabled database on SQL Server
  4. Has an option of enabling advanced properties like cdc_stop_on_breaking_schema_changes, or trace. These properties help in managing the instance behavior upon a schema change in source table, with additional diagnostics for troubleshooting any issues

SQL Server

    CDC enabled database

A CDC instance creates a CDC enabled database and each of this database has multiple mirror tables, capture instances or tables representing each source table, and the system tables that provide information on the change capture objects, columns, DDL history, health of the CDC instance.

The CDC enabled database is by default created with case sensitive collation, and could be changed by modifying the database deployment script created by the CDC instance designer.

Mirror tables

Each source table in Oracle has a mapping mirror table in CDC enabled database in SQL Server. The mirror table is a similar copy of the source table, except the data types and length which are different in SQL Server. Any schema changes on the source table is verified by CDC process by checking against schema of the mirror table.

The mirror tables are always empty, and should not be used for as business tables. The mirror table is identified with a specific schema owner, and are not created under default dbo schema.

     A sample name of mirror table is CDC_POC.CDC_POC_DEPT.

Capture instances

Each source table has a capture instance, and the capture instance is created as system table. The CDC process populates capture instances with data that is changed in source. These tables has columns that represent the source or the mirror table, and columns that store the Start and End Last Sequence Number (LSN), a sequence value, type of operation and update mask.

Here is the sample structure of a capture instance

Column Name

Purpose of the column

__$start_lsn

LSN associated with the commit transaction of the change.

__$seqval

Sequence or order of rows that the transaction has affected.

__$operation

The type of DML operation

1 – Delete

2 – Insert

3 – Update (old value)

4 – Update (new value)

__$update_mask

Bit mask of the column ordinals

The capture instance always has schema identifier as cdc, and ends with _CT.

A sample name looks like cdc.CDC_POC_DEPT_CT.

The capture instance has the information based on the DML changes that are committed on the source table. This table provides inputs in the form changes (data that is changed, columns whose values have changed, and the type of change) for SSIS packages to refresh the business tables.

The capture instance could increase in size based on the changes that are captured from the source. This table could be purged on regular basis in order to keep the size of the table and database per standards.

Business tables

The business tables contain the data that is used by the business applications. The business table should be created with any standard naming convention and could be created in any other database, and need not be in CDC enabled database. The database can be hosted on any other SQL Server instance.

The schema structure of the business table could be same as source or mirror table but the data types could vary. If the data types are different then the SSIS packages should have appropriate data type conversions.

The business table is initially populated through SSIS as a full load, and then the incremental loads are applied to refresh the data. The incremental loads fetch the data changes from the capture instance tables, and does an insert or merge operations on the end tables.

Overall process

    The overall CDC process is summarized in the below diagram.


The CDC service fetches the data changes from the Oracle source table. The CDC service connects to the Oracle database instance and invokes the Oracle LogMiner. The CDC process requests the LogMiner for the data changes that have been committed on a source table. The LogMiner starts the data fetching process from the archived redo logs, and returns the changes against columns that are enabled with supplemental logging.

The CDC service then populates the changed data into capture instances along with last sequence number, data that is changed in a column, and the type of operation.

The CDC service uses the CDC instance information to connect to an Oracle instance and to fetch data changes in particular source table for the specified columns.

Implementation Scenarios

Data Synchronization – Connecting to Load Balancer

The below diagram summarizes on a scenario where the CDC for Oracle instance is configured to Oracle RAC through the scan IP name.


Oracle Cluster

The Oracle instance is in a RAC configuration mode, where Node 1 xxxxx01.xxxx.net) and Node 2 (xxxxx02.xxxx.net) are load balanced and is connected by using a scan IP. The two nodes share the same storage area, and the redo log files generated by these nodes are archived and stored in file locations in the storage area. There are multiple file locations, and each location is created based on the day, month and year. Each folder could have multiple archived redo files.

SQL Server

The SQL Server system hosts CDC service and CDC Designer for Oracle tools.

The CDC service is created using the CDC service management console and is run as Windows service using Local System account.

The CDC for Oracle instance is created using CDC designer console. The CDC instance connects to the Oracle instance using the scan IP name – xxx-rac01-scan.xxxx.net. The CDC process reads the archive logs using LogMiner and writes to the CDC enabled DB. The data changes that are captured are written into the capture instance in the SQL Server.

The mirror tables are empty and are never populated either during the full load initially or during the data refreshes through incremental load.

    ODS

The business tables are created in the operational data store, and the table is populated with full load from the source for the first time. The data changes captured through the CDC process is applied on this table through incremental loads.

    SSIS

SQL Server integration services (SSIS) runs on a different system. The purpose of having SSIS is to perform a full load into the business table initially, and when the CDC synchronization is set the data refreshes happen on the business tables in the ODS.

There are two packages that perform the full load and incremental load. The below tables summarize the component used, purpose of each component, and the overall process executed by each package.

Package Name

Component Used

Purpose of the component

Purpose of the package

Initial Load Execute SQL Task Deletes the data from the business table before a full load The package is executed only once as a full load, and when the CDC is setup or re-configured.

This package deletes the destination business table, and populates latest data from the Oracle source table using Oracle Connector. After the data is populated from Oracle to SQL Server business table in ODS, system table cdc_state is updated with a LSN value that is referenced as next starting point to read the data changes from the capture instances.

Data Flow Task Populates data from the Oracle source table into business table using Oracle connector from Attunity
CDC Control Task Sets a LSN value in cdc_state (a system table created when the package is executed) table as a bookmark. This value will be referenced by the incremental package for reading changed data
Incremental Load CDC Control Task The task reads the last saved LSN value in cdc_state, and marks the process as start fetching data from the capture instance The package is executed after the first initial full load.

The package could be scheduled to fetch the data changes and refresh the end business table in ODS.

The CDC source components marks the processing range of the data based on the last saved LSN value.

The data is inserted in case there are new rows through the OLE DB destination task, or updates or deletes through OLE DB command.

After the data refresh is complete on the destination table, the process is marked as completed the latest value LSN value is saved into cdc_state.

Data Flow Task

The task reads the capture instance for All or Net data changes using CDC Source component from last LSN value saved in cdc_state.

The CDC Splitter component splits the data read from CDC source into three different operations – Insert, Update and Delete.

The CDC splitter directs the data based on the DML operation – To INSERT into OLE DB connection; To UPDATE and DELETE into OLE DB command.

CDC Control Task

This component marks the end of reading the data, and the state of the reading, populating data into business tables.


Data Synchronization – Connecting to RAC Nodes

The below diagram summarizes on a scenario where the CDC for Oracle instance is configured to Oracle RAC through the VIP names (individual nodes).


This section describes on an approach adopted to verify SQL Server 2012 CDC for Oracle works with Oracle RAC nodes, and multiple redo log files in different locations.

The first and second diagram above show that Oracle CDC instances are connected to Node 1 (xxxxx01.xxxx.net) through the VIP name xxxxx01.xxxx.net:1521/orcl and Node 2 (xxxxx02.xxxx.net) through the VIP name xxxxx02.xxxx.net:1521/orcl respectively. The CDC for Oracle reads the changed data from each node that is committed independently on each node.

As part of verification the following steps were executed

On Node 1 or Node 2:

  1. Logged onto to Oracle nodes using the connection string uname/pwd@ xxxxx01.xxxx.net:1521/orcl or uname/pwd@ xxxxx02.xxxx.net:1521/orcl
  2. Ran a SELECT query against a sample table under the same schema and captured the existent data in the table for validation, later
  3. Ran DML (UPDATE, DELETE, INSERT) statements on each node, and committed the transaction
  4. Validated the DML changes done on each node, by running SELECT statements
  5. Verified if the CDC instances were capturing the changed data by running SELECT statements on capture instances

The result after these steps was that CDC instances (configured against each node) were reading the DML changes done on each node, and is captured in capture instances on SQL Server.

The above result concludes that CDC for Oracle works with RAC environment where in data is read, invariably where the transactions are committed. This also confirms that CDC for Oracle instances captures data changes that are logged into multiple archived redo log files, in multiple locations.

The key to above conclusion is that CDC process fetches the data changes by calling the Oracle LogMiner which in turn reads the multiple archived redo log files in different locations.

The captured data in the capture instances is refreshed onto business tables using the incremental package process as described in section SQL Server under SSIS.

Schema changes

This section explains on the DDL statements that impact on the Oracle CDC instances, and steps to apply the changes on the destination database.

SQL Server 2012 CDC for Oracle does not propagate the schema changes on Oracle tables to SQL Server. However enabling an advanced property named as cdc_stop_on_breaking_schema_changes helps in capturing accurate and consistent data or latest data. This property could be set under each CDC instance under Advanced Properties tab.

The values that could be set against this property are 0 and 1. On the event of schema change on Oracle source table, and when the value of cdc_stop_on_breaking_schema_changes is

  1. Set to 1, then the data capture is stopped and this helps in capturing consistent data, with exceptions in few scenarios
  2. Set to 0, then the corresponding capture instance in SQL Server is dropped, and this helps in capturing latest data

If the schema changes are pertinent on the source, and if these changes should be propagated to capture instance in SQL Server, and the business table then a series of steps have to be considered.

The below table summarizes different DDL statements with its impact on the CDC instance capture process, and way to mitigate the schema change.

Type of Schema Change

Impact and corrective steps on schema change

Adding a new column to the source table Impact: If the cdc_stop_on_breaking_schema_changes is set to 0 then the capture instance drops the table

cdc_stop_on_breaking_schema_changes is set to 1 then the CDC will capture the data for the existing columns ignoring the new columns

Action steps: If the property cdc_stop_on_breaking_schema_changes is set to 1, if schema changes are to be applied

1.Backup the existing CT table and the cdc_state table

2.Drop the source table from CDC instance

3.Add the CDC instance with the altered source table

4.Import the historical data from CT to the new capture instance

5.Modify the SSIS package to capture the new column for initial and incremental load from the last LSN

Drop an existing column from the source table Impact: If the cdc_stop_on_breaking_schema_changes is set to 0 then the capture instance drops the table

cdc_stop_on_breaking_schema_changes is set to 1 then the CDC will stop data capture with an error

Note: A column is dropped on source table only if the supplemental logging on that table is dropped

Action Steps:If the property cdc_stop_on_breaking_schema_changes is set to 1, if schema changes are to be applied

1.Drop the supplemental logging on the source table

2.Drop the column on source table

3.Backup the existing CT table and the cdc_state table

4.Drop the source table from CDC instance

5.Add the CDC instance with the altered source table

6.Import the historical data from CT to the new capture instance

7.Modify the SSIS package to capture the new column for initial and incremental load from the last LSN

Rename an existing column on the source table Impact: If the cdc_stop_on_breaking_schema_changes is set to 0 then the capture instance drops the table

cdc_stop_on_breaking_schema_changes is set to 1 then the CDC will stop data capture with an error

Action steps: If the property cdc_stop_on_breaking_schema_changes is set to 1, if schema changes are to be applied

1.Backup the existing CT table and the cdc_state table

2.Drop the source table from CDC instance

3.Add the CDC instance with the altered source table

4.Import the historical data from CT to the new capture instance

5.Modify the SSIS package to capture the new column for initial and incremental load from the last LSN

Change in data type or data type length on the source table Impact: If the cdc_stop_on_breaking_schema_changes is set to 0 and if data type is changed from for e.g. int to varchar

then the capture instance drops the table

cdc_stop_on_breaking_schema_changes is set to 1 then the CDC will do below

1.If the data type change is from for e.g. varchar to int, then NULL value is captured although different row operations state (3 & 4) are logged

2.If the data type length is changed then row operation state 3 is alone captured

Action steps: If the property cdc_stop_on_breaking_schema_changes is set to 1, if schema changes are to be applied

1.Backup the existing CT table and the cdc_state table

2.Drop the source table from CDC instance

3.Add the CDC instance with the altered source table

4.Import the historical data from CT to the new capture instance

5.Modify the SSIS package to capture the new column for initial and incremental load from the last LSN

A schema change on the source table should be applied on the business tables through a series of steps and procedure to capture accurate and consistent data.

CDIs-Non Functional Requirements-Other few May 21, 2013

Posted by msrviking in Architecture, Business Intelligence, Data Integration, Design, Integration, Security.
Tags: , , , , , , , , ,
add a comment

In the series of CDIs-Non Functional Requirements I had covered on NFRs like Performance, Scalability, Availability, Reliability, and Utilization Benchmarks that could be useful to build a CDI system. In this post I shall talk about less known but important NFRs like Maintainability, Extensibility and Security. I wouldn’t term these as NFRs but questions on these topics will help you to get an insight on how the system is expected to behave, from a business and technical perspective. Also, these don’t work in silos, instead are linked back to some of the ones mentioned in the above list.

Maintainability

  1. What is the maintenance/release cycle?

    This question will give us an idea on the practices that is followed by the business and technical teams about the release cycle. And each release cycle would mean there would be change in the source system which may be applicable to down-stream applications. The more the release cycles, the difficult is the job of maintaining code base. And to avoid long-term overheads the system, data model and the low level design of the ETLs should be carefully built considering that this changes would be constant, and frequent.

  2. How frequently do source or target structures change? What is the probability of such change?

    This point is relevant to the first question, but elicits information one level deeper by asking “if more maintenance cycles, what are those changes at source and expected to be in target”. If the changes are constant, frequent and less complex then the data model and the ETLs have to be configurable to accommodate ‘certain’ changes in the source. The configurability comes with a rider and tradeoff on other NFR like performance. The changes on data source could affect the performance of the ETL and sometimes the business laid SLA can’t be met.

Now having said this, I presume the next NFR will be closely related with Maintainability.

Configurability

The answers to queries under this topic is supposedly to be challenging for the business, and technical teams. Not everyone is sure of what should be configurable and what shouldn’t be based on the changes that are expected from business at the source system level. One would get the answer of “not sure”, “may be”, “near future quite possible”, “probably” the source will change, and what change will remain as a question. The challenge of providing an appropriate solution at different layers will be a daunting task for the technical team.

  1. How many different/divergent sources (different source formats) are expected to be supported?

    The answer to this question will help in understanding what formats of sources (csv, tsv, xml..etc…) have to be supported, and if there is plenty of difference then alternate design practices could be implemented on the target which could provide extensibility to all formats.

  2. What kind of enhancements/changes to source formats are expected to come in?

    An answer to this point would help in deciding if there be abstract transformations or reusable mappings.

  3. What is the probability of getting new sources added? How frequently does the source format change? How divergent will the new source formats be?

    This information will help in knowing how often the sources format change, and is it with existing sources or with new ones. Again it would also help in deciding between abstract transformations or reusable mappings.

The last NFR is Security which is usually the last preferred in any system architecture and design, but most important.

Security

In case of CDI we are dealing with sensitive information of the customer and the transaction details. It is important to understand how business treats this type of data, and how do security compliance team want to consider the data is being gathered from different source systems and consolidated at a single place – “Data hub”. The below bunch of questions cover on the data protection level rather than access levels or privileges of different users.

  1. Are there any special security requirements (such as data encryption or privacy) applicable?

    An answer to this question usually would be “no”, but there are certain fields that are brought in from CRM and ERP systems and needs to be hidden from any misuse in case of breach of security. It is suggested that this question is explained well with a real scenario, and then a decision of having data or database encryption enabled or not could be taken.

  2. Are there any logging and auditing requirements?

    This is least required since the data from different systems is mostly massaged and made available in reporting format through a different data sink. A discussion in here would help in deciding if the security should be handled at reporting level (enabling different security features), rather than in massive data processes.

I hope all these posts on NFR for CDIs helps you in architecting, designing Data Hub system that is highly available, scalable, high performing, and most reliable.

Cheers!

Oracle to SQL Server data synchronization–What are the other ways? May 17, 2013

Posted by msrviking in Architecture, Configuration, Data Integration, Heterogeneous, Integration, Integration Services, Replication.
Tags: , , , , , , , ,
2 comments

In this series I have been talking about environment setup, configuring the box with appropriate drivers, setting up Oracle Publisher and then short post on Transactional Replication. While doing this I had my bit of issues in setting and configuring the whole workflow so a parallel thread was running in my head asking only one question – “Is this the only way to synchronize data from Oracle?”.

The only one option which was looking viable was using SSIS, and with SSIS 2012 lot more features have come in to help the situation. I haven’t worked on using SSIS – CDC for Oracle per below links but intend to do it sometime.

Here are the few links I found when I Googled around on this topic,

1. Microsoft SQL Server 2012 Change Data Capture for Oracle by Attunity Topics (http://social.technet.microsoft.com/wiki/contents/articles/7644.microsoft-sql-server-2012-change-data-capture-for-oracle-by-attunity-topics.aspx)

2. CDC for Oracle in SQL Server 2012 (http://blogs.msdn.com/b/mattm/archive/2012/03/26/cdc-for-oracle-in-sql-server-2012.aspx)

3. CDC in SSIS for SQL Server 2012 (http://www.mattmasson.com/2011/12/cdc-in-ssis-for-sql-server-2012-2/)

4. Installing Microsoft SQL Server 2012 Change Data Capture for Oracle by Attunity (http://social.technet.microsoft.com/wiki/contents/articles/7647.installing-microsoft-sql-server-2012-change-data-capture-for-oracle-by-attunity.aspx)

5. Change Data Capture (CDC) for Oracle video now available (http://www.mattmasson.com/2012/09/change-data-capture-cdc-for-oracle-video-now-available/)

6. SQL Server 2012 CDC for Oracle – a Review of One Implementation (http://blogs.msdn.com/b/dfurman/archive/2013/04/15/sql-server-2012-cdc-for-oracle-a-review-of-one-implementation.aspx)

7. Troubleshoot CDC Instance Errors in Microsoft Change Data Capture for Oracle by Attunity (http://social.technet.microsoft.com/wiki/contents/articles/7642.troubleshoot-cdc-instance-errors-in-microsoft-change-data-capture-for-oracle-by-attunity.aspx)

Although I say that SSIS is the best option and yet to be tested I wanted to list the differences of implementation using Replication and SSIS. So I built a quick comparison table between Replication and SSIS from perspective of DDL and Performance to start, and this is work-in-progress. As of now I am unable to judge the timeline I should tie to close this table of comparison with other factors included. Nevertheless I am thinking this should help you folks too, just in case.

Feature Category

Feature

SSIS

Repli-cation

Notes

DDL /Schema Changes

– New column addition

– Change in mapped (source) column name

– Change in data type of source column name

– Drop column in the source table

Yes (customized coding and SMO)

No

– SSIS – SMO based programming can help in detecting changes at the source and help in remapping in destination

– Replication – Subscriber has to be dropped /deleted, Publisher should add new table or regenerate the snapshot in case of Oracle publisher

Performance

– On source table or system

Impacted

Impacted

– Replication – Triggers are created on the published tables to track data change at row level. Such triggers could be overhead to Oracle published tables. Triggers are created by SQL Server replication to capture data changes

– SSIS – the impact on the performance on the source table is lesser comparatively to replication since no triggers are created for tracking the changes of data. This could be achieved using CDC with SSIS 2012 for Oracle

– On destination table or system

Impacted

Impacted

– Replication – in case of large data changes (inserts and updates) transactional replication could cause performance bottleneck on the destination /subscriber tables because of DML statements cause lock escalation on the tables. Another option for large data changes is to use snapshot replication

– SSIS – the impact is lesser and can be controlled by batch operations

Synchro

nization

– Real-time

Supported

Supported

– Replication – transactional replication helps in achieving the real-time data synch up between Oracle and SQL Server however if the data changes are large then this could impact the performance. There is very minimal latency of data synchronization

– SSIS – with CDC for Oracle using SSIS 2012, real-time synchronization is achievable without any performance overhead, since no triggers are created on Oracle tables, and the changes in data doesn’t need any configuration or maintenance or setup issues because of the capabilities of CDC to create change tables in SQL Server. The data captured as part of changes is applied by SQL Server through the CDC tables

Please feel free to share your thoughts or comments. Happy reading.

Cheers!

CDIs-Non Functional Requirement-Utilization Benchmarks May 16, 2013

Posted by msrviking in Architecture, Business Intelligence, Data Integration, Integration.
Tags: , , , , , , ,
add a comment

In this post I shall share my thoughts on how these questions would help in designing system to meet the response time, and performance. Some of these are about the history of the existing systems and few are about the future requirements. Lot of English, eh, so straight message is that answers to these questions will help me in gathering “baselines and benchmarks” of the past and future. And once I get to know I should have

  1. A system designed that performs with low latency (quick response time) for any operation
  2. Built an infrastructure (especially servers) that would meet future performance requirements

Let me group these questions based on purpose and tag with some key words, for easy understanding.

Q# Question Purpose – Keywords
1 How many rows are created every day in transactional DBs? To get current baselines on how many rows are generated from business transactions – row size, storage space, disk size, CPU size, computation power
2 What is the size of data that is generated across LOBs? To get current baselines on how many rows are generated from business transactions per line of business – row size, storage space, disk size
3 What is the agreeable processing time of data in data hub To understand future benchmarks of the data hub – CPU sizing
4 How stale can the reporting data be? To understand future benchmarks of the data hub – storage sizing
5 How much of the processed data and data used for reporting has to be archived /purged?
To understand future benchmarks on how the archived data would be used in reporting system – server sizing, storage sizing
6 How many years of data have to be archived in the system?
To understand future benchmarks on how the archived data would be used in reporting system – server sizing, storage sizing
7 How many years of yester year’s data have to be processed? To understand future benchmarks on how the archived data would be used in reporting system – server sizing, storage sizing, CPU sizing, RAM sizing
8 What will be year on year growth of data in the transactional source system?
To understand current baselines and future benchmarks on the data that needs to be processed – server sizing, storage sizing, CPU sizing, RAM sizing
9 What could be the other sources of data that could be added during a period of 1-2 years, and how much of these data sources could provide data to data hub?
To understand current baselines and future benchmarks on the data that needs to be processed – server sizing, storage sizing, CPU sizing, RAM sizing
10 What are the agreeable database system downtime hours? To understand future benchmarks on how quick the data processing should complete – CPU sizing, RAM sizing
11 What is the possible backup and recovery time required for the Data Hub and Reporting system?
To understand on how quickly should the whole data hub and BI system be back up and running – servers sizing, storage sizing, CPU sizing, RAM sizing
12 How many users will be added year on year for the reporting system, and what are the types of users?
To understand future benchmarks on how the reporting and data hub system resources will be utilized – servers sizing, storage sizing, CPU sizing, RAM sizing
13 Are there any other external systems that would require the data from data hub?
To understand current baselines of data processing and BI systems – server sizing
14 How many rows of the transactional /business system needs to be processed for the Data Hub?
15 How much data is currently processed for reports?
16 What type of data processing queries exist in the system which provide Static /Ad-Hoc reports?
17 What types of reports are currently available, and what is the resource usage?
18 What are the query profiling inputs for these data processing queries /reporting queries?

  • CPU usage
  • Memory usage
  • Disk usage
To understand current baselines of data processing and BI systems – server sizing

In the follow up post for this NFR I shall let you all know on how I translated as part of design and server sizing. Until then happy reading.

Please feel free to share your comments.

Cheers!


CDIs-Non Functional Requirement-Reliability May 6, 2013

Posted by msrviking in Architecture, Business Intelligence, Data Integration, Integration.
Tags: , , , , , ,
add a comment

This is a strange NFR, but very important. The word reliable is more of English and looks less applicable in here?! Well, no. We are talking of reliability with the meaning and importance of data reliability in a data integration system – CDI. Please take a note of a point which I had mentioned earlier that in a CDI there is no change in meaning of data, but trying to align across organization perspective of what the data in the CDI system means and how true it is. It is necessary that all business, and technical teams agree with the definition of these data terms. After this is agreed, the technical teams have to implement logics – business, solution in the system to bring that “value and truth” to the data or information that is consolidated from different sources.

In this post, I shall share few thoughts on the questions that we need to find answers.

  • What is the tolerance level or percentage of erroneous data that is permitted?

With the above context set, this question should gather information in order to plan error handling and threshold levels in case of any issue. The usual way is to have a failed process restarted, or aborted. Now to do this we need to identify the tolerance (threshold levels) and handle to get the essential, business valid data.

  • Is there any manual process defined to handle potential failure cases?

There should always be a process which restarts any failed or aborted process due to technical or business validation errors, should this be done done manually or automated. In less tolerant systems this process should be either semi-automated or fully-automated, and manual in some cases. A known fact and familiar thought too, but we should elicit this from the stakeholders explicitly to understand the business information at the same level.

  • Under what conditions is it permissible for the system/process to not be completely accurate, and what is the frequency/probability of such recurrence?

This question is not to be mistaken that we could ignore certain levels of error in the system and could be unattended. Rather this pointer helps in understanding what are the processes that could be considered as business critical, and the inputs gathered here could and should be used in designing a solution to handle errors – especially the business across all the modules based on importance that is agreed with the business stakeholders.

Although there just three questions in this NFR, all inputs work in tandem with other NFRs helping to build in a less-erroneous, fault-tolerant and robust CDI. There is lot of business importance to these questions which actually feeds into the solution building, designing of components, modules, and data model.

So please share your thoughts or feel free to comment on this NFR.

Cheers!

CDIs-Non Functional Requirement-Availability March 28, 2013

Posted by msrviking in Architecture, Business Intelligence, Data Integration, High Availability, Integration.
Tags: , , , , ,
add a comment

The word availability so commonly used while building systems, and there are different responses, and of course spontaneous too. I would like to share few of those which I know of, and used more often.

  • I want the system to be available all the time – 24×7. This response was not of recent times, but maybe at least 5 years back where the business or application or program managers didn’t have knowledge on what a system availability means.
  • The business needs system to be available during the weekdays, and during the peak hours between 8 AM – 4 PM with acceptable failure time or non availability time of 5 mins. And during the weekends the business could accept few down time hours during the early hours of the day for any patch or software upgrades. This is more stringent for applications that have database systems to be continuously available for the business.

Thankfully the response to the availability questions is better and is getting better as more techno-functional business managers are involved in a new application or system building. It has been saving my day, where I don’t have to explain bunch of terms, formulae and what not.

Availability in the world of CDI solution is little different although the system should be available for any users accessibility all the while. What is that small difference? I had been dealing with transactional systems extensively, and the idea of availability changed when I had to look at this NFR from the perspective of CDI or a data integration solution. Trust me as I am writing this post, I couldn’t figure out the exact point that gives difference in the context of transactional system and data integration solution. I shall try defining for both, and hopefully that gives some sense.

Availability in the transactional system – the system should be up and running in the defined SLAs, except for unwarranted outages. In case of failure the system should be recovered in defined period of SLAs. This could be addressed by use of several availability features in SQL Server, and usually transactional systems are less bulkier than CDI /Data Hub databases. The key points that come on top of my head are

  1. No business /user data should be lost during a transaction, and data has to be accessible all the time
  2. No transaction should be a failure because of a system or process failure
  3. And in case of any failure the system should handle the failed transaction, and data should be recovered

Availability in the data integration systems – the system should be up and running – available for business users to consume the aggregated data from various sources. Again these too in the pre-defined and agreed SLAs, and some of these bulkier databases availability requirements could be addressed by different availability features of SQL Server. The key points are

  1. No business data should be lost during transition (porting time and path) and there should be enough mechanisms to track back the lost row or record until the originating point
  2. In case of any lost row or record should not be because of a hardware or system but could be because of implemented business processes failure, and this should be recorded in the system
  3. In case of any hardware failure, the system should be up and running within agreed SLAs and it is acceptable to have little longer period of recovery or restoration
  4. And the data should be available near real-time, accurate and on time

I believe after writing the above list I am probably bringing out that ‘thin’ difference between availability of integration solutions and transactional system. For both the systems data is most important and for this system, processes, and hardware should be in place, and that’s the objective. Great!, now this knowledge sharing is done, I am going to get into those questions which I put forward to know what is the availability requirement, and also judge what it all means for the customer. I am sure there are organizations who have adopted MDM solution, but MDM-Data Hub or CDI is seldom done because of its impact on the business directly or indirectly. Okay, I am not getting into that part of the discussion..so here are those pointers that we should gather inputs.

  • How important is the system and how quickly does it need to be returned to online in case of an outage or failure?

This question is to address the recoverability or restoration of the system in case of a hardware or system process failure. At the end the response to this question helps in defining SLA for a data hub database, and its downstream eco-system data sources.

  • How up-to-date does your information need to be?

The question here looks closest to near real-time data requirement, but please hold and look at it once again. The response would help to address on “Data Staleness”. In short how old can the business bear with, and technically how often should the data refresh happen.

  • Do you need real-time or are delays acceptable?

This question is off-shoot of the previous one, and response for this question will set the expectations from business and techno-functional teams if there should be real-time data pulls, processing and analytics.

  • What is the data retention policy?

The last question is to address the archival policy, and also to give an idea on what type of availability feature should be used to make large volumes of data available as part recovery of process.

At the end I probably managed pulling in questions for the tricky word “availability” in the context of CDI. All the inputs out here would help in designing a solution that should meet the requirements of availability – data staleness, data retention, and data recoverability.

I shall stop here and of course I feel this is the end of the NFR availability. Please feel free to comment and share your thoughts.

CDIs-Non Functional Requirement–Scalability March 27, 2013

Posted by msrviking in Architecture, Business Intelligence, Data Integration, Integration.
Tags: , , , , , ,
add a comment

n the last post I spoke about Non Functional Requirement – Performance, and in this post I am going to talk a bit about Scalability and this  is a closest parameter related to performance too. Generally during in a requirement gathering  phase this requirement documented as – need a scalable solution and should have high performance or quick response time. Well I don’t think I would write this note technically, but if anyone else I would expect this type of “single line” statement.

What is the definition of scalability, and what should be those finer line items that needs to be considered for a CDI solution to be scalable?

Scalability definition from Wikipedia (http://en.wikipedia.org/wiki/Scalability)

Scalability is the ability of a system, network, or process to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth.

and the definition of the related word Scalable system is

A system whose performance improves after adding hardware, proportionally to the capacity added.

All these sound simple, yes, true but these sentences have great meanings inside, and to realize this as implementation is the task for any engineer, manager, technology specialist or an architect. Now this is done, let me get into next point and the most important one for this post.

Scalable database system is the system that can handle additional users, a number of transactions and more in future, data growth and all this with speed, preciseness, accuracy. Any system can be scaled up or out, and when it comes to database system like SQL Server, a scale up option is most easiest but could get costlier, and scale out needs to be well planned. When I say well planned it starts from non functional requirements understanding, solution design, design of data model, development, capacity planning with infrastructure setup.

Now that we know definition of scalability and scalable database system, I would like to list down those questions that helps me in deciding the needs of a scalable database system in a CDI solution.

  • What is the expected annual growth in data volume (for each of the source systems) in the next three to five years?

This is in same line to what a scalable database system should meet, however the importance is more when dealing with CDI – a data integration solution. An approximate projection of data growth of different data sources would help in deciding if the CDI solutions database could handle that growing data. The volumes of data plays very important part in meeting your response times or the SLAs for data processing, and then publish for analytics consumption.

  • What is the projected increase in number of source systems?

Similar question as above, but an idea of how many data source systems should be integrated along with the volume of data generated from these systems would help in knowing the volume of data that needs to be handled.

  • How many jobs to be executed in parallel?

I couldn’t disagree with an already documented point on this question, so mentioning as a repeat of someone who said about this meaningfully.

There is a difference between executing jobs sequentially and doing so in parallel. The answer to this question impacts performance and could affect the degree to which the given application can scale.

  • Is there any need for distributed processing?

An answer to this question will give a thought of existing systems distributed processing capabilities or the capacities, the organizations policies towards distributed processing. But this is at the level of understanding the the clients requirements, and should be taken as only leads or inputs to decide if distributed processing should be considered or not. A distributed processing of data is double-edged decision, and needs to be considered for solution after plenty of thought given on trade-offs.

The list of questions for scalability is short, but all these are essential to be addressed to have a high-performing and scalable system. I have nothing to give as an expert advise, but the last suggestion is to get these responses to have predictable behavior of any CDI solution.

Please feel free to comment and let me know your thoughts too.

CDIs–Non Functional Requirement–Performance March 12, 2013

Posted by msrviking in Architecture, Business Intelligence, Integration.
Tags: , , , , ,
add a comment

In today’s post I will talk about the NFR which I consider as one of the most important for the stability of the system for a CDI implementation – Performance.

This topic is vast and intense and there are several factors that needs to be considered while architecting or designing a system. A thought passes through my mind whenever, and even now – writing principles to achieve performance and scalability is easy which is mostly scholarly but the challenge lies during realization phase. If a systems performance and scalability has to be achieved then the appropriate solution blue print, right design, best coding, precise testing and deployment has to be in place. See as I said the moment this topic is opened up so much of lateral and tangential thinking comes into place. Now I am going to hold this and let’s talk about the above NFRs significance and then how it could be used for building CDI solution.

Performance – A short description and explanation that I picked from one of the sources I have shared in my earlier post, but modified for my better understanding and of course to suit the solution that I was building for the client.

So here it is, “performance is about the resources used to service a request and how quickly an operation can be complete”, e.g., response time, number of events processed. Its known that in a CDI type solution where the data is processed in GB to TB, and in large number of cycles response time could be less appropriate, but identifying the pre-determined response time for a particular process will help in defining the SLAs (SLA not from monitoring or maintenance perspective) for the downstream applications on when the data would be available.

I will give a short explanation here on what I intended by defining the performance. Say for e.g. I have to bring data from different data sources, apply lot of cleansing rules on the in-flow data, massage it for the intermediary schema, go map and publish per need of business and this is done on few millions of rows on different tables, or on GBs of data. I can’t assess the performance of batch applications at row basis or seconds basis. In a transactional system I would have gone by response time to be in seconds, and this doesn’t mean I can define that the job /batch application should finish the processing in an hour to 4 hours defined in the maintenance windows. I would prefer to abide per business needs of maintenance windows, however while looking for information on this NFR I am pre-defining the systems behavior – predictably, and also ensuring that to meet this NFR I need to have batch applications, real-time processing, appropriate design to handle volumes and velocity of the data. Yes, too much of thought process and explanation but the short message of the story is that “Meet the business needs of maintenance window by defining the SLAs for the system”. This is key for an integration solution or in specific to CDI.

Now I shall get back to the questions that I posed across to assess on what is the performance or rather SLA levels should the system adhere. The below list of questions are repeat of my previous post, but I shall put a bit of more text on why this question was asked.

  • What is the batch window available for completion of complete the batch cycle?

The batch window is that number of hours that are available for the CDI system to process data. This usually is mutually agreed inputs from the business and the IT folks, and knowing this information helps in deciding if the system can meet the proposed number of hours with the existing data sources, complexity in the processing logics, and volumes of data that needs to be processed.

  • What is the batch window available for completion of individual jobs?

The response to this question will help in deciding the dependency of batch job, and also in designing batches efficiently.

  • What is the frequency of the batch that could be considered by default?

The inputs for frequency of the job will help in assessing on how the loads on the system would vary during different weekdays. At the end this information would essentially help in deciding the frequencies of jobs, how to optimally use the resources of the system knowing that a typical day may not have the requisite load.

  • What are the data availability SLAs provided by the source systems of the batch load?

The source systems data availability would help in assessing if the proposed SLAs would be met or not. Essentially it is close to realistic if the data availability of the source system is close to what is being considered as SLA requirement for CDI.

  • What is the expected load (peak, off-peak, average)?

This would be more of a response from the IT team who would have set internal benchmarks on peak, off-peak and average load that a particular system should have. The historical data of the existing source systems or integration systems would be good references, and the answer to this question would help in designing optimal batches, frequencies and in proposing the relevant hardware for implementation.

  • What is the hardware and software configuration of the environment where the batch cycle could be run?

Strange question isn’t it!? Yes indeed, but trust me this questions response sets the pulse of what is expected from the CDI’s hardware and software if it is to be procured based on fresh inputs or if the solution has to be co-hosted on the existing systems with a pre-defined or prescribed hardware or software.

The following question will help in digging deeper and further on the standards that needs to be adhered for new solutions.

  • Are the resources available exclusive or shared?

The last and final question in evaluating on what is the expected performance behavior is to find out if the new solution has to be hosted on existing systems or new systems in to-to.

This has been a long post, and I am thinking I should pause for now. In the next post I shall talk about the NFR – Scalability.

Please feel free to comment and let me know your thoughts.

CDIs – Non functional requirements January 24, 2013

Posted by msrviking in Architecture, Business Intelligence, Integration.
Tags: , , , , ,
add a comment

Its been quite sometime that I had posted on this topic and I believe its time to share the next post which is to do with technical work I had done in this assignment.

NFR – Non functional requirements! Are you aware of this term? Yes you would be, should be and could be. But when I talk to Architects, DBAs, and Project Management Teams I noticed that these group of people understand at a layman level, but don’t understand much at depths. It usually ends with “Yes, NFRs are very important like response time, total number of users, number of concurrent users”.

Somehow I don’t like stopping at this level, and if you are building a solution for a transactional system or analytical system you will have to get deeper to know the behavior of the system at a level of Availability, Scalability, Performance, Multi-Tenancy when business is on work. The words I have brought up here are well known, but when you start digging further you will notice we would be covering lot more. At the end, the built solution should comply to NFRs as an integrated piece of all the above said factors.

So for the CDI solution I had built, I considered the below list as most important NFRs. I don’t believe this is the exhaustive list, but I know these are to be addressed for me to build the architecture.

· Availability

o Disaster Recovery

§ How important is the system and how quickly does it need to be returned to online in case of an outage or failure?

· Data

o Data Staleness

§ How up-to-date does your information need to be?

§ Do you need real-time or are delays acceptable?

o Data Retention

o Internationalization

· Performance

o What is the batch window available for completion of complete the batch cycle?

o What is the batch window available for completion of individual jobs?

o What is the frequency of the batch that could be considered by default?

o What are the data availability SLAs provided by the source systems of the batch load?

o What is the expected load (peak, off-peak, average)?

o What is the hardware and software configuration of the environment where the batch cycle could be run?

o Are the resources available exclusive or shared?

· Scalability

o What is the expected annual growth in data volume (for each of the source systems) in the next three to five years?

o What is the projected increase in number of source systems?

o How many jobs to be executed in parallel?

o Is there any need for distributed processing?

· Reliability

o What is the tolerance level or percentage of erroneous data that is permitted?

o Is there any manual process defined to handle potential failure cases?

o Under what conditions is it permissible for the system/process to not be completely accurate, and what is the frequency/probability of such recurrence?

· Maintainability

o What is the maintenance/release cycle?

o How frequently do source or target structures change? What is the probability of such change?

· Extensibility

o How many different/divergent sources (different source formats) are expected to be supported?

o What kind of enhancements/changes to source formats are expected to come in?

o What is the probability of getting new sources added? How frequently does the source format change? How divergent will the new source formats be?

· Security

o Are there any special security requirements (such as data encryption or privacy) applicable?

o Are there any logging and auditing requirements?

· Capacity

o How many rows are created every day in transactional DB, CRM and ERP?

o What is the size of data that is generated across LOBs in DB, CRM, ERP systems?

o What is the agreeable processing time of data in data hub?

o How stale can the reporting data be?

o What are the agreeable database system downtime hours?

§ Administration & Maintenance

§ Data Refresh /Processing

o How many concurrent users will access the report /application?

o What is the total number of users expected to use the reporting system?

o What are the expected complexity levels of reporting solution? (High, Medium, Low)

o How much of the processed data and data used for reporting has to be archived /purged?

o How many years of data have to be archived in the system?

o How many years of yester year’s data have to be processed?

o What is the possible backup and recovery time required for the Data Hub and Reporting system?

o What are the availability requirements of the data hub and reporting system?

o How many users will be added year on year for the reporting system, and what are the types of users?

o What will be year on year growth of data in the transactional source system?

o What could be the other sources of data that could be added during a period of 1-2 years, and how much of these data sources could provide data to data hub?

o Are there any other external systems that would require the data from data hub?

o How many rows of the transactional DB, CRM and ERP needs to be processed for the Data Hub?

o How much data is currently processed for reports?

o What type of data processing queries exist in the system which provide Static /Ad-Hoc reports?

o What types of reports are currently available, and what is the resource usage?

o What are the query profiling inputs for these data processing queries /reporting queries?

§ CPU usage

§ Memory usage

§ Disk usage

· Resource Utilization

o Is there an upper limit for CPU time or central processing systems, etc.?

o Are there any limitations on memory that can be consumed?

o Do the target store/ database/ file need to be available 24X7?

o Any down time is allowed?

o Is there any peak or off peak hours during which loading can happen?

o Are there crucial SLAs that need to be met?

o What if SLAs are missed are there any critical system/ business impact?

This list was prepared after researching the web for similar implementations, best practices, standards, and based on the past experiences.

I am sharing few links where I found information on capacity planning which had questions that were around NFRs

http://searchdatamanagement.techtarget.com/news/2240033663/Capacity-planning-for-the-data-warehouse-environment

www.information-management.com/newsletters/data_management_warehouse_store_integration-10020567-1.html?zkPrintable=1&nopagination=1

Please feel free to comment.

Cheers and Enjoy!

Few learning’s..suggestions based on experience January 10, 2013

Posted by msrviking in Architecture, Business Intelligence, Integration.
Tags: , , , ,
add a comment

This is a continuation post of my CDI series however less technical, and I felt these were very essential for success of the CDI projects. Some of these were pre-meditated pointers based on industry based CDI experiences, and few from my experience.

The list isn’t exhaustive but have high importance and higher criticality. So let’s start off.

  1. The type of CDI that was to be adopted for implementation had to have lot of stake by several teams. It starts with Business Teams, Technical Teams (usually Deployment), and most importantly the Senior Management. So stakeholder participation is one of the top most important points.
  2. Constant engagement with business teams is critical for understanding the functional requirements and should be maintained through design.
  3. The CDI is to bring customer-centric solution which would involve bringing in customer and customer related information from different lines of business into a repository. Sometimes the customer relation information from several businesses is so overwhelming and this could lead to scope-leakage, and schedule hits. What does it matter to a technical person like me? Well the solution and the design would be hit and hurt.
    To avoid pain in later stages, its essential to focus on small and easy business groups for implementation. This could be contradicting the purpose of solution & design – “solution and design should cover the customer-centricity without loss of any information”, but is harmless when we consider only for implementation.
  4. Establishing a Data Governance team which would monitor the enterprise level information flow into and out of the the CDI. This team has to work closely with the business teams from early phases of project, continue to work with the implementation teams and be responsible for that data after release to the production.
  5. The last and never the least point is having proficient testing team who would understand the functional requirements with the implementation teams, and carry out well-defined test strategies at each phases of the project once the development is kicked off.

I couldn’t explain on why I have listed the above points because these are self-explanatory and represent the importance significantly.

I am sharing few links that I had been reading from project execution perspective, and these links talk more than nuances of executing such projects and are surely worth the share Smile.

http://searchcrm.techtarget.com/quiz/Are-you-a-CDI-know-it-all

http://searchcrm.techtarget.com/news/1206714/Microsoft-goes-outside-for-CDI-help

http://searchbusinessintelligence.techtarget.in/news/2240021901/Homegrown-versus-commercial-customer-data-integration-tools-Finding-the-right-balance

http://searchdatamanagement.techtarget.com/news/1187585/Microsoft-tackles-its-own-CDI-project?bucket=NEWS

http://searchcio.techtarget.in/tutorial/Customer-data-integration-CDI-A-Quick-Guide

Cheers and Enjoy!