jump to navigation

SSIS is for ETL, ELT or EL….clamor continued September 3, 2013

Posted by msrviking in Business Intelligence, Data Integration, Design, Integration Services.
Tags: , , , , , , , ,
add a comment

In my yesterday’s post I had mentioned that we had to design and implement a SSIS solution that does EL, and had my own gripe about addressing only the non-functional requirements. Also it was just about a process of extraction & loading – EL.

Today let me start off with the definition of ETL, ELT, and then the new word with EL where T is missing, or supposedly not to be considered here. While writing this post I realized I should put across few notes on what an ETL or ETL means, or probably the differences between these implementations. I could have probably listed the points as a neat comparison table but blame it on my laziness that I am sharing these links. Trust me I have read these links, and I completely acknowledge the technical content’s relevance to ETL vs ELT.

http://www.dataacademy.com/files/ETL-vs-ELT-White-Paper.pdf

http://blog.performancearchitects.com/wp/2013/06/13/etl-vs-elt-whats-the-difference/

At the end ETL would mean that you have a powerful infrastructure in your destination, to handle transformation using SET operations, but then you would need a staging area to do the transformation and push the data into the destination tables. Whereas in ELT processes, the tools have the power to do the transformation in parallel infrastructure setup and in-memory, and away from the destination systems. After the data is transformed it is loaded into the databases for business to consume.

So having these principles in mind, we had yet to implement the EL instead of ETL or ELT. I personally didn’t find the approach to be good to implement, and which could have been done easily without any development effort using a CDC for Oracle in SQL Server, Replication. However I found some finer pros of this approach, and I remember noting those in one of my earlier post over here. Those few are,

  1. SSIS doesn’t need any additional changes to be done on the source end, except that we configure a connector to pull the data.

    Whereas Replication from Oracle had to create temporary tables, triggers in the source schema which probably is an overhead along with changes on the source schema. This is not definitely acceptable by any business or system owner. CDC for Oracle in SQL Server helped to a larger extent to address the above problem, but then enabling certain properties of the source instance would be overhead for security considerations.

  2. SSIS is essentially for a flow-control of data, and we could synch data on a pre-set precedence of steps. For example the master table’s data should ported first, and then the transactional tables.
  3. SSIS could be configured to pull data from the source based on business criteria by using parameters in the WHERE statements.
  4. SSIS gives finer control on error logging, handling and of course we could have retry packages to pull in data of failure.
  5. SSIS also gives me a perfect control on the defensive programming so that all business data is in place.
  6. SSIS could be tweaked, tuned to have optimal performance from extraction through loading.

Overall we will have control on what is happening when the data is extracted and loaded. Now I am wondering all these are perfect design guidelines for any ETL or ELT, and how I wished or rather would have not wanted on just an EL process, huh.

Cheers!

Advertisements

SSIS for Extraction & Loading (EL) only September 2, 2013

Posted by msrviking in Business Intelligence, Data Integration, DBA Rant, Design, Heterogeneous, Integration, Integration Services.
Tags: , , , , , , , ,
add a comment

There were series of posts earlier on the different ways to pull data from Oracle to SQL Server where I shared on how to port data either through Replication, CDC for Oracle, and few others like SSIS. I thought I will give a starter to a series of post(s) on how we went about picking up the option of implementing SSIS as a way to bring data from Oracle to SQL Server.

The major drivers (business and non-functional) for picking the solution were

  • The destination database on SQL Server should be replica of the source
  • No business rules or transformation need to be implemented when data is pulled
  • The performance of pull and loading should be optimal with no overheads or bottlenecks either on the source or destination

The destination system was mentioned as ODS, but it surely isn’t the same as Operational Data Store of DWH systems. Sadly the naming convention had been adopted by the shop where we had to implement the solution, and you might see me using the word ODS for sake of explaining the implementation, and I am sorry for putting the incorrect usage of the word. You would probably see little more of conventional or standards being skewed in the implementation, and of course a bit of my comments on how this could have been handled, better.

So the ODS was to hold data that would be used by downstream business analytics applications for the business users, with an intent of providing Self-Service BI. The ODS is to be of exact – in the form of schema, and data as in the source. The job was to pull data from the source without any performance overhead, and failures. Not to forget to mention that there needn’t be any transformation of data. At the end we use SSIS as only an Extraction & Loading tool instead of ETL – Extraction, Transformation and Loading.

This sounds simple, eh, but beware this wasn’t that simple because the design was to be kept simple to address all of the above factors. The SSIS package(s) had to handle these at a non-functional level,

  • The package should be configurable to pick and pull data from any set date
  • The package should be re-runnable from the point of failure
  • The package should have configurability to address performance needs – package execution time should be less, the source should not be overloaded, the destination should be full in its efficiency while data is loading.
  • The package should be configurable to pull data from source in preset chunks. The preset chunks could be based on a period basis – days /months /years, or number of records per period
  • The package should have the option to flag any other dependent packages to run or not run during initial and incremental loads
  • The package should have defensive design to handle bunch of different type of errors
  • The package should have error logging and handling at a package level, and at record levels

In Toto this solution was more to how do we design for the non-functional requirements and implement, leaving the functional side of the data. This type of implementation is half-blind and I will talk more on those cons we had to face, when I get in to details of each implementation.

A post after few weeks of gap, and one with learnings and summaries on how it could have been better.

CDIs-Non Functional Requirements-Other few May 21, 2013

Posted by msrviking in Architecture, Business Intelligence, Data Integration, Design, Integration, Security.
Tags: , , , , , , , , ,
add a comment

In the series of CDIs-Non Functional Requirements I had covered on NFRs like Performance, Scalability, Availability, Reliability, and Utilization Benchmarks that could be useful to build a CDI system. In this post I shall talk about less known but important NFRs like Maintainability, Extensibility and Security. I wouldn’t term these as NFRs but questions on these topics will help you to get an insight on how the system is expected to behave, from a business and technical perspective. Also, these don’t work in silos, instead are linked back to some of the ones mentioned in the above list.

Maintainability

  1. What is the maintenance/release cycle?

    This question will give us an idea on the practices that is followed by the business and technical teams about the release cycle. And each release cycle would mean there would be change in the source system which may be applicable to down-stream applications. The more the release cycles, the difficult is the job of maintaining code base. And to avoid long-term overheads the system, data model and the low level design of the ETLs should be carefully built considering that this changes would be constant, and frequent.

  2. How frequently do source or target structures change? What is the probability of such change?

    This point is relevant to the first question, but elicits information one level deeper by asking “if more maintenance cycles, what are those changes at source and expected to be in target”. If the changes are constant, frequent and less complex then the data model and the ETLs have to be configurable to accommodate ‘certain’ changes in the source. The configurability comes with a rider and tradeoff on other NFR like performance. The changes on data source could affect the performance of the ETL and sometimes the business laid SLA can’t be met.

Now having said this, I presume the next NFR will be closely related with Maintainability.

Configurability

The answers to queries under this topic is supposedly to be challenging for the business, and technical teams. Not everyone is sure of what should be configurable and what shouldn’t be based on the changes that are expected from business at the source system level. One would get the answer of “not sure”, “may be”, “near future quite possible”, “probably” the source will change, and what change will remain as a question. The challenge of providing an appropriate solution at different layers will be a daunting task for the technical team.

  1. How many different/divergent sources (different source formats) are expected to be supported?

    The answer to this question will help in understanding what formats of sources (csv, tsv, xml..etc…) have to be supported, and if there is plenty of difference then alternate design practices could be implemented on the target which could provide extensibility to all formats.

  2. What kind of enhancements/changes to source formats are expected to come in?

    An answer to this point would help in deciding if there be abstract transformations or reusable mappings.

  3. What is the probability of getting new sources added? How frequently does the source format change? How divergent will the new source formats be?

    This information will help in knowing how often the sources format change, and is it with existing sources or with new ones. Again it would also help in deciding between abstract transformations or reusable mappings.

The last NFR is Security which is usually the last preferred in any system architecture and design, but most important.

Security

In case of CDI we are dealing with sensitive information of the customer and the transaction details. It is important to understand how business treats this type of data, and how do security compliance team want to consider the data is being gathered from different source systems and consolidated at a single place – “Data hub”. The below bunch of questions cover on the data protection level rather than access levels or privileges of different users.

  1. Are there any special security requirements (such as data encryption or privacy) applicable?

    An answer to this question usually would be “no”, but there are certain fields that are brought in from CRM and ERP systems and needs to be hidden from any misuse in case of breach of security. It is suggested that this question is explained well with a real scenario, and then a decision of having data or database encryption enabled or not could be taken.

  2. Are there any logging and auditing requirements?

    This is least required since the data from different systems is mostly massaged and made available in reporting format through a different data sink. A discussion in here would help in deciding if the security should be handled at reporting level (enabling different security features), rather than in massive data processes.

I hope all these posts on NFR for CDIs helps you in architecting, designing Data Hub system that is highly available, scalable, high performing, and most reliable.

Cheers!

CDIs-Non Functional Requirement-Utilization Benchmarks May 16, 2013

Posted by msrviking in Architecture, Business Intelligence, Data Integration, Integration.
Tags: , , , , , , ,
add a comment

In this post I shall share my thoughts on how these questions would help in designing system to meet the response time, and performance. Some of these are about the history of the existing systems and few are about the future requirements. Lot of English, eh, so straight message is that answers to these questions will help me in gathering “baselines and benchmarks” of the past and future. And once I get to know I should have

  1. A system designed that performs with low latency (quick response time) for any operation
  2. Built an infrastructure (especially servers) that would meet future performance requirements

Let me group these questions based on purpose and tag with some key words, for easy understanding.

Q# Question Purpose – Keywords
1 How many rows are created every day in transactional DBs? To get current baselines on how many rows are generated from business transactions – row size, storage space, disk size, CPU size, computation power
2 What is the size of data that is generated across LOBs? To get current baselines on how many rows are generated from business transactions per line of business – row size, storage space, disk size
3 What is the agreeable processing time of data in data hub To understand future benchmarks of the data hub – CPU sizing
4 How stale can the reporting data be? To understand future benchmarks of the data hub – storage sizing
5 How much of the processed data and data used for reporting has to be archived /purged?
To understand future benchmarks on how the archived data would be used in reporting system – server sizing, storage sizing
6 How many years of data have to be archived in the system?
To understand future benchmarks on how the archived data would be used in reporting system – server sizing, storage sizing
7 How many years of yester year’s data have to be processed? To understand future benchmarks on how the archived data would be used in reporting system – server sizing, storage sizing, CPU sizing, RAM sizing
8 What will be year on year growth of data in the transactional source system?
To understand current baselines and future benchmarks on the data that needs to be processed – server sizing, storage sizing, CPU sizing, RAM sizing
9 What could be the other sources of data that could be added during a period of 1-2 years, and how much of these data sources could provide data to data hub?
To understand current baselines and future benchmarks on the data that needs to be processed – server sizing, storage sizing, CPU sizing, RAM sizing
10 What are the agreeable database system downtime hours? To understand future benchmarks on how quick the data processing should complete – CPU sizing, RAM sizing
11 What is the possible backup and recovery time required for the Data Hub and Reporting system?
To understand on how quickly should the whole data hub and BI system be back up and running – servers sizing, storage sizing, CPU sizing, RAM sizing
12 How many users will be added year on year for the reporting system, and what are the types of users?
To understand future benchmarks on how the reporting and data hub system resources will be utilized – servers sizing, storage sizing, CPU sizing, RAM sizing
13 Are there any other external systems that would require the data from data hub?
To understand current baselines of data processing and BI systems – server sizing
14 How many rows of the transactional /business system needs to be processed for the Data Hub?
15 How much data is currently processed for reports?
16 What type of data processing queries exist in the system which provide Static /Ad-Hoc reports?
17 What types of reports are currently available, and what is the resource usage?
18 What are the query profiling inputs for these data processing queries /reporting queries?

  • CPU usage
  • Memory usage
  • Disk usage
To understand current baselines of data processing and BI systems – server sizing

In the follow up post for this NFR I shall let you all know on how I translated as part of design and server sizing. Until then happy reading.

Please feel free to share your comments.

Cheers!


CDIs-Non Functional Requirement-Reliability May 6, 2013

Posted by msrviking in Architecture, Business Intelligence, Data Integration, Integration.
Tags: , , , , , ,
add a comment

This is a strange NFR, but very important. The word reliable is more of English and looks less applicable in here?! Well, no. We are talking of reliability with the meaning and importance of data reliability in a data integration system – CDI. Please take a note of a point which I had mentioned earlier that in a CDI there is no change in meaning of data, but trying to align across organization perspective of what the data in the CDI system means and how true it is. It is necessary that all business, and technical teams agree with the definition of these data terms. After this is agreed, the technical teams have to implement logics – business, solution in the system to bring that “value and truth” to the data or information that is consolidated from different sources.

In this post, I shall share few thoughts on the questions that we need to find answers.

  • What is the tolerance level or percentage of erroneous data that is permitted?

With the above context set, this question should gather information in order to plan error handling and threshold levels in case of any issue. The usual way is to have a failed process restarted, or aborted. Now to do this we need to identify the tolerance (threshold levels) and handle to get the essential, business valid data.

  • Is there any manual process defined to handle potential failure cases?

There should always be a process which restarts any failed or aborted process due to technical or business validation errors, should this be done done manually or automated. In less tolerant systems this process should be either semi-automated or fully-automated, and manual in some cases. A known fact and familiar thought too, but we should elicit this from the stakeholders explicitly to understand the business information at the same level.

  • Under what conditions is it permissible for the system/process to not be completely accurate, and what is the frequency/probability of such recurrence?

This question is not to be mistaken that we could ignore certain levels of error in the system and could be unattended. Rather this pointer helps in understanding what are the processes that could be considered as business critical, and the inputs gathered here could and should be used in designing a solution to handle errors – especially the business across all the modules based on importance that is agreed with the business stakeholders.

Although there just three questions in this NFR, all inputs work in tandem with other NFRs helping to build in a less-erroneous, fault-tolerant and robust CDI. There is lot of business importance to these questions which actually feeds into the solution building, designing of components, modules, and data model.

So please share your thoughts or feel free to comment on this NFR.

Cheers!

CDIs-Non Functional Requirement-Availability March 28, 2013

Posted by msrviking in Architecture, Business Intelligence, Data Integration, High Availability, Integration.
Tags: , , , , ,
add a comment

The word availability so commonly used while building systems, and there are different responses, and of course spontaneous too. I would like to share few of those which I know of, and used more often.

  • I want the system to be available all the time – 24×7. This response was not of recent times, but maybe at least 5 years back where the business or application or program managers didn’t have knowledge on what a system availability means.
  • The business needs system to be available during the weekdays, and during the peak hours between 8 AM – 4 PM with acceptable failure time or non availability time of 5 mins. And during the weekends the business could accept few down time hours during the early hours of the day for any patch or software upgrades. This is more stringent for applications that have database systems to be continuously available for the business.

Thankfully the response to the availability questions is better and is getting better as more techno-functional business managers are involved in a new application or system building. It has been saving my day, where I don’t have to explain bunch of terms, formulae and what not.

Availability in the world of CDI solution is little different although the system should be available for any users accessibility all the while. What is that small difference? I had been dealing with transactional systems extensively, and the idea of availability changed when I had to look at this NFR from the perspective of CDI or a data integration solution. Trust me as I am writing this post, I couldn’t figure out the exact point that gives difference in the context of transactional system and data integration solution. I shall try defining for both, and hopefully that gives some sense.

Availability in the transactional system – the system should be up and running in the defined SLAs, except for unwarranted outages. In case of failure the system should be recovered in defined period of SLAs. This could be addressed by use of several availability features in SQL Server, and usually transactional systems are less bulkier than CDI /Data Hub databases. The key points that come on top of my head are

  1. No business /user data should be lost during a transaction, and data has to be accessible all the time
  2. No transaction should be a failure because of a system or process failure
  3. And in case of any failure the system should handle the failed transaction, and data should be recovered

Availability in the data integration systems – the system should be up and running – available for business users to consume the aggregated data from various sources. Again these too in the pre-defined and agreed SLAs, and some of these bulkier databases availability requirements could be addressed by different availability features of SQL Server. The key points are

  1. No business data should be lost during transition (porting time and path) and there should be enough mechanisms to track back the lost row or record until the originating point
  2. In case of any lost row or record should not be because of a hardware or system but could be because of implemented business processes failure, and this should be recorded in the system
  3. In case of any hardware failure, the system should be up and running within agreed SLAs and it is acceptable to have little longer period of recovery or restoration
  4. And the data should be available near real-time, accurate and on time

I believe after writing the above list I am probably bringing out that ‘thin’ difference between availability of integration solutions and transactional system. For both the systems data is most important and for this system, processes, and hardware should be in place, and that’s the objective. Great!, now this knowledge sharing is done, I am going to get into those questions which I put forward to know what is the availability requirement, and also judge what it all means for the customer. I am sure there are organizations who have adopted MDM solution, but MDM-Data Hub or CDI is seldom done because of its impact on the business directly or indirectly. Okay, I am not getting into that part of the discussion..so here are those pointers that we should gather inputs.

  • How important is the system and how quickly does it need to be returned to online in case of an outage or failure?

This question is to address the recoverability or restoration of the system in case of a hardware or system process failure. At the end the response to this question helps in defining SLA for a data hub database, and its downstream eco-system data sources.

  • How up-to-date does your information need to be?

The question here looks closest to near real-time data requirement, but please hold and look at it once again. The response would help to address on “Data Staleness”. In short how old can the business bear with, and technically how often should the data refresh happen.

  • Do you need real-time or are delays acceptable?

This question is off-shoot of the previous one, and response for this question will set the expectations from business and techno-functional teams if there should be real-time data pulls, processing and analytics.

  • What is the data retention policy?

The last question is to address the archival policy, and also to give an idea on what type of availability feature should be used to make large volumes of data available as part recovery of process.

At the end I probably managed pulling in questions for the tricky word “availability” in the context of CDI. All the inputs out here would help in designing a solution that should meet the requirements of availability – data staleness, data retention, and data recoverability.

I shall stop here and of course I feel this is the end of the NFR availability. Please feel free to comment and share your thoughts.

CDIs-Non Functional Requirement–Scalability March 27, 2013

Posted by msrviking in Architecture, Business Intelligence, Data Integration, Integration.
Tags: , , , , , ,
add a comment

n the last post I spoke about Non Functional Requirement – Performance, and in this post I am going to talk a bit about Scalability and this  is a closest parameter related to performance too. Generally during in a requirement gathering  phase this requirement documented as – need a scalable solution and should have high performance or quick response time. Well I don’t think I would write this note technically, but if anyone else I would expect this type of “single line” statement.

What is the definition of scalability, and what should be those finer line items that needs to be considered for a CDI solution to be scalable?

Scalability definition from Wikipedia (http://en.wikipedia.org/wiki/Scalability)

Scalability is the ability of a system, network, or process to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth.

and the definition of the related word Scalable system is

A system whose performance improves after adding hardware, proportionally to the capacity added.

All these sound simple, yes, true but these sentences have great meanings inside, and to realize this as implementation is the task for any engineer, manager, technology specialist or an architect. Now this is done, let me get into next point and the most important one for this post.

Scalable database system is the system that can handle additional users, a number of transactions and more in future, data growth and all this with speed, preciseness, accuracy. Any system can be scaled up or out, and when it comes to database system like SQL Server, a scale up option is most easiest but could get costlier, and scale out needs to be well planned. When I say well planned it starts from non functional requirements understanding, solution design, design of data model, development, capacity planning with infrastructure setup.

Now that we know definition of scalability and scalable database system, I would like to list down those questions that helps me in deciding the needs of a scalable database system in a CDI solution.

  • What is the expected annual growth in data volume (for each of the source systems) in the next three to five years?

This is in same line to what a scalable database system should meet, however the importance is more when dealing with CDI – a data integration solution. An approximate projection of data growth of different data sources would help in deciding if the CDI solutions database could handle that growing data. The volumes of data plays very important part in meeting your response times or the SLAs for data processing, and then publish for analytics consumption.

  • What is the projected increase in number of source systems?

Similar question as above, but an idea of how many data source systems should be integrated along with the volume of data generated from these systems would help in knowing the volume of data that needs to be handled.

  • How many jobs to be executed in parallel?

I couldn’t disagree with an already documented point on this question, so mentioning as a repeat of someone who said about this meaningfully.

There is a difference between executing jobs sequentially and doing so in parallel. The answer to this question impacts performance and could affect the degree to which the given application can scale.

  • Is there any need for distributed processing?

An answer to this question will give a thought of existing systems distributed processing capabilities or the capacities, the organizations policies towards distributed processing. But this is at the level of understanding the the clients requirements, and should be taken as only leads or inputs to decide if distributed processing should be considered or not. A distributed processing of data is double-edged decision, and needs to be considered for solution after plenty of thought given on trade-offs.

The list of questions for scalability is short, but all these are essential to be addressed to have a high-performing and scalable system. I have nothing to give as an expert advise, but the last suggestion is to get these responses to have predictable behavior of any CDI solution.

Please feel free to comment and let me know your thoughts too.

CDIs–Non Functional Requirement–Performance March 12, 2013

Posted by msrviking in Architecture, Business Intelligence, Integration.
Tags: , , , , ,
add a comment

In today’s post I will talk about the NFR which I consider as one of the most important for the stability of the system for a CDI implementation – Performance.

This topic is vast and intense and there are several factors that needs to be considered while architecting or designing a system. A thought passes through my mind whenever, and even now – writing principles to achieve performance and scalability is easy which is mostly scholarly but the challenge lies during realization phase. If a systems performance and scalability has to be achieved then the appropriate solution blue print, right design, best coding, precise testing and deployment has to be in place. See as I said the moment this topic is opened up so much of lateral and tangential thinking comes into place. Now I am going to hold this and let’s talk about the above NFRs significance and then how it could be used for building CDI solution.

Performance – A short description and explanation that I picked from one of the sources I have shared in my earlier post, but modified for my better understanding and of course to suit the solution that I was building for the client.

So here it is, “performance is about the resources used to service a request and how quickly an operation can be complete”, e.g., response time, number of events processed. Its known that in a CDI type solution where the data is processed in GB to TB, and in large number of cycles response time could be less appropriate, but identifying the pre-determined response time for a particular process will help in defining the SLAs (SLA not from monitoring or maintenance perspective) for the downstream applications on when the data would be available.

I will give a short explanation here on what I intended by defining the performance. Say for e.g. I have to bring data from different data sources, apply lot of cleansing rules on the in-flow data, massage it for the intermediary schema, go map and publish per need of business and this is done on few millions of rows on different tables, or on GBs of data. I can’t assess the performance of batch applications at row basis or seconds basis. In a transactional system I would have gone by response time to be in seconds, and this doesn’t mean I can define that the job /batch application should finish the processing in an hour to 4 hours defined in the maintenance windows. I would prefer to abide per business needs of maintenance windows, however while looking for information on this NFR I am pre-defining the systems behavior – predictably, and also ensuring that to meet this NFR I need to have batch applications, real-time processing, appropriate design to handle volumes and velocity of the data. Yes, too much of thought process and explanation but the short message of the story is that “Meet the business needs of maintenance window by defining the SLAs for the system”. This is key for an integration solution or in specific to CDI.

Now I shall get back to the questions that I posed across to assess on what is the performance or rather SLA levels should the system adhere. The below list of questions are repeat of my previous post, but I shall put a bit of more text on why this question was asked.

  • What is the batch window available for completion of complete the batch cycle?

The batch window is that number of hours that are available for the CDI system to process data. This usually is mutually agreed inputs from the business and the IT folks, and knowing this information helps in deciding if the system can meet the proposed number of hours with the existing data sources, complexity in the processing logics, and volumes of data that needs to be processed.

  • What is the batch window available for completion of individual jobs?

The response to this question will help in deciding the dependency of batch job, and also in designing batches efficiently.

  • What is the frequency of the batch that could be considered by default?

The inputs for frequency of the job will help in assessing on how the loads on the system would vary during different weekdays. At the end this information would essentially help in deciding the frequencies of jobs, how to optimally use the resources of the system knowing that a typical day may not have the requisite load.

  • What are the data availability SLAs provided by the source systems of the batch load?

The source systems data availability would help in assessing if the proposed SLAs would be met or not. Essentially it is close to realistic if the data availability of the source system is close to what is being considered as SLA requirement for CDI.

  • What is the expected load (peak, off-peak, average)?

This would be more of a response from the IT team who would have set internal benchmarks on peak, off-peak and average load that a particular system should have. The historical data of the existing source systems or integration systems would be good references, and the answer to this question would help in designing optimal batches, frequencies and in proposing the relevant hardware for implementation.

  • What is the hardware and software configuration of the environment where the batch cycle could be run?

Strange question isn’t it!? Yes indeed, but trust me this questions response sets the pulse of what is expected from the CDI’s hardware and software if it is to be procured based on fresh inputs or if the solution has to be co-hosted on the existing systems with a pre-defined or prescribed hardware or software.

The following question will help in digging deeper and further on the standards that needs to be adhered for new solutions.

  • Are the resources available exclusive or shared?

The last and final question in evaluating on what is the expected performance behavior is to find out if the new solution has to be hosted on existing systems or new systems in to-to.

This has been a long post, and I am thinking I should pause for now. In the next post I shall talk about the NFR – Scalability.

Please feel free to comment and let me know your thoughts.

CDIs – Non functional requirements January 24, 2013

Posted by msrviking in Architecture, Business Intelligence, Integration.
Tags: , , , , ,
add a comment

Its been quite sometime that I had posted on this topic and I believe its time to share the next post which is to do with technical work I had done in this assignment.

NFR – Non functional requirements! Are you aware of this term? Yes you would be, should be and could be. But when I talk to Architects, DBAs, and Project Management Teams I noticed that these group of people understand at a layman level, but don’t understand much at depths. It usually ends with “Yes, NFRs are very important like response time, total number of users, number of concurrent users”.

Somehow I don’t like stopping at this level, and if you are building a solution for a transactional system or analytical system you will have to get deeper to know the behavior of the system at a level of Availability, Scalability, Performance, Multi-Tenancy when business is on work. The words I have brought up here are well known, but when you start digging further you will notice we would be covering lot more. At the end, the built solution should comply to NFRs as an integrated piece of all the above said factors.

So for the CDI solution I had built, I considered the below list as most important NFRs. I don’t believe this is the exhaustive list, but I know these are to be addressed for me to build the architecture.

· Availability

o Disaster Recovery

§ How important is the system and how quickly does it need to be returned to online in case of an outage or failure?

· Data

o Data Staleness

§ How up-to-date does your information need to be?

§ Do you need real-time or are delays acceptable?

o Data Retention

o Internationalization

· Performance

o What is the batch window available for completion of complete the batch cycle?

o What is the batch window available for completion of individual jobs?

o What is the frequency of the batch that could be considered by default?

o What are the data availability SLAs provided by the source systems of the batch load?

o What is the expected load (peak, off-peak, average)?

o What is the hardware and software configuration of the environment where the batch cycle could be run?

o Are the resources available exclusive or shared?

· Scalability

o What is the expected annual growth in data volume (for each of the source systems) in the next three to five years?

o What is the projected increase in number of source systems?

o How many jobs to be executed in parallel?

o Is there any need for distributed processing?

· Reliability

o What is the tolerance level or percentage of erroneous data that is permitted?

o Is there any manual process defined to handle potential failure cases?

o Under what conditions is it permissible for the system/process to not be completely accurate, and what is the frequency/probability of such recurrence?

· Maintainability

o What is the maintenance/release cycle?

o How frequently do source or target structures change? What is the probability of such change?

· Extensibility

o How many different/divergent sources (different source formats) are expected to be supported?

o What kind of enhancements/changes to source formats are expected to come in?

o What is the probability of getting new sources added? How frequently does the source format change? How divergent will the new source formats be?

· Security

o Are there any special security requirements (such as data encryption or privacy) applicable?

o Are there any logging and auditing requirements?

· Capacity

o How many rows are created every day in transactional DB, CRM and ERP?

o What is the size of data that is generated across LOBs in DB, CRM, ERP systems?

o What is the agreeable processing time of data in data hub?

o How stale can the reporting data be?

o What are the agreeable database system downtime hours?

§ Administration & Maintenance

§ Data Refresh /Processing

o How many concurrent users will access the report /application?

o What is the total number of users expected to use the reporting system?

o What are the expected complexity levels of reporting solution? (High, Medium, Low)

o How much of the processed data and data used for reporting has to be archived /purged?

o How many years of data have to be archived in the system?

o How many years of yester year’s data have to be processed?

o What is the possible backup and recovery time required for the Data Hub and Reporting system?

o What are the availability requirements of the data hub and reporting system?

o How many users will be added year on year for the reporting system, and what are the types of users?

o What will be year on year growth of data in the transactional source system?

o What could be the other sources of data that could be added during a period of 1-2 years, and how much of these data sources could provide data to data hub?

o Are there any other external systems that would require the data from data hub?

o How many rows of the transactional DB, CRM and ERP needs to be processed for the Data Hub?

o How much data is currently processed for reports?

o What type of data processing queries exist in the system which provide Static /Ad-Hoc reports?

o What types of reports are currently available, and what is the resource usage?

o What are the query profiling inputs for these data processing queries /reporting queries?

§ CPU usage

§ Memory usage

§ Disk usage

· Resource Utilization

o Is there an upper limit for CPU time or central processing systems, etc.?

o Are there any limitations on memory that can be consumed?

o Do the target store/ database/ file need to be available 24X7?

o Any down time is allowed?

o Is there any peak or off peak hours during which loading can happen?

o Are there crucial SLAs that need to be met?

o What if SLAs are missed are there any critical system/ business impact?

This list was prepared after researching the web for similar implementations, best practices, standards, and based on the past experiences.

I am sharing few links where I found information on capacity planning which had questions that were around NFRs

http://searchdatamanagement.techtarget.com/news/2240033663/Capacity-planning-for-the-data-warehouse-environment

www.information-management.com/newsletters/data_management_warehouse_store_integration-10020567-1.html?zkPrintable=1&nopagination=1

Please feel free to comment.

Cheers and Enjoy!

Few learning’s..suggestions based on experience January 10, 2013

Posted by msrviking in Architecture, Business Intelligence, Integration.
Tags: , , , ,
add a comment

This is a continuation post of my CDI series however less technical, and I felt these were very essential for success of the CDI projects. Some of these were pre-meditated pointers based on industry based CDI experiences, and few from my experience.

The list isn’t exhaustive but have high importance and higher criticality. So let’s start off.

  1. The type of CDI that was to be adopted for implementation had to have lot of stake by several teams. It starts with Business Teams, Technical Teams (usually Deployment), and most importantly the Senior Management. So stakeholder participation is one of the top most important points.
  2. Constant engagement with business teams is critical for understanding the functional requirements and should be maintained through design.
  3. The CDI is to bring customer-centric solution which would involve bringing in customer and customer related information from different lines of business into a repository. Sometimes the customer relation information from several businesses is so overwhelming and this could lead to scope-leakage, and schedule hits. What does it matter to a technical person like me? Well the solution and the design would be hit and hurt.
    To avoid pain in later stages, its essential to focus on small and easy business groups for implementation. This could be contradicting the purpose of solution & design – “solution and design should cover the customer-centricity without loss of any information”, but is harmless when we consider only for implementation.
  4. Establishing a Data Governance team which would monitor the enterprise level information flow into and out of the the CDI. This team has to work closely with the business teams from early phases of project, continue to work with the implementation teams and be responsible for that data after release to the production.
  5. The last and never the least point is having proficient testing team who would understand the functional requirements with the implementation teams, and carry out well-defined test strategies at each phases of the project once the development is kicked off.

I couldn’t explain on why I have listed the above points because these are self-explanatory and represent the importance significantly.

I am sharing few links that I had been reading from project execution perspective, and these links talk more than nuances of executing such projects and are surely worth the share Smile.

http://searchcrm.techtarget.com/quiz/Are-you-a-CDI-know-it-all

http://searchcrm.techtarget.com/news/1206714/Microsoft-goes-outside-for-CDI-help

http://searchbusinessintelligence.techtarget.in/news/2240021901/Homegrown-versus-commercial-customer-data-integration-tools-Finding-the-right-balance

http://searchdatamanagement.techtarget.com/news/1187585/Microsoft-tackles-its-own-CDI-project?bucket=NEWS

http://searchcio.techtarget.in/tutorial/Customer-data-integration-CDI-A-Quick-Guide

Cheers and Enjoy!