jump to navigation

CDIs-Non Functional Requirement-Availability March 28, 2013

Posted by msrviking in Architecture, Business Intelligence, Data Integration, High Availability, Integration.
Tags: , , , , ,
add a comment

The word availability so commonly used while building systems, and there are different responses, and of course spontaneous too. I would like to share few of those which I know of, and used more often.

  • I want the system to be available all the time – 24×7. This response was not of recent times, but maybe at least 5 years back where the business or application or program managers didn’t have knowledge on what a system availability means.
  • The business needs system to be available during the weekdays, and during the peak hours between 8 AM – 4 PM with acceptable failure time or non availability time of 5 mins. And during the weekends the business could accept few down time hours during the early hours of the day for any patch or software upgrades. This is more stringent for applications that have database systems to be continuously available for the business.

Thankfully the response to the availability questions is better and is getting better as more techno-functional business managers are involved in a new application or system building. It has been saving my day, where I don’t have to explain bunch of terms, formulae and what not.

Availability in the world of CDI solution is little different although the system should be available for any users accessibility all the while. What is that small difference? I had been dealing with transactional systems extensively, and the idea of availability changed when I had to look at this NFR from the perspective of CDI or a data integration solution. Trust me as I am writing this post, I couldn’t figure out the exact point that gives difference in the context of transactional system and data integration solution. I shall try defining for both, and hopefully that gives some sense.

Availability in the transactional system – the system should be up and running in the defined SLAs, except for unwarranted outages. In case of failure the system should be recovered in defined period of SLAs. This could be addressed by use of several availability features in SQL Server, and usually transactional systems are less bulkier than CDI /Data Hub databases. The key points that come on top of my head are

  1. No business /user data should be lost during a transaction, and data has to be accessible all the time
  2. No transaction should be a failure because of a system or process failure
  3. And in case of any failure the system should handle the failed transaction, and data should be recovered

Availability in the data integration systems – the system should be up and running – available for business users to consume the aggregated data from various sources. Again these too in the pre-defined and agreed SLAs, and some of these bulkier databases availability requirements could be addressed by different availability features of SQL Server. The key points are

  1. No business data should be lost during transition (porting time and path) and there should be enough mechanisms to track back the lost row or record until the originating point
  2. In case of any lost row or record should not be because of a hardware or system but could be because of implemented business processes failure, and this should be recorded in the system
  3. In case of any hardware failure, the system should be up and running within agreed SLAs and it is acceptable to have little longer period of recovery or restoration
  4. And the data should be available near real-time, accurate and on time

I believe after writing the above list I am probably bringing out that ‘thin’ difference between availability of integration solutions and transactional system. For both the systems data is most important and for this system, processes, and hardware should be in place, and that’s the objective. Great!, now this knowledge sharing is done, I am going to get into those questions which I put forward to know what is the availability requirement, and also judge what it all means for the customer. I am sure there are organizations who have adopted MDM solution, but MDM-Data Hub or CDI is seldom done because of its impact on the business directly or indirectly. Okay, I am not getting into that part of the discussion..so here are those pointers that we should gather inputs.

  • How important is the system and how quickly does it need to be returned to online in case of an outage or failure?

This question is to address the recoverability or restoration of the system in case of a hardware or system process failure. At the end the response to this question helps in defining SLA for a data hub database, and its downstream eco-system data sources.

  • How up-to-date does your information need to be?

The question here looks closest to near real-time data requirement, but please hold and look at it once again. The response would help to address on “Data Staleness”. In short how old can the business bear with, and technically how often should the data refresh happen.

  • Do you need real-time or are delays acceptable?

This question is off-shoot of the previous one, and response for this question will set the expectations from business and techno-functional teams if there should be real-time data pulls, processing and analytics.

  • What is the data retention policy?

The last question is to address the archival policy, and also to give an idea on what type of availability feature should be used to make large volumes of data available as part recovery of process.

At the end I probably managed pulling in questions for the tricky word “availability” in the context of CDI. All the inputs out here would help in designing a solution that should meet the requirements of availability – data staleness, data retention, and data recoverability.

I shall stop here and of course I feel this is the end of the NFR availability. Please feel free to comment and share your thoughts.

SQL Server 2005 DB Availability Features December 4, 2008

Posted by msrviking in High Availability.
add a comment

Sometime ago I had to come up with a document in my IT shop talking and comparing different SQL Server 2005 High Availability features.

With inputs from several SQL Server websites, MSDN, blog posts from my favorite bloggers, and other resources I managed to bring up what is in here. Also, at the end I brought a table which is a checklist of all the available features.

What is availability?

The simple definition of the availability of a system is the amount of time a system is up in a given year. Hence if a system is needed to be available for 99.99%, then it can’t be down for more than about 52 minutes in a year.

What is high availability?

A highly available system is one that can potentially nullify effects of the failure and maintain availability to such a degree that end users are not affected by the failure.

Higher availability in SQL Server world would revolve around these questions

Ø Are the instances up and running?

Ø Are all databases always available for the business?

Ø What if we bring the instance or database down as part of planned downtime, will the business tolerate this planned downtime?

Ø What if the instance or database crashes as part of an unplanned downtime, will the business tolerate such unplanned downtime?

In order to bring the instance or database up and running we have following high availability features in SQL Server 2005

Ø Database mirroring (New in SQL Server 2005)

Ø Failover clustering

Ø Replication

Ø Log shipping

Scalability vs. High Availability:

Availability and Scalability may seem to be conceptually related but they are actually not. With high availability, we guarantee a certain level of up-time, whereas scalability is primarily to get better performance through utilization of more resources.

In scalability, we have two different solutions for achieving better performance and to balance the load on SQL Server – scaling-up and scaling-out. Scaling up is about increasing resources like memory, CPU, disk controllers etc and is at hardware level. Scale out is using additional servers to distribute the load. This document is focused on availability options; a subsequent document will highlight scalability options in SQL Server 2005.

Description of High Availability options

Database mirroring

In database mirroring, an originating SQL Server 2005 instance continuously sends a database’s transaction log records to a copy of the database on another standby SQL Server instance. The originating database and server have the role of principal, and the receiving database and server have the role of mirror. The principal and mirror servers are separate instances of SQL Server 2005.

In all SQL Server databases, data changes are recorded in the transaction log before any changes to actual data pages are made. The transaction log records are placed first in a database’s log buffer in memory, and then flushed to disk (or ‘hardened’) as quickly as possible. In database mirroring, as the principal server writes the principal database’s log buffer to disk, it simultaneously sends that block of log records to the mirror instance. When the mirror server receives a block of log records, it places the log records first into the mirror database’s log buffer and then hardens them to disk as quickly as possible.

The principal and mirror servers are each considered a partner in the database mirroring session. A database mirroring session consists of a relationship between the partner servers when they mirror a database from one partner to another. A given partner server may have the principal role for one database and a mirror role for a different database.

In addition to the two partner servers (principal and mirror) a database mirroring session may have an optional third server, called the witness. The witness server’s role is to enable automatic failover. When database mirroring is used for high availability, if a principal server suddenly fails, if the mirror server has confirmation from the witness, it can automatically take on the role of principal and make its database available within a few seconds. In such a mode, when using ADO.Net or the SQL native client, the application can failover transparently using a connection string that looks, for example, like:

Data Source=myServerAddress; Failover Partner=myMirrorServer; Initial Catalog=myDataBase; Integrated Security=True;

There are two types of transfer mechanism for a database mirroring session: Synchronous and Asynchronous.

Synchronous transfer: The principal server hardens its principal database log records to disk; it also sends them to the mirror. The principal then waits for a response from the mirror server. The mirror responds when it has hardened those same log records to the mirror’s log disk.

Asynchronous transfer: The principal server will not wait for an acknowledgment from the mirror that the mirror has hardened a block of transaction records. The mirror will attempt to keep up with the principal, by recording transactions as quickly as possible, but some transactions may be lost if the principal suddenly fails and you force the mirror into service.

Failover clustering

A failover cluster consists of a set of redundant servers (called nodes) that share an external disk system. Clustering relies on the clustering capability of the Windows operating system. Windows must hence be configured for clustering. The server editions that support clustering in Windows 2003 are Enterprise and Datacenter. This mechanism also relies on a quorum disk that is shared by both instances, and holds all database files. This is typically a shared disk array.

During a cluster failover, a virtual SQL Server instance moves from one node to another. As a result, a cluster failover appears to external applications as if the virtual SQL Server instance is briefly unavailable (usually for less than a minute), then available again. The instance seemingly just stops and restarts. Behind the scenes, when one SQL Server instance located on one physical server becomes unavailable, Windows closes the database data files that the instance had open on a commonly shared disk space, starts another SQL Server instance on another physical server, opens the same data files, and takes over the virtual server name and virtual IP address of the failed instance. Open connections are closed during the failover, all committed transactions are rolled forward during the redo phase of database recovery in secondary node, and the undo phase will rollback the uncommitted records.

There are two types of failover clustering in Windows: Active/Passive and Active/Active.

Active/Passive means that your cluster has an active node and a passive mode. If your active node failed, then its defined resources would shift to the passive node and it would become active. The passive node is not accessible unless an accident occurs and the resources shifted.

In Active/Active clustering, both nodes are accessible and active. If a node fails, then its resources would shift to the other active node. The node that survives would then carry the load for both nodes.

Transactional Replication

Of the various types of replication supported by SQL Server, transactional replication is used for high availability primarily because it can guarantee transactional consistency between the publisher and subscribers, and also because its performance for continuous replication is much better than that of other replication methods. The most common high availability scenario for transactional replication occurs when you copy data from one database – the publisher – to one or more subscribers through a distribution server. If the publisher fails, one of the subscribers can become a read/write server and accept data updates—and even become a publisher to the other subscribers.

Log shipping

Log shipping allows you to automatically send transaction log backups from one database (known as the primary database) to a secondary database on another server (known as the secondary server). At the secondary server, these transaction log backups are restored to the secondary database, keeping it closely synchronized with the primary database. An optional third server, known as the monitor server records the history and status of backup and restores operations and optionally raises alerts if these operations fail to occur as scheduled.

When / Why should you the below availability features:

If any of your answers to the below questions is “Yes”, then that availability feature is for you.

Database mirroring:

1) Do you need a quick (<3-5 sec) automatic failover?

2) Do you want to not have the hardware expenses of a cluster?

3) Are you okay with a small performance overhead due to mirroring in the high protection mode, or with a small potential data loss in the high performance mode?

4) Do you need to protect individual databases (as opposed to all databases on an instance?

5) Do you want an automatic application redirection to standby?

6) Do you want to have standby over WAN?

7) Do you want to access standby for read-only access to a snapshot of the data?

Failover clustering:

1) Do you have hardware like a large shared cluster disk array, a dual NIC, a high capacity network, high capacity CPU and RAM listed in MS Windows 2003 HCL?

2) Do you need databases failover without any performance overhead?

3) Are you okay with a downtime of about 1-2 minutes during failover?

4) Do you need all databases of an instance as highly available?

5) Do you need a scalable solution for database availability?

Transactional replication and Log shipping:

1) Do you tolerate database availability delays that result from a manual failover?

2) Are you okay with clients not being completely transparent to failover?

3) Do you need to spread the standby servers across your WAN?

4) Do you want to balance the load across servers by accessing a read-only copy of the primary database?

Checklist of availability features

Category

Availability Feature

Database Mirroring

Failover Clustering

Log Shipping

Transactional Replication

Failover Characteristics

Standby type

Hot

Hot

Warm

Hot

Failure detection

Yes

Yes

No

No

Automatic failover

Yes

Yes

No

No

Masks disk failure

Yes

No

Yes

Yes

Masks SQL process failure

Yes

Yes

Yes

Yes

Masks other process failure

No

Yes

No

No

Metadata support

User Database

All databases (System and User Database)

User database

User database

Transactional consistency

Yes

Yes

Yes

Yes

Transactions are current

Yes

Yes

No

Yes

Perceived downtime

< 3 seconds

<20 seconds and db recovery time

Seconds plus recovery time

Seconds plus role change time

Transparent to client

Yes

Yes

No

No

Automatic role change

Yes

Yes

No

No

Configuration

Special hardware needed

No

Yes

No

No

Distance limit

Unlimited

Limited

Dispersed

Dispersed

Complexity

Some

More

Some

Some

Standby – accessible

Yes

No

Yes

Yes

Impact on performance

Some

No impact

Some

Some (Depending kind of subscription)

Impact on backup strategy

No impact

No impact

Some

No impact

Version availability

Enterprise of SQL 2005 – All features
Standard of SQL 2005 – Limited features

Enterprise and Standard Edition of SQL 2005

All versions except Express Edition

All versions except Express Edition

Additional server role

Witness

None

Monitor

Distributor

Redundant storage locations

Yes

No

Yes

Yes

Scope of availability

Database

Instance

Database

Database

Management

Load delay on secondary

No

No

Yes

No

Load balancing

No

No

No

No

Scalable

No

Yes

Yes

Yes

Client redirect

Limited support

No

No

No

Monitoring

Third party tools required

No

Yes

Yes

No

Customized scripts

Yes

Yes

Yes

No

Expertise required

Medium

High

Low

Medium

Troubleshooting

Complexity

Medium

High

Low

Medium

Expertise required

Medium

High

Low

Medium

Maintenance and Support

Complexity

Medium

High

Low

Medium

Expertise required

Medium

High

Low

Medium

Support

Support from Microsoft

New feature, so MS support is limited

Well defined support from MS

Well defined support from MS

Difficult to troubleshoot even by MS (sometimes)

Cost

Solution Cost

Low

High

Low

Low

Security

Complexity

Low

High

Low

Low

Expertise required

Low

High

Low

Low

Why don’t you guys tell me how did you find this consolidation?

Cheers!