Sometime ago I had to come up with a document in my IT shop talking and comparing different SQL Server 2005 High Availability features.
With inputs from several SQL Server websites, MSDN, blog posts from my favorite bloggers, and other resources I managed to bring up what is in here. Also, at the end I brought a table which is a checklist of all the available features.
What is availability?
The simple definition of the availability of a system is the amount of time a system is up in a given year. Hence if a system is needed to be available for 99.99%, then it can’t be down for more than about 52 minutes in a year.
What is high availability?
A highly available system is one that can potentially nullify effects of the failure and maintain availability to such a degree that end users are not affected by the failure.
Higher availability in SQL Server world would revolve around these questions
Ø Are the instances up and running?
Ø Are all databases always available for the business?
Ø What if we bring the instance or database down as part of planned downtime, will the business tolerate this planned downtime?
Ø What if the instance or database crashes as part of an unplanned downtime, will the business tolerate such unplanned downtime?
In order to bring the instance or database up and running we have following high availability features in SQL Server 2005
Ø Database mirroring (New in SQL Server 2005)
Ø Failover clustering
Ø Replication
Ø Log shipping
Scalability vs. High Availability:
Availability and Scalability may seem to be conceptually related but they are actually not. With high availability, we guarantee a certain level of up-time, whereas scalability is primarily to get better performance through utilization of more resources.
In scalability, we have two different solutions for achieving better performance and to balance the load on SQL Server – scaling-up and scaling-out. Scaling up is about increasing resources like memory, CPU, disk controllers etc and is at hardware level. Scale out is using additional servers to distribute the load. This document is focused on availability options; a subsequent document will highlight scalability options in SQL Server 2005.
Description of High Availability options
Database mirroring
In database mirroring, an originating SQL Server 2005 instance continuously sends a database’s transaction log records to a copy of the database on another standby SQL Server instance. The originating database and server have the role of principal, and the receiving database and server have the role of mirror. The principal and mirror servers are separate instances of SQL Server 2005.
In all SQL Server databases, data changes are recorded in the transaction log before any changes to actual data pages are made. The transaction log records are placed first in a database’s log buffer in memory, and then flushed to disk (or ‘hardened’) as quickly as possible. In database mirroring, as the principal server writes the principal database’s log buffer to disk, it simultaneously sends that block of log records to the mirror instance. When the mirror server receives a block of log records, it places the log records first into the mirror database’s log buffer and then hardens them to disk as quickly as possible.
The principal and mirror servers are each considered a partner in the database mirroring session. A database mirroring session consists of a relationship between the partner servers when they mirror a database from one partner to another. A given partner server may have the principal role for one database and a mirror role for a different database.
In addition to the two partner servers (principal and mirror) a database mirroring session may have an optional third server, called the witness. The witness server’s role is to enable automatic failover. When database mirroring is used for high availability, if a principal server suddenly fails, if the mirror server has confirmation from the witness, it can automatically take on the role of principal and make its database available within a few seconds. In such a mode, when using ADO.Net or the SQL native client, the application can failover transparently using a connection string that looks, for example, like:
Data Source=myServerAddress; Failover Partner=myMirrorServer; Initial Catalog=myDataBase; Integrated Security=True;
There are two types of transfer mechanism for a database mirroring session: Synchronous and Asynchronous.
Synchronous transfer: The principal server hardens its principal database log records to disk; it also sends them to the mirror. The principal then waits for a response from the mirror server. The mirror responds when it has hardened those same log records to the mirror’s log disk.
Asynchronous transfer: The principal server will not wait for an acknowledgment from the mirror that the mirror has hardened a block of transaction records. The mirror will attempt to keep up with the principal, by recording transactions as quickly as possible, but some transactions may be lost if the principal suddenly fails and you force the mirror into service.
Failover clustering
A failover cluster consists of a set of redundant servers (called nodes) that share an external disk system. Clustering relies on the clustering capability of the Windows operating system. Windows must hence be configured for clustering. The server editions that support clustering in Windows 2003 are Enterprise and Datacenter. This mechanism also relies on a quorum disk that is shared by both instances, and holds all database files. This is typically a shared disk array.
During a cluster failover, a virtual SQL Server instance moves from one node to another. As a result, a cluster failover appears to external applications as if the virtual SQL Server instance is briefly unavailable (usually for less than a minute), then available again. The instance seemingly just stops and restarts. Behind the scenes, when one SQL Server instance located on one physical server becomes unavailable, Windows closes the database data files that the instance had open on a commonly shared disk space, starts another SQL Server instance on another physical server, opens the same data files, and takes over the virtual server name and virtual IP address of the failed instance. Open connections are closed during the failover, all committed transactions are rolled forward during the redo phase of database recovery in secondary node, and the undo phase will rollback the uncommitted records.
There are two types of failover clustering in Windows: Active/Passive and Active/Active.
Active/Passive means that your cluster has an active node and a passive mode. If your active node failed, then its defined resources would shift to the passive node and it would become active. The passive node is not accessible unless an accident occurs and the resources shifted.
In Active/Active clustering, both nodes are accessible and active. If a node fails, then its resources would shift to the other active node. The node that survives would then carry the load for both nodes.
Transactional Replication
Of the various types of replication supported by SQL Server, transactional replication is used for high availability primarily because it can guarantee transactional consistency between the publisher and subscribers, and also because its performance for continuous replication is much better than that of other replication methods. The most common high availability scenario for transactional replication occurs when you copy data from one database – the publisher – to one or more subscribers through a distribution server. If the publisher fails, one of the subscribers can become a read/write server and accept data updates—and even become a publisher to the other subscribers.
Log shipping
Log shipping allows you to automatically send transaction log backups from one database (known as the primary database) to a secondary database on another server (known as the secondary server). At the secondary server, these transaction log backups are restored to the secondary database, keeping it closely synchronized with the primary database. An optional third server, known as the monitor server records the history and status of backup and restores operations and optionally raises alerts if these operations fail to occur as scheduled.
When / Why should you the below availability features:
If any of your answers to the below questions is “Yes”, then that availability feature is for you.
Database mirroring:
1) Do you need a quick (<3-5 sec) automatic failover?
2) Do you want to not have the hardware expenses of a cluster?
3) Are you okay with a small performance overhead due to mirroring in the high protection mode, or with a small potential data loss in the high performance mode?
4) Do you need to protect individual databases (as opposed to all databases on an instance?
5) Do you want an automatic application redirection to standby?
6) Do you want to have standby over WAN?
7) Do you want to access standby for read-only access to a snapshot of the data?
Failover clustering:
1) Do you have hardware like a large shared cluster disk array, a dual NIC, a high capacity network, high capacity CPU and RAM listed in MS Windows 2003 HCL?
2) Do you need databases failover without any performance overhead?
3) Are you okay with a downtime of about 1-2 minutes during failover?
4) Do you need all databases of an instance as highly available?
5) Do you need a scalable solution for database availability?
Transactional replication and Log shipping:
1) Do you tolerate database availability delays that result from a manual failover?
2) Are you okay with clients not being completely transparent to failover?
3) Do you need to spread the standby servers across your WAN?
4) Do you want to balance the load across servers by accessing a read-only copy of the primary database?
Checklist of availability features
|
Category |
Availability Feature |
Database Mirroring |
Failover Clustering |
Log Shipping |
Transactional Replication |
|
Failover Characteristics |
Standby type |
Hot |
Hot |
Warm |
Hot |
|
Failure detection |
Yes |
Yes |
No |
No |
|
|
Automatic failover |
Yes |
Yes |
No |
No |
|
|
Masks disk failure |
Yes |
No |
Yes |
Yes |
|
|
Masks SQL process failure |
Yes |
Yes |
Yes |
Yes |
|
|
Masks other process failure |
No |
Yes |
No |
No |
|
|
Metadata support |
User Database |
All databases (System and User Database) |
User database |
User database |
|
|
Transactional consistency |
Yes |
Yes |
Yes |
Yes |
|
|
Transactions are current |
Yes |
Yes |
No |
Yes |
|
|
Perceived downtime |
< 3 seconds |
<20 seconds and db recovery time |
Seconds plus recovery time |
Seconds plus role change time |
|
|
Transparent to client |
Yes |
Yes |
No |
No |
|
|
Automatic role change |
Yes |
Yes |
No |
No |
|
|
Configuration |
Special hardware needed |
No |
Yes |
No |
No |
|
Distance limit |
Unlimited |
Limited |
Dispersed |
Dispersed |
|
|
Complexity |
Some |
More |
Some |
Some |
|
|
Standby – accessible |
Yes |
No |
Yes |
Yes |
|
|
Impact on performance |
Some |
No impact |
Some |
Some (Depending kind of subscription) |
|
|
Impact on backup strategy |
No impact |
No impact |
Some |
No impact |
|
|
Version availability |
Enterprise of SQL 2005 – All features |
Enterprise and Standard Edition of SQL 2005 |
All versions except Express Edition |
All versions except Express Edition |
|
|
Additional server role |
Witness |
None |
Monitor |
Distributor |
|
|
Redundant storage locations |
Yes |
No |
Yes |
Yes |
|
|
Scope of availability |
Database |
Instance |
Database |
Database |
|
|
Management |
Load delay on secondary |
No |
No |
Yes |
No |
|
Load balancing |
No |
No |
No |
No |
|
|
Scalable |
No |
Yes |
Yes |
Yes |
|
|
Client redirect |
Limited support |
No |
No |
No |
|
|
Monitoring |
Third party tools required |
No |
Yes |
Yes |
No |
|
Customized scripts |
Yes |
Yes |
Yes |
No |
|
|
Expertise required |
Medium |
High |
Low |
Medium |
|
|
Troubleshooting |
Complexity |
Medium |
High |
Low |
Medium |
|
Expertise required |
Medium |
High |
Low |
Medium |
|
|
Maintenance and Support |
Complexity |
Medium |
High |
Low |
Medium |
|
Expertise required |
Medium |
High |
Low |
Medium |
|
|
Support |
Support from Microsoft |
New feature, so MS support is limited |
Well defined support from MS |
Well defined support from MS |
Difficult to troubleshoot even by MS (sometimes) |
|
Cost |
Solution Cost |
Low |
High |
Low |
Low |
|
Security |
Complexity |
Low |
High |
Low |
Low |
|
Expertise required |
Low |
High |
Low |
Low |
Why don’t you guys tell me how did you find this consolidation?
Cheers!