Best Practices for Continuous Application Availability

Best Best Practi Practice ces s for Continu ous Appli A pplica catio tion n Av A v ail ai l abi ab i l i t y

Gartner IT Security Summit 2005

Donna Scott

6–8 June 2005 Marriott Wardman Park Hotel Washington, District of Columbia

These materials can be reproduced only with Gartner's written approval. Such approvals must be requested via email — [email protected].

Best Practices for Continuous Application Availability

High-Profile Downtime Is Down High-profile downtime incidents in 2004 were much less frequent than in 1999 1999-2 -200 001. 1. Because Because they are less fr equent, they become part of “ doin g business.” That That does not mean there there is no cost of downtim e — it just means that people/customers are more accommodating when they are rarely vs. frequently impacted.

Dat e

Ev en t

Cau s e

10/8/04

4-day slow-down/outage of Paypal;

code changes

9/14/04

5-hour FAA radio outage disrupted air travel; some planes flew dangerously close

L ac k o f maintenance

8/1/04

3-hour plane grounding American & U.S. Airways

Unintentional user error

Since 2001, a greater number of organizations started to systematically reduce risks in their IT environments, and therefore improve end-to-end availability. This included an emphasis on designing for availability and managing for availability through improving problem and change management processes. As their availability levels rose, there have been fewer high-profile downtime instances in the news. You certainly can still find them as shown in the graphic, but they are less frequent. Because they are less frequent, they become part of "doing business," and have less overall impact, than when they were occurring frequently. That does not mean that there is not a cost of downtime — it just means that people/customers are more accommodating when they are rarely impacted (vs. frequently impacted). While all enterprises still do have downtime, the levels of uptime have risen. Still, there is increased desire to continue to improve — to operate critical IT services and applications 24 hours per day, seven days per week (24x7), because of IT-business process interdependencies . The complexity of today’s IT infrastructures and applications makes managing these systems to high levels of availability difficult. Achieving continuous availability requires a multipronged strategy that addresses and mitigates risks of failures and planned maintenance/upgrade. Continuous availability must be designed in and requires substantial levels of cross-organizational people/process discipline and control. This presentation focuses on strategies to achieve continuous end-to-end IT service/application availability. © 2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction Reproduction of this publication in any form without prior written permission is forbidden. The information contained herein has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies inadequacies in the information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to achieve its intended results. The opinions expressed herein are subject to change without notice.

Donna Scott C2, SEC11, 6/05, AE

Page 1

Best Practices for Continuous Application Availability Client Issues 1. How will enterprises define and measure conti nuous availabili ty, and how much does it cost? 2. How shoul d IT services be archi tected for conti nuous availabili ty? 3. What IT process best practices and strategies will enterprises adopt to achieve contin uous IT service availabili ty?

Less Downtime Supported by Data Center Conference Polling Results 60% Dec. 2000 Data Center Conf. Poll Results (n=N/A)

50

10

Dec. 2003 Data Center Conf. Poll Results (n=151)

34 29

30 20

42 41

38

40

25

23 24

20 9

8

5

0

0 Average

Dec. 2004 Data Center Conf. Poll Results (n=165)

Very Good

Key Average Very Good Outstanding Best-in-Class 100% Availability

Outstanding

Best in Class

2

1

100% Availability

(<=98%; >= 175 downtime hours per year) (99%; <= 87 downtime hours per year (99.5%; <= 43 downtime hours per year) (99.9%; <= 9 downtime hours per year) (zero unplanned downtime)

The results from a poll conducted at Gartner's Data Center Conference in December 2004 as well as in December 2003 indicate that many enterprises have made outstanding progress in improving application availability since 2000. Business processes' increasing reliance on IT and the pervasive ideal of the real-time enterprise will continue to drive this trend toward around-the-clock availability. To achieve this, enterprises must have a multipronged strategy that addresses application architecture, technology infrastructure and IT process maturity. These conclusions are based on the findings of a poll conducted among CIOs, heads of IT operations and data center managers attending Gartner's Data Center Conference in December 2004 as well as in December 2003 (which attracted more than 1,400 attendees, who were given electronic polling devices with selected questions inserted in each conference presentation). Respondents answered two questions that focused on end-to-end availability levels for their enterprises' most-critical applications. Responses to questions on unplanned downtime were compared to results from a similar poll in 2000. Although Gartner recognizes that respondents don't necessarily represent a statistically significant distribution, we believe the results of these polls are of interest to users.

© 2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any form without prior written permission is forbidden. The information contained herein has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to achieve its intended results. The opinions expressed herein are subject to change without notice.


Page 2

Best Practices for Continuous Application Availability Client Issue: How wi ll enterprises define and measure conti nuous availabili ty, and how much does it cost? Tactical Guideline: Most enterprises that are seeking t o imp rove availabili ty ini tially fo cus on reducing unplanned downtime.

Availability Defined in User, Not Component Terms Continuous availability provides expanded user access to application services (24x7 or near 24x7) while also providing access to a high percentage (equal to or more than 99 percent) of scheduled time, despite unscheduled incidents. Continuous Availability Minimizing unplanned downtime

Fault Avoidance

High Availability

Rapid Recovery

Availability Reporting/ Metrics

Continuous Operations

Integrity

Minimizing planned downtime

Application Performance Management

A highly available IT service provides user access to applications and data for a minimum of 99 percent of scheduled time, despite unscheduled incidents. It typically implies the ability to eliminate/avoid (via error detection, circumvention, correction and recovery) or minimize (via rapid restart) unscheduled outages. Implied in high availability is application and data integrity, as well as acceptable (as defined by the user) application performance. Ultimately, high availability must be measured from a user’s perspective. If a user can’t access an IT service (set of applications and their underlying infrastructure) during scheduled hours, the application is considered unavailable. Most business-critical applications, including business-to-business (B2B), business-toconsumer (B2C) and enterprise applications, have some planned or scheduled downtime to perform maintenance. Scheduled downtime should be negotiated with users to avoid times of peak or seasonal business demand. Further, scheduled downtime should clearly be communicated to users to set expectations and avoid the dissatisfaction that occurs when trying to access a site or services that are unavailable. A continuously operable site enables access during expanded hours, often near 24x7 or the full 24x7. Continuous availability is the combination of high availability and continuous operations, and it enables expanded hours of user access (near 24x7 or 24x7) a high percentage (99 percent or more) of the time. Conclusion: 24x7 availability is “designed in,” not bought; is expensive; and requires a strategy and plan.



Page 3

Best Practices for Continuous Application Availability Strategic Planning Ass umpti on: L arge enterprises and out sourc ers that measure end-to-end service availabili ty w ill rise from 25 percent t oday to more than 50 percent by 2007 and 75 percent b y 2009 (0.8 probabil ity ).

Best Practice #1: Measure to the Users’ View Transactions Network Internet Database Business Rules Ap pl ic ati on Cod e Storage Middleware and Production Objects Desktop

Servers

“Many enterprises find it difficult to measure end-toend IT service availability, and use response time measures as a proxy for availability.”

IT Servic es/Application A vailability Report ing Best Practices 

Measure on IT services/products, not individual components



General formula = 1- (total downtime minutes/total available minutes)







Weigh downtime by pain index — typically number of users affected, weighted for the severity of the impact Predetermine the conditions that constitute “downtime.” If an e-commerce site has 1 percent of functions down, does this “count?” Map downtime conditions to severity/priority levels in the service desk and event management systems.

Key Issue: How wil l enterprises define and measure continuo us availabili ty, and how muc h does it cost? As IT organizations move toward IT service management — providing various levels of service to customers at various levels of cost — they must also regularly measure and report their service performance. A critical aspect of service performance is end-to-end application service availability. The end-to-end service to be measured is typically defined jointly by the business and IT organization as part of the service-level management process. It includes a set of applications and u nderlying infrastructure that are critical to a business process. Examples may include all applications and infrastructure associated with e-commerce, call center or enterprise resource planning. Fulfilling the end-to-end service requirements may be done in-house or outsourced, or a combination of both. Although most IT organizations and outsourcers measure the availability of IT components, most do not yet measure end-to-end service availability. However, the trend for measurement is consistent with the IT service management trend and, as a result, large enterprises and outsourcers that measure end-to-end service availability will rise from 25 percent today to more than 50 percent by 2007 and 75 percent by 2009 (0.8 probability). Most enterprises don’t measure end-to-end service availability today because it’s difficult to do, crosses many business, IT, outsourcer, organizational and process boundaries, and requires significant manual effort to determine outage business impact and cost. Furthermore, many enterprises are just starting to organize IT for service management. Action Item: You can’t improve what you don’t measure. Develop and implement a method for measuring end-to-end availability — from the user’s perspective — the set of application functions critical to their business process. © 2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any form without prior written permission is forbidden. The information contained herein has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to achieve its intended results. The opinions expressed herein are subject to change without notice.


Page 4

Best Practices for Continuous Application Availability Strategic Planning Assu mpti on: B y 2007, more than 50 percent of enterprises wil l classif y IT service availability and disaster recovery requirements d urin g the early p hases of t he project life cyc le — an increase from 20 percent t oday (0.7 probabili ty). Tactical Guidelin e: Breaking the service level dow n in to manageable compo nents is t he best way to g ain conf idence that service levels are realist ic and achievable.

Best Practice #2: Determine SLAs Early in Life Cycle; Classify and Consider SLA Chain Business Unit

IT organization

Internal and External Service Providers

Class

SLA

Typical IT Services

1-RTE

24x7, 99.9 RTO=2 hrs; RPO=0

Customer/Partner Facing Significant Revenue and/or Service Impact

2-Critical

24x6-3/4, 99.5 RTO<8 hrs; RPO<4 hrs

Supply Chain Medium Impact on Customer Service

3-Important

18x7, 99.2 Back Office Applications RTO=72 hrs; RPO=24 hrs SLA Chain

Outsourced Facilities/Environment Ap pl ic ati on Outsourced Wide-Area Network s

Servers

Systems Software

Database

Key Issue: How wil l enterprises define and measure continuo us availabili ty, and how muc h does it cost? Business requirements for application service availability and disaster recovery should be defined during the business requirements phase. Ignoring requirements early often results in a solution that does not meet requirements and ultimately requires significant re-architecture to improve service. We recommend a classification scheme of supported service levels and associated costs. These drive tasks and spending in development/application architecture, system architecture and operations. Business managers then develop a business case for a particular classification of service. When moving toward IT service management (ITSM), enterprises define IT services that are meaningful to their customers, and define formal service-level agreements which define service quality commitments. Enterprises not yet ready for ITSM, but wanting to improve availability levels, should follow the same process and set informal SLAs from which to monitor their progress toward goals. Whether formal or not, these measures must “trickle down” into IT and supplier performance goals that correlate with the end-to-end SLA. It is vital for IT organizations to identify realistic targets internally and summarize them to ensure that the BU/IT organization service-level target is realistic. The target numbers are not “guessed at,” but are based on actual experience. Action Item: When moving to IT service management, IT organizations should ideally measure their performance for six to 12 months prior to agreeing to formal BU/IT organization SLAs. © 2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any form without prior written permission is forbidden. The information contained herein has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to achieve its intended results. The opinions expressed herein are subject to change without notice.


Page 5

Best Practices for Continuous Application Availability Strategic Planning As sumpt ion: Through 2007, no more than 10 percent of cri tical applic ation services will achieve 99.9 percent (best-in-class) availability (0.8 probability).

Best Practice #3: Know How Your IT Services Availability Metrics Stack Up Hours Down/Year/IT Service Unplanned

Planned

More than 175 hours (Less than 98%)

More than 250 hours

Very Goo d

Between 70 and 87 hours (99% - 99.2%)

Less than 200 hours

Outstanding

Less than 43 hours (99.5%)

Less than 50 hours

Less than 9 hours (99.9%)

Less than 12 hours

Aver age

Best in Class

Key Issue: What IT process best practic es and st rategies wi ll enterpri ses adopt to achieve conti nuous IT service availabili ty? Business executives want the IT organization to deliver services like a utility — turn it on and it is always there; turn it up or down as needed. IT services, however, are not standardized or commoditized; rather, service levels reflect the interworking of components, services and management domains. The end result reflects the infrastructure’s ability to deliver service, and also depends on the design of the application and the other services being delivered by shared resources. Wishful thinking by those delivering or consuming the service will not guarantee availability levels. Here we provide an end-to-end IT service availability ranking based on annual service downtime, categorized independently for planned vs. unplanned downtime. It shows average, very good, outstanding and best-in-class levels of availability. Most enterprises would not bother measuring availability for services requiring average availability levels, but they would reserve their efforts for those mission-critical services requiring higher levels of availability. Action Item: Benchmarking can be a useful exercise, but ultimately, you should measure against your business requirements, and not against some other enterprise’s requirements or achievements.



Page 6

Best Practices for Continuous Application Availability Tactical Planning Guideline: A con tinu ousl y available IT service will cost at least 3.5 times a standard, nonhig hly available service.

Best Practice #4: Know Your Costs of Delivering Availability; Use in Project Justification From a design/development perspective, it costs less to do it right the first time than to retrofit it later. Standards are necessary to make the process repeatable. Plus One-Time Project 4X Costs For: Retrofit/redesign Service and operations Additional technology architecture and standards, costs for CA Relative 3X including change CA design, testing, Cost management scheduling, operations and impact analysis Operations/mgmt. Design, testing and 2X 



HA Support HA Design/ Development/ Testing

Redundancy

Cost of standard application

development architecture and standards

CA Continuous availability HA High availability

Key Issue: How wil l enterprises define and measure continuo us availabili ty, and how much does it cost? Building highly available applications is expensive, costing about 2.5 times that of a standard or not highly available application. The costs are not just capital costs, such as in redundancy, but also come in the form of greater diligence in IT processes (such as in performance monitoring). Continuous availability is even more expensive — at least three-and-a-half times the cost of a standard application. Most enterprises define satisfactory levels of availability for mission-critical, externally accessed Web applications at 99.5 percent (about 43 hours per year of unplanned downtime), and two to eight hours per month of planned downtime. A key differentiator for those enterprises justifying high availability and near-continuous o perations is an investment in architectures and standards for application design/development/testing and operations. Once development standards are implemented, it costs little more to design and develop a highly available application than a standard application. Disregarding availability requirements during design frequently causes costly “re-architecting” at a later date to meet growing availability requirements. Action Item: Considering availability requirements during design will save money throughout the life of the IT service, by avoiding the costs of retrofitting and re-architecting at a later date.



Page 7

Best Practices for Continuous Application Availability Client Issue: How shou ld IT services be archi tected for conti nuous availabili ty? Strategic Imperative: Achi eving high l evels of applic ation service availability r equires infusin g it int o corpo rate cult ure. A criti cal success factor is defining repeatable processes throu gh the creation of in frastruct ure, softw are and operation al archit ectures.

Best Practice #5: Invest in Service-Level Management and Architecture Standards Class 1 “ Gold” Downtime 99.5%–99.9% per year Eight to 43 hours Price: “ 3*X” Infrastructure Reqs.

   

Parallel cluster “Hot plug” hardware Use of GA products Spare parts on-site

Auto-recovery  No transaction re-entry  Replicated database  Test env. = prod. env. 

Software Reqs.



Operations Reqs.

 

Real-time alarming with business impact Capacity planning Outage analysis, prevention

Class 2 “ Silver” 99.0%–99.5% 43 to 87 hour s Price: “ 2*X” Redundant arch.  Auto-failover  Consistent config.  Vendor MTTR SLA 

App. design failover  Auto-diagnostics  Scalable  Secure 

 



Proactive tuning Proactive availability & performance mgmt. Proactive problem mgmt./root-cause analysis

Class 3 “ Standard” 98%–99% 87 to 175 hou rs Price: “ X” Stand-alone servers  Auto-restart  Tested backup  No SPOF — user NW 

Application start/stop  Tested recovery plan  User account testing  Change management 

    

Change management Event monitoring Backup practices Well-trained staff Tested recovery plan

To achieve high levels of availability requires enterprises to understand the impact of architecture on availability. We recommend availability SLAs be tied to architectural standards — for infrastructure, software and operations. These policies make sure that a repeatable process is used to ensure the level of availability specified. It builds the SLA planning and execution into the infrastructure design, application design and operational design. In the slide example, three levels of SLAs are offered for acceptable levels of unplanned downtime: standard (98 percent to 99 percent availability); silver (99 percent to 99.5 percent) and gold (99.5 percent to 99.9 percent). By specifying architectural requirements, it provides justification for the increase in cost for higher levels of service. It also provides a basis for benchmarking costs of service with external service providers. Furthermore, based on the architectural requirements for achieving higher quality of service, an IT organization can provide standard multipliers required to achieve the higher levels of service. For example, if the standard SLA is “X,” the silver level may be “2*X”, while the gold level may be “3*X.” This will help the business process/application owner and relationship manager during the negotiation process (where som etimes requirements change based on available budget).



Page 8

Best Practices for Continuous Application Availability Strategic Planning Assump tion s: Throu gh 2008, less than 20 percent of large enterprises wil l systematically design applicatio n architectures to achieve near-continu ous availabili ty (0.8 probabil ity). Through 2008, 80 percent of l arge enterprises w ill cont inue to rely on the infr astructur e to achieve near-conti nuous availability , rather than systematically build i t into the application architecture (0.8 probability).

Best Practice #6: Invest in Holistic, Resilient Application and Infrastructure Architectures Strategies to Reduce Ap pl ic ati on Dow nt im e Use Stateless but Persistent Architectures Embed Management Instrumentation Use Asynchronous Application Integration/ Messaging

Strategies to Reduce Ap pl ic ati on Downtime Impact

Architect for Redundancy, Failover and Horizontal Scaling

IT Service Design Principals for CA

Architect for Active/Active; User Transparency through Failures

Design for Automated, Rolling Updates, Change, Scale Partition Databases

Design Apps to Communicate with Users

Design for Functional Degradation

Client Issue: How shou ld IT services be archit ected for c ontin uous availability ? Designing for continuous availability reduces downtime or minimizes user impact for any outage that occurs. Although unexpected component outages cannot be completely eliminated, masking outages from users creates the perception of uninterrupted availability. This approach is the underpinning of any availability strategy. Although enterprises would like to rely on the underlying technology infrastructure exc lusively for the quality of service of their applications, application availability also depends on application design. For example, component architectures make it easier to upgrade an application “in flight.” Stateless application components allow for a more flexible, powerful use of application server load balancing, providing for greater levels of availability as well as user access persistency, despite component outages. Moreover, de-coupled connections between components and applications are more tolerant of failures than synchronous connect ions. Effective applications management starts at development time with instrumentation. Clearly identified "interfaces" between components or services, transactions and business processes are primary candidates for this activity. Action Item: Base infrastructure resiliency on levels of application resiliency; high levels of application resiliency will modify requirements for infrastructure resiliency. Consideration of application availability must be present from the start of the project and not delayed until application deployment because architectural changes cannot easily be reversed. © 2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any form without prior written permission is forbidden. The information contained herein has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to achieve its intended results. The opinions expressed herein are subject to change without notice.


Page 9

Best Practices for Continuous Application Availability Strategic Imperative: For sh ort (measured in s econds) and transparent recovery, the application architecture must b e designed, rather than adding features in inf rastructur e as an afterthought.

Best Practice #7: Invest in Multisite Architectures for Built-In Disaster Recovery Geographic Load Balancer

Site Load Balancer

Web Server Clusters Application Replication

Secondary Site Geographic Load Balancer

Site Load Balancer

Web Server Clusters

Application Server Clusters

Database Server Clusters

DB/Host Replication g IBM g Microsoft g NSI g Oracle g Quest g Veritas

Disk

Remote Copy g EMC g Hitachi g HP g IBM

Application Database Server Server PIT Image, Tape B/U Clusters Clusters

Client Issue: How shou ld IT services be archit ected for c ontin uous availability ? For application services with continuous availability requirements, including short recovery time objective (RTO) and recovery point objective (RPO), multisite architectures are used. Often, a new real-time enterprise (RTE) application service starts with a single-site architecture and migrates to multiple sites as its risks grow. Multiple sites complicate applications architecture design (for example, load balancing, database partitioning, database replication and site synchronization must be designed into the architecture). For nontransaction processing applications, multiple sites run conc urrently, connecting users to the closest or least-used s ite. To reduce complexity, most transaction processing (TP) applications replicate databases (or disks) to an alternate site, but the alternate databases are idle unless a disaster occurs. Then, a switch to the alternate site can be accomplished in typically 15 to 30 minutes. Some enterprises prefer to partition databases and split the TP load between sites, and then consolidate data later for decision support and reporting. This reduc es the impact of a site outage, affecting only a portion of the user base. A small number of organizations prefer more-complex architectures with bidirectional replication between sites to maintain a single database image and the highest levels of availability. Action Item: The shorter the requirement for failover, the closer the replication must be to the application level in the architecture. © 2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any form without prior written permission is forbidden. The information contained herein has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to achieve its intended results. The opinions expressed herein are subject to change without notice.


Page 10

Best Practices for Continuous Application Availability

Best Practice Case Study: UPS Architects for 100 Percent Availability Customers DIAD

UPS Operating Centers

Critical Information Important Information

Data Replication New Jersey Data Center

Georgia Data Center

Client Issue: How sh ould IT services be architected for cont inuo us availabili ty? UPS architected its package delivery status data so that customers could access it over the Internet 24x7. An architectural overview follows and refers to the graphic. •At the time of package delivery, UPS drivers record package data electronically into a PDA-like device called a DIAD. After each package delivery, the driver transmits the data over a cellular network, in real time, to a mainframe in New Jersey (NJ). If NJ is inaccessible, the DIAD transmits to the Georgia (GA) center. •As soon as the data is received by a front-end collection application in the NJ or GA data centers, it's written to an IMS database and to WebSphere MQ; the latter replicates the transaction to the other data center. •Both locations run application tasks to take the IMS data and apply the updates/inserts to the IBM DB2 database. This process enables up-to-date package data status from the database. •At the end of each day, UPS drivers physically go to one of the 1,800 regional package processing centers, and all data collected in their DIADs is transferred to distributed Intel/Windows systems. •All data is transferred daily from the regional facilities via FTP to the NJ data center and processed into the IBM DB2 databases. For added protection, all data is stored in the regional centers for up to three days. Action Item: Companies that want uninterrupted, 24x7 access to business applications and data must consider certain requirements in the early phases of IT enterprise architecture design. © 2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any form without prior written permission is forbidden. The information contained herein has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to achieve its intended results. The opinions expressed herein are subject to change without notice.


Page 11

Best Practices for Continuous Application Availability Client Issue: What IT process b est practices and strategies will enterpris es adopt to achieve conti nuous IT service availabilit y? Strategic Plannin g As sumpt ion: Through 2005, fewer than 25 percent, and thro ugh 2007, fewer than 35 percent, of large enterprises will achieve IT service management maturity (0.8 probability).

Best Practice #8: Know Your IT Management Process Maturity Level Level

Maturity

Processes

4

Value

IT/Business Metric Linkage

3

Service

Capacity Planning, Service-Level Management

2

Proactive

Performance, Change, Problem Management Configuration, Availability Mgmt. Automation, Job Scheduling Automation, Job Scheduling

1

Reactive

Event Up/Down, Console, Trouble Ticket, Backup, Topology, Inventory

0

Chaotic

Multiple Help Desks, Nonexistent IT Operations, User Call Notification

Service Management Benefits   

Service quality Customer satisfaction Understanding of costs; benchmarking

 

Labor costs Risk

IT organizations evolve over time toward IT management process maturity. In the more-traditional mainframe data centers, it took 10 to 20 years to achieve Level 3 service management maturity. By comparison, distributed heterogeneous computing is relatively new, and few IT organizations have reached Level 3 maturity. Each maturity level provides the foundation for the higher levels. People, processes and tools must be in place at one level before the enterprise will be able to proceed to the next level. Even as IT organizations move up, there will always be additional work, continuous engineering and improvements going on at each level. Business processes that are heavily dependent on IT require a minimum of Level 3 IT management process maturity to have the necessary rigor and predictability in the delivery of IT services. Therefore, setting business/IT SLAs requires that all objects within the IT service (for example, systems, storage, networks, applications, database, OS software, facilities and environmental factors) also be managed at Level 3 maturity. Otherwise, the weak link (the objects managed at lower levels) will cause the entire service to fail to meet the business SLAs. Action Item: Understand where you are today in the IT management process maturity model, set a goal of what level you need to reach to best support the business, then use the model to help guide investments in people, processes and technology to achieve higher levels of maturity.



Page 12

Best Practices for Continuous Application Availability Strategic Imperative: Through 2007, excessive down time wi ll c ause most IT organization s to increase their attention on IT pro cess re-engineering.

Best Practice #9: Know Why Your IT Service Is Down Unplanned Downtime

Planned Downtime 10% Backup and Recovery

20% Environmental Factors, HW, OS, Power, Disasters

13% Hardware, Systems Software

40% Operations Errors

40% Application Failure

2% Physical Plant/ Environmentals

65% Ap pl ic ati on and Database

10% Batch Application Processing

Client Issue: What IT process best p ractices and strategies will enterpris es adopt to achieve conti nuous IT service availabili ty? Based on extensive feedback from clients, we estimate that, on average, 40 percent of unplanned mission-critical application downtime is caused by application failures (including “bugs,” performance issues or changes to applications that cause problems); 40 percent by operator errors (including incorrectly or not performing an operations task); and about 20 percent by hardware (for example, server and network), operating systems, environmental factors (for example, heating, cooling and power failures), and natural or manmade disasters. To add ress the 80 percent of unplanned downtime ca used by “people failures” (vs. technology failures or disasters), enterprises should invest in improving their change and problem management processes (to reduce the downtime caused by application failures); automa tion tools, such as job scheduling and event management (to reduce the downtime caused by operator errors); and improving availability through application architecture. The balance should be addressed by eliminating single points of failure through redundancy or reducing time-to-repair through technology support/maintenance agreements. To reduce planned downtime, investing in change management and application/DBMS architecture/design/ development processes will have the greatest effect and the highest return on investment. Action Item: Enterprises will reach an availability “wall” over which they can’t climb, unless they invest in re-engineering IT processes. These processes include availability, change, configuration, problem and performance management, as well as application architecture and capacity planning.



Page 13

Best Practices for Continuous Application Availability Strategic Planning Assump tion : Throu gh 2007, fewer than 10 percent of large enterprises wi ll assess bu siness and IT performance on availability metrics (0.8 probability).

Best Practice #10: Collect the Right Metrics for Downtime Analysis and Future Prevention Drive the right behavior with employee and department performance metrics

IT services metrics/trending: Frequency of unplanned outages Mean-time-to-resolution/repair Service downtime and impact Response time Root-cause analysis/postmortem     

Root Cause Coding — Multilevel Hardware

People (Human Error)

Application Software

System Software

Facilities/ Environment

Service Provider

Process

Technology

External Events

Capacity

Preventable

Unpreventable

Client Issue: What IT process best p ractices and strategies will enterpris es adopt to achieve continuous IT service availability? Measuring and trending end-to-end availability will help the IT organization understand how it is doing relative to defined service-level goals. However, improving availability requires a more-granular understanding of availability and the root cause of outages, so that similar outages may be prevented. Root cause consists of postmortem outage analysis by a cross-functional IT team, to identify the reason the outage occurred. This may be, for example, due to a component failure, application failure, changes that resulted in unanticipated problems or configuration inconsistency. Further classification to determine whether the outage was preventable under existing processes will aid in correcting human error and process failures. However, some outages may be preventable only with additional investmen t, which the enterprise and IT organization must justify, or it must accept the outage risk. IT and business performance metrics can be counter to the goals of availability. For example, if developers are assessed on code timeliness and not quality, then this will lead to “buggy” code, causing downtime. Few enterprises define availability as a goal across business and IT. However, those that have done so find significant benefits in availability, planning and less “fire-fighting.” Action Item: Consider which metrics drive the desired behavior to achieve high levels of availability.



Page 14

Best Practices for Continuous Application Availability Strategic Planning Assump tion : Throu gh 2008, fewer than 5 percent of su ccessful Internet attacks wil l exploit a " day-zero" vuln erabili ty (0.8 probabili ty).

Best Practice #11: Invest in Security; We Have Met the Enemy, and They Are Us Vulnerabilities Exploi ted

Old Patch Recent Patch New Vulnerability Misconfiguration

The percentage of vulnerabilities that are attacked within one month of the patch release will double from 15 percent in 2003 to 30 percent by 2006 (0.7 probability).

Client Issue: What IT process b est practices and strategies will enterprises adopt to achieve conti nuous IT service availability ? Most vulnerabilities are known about long before an attack happens. However, it is tremendously expensive and complex for companies to patch software on all desktops and servers. Most patches often require patches themselves, and many patches (at least 30 percent) “break” at least one corporate application. To many IT operation organizations, patching is seen as more risky than is getting attacked. The security industry likes to hype the possibility of “day-zero” attacks — that is, attacks that exploit vulnerabilities that no one knew about prior to the attack. However, day-zero attacks will represent less than 5 percent of successful attacks through 2008, a s attackers continue to focus on reverse engineering patche s to develop exploit code. Software developers should focus on reducing configuration errors — making default “out of the box” configurations more secure — than on reducing “attack surfaces” — external interfaces that provide traction for attacks. Action Item: The most-important step to increase Internet security is to stop using software that has frequent security vulnerabilities. Barring that, greatly increase the resources that you apply to security to assure that all software is safely configured and patched.



Page 15

Best Practices for Continuous Application Availability Strategic Planning Assump tion : Throu gh 2008, investments i n ch ange management pro cesses wil l have the highest i mpact on IT service levels (0.8 probability).

Best Practice #12: Invest in IT Change Management IT Change Management Benefits:

Availability  IT/business alignment  Customer satisfaction 

IT Change Management Key Components Operational Change Management Project Change Management

IT Operational Change Management Request

Change Implementation

Change Monitoring

Client Issue: What IT process best p ractices and strategies will enterpris es adopt to achieve continuous IT service availability? IT change management is a process that enables an enterprise to modify any part of its IT and communications environment, and supports the acceptance, approval and implementation of the modifications. The goal is to enable controlled changes while preserving the integrity and service quality of the production environment. Business processes rely on IT and expect it to be available and provide high service quality. Enterprises can’t achieve this quality without effective IT change management processes. This includes project change management — managing the design, development, testing and implementation of change — and operational change management — managing the approval, scheduling and coordination of change. Improving change management processes is one of the best investments enterprises can make as availability can increase by 25 percent to 35 percent. Change management is difficult because it requires changes in human behavior. No longer can changes be made by an individual in isolation; they become public for the betterment of the enterprise as a whole to reduce bus iness and technical risk. Reshaping users’ behavior requires a significant amount of education to raise awarenes s. Support by senior management also is needed to reinforce the consequences of breaching the process. Action Item: Clients must document and instrument their change management process across development, operations and the lines of business.



Page 16

Best Practices for Continuous Application Availability Strategic Imperative: We estimate that unpl anned downt ime can be reduced by 20 percent to 40 percent through t esting.

Best Practice #13: Invest in Testing 

Application Development/Quality Assurance



Infrastructure Changes



Security Patches



Operations/Production Control



Failover/Fail-back



Business Continuity and Disaster Recovery



Configuration Audits

Client Issue: What IT process best p ractices and strategies will enterpris es adopt to achieve conti nuous IT service availabili ty? Availability should be designed-in, but testing should provide additional confidence. Testing should be done for all changes in the environment prior to deployment including application, application integration, infrastructure, software, security patches and operations/production control changes. In addition, all changes should have a roll-back plan, should unforeseen problems occur as a result of the change. Moreover, it is critical to test for automated failover architectures, to ensure they will work when needed. Failover clustering frequently fails due to out-of-sync conditions, and testing helps ensure that the configurations are consistent and provides confidence that the architecture will work as planned. Configuration audits (using software that compares a “gold” configuration to actual) can help with this process and enable more consistent configurations, which will improve quality of service. Finally, testing of business co ntinuity plans are vital to ensuring recovery in the event of a disaster scenario, including testing of technology and people processes (for example, crisis management, public relations and interface with the press, damage assessment, invocation of plans, execution of plan and more). Action Item: Invest in comprehensive testing of all changes to the production environment. Audit configurations for compliance and enforcement.



Page 17

Best Practices for Continuous Application Availability Strategic Imperative: Reduce mean-time-to-repair by provi ding the operations command center with meaningful tools, skills and procedures.

Best Practice #14 Case Study: Invest in Reduced MTTR through Effective 24x7 L1 Support EDS Case Study : Benefits: Improved Incident Response Time and Productiv ity L1 OC — alert to incident — in 7.9 minutes (vs. 14.7 minutes)  Reduced alert volume by 72 percent and number of tickets by 25 percent  Enterprise Clients supported by OC FTE rose from 1.2 t o 4.2  Improved operator job satisfaction 

Outsourcing Client — Pre As ses sm ent and Inv est men t



Monitoring rules Restoration Procedures



Operations Center (L1) Pre-established SLAs and inheritance dependency tree

SLA

Integrated control center — i.e., CA, InfoVista, Opsware SMARTS; documented procedures enable OC to focus on fast service restoration Integrated, automated ticketing and escalation (with info attached)

SLA rules dynamically determine severity and process rules for L2/L3

Client Issue: What IT process b est practices and strategies will enterpris es adopt to achieve continuous IT service availability? To provide more value to its outsourcing clients at a lower cost, EDS has embarked on a broad strategy for RTI, which it calls the Agile Enterprise. As a foundation, EDS has implemented the Agile Control Center which focuses on incident/problem prevention and fast restoration. While most companies have weak support processes where operators work in a reactive manner — with too many alerts and inadequate procedures and tools to restore service — EDS implements smarter and more-integrated processes, giving its operators the tools and procedures to do their jobs, thus increasing L1 incident resolution and shortening overall time-to-repair. To achieve these benefits, EDS invests skilled resources in creating monitors, alert correlation/suppression rules, restoration procedures, and SLAs associated with IT service topology/resource dependencies. L1 operators view all alerts from a single pane of glass and have specific procedures to follow when alerts occur. Drill-down tools are integrated with the alerts (with a "right click"), so, for example, the operator can determine whether changes were made to a server that is experiencing slow response time. All actions taken are automatically attached to the incident ticket. Further, ticket severity level is dynamically assigned based on SLAs, thus ensuring L2/L3 are working on the right priorities. Action Item: Invest in building monitors, correlation and restoration procedures to enable L1 support achieve lower mean-time-to-repair for IT service outages.



Page 18

Best Practices for Continuous Application Availability Recommendations • Develop an availabili ty st rategy, plan and archit ecture (reducing planned and unpl anned downtime) that crosses business units, customer relationship management, applications, architecture, IT infrastruct ure and operation s. • Design application architectures for continuous availability, reducing complexity where possible. • Invest in maturing IT management processes to ward service-level m anagement. IT processes “ cut throu gh” the individual department silos. Begin setting t argets and measuri ng service levels, even if in formally . • Don’t be comp lacent; Actively t est, test, test…all c hanges, integration of ch anges, switc hover processes and business continuity plans. • To quickly d rive availability i mprovements, set corpo rate goals and performance metrics fo r busin ess units and IT.



Page 19

This is the end of this presentation. Click any where to continue.

These materials can be reproduced only with Gartner’s written approval. Such approvals must be requested via e-mail — [email protected].

Best Practices for Continuous Application Availability

Recommend Documents