PUBLICLY PUBLICL Y AV AVAILABLE AILABLE SPECIFICATION
PAS 77:2006
IT Ser vice Continuity Management Code of Practice
ICS code: 35.020 NO COPYING WITHOUT BSI PERMISSION EXCEPT AS PERMITTED BY COPYRIGHT LAW
PAS 77:2006
This Publicly Available Specification comes into effect on 11 August 2006
Amd. No.
Date
Comments
© BSI 11 August 2006 ISBN 0 580 49047 5
© BSI 11 August 2006
PAS 77:2006
Contents Page ii .......................... Foreword iii .......................... Introduction 1 .......................... 1 Scope 2 .......................... 2 Terms and defini tions
5 .......................... 3 Abbreviations 6 .......................... 4 IT Service Continuity management 7 .......................... 5 IT Service Continuity strategy
13 .......................... 6 Understanding risks and impacts within your organization 14 .......................... 7 Conducting business criticality and risk assessments 15 .......................... 8 IT Service Continuity plan 20 .......................... 9 Rehearsing an IT Service Continuity plan 25 .......................... 10 Solutions architecture and design considerations 27 .......................... 11 Buying Continuity Services
.......................... .......................... .......................... .......................... .......................... 48 .......................... 29 36 38 39 43
Annex Annex Annex Annex Annex Annex
A (informative) Conducting business criticality and risk assessments B (informative) IT Architecture Considerations C (informative) Virtualization D (informative) Types of site models E (informative) High availability F (informative) Types of resilience
51 .......................... Bibliography
© BSI 11 August 2006
i
PAS 77:2006
Foreword This Publicly Available Specification (PAS) has been prepared by t he British Standards Institution (BSI) in partnership with Adam Continuity, Dell Corporation, Unisys, and SunGard. Acknowledgemen t is given to the following organizations t hat have been involved in the development of this code of practice. • • • •
Adam Continuity Dell Corporation SunGard Unisys
Contributors: • Oscar O’Connor, Lead Author • John Pollard • Richard Pursey • Andrew Roles • Brian Hayden • Douglas Craig • Stafford Hunt As a code of practice, this PAS takes the form of guidance and recommendations. It should not be quoted as if it is a specification and particular care should be taken to ensure that claims of compliance are not misleading. This Publicly Available Specification has been prepared and published by BSI, which retains its ownership and copyright. BSI reserves the right to withdraw or amend this Publicly Available Specification on receipt of authoritative advice that it is appropriate to do so. This Publicly Available Specification will be reviewed at intervals not exceeding two years, and any amendments arising from the review will be published as an amended Publicly Available Specification and publicized in U pd at e St and ar d s . This Publicly Available Specification is not to be regarded as a British Standard. This Publicly Available Specification does not purport to include all the necessary provisions of a contract. Users are responsible for its correct application.
Attention is drawn to the following statutory instruments and regulations: • Basel II: International Convergence of Capital Measurement and Capital Standards: a Revised Framework, Basel. Bank for International Settlements Press and Communications, 2005. • The Civil Contingencies Act 2004. Cabinet Office: The Stationery Office. • The Data Protection Act 1998. British Parliament: The Stationery Office. • The Higgs Report on the Role of Non-Executive Directors: Department of Trade and Industry: The Stationery Office, 2001 • The Sarbanes-Oxley Act, 107th Congress of the United States of America, 2002. • The Turnbull Report on Corporate Governance: Department of Trade and Industry: The Stationery Office, 1998 • The Orange Book Management of Risk – Principles and Concepts: HM Treasury, 2004.
Compliance with this Publicly Available Specification does not of itself confer immunity from legal obligations.
ii
© BSI 11 August 2006
PAS 77:2006
Introduction This code of practice provides guidance on IT Service Continuity Management (ITSCM). It is intended to compliment, rather t han replace or su persede, other publications su ch as PAS 56, BS ISO / IEC 20000, BS ISO / IEC 17799:2005 and ISO 9001 (see Bibliography for further information). • PAS 56 provides guidance on best practice in Business Continuity Management, and while it mentions the need for IT Service Continuity it does not provide the detailed guidelines found in this code of p ractice; • BS ISO / IEC 20000 provides guidance on best practice on Service Management and, as PAS 56, mentions IT Service Continuity, but not at the level of detail presented in this code of p ractice;
requirements and ob jectives set out in both the IT Strategy and ITSC Strategy. Once the architecture is defined, the organization can then define IT Service Continuity Plans for each element of the architecture. Feedback from (amongst many other sources) rehearsing the ITSC Plans can subsequently be used as input to the next iteration of the IT Strategy.
• BS ISO / IEC 17799:2005 provides detailed guidance on best practice in information security management, which is one aspect of IT Service Continuity Management. This code of practice does not directly address information security or physical and environmental security as these areas are covered by BS ISO / IEC 17799:2005; • ISO 9001 provides guidance on best practice in Quality Management Systems. When implementing any recommendations found within this code of p ractice, the reader is encouraged to apply the quality assurance and control recommendations found in ISO 9001. Many organizations believe that a loss of systems infrastructure will not happen to them or that a loss of such infrastructure will have a relatively low impact. However, while many of those organizations might believe that they have invested in adequate systems resilience, it is often apparent that such confidence is misplaced. In an age in which information technology is becoming evermore pervasive and increasingly critical within the day to day operations of many organizations, it is clear that the ability to continue to operate with any degree of success is likely to be severely compromised following loss of IT services. In addition it is evident that the duration of a tolerable IT outage is becoming ever shorter. As Figure 1 suggests, there is a continuous cycle in the relationships between several important documents. The IT Strategy defines the organization’s key policies and direction regarding information technology, systems and services. From this, the IT Service Continuity Strategy can be defined to ensure that the policies and standards for IT Service Continuity directly and explicitly support the ob jectives set out in the IT Strategy. This then enables the organization to define i ts IT Architecture based upon the
© BSI 11 August 2006
iii
PAS 77:2006
Figure 1 – The relationship between the IT Strategy, ITSC Strategy, IT Architecture and ITSC Plan
IT Strategy
IT Service Continuity Plan
IT Service Continuity Strategy
IT Architecture
Whilst it is true that ma jor events such as bombs, fires and floods make headline news, the ma jority of IT related incidents fall into the category of ‘quiet calamities’ that only affect an individual or a small subset of the organization. Examples of such common incidents include the theft of a mobile worker’s notebook computer, the failure of an important business application and corruption of impo rtant or confidential data. These incidents have the potential to damage an organization’s brand or public image and its r eputation, not to mention its revenues and customer service. Such damage has the potential to destroy that organization unless appropriate action is taken to implement IT Service Continuity (ITSC). In order to retain an appropriate sense of perspective, this document refers to ‘incidents’ and ‘events’ rather than ‘disasters’. Since the Asian Tsunami of 2004 and Hurricane Katrina in 2005, the phrase ‘disaster recovery’ has taken on dimensions previously unknown and the authors felt it was inappropriate to describe the failure of IT systems, however disruptive, using the same language. Throughout the document the reader may encounter terminology which is used in other st andards. To avoid ambiguity the reader should refer to the definition section to understand how such terminology is used in this document which may differ from other standards. This document is intended to be read by a number of different audiences:
iv
• Executive and Senior Management – to gain a high level understanding of the fundamental interdependencies between Corporate Governance, Business Continuity and IT Service Continuity in order to make better-informed investment decisions relating to ITSCM; • Middle Management – to understand how decisions should be made regarding IT Service Continuity such that critical business processes survive disruption (ideally) or at the very least have the ability to recover from disruption in timescales required by the organization; • IT Management – to understand the decision making processes required in order to ensure that IT Service Continuity strategies and plans fully support business priorities; • IT Support and Operations – to gain a practical insight into how IT Service Continuity strategies should be drawn up and implemented in such a way as to add value to the organization as w ell as protecting it from IT-related incidents; • Regulators, auditors, insurance and benchmarking organizations – to understand what best practice in IT Service Continuity Management implies for organizations so that these measures can be assessed as part of wider reviews of Corporate Governance and resilience. This code of p ractice is designed for organizations of all shapes and sizes, whether in the private or public sectors.
© BSI 11 August 2006
PAS 77:2006
It should not be regarded as a step-by-step guide to implementing IT Service Continuity Management but as guidance on the aspects of ITSCM which organizations should consider w hen investing in this area. Not all activities described herein will be applicable or appropriate for all organizations. In particular, small organizations should aim to use this code of p ractice as a reference guide in order to help them make informed decisions about what level of ITSCM would be appropriate for them given their individual characteristics. Throughout this code of practice certain terms have been used which may cause confusion. Such confusion is naturally not the intention of the authors, so the following guidance should be borne in mind when reading this document: • The term ‘business’ is used when referring to the non-IT elements of an organization. This should not be taken to imply t hat this code of p ractice is aimed purely at private sector or commercial bodies. In each such instance, the term is used merely as convenient shorthand to avoid over-complicating the language used herein. • This code of practice refers t o ‘rehearsing’ ITSC Plans. Other publications in this field have referred to ‘testing’ and also to ‘exercising’. The authors r egard these terms as largely interchangeable and have opted to use the term rehearsing in this context as it implies not only testing that ITSC Plans are accurate and capable of being implemented, but also that the people required to implement them are guided, supported and provided with feedback on their own personal performance as well as that of the Plans. The authors did not feel that either of the other terms used in o ther publications quite conveyed the necessary emphasis on this aspect. • The term ‘data centre’ is used to imply any location or facility where core information technology services are housed, whether that be the ultra-modern data centres that ma jor organizations use or under the desk where a one-person business keeps its file server. No inference should be drawn regarding the applicability of guidance or recommendations to any type of non-d ata centre environment.
© BSI 11 August 2006
1 Scope This Publicly Available Specification (PAS) explains the principles and some recommended techniques for IT Service Continuity management. It is intended for use by persons responsible for implementing, delivering and managing IT Service Continuity within an organization. This PAS provides a generic framework and guidelines for a continuity programme including the following topics. • What the required management structure, roles and responsibilities for implementing IT Service Continuity management are. • How business criticality, risk assessments and business impact assessments should be performed to produce useable results. • What business continuity plans contain and the steps required to respond to, and recover from, the identified risks within the context of specified business processes. • How the development, rehearsal and deployment of the Business Continuity plan does not have to cost more in terms of money, risk or reputation than taking no action. • Why a framework and capability should be developed for the organization to respond effectively to unexpected disruption. This document is not intended to be used as st ep-by-step instructions for conducting any of the activities described herein. It is intended to provide an overview of a complete process on the assumption that information will already exist within the organization that would be identified by activities described in this document. Where this is the case, users of this document are encouraged to review the information in their possession to ensure that it includes all of the details required, and that it is up-to-date and accurate.
1
PAS 77:2006
2 Terms and definitions For the purpose of this PAS, the following terms and definitions apply.
2.1 abnormal service level of service that deviates from the levels agreed for normal operations
suall y a s NOTE U
sru pt ion t o a r e sul t of an inc ident causing di nor mal serv ic e lev e s l .
2.2 action plan schedule of activities, lead times and dependencies of activities in order to address a particular requirement
2.3 asynchronous replication
periodic physical replication of d ata from one storage system to another y pi call y ov er a NOTE T
w ide ar ea netw or k.
2.4 atomic requirement, transaction or ob jective which is self contained i.e. cannot be broken down further
2.5 audit log shipping automated process for tr ansferring records of transactions (audit logs) between primary and secondary systems
2.6 business continuity management plan document that sets out to ensure resumption of critical business functions in the event of either an incident or unforeseen event that threatens the business
2.7 clustered system two or more computer systems configured in such a manner that in the event of failure of a system or service run on i t, operation is tr ansferred to another system within the cluster
2.8 cold back-up site
provides the space but not the infrastru cture needed to resume operations quickly
2.10 data availability measure a system’s ability to deliver a predetermined level of data access during a system failure
2.11 dependency modelling activity used to determine the inter-relationships and dependencies between functions and / or processes and how they affect the system or organization as a whole
2.12 disk imaging method of copying a complete hard disk of a computer into a single file from which the gathered image can be distributed to a single or multiple computers to minimize the time and effort for the creation of computers t hat will have identical software and configurations to the original
2.13 domain logical association of a defined environment and the assets within the pre-defined environment
2.14 downtime vs. cost vs. benefit model model which analyses the costs of downtime and of the measures required to minimize downtime in the event of an incident and compares them against t he benefits available to the organization from services being resumed
2.15 duplexed ability to simultaneously s end and receive data through a medium in both directions NOTE W hen
used t o de sc r ib e di sk dev ic e s or di sk c onnec t iv i t y i t implie s ‘d u pli cat ion’ or ‘mi rr or ing’.
2.16 fail-back return of service / operation from fail-over site
2.17 fail-over ability for services offered by a component, server or system to automatically be undertaken by another component, server or system in the event of it’s failure so that the impact of losing that device, server or system has a minimal impact on the service or services offered
2.9 continuity procedures set of predefined procedures to be followed in the event of an incident w hich disrupts normal service levels
2
© BSI 11 August 2006
PAS 77:2006
2.18
x ample s NOTE E
failure modes and effects analysis (FMEA) structured q uality method to identify and counter weak points in e arly conception phase of products and p rocesses1)
2.19 incident event that disrupts normal IT services i NOTE T h s
usa ge differs f r om t hat in ITIL [ 1 ] .
2.20 incident recovery activities required to respond effectively t o an incident, with the primary ob jective being to ensure the resumption of normal service levels
2.21 I/O Processors
allow servers, workstations and storage subsystems to transfer data faster, reduce communication bottlenecks, and improve overall system performance by offloading I / O processing functions from the host CPU2)
2.22 IP Address logical address of a system within an IP network NOTE T he
ss uniquel y ident ifie s c omput ers on a IP add re netw or k. An IP add r e ss can be pr iv a t e , for use on a Local Ar ea N etw or k ( LAN ) , or publi c, for use on t he I nt er net or ot her WAN . sse s can be det er mined st at ic all y ( a ssigned t o a IP add re c omput er b y a syst em admini str at or ) or d y nami call y ( a ssigned b y anot her dev ic e on t he netw or k on demand).
2.23 IT Architecture overall design of an organization’s information technology and services including both physical and logical entities
2.24 IT Infrastru cture
physical devices which comprise an organization’s information technology and services architecture
2.25 IT Service set of related information technology and probably non-information technology functionality, which is provided to end-users as a service
of IT serv ic e s inc lu de me ssa ging , busine ss a ppli cat ion s , file and pr int serv ic e s , netw or k serv ic e s , and help de sk serv ic e s 3 ).
2.26 IT Service Continuity Management supports the overall Business Continuity Management process by ensuring that the required information technology technical and services facilities (including computer systems, networks, applications, telecommunications, technical support and service desk) can be recovered within required, and agreed, business timescales
2.27 last mile telecoms provider organization responsible for the provision of telecommunications services from the national or local telecommunications infrastructure to a specific location
2.28 latency delay due to the time it takes to transmit data from one location to another
2.29 maintenance procedures procedures applied by an organization to ensure that their IT Infrastru cture is maintained in op timum condition through both proactive and reactive measures
2.30 monte carlo analysis means of statistical evaluation of mathematical functions using random samples, often used in risk analysis of highly complex systems
2.31 Network Attached Storage (NAS) storage device that can be attached to the network for the purpose of file sharing NOTE I n
e ssenc e a NAS dev ic e s i simpl y a file serv er .
2.32 network protocol technological rules, codes, encryption, data transmission and receiving techniques which allow networks to operate
1) http: //www.fmeainfocentre.com / 2) http: //www.intel.com / design / iio /
© BSI 11 August 2006
3) http: //whatis.com
3
PAS 77:2006
2.33
2.42
operations bridge central facility used for monitoring and managing systems, services and networks
risk management plan document that sets out to define a list of activities, lead times and dependencies in order to mitigate one or more identified risks
2.34 paper test mechanism for proving the hypothetical effectiveness of a process by working through scenarios in a discursive forum
2.43 risk mitigation
point in time (PIT)
set of actions that will affect either the probability of the risk occurring or its impact should the risk occur. These are summarized as risk transference, tolerate the risk, terminate or treat
consistent copy of the data taken at the same instance in time for one or more systems
2.44
2.35
2.36 recovery procedures procedures which result in the restoration of services following an incident
2.37 redundant routing resilient approach to data networking in which there are a minimum of two routes from each node in the network
2.38 rehearsing the critical testing of ITSC strategies and ITSCs, rehearsing the roles of team members and staff, and testing the recovery or continuity of an organization’s systems (e.g. technology, telephony, administration) to demonstrate ITSC competence and capability r ehearsal ma y inv ol ve inv oking busine ss c ont inui ty pr oc ed ur e s but s i mor e likel y t o inv ol ve t he simul at ion of a busine ss c ont inui ty inc ident , announc ed or unannounc ed , part ic ipants r ole-pl a y in or der t o a sse ss w hat ssu in w hi ch i e s ma y ar s i e , pr ior t o a r eal inv ocat ion. NOTE A
2.39 replication appliance device which provides functionality t o replicate data to other storage systems
risk monitoring iterative process of the risk owner checking and reporting on any changes in status of the risk log in terms of risk proximity, impact and response
2.45 stateful/stateless describe whether a computer or computer program is designed to note and remember one or more preceding events in a given sequence of in teractions with a user, another computer or program, a device, or other outside element t ef ul NOTE S ta
mean s t he c omput er or pr ogr am keep s tr ac k of t he st at e of int er ac ti on , usuall y b y sett ing v al ue s in a st or a ge field de signat ed for t hat pur po se. S t at ele ss mean s t her e s i no r ec or d of pr ev ious int er ac t ion s and eac h int er ac t ion r eque st ha s t o be handled ba sed ent ir e y i t . l on infor mat ion t hat c ome s w it h t ef ul and st at ele ss ar e der i ve d f r om t he usa ge of st at e a s a S ta set of c ondi t ion s at a moment in t ime. ( Co mput ers ar e inher ent y l st at ef ul in oper at ion , so t he se t er m s ar e used in t he c ont e xt of a part ic ul ar set of int er ac t ion s , not of how c omput ers w or k in gener al).
2.46 storage array two or more hard disk drives working in unison to improve fault tolerance and performance
2.47
2.40
synchronous replication
risk combination of the probability of an event and its consequence [ISO Guide 73:2002]
instantaneous physical replication of data from one storage area to another, typically over a high speed interconnect such as fibre channel
2.41
2.48
risk communication exchange or sharing of info rmation about risk between the decision-maker and other stakeholders [ISO Guide 73:2002]
test scripts definition of the specific tests to be enacted when proving the functionality and operation of a system or service
4
© BSI 11 August 2006
PAS 77:2006
2.49 vulnerability report report which identifies the specific vulnerabilities of a specific system or service
2.50 work schedule defined set of activities and deliverables which, once completed, will result in the desired outcome of a procedure or project
2.51
3 Abbreviations For the purpose of this PAS, the following abbreviations apply. BCM
Business Continuity Management
BCMP Business Continuity Management Plan BCMT Business Continuity Management Team BCSG
Business Continuity Steering Group
CMT
Crisis Management Team
DAS
Direct Attached Storage
Zero Data Loss (ZDL) remote replication method that guarantees not to lose any live data
DBMS Database Management System
2.52
IMT
Incident Management Team
I/O
Input/ Output
IT
zoning allocation of resources for device load balancing and for selectively allowing access to data only to specific systems
Information Technology (also includes Information Systems (IS))
ITIL
Information Technology Infrastru cture Library
ITSC
Information Technology Service Continuity
NOTE Z oning
NAS
Network Attached Storage
OS
Operating System
RAID
Redundant Array of Independent Disks
RPO
Recovery Point Ob jective
RTO
Recovery Time Ob jective
SAN
Storage Area Network
UPS
Uninterruptible Power Supply
WAN
Wide Area Network
str at or t o c ontr ol w ho can see allows an admini w hat s i in a SAN .
© BSI 11 August 2006
5
PAS 77:2006
4 IT Service Continuity management Information Technology Service Continuity (ITSC) is the collection of poli cies, standards, processes and tools through which organizations not only improve their ability to respond when ma jor system failures occur but also improve their resilience to ma jor incidents such that critical systems and services do not fail. It is related to a number of disciplines and should be undertaken with a complete and thorough understanding of the organization’s policies, standards, processes and supporting services for: a) Business Continuity Management; b) Ma jor Incident and Crisis Management; c) Corporate Governance and Risk Management;
d) sabotage, extortion or commercial espionage; e) deliberate infiltration or attack on critical information systems. Business Continuity Management (BCM) is concerned with managing risks to ensure that at all times an organization can continue operating to, at least, a pre-determined minimum level. The BCM process involves reducing the risk to an acceptable level and planning for the recovery of business processes should a risk materialize and a disruption to the business occur. In essence ITSC management should be a part of the overall Business Continuity plan and not dealt with in i solation.
d) Information Technology (IT) Governance; e) Information Security and Data Protection. ITSC management should also have a significant influence on IT strategy to identify information systems and services which require high le vels of resilience, availability and capacity. The purpose of risk management and ITSC management is not simply to be able to say that risk-based control mechanisms have been implemented. The management of risk can result in many tangible and intangible benefits to the organization if implemen ted with commitment and the right motivation. Risk management can be used to improve product quality, productivity, financial performance and working conditions. These benefits should be at the forefront of the participants’ thinking throughout the process. Risk can be seen as a positive approach to improving all aspects of the organization's performance. Every stakeholder can make a significant, positive contribution by considering, on a regular basis, the ways in which the organization's ability to achieve its ob jectives could be at risk. In order to do so, the organization's ob jectives should be communicated clearly to everyone involved in their achievement and communication on risks should be encouraged. ITSC management addresses risks that could cause a sudden and serious impact, such that they could immediately threaten the continuity of the business. These typically include: a) |oss, damage or denial of access to key infrastru cture services; b) failure or non-performance of critical providers, distributors or other third parties; c) loss or corruption of key information;
6
© BSI 11 August 2006
PAS 77:2006
5 Service Continuity strategy 5.1 Defining an IT Service Continuity strategy
of a ma jor incident, as also shown in Figure 2:
NOTE T he
follow ing f r amew or k s ar e w ell r egar ded and r e s pec t ed and can be used for addi t ional infor mat ion w hen IT S tr at eg y t hat w ill a ss st i in c la r if y ing c r eat ing a c ompr ehen si ve and defining y our ITSC S tr at eg y . F or addi t ional r eading plea se r efer t o:
a) initial response – covering the initial actions required to ensure the safety and welfare of people affected by the incident, to activate the relevant incident management teams and determine the level of response which is appropriate to the incident;
• CMM, Ca pabili ty Matur it y Model , htt p: //www .i tserv ic e c mm.or g ;
b) service recovery – this may take place in a number of stages depending upon the needs and scale of the organization but should involve the restoration of all required services in priority order to pre-agreed (possibly degraded) levels of service;
• C obi T 4.0 , C ontr ol Ob jec t iv e s for I nfor mat ion and r el at ed T ec hnolog y , htt p: //www .i t gi.o r g ;
struc tur e Li br ary p: //www .i t il.c o.uk. • ITIL, IT I nf ra , htt An ITSC strategy should define the direction and highlevel methods that should meet IT service level ob jectives. It should ensure a business is never compromised by a lack of IT availability beyond acceptable, predefined and regularly reviewed levels of uptime and performance. The ITSC strategy s hould be agreed at Board level and ideally be fully endorsed by the CEO. A Board member should be accountable for the strategy and be referred to when deciding on ne w business initiatives including mergers and acquisitions, directional change and any decision that could have an impact on ITSC.
c) service delivery in abnormal circumstances – until the organization is ready and able to resume normal service operations there is st ill a need to continue to operate required services at the pre-agreed service levels until the circumstances permit these ‘abnormal services’ to be failed back to ‘business as usual’ and decommissioned;
d) normal service resumption – as with service recovery, the resumption of normal service may t ake place in stages according to the needs and priorities of the organization. Only when each service has been validated and verified as being ‘back to normal’ should the secondary systems be decommissioned. This st age is only complete when all of the organization’s IT services are restored to normal service levels.
In devising an o rganization’s ITSC strategy, it is advisable to consider four discrete but linked stages in the m anagement
Figure 2 – Ma jor Incident Management
Service Service recovery
delivery in abnormal circumstances
Resumption of normal service
Initial
response
© BSI 11 August 2006
7
PAS 77:2006
The ITSC strategy s hould enable the organization to plan for and rehearse the whole life cycle of a ma jor incident from the point of initial disruption, through the recovery, to abnormal service to the point where normal service levels are once again guaranteed.
l) compliance with legislation; m) deadline management; n) rehearsing and rehearsing recovery plans; o) data protection; p) data availability;
The strategy should be developed from a clear understanding of the organization’s need for IT services and the agreed service levels that are required from time to time, taking into account:
q) plan maintenance; r) education and awareness programmes for all IT staff.
b) peak loads on business;
The strategy s hould not define the detailed tactics but should set the direction of the individual components of an ITSC plan.
c) strategically important business periods e.g. reporting periods, manufacturing deadlines etc;
5.2 Creating an ITSC strategy
a) priority for key business units at given moments in time;
d) compliance with business Continuity Management Plans and ob jectives; e) investment vs. risk; f) impact of failure or loss; g) recovery time ob jectives; h) acceptable levels of downtime and performance; i) system changes and upgrades; j) new projects;
The ITSC strategy should be a by-product of a Business Continuity Management Plan (BCMP) but can be defined without. Where a BCMP exists, those responsible for IT service levels are likely to have contributed to the plan and already be aware of the implications of that plan on the IT strategy and direction. As shown in Figure 3, an ITSC strategy should have six main elements all of which are part of a continuous cyclical process.
k) interdependencies;
Figure 3 – ITSC strategy elements
Understand Requirements
Monitor
Review Strategy and Update Ob jectives
Rehearse, Exercise and Audit
Understand Dependencies
Instill Continuity Culture
8
© BSI 11 August 2006
PAS 77:2006
5.4 Identifying requirements and weaknesses The foundation of an ITSC strategy should be to ensure an embedded resilience throughout an IT infrastru cture. The Head of IT should commission an internal review of all areas of potential weakness from single points of failure through to redundancy, supply chain dependence and general IT housekeeping processes such as secure back-up and restore technology. From this review, a strategy of improved resilience can be determined. An ITSC strategy should make use of hi story and trend reports highlighting downtime experience, proven areas of weakness and service level reports. A vulnerability report is highly likely t o involve expenditure. Key to implementing an ITSC strategy is to measure the costs of downtime vs. resilience expenditure i.e. impact and risk vs. cost. This should be done by working with the Board to calculate the cost of downtime on key business functions on an hourly basis. This determines budgets and also focuses attention on service level agreements required as a result of measured downtime. Department Heads should provide their service level requirements and the level of uptime they r equire within a steady state as well as the recovery time ob jectives after an incident. These should be measured using a downtime vs. cost vs. benefit model. Advancements in technology carry an inherent demand for constant change and improvement. Enhancements to an IT infrastru cture should be planned, rehearsed, and carefully managed with clear contingency plans in place should the implementation fail. The ITSC strategy should take into account a tolerance for downtime for ma jor system upgrades and changes. Should scheduled downtime be unacceptable then plans should be put in place for duplicate environments running parallel systems. Agreement should be reached on levels of foreseen downtime prior to defining the strategy.
expenditure on IT resilience. It should agree on its policy on outsourcing risk management to third party suppliers e.g. incident recovery companies and third party maintenance organizations, (see Clause 11).
5.5 Management structure and roles 5.5.1 General The management structure should be a standard, three tiered stru cture used widely within both the private and public sectors for ma jor incident and crisis management. The three tiers are: a) Bronze – operational level: Incident Management Team (IMT); b) Silver – tactical level: Business Continuity Management Team (BCMT); c) Gold – strategic level: Crisis Management Team (CMT). In terms of relative size and subsidiarity, the relationship between these teams is illustrated in Figure 4:
Figure 4 – Management structure
Gold – Crisis Management Team
Silver – Business Continuity Management Team
Bronze – Incident Management Team
Due consideration and research should be undertaken on current and forthcoming legislative requirements as well as good practice guidelines on aspects of business continuity and IT resilience. The Board (or equivalent) and especially Non-Executive Directors, if present, should steer an organization towards compliance and healthy business management. The ITSC strategy s hould also include an ongoing and continuous process for change management including involvement of third party suppliers as well as internal customers. An agreement should be reached at Board or Executive level on levels of investment and priorities for
10
© BSI 11 August 2006
PAS 77:2006
d) all events transpiring during the disruption, their effects and likely causes; e) all actions taken and evidence of their results; f) all communication in relation to the disruption, including the other parties involved, the nature of the communication and what information was passed in each direction. This journal should cover the period from the time the team is activated to the time it stands down. All entries in the journal should include details of the date and time the entry was made, and by whom. The completed journals should be used to support future reviews of business and IT service continuity plans and their effectiveness. Therefore, stringent change control should be applied on these journals, and no changes of any nature should be permitted once the team has stood down.
5.6 IT Service Continuity in a changing environment
e) Service levels (e.g. uptime statistics) should be reviewed as a Board agenda item each month. Trend analysis can show even a slight decline in service which can be an indicator of bigger problems. f) Testing and rehearsing contingency and recovery plans should be an essential ingredient to keeping an ITSC strategy current. Ensuring a department or application can be recovered fully, after failure can ensure simple errors and problems are minim minimiized. This includes performing complete data back-ups as well as testing third party suppliers. g) Supplier’s ability to maintain appropriate levels of service should be regularly assessed. Including suppliers such as incident recovery and maintenance providers in the change management loop is highly r ecommended. h) Remunerating staff against s ervice levels can help ensure the relevant level of awareness reaches all levels of the organization. i) An interna / l / external audit of plans.
Business is by its very nature dynamic. It changes regularly and with change comes risk; not only r isk of f ailure but risk of destabilizing existing policies and strategies. Therefore, the ITSC strategy s hould be resilient to change and also adaptable. The key factors that should be considered to ensure that the ITSC strategy and plans remain appropriate for the organization as it and its environment change include the following. a) Board level responsibility and accountability for the ITSC strategy s hould be to help keep an ITSC strategy current as the organization changes, develops and grows. BCM and ITSCM should be a high-profile ingredient to Board level thinking and should be the most important aspect of any continuity plan. b) The change management process should include all parties responsible for the ITSC strategy, both its compilation and its delivery. No change to the IT infrastru cture should be considered until the implications of the change have been assessed and understood and contingency plans are rehearsed. c) The procurement process for new IT systems should include sign-off that resilience has not been compromised by even the most s imple of upgrades or improvements. Non-IT expenditure could still have an impact on IT resilience such as recruitment (system overheads), marketing campaigns (web site activity) etc.
d) Due diligence on merger and acquisition (M&A) activity should include a resilience assessment. Often, M&A activity can bring perceived cost s aving benefits such as branch or office closure. This can also reduce resilience through loss of fail-over sites, loss of secondary systems, and inherent redundancy.
12
© BSI 11 August 2006
PAS 77:2006
6 Understanding risks and impacts within your organization 6.1 General
m) data communications;
Risks are prevalent within any environment. Before commencing any ITSC programme there should be an understanding of po tential risks and impacts.
n) archiving;
The loss of IT (staff, management or infrastru cture) typically r esults in the loss of the ability to operate and manage an organization’s systems infrastructure, with the resultant degradation or loss of critical applications and data. How this affects an organization depends on what it does, its key processes and their dependence on technology and the duration of that disruption. For example, businesses in the financial sector frequently depend depen d on financial and / or market information feeds and applications in order to manage time bound investments or tr ansactions. An inability to manage investments and other financial vehicles w ould have a potentially serious impact on the business’s balance sheet and loss of significant revenues impact on an organization’s balance sheet and revenues. In order to fully understand how a disruption in IT service can affect an organization it is necessary to conduct a business criticality and risk assessment (see Annex A) which will identify critical activities with the degree these are dependent on IT. It should also identify the required recovery timescales (RTOs) for IT services which are vital in the implementation of those critical activities as well as the currency of the data which is used in the recovery of those IT services.
o) IT environment and monitoring; p) telephony; q) any other relevant exposure. Every organization’s risk level will be different, however the outcome of the risk assessment should provide it with sufficient information to evaluate its vulnerabilities in a rational manner and to decide how to deal with them by eliminating the risk altogether. This can be achieved by investing in resources to mitigate the exposure or by preparing beforehand for the consequences of the risk, such as having appropriate incident management in place. By adopting this twin track approach at the start of the ITSC programme one should obtain an understanding of the organization’s dependencies on IT infrastru cture in terms of the impacts of infrastru cture failure (as a whole or in part) and an appreciation of the vulnerabilities present which could give rise to an incident w hich precipitates those impacts.
6.2 Vulnerability assessment In parallel with an impact analysis, the potential vulnerabilities prevalent within IT service delivery which might give rise to disruption should be determined. This information can be obtained through a risk assessment which should review t he IT infrastructure’s exposure in terms of: a) system resilience and availability; b) key suppliers and agreements; c) documentation;
d) hardware and software assets; e) storage; f) back-up regimes; g) staff e xposure; h) staff training; i) location of buildings and facilities; j) IT security; k) systems monitoring; l) power;
© BSI 11 August 2006
13
PAS 77:2006
7 Conducting business criticality and risk assessments A critical initial activity in the development of an ITSC strategy or plan is to identify all business processes and the departments or business functions responsible for their operation and to categorize each function and process according to its criticality to the business. Subsequently to identify all IT services which support each business process and assess their criticality to the operation of those business processes. NOTE 1
Table 1 – Business criticality categories Category
Impact
Mandatory
Vital to enable the organization to meet st atutory or other (internally or externally) imposed requirements.
Critical
Vital to the day-to-day operation of the organization.
Strategic
Important for the implementation of the long term strategy.
Tactical
Important for the achievement of the short t o medium term performance ob jectives of the organization.
Mor e det ailed guid anc e s i av ail able in Anne x A.
i k a sse ssments S pec ifi c guid anc e on c ond uc t ing r s r el at ing t o infor mat ion sec ur it y can a s l o be found in BS ISO / IEC 177 99:2005. NOTE 2
ITSC management addresses the ways in which the following types of activity could be disrupted, stopped or have their performance degraded to unacceptable levels. a) operation of IT services and processes; b) IT service resumption following a disruption or failure; c) new IT service or information systems development projects;
d) readiness and operation of ITSC required to comply with statutory or regulatory requirements. The organization should be regarded in two ways: a) Physically: an organization exists on one or more sites, each site comprising buildings, which can be broken down in a variety of ways (floor, wing, corridor, office etc.); b) Organizationally: most organizations are stru ctured into a number of Directorates, each of which comprises a number of functions, which comprise departments, processes and activities. Naturally this naming convention is not intended to be an accurate description of all organizations but a theme which can be readily r ecognized. It is possible for each physical component to support a number of organizational components. In order to avoid duplication of effo rt t he risk assessment process s hould examine the organization and its IT services from both of these perspectives. It is equally possible for a single organizational component to be situated in a number of different physical locations.
NOTE I f a
syst em or serv ic e cannot r eadil y be a ssigned t o an y of t he se cat egor ie s , t he or gani z at ion ma y w s i h t o c on sider w het her t hat syst em or serv ic e ha s an y ongoing pur po se. I f how ev er a syst em or serv ic e can be a ssigned t o mor e t han single one cat egory t he or gani z at ion should dec ide on w hi ch cat egory de signat ion w ill be used.
The process of assessing business criticality and risk should be managed to ensure that the assessment of physical risks is coordinated with, but not dominated by, the assessment of organizational risks. Neither assessment is more important than the other, but each has its part to play in ensuring that the business as a whole adopts a position in which all types of risk are managed as effectively as possible. The inherent complexity in all organizations implies that any risk assessment method should be adaptable to the different circumstances within each part of the organization. NOTE S ee
s on how t o c ond uc t busine ss Anne x A for det ail r t t y r s ss ss ts c i ic ali and i k a e men .
The criticality of each business process s hould have a direct impact on the criticality of supporting IT services. Suggested designations are shown in Table 1.
14
© BSI 11 August 2006
PAS 77:2006
8 Service Continuity plan 8.1 Definition of an ITSC plan The ITSC plan is a simple, clear, unambiguous and all encompassing set of documents t hat define the actions required to restore IT services in the event of an incident. An ITSC plan is a series of working documents which are constantly rehearsed, updated, modified and improved. Depending on the organization’s requirements, the ITSC plan can be one document, or a series of connected documents. It can be printed on paper or held as an electronic / on-line documents. However, the ITSC plan should be readily available in the right place at the right time and to the right people when an incident occurs, which might mean having hard copies accessible. The ITSC Plan for each service should provide detailed procedures and step-by-step guidelines for each stage in the incident management process, as described in Figure 2 in Clause 5.1.
8.2 Defining an architecture Before building an ITSC plan the IT infrastructure should be reviewed to determine whether it has all the components and technology r equired to allow IT services to continue in the event of an incident. If not, then the systems should be updated to include ITSC components, such as resilient, high availability or redundant systems and data replication mechanisms. This should be done by defining an IT architecture that includes these components. Much like the architecture of an office block will include fire escapes and emergency exits, the IT architecture may include components whose sole purpose is to ensure service continuity. There are a number of common IT models which can be adopted to facilitate ITSC. Building IT architecture for a site doesn’t have to be an onerous task, commonly accepted models for IT resiliency and ITSC can be used (see 10.3). Selection of the appropriate model(s) depends on many things including IT architecture and service continuity considerations.
8.3 Key Service Continuity Factors There are three key factors which should be balanced prior to deciding on the IT architecture: a) Recovery Time Ob jective (RTO): How quickly after an incident the IT service needs to be restored. b) Recovery Point Ob jective (RPO): The point in the processing cycle where the IT service can be resumed. NOTE T h s i
c ould be at some c on s st i ent point pr ior t o t he inc ident e.g. t he t ime of t he l a st bac k-u p. I t c ome s dow n t o d at a can I affor d t o an sw er ing t he que st ion: ‘ How muc h li ve
© BSI 11 August 2006
lo se?’ I f t he an sw er s i none t hen t h s i w ill hav e a big impac t on t he t hi rd f ac t or , c o st . c) Cost: Typically the smaller the RTO and RPO values, the higher the cost of the solution. Essentially the cost of the technology increases as the time to recover and the amount data that can be lost decrease. Since the availability and cost of technology solutions change over time, these decisions should be reviewed on a regular basis. See Annex B for a more detailed discussion of IT Architecture considerations w hich influence service continuity. See also Annex C for a detailed discussion on virtualisation and how such technologies might be used to build resilience into the IT Architecture and also assist continuity planning.
8.4 Populating the IT Service Continuity plan 8.4.1 General If the IT infrastructure supports multiple services, for example a bank could provide separate independent cashier and mortgage application services, then the ITSC plan should be considered in m ultiple ways. One aspect is total failure of a site (or sites), another is the failure of individual IT services within a site. An ITSC plan should be part of a wider Business Continuity Management Plan and, as such, should adhere to any standards and terminology defined by t hat. If following an ITIL model for incident and problem management then the ITSC plan should also fall in line with ITIL processes. The model ITSC plan should contain the procedures to follow from initial response through to resumption of normal service following an incident (see 5.1).
8.4.2 Teams to populate the ITSC plan In order to populate each part of the plan the following should be prepared. a) Nominate members of management to form Incident Management Teams. The main role of these teams is to manage the recovery processes for each technology platform, each IT service and all required site facilities. The members of these teams should be trained to understand their responsibilities in the event of an incident. b) Develop escalation and process flow charts so that once the decision has been made to invoke the correct ITSC procedures are followed to allow recovery to commence as quickly as possible. c) Develop detailed procedures specifying how t o recover each component of the IT systems. Although operations
15
PAS 77:2006
synchronous remote mirroring of the savings database. Recovery takes the form of enabling the remote mirrors on the remote system, recovering the database environment and then allowing branch traffic to access the system from the remote site. The mortgage system uses a combination of tape back-ups and audit log shipping. To recover this environment, first reload the last known copy of the database from tape and then bring it up to date by reapplying the audit records read from the
audit logs. The insurance system is a high availability clustered system which automatically fails-over to the back-up site to provide almost uninterrupted service. NOTE I n
t h s i e x ample t her e ar e no int er dependenc ie s betw een t he indi vi d ual syst em s. T h s i ma y not be t he ca se in r eali ty . Qui t e s t o be r ec ov er ed befor e anot her can be of t en one syst em need r u t b o gh on line.
Figure 5 – Example of a high level process flow chart for service continuity management Disaster / Ma jor Componen t Event
Contact EMT members
Assess scale of disaster
Main site still usable and safe?
Prepare backup site for full production running Switch all branch No
networks to remote site
Re-route help de sk and operations calls to remote site
Notify branches of ma jor disaster invocation
Call in Disaster Operations Team
Establish disaster operations bridge at remote site
Switch remote access ports to
remote site
End site preparation
Yes
Failover savings systems to remote backup Savings systems available?
No
UP mirror of savings database packs
Short recovery of DBMS
environment
Restart DBMS
support runs
Allow branch traffic for savings systems
End of recovery of savings systems
Yes
Failover mortgage systems to remote backup Mortgage systems available?
No
Reload mortgage systems from last PIT backup tapes
Re-apply DBMS audit logs to mortgage database
Validate mortgage database for corrupted entries
Restart DBMS
support runs
Allow branch traffic for mortgage systems
End of
recovery of mortgage systems
Yes
Failover insurance systems to remote backup Insurance systems available?
No
Cluster failover insurance systems to remote backup
Allow branch traffic for insurance systems
End of
recovery of insurance systems
Yes
End failover checks
© BSI 11 August 2006
17
PAS 77:2006
Each process in this flow chart s hould be documented separately, with its own flowchart if necessary highlighting each task that forms the process. The documented procedures should provide detailed step-by-step instructions. The level of detail required in the plan will
depend on the skill level of the intended audience. Each task shown in the top level process flow chart should be accompanied by a summary sheet containing the items shown in Figure 6.
Figure 6 – Task summary sheet
Task A-4:
Call-in the fail-over operations team
Task description:
Contact remote site on-call operations st aff and request extra coverage at the remote site.
Essential documentation:
Current remote site Operations Contact List – Contact-List.doc Emergency Call Out Procedure – Emergency-Call-Out-.doc
Action takes place at:
Wolverhampton Back-up Site
Task completed by:
Remote Site Operations Support Manager
Preceding tasks:
A-3
Time to complete task:
10 minutes
Requestor:
Incident Management Team (BCM Manager)
Full description/reason for action:
There is a need to provide full operations coverage at the remote site to augment normal skeleton staff. Thus need to invoke emergency on-call procedures for operations. Status check: Signature
Ensure that the section below is completed and signed-off Name
Time
Status and Comments:
8.4.7 Fail-back Although it may not be possible to plan for all post failover scenarios, where for example there has been total devastation of the production site, basic planning should be undertaken and the high le vel steps understood. When returning service to the original system or site then detailed plans should be created for the fail-back process. In these circumstances it is unlikely that fail-back will be a
18
straightforward reversal of the fail-over st eps and a separate set of procedures are likely t o be required. Thus a full fail-back plan should be in pl ace with the same quality and standard of documentation as for the fail-over. Figure 7 shows an example fail-back plan for the fictitious fail-over considered in Clause 8.4.6.
© BSI 11 August 2006
PAS 77:2006
Figure 7 – Example of a high level process flow chart for fail-back EMT request fail-back
Orderly shutdown of backup site
Prepare production site for full production running Switch all branch
networks to production site
Re-route help de sk and operations calls on production site
Establish operations bridge at production site
Switch remote access ports to production site
End site preparation
Restore savings systems to production site UP production mirror of savings database packs
Short recovery of DBMS
environment
Restart DBMS
support runs
Allow branch traffic for savings systems
End of recovery of savings systems
Restore mortgage systems to production site Reload mortgage systems backup tapes
Re-apply DBMS audit logs to mortgage database
Restart DBMS
support runs
Allow branch traffic for mortgage systems
End of
recovery of mortgage systems
Restore insurance systems to production site Cluster failover insurance system to production site
Allow branch traffic for insurance systems
End of
recovery of insurance systems
End fail-back
© BSI 11 August 2006
19
PAS 77:2006
9 Rehearsing an IT Service Continuity plan 9.1 Introduction
The service continuity recovery team:
The delivery of, and the feedback from, any rehearsal is one of the most interesting and fruitful parts of any business continuity programme. However, its success depends almost entirely on the way in which it is approached and developed. Good solid preparation ensures a sound delivery and everybody benefits from the exercise. Poor preparation leads to an ineffective rehearsal and the whole programme suffers. One unsatisfactory experience in an ill-conceived rehearsal will cause most participants to want to distance themselves from the whole concept of business continuity.
a) participates in the rehearsing and invocation of the service continuity recovery plan;
On the other hand, a well-prepared exercise will provide all of the participants with a profitable experience. They will be fully engaged in the opportunity to learn from practical experiences. Thus they will become more competent whilst gaining confidence in themselves as well as the plans and procedures.
Staff resources, costs and implications should be considered by the o rganization when planning for a rehearsal.
It is important for the organization’s staff to be aware of and to recognize the differences between a service continuity rehearsal and an actual invocation. The main difference is the high degree of planning and preparation that is required for each rehearsal. With any rehearsal there is a high degree of planning and preparation to ensure that there is little or no impact upon the live systems and to also ensure the rehearsal ob jectives are met wherever possible. All resources identified should be made available, booked and be available for the planning and preparation required for the rehearsals. Also during any r ehearsal the live systems will still be running and therefore have to be maintained and supported.
9.2 Roles and responsibilities The service continuity manager: a) is responsible for service continuity; b) is the service continuity management process owner;
b) includes technical staff for technical procedures; c) includes users for rehearsing and during actual invocation;
d) includes departmental representatives for communication and coordination (in rehearsing and in invocation); e) is led by the service continuity manager.
9.3 Rehearsal guidelines
Staff resources are the most important element as without them the rehearsal would be difficult, if not impossible, to conduct. The staff resources should have the appropriate skills for any r ehearsal, including appropriate platform knowledge, storage management knowledge and application knowledge. NOTE T he se
s ar e not onl r e sour ce y r equi r ed for t he ac tual r ehearsal but a s l o for pr e-r ehearsal meet ing s and should allow suffi c ient t ime for pr epar at ion and pl anning. I t s i e ssent ia l for senior mana gement ‘ buy -in’ t o t h s i . Costs are perhaps the most sensitive consideration of any rehearsal as they are not insignificant. Therefore, each rehearsal should be scoped in order to leverage maximum rewards/ benefits and strive towards the organization’s overall continuity ob jectives. There are implications to conducting rehearsals which the organization’s senior management need to be made aware of. For example, whilst preparing, planning and attending the rehearsal, staff are not doing their day job and therefore impacting upon existing services, processes, systems and projects. Senior management should be aware of this and plan accordingly.
c) leads the development of the service continuity recovery plan;
Whilst everything is done to minimize the disruption rehearsals can have on the business, the following should also be considered:
d) is the person who invokes the service continuity recovery plan;
a) Is it possible to time this rehearsing to cause the least disruption to business functions?
e) is a senior member of the IT function; f) does not need to be technical; g) should understand the IT priorities of the users; h) should not delegate responsibility; i) should have cover during absence.
20
b) How much will the rehearsal cost? Is this appropriate for the additional confidence gained over other forms of rehearsing, including a tabletop or scenario exercise? c) Does the rehearsal scope continue to progress against the agreed rehearsing strategy and associated annual plans?
© BSI 11 August 2006
PAS 77:2006
d) How can staff be trained to cope with the situation if they do not experience it in rehearsal-mode? e) Once the BCMP is in operation, how will you return to normal business operations? Are there specific issues here that warrant rehearsing in their own right? f) How different are the circumstances of an actual invocation likely to be relative to those of a rehearsal? NOTE F or e x ample
i t ma y be ad v s i able t o use c opie s of li v e syst em s and d at a in a r ehearsal , t he emot ional env ir o nment of a r ehearsal s i likel y t o be mor e r el a x ed t han in a r eal inc ident , et c.
9.4 Business user rehearsing Whilst the organization’s technical staff performs the service continuity rehearsal, the business users should validate the recovered applications and services. Therefore, they s hould understand their role to allow them to prepare appropriately. All business users who take part in service continuity rehearsals should be aware of the artificial environment, benefits of rehearsing, preparing and using rehearsal scripts and data input for v alidation. The environment used for an exercise might not be identical to the live environment in an actual invocation therefore participants should be aware of and understand the differences. For example, they might not have access to current data or logons might be different. Business users s hould rehearse to validate the recovery and feel confident that their applications and services can be recovered. Rehearsing should also provide valuable feedback to the organization, ensure the recovery is achieved as expected and offer opportunities for improvement. Business users s hould develop rehearsal scripts which can be followed during a rehearsal to ensure that the appropriate elements for a particular rehearsal are tested. Rehearsal scripts also provide valid input into the audit process. As the rehearsals become more complex, they s hould be as real as possible to be able to track data through the various recovered systems from front office to back office. The input data should be validated and the results, when running the rehearsal scripts, transactions and batch jobs, should be checked against pre-defined expectations.
9.5 Strategy To achieve the o rganization’s ITSC ob jectives, a combination of the following recommendations should be
© BSI 11 August 2006
considered. The f requency of e xercises will depend on the individual circumstances of your organization but accepted best practice is to exercise plans at least once a year. a) ‘Callout’ rehearsals should be conducted regularly, in addition a surprise callout rehearsal should be conducted involving all departments and the IMT. b) Walk through reviews of recovery plans, emergency management plans and departmental plans. c) Scenario-based walkthrough exercises for IMT, support teams and individual departments.
d) Component rehearsing (e.g. indi vidual departments, business processes, IT systems, voice and data network links, etc). For instance when new systems are implemented, when there are previous rehearsal failures, when changes occur or for previously unrehearsed components. Component testing should also be considered during periods w hen a more comprehensive test cannot be completed, e.g. test t hat network traffic can be redirected to the fail-over site, that users can connect to the fail-over site and that live data can be restored at the fail-over site. e) Integration rehearsals (e.g. multiple systems and / or business processes) where IT services rely upon combinations of information systems working together the organization should reassure itself that they are capable of not only r ecovering the individual systems but also that they can be recovered in such a way as to provide the required services by interacting as expected. f) Relocation rehearsals (technical and business r ecovery), whereby key parts of the business relocate to, and operate from, the recovery site, including the loss of the main facility, an IT switch or critical business processes. g) Fail-over rehearsals of the live IT environment to the recovery site (including verification by users) and business r elocation rehearsals. h) Ma jor incident simulations should include scenariobased role playing exercises, IT fail-over, business relocation and full fail-back rehearsals. In all cases, results s hould be documented and updates to appropriate continuity plans completed within four weeks of each rehearsal. All rehearsing should be carefully managed and coordinated to ensure low risk to the business but with maximum return on the effort put in.
9.6 Rehearsal programme management To support t he rehearsal programme an adequate management framework should be in place as illustrated in Figure 8.
21
PAS 77:2006
Figure 8 – Suggested Programme Management Organization
Compliance/ Audit Team
Business Continuity Steering Group
Business Continuity Coordinator
IT Rehearsal Working Group
Business Continuity Rehearsal Group
The suggested roles are as follows: a) Business Continuity Coordinator: the key facilitator of the Business Continuity function. b) Compliance / Audit: to oversee recovery rehearsals and exercises and to ensure they meet the regulatory requirements and satisfy external auditors.
e) Business Continuity Rehearsal Group: is chaired by the Business Continuity Coordinator and including representatives from the IT Support Groups and Compliance / Audit. The Business Continuity Rehearsal Group reports to the BCSG. The Business Continuity Rehearsal Group is responsible for:
c) Business Continuity Steering Group (BCSG): oversight committee for the entirety of the business continuity function consisting of senior representation from all business areas, to reflect the business-wide impact of business continuity planning and management.
1) planning and executing all ad hoc infrastructure rehearsing, and regular full scale service continuity rehearsal simulation rehearsals;
NOTE A s part of t he
r ehearsal str at eg y gani z at ion’ s , t he or Busine ss C ont inui ty f unc t ion should maint ain a r olling r ehearsal sc hed ule. T he Busine ss C ont inui ty S t eer ing Gr ou p should sign off t he r ehearsal pr ogr amme a s part of t h s i doc ument being ssu i ed.
3) pre-rehearsal planning and preparation;
d) IT Rehearsal Working Group: responsible for planning technical IT aspects of recovery rehearsals.
7) follow-up of actions arising.
22
2) agreeing the rehearsal scope and ob jectives with the business, via the BCSG;
4) production of the rehearsal plan document; 5) coordination of activities during the rehearsal; 6) post r ehearsal reporting;
© BSI 11 August 2006
PAS 77:2006
The Business Continuity Rehearsal Group should be business-led, rather than an IT-led group. The Business Continuity Rehearsal Group should meet regularly, as required to meet the above responsibilities. Typically, this will be monthly, but increasing in frequency in the weeks before a rehearsal.
9.7 Rehearsal planning process 9.7.1 Rehearsal plan contents An effective rehearsal should contain: a) a body responsible for control and coordination; b) ob jectives and success criteria; c) a rehearsal plan and schedule;
d) a reversion plan allowing restoration back to live service at certain key points; e) briefing of participants; f) management and coordination; g) event logs and rehearsal feedback forms; h) independent observers; i) post-rehearsal reporting, follow-up and action plan. Post r ehearsal reporting should include a variety of sources, e.g. helpdesk call for the duration of the test compared to the normal amount of calls for the day and time the test was carried out, to see if there were an increased number of incidents r ecorded.
9.7.2 Rehearsal planning principles The rehearsal process includes a number of principles, which should be applied throughout the planning process: a) Document an overall rehearsal strategy with a desired ob jective to be reached within a clearly defined timeframe, which should include the move to rehearsing invocation.
9.7.3 The importance of rehearsing Rehearsing is a vital part of the long term BCM lifecycle, which will prove the viability of recovery plans and highlight areas for further improvement. It also provides an ideal training opportunity for those involved in the key activities. Rehearsals are so called so that areas of weakness can be identified and new processes implemented to improve resilience. It is crucial that rehearsals are seen as positive tasks and any internal political influences are eliminated so that the focus of business r esilience and continuity is maintained. The overall aims of the rehearsing strategy are to ensure effective crisis management and to enable live processing to be moved to the recovery site(s) on a regular basis and become part of business as usual.
n NOTE E ve
t he mo st c ompr ehen si ve r ehearsal doe s not c ov er ev eryt hing. F or e x ample in a serv i ce di sru pt ion w her e t her e ha s been inj ury or ev en deat h t o c ollea gue s , t he r eac t ion of st aff t o a c r s i s i cannot be r ehearsed and t he pl an s should make llo w i . a anc e for t h s Rehearsals should have clearly defined ob jectives and critical success factors which will be used to determine the success or otherwise of the exercise as well as of the BCP itself. A full rehearsal should replicate the invocation of all standby arrangements, including the recovery of business processes and the involvement of external parties. This should test completeness of the plans and confirm: a) time ob jectives, e.g. to recover the key business processes within a certain time period; b) staff p reparedness and awareness;
b) Involve the customers in the service continuity rehearsing process.
c) staff d uplication and potential over commitment of key resources, during invocation of the BCP;
c) Document and agree a detailed annual plan and rehearsal programme which relates to the overall rehearsing strategy.
d) the responsiveness, effectiveness and awareness of external parties.
d) Real and achievable ob jectives with realistic dates should be set.
Rehearsals may be announced or unannounced. However, in the latter case the senior management should approve the announcement in advance otherwise it may be difficult to achieve commitment.
e) Ensure that all critical daily tasks and housekeeping routines are included. f) Include Business Continuity aspects and Business Recovery rehearsing in the plans. g) Include scenario planning /rehearsing with a generic priority list. h) Promote continuous improvements by following actions, suggestions and ideas from previous rehearsals. i) Include the Service Continuity Management team in rehearsals and test their abilities.
© BSI 11 August 2006
9.7.4 Rehearsal ob jectives The rehearsal strategy s hould meet the ob jectives to: a) validate emergency callout procedures and contact details contained in the recovery plans; b) ensure key staff are familiar w ith their Incident Management, Business Recovery and Technical Recovery plans;
23
PAS 77:2006
c) prove the ability to recover the technical IT and communications infrastru cture;
d) prove the ability of critical staff to relocate to and work from the nominated recovery site(s); e) validate the effectiveness and accuracy of the documented IT and Business Recovery plans.
9.7.5 Planning a rehearsal All parts of each rehearsal should be planned in advance as without the planning and preparation the following could occur: a) ob jectives will not be met and live systems could be adversely affected; b) the rehearsal could fail which will cause the staff involved to disassociate themselves from Business Continuity and Service continuity rehearsal; c) the identified resources (staff and other) may not be available when required or may not be appropriate, such as skill sets, adequate communications link, and server specification;
d) there is nothing to measure progress against and therefore no opportunities to improve the rehearsing process; e) expectation of the organization’s staff and customers may not be met or remain unknown. NOTE I n
man y w a ys each r ehearsal can be v iew ed a s a ‘pr oject ’ in t hat i t ha s defined st art and end points and should hav e agr eed ob ject iv e s and de si re d out come s. F or guid ance on be st pr act ic e in pr oject management 2 [ 2 ] and pl anning t he r eader should r efer t o PRINCE r t he P r oject Management I n st it ut e s ’ ‘ Pr o ject and /o y of K now ledge’ [ 3 ] . Management Bod
24
© BSI 11 August 2006
PAS 77:2006
10 Solutions architecture and design considerations 10.1 General Service continuity may be achieved in m any ways ranging from replicating every single IT component to removing all known single points of failure from those components. There are many available models to choose from as
illustrated in Figure 9. An organization may, however, favour one particular model but then also use components of several others t o complete the IT architecture.
Figure 9 – Infrastructure Architecture Models for Business/Service Continuity
Site
Site recovery Site/data centre failover
Application Application failover/load balancing Redundant systems
Data SAN, NAS & DAS Backup and restore
Platform Rapid equipment replacement High availability system features
If the IT architecture is changed to support ITSC then this should be checked to ensure it does not compromise continuity or security. Thus a review of the complete environment should be undertaken to ensure security is maintained at the same level. This should include a thorough examination of alternative / back-up sites and network links between them. The following should be considered: a) Is the replication of d ata exposing client data? b) Are the Service continuity rehearsal plans secure or could these be used to identify weaknesses in the IT architecture? c) Are there unused service continuity rehearsal Internet Protocol (IP) addresses, which during normal operation a hacker could use to gain access to the network? The classic approach to ITSC is to use a two-site model which has a back-up site that can continue to provide a service when the main site is disabled or destroyed by an incident. There are a number of ways in which this remote
© BSI 11 August 2006
site model m ay be implemented (see Annex D), depending upon the organization’s requirements.
10.2 System resilience Typically any system running mission critical applications should be locally resilient. This means that the central system has no known single points of failure such as power supplies, CPUs, I/ O Processors. In addition, paths to multiple peripherals are duplicated or duplexed and disk devices are mirrored or part of a Redundant Array of Independent Disks (RAID) configuration. Loss of any single component should not cause an interruption to service. Further information can be found in Annex E.
10.3 Application resilience Application software may also play a part in system resilience by creating cluster systems viewed as a single system by t he outside world but implemented physically as
25
PAS 77:2006
multiple independent systems w ith automated fail-over between hosts. There could be issues relating to the sharing of databases (see D.2). Clustering and database sharing should be implemented if there are concerns around hardware or even software stability. Any application resiliency mechanisms should ensure recovery of data to consistent points. For example if a database has data on one volume and the indices on another, then the application should ensure that updates to the disks are either all applied or none applied – i.e. the update is ‘atomic’. Databases that are resilient in this way are said to adopt Atomic, Consistent, Isolated and Durable (ACID) properties. A stateless s erver is one that provides a service but retains no transaction state information between interactions from the client. Each transaction is atomic e.g. self contained and has no relation to preceding or following interactions. An example of this type of server is a web server, web applications are typically stateless. Naturally stateless s ervers are good candidates for the creation of server farms: large groups of servers that all offer the same level of service. When optimum load is exceeded then another server running the same stateless s erver software should be added.
10.4 Network resilience The network should be resilient and capable of handling the fail-over approach. There should be adequate communication bandwidth between sites to allow production to switch from one site to another and for performance to remain acceptable for business needs. Where appropriate, networks should use dual-paths between critical systems, both within a site and between sites, with all components replicated (switches, networks cards, etc.). Single points of failure should be identified and a risk analysis performed to identify if the risk is acceptable. Alternative network providers s hould be considered for inter-site links. This includes the last mile from any ma jor trunks to the site, with cabling routed independently, following separate routes into the building and terminating to physically separate communication equipment.
10.5 Data resilience Typically computer systems are reliant on the resilience of their disk based data storage. There are many different models that can be adopted to ensure data resilience some of which are described in Annex F, which discusses various approaches to resilience. Organizations should select the most appropriate model or models.
26
© BSI 11 August 2006
PAS 77:2006
11 Buying Continuity Services 11.1 General Buying continuity services is not a simple process. Any organization that chooses to minimize its r isks by outsourcing to a third party should assess t he viability and sust ainability of the service it is buying. This is especially the case for continuity services, which may never be used and are hard to rehearse outside of a controlled and pre-planned environment. Paradoxically, it is quite possible that buying continuity services from an external supplier could compromise an ITSC plan if the due diligence on that supplier and its services has not been thorough. An organization should understand how a continuity services organization (supplier) makes money. For example, the supplier invests in resources (buildings, infrastructure, IT equipment etc.) that may be required by a client if an incident or failure occurs. To ensure that the service continuity rehearsal services are economically viable and thereby affordable to a client, and also to ensure the supplier is profitable, it syndicates those resources across as many clients as possible. The supplier then manages the chance (risks) of more than one client invoking the service and thereby demanding access to those same resources simultaneously. The implication is that, if the supplier does not manage the risk of multiple, simultaneous invocations both professionally and reasonably then the buyer could, in the event of a ma jor incident, be denied access t o the very resources it has subscribed to and thereby could stru ggle to regain IT and thereby business r esumption. There are a range of q uestions to which satisfactory answers should be required when buying any service or product from an organization, irrespective of industry. This section is focused on the specialist due diligence required when buying continuity services and assumes the reader is already versed in standard purchasing practices such as financial due diligence and validating the accreditations of a supplier. Further information on best practice in this area can be found at The Chartered Institute of Purchasing and Supply5).
11.2 Syndication management There is a high chance that companies based in close proximity could be affected by the same incident or event that can disrupt IT and Business Continuity. There are
many examples of this, notably the terrorist attacks on New York in 2001 and on London in 2005, accidents such as the Buncefield oil terminal explosion and natural disasters such as the Asian Tsunami and Hurricane Katrina. The supplier should be able to demonstrate its risk management system and the methods it uses to ensure the risks of multiple, simultaneous invocations (which ma jor incidents and natural incidents imply) are as low as possible. It is also highly advisable to assess t he method of syndication used by the supplier and match it against the levels of risk that your organization will find acceptable. For example, the supplier may offer lower prices if the buyer is prepared to accept a higher syndication rate (risk).
11.3 Syndication ratios The supplier could quote a ratio of clients t hat it w ill allow to concurrently subscribe to a particular resource e.g. 25 clients s hare one computer etc. However, this ratio is just one aspect of the risk level that a buyer should be aware of and should not be accepted on i ts own as a satisfactory indication of the chances of gaining access to the resource you have subscribed to should an incident occur. The supplier should be able to produce automatically a risk listing of: a) its clients; b) their industry; c) their location;
d) the resources under cover; e) the number of times it has sold those same resources; f) the speed with which the resources are to be delivered and / or made available; g) the length of time the resources may be required after an incident. This report s hould be made freely available to the buyer who can then determine if the risk of buying from the supplier is acceptable. Risk management is a dynamic process. The buyer of continuity services should periodically request and see the syndication report from the supplier and thereby continually be able to assess its own risk position.
11.4 Location of clients It is important, when buying Continuity Services, to understand not only t he number of clients sharing a resource but also their location. As an example, it may be unlikely for an organization to find i t acceptable to share
5) http: //www.cips.org
© BSI 11 August 2006
27
PAS 77:2006
the same resource with another client of the supplier in the same building, street or close area.
11.5 Risk presented by other clients In addition to the location of other clients, it is also crucial to understand who those clients are and the industry they are in. By doing so the buyer becomes able to determine the likely t hreat those other clients could place on their own ITSC plans e.g. whether their v ery presence could constitute a threat or they could be a target for extremists which could have a knock-on effect on your own ITSC. This is a dynamic equation and will often provide a range of risk positions dependent upon the current political climate. As an example, it would be appropriate to know if you are subscribing to syndicated resources that are shared with a organization that could be classed as a welfare, social or political risk e.g. an organization that could be the target of an animal rights group, a forestry business t hat could be threatened by an environmental pressure group or an organization that could be known to sympathize with a particular side in an area of political unrest. It is often difficult to know exactly what risks other clients may actually place on you. Clearly there are limits to what you can do, however it is often worth imagining (if not actually doing) a helicopter scan over your premises and the surr ounding areas of other clients. You may, for example, not be aware that your neighbour is st oring gas cylinders in their work yard or is charging fuel tanks next to your building. You may be closer to a flood plain than you had originally t hought or there could be building works going on that could, by accident, cut y our telecoms lines etc. The Buncefield oil terminal explosion was proved to be the classic example of a single incident causing direct and consequential issues for many companies.
Another consideration is that of the physical and environmental security measures which are in fo rce in the recovery site. These should be equivalent to those for the primary location and should be regularly audited against specific and detailed requirements.
11.7 Rehearsing An ITSC plan should always be rehearsed to ensure it is current and appropriate to meet the required ITSC service levels. It is crucial when procuring continuity services from a supplier that the services are rehearsed. A supplier will have a finite amount of resource (both equipment and people). It is important when buying a service that the supplier's resources are known to the buyer to help i t gauge the chances of service provision when an incident or interruption occurs. This information should be made readily available by the supplier; however, one way to gauge the amount of available resource is to request s cheduled and unscheduled rehearsals. If a supplier is under-resourced to meet its contractual obligations, it is unlikely to be able to honour short timescale scheduled rehearsals. Should this happen, then alarm bells should be ringing as a lack of resource in a rehearsal when all is relatively calm and quiet, is likely t o mean over-stretched resources and over-syndicated services. If the doubt is there, the buyer should ask deeper questions to ensure its own risk management levels have not been compromised.
11.6 Location and Physical Security When there is an incident that has to be managed by the police and other emergency/security services, an area could be cordoned off fo r safety and on-going in cident management purposes. This could mean that access to the premises is denied and buildings could be evacuated. When buying a continuity service, the buyer should expect that its chosen supplier does not sell the same resources to any other companies within the same geographic location. The buyer should, in advance of subscribing to the service, understand the typical size of an exclusion zone enforced by t he security services and determine what is a reasonable and satisfactory area to demand exclusive access t o the syndicated resources. The buyer can then ask the supplier to prove that it has allocated the requisite exclusion zone.
28
© BSI 11 August 2006
PAS 77:2006
Annex A (informative) Conducting business criticality and risk assessments A.1 General The approach described here is a variant of Failure Modes and Effects Analysis (FMEA). A variant is suggested because the standard FMEA approach assumes application to a business process and concentrates on the causes and effects of disruption or failure of steps in the process. NOTE S uc h
c t y an a ppr oac h w ould not be di re l a ppli cable t o a rt t t r ss r r depa men al mana gemen p oc e o a p ojec t , t hou gh suffi ci ent c ommon gr ound e x sts i for t he a ppr oac h t o be
nt s i a s l o r equi r ed ad a pt ed for t ho se c ir c um st anc e s. T he v ar ia sinc e t he dev elopment of FMEA, t he r s i k mana gement ind ustry ha s w idel y acc ept ed t hat t he c onc ept of r s i k inc lu de s bot h t hr eats and opportuni t ie s. Figure A.1 indicates the steps in the risk assessment process, which results in the development of an ITSC plan (see Clause 9).
Figure A.1 – Risk assessment process
Process and Risk Identification
Response Selection
Rehearse and Learn lessons IT Service Continuity Plan
Assign Responsibility and Implementation
Response Planning
NOTE W her e
syst em s and / or IT serv ic e d in safety c r it ic al env ir o nments , suc h a s on oil r ig s , nuc lear pow er pl ants et c. , s ar e inv ol ve mor e sophi st ic at ed a ppr oac he s t o r s i k mana gement suc h a s Mont e Car lo Ana ys l s i ma y be mor e a ppr opr ia t e.
© BSI 11 August 2006
29
PAS 77:2006
A.2 Process and risk identification The heart of any process for assessing risk should have a ‘types of risks’ set that can be easily understood by those conducting the assessment. In the case of a physical risk assessment, this should involve identifying the hierarchy of IT services that will be the sub ject of the assessment, the owners of each service and the dependencies between them. In the case of an organizational risk assessment it involves identifying the organization’s structure and the processes for w hich each node in the structure is responsible, the owners of each process and the dependencies between them.
i NOTE T h s
doe s not mean t hat all r s i k a sse ssments should st art w it h a busine ss modelling e x er c s i e , sinc e in man y ca se s t h s i infor mat ion w ill al r ead y be av ail able. W her e t he infor mat ion e x sts i , c ommon sen se su gge sts t hat i t w ould be prudent t o r ev iew i t t o en sur e c ont inued acc ur ac y , but under no i r u m st n e s s ho u ld effo rt e e x pended in r epr od uc ing w or k c c a c b t hat al r ead y e x sts i in an acc ept able for m. The ob ject of the risk assessment should therefore be to define the possible changes, understand how likely they are and how each change would impact IT service provision. The types of risk that can be identified include changes to: a) business process or activity, including risks ranging from catastrophic failure through minor disruption to positive improvement in productivity;
temporary failure of an information flow from another business process; c) plant or equipment;
d) buildings and environment; e) information technology or systems; f) information security including confidentiality, integrity and availability; g) projects including risks associated with not delivering the specified solution, risks associated with the solution and risks associated with its delivery. In assessing the types of risk to which a physical or organizational component of the business could be sub ject, the assessment should be well informed and based on verifiable evidence. Where possible and appropriate, the views of acknowledged experts should be called upon to ensure that the assessment of the nature and likelihood of a particular risk is as realistic as possible. All risks identified during this activity should be described in the ITSC plan. At this st age it is only necessary to record summary details for each risk including a name, which should convey s omething of the nature of the risk, and one or two sentence description of the nature of the risk. The probability of a risk occurring and its likelihood should be determined according to Table A.1.
b) dependencies, including risks ranging in effe ct from the collapse of a critical supplier of goods or services to the
Table A.1 – Probability of risk occurring
30
Probability
Definition
Low
The risk is not expected to occur more than once per y ear.
Medium
The risk is not expected to occur more than once per quarter.
High
The risk is expected to occur at least once per month.
Very High
The risk is known to exist or is expected to occur frequently and / or regularly.
© BSI 11 August 2006
PAS 77:2006
In performing a risk assessment one should identify not only the immediate effects of the risk occurring but also the impact on the business of those effects. For example, the effect of a hard disk problem could be the corruption of some data stored on that disk, whilst the business impact of corrupt data relating to customer accounts could result in significant cash flow problems and could also adversely effect the organization’s reputation for excellence. In general, the assessment of each risk should consider the impact on: a) environment; b) financial performance of the organization; c) health and safety of employees and the public;
f) product quality; g) business controls; h) regulatory or legislative compliance; i) reputation of the organization with its customers, investors, staff and suppliers; j) political impact at local, regional, national and international level. When assessing the impact of a risk one should ensure that the assessment is w ell informed and based upon verifiable evidence, hence, expert opinion should be called upon where possible and appropriate to do so. Table A.2 categorizes the impact of a risk.
d) morale of employees; e) productivity and process efficiency;
Table A.2 – Impact of risk Impact
Definition
Low
Expected to have a minor negative impact. The damage would not be expected to have a long term detrimental effect.
x ample: v ery short -t er m (le ss t han fi ve minut e s ) pow er f ail ur e E Medium
Expected to have a moderate negative impact. The impact could be expected to have short to medium term detrimental effects.
x ample: short -t er m (le ss t han one hour E ) f ail ur e of email syst em High
Expected to have a significant negative impact. The impact could be expected to have significant medium to long term effects.
x ample: une x E pect ed f ail ur e of online banking syst em r e sul t ing f r om unknow n cause. Very High
Expected to have an immediate and very significant negative impact. The impact could be expected to have significant long term effects and potentially catastrophic short term effects.
x ample: d at a centr e de str o y ed b y fi re or flood E
© BSI 11 August 2006
31
PAS 77:2006
A.3 Response selection Implementing a risk response should only be done if the tangible and intangible benefits of doing so outweigh the tangible and intangible costs. In addition, the tangible and intangible costs of preparing the response and ultimately of deploying it should not outweigh the costs of taking no action. Since success in business involves a degree of risk taking, there will be risks that the business
is happy to accept in the expectation that doing so will result in improved profitability, market share or other tangible benefits. The body responsible for deciding which responses should be implemented should consider the questions listed in Table A.3.
Table A.3 – Questions Question
Options
Is the risk likely to result in a positive outcome?
If so, a response should be devised which causes the risk to occur and maximises the benefit derived from it. If not, consideration should be given to a response which would avoid, eliminate or mitigate the risk or its impact.
Is the risk sufficiently likely or its impact su fficiently significant to justify implementing the response?
If so, some form of response would appear appropriate.
Would a decision not to develop a response leave the organization or its officers open to civil or criminal litigation?
If so, prudence would suggest t hat some form of appropriate response should be developed.
Would the benefits (both in terms of risk mitigation and other consequential improvements) to the business from implementing the response outweigh both the costs of taking no action and the costs associated with the implemen tation?
If not, consideration should be given to alternative approaches which cost less t o implement or in some cases whether the organization is prepared to accept the risk.
In order to ensure that Risk Management represents a viable and positive investment for the future of the business, a cost-benefit analysis for each possible risk response should be conducted. The ob jective of this exercise is to determine whether the benefits of taking action will outweigh the costs of taking no action. This analysis is then fed in to the decision making process for selecting the responses to be implemented. From the risk profiles (see A.2), documented in the ITSC plan, obtain details of the financial costs of: a) the estimated cost of taking no action in the event that the risk occurs, i.e. the impact cost;
taken into account, such as the organization’s reputation, employee health, safety and morale, environmental protection, security and the confidence of investors, customers and regulators. In each case, an estimate of the impact on the intangible factors s hould be made for taking no action, for preventing the risk and for implementing the proposed counter-measures. By examining the intangible costs in conjunction with the financial costs a broader picture is seen. This can be fed into the process of deciding whether a response should be implemented for the risk(s) in question.
b) the estimated development and implementation costs of existing and new counter-measures; c) the estimated costs that would be prevented or averted by implementing the proposed counter measures. In addition to these financial costs, other factors should be
32
© BSI 11 August 2006
PAS 77:2006
A.4 Response planning A.4.1 General Once decisions have been made regarding the risk responses that are appropriate for the circumstances, the implementation of e ach element of the response should be carefully planned. Risk response planning is concerned with ensuring that resources are deployed effectively and efficiently, paying particular attention to maximizing the benefit to the business from implementing the response.
For each response the most appropriate people should be involved in its development and implementation. As such, the participants may work in areas of the business other than that affected by the risk, or may indeed work for other stakeholders. The plan for implementing each risk response should identify: a) the scope and ob jectives of the response, such as RTO, RPO etc;
c) pre-requisites;
d) summary of resources required (people, facilities, equipment, money); e) work breakdown, identifying the sequence of activities required to implement the response, including the resources required for each step and estimates of the effort and elapsed time required. Based upon the work breakdown for all required risk responses, a schedule of work should be created in which the timing of e ach activity should be determined by t he availability of the time, effort and people required to complete it. A.4.2 Assign risk category Based on Table A.2, the definitions of risk categories, as deduced from predictions of likelihood and business impact, have been slightly modified, as shown in Figure A.2.
b) planning assumptions;
Figure A.2 – Risk categories Risk Likelihood
L o w
M e d i u m
H i g h
V e r y h i g h
Very high
t c a
p m I
High
Category One
Medium
Category Two
s s
e n i
s u B
Low
© BSI 11 August 2006
Category Three
33
PAS 77:2006
Having assigned the risk category, details of the likelihood, impact and risk category should be added to the risk description in the ITSC plan. At this stage the ITSC plan should contain the risks in category grouping, with Category One risks listed first. To interpret these categories further: a) A Category One risk is one to which the organization should certainly respond; b) A Category Two risk is one to which the organization should consider responding; c) A Category Three risk is one which the organization should consider accepting. No two organizations are the same and thus no firm guidance on interpreting these categories can be given without it being inappropriate to a significant percentage of the audience. Hence, though the guidance above is intentionally vague, it helps to frame the questions the organization should be asking itself at this stage of the process. A.4.3 Develop risk profile For each risk identified as falling into Categories One and Two, a risk profile should be developed, which defines:
a) the nature of the risk and the events likely t o trigger it; b) the probability of the risk occurring, including details of any circumstances where the likelihood of the risk could change; c) details of the potential impact of the risk on the business, including estimates of the cost to the business of taking no action to prevent or mitigate its impact;
event that the risk occurs and the ways in which these symptoms could be detected; e) an assessment of the likelihood of de tecting the risk and measures that could be taken to increase that probability; f) details of existing counter-measures designed to monitor the risk, prevent it from occurring or to mitigate its impact, including estimates of the costs of implementing and maintaining these counter-measures; g) proposals for additional counter-measures, or changes to those in pl ace, to prevent the risk from occurring and to mitigate its impact, including details of the facilities, equipment and personnel required, and estimates of the time, effort and cost required to implement and maintain these new counter-measures; h) estimated savings accruing from implementing the proposed counter-measures in the event that the risk occurs; i) estimated consequential savings likely to accrue from implementing the proposed counter-measures in the event that the risk does not occur. This information provides the basis for a cost-benefit analysis, which should support decision making on ho w each risk should be addressed by risk monitoring, risk mitigation, risk communication and business continuity planning activities. Details of the risk profile are added to the ITSC plan. A.4.4 Assess probability of detection The probability of symptoms of the risk being detected should be determined according to Table A.4.
d) details of the symptoms likely to be displayed in the
Table A.4 – Probability of risk detection Probability
Definition
Low
The symptoms expected to be displayed when the risk occurs will not be obvious or easy to detect without specialised monitoring processes.
x ample: di sk har dw r e err or causing inf r equent and r andom err ors w hen wr it i ng E a infor mat ion t o di sk. Medium
The symptoms expected to be displayed when the risk occurs will be detectable with basic or standard monitoring processes. E x ample: mali cious intrusion ont o cor por at e netw or k , f ail ur e of online or bat ch pr oce ss t o y complet e succe ssf ull , et c.
High
The symptoms displayed when the risk occurs will be immediately apparent.
x ample: f ail ur e of email syst em , pow er f ail ur e , natur al di sa st er et c. E
34
© BSI 11 August 2006
PAS 77:2006
A.4.5 Response selection A basic model for determining appropriate responses is based upon risk categorization and likelihood of de tection. Having categorized the identified risks and having decided whether a response ought to be implemented, the n ature of that response should be influenced not only by the potential impact or likelihood of the risk occurring but also the o rganization’s ability to detect that it has occurred.
For example, in planning a response to a risk such as the example given for a ‘low’ probability of detection, the organization might be well advised to consider implementing specialized monitoring processes and / or equipment to make detecting the risk more possible. In the case of the medium probability example, the organization can implement one of a number of common firewall and intrusion detection tools to both identify and prevent such intrus ions. A.4.6 Assign responsibility and implemen t Having determined the appropriate response to the risk, the actions implied should be planned such that resource utilization and cost information is available for cost-benefit analysis. The cost-benefit analysis is an important part of the decision making process for determining which of the potential response actions will be justified and therefore implemen ted. It is also important information to retain when a decision is taken not to take action in response to a risk, as it demonstrates that a formal and rigorous thought process was followed in arriving at that decision.
Details of the decisions taken on the proposed response actions should be added to the ITSC plan and summarized in an action plan and work schedule.
A.5 Rehearse and learn lessons An ITSC plan is only likely t o be effective if it is regularly rehearsed and when the lessons from these rehearsals are fed back into updated plans. Clause 9 provides guidance on how to plan and conduct such rehearsals for maximum effectiveness.
© BSI 11 August 2006
35
PAS 77:2006
j) Redundant routing of communications: The ability to communicate in a period of disruption is fundamental to the successful management of an incident. Whilst there may be multiple redundant phone lines into and out of sites, check the telephony provider is not routing all these lines through one common exchange which can be impacted by an incident at that exchange. In addition since email systems can be impacted by an incident, it may be provident to maintain a number of independent email accounts on external Internet Service Providers (ISP) for use in case of emergency. Consideration should be given to providing multiple forms of communication, such as SMS, pagers, external (non-corporate) email systems, pre-agreed brief coded messages (to avoid overloading the networks and to speed communications) and so on. k) Third party connectivity and external links: If the organization depends on the services of a third party provider (for example, in the financial world many companies use third party credit reference agencies), those services should be accessible from the remote site. The contract with the third party should provide a guaranteed level of service in the event of an incident.
© BSI 11 August 2006
37
PAS 77:2006
Annex C (informative) Virtualization C.1 General Virtualization, although considered a new technology by many people, has actually been with us since the early mainframe days when administrators were able to partition memory, processing and disk resources to create a virtual machine. This same technology has now been widely adopted in three keys areas: storage, server and network virtualization. Although each of the above are technically very different the concept of virtualization remains the same. Take a physical resource and partition it into multiple virtual resources or consolidate multiple resources into a single virtual resource. The benefits of virtualization allow you to maximize utilization of the physical resource while simplifying management through fewer physical devices.
C.2 Network virtualization Network virtualization allows you to take the components in your network infrastru cture and either consolidates them into fewer networks or takes an existing network and divides it into smaller segments. For example, you could take a single 48 port network switch and partition it into four segments, each with 12 ports. This allows you to create 4 isolated networks and utilize all ports on the network switch. It also makes managing the network easier as there is only one physical switch.
C.3 Storage virtualization Storage virtualization provides a means to hide the complexity of a storage infrastru cture behind a virtual layer. The main advantage to doing this is simplified management. There are three ways to implement st orage virtualization: a) use an appliance; b) in the network fabric; c) locally in the storage array. There are pros and cons to each of these methods. There are quite a few virtualization appliances on the market today that all more or less do the same thing. The appliance will usually s it between the storage arrays and fabric switches. This is called an in-band appliance. All data passing between the host and storage arrays also passes through the appliance. One concern about this approach is that the appliance might become a bottle neck. The second option is an appliance that sits on the edge of the SAN fabric. This is known as an out-of-band appliance. An advantage of this model is that only a small
38
amount of metadata needs to be passed to the appliance, thus eliminating the bottleneck problem. Most appliances also support clustering so that the appliance does not become a point of failure. A disadvantage of the appliance approach is that adding an additional device increases complexity and management of the SAN. Fabric based virtualization places the virtualization technology inside the SAN fabric switches. This increases the processing and memory requirements of the switch but has the added advantage of reducing overall complexity. This technology is still at a relatively early stage but there are already of number of competing products on the market. There is however, some caution around how much intelligence should be implemented at the fabric level. There also the needs to be some standardization at the fabric level so that fabrics with multi vendor switches are fully interoperable.
C.4 Server v irtualization Deploying virtualization software on a server allows you to partition the server into multiple virtual servers and then host an independent OS and applications on each of these virtual machines. Server v irtualization abstracts the OS and applications from the underlying hardware. This helps protect applications from hardware peculiarities. It also makes it much easier to migrate applications onto new hardware platforms. The management console allows y ou to configure how much memory and processing resources each virtual machine can have. It also allows you to monitor how many resources on the physical server each virtual machine is consuming. Replication technologies built into the virtualization software allow you to quickly clone and deploy virtual machines. By integrating with some of the ma jor software deployment tools, it is also possible to rapidly deploy applications onto virtual machines. One version of virtualization software also allows for the relocation of virtual machines between separate physical servers. This can be policy driven so in the event of a server failure the virtual machines can be moved to a new physical server.
© BSI 11 August 2006
PAS 77:2006
Annex D (informative) Types of site models D.1 General There are a number of basic site models that can be adopted to provide resilience. The requirements from the ITSC strategy will have a ma jor influence on which model is selected, and this may have significant implications for the IT architecture. Thus the decision will require input and careful consideration by many areas of the organization and the final selection is likely to be an iterative process as the costs and implications are more thoroughly understood.
D.2 Active/Contingency This model introduces a remote or back-up site for recovery only at the time of incident. It is often referred to as a cold back-up site since at the point of incident it usually consists of either an empty computer room, or a computer room populated with inactive computers in an un-initialized state. An alternative to this st atic computer room is a mobile computer suite provided with generators
etc. that may be setup in the parking lot of an incident stricken company. Similarly hotel rooms and other rented office space may be turned into incident back-up sites to temporarily house new computer equipment. Specialist companies exist t hat can help ship equipment quickly to help minimize the costs and increase the viability of cold back-up sites. These companies are skilled in the rapid deployment and delivery of pre-configured systems and resources from servers and PCs through to telephone switches, structured cabling and furniture. An alternative is the potential for sharing machine room space with a supplier or business partner, providing reciprocal arrangements for computer room space. Care should be exercised here and no such arrangement should be undertaken until all the risks of co-hosting another company’s equipment are fully understood. The advantages and disadvantages associated with this model are listed in Table D.1.
Table D.1 – Advantages/Disadvantages associated with Active/Contingency model Advantages
Disadvantages
• Typically lower cost than active / active
• Typically a slower fail-over than other approaches.
• If buying access to the contingency s ite from a supplier, the service will typically be treated as revenue / operational expenditure rather than capital which can have advantages for some organizations.
• As systems are built at point of recovery, very rigorous change and configuration management is required to ensure fail-over procedures are up to date.
• Limited investment in unused infrastru cture and removes need to upgrade continuity equipment when upgrading production.
• Process is likely to require a high level of technical skill to deal with complex r ecovery issues.
• Additional support skills may be available if using a third party to provide the service.
• If using a shared recovery s ite, then an additional risk that another organization may also require or be using the site.
• May be possible to utilize space across other sites within the organization, reducing or removing the need for a specific cold site.
© BSI 11 August 2006
39
PAS 77:2006
D.3 Active/Active At the other end of the spectrum from the Active / Contingency model is the Active / Active model. As this name implies, in normal operation both sites are up and running accepting work at both centres and balancing the load across all computers at both sites. In the event of
an incident or system failure at one site then all work is routed to the second site which has been sized to be able to accept the workload increase with little or no reduction in throughput. The advantages and disadvantages associated with this model are listed in Table D.2.
Table D.2 – Advantages/Disadvantages associated with Active/Active model Advantages
Disadvantages
• Fast r ecovery from an incident
• Can be more difficult to implement and manage than other models.
• Improved confidence in ability to fail-over as much of the resilience equipment is being actively used at each site. • Recovery procedures can be simplified and / or automated, as much of the infrastru cture will be up and running. • May improve utilization of the infrastru cture over other models. • Less overhead on change and configuration management as sites are being continually exercised and so issues are likely to be identified more quickly than where equipment is not be used.
• May require additional load balancing technology to allow services to be split across sites. For example to route Internet traffic to two separate sites. • Complex databases issues. If a database is to be active at multiple sites then a mechanism is required to externalize and manage updates so that data at the sites is kept synchronised. Some organizations approach this by running a cluster with database only active at one of the sites at any one time. • Limited separation between sites. To achieve the desired level of pe rformance the parts of the Active / Active pair are often close together.
• Makes live fail-over rehearsals easier to implement.
40
© BSI 11 August 2006
PAS 77:2006
D.4 Active/Alternate (Active/Passive) In the Active / Alternate model, production runs at one site with a warm standby mirror copy of the production system maintained at a second site. In the event of a failure, production work moves from the main site to the warmstandby site with little or no interruption to service.
This requires either synchronous (Zero Data Loss) or asynchronous (Point in Time) replication of data. The advantages and disadvantages associated with this model are listed in Table D.3.
Table D.3 – Advantages/Disadvantages associated with the Active/Alternate model Advantages
Disadvantages
• Either site can be nominated as production site on a scheduled basis, providing confidence in the solution.
• The fail-over to the Alternate site can have more impact on service than in the Active / Active model, though still typically better than other models.
• Makes live fail-over rehearsals easier to implement.
• Systems at the alternate site must be kept in step with the Active site and as with the other models this will be a greater overhead than for the Active / Active model.
• Updates and maintenance can be scheduled at either site by switching service to the other site.
• Limited separation between sites. To achieve the desired level of pe rformance the parts of the Active / Active pair are often close together.
D.5 Active/Back-up In the Active / Back-up model two separate computer suites are maintained, but production only runs at one site, the remote site hosting back-up systems are only enabled when an incident strikes. One way of exploiting the software license issue is to utilize the back-up systems as development, test or
training platforms. Many IT companies w ill reduce the cost of software licences if a system is only used for development work, and will allow production licences to be transferred to the back-up site when an incident strikes, although this can incur additional cost. The advantages and disadvantages associated with this model are listed in Table D.4.
Table D.4 – Advantages/Disadvantages associated with the Active/Back-up model Advantages
Disadvantages
• Can reduce the number of software licences required as a warm standby system doing no productive work may still incur the cost of operating system, database, and communications software licences.
• If using back-up site systems for development, test and / or tr aining, then in the event of a ma jor incident the facility is no longer available. If the incident is the result of, say, a faulty software release, there may not be access to the required development resources such as source files, documentation required to provide a resolution. • Security implications when running production from the back-up site. • Slower to activate as existing services may have to be stopped first, before fail-over can be initiated.
© BSI 11 August 2006
41
PAS 77:2006
D.6 Multi-site Models/Hybrids Of course, the two-site model i s fine for most companies, but some companies, especially multi-nationals by necessity adopt a three or four site model where one o r
more sites can take over the work of the other sites if required. The advantages and disadvantages associated with this model are listed in Table D.5.
Table D.5 – Advantages/Disadvantages associated with a Multi-site models Advantages
Disadvantages
• Reduced impact following a ma jor incident at one site as production is spread across multiple sites.
• The approach can have complex implications, and give rise to issues such as visibility of data and scalability of systems.
• Requires less s pare capacity for resilience as load is spread across multiple sites.
42
© BSI 11 August 2006
PAS 77:2006
Annex E (informative) High availability E.1 General High availability r efers to the ability of a computer system and its hosted resources to withstand failures. These failures can range from component level hardware failures to complete site failures. Availability is commonly measured in 9’s with five 9’s being the highest level.
NOTE F or in st anc e
9’ s av ail abili ty allows for a syst em w it h fi ve fi v e minut e s dow nt ime per y ear . W hile fi ve 9’ s or near i of t en de si re d , sol ut ion s c ont inuous busine ss oper at ion s guar ant eeing z er o-dow nt ime ar e of t en t oo c o st -pr ohi bi t iv e t o implement , e s pec ia ll y af t er w eighing all r s i k s of f ail ur e and det er mining w hat kind of dow nt ime s i acc ept able for y our need s , a s show n in F igur e E .1.
Figure E.1 – Downtime vs. cost
$$$$ Continuous Processing
Fault Tolerant
$$$
Fault Resilient $$
High Availability
Commercial Availability
$
9 9 . 9 9 9 %
© BSI 11 August 2006
9 9 . 9 9 %
9 9 . 9 %
System Availability
9 9 . 0 %
Downtime
43
PAS 77:2006
RAID levels that utilise the most disks provide the highest level of redundancy and performance. Each RAID level has its own advantages and disadvantages which are summarized in Table E.1.
Table E.1 – Advantages/Disadvantages of RAID levels RAID Level
0
Description
Advantages
Disadvantages
Data Striped across one or more disks
Easy to implement
No fault tolerance /redundancy
Very good read /write performance No parity overhead No fault tolerance /redundancy
1
Data mirrored between two disks Requires a minimum of two disks
5
Data and parity striped across multiple disks Requires a minimum of three disks
0+1
Mirrored RAID 0 s egments Requires a minimum of four disks
Single disk failure causes data loss Not suitable for high availability
100% disk redundancy
Only 50% disk utilization
Improved read performance over RAID 0
Limited redundancy – single disk failure
Simple design
RAID function requires additional processing
Very good read performance
Write penalty due to parity calculation
Parity is distributed across all disks
Slow rebuild after drive failure
Maximum utilization of disk resources
Limited redundancy – single disk failure
High I / O throughput – read and write
Very expensive to implement
Same overhead as RAID 1
Only 50% disk utilization Limited redundancy – single disk failure
10
Striped RAID 1 segments Requires minimum of four disks
Very good /write performance
Very expensive to implement
Same overhead as RAID 1
Only 50% disk utilization
Can withstand single drive failures across RAID 1 segments
© BSI 11 August 2006
45
PAS 77:2006
Annex F (informative) Types of resilience F.1 General This annex provides a brief overview of some of the replication approaches that should be used to protect and recover data. As technology progresses other options will become viable and any selection should include a review of products available. As with the choice of a site model described in Annex D the ITSC Strategy will be a ma jor factor in selecting the appropriate replication mechanism(s) and the resultant choice will have an effect on the IT Architecture.
F.2 Media back-up/restore Creating data back-ups on an alternate media (usually tape) is st ill the de facto method of ensuring there is a secure Point In Time (PIT) copy of vital data but for many companies the tape back-up has become a second level back-up which is just insurance against disk or application based replication failures. However some legacy applications still rely on tape as the primary back-up.
i c ont e xt t he t er m ‘ t a pe bac k-u p’ w ill a ppl y I n t h s equall y w ell t o an y ot her ph ysi cal medi a used t o c re at e a i t hen ph ysi call y r emov ed off si te . bac k-u p of li v e d at a w hi c h s NOTE 1
I f t a pe bac k-u p s i used ei t her a s a pr imary or sec ond ary bac k-u p i t s i nor mal t o c r eat e a sc hed ule of bac k-u p s w hi c h r eflec ts t he w or king patt er n of t he syst em being bac ked u p. NOTE 2
F.2.1 Example 1 An order entry system is open for orders from 09:00 to 17:00 every day, Monday through Saturday, the online day. After 17:00 batch processes extract all new orders from the orders file and process them, distributing shipment orders to the warehouse and build orders to the factory floor. The order history file is then updated. On Sunday the system is unavailable.
The back-up cycle for this system may be: a) full back-up of the entire database once per week on Sunday; b) back-up of all new orders, daily after 17:00 but before batch processing commences; c) back-up of all shipment orders and build orders after batch processing;
d) back-up of order history file before start of online day. This is highly tailored to a known and fixed file based processing cycle.
48
F.2.2 Example 2 Modern tape back-up systems adopt a more holistic approach to backing up the entire system, as illustrated in the following example. A software development system is available 24 x 7 for use by a group of 100 + developers writing and testing software. Source files and executables are created and updated on an adhoc basis throughout the day and at v arious times throughout the night. Developers work typical eight hour shifts during the core hours of 07:00 to 19:00 but in order to meet deadlines will sometimes work through the night. The least busy period is Sunday nights through early morning Monday.
The back-up cycle for this system adopts t he following pattern: a) full system back-up of all files on Sunday night; b) full system back-up of all files changed since Sunday at midnight every day; c) incremental back-up of all files changed since midnight twice per day, once at midday and once at 21:00. In this example only those files that are unopened are backed up. If a file is left open to an application it will not be backed up, simply because its contents cannot be guaranteed to be consistent. It is possible to override this restriction and create ‘dirty’ back-ups but this should only be done with an understanding of the applications in use as data could be an uncertain state. Conversely, if ignoring open files certain files may never be backed up leaving the organization exposed to the risk of d ata loss. F.2.3 Media Issues Many of the issues related to tape back-up of open databases are dealt with by application or system based back-ups. Faced with the issues related to creating Zero Data Loss (ZDL) back-ups of live, permanently open databases, systems and database vendors have created their own back-up software that worked in concert with the database system to enable online production databases to be saved and restored with no data loss, and with the ability to fail-back updates performed by failed transactions. These back-up mechanisms rely on the creation of log file s, often referred to as audit trails which reflect every update to the database, and allow databases to be restored in a consistent fashion should systems fail. F.2.4 Media Storage If tape-based back-up is the primary back-up mechanism, then copies of the back-ups should be taken offsite at the earliest opportunity. Many companies exist t o provide secure data storage and can be contracted to collect data
© BSI 11 August 2006
PAS 77:2006
load on the host system. Handling replication at the control unit level also allows the control unit to create multiple copies or snapshots of the volumes. These snapshots can be used to drive separate applications such as overnight batch processing and their careful use can vastly improve overall system throughput. However it should be noted that depending on ho w it is used, storage array based replication can introduce I / O latency and potentially delay the I / O completion.
F.7 Storage Area Network-based replication In a Storage Area Network (SAN), storage devices are attached to a fabric of fibre channels which operate at very high speeds. An approach to replication that operates at the SAN level is to introduce a replication appliance into the SAN as both a disk volume and host at the same time. Through the use of special device drivers embedded in the host systems, write requests are replicated to the Replication Appliance. These devices can then ship the write across a Wide Area Network (WAN) to a remote location where another set of replication appliances can distribute the write requests to a duplicate set of disk devices. The advantage of this mechanism is that st andard communications lines can be used to carry the replication traffic since replication appliances may also compress t he data being sent.
F.8 Disk replication modes Almost all of the mechanisms described for disk replication can operate in one of two modes and it is worth considering which of these modes best fit the RTO, RPO and cost requirements. a) Synchronous (Zero Data Loss) – Each write request is replicated to a remote system and the issuing system effectively waits until it receives an I / O complete status back from the remote site. This ensures that all I / O write requests are securely completed on the remote site and a back-up is guaranteed to be synchronized with the original copy. It should be noted that in order to achieve acceptable throughput dedicated links are required between the two sites. b) Asynchronous (Point In Time) – The write request is issued to the remote system but there is no delay or wait for an acknowledgement of I / O completion, rather the local application continues without any delay. While this improves performance it does create a window where data loss can occur if the local disk subsystem is destroyed with writes pending. This Asynchronous replication is also called a Point In Time back-up because it reflects the consistent state of the disk at a specified point in time.
50
© BSI 11 August 2006
PAS 77:2006
Bibliography Standards publications PAS 56: 2003, Guide t o Busine ss C ont inui ty Management
e BS ISO/IEC 20000, I nfor mat ion T echnolog y – Serv ic Management ct ic e for I nfor mat ion ISO/IEC 17799:2005 C ode of P ra Secur it y Management
i k management – V ocabul ary – ISO Guide 73:2002, R s Guideline s for use in st and ar d s
Other publications [1] IT Infrastructure Library (ITIL). Office of Government and Commerce: The Stationery Office. [2] PRINCE2 Maturity Model (P2MM). Office of Government and Commerce (OGC). [3] Project Management Body of Knowledge. Project Management Institue (PMI).
Further Reading TR 19:2005, Technical Reference for Business Continuity Management (Bt GM). Spring Singapore. Emergency Preparedness: Guidance on Part 1 of the Civil Contingencies Act 2004, its associated Regulations and non-statutory arrangements. Home Office: The Stationery Office. Generally Accepted Practices for Business Continuity Practitioners. Disaster Recovery Journal and DRI International, 2005.
“Business Continuity”. CBI with Computacenter, 2002. ”A Risk Management Standard”. The Institute of Risk Management, The Association of Insurance and Risk Managers and The National Forum for Risk Management in the Public Sector, 2002. Microsoft Operations Framework, a pocket guide, Van Haren Publishing, ISBN 9077212108. Management of Risk: Guidance for Practitioners. Office of Government and Commerce: The Stationery Office.
“A Guide to Business Continuity Planning” by James C. Barnes, ISBN 0-471-53015-8.
© BSI 11 August 2006
51
PAS 77:2006
BSI – British Standards Institution BSI is the independent national body responsible for preparing British Standards. It presents the UK view on standards in Europe and at the international level. It is incorporated by Royal Charter.
Information on standards
Copyright
BSI provides a wide range of information on n ational, European and international standards through its Library and its Technical Help to Exporters Service. Various BSI electronic information services are also available which give details on all its products and services.
Copyright subsists in all BSI publications. BSI also holds the copyright, in the UK, of the publications of the international standardization bodies.
Revisions
Contact the Information Centre
British Standards are updated by amendment or revision. Users of British Standards should make sure that they possess t he latest amendments or editions.
Tel: +44 (0) 20 8996 7111 Fax: +44 (0) 20 8996 7048 Email:
[email protected]
We would be grateful if anyone finding an inaccuracy or ambiguity while using this Publicly Available Specification would inform Customer Services.
Subscribing members of BSI are kept up to date with standards developments and receive substantial discounts on the purchase price of standards. For details of these and other benefits contact Membership Administration.
Tel: +44 (0)20 8996 9001 Fax: +44 (0)20 8996 7001 Email:
[email protected]
Tel: +44 (0) 20 8996 7002 Fax: +44 (0) 20 8996 7001 Email:
[email protected]
BSI offers members an individual updating service called PLUS which ensures that subscribers automatically receive the latest editions of standards.
Information regarding online access to British Standards v ia British Standards Online can be found at http: //www.bsiglobal.com / bsonline
Buying standards
Further information about BSI is available on the BSI website at http: //www.bsi-global.com
Orders for all BSI, international and foreign standards publications should be addressed to Customer Services.
Except as permitted under the Copyright, Designs and Patents Act 1988 no extract may be reproduced, stored in a retrieval system or transmitted in any form or by any means – electronic, photocopying, recording or otherwise – without prior written permission from BSI. This does not preclude the free use, in the course of implemen ting the standard, of necessary details such as symbols, and size, type or grade designations. If these details are to be used for any other purpose than implementation then the prior written permission of BSI must be obtained. Details and advice can be obtained from the Copyright & Licensing Manager. Tel: +44 (0) 20 8996 7070 Fax: +44 (0) 20 8996 7553 Email:
[email protected] BSI, 389 Chiswick High Road London W4 4AL.
Tel: +44 (0)20 8996 9001 Fax: +44 (0)20 8996 7001 Email:
[email protected]
Standards are also available f rom the BSI website at http ://www.bsi-global.com In response to orders for international standards, it is BSI policy t o supply the BSI implementation of those that have been published as British Standards, unless otherwise requested.
52
© BSI 11 August 2006