Technical Report
Best Practices for AFF Business-Processing SAN Workloads ONTAP 9 Michael Peppers, NetApp August 2016 | TR-4515
Abst Ab strac ractt This technical report details implementation and best practices recommendations for AFF business-processing SAN configurations. This includes platform configuration requirements, lifecycle, and the use of NetApp ® monitoring tools to validate and report on its continuing operation. This configuration optimizes for consistent low-latency h igh performance. This version of the technical report corresponds to ONTAP ® 9.
Data Da ta Classific ation Public
Version History
2
Version
Date
Document Version History
Version 1.0
August 2016
Initial version
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
Version History
2
Version
Date
Document Version History
Version 1.0
August 2016
Initial version
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
TABLE OF CONTENTS Vers io n His to ry ........ ................. .................. ................... ................... .................. .................. .................. .................. .................. .................. .................. .................. ................... ................... ............ ... 2 1 AFF Bu si nes s-Pr oc ess in g Overv Ov erv iew .......... ................... .................. .................. .................. .................. .................. .................. ................... ................... ............... ...... 6 2 AFPB Co mp ati bi li ty Gui del in es ........ ................. .................. ................... ................... .................. .................. .................. .................. .................. .................. ................. ........ 6 2.1 AFPB Commitments Commitments and Service Service Level Objectives Objectives .................. ................................ ............................ ............................ ............................ ............................ ..............7
3 AFB AFBP P Conf ig ur ati on Requ ir emen ts .................. ........................... .................. .................. .................. .................. .................. ................... ................... .................. ......... 8 3.1 Required Hardware and Software Components for AFBP Configurations .............................. ............................................ ........................ ..........8 3.2 SAN Environmental Requirements .......................... ......................................... ............................. ............................ ........................... ........................... ............................ ................. ...8 3.3 Hardware Configuration ........................... ......................................... ............................ ............................ ............................ ............................ ............................ ............................ ................... .....8 3.3.1 Storage Controllers ........................... ......................................... ............................ ............................ ............................ ............................ ............................ ............................ .......................... ............8 3.3.2 Storage Media: All Flash FAS ..................................... ................................................... ............................ ............................ ............................ ............................ ............................ ..............9 3.3.3 Steady-State Storage Utilization ............................ .......................................... ............................ ............................ ............................ ............................ ............................ ................... .....9 3.4 Software Configuration ............................ .......................................... ............................ ............................ ............................ ............................ ............................ ............................ ................... .....9 3.4.1 SupportEdge Premium Support Entitlements.............. Entitlements............................ ............................ ............................ ............................ ............................ .......................... ............10 3.4.2 Differentiated Services Flag ............................ .......................................... ............................ ............................ ............................ ............................ ............................ ........................ ..........10 3.4.3 Protocol Support .......................... ........................................ ............................ ............................. ............................. ............................ ........................... ........................... ............................ ................10 3.4.4 Enforcing Protocols ........................... ......................................... ............................ ............................. ............................. ............................ ........................... ........................... ...................... ........10 3.4.5 Storage Efficiency: Efficiency: Compression, Compaction, and Deduplication ................................ .............................................. ............................ ................11 3.4.5.1 Enforcing Storage Efficiency Policies ........................................ ...................................................... ............................ ........................... ........................... ...................... ........12 3.4.6
Snapshot Scheduling and Policy...................................... .................................................... ............................ ............................ ............................ ............................ ................. ...12
3.4.6.1 Enforcing Snapshot Scheduling Policy............................ .......................................... ............................ ............................ ............................ ............................ ................. ...12 3.4.7 Thin Provisioning ............................ .......................................... ............................ ............................ ............................ ............................ ............................ ............................ ........................ ..........13 3.4.8 LUN Space Allocation........................................ ...................................................... ............................ ............................ ............................ ............................ ............................ ................... .....13 3.4.9 Storage Object Tested Maximums ........................... ......................................... ............................ ............................ ............................ ............................ .......................... ............13 3.5 AFF SAN-Optimized SAN-Optimized Nodes and the Baseline Configuration Configuration .................. ................................ ............................ ............................ .......................... ............14 3.6 Validating the AFBP Baseline Configuration ............................. ........................................... ............................ ............................ ............................ .......................... ............14
4
Performance Capacity, CPU Utilization , Storage Storage Utili zation, and Performanc e Capacity Capacity Planni Plan ni ng ........ ................. .................. .................. ................... ................... .................. .................. .................. .................. .................. .................. .................. ................... ................... ................ ....... 16 4.1 Performance Capacity Terms ........................... .......................................... ............................. ............................ ............................ ........................... ........................... ...................... ........16 4.1.1 Optimal Point ........................... ......................................... ............................ ............................ ............................ ............................ ............................ ............................ ............................ ................... .....16 4.1.2 Performance Capacity Used ........................... ......................................... ............................ ............................ ............................ ............................ ............................ ........................ ..........17 4.1.3 Performance Capacity Available ............................ .......................................... ............................ ............................ ............................ ............................ ............................ ................. ...17 4.1.4 Operating Point ............................ .......................................... ............................ ............................. ............................. ............................ ........................... ........................... ............................ ................17 4.1.5 Unsafe Zone ............................ .......................................... ............................ ............................ ............................ ............................ ............................ ............................ ............................ ................... .....17 4.1.6 Safe Zone ........................... ......................................... ............................ ............................ ............................ ............................ ............................ ............................ ............................ ........................ ..........17
3
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
4.2 How to Use OPM 7 to Determine Performance Capacity: Working Example................................................17
5 AFBP Servi ce Offer in g L ifecy cl e....................................................................................................... 24 5.1 Sizing an AFBP Cluster ................................................................................................................................24 5.2 Initial Setup and Prevalidation.......................................................................................................................24 5.2.1 Initial Hardware Setup Checklist ...................................................................................................................24 5.2.2 Initial Hardware Setup Validation ..................................................................................................................25 5.2.3 Monitoring Setup Checklist ...........................................................................................................................26 5.2.4 Configuration Tool Setup Checklist...............................................................................................................26 5.2.4.1 OnCommand Insight Report Checklist....................................................................................................... 26 5.3 Ongoing Maintenance and Operations .........................................................................................................29 5.4 Managing and Scheduling Operations That Will Increase System Utilization ...............................................29 5.5 Heavy Cluster Interconnect Traffic Impacts .................................................................................................. 29 5.6 Support Considerations ................................................................................................................................30
Ap pen di x .................................................................................................................................................... 31 Using OnCommand Unified Manager for AFBP Application Monitoring and Alerting ........................................... 31 Using OnCommand Performance Manager for Complex AFBP Monitoring Thresholds ....................................... 35
Refer enc es ................................................................................................................................................. 37 Con tac t Us ................................................................................................................................................. 38
LIST OF FIGURES Figure 1) OnCommand System Manager SVM protocols.............................................................................................11 Figure 2) Config Advisor with Managed ONTAP SAN plug-in. .....................................................................................14 Figure 3) Verifying configuration with Configuration Advisor with Managed ONTAP SAN plug-in. ...............................15 Figure 4) Performance capacity. ..................................................................................................................................16 Figure 5) OPM 7 cluster dashboard. ............................................................................................................................19 Figure 6) OPM 7-node summary page. ........................................................................................................................19 Figure 7) OPM 7 single-node drill-down. ......................................................................................................................20 Figure 8) OPM 7 performance summarization graphs showing both nodes in an HA pair. ..........................................21 Figure 9) OPM 7 failover planning tab. .........................................................................................................................22 Figure 10) OPM 7 failover planning graphs showing both nodes’ performance capacity and estimated takeover
performance capacity. ..................................................................................................................................................23 Figure 11) OPM 7 failover planning graphs showing overprovisioned workloads with very high latencies in both steady state and increasing latency in takeover. ..........................................................................................................24
4
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
LIST OF TABLES Table 1) AFBP takeover and giveback timing service level objectives (SLOs). ..............................................................6 Table 2) AFBP storage platform cluster interconnect requirements. ..............................................................................9 Table 3) Storage media configuration details for AFF storage systems. ........................................................................9 Table 4) Storage efficiency features on baseline configuration storage nodes.............................................................11 Table 5) Storage object limits.......................................................................................................................................13 Table 6) Configuration checks performed by the Managed ONTAP SAN plug-in.........................................................15 Table 7) Host and switch configuration used in the OPM planning example. ...............................................................17 Table 8) NetApp storage array configuration used in the OPM planning example. ......................................................18 Table 9) Hardware setup checklist. .............................................................................................................................. 24 Table 10) Hardware checklist validation methods. .......................................................................................................25 Table 11) AFBP environment-monitoring tools. ............................................................................................................26 Table 12) AFBP configuration tools..............................................................................................................................26 Table 13) OnCommand Insight per-application reports. ............................................................................................... 27 Table 14) OnCommand Insight storage environment reports. ......................................................................................27 Table 15) Predeployment validation tasks....................................................................................................................28 Table 16) Application validation test items. ..................................................................................................................29
5
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
1
AFF Busi ness-Processi ng Overvi ew
This document is a detailed guide for storage architects who intend to run All Flash FAS (AFF) businessprocessing applications on the NetApp ONTAP operating system. The guide describes in detail a storage baseline configuration: the prescribed SAN configuration that has been tested by NetApp to validate its ability to provide consistent low-latency performance, high throughput, maximum availability, and resiliency. It also discusses the process of configuring, installing, validating, deploying into production, and monitoring an AFF business-processing (AFBP) storage environment according to best practices. This document and its prescriptions are the product of extensive performance testing to identify and qualify a baseline configuration for consistent performance. It describes this configuration and makes conservative recommendations designed to make sure of the best possible end-user experience. The goal of the AFBP configuration is to eliminate variability in an ONTAP configuration that could cause unexpected deviations in storage latency and performance. T his guide’s recommendations do not imply that the features being limited are problematic for production deployment in any way. The requirements and recommendations made are those necessary to match the validated and tested configurations created to maximize performance while maintaining consistent low-latency operations even in the face of storage disruptions like those seen with controller takeovers and givebacks. This configuration is optimized for consistent low-latency high performance and the ability to leverage advanced storage features, not a theoretical maximum performance level that disregards all other factors. For information about NetApp AFF top-end performance, review the Storage Performance Council’s SPC-1 results. This document describes AFBP guidelines and requirements consistent with ONTAP 9. The guidelines, requirements, and sample results enumerated in this document are all products of extensive and continuous testing by NetApp’s workload performance characterization team.
2
AFPB Compatibili ty Guidelines
A cluster provisioned to serve AFBP applications can grow as more applications are hosted on it, but its initial size and configuration should be determined in accordance with NetApp ’s and the application’s publisher’s best practice recommendations. Applications and storage requirements that adhere to the following guidelines are an excellent fit for an AFBP storage cluster. These guidelines are subject to change with subsequent releases of ONTAP: Consistent low-latency performance is more important for the given workload than getting the maximum possible steady state throughput. Review section 4 of this document for a discussion about performance optimization and maintaining consistent low l atency. The application should be able to tolerate Fibre Channel, Fibre Channel over Ethernet (FCoE), or iSCSI planned and unplanned path changes. In particular, the status of p aths may change from optimized to nonoptimized during nondisruptive data mobility operations. The application environment in question is covered by the NetApp Interoperability Matrix Tool.
The window for node takeover and giveback operations is as shown in Table 1. It includes other non AFBP configurations as points of comparison.
Table 1) AFBP takeover and giveback timi ng servi ce level objecti ves (SLOs).
Platform
Planned Takeover
Unplanned Takeover
AFF running prescriptive SAN configurations
2 –10 seconds
2 –15 seconds
AFF
15 seconds
30 seconds
30 seconds
60 seconds
FAS with Flash Pool™ or SSD aggregates
6
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
Note: All measurements listed in Table 1 are measured from the high-availability (HA) partner.
Variations in host I/O resume times can frequently be minimized with host stack tuning. Consult the appropriate host attach kit for recommended host stack optimizations. All kits can be found in the downloads section of support.netapp.com under Host Utilities – SAN. Note:
It is possible excessive cluster interconnect traffic can be cause congestion, which can negatively affect takeover and giveback timings and could cause those timings to exceed those shown in Table 1. For more information, review section 5.5, Heavy Cluster Interconnect Traffic Impacts .
Note: All takeover and giveback timings are based on NetApp performance testing. The timings are
taken from the EMS logs on the partner controller. The EMS log entries look similar to these:
event log show -message-name cf.transition.info -severity NOTICE Time
Node
Severity
Event
------------------- ---------------- ------------- --------------------------8/16/2016 13:38:04 aff-01 NOTICE cf.transition.info: SFO Giveback (aff_01_aggr1) Protocol Transition Time(msec):NFS=250[166|84], CIFS=250[166|84], FCP=251[167|84], ISCSI=251[167|84]. 8/16/2016 13:25:04 aff-02 NOTICE cf.transition.info: Takeover Protocol Transition Time(msec):NFS=2994[224|2442], CIFS=2994[224|2442], FCP=2994[224|2442], ISCSI=2994[224|2442]. 8/16/2016 13:24:33 aff-02 NOTICE cf.transition.info: SFO Phase of Takeover (aff_01_aggr1) Protocol Transition Time(msec):NFS=243[164|79], CIFS=243[164|79], FCP=244[165|79], ISCSI=244[165|79]. AFF/FAS systems using FlexArray® do not have an associated takeover and giveback timing guideline.
2.1 AFPB Commi tment s and Servi ce Level Objectives Table 1 above illustrates AFBP’s storage failover Service Level Objective (SLO). The SLO states that in
the event of a planned failover NetApp expects to see the failover complete within 10 seconds, our expectation is that if the SFO is unplanned then we expect completion within 15 seconds. This is a service level objective which means that it is our objective that any takeovers or givebacks on an AFBP system should complete within those times, as measured from the partner controller. NetApp expects SFO’s to complete within those thresholds at least 95% of the time. In testing, in a very small number of cases, SFOs didn’t complete within the SFO t hresholds noted above, subsequent root cause analysis
(RCA) investigations noted transient items were likely to be the cause or exacerbation of the SFO completion durations. As noted above, all takeover and giveback completion times are measured from the ha-partner controller. NetApp has also done some testing to collect IO resume times from the hosts poi nt of view, and have found that there is some variation in different host OS IO stacks. We are investigating indi vidual stacks in order to identify and test host stack optimizations that reduce IO resume times as measured from the host. We can currently characterize some of the preliminary testing by noting that there is some variability between host stacks with one of the best tuned being Red Hat Enterprise Linux (RHEL) 7.x. In testing, we have seen IO resumes on RHEL 7.2 of between 3-6 seconds in both planned and unplanned SFOs. We have observed similar responses from Windows where we are seeing IO resume with 4-7 seconds. In fact, even as measured from the HA-partner we are seeing that 70-80% of the time SFOs complete in less than 5 seconds.
7
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
3
AFBP Configu ration Requi rements
This section details the requirements for implementing and maintaining the prescribed SAN configuration and other requirements necessary in order to be in and remain in an AFBP configuration. You need to fulfill these requirements during provisioning of storage for applications to be run on an AFBP cluster. The Managed ONTAP SAN plug-in for Config Advisor should be run both after initial setup and provisioning and regularly during operation in order to confirm that the storage system continues to conform to the baseline configuration. In order to maintain consistent performance and meet storage service-level objectives (SLOs), any inconsistencies with the baseline configuration that the plug-in discovers should be remediated.
3.1
Required Hardware and Software Components for AFBP Configuration s
All AFBP configurations have the following mandatory components: An AFF HA pair
No non-AFBP nodes can exist in the same cluster with AFBP nodes. Their presence causes the Managed ONTAP SAN plug-in to fail to initialize and throw a configuration error message. ONTAP 8.3.1 or later, ONTAP 9 recommended
OnCommand® Performance Monitor (OPM) version 7 or later
OnCommand Unified Manager (OCUM) version 7 or later
SupportEdge Premium Support entitlements
Config Advisor with Managed ONTAP SAN plug-in
Note:
While not required, NetApp strongly recommends OnCommand Insight (OCI) version 7 or later. OCI proves invaluable to storage administrators for visualizing the entire storage environment. This includes but isn’t limited to:
All of their storage arrays, regardless of make, model, or manufacturer All hosts, with granular enough data that it can report on HBA/UTA information All switches, including fabric information, in addition to more specific switch-centric data
Additionally, OCI provides a plethora of reporting about topics as diverse as utilization, efficiencies, and stranded storage, to name just a few. For more information about OCI, follow the links in the References section of this technical report.
3.2 SAN Envir onm ental Requi rements All AFBP SAN environments are assumed to have been architected to follow general SAN best practices, like redundant fabrics, use of dedicated 10G storage networks that are segregated from general Ethernet communication networks,
3.3 Hardw are Configuration AFF HA pairs form the basic hardware building blocks for creating AFBP configurations.
3.3.1 Storage Controll ers Table 2 lists the storage controllers eligible to be members of an AFBP cluster, along with their required number of cluster interconnects.
8
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
Table 2) AFBP storage platfor m cluster in terconn ect requirements.
Platform
Cluster Interconnect Count
AFF8080EX
4
AFF8080CC
4
AFF8060
4
AFF8040
4
Note:
Review section 5.5, Heavy Cluster Interconnect Traffic Impacts, for more information about cluster interconnect requirements.
3.3.2 Storage Media: All Flash FAS The baseline configuration has been tested and qualified with a particular storage layout when running an AFF storage system. AFF nodes in a business-processing cluster must meet the storage subsystem hardware requirements described in Table 3. Table 3) Storage media conf igur ation details fo r AFF storage systems.
Configuration Detail
Requirement
Drives per storage controller
No more than 240
Shelves per storage controller
No more than 10
Data aggregates per storage controller
No more than 10
Shelves per aggregate
No more than 10
Aggregate type
SSD only
RAID group size
11 –28
HA partner configuration
Both nodes in an HA pair must have same configuration
Notes
Considered as half-shelves; node 1 owning drives in bays 0 – 11, node 2 owning drives in bays 12 –23
Considered as advanced drive partitioning (ADP) data partitions
3.3.3 Steady-State Storage Utili zatio n Previously, NetApp has recommended that storage administrators target using no more than 50% CPU and storage utilization in order to maintain consistent performance in the face of a takeover. This paper changes that recommendation. While the metrics proposed, CPU and storage utilization, are easy to measure and understand, they are not sufficiently nuanced and are likely to leave potential capacity unused. Going forward, we recommend using performance capacity to optimize performance while maintaining consistent low latency. Section 4 of the paper discusses utilization, capacity planning, and how performance capacity calculations work and then presents an example using OPM 7 for capacity planning and workload placement.
3.4
Software Configuration
The software configuration specific to a storage cluster running within the baseline configuration is meant to change over time as workloads and applications are added and/or removed. This section outlines the range of configuration values and settings that are included in the baseline configuration. Validating them
9
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
can be accomplished automatically using the Config Advisor tool. For more information about this tool and using it to validate a storage cluster’s settings, see section 3.4.
3.4.1 SupportEdge Premium Support Entitl ements SupportEdge Premium support entitlements are required to be carried on all controllers that are part of an AFBP configuration. This requirement is to make sure that all support required is received and performed based on the specification of the SupportEdge Premium entitlement level. This incl udes expectations for support responses and field personnel entitlements and service-level objectives for problem triage, resolutions, and escalations. For more information about NetApp support options and the service-level objectives of each support entitlement, review the information on the NetApp Support offerings page .
3.4.2 Differenti ated Servi ces Flag In Data ONTAP® 8.3 and later, nodes should be designated as participating in an AFBP baseline configuration with the differentiated services flag. This flag should be set to true in clusters running the baseline configuration detailed in this document and running an AFBP workload. The flag can be set using the node modify [node] -is-diff-svcs true command. mcbp01::> node modify [node] -is-diff-svcs true
You can check if differentiated services have been set using the node show -fields is-diff-svcs command. mcbp01::> node --------mcbp01-01 mcbp01-02 mcbp01-03 mcbp01-04
Note:
node show -fields is-diff-svcs is-diff-svcs -----------true true true true
The differentiated services flag must be set before the Config Advisor Managed ONTAP SAN plug-in can successfully scan a configuration.
3.4.3 Protocol Suppor t NetApp has tested and qualified only block protocols for the baseline configuration serving AFBP applications. Other protocols might be qualified in future releases of the baseline configuration.
3.4.4 Enforcing Protocols At the command line, a storage virtual machine (SVM) can be set to allow some protocols or disallow others: mcbp01::> vserver modify -vserver [svm] -allowed-protocols fcp,iscsi disallowed-protocols nfs,cifs
Furthermore, the output of nfs show and cifs show should result in an empty table. Using fcp show and iscsi show should result in a list of SVMs with associated worldwide node names (WWNNs) or iSCSI qualified names (IQNs), respectively. aff::> fcp show Status Vserver
Target Name
Admin
---------- ---------------------------- ------
10
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
AFF_SAN_DEFAULT_SVM 20:00:00:a0:98:9e:f8:55
up
If you use NetApp OnCommand System Manager, navigate to the Storage Virtual Machines section of the cluster, then to the SVM hosting the AFBP application, then to SVM Settings, and then to Protocols (on the left menu). From there, you can check the status of any individual protocols licensed for this cluster. Figure 1) OnCommand System Manager SVM protocol s.
3.4.5 Storage Efficiency: Compression, Compaction, and Dedupli cation In order to maintain predictable performance, a storage system wi thin the baseline configuration should operate under the guidelines detailed in Table 4. Table 4) Storage efficiency features on baseline configuration storage nodes.
Storage Efficiency Feature
AFF Setting
Inline compression
Yes (permanently enabled by default)
Scheduled compression
No
Inline deduplication
Yes (on by default on AFF)
Scheduled deduplication (see note3)
Yes (1 thread)
Compaction
Yes (on by default on AFF)
Note:
Performance testing has shown that inline compression has no negative impact on performance, and in some cases the resulting decrease in internal I/O may result in a performance improvement. When Data ONTAP is running on an AFF storage node, inline compression is enabled by default for all volumes and cannot be disabled.
Note:
The number of threads available for scheduled deduplication runs can be set on a per-node basis using the node run -node [node] -command options sis.max_active_ops 1 command. Note that this command requires a node reboot to take effect.
Note:
Some workloads, for example, databases, won’t receive any benefit from deduplication. In cases where workloads envisioned aren’t likely to benefit from deduplication, deduplication should be disabled. Use the command listed in the next section to turn off efficiency (deduplication) for the volumes that won’t benefit from deduplication.
11
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
3.4.5.1 Enforcing Storage Effic iency Poli cies You can use the CLI to enforce storage efficiency policies. To turn on efficiency for a volume in an SVM: mcbp01::> volume efficiency on -vserver [svm] -volume [vol]
To turn on efficiency for all volumes in an SVM: mcbp01::> volume efficiency on -vserver [svm] -volume *
To turn off compression, but leave efficiency on: mcbp01::> volume efficiency modify -vserver [svm] -volume [vol] -compression false
To examine the volume efficiency status of all volumes in a storage cluster to make sure none are set to inline only: mcbp01::> volume efficiency show -policy inline-only
If you use System Manager, navigate to the Storage Virtual Machines section of the cluster, then to the SVM hosting the AFBP application, then to Policies, and then to Efficiency Policies. You can also inspect individual volumes. Navigate to the Storage Virtual Machines section of the cluster, then to the SVM in the AFBP cluster, then to Storage, and then to Volumes. By default, the rightmost column is Storage Efficiency. If volumes have been migrated from an ONTAP context outside the AFBP zone to within it, make sure that the aggregate containing these volumes has enough free space to turn compression off. You can see the amount of logical data (that is, the volume’s size if storage efficiency were turned off) in the Storage
Efficiency tab of the Volume view in System Manager or by running the following command in the CLI: mcbp01::> vol efficiency show -vserver [svm] -volume [vol] -fields logicaldata-size
3.4.6 Snapshot Schedulin g and Policy The Snapshot® policy should generally be set to none for volumes that contain AFBP application data. This makes sure that the number of Snapshot copies and amount of space consumed are managed properly. Overall, Snapshot copies should be managed by a storage management tool, for instance, a member of the SnapCenter ® suite of products, or should be application initiated in order to validate that they are application consistent. In some cases, a Snapshot policy other than "none" is appropriate, but only in conjunction with a specific data protection strategy designed for on-box Snapshot copies.
3.4.6.1 Enforcing Snapshot Scheduling Policy You can use the CLI to enforce the Snapshot scheduling policy. Use this command to change the Snapshot policy for a volume in an SVM to none. mcbp01::> volume modify -vserver [svm] -volume [vol] -snapshot-policy none
Use this command to specify that an individual volume has no scheduled Snapshot policy. mcbp01::> volume show -fields snapshot-policy vserver volume snapshot-policy ------- -----------------------mcbp01 data01 none mcbp01 data02 none
12
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
mcbp01
mcbp_rootvol default
To specify that new volumes in an SVM have a default Snapshot policy with no scheduled Snapshot copies, use: mcbp01::> vserver modify -vserver [svm] -snapshot-policy none
If you use System Manager, navigate to the Storage Virtual Machines section of the cluster, then to the SVM hosting the AFBP application, then to Policies, and then to Snapshot Policies. The table should be empty.
3.4.7 Thin Provision ing The NetApp WAFL® (Write Anywhere File Layout) file system used by ONTAP does not preallocate storage on disk before consuming it. This storage allocation policy is known as thin provisioning or dynamic provisioning. You can set space reserves to subtract free space from a volume, aggregate, or LUN and hold it in reserve for future write operations. This is called thick provisioning. When space reserves are turned off and LUNs are created that, when fully written, could consume more space than is immediately available in a volume or aggregate, the policy is known as storage overcommitment. Storage overcommitment requires that free space be continuously monitored to meet the needs of hosted applications. This policy also requires an action plan for increasing the free space available (either through nondisruptive data mobility operations or by expanding aggregate sizes). Therefore, the most conservative option is to fully provision storage, but at the cost of additional storage capacity that may not be required. The baseline configuration has thin provisioning turned on by default. It is a best practice to leave >25% free space in the hosting aggregate and to adjust free space thresholds for those aggregates. Refer to the section titled “ Aggregate Full and Nearly Full Thresholds” in TR-4480: All Flash FAS SAN-Optimized Configuration. If thin provisioning is used, a strategy or action plan must be documented and in place to mitigate low-space scenarios.
3.4.8 LUN Space All ocati on The space allocation option on LUNs is disabled by default; you should not enable it. The space allocation setting determines if a LUN supports SCSI unmap/space reclamation Space reclamation can be very processor intensive and potentially long running and is therefore not supported in an AFBP configuration. If any LUN that has this option enabled is replicated or migrated into the AFBP configuration, you should disable the option before allowing the LUN to be discovered by a host system.
3.4.9 Storage Object Tested Maximum s Testing for AFBP storage performance was carried out for ONTAP 9 with the storage maximums detailed in Table 5. Table 5) Storage object l imits.
Storage Object
Limit
Volumes per node
200
LUNs per node
8,000
Snapshot copies per volume
40
Data aggregate space utilization
<75%
Note:
13
These limits supersede limits detailed in other documentation, deployment, or best practices guides.
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
3.5 AFF SAN-Optim ized Nodes and the Baseline Config urati on Because AFF storage controllers are likely to be used in deployments where consistent performance and low latencies are desired, AFF storage controllers that are ordered in the FC-SAN-optimized configuration are preconfigured to conform to the hardware and software guidelines laid out in section 3.1 and section 3.2, where applicable. These storage systems are not by themselves AFBP without the rest of the service offering lifecycle detailed in section 5, especially the end-to-end management, monitoring, and configuration validation tools.
3.6 Validating the AFBP Baseline Confi guration You can validate the AFBP baseline configuration with the Managed ONTAP SAN plug-in for the Config Advisor tool. This plug-in examines an AFBP c luster’s current configuration and compares it with the baseline configuration, as detailed in this document. The resulting list of warnings should be kept for archival purposes and should be used as a list of items to be remediated. When starting Config Advisor, if the Managed ONTAP SAN plug-in is installed, the following should be seen in the output console: CA 2015-11-02 11:04:01,427 INFO: 'Config Advisor GUI Started' CA 2015-11-02 11:04:01,431 INFO: 'Plugin "Managed ONTAP SAN" Loaded.'
When running the plug-in against a storage cluster conforming to the baseline configuration, select the Managed ONTAP SAN execution profile. The resulting output details any areas where the storage cluster’s current configuration differs from the
baseline configuration. Any configuration details that do not conform to the baseline configuration should have remediation actions scheduled in order to reestablish compliance. Note:
The Managed ONTAP SAN plug-in does not start analyzing AFBP configuration compliance until it has verified that all nodes in the cluster have:
The differentiated services flag set to true Any of the nodes in the cluster that are not AFBP
Any of the nodes not in the supported controller list (AFF80xx series as of this writing)
Any nodes that aren’t healthy
If any of these checks fail, the plug-in displays an error and does not run any further checks until the four preceding conditions are satisfied. Figure 2) Config Advis or w ith Managed ONTAP SAN plug-in.
14
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
Figure 3) Verifying configuration with Configuration Advisor with Managed ONTAP SAN plug-in.
Table 6) Configu ration checks perfo rmed by t he Managed ONTAP SAN plug-in.
Check Name
Description
Node health check
Verifies that nodes are healthy and can be queried for information. Verifies that all nodes have been marked for differentiated services and are optimized for all flash. Verifies that all nodes are of supported models for the Managed ONTAP SAN program. Makes sure that only SAN LIFs exist on the cluster. Verifies that there are from 1 through 10 data aggregates on each node. Verifies that all aggregates are currently being serviced by their owning node.. Verifies that no aggregates exceed 75% utilization. Verifies that RAID groups do not exceed 16 disks. Verifies that data aggregates have at least 4 SSDs. Verifies that no nodes own more than 200 volumes. Verifies that no volumes have more than 40 Snapshot copies. Verifies that offline dedupe is restricted to one stream.
Differentiated services and all-flash optimized bit check Model check Network interfaces check Aggregates per node check Aggregates at home check Aggregates utilization check Aggregate RAID group size check SSDs per aggregate check Volumes per node check Snapshot copies per volume check Offline dedupe with one stream maximum (runs only on 8.3.1 or later versions and earlier than 9.0 versions) Inline compression is enabled Disk type check SFO check SAN SVM (formerly Vserver) QoS check LUNs per node check
LUN space allocation check
15
Verifies that inline compression is enabled. Verifies that all disks are of type SSD. Verifies that all nodes have SFO enabled. Verifies that QoS is enabled on all SVMs. Earlier than 9.0 versions: Verifies nodes owning less than 4,096 LUNs. Later than 9.0 versions: Verifies nodes owning less than 8,192 LUNs. Verifies that space allocation is disabled for all LUNs.
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
4
Perfo rmance Capacity, CPU Utili zatio n, Storage Utili zatio n, and Perfor mance Capacity Plannin g
NetApp or partner system engineers should perform initial sizing using NetApp, OS, and application vendor best practices and typically use NetApp internally available tools to determine optimal solution sizing. After initial sizing, NetApp recommends basing all incremental performance sizing, monitoring, capacity planning, and workload placement on available performance capacity. This is a departure from NetApp’s previous recommendation, which was to size workloads to use less than 50% CPU utilization. This previous recommendation had the benefit of being easy to make and understand but is far less nuanced and more prone to guesswork than our current recommendation to use performance capacity planning for sizing. NetApp’s best practice for sizing a SAN AFBP environment is to use performance capacity to size each
node to less than 50% of performance capacity on each controller in an HA pair. By sizing this way, you can maintain acceptable low latency in the event of a takeover. The cost of this approach is that you sacrifice a little of the steady-state top-level performance. Before getting started, note that the discussion is based on the assumption of a transactional workload that uses IOPS and latency as principal metrics. With that said, let ’s define some terms and then finish the discussion with a real-world example of how OPM 7 can be used to determine the total performance capacity and performance capacity available on a given set of controllers.
4.1 Perfo rmance Capacit y Terms Figure 4) Perfor mance capacity.
4.1.1 Optimal Point The optimal point identifies the total performance capacity a resource has before latency increases more quickly than IOPS do. The optimal point can be determined by either:
Finding the “knee” in the performance curve, where an increase in utilization leads to more rapidly increasing latency. Generally, performance curves are fairly flat at lower utilizations, but latency increases as the number of IOPS increases. There is a point in the curve where the rate of increase in latency starts accelerating more rapidly than the increase in the number of IOPS being served. An alternative to the knee of the curve method is to target a specific latency value and draw a horizontal line at that latency value. The optimal point is the point where the IOPS curve intersects the latency threshold line you have just drawn.
Total performance capacity (or optimal point) = performance capacity used + performance capacity.
16
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
4.1.2 Perfo rmance Capacit y Used Performance capacity used can be defined as the amount of the useful capacity of the resource that has been consumed. As noted earlier, the remaining useful capacity is performance capacity available. Performance capacity used = optimal point – performance capacity available.
4.1.3 Perfo rmance Capacit y Avail able Performance capacity available is derived by subtracting the performance capacity used from the total performance capacity (or simply performance capacity) of a node. The performance capacity is identified by the optimal point. From Figure 4, we see that: total performance capacity (or Optimal point) = performance capacity used + performance capacity available.
4.1.4 Operating Point The operating point is the location on the curve where your resource is currently operating. This point illustrates the number of IOPS served at a given latency.
4.1.5 Unsafe Zone The unsafe zone in a performance context can be defined as the portion of the performance graph that is above and to the right of the optimal point. Performance in these areas has higher latencies, and little increases in IOPS yield larger increases in latency.
4.1.6 Safe Zone The safe zone is the area of the performance graph that is below and to the left of the optimal point. This is the area of the graph where you see the highest throughput relative to latencies and is the area within which you want to operate. In order to maintain consistent low-latency high performance, the operating point needs to stay i nside the safe zone.
4.2 How to Use OPM 7 to Determin e Perfo rmanc e Capacit y: Workin g Exampl e Now that we have done a conceptual overview of performance capacity planning and how to use performance capacity to methodologically plan workload placement, let’s proceed to an example of how to use OnCommand Performance Manager 7.0 (OPM) and an I/O load generator to determine the performance capacity of a NetApp controller pair. After looking at performance capacity to determine if a controller pair has sufficient unused performance capacity to host a specific workload, we then look at OPM’s node failover planning page. For more details about OPM 7.0 and detailed documentation, review the OPM Users Guide. In this example we used the host, switch, and NetApp hardware detailed in Table 7 and Table 8. Table 7) Host and switch configuration used in the OPM planning example.
Hardware and Software Components
Details
Oracle database servers
2 x IBM X3850 X5 for Oracle RAC 1 IBM X3550 for SLOB
Server operating system
RHEL 7.2
Oracle database version
12cR1 RAC
Processors/server
64 logical cores: Intel Xeon X7560 at 2.27GHz
17
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
Hardware and Software Components
Details
Physical memory/server
128GB
FC network
16Gb FC with multipathing
FC HBA
QLogic QLE2672 dual-port PCIe FC HBA
Network connections
2 x Intel 82599ES 10Gbps SFI/SFP+ network connections
16Gb FC switch
Brocade 6510 24-port
10GbE switch
Cisco Nexus 5596
Table 8) NetApp s torage array confi guratio n used in t he OPM planning example.
Hardware and Software Components
Details
Storage controller
FAS8080 configured as an HA active-active pair
Clustered Data ONTAP
v8.4.0
Number/size of SSDs
48/800GB
FC target ports
8 x 16Gb (4 per node)
Ethernet ports
4 x 10Gb (2 per node)
SVMs
1 x across both node aggregates
Management LIFs (Ethernet)
2 x 1GbE data (1 per node connected to separate private VLANs)
FC LIFs
8 x 16Gb data
Associate OPM 7 with NetApp storage of interest: 1. Add NetApp storage controller to OPM. 2. You need to provide following details:
Host name or IP address
User name and password
Protocol (select https)
Port: 443
3. After OPM is connected to the NetApp controller, it starts collecting performance metrics, including IOPS, throughput, latency, and system utilization. Generally, the longer OPM is able to collect and analyze performance data, the more accurate its analysis and trending are. NetApp recommends that you allow OPM to gather at least 24 hours’ worth of data before you rely too heavily on any patterns displayed. 4. After at least one controller is associated with OPM, you see a dashboard page similar to Figure 5 when you log in.
18
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
Figure 5) OPM 7 clust er dashbo ard.
5. The dashboard view separates areas of interest into different views and then puts a summary icon, with the familiar stoplight conception:
Green = good
Yellow = caution/warning
Red = error/problem Each of these icons can be drilled into for more information about the underlying objects with more performance and trending details available. In Figure 5, a caution is indicated by the yellow Nodes icon under Utilization.
6. If you click the Nodes icon in Figure 5, you see more granular detail about the nodes being monitored. Figure 6 shows the two nodes in the cluster that are being monitored with summary information about utilization, capacity, IOPS, throughput, and latency.
Figure 6) OPM 7-node summary page.
7. At this point you could select an individual node to get more node-specific detail about the node selected and click the checkbox corresponding to each node about which you want detailed information. In Figure 7 we have selected the first node, nst-fas8080-22.
19
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
Figure 7) OPM 7 single-node dri ll-dow n.
8. On the right side of the node detail window are time ranges to which you can filter any display, and you can select the charts displayed. In Figure 7 you can see that the utilization warning indicated by the yellow node summary icon on the dashboard appears to indicate that there were three latency warnings on the node being monitored in the last 72 hours. This is indicated both by the yellow warning “dots” on the events graph and by the latency spikes showing latency spiking Monday before
noon and a couple of events bracketing 12 a.m. Monday. You can get much more specific times for these events by adjusting the time scale. 9. Among other calculations, OPM uses the metrics mentioned in step 3 to calculate performance information, which it logs in order to preserve historical data, which can be used for reporting, analysis, and trending. 10. OPM presents the log data collected in a series of easily consumable graphs. The most common and useful of these for our purposes display latency, IOPS, MBps, and performance capacity used. A sample of these is shown in Figure 8. Handling workload from the host side: 1. In this example we use SLOB2 (The Silly Little Oracle Benchmark v2) to generate a synthetic Online Transaction Processing (OLTP) workload. 2. The IOPS generated by SLOB depends on the number of users selected. You can increase the number of IOPS by increasing the number of users selected. 3. The workload mix selected for this example is 80% read and 20% write. 4. In order to arrive at the ~50% performance capacity, we ratchet up the workload incrementally; we start the load point with 4 users and keep incrementing up 4 users at a time. 5. Using SLOB2, we watch the load for an hour, in order to give SLOB2 adequate warm-up time, before incrementing additional users. The amount of warm-up time necessary depends on the workload generating tool.
20
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
6. Should the tool display higher warm-up time and higher degree of variation in the initial stages, a longer runtime is desirable. 7. Figure 8 shows different performance metrics of both nodes as the workload is ratcheted up:
Figure 8) OPM 7 performance summarization graphs showing both nodes in an HA pair.
8. To start with, the values are low. As the load is increased, the metrics vary accordingly. 9. At a certain point the performance capacity used reaches ~50%. This is the optimal point for the storage controller. Operating at optimal point makes sure of consistent performance and faster takeovers/givebacks in event of planned and unplanned takeovers. 10. Next we want to select the Failover Planning tab in OPM in order to model performance in failover. Figure 9 shows the Failover Planning tab you select to be able to see performance on both controllers and modeling of how failover events affect performance.
21
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
Figure 9) OPM 7 failover planni ng tab.
11. Figure 10 illustrates what the storage performance looks like in takeover where only one node is serving data instead of a pair. It does this by showing both nodes ’ performance and then modeling estimated performance in takeover.
22
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
Figure 10) OPM 7 failover planning graphs showing both nodes performance capacity and estimated takeover performance capacity. ’
12. Figure 11 represents the performance utilization of the surviving node to be a little over 100% performance capacity at the optimal point. 13. As the performance capacity increases in steady state (active-active mode), the performance capacity of the surviving node breaches the 100% usage mark and enters the red zone, where in order to generate the desired number of IOPS requi red for the combined load, the latency increases beyond acceptable limits. In Figure 11, you can see that performance in takeover is estimated to be roughly ~103% of performance capacity. This is also visually represented by the breaches of the yellow (unsafe: caution) and the red (unsafe: performance capacity breaching 100%). In this scenario, you probably want to back your workloads down a little bit in order to reduce performance capacity used on each node to get estimated takeover performance down below 100%. Of course, it may be that the breaches are of a short enough duration and their concomitant latency spikes are within acceptable ranges for your organization. 14. Figure 11 looks at two workloads that are already beyond 100% performance capacity on each of the two nodes to show how that translates into an estimated takeover performance capacity used. As you can see, both controllers are already being pushed pretty hard and are likely to have latencies that are fairly high in steady state; in takeover those latencies get much higher.
23
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
Figure 11) OPM 7 failover planning graphs showing overprovisioned workloads w ith very hig h latencies in both steady state and increasing latency in takeover.
5
AFBP Servic e Offering Lifecycl e
5.1 Sizing an AFBP Cluster The AFBP cluster might need to grow over time, but NetApp or NetApp partner system engineers or architects should determine the cluster’s initial number of nodes, disks, and shelves by using the NetApp operating system and application vendor sizing or the deployment guide associated with the applications that the cluster hosts. For other sizing guides appropriate to particular applications, see the References section in this guide and section 4 of this technical report.
5.2 Initi al Setup and Prevalidati on Before beginning qualification and acceptance testing of the AFBP application environment, you should perform several steps after basic hardware installation of the cluster nodes. These steps are shown in the following checklists and validation guidelines.
5.2.1 Initi al Hardw are Setup Checkli st Install all the cluster nodes, including shelves, cluster network switches, and cabling, according to their installation guides. Table 9 shows the checklist items in order. Table 9) Hardware setup checkli st.
Number
Checklist Item
1
All of the AFBP cluster’s hardware components are operational.
2.1
The cluster’s data center environment falls within the parameters specified in the
Hardware Universe.
24
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
Number
Checklist Item
2.2
No cluster nodes or network switches have fault indicators.
2.3
All power supply units and system fans are operational.
2.4
No shelf modules, disks, or SSDs display faults.
3
All nodes boot to ONTAP 9 or later.
4
The cluster network switches are running the standard configuration provided by NetApp.
5
The FCP and iSCSI licenses are enabled, as appropriate.
6
All nodes are in quorum and participating in the cluster.
7
The cluster’s disks, cluster network, and HA failover cabling are correct and have been
validated by the Config Advisor tool.
5.2.2 Initi al Hardw are Setup Validation To validate the initial hardware setup checklist shown in Table 10, use the validation method from the corresponding checklist item in Table 10. Table 10) Hardware checklis t validation methods .
Number
Checklist Validation Method
1, 2.1
Validate according to data center policies and guidelines.
2.1
Visually inspect cluster hardware for fault lights or other indicators. Review storage controller environmental sensor readouts. Review the cluster dashboard by using System Manager.
Get storage controller and switch environmental sensor readouts.
Storage controllers: system environment sensors show
Cluster interconnect network switches: show environment
2.4
Get storage controller disk and shelf status values.
Storage controller disks: storage disk show -fields nodelist,aggregate,state,errors
All disks should show as PRESENT or SPARE, with no errors. Storage controller shelves: node run –node [nodename] –command storage show shelf
All shelves should indicate shelf state as ONLINE and all port states as OK.
3
Get the currently running ONTAP version. system image show -iscurrent true -fields node,version
ONTAP 9 should be the currently running version. 4
Review Config Advisor output.
5
Review the licenses currently installed: system license show
25
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
Number
Checklist Validation Method
6
Review the currently running cluster information: set –priv diag storage failover show cluster kernel-service show cluster ring show debug smdb table bcomd_info show
5.2.3 Monitori ng Setup Checklis t Install the tools shown in Table 11; they provide monitoring and alerting services for the AFBP cluster. Table 11) AFBP environment-monitoring tools.
Monitoring Tool
Version
Functionality
OnCommand Unified Manager
7
ONTAP internal monitoring and alerting
OnCommand Performance Manager
7
Steady-state performance modeling
OnCommand Insight
7.2.2
End-to-end performance and availability monitoring
OnCommand System Manager
9
ONTAP setup and system administration
AutoSupport®
9
Call-home monitoring and alerting tool
Note:
The versions listed in Table 11 are the latest available at the time of this writing.
Note:
The NetApp AutoSupport support tool is required for monitoring AFBP configurations. However, this requirement does not necessitate NetApp receiving AutoSupport output. If site security policy precludes sending AutoSupport notifications to NetApp, use a site-internal destination instead.
5.2.4 Configuration Tool Setup Checklist A storage environment running the AFBP baseline configuration should also contain a system to run the Managed ONTAP SAN plug-in tool and the nSANity tool to validate that the AFBP cluster continues to operate within the baseline configuration parameters. See Table 12 for the list of configuration tools that are part of an AFBP environment. Table 12) AFBP config uration tool s.
Configuration Tool
Version
Schedule
Functionality
nSANity
1.2.14
When configuration changes
Checks and preserves end-toend configuration details
Managed ONTAP SAN plug-in
2.0
Daily
Checks that current storage configuration falls within parameters specified in section 3
Config Advisor
4.5
When cluster configuration changes
Checks cabling and HA properties of storage systems
5.2.4.1 OnCommand Insi ght Report Checkl ist While OCI is an optional component of a NetApp AFBP SAN configuration, this section showcases just how valuable a monitoring OCI can be. Storage administration and application stakeholders negotiate which storage performance, availability, and utilization reports to deliver, along with report format and
26
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
schedule. The reports take the form of dashboard views of the AFBP cluster storage from both application and total storage utilization viewpoints, as shown in Table 13 and Table 14. Table 13) OnCommand Insight per-application reports.
Per-Application Report
Description
Suggested Schedules
End-to-end latency
~Five-minute average latency of all objects associated with a given application, including storage volumes, fabric switches and ports, hosts, and VMs
Daily/weekly/monthly
End-to-end throughput
~Five-minute average throughput of all objects associated with a given application, as noted earlier
Daily/weekly/monthly
Fabric redundancy/path count violations
Times at which violations occurred and were resolved, correlated with latency/throughput reports
Daily/weekly/monthly
Storage growth delta
Growth of storage required by application over time, along with chargeback value (if any)
Weekly/monthly
Table 14) OnCommand Insight storage environment reports.
Storage Environment Report
Description
Suggested Schedules
Storage volume latency
~Five-minute average latency of all storage volumes on a per-node basis, along with “top volumes”
Daily/weekly/monthly
Storage volume throughput
~Five-minute average throughput of all storage volumes on a per-node
Daily/weekly/monthly
basis, along with “top volumes”
Overall aggregate capacity
Time at which violations occurred and were resolved, correlated with latency/throughput reports
Daily/weekly/monthly
Storage growth delta
Graph of used vs. total capacity for entire storage environment, along with ROI calculations
Weekly/monthly
Note: You must add any existing AFBP applications to NetApp OnCommand Insight as a point of
comparison and validation that the AFBP cluster is meeting application latency and availability requirements.
5.2.4.1.1
Prevalid ation Tasks
Before beginning customer validation testing, you must examine the cus tomer’s AFBP application environment. If the environment does not fall within the guidelines of the IMT for the SAN FAS host solution, create action items and change requests in order to remediate them before configuring hosts to access storage provided by the AFBP cluster. Table 15 lists some common predeployment validation tasks that need to be performed as part of any AFBP implementation and provisioning process.
27
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
Table 15) Predeployment validati on tasks.
Number
Prevalidation Task
Desired Result
1
Identify the hosts, fabrics, and networks that are to connect to the AFBP cluster, including hosts to be used during validation phases and when the AFBP cluster is serving applications in a production role.
Customer-validated list of equipment in the AFBP application environment, including hosts, networks, and fabrics
2
Gather configuration details by using the nSANity tool and review them by using the SANView tool.
SANView output of AFBP application environment, including relevant configuration details
3
Cross-check configuration details collected by SANView by using the NetApp Interoperability Matrix Tool.
For all the items in the list created during step 1, either:
4
Connect hosts to the AFBP cluster using iSCSI or FCP protocol.
Verification that an item is validated and tested by NetApp according to the Interoperability Matrix An action plan to change the storage environment to bring it in line with the Interoperability Matrix A PVR filed for any AFBP application environment equipment or configurations that do not fall within the guidelines of the Interoperability Matrix
LUNs provided by the AFBP cluster that are suitable for testing are mounted on hosts in the AFBP application environment
Note:
In step 3, you do not have to cross-check Ethernet-only switches that do not include data center bridging (DCB) or FCoE functionality, because they are not part of the NetApp Interoperability Matrix for SAN FAS hosts.
Note:
See the Data ONTAP SAN Configuration Guide for a description of SAN topologies and host setup details.
5.2.4.1.2
Vali dation Test ing
Using OnCommand Insight monitoring and reporting capabilities helps with the task of keeping the AFBP cluster serving data with consistent performance during the testing scenarios listed in Table 16. If OCI is not being used, then customers need to develop other procedures for monitoring and testing their AFBP configuration.
28
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
Table 16) Application validation test items.
Number Validation Test
Desired Result
1
Path faults are detected by OnCommand Insight or OnCommand Unified Manager; storage volume performance is still within AFBP parameters.
Cable pull and/or port shutdown to cause path failure: 1. From storage controller to fabric or Ethernet switch 2. From host to fabric or Ethernet switch
2
Planned takeover/giveback of storage controllers
Storage I/O resumes within AFBP takeover/giveback limits; alerts are sent out using OnCommand Unified Manager and AutoSupport.
3
Unplanned takeover/giveback of storage controllers
Storage I/O resumes within AFBP takeover/giveback limits; alerts are sent out using OnCommand Unified Manager and AutoSupport.
4
Application performance and availability testing as defined by customer requirements
As defined and configured by customer requirements and System Performance Modeler or other sizing input, the cluster operates within the AFBP performance guidelines for consistent performance.
5.3 Ongoin g Maintenance and Operations Ongoing maintenance of the AFBP cluster includes: Monitoring the storage and reacting to alerts from OnCommand Insight and OnCommand Unified Manager
Implementing configuration changes
Keeping a record of change requests that apply to the AFBP cluster Updating configuration tool output regularly to validate the AFBP cluster’s adherence to the baseline configuration guidelines
Providing reports, according to a negotiated schedule and upon request, to AFBP application administrators or other stakeholders
Troubleshooting issues that arise within the AFBP cluster
5.4 Managing and Scheduli ng Operations That Will Increase System Utili zatio n There are a number of operations that a storage administrator can run that can increase processor and/or disk utilization temporarily while they are being run. Some of these operations include DataMotion ™ operations such as vol move or LUN move, large Snapshot deletes, SnapMirror ® initializations or rebaselines, and so on. As a general common sense guidance we recommend that, where possible, you should schedule these during nonpeak or lower utilization periods. Additionally, we recommend reducing the number of concurrent operations you run. For Instance, we recommend, running one vol move at a time in order to reduce the performance impact of that operation. By following these guidelines, you see higher performance and also have operations such as vol moves complete more rapidly, which has the added benefit of reducing the amount of time your controllers are subject to the utilization costs of these types of operations.
5.5 Heavy Clust er Intercon nect Traffic Impacts Excessive cluster interconnect can affect the AFBP cluster ’s ability to consistently provide low-latency high performance. Heavy traffic/congestion is often a symptom of nonoptimal paths, where host LUN
29
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
access is through a LIF on a controller that doesn’t host the LUN directly. This means that LUN I/O needs
to traverse the back-end cluster interconnect network, which i ncreases cluster traffic and could contend with other normal cluster traffic such as DataMotion activities (vol/LUN moves/copies) and other ONTAP internal administrative traffic. Contention can cause I/O delays, which may show up as a latency spike. Appropriately sizing the cluster interconnects, as called out in Table 2, minimizes but can’t completely negate the possibility of excessive traffic and congestion.
5.6
Support Considerations
In order to make sure that AFBP customers receive appropriate support, customers running AFBP configurations must purchase and maintain SupportEdge Premium support entitlements on all AFBP cluster nodes. Additionally, it is recommended that whenever an AFBP customer speaks with NetApp support personnel, they should make sure the support engineer is aware that they are an AFBP customer and that the clusters being supported are in an AFBP configuration.
30
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
Appendix Using OnCommand Unified Manager for AFBP Applic ation Monitor ing and Al ertin g NetApp OnCommand Unified Manager (OCUM) is part of the suite of monitoring tools to be deployed with an AFBP cluster. It should be deployed in tandem with OnCommand Performance Manager, with performance alerts arriving in OCUM. See the section “Monitoring Performance” in OCUM’s online help
for detailed instructions about linking them. 1. Add the AFBP cluster to OCUM. From the menu bar, select Storage
Clusters.
2. Add the AFBP cluster’s details and then click Add. For Host Name or IP Address, use the host name or IP address of the cluster’s management interface.
31
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
3. OCUM acknowledges that the AFBP cluster is added as a data source.
4. After the AFBP cluster has been added to OCUM, from its entry on the Storage page, use Actions Add to send event alerts from the AFBP cluster.
Clusters
detail
5. More granular alerting scenarios are outside the scope of this technical report. In this case, a catch-all alert is created for the AFBP cluster. Add an alert name and description. 6. The AFBP cluster should already populate the Selected Resources field. Move on to Events.
32
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
7. Select all of the event severities; then highlight them in Matching Events and add them to Selected Events.
8. Finally, add recipients for the alert. Then click Save.
9. OCUM shows a notification that the alert is now active.
33
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
10. Following the baseline configuration guideline, set thresholds for the aggregates in the AFBP cluster to no more than 75% full. From the Storage menu, select Aggregates.
11. Select the data aggregates from the AFBP cluster and click Edit Thresholds.
34
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
12. From here, the thresholds for the data aggregates can be edited collectively. The nearly full threshold should be set to permit enough turnaround time to add or balance capacity inside the AFBP cluster’s aggregates. Click the Save and Close button when the thresholds have been edited.
Using OnCommand Perfor mance Manager for Compl ex AFBP Monit oring Thresholds OnCommand Performance Manager 7 has the capability to create threshold policies for many storage objects, including ports and LIFs as well as LUNs, volumes, aggregates, and nodes, and is especially useful for close monitoring of AFBP storage resources along multiple axes. For example, a threshold could be set to send an alert at a higher level of urgency when a LUN’s latency counter exceeds a limit
and the node utilization is also exceeded.
35
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
1. Select the LUNs item from the Storage menu after logging in. 2. Select the LUNs where the policy is to be applied.
3. If no threshold policies currently exist, you are given the opportunity to create a new one. The new policy is object-type specific and can include a secondary counter condition. This can be useful to create an alert that is triggered if more than one of the AFBP thresholds are crossed during the same monitoring period.
36
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
References The following references were used in this technical report: TR-4080: Best Practices for Scalable SAN in ONTAP 9 http://www.netapp.com/us/media/tr-4080.pdf
OnCommand Performance Manager for Clustered Data ONTAP 1.0 http://mysupport.netapp.com/documentation/docweb/index.html?productID=61810 Config Advisor http://mysupport.netapp.com/NOW/download/tools/config_advisor/
Managed ONTAP SAN Plug-In 2.0 (Config Advisor Plug-Ins Section) http://mysupport.netapp.com/NOW/download/tools/config_advisor/
nSANity Diagnostic and Configuration Data Collector http://mysupport.netapp.com/NOW/download/tools/nsanity/
NetApp Hardware Universe https://hwu.netapp.com/
SAN Migration Using Foreign LUN Import http://www.netapp.com/us/media/tr-4380.pdf
NetApp Support Offerings
http://www.netapp.com/us/services-support/services/operations/services-descriptions.aspx
37
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.
Contact Us Let us know how we can improve this technical report. Contact us at
[email protected]. Include TECHNICAL REPORT 4515 in the subject line.
38
TR-4515: Best Practices for AFF Business Processing SAN Workloads
© 2016 NetApp, Inc. All rights reserved.