X07_SVI-V3_Monitoring Worklloads and System Activity

Symmetrix VMAX3 Internals: Essentials Monitoring Workloads and System Activity

This module reviews workload characteristics and how they impact performance on the Symmetrix. It also introduces the principles of performance management and several tools for viewing metrics that show how the workload characteristics affect the performance of the storage system.

1

Copyright © 2015 EMC Corporation. All Rights Reserved.


The objective of this module is not to make you expert on performance, but rather to introduce some key concepts and provide several tools that can be used to understand the relationship between workload characteristics and the performance of the array.

2



The objective of this module is not to make you expert on performance, but rather to introduce some key concepts and provide several tools that can be used to understand the relationship between workload characteristics and the performance of the array.

2



As with a lot of questions in systems design, the answer to a lot l ot of questions about performance expectations is “It depends.” This module will explore some of the dependencies that impact performance. With the VMAX3, one of the objectives is to minimize the number of “knobs and levers” that can be tweaked and to focus on standard layouts based on proven best practices—including wide workload distribution using virtual provisioning, pooling of CPU resources with dynamic allocation, and FAST automation to ensure compliance with agreed upon Service Level Objectives.

3



The high level architecture of the VMAX has not changed with the VMAX3 but how it was implemented did. Hosts still connect through the frontend, physical disks are connected through the backend, and IOs go through cache. The key to achieving optimum system performance is maintaining a balanced workload across system components. The challenge is how to achieve this when the array typically handles many different workloads that are constantly changing. On the VMAX3, the system is 100% virtual provisioned and the pools are preconfigured from the factory following proven best practices for size, protection, and layout. The value of virtual provisioning is well known and, when configured following best practices, will achieve the best possible workload distribution. FAST (Fully Automated Storage Tiering) is always on to ensure that as workloads change, data is placed in the correct pools in order to get the best overall performance and maintain compliance with Service Level Objectives. Global memory sizes have increased on the VMAX3; the intelligent cache management algorithms ensure that more IOs are serviced from cache, thus providing the best possible response time. Workloads tend to come in bursts. While on average, the system may show low utilization, when bursts of activity come in, which is typical of many environments, the system can become stressed. The VMAX3 addresses this by pooling CPU resources. Rather than dedicating CPU cores to ports, CPU cores are pooled and work together to handle the workload across ports. 4



Historically, performance has been treated as a break-fix issue: “My application is slow, it’s your array!” This puts EMC in a defensive position to do the analysis to prove that it isn’t the array. Although it is possible that it i t is the array, it’s more likely that the requirements were not understood and the system was not sized properly, or the workload has changed, or the system was not laid out following best practices. Over the years EMC has developed customer tools for monitoring storage and has attempted to instill capacity management, including performance management, as an ongoing discipline in the data center. This has helped as it has provided a better understanding u nderstanding of what is “normal,” and if something changed, it helped focus the analysis. Each generation of Symmetrix introduces new capabilities and configuration options. Sometimes there are too many choices, and these may not be fully understood. u nderstood. This sometimes leads to systems not being configured optimally. With the VMAX3, VMAX3, not not only have there there been significa significant nt architectural architectural implementa implementation tion changes, changes, but also there has been a move to service level based provisioning, using easily easil y understood response time objectives. FAST and ongoing compliance monitoring have shifted the focus to understanding the workload requirements upfront, modeling the requirements to develop the optimum design with the right set of resources, and then building the system in a manner that follows known best practices. The result is a system that can handle a known workload at a specified SLO as measured by easily understood response time objectives. This allows for better prediction of when additional resources are required. 5



Performance tuning is all about identifying bottlenecks in the systems so they can be eliminated. Bottlenecks occur when one or more components are over utilized and new work becomes queued up and waits before being processed. A physical disk drive is a good example of a potential bottleneck. Because of mechanical and bandwidth limitations, a physical disk drive is capable of processing only a certain number of IOs in a given timeframe. If there is more work than can be processed, some work will queue up. The goal of performance analysis is to identify these bottlenecks or hotspots. Once a bottleneck is identified, it can be removed by rebalancing the workload. Quite often, after a bottleneck is removed, another will appear. Therefore, performance analysis is an iterative process of monitor, identify, resolve, monitor again. While our focus is on the Symmetrix, there are many other factors that can be the root cause of perceived performance problems. Oftentimes, we monitor to qualify that the problem is not with the Symmetrix. At other times, real bottlenecks can be identified through monitoring and tuning is necessary to resolve them. Performance can be defined by raw numbers; it is usually the user experience in real-world environments that will prompt the investigation of performance.

6



An IO is a unit of work, or a complete transfer, between two end points. There are many end points in an Enterprise storage environment. From the time the operation is initiated by the application until it is processed by the storage system, different protocols may be involved and the data transformed. A single IO may be broken into multiple IOs and reassembled. Keep in mind that more than one complete transfer might be taking place at different points in the data path. The illustration shows how a single file might be broken up as it passes from the application to the array. The file write may be multiple IO operations at the file system, operating system, and Host Bus Adapter. The Fibre Channel HBA will transfer each IO it receives as a series of frames of up to 2 KB to the next connectivity device. The frames are routed through the fabric to the fibre adapter (FA) on a Symmetrix, where they are re-assembled into the original IO as sent by the HBA. At the VMAX, the frontend director allocates one or more c ache slots depending on the logical block address range. When the backend director persistently stores the data, smaller writes may be combined to larger writes and the backend director will apply the RAID protection where a single host write to cache may result in multiple IOs to disk. Note: For this discussion, we are looking at the IO from the array perspective onl y.

7



Transferring a single IO between points involves more than just transferring the data. In every data transfer protocol, several steps are involved and there is additional link overhead that includes negotiation, header, and acknowledgement components. •

•

•

Negotiation and Acknowledgement – Both endpoints must agree to the transfer and manage the operation. This includes any “handshaking” tasks that processors use to schedule and organize the activity. Most protocols require some “setup” negotiation to start the IO and a final “finish” message to terminate it. Header includes identifier or address information and may include other information for the endpoints to understand the data. This may include CRC or parity bits used to error check the data that adds additional overhead on the channel. Data – The actual data. This does not include the overhead of the header, negotiation and acknowledgement.

The Negotiation, Header, and Acknowledgement components are largely fixed in size regardless of IO size. An 8KB IO has the same negotiation and header information as a 32 KB IO. Larger block IO have only a larger data component. The illustration graphically shows the parts of an IO on a time graph. As you can see, the overhead components take time and consume bandwidth.

8



IOs per second (IOPS) is a measure of the number of transfers per second between end points, typically between the host and the array but also between cache and backend and backend to/from disk. IOPS is not a standalone metric and should be considered in context of other metrics. For example, IOs may be much higher in an OLTP (Online Transaction Processing) environment, which is typically a smaller block than a DSS (Decision Support System), which typically does larger block IO.

9



“Throughput” measures the volume of data transferred per second through an IO channel. From the array perspective, throughput measures the “useful” part of the I/O and does not include the header and negotiation “overhead.” Throughput will be highest with larger sized IOs because there is less overhead on the link, and larger block IO make more efficient use o f cache, require less processing, and can be read from or written to disk more efficiently. A “throughput” value is often used to rate the speed of a channel. This is more accurately termed “bandwidth.” This number is a theoretical maximum for all traffic on the channel, including headers and negotiation. Since performance measurement tools report the data volume and ignore header and negotiation, measured throughput will not reach the rated maximums in practice.

10



The IO per second, throughput, and IO size are closely related. When small IOs are transferred, less time is taken on each IO, increasing the number that can be moved in a given time period. The IOPS measure will be greater for small size IO than larger size IOs; however, the data payload will be smaller as there is more overhead for header, link negotiation, and acknowledgement. . While more IOs may be sent with small block size, the total data throughput will be less. The bottom graph illustrates this relationship. As the IO size increases, the IOPS measure decreases. The graph also illustrates how throughput increases as the IO size increases and at the same time the total number of IOs goes down. Larger sized IOs are more efficient for the array to process; however, the size of the IO is outside of the control of storage administrator/architect. Understanding the IO size does help explain the behavior, however.

11



There are other considerations other than IO size that influence IOPS and throughput. Workload skew and locality of reference are very important. Skew can be defined as the percentage of the workload that is performed by the percentage of the data. A common skew might be 85-15 where 85% of the work is being performed on 15% of the data. When the workload is skewed in this manner, which is typical of most environments, tiered storage and FAST can be very efficient at positioning the most active data on the fastest storage. Locality of reference is the sequentiality of the IO. If a host reads consecutive logical block addresses, prefetching can stage data in cache and increase the case hit rate. Writes to consecutive blocks allow more efficient writes to the backend. For example, sequential small block writes can be folded into more efficient larger writes on the backend. An example of this is a full strip write with RAID5 vs. reading the old data and parity before writing new data and new parity for small block writes. Utilization is another consideration. While ideally you would want to see utilization of 100%, as utilization increases, queueing also increases and this effects response time observed. For example, if it takes 10ms to service an IO from disk and there are 10 IOs ahead of it in the queue, the response time observed by the application will be 100ms. This may not be acceptable.

12



Every component in the system has limits on the amount of work it can perform while maintaining a reasonable response time. Again, while it is ideal to get 100% utilization on system components, the thing to remember is that as utilization increases, queueing occurs, and queueing affects response time. As a general rule, we like to plan for utilization in the 40%, 50%, 60% range. If on average there is 50% utilization, there are ve ry likely bursts of activity that can peak out at 100%, which can lead to response time issues. Also take into account degraded mode and the impact if a director should fail and the surviving director in the engine must take on the additional workload; or in the event that a drive fails and the data needs to be rebuilt. This puts an additional burden that will likely increase utilization and response times.

13



Response is often the most important measure of performance to the end user and is used with IO per second and throughput. A system may be able to do more IOPS at the expense of response time. Response time measures the total time taken to process an IO from the frontend perspective. This is from the time the request is received until the time it is complete. The response time measured by the Symmetrix does not include SAN connectivity devices, HBA delays, or host queueing. Host response times are also measured by the host operating system. The time of the start of the application’s request to the time of the operating system’s final acknowledgement to the application is the host response time. This measure includes queueing, HBA and connectivity delays, as well as the time it takes the Symmetrix to process the request. With VMAX3, response time is a key component in SLO based provisioning; the system monitors compliance with the SLO to ensure that response time expectations are met.

14



Real time analysis concerns what is going on right now in the Symmetrix. The Solutions Enabler symst at command provides an interface for looking at systems activity in real time. Unisphere for VMAX, using the storstpd command also c an report on real time activity. Real time monitoring may help detect bursts of activity where the peaks are often lost due to averaging over longer term samples. Because of the frequency of data collection with real time monitoring, typically fewer metrics are available for these short-term samples.

15



The statistics command (symst at ) performs the following: 

Queries Symmetrix devices to capture raw performance counts and store them in memory.



Retrieves the performance counts for the Symmetrix array as a whole.



Retrieves the performance counts for a director or director port.



Retrieves the performance counts for one or more Symmetrix devices.



Retrieves the performance counts for one or more Symmetrix device groups, composite groups, or RDF groups.



Retrieves the performance counts for a selection of, or all, Symmetrix disks.



Retrieves the timestamp of the performance count sample.



Retrieves and displays replication session statistics for SRDF/A.



Retrieves GigE iSCSI network statistics.

16



Oftentimes when doing capacity planning and system monitoring, a longer term perspective is required. However, for troubleshooting an immediate issue or when performing testing and benchmarking, an understanding of what is happening right now is needed. This is where symst at proves its value as a tool. When using symst at , you need to specify a collection interval using the -i flag, and optionally a count using –c. In earlier versions of symst at it was possible to set a collection interval of a few seconds. Today, if an interval of less than 60 seconds is specified, symst at will use 60 as the minimum. The count flag is optional and, if not specified, will run continuously until the command is terminated using ctrl-c. The default is to report activity against all devices on the system. On a busy array with potentially thousands of active devices, it is more likely you are interested in a subset of devices. This can be specified using a device group or specifying specific devices or directors in order to focus the output. Remember that like other Solutions Enabler commands, the request and the response are sent in-band using the same path as used for normal IO. Running symst at on the same host and ports that are being used to generate IO may skew the results; if a front-end port is already stressed, running symst at may give unpredictable results.

17



presents different types of information giving different perspectives on the workload. Requests are the most common set of statistics and reports on activity coming into the Symmetrix. Other metrics will report on activity to and from cache and activity to disk. There are also a set of statics for SRDF/A replication. symstat

18



The default type of statistics for the symstat command is requests; therefore if a type is not specified with the –type option, requests are presented. The - c argument defines the number of samples. The default for this argument is continuous sampling. If you do not specify this argument, but you specify an -i value, the command produces a continuous statistical output, requiring a cancel (ctrl-c) to stop the process. To filter the information to display only information of interest, you can use –h and specify a device group. For example: s yms t a t - i 60 - c 3 - t es t _ dg Here is an example of how to create a device group:

C: / > symdg cr eat e t est _dg C: / >symdg –g t est _dg addal l –devs 154: 15d

19



The display here shows an example of the output of the symst at command. The default type of statistics for the symst at command is requests, if a type is not specified. In this example, we specified a device group using the –dg option. This limits the output to only the devices in the device group. If a device group is not specified, it will report against all active devices. Using the –i option, the output is generated using an update interval of 60 seconds and the –c specifies a count of 5 intervals before the command exits. IO/Sec and throughput are self-explanatory. The %hit of approximately 46% would be reasonable for a semi-random workload on a lightly used Symmetrix as is the 100% cache hit for writes. This indicates 100% of writes are to tracks that are already in cache and marked as write pending. As cache utilization increases with other workloads, it is unlikely to sustain 100% hits for writes. Notice the Device Write pending; while these numbers may seem high, they look normal as neither the total write pending nor the individual device write pending is approaching the ceiling.

20



The VMAX3 can have up to 16TB global memory but still cache is not an infinite resource. On the Symmetrix every IO must go through cache, thus cache is the heart of the system. From a user’s data perspective, cache serves two primary functions: •

•

Maintains recently accessed, data making it readily available. Statistically, users will access data that was recently used. The longer data resides untouched, the less likely it is to be accessed again. The system uses Tag-based caching, a Least Recently Used (LRU) algorithm to determine which cache slots to overwrite. Buffers writes until they can be de-staged to a persistent location on disk. When a host writes data to the VMAX, it is written to cache, and the host is notified the write is complete. Data is asynchronously written to disk by the backend directors. The priority of this is dependent on the number of tracks that are write pending.

Through the Virtual Matrix, the memory on each director board forms a single global memory address space. In addition to being used for caching of reads and writes, global memory also contains metadata such as track ID tables for each device configured. This metadata is used by the system to determine if data is in cache and the status of the data. Cache is dynamic, and is the global memory left after the metadata. Creating new devices and/or adding snapshots consume global memory and impact the cache slots available. To ensure fairness and to prevent a device from consuming too much global memory, there are two ceilings imposed: System Write Pending and Device Write Pending limits. 21



There are two Write Pending ceilings on the VMAX: 



System Write Pending: 75% of usable cache. When the Symmetrix is at the System Write Pending ceiling, all new writes are delayed until the number of write pending tracks is below this ceiling. Device Write Pending: Will not allow a single device to have more than 5% of the available System Write Pending limit. If the limit is reached, a Device Write Pending event occurs, and the new writes to the device are delayed until the Device Write Pending is below the ceiling.

When new write requests come into the frontend faster than they can be de-staged on the backend, the number of write pending tracks in creases. As long as the number of write pendings is less than 50% of the available cache slots, new requests are serviced on the frontend at a higher priority than de-staging write pendings on the backend. When the number of write pendings increase to over 50%, de-stage activity is given a higher priority. When either the Device or System Write Pending ceiling is reached, de-staging write pending is given the highest priority. Effectively, hitting either write Pending Limit makes the system respond to writes at disk speeds. Typically this happens when the backend disks are oversubscribed; the solution is either to change the disk type or to add more disks to spread the workload. These ceilings are dynamic and based on the amount of global memory and device configuration in the system. Using the command symcf g l i st - v displays these limits. 22



- t ype memi o shows cache statistics including Write Pendings and Prefetch activities. In this example, taking into account the previous output of the symcf g command, we see that we are not close to either the system or device write pending ceiling

23



This diagram compares the previous architecture of the VMAX where there were four frontend emulations and four backend emulations with ports and CPU cores dedicated to each instance. With VMAX3 each director has one instance of frontend and one instance of backend that are serviced by a pool of CPU cores; each port can leverage the full processing power of the pool of CPU resources. This enables a VMAX3 to better respond to bursts of activity. When looking at activity on a VMAX3, it can be measured at both the director level and the port level, the director level shows all IO for all ports on a director. Frontend directors, also called channel directors, are configured in pairs, providing redundancy and continuous availability in the event of repair or replacement.

24



Frontend activity on a VMAX3 frontend can be measured at both the director and the port level with the director level showing all IO activity to all ports for a director. There are queues for each device on a frontend director and the CPU resource process IOs from these queues. For best performance, a host should be connected through multiple frontend directors. When configured using Powerpath or other multipathing software, IO operations can be spread across multiple queues serviced by multiple cores and the multithreaded frontend emulation code. For best performance, it is recommended that a host be configured to a minimum of two frontend directors. This will minimize the impact of a failure in any component of the IO path.

25



In this example we are looking at the IO activity for a director. This includes all IO for all ports. Note: The generic director type -SA was specified. This includes all open systems frontend directors, including Fibre Channel, iSCSI, and FCoE.

26



Using the –t ype Por t will show the activity to individual ports. The key point is to balance workload across available ports.

27



There are three components involved with processing an IO from the backend perspective: •

•

•

Logical Device: IOs to and from a host to a Logical Device that result in backend activity. For example, read misses and new writes to tracks that are not currently in cache and marked as write pending. Disk Director: IO activity to/from the physical disks controlled by the director. This includes host IO, protection writes to mirrored and RAID devices that are the result of host IOs. On the VMAX3, the EDS emulation supports the backend director for other IO activity that is not directly the result of an IO request. This would include FAST movement and local replication. Physical Disk: All reads and writes to physical disk that are the result of host IO, RAID protection, rebuilds, and local and remote replication.

28



In this example, we are looking at the total IOs from a DA director to disk. Again, the key to best performance is to balance workload across all directors.

29



Sequentiality is detected by the frontend director and dispatches a task to the backend director to begin prefetching.

30



The actual prefetching is performed by the DA. On a system with low activity, you may see more prefetching because it is done as a lower priority task.

31



Physical disks, especially mechanical disks, are often the slowest link in the IO chain. As we have seen from prior discussions, ideally most IOs are serviced from cache. However, read misses require IO to disk and this means the IO is performed at the speed of the mechanical disk, which includes positional and rotational latency. The impact is exponential when disks are oversubscribed and there is queueing at the drive level. While a mechanical drive may be able to service an IO at 10ms, if there is queueing and there are 10 IOs ahead of it, the response time will be 100ms, which is likely to cause unacceptable performance. The best way to minimize the impact is to use the fastest storage for the most active data and to spread the workload wide across as many drives as possible.

32



Using the - t ype DI SK option with the symst at command displays IO requests and throughput on a physical disk. The drive is identified by the DA director nu mber and spindle ID. We generally use the rule of thumb that a 15K RPM drive can perform approximately 150 IOPS while maintaining reasonable response time. For a 7.2K RPM drive, the rule of thumb is closer to 50 IOPS. In the example, we are seeing ~1300 RPM to some drives. These are EFD drives, which can easily do 10X the IOs of a 15K RPM mechanical disks.

33



While real time tools allow you to see what is happening right now, diagnosing performance problems and capacity planning requires longer term monitoring. Solutions Enabler includes the storstp (Symmetrix Trends and Performance) daemon, which collects both real time and diagnostic information and makes it available to performance management tools such as Unisphere for VMAX. This daemon, as are all Solutions Enabler daemons, is managed by the stordaemon command.

34



While real time tools allow you to see what is happening right now, diagnosing performance problems and capacity planning requires longer term monitoring. Solutions Enabler includes the storstp (Symmetrix Trends and Performance) daemon, which collects both real time and diagnostic information and makes it available to performance management tools such as Unisphere for VMAX. This daemon, as are all Solutions Enabler daemons, is managed by the stordaemon command.

35



Before Unisphere for VMAX can be used to display SLO compliance information, or any performance information, U4V must register with the local storstpd daemon to collect the necessary performance information. There are two levels of data collection: Root Cause Analysis and Real Time. Real Time collects high le vel KPI at 5 minute intervals and is useful for displaying bursts of activity that may be lost in averaging in the Root Cause Analysis collection, which by default uses a 5 minute interval but a much more extensive set of performance metrics. Generally you want to collect diagnostic information at all times but you may enable and disable real time collections as needed as they do increase syscall traffic.

36



New with VMAX3 and Unisphere for VMAX 8.0 is compliance with Service Level Objectives. This is typically observed at the Storage Group level as SLO are applied to a Storage Group. In this example, we see that there are nine Storage Groups defined. Five of them have SLOs defined and all are in compliance with the response time requirements. There are also four storage groups using the default SLO of “Optimize.” This view also shows the general approach with Unisphere for VMAX of starting with a high level view and drilling down as appropriate.

37



The performance views are organized into three levels: •

•

•

Monitor – High level information that can help determine if the system is working optimally. Within this view there is a heatmap, summary and dashboard level information. Analysis – This provides greater detail than the Monitor views and allows viewing of more Key Performance Indicators (KPI). Charts – This view allows a user to create custom charts of KPIs.

The illustration here is a view of the heatmap. This chart is based on component utilization with a color coded representation of the major subsystems, with the color representing the utilization; red indicates 100% utilization. This view is a good starting point when evaluating a system to see the distribution of workload and if any component is oversubscribed. The hover capability shows the details of the components; as we can see in this example, an FA director is read and the hover shows that it is over 85% utilized.

38



This is the Monitor Summary view; it provides more detail about the specific workload. While this specific view is for Storage Groups, you can see from the dropdown that there is similar summary level information for other components. If there is information of interest, from this view you can drill down to the Analysis level for more details.

39



Dashboards have similar information as the summary view but presented in a graphical manner. Similarly, if there is information of interest, from this view you can drill down to the Analysis level for more details by clicking the Navigate to Analyze button.

40



Under Analysis, there are three levels of detail: •

•

•

Real Time is updated at 5 second intervals and reflects the last hour of activity. The default behavior is to overwrite older data. If the information is of interest, it can be saved as a real time trace for later analysis. It is also possible to schedule the capture of a real time trace for some time in the future. Root Cause Analysis data is averaged and updated at 5 minute intervals by default. It allows up to 24 hours to be displayed in one view. Trending and Planning is averaged and reflects a minimum of 24 hours of information of activity but can display up to 12 month of activity. This is ideal for understanding trending for long term planning.

The example here is of Root Cause Analysis data and as you can see from the dropdown, it can be displayed for different system components.

41



Charts allow a user to create their own charts using real time, diagnostic, or historical information. Simply select the object to be analyzed, select one or more metrics, and click the chart button to build a chart that can be displayed and or saved as a dashboard and exported to be used elsewhere.

42



There are a number of tools available for generating workloads. For this class we will be using Iometer, an open source tool that is widely used in Windows environments. It allows a user to specify the characteristics of a workload and the target devices to send it to, and graphically displays the results.

43



While it can be run in a cluster environment with multiple dynamos on different servers, for our lab exercises we will run a local dynamo and multiple works. The first step is to select the targets. For most tests, we want to use the raw device. If the device has a partition or has been initialized as a dynamic disk, it will not show in the list.

44



Next we would create an Access Specification. This defines the block size, the level of sequentiality, and the percent of read write. An Access Specification can be assigned a name and reused.

45



The last setup step is to add an Access Specification to the workers.

46



To start Iometer, click the green start flag. When prompted, specify a location to save the results flag. The results can be displayed as an average since the start of test or since the last update. The update interval can be set for as low as 1 second.

47



SPEED refers to a tool both for maintaining and distributing performance information. There are separate SPEED groups and qualification processes for different platforms. Some of the performance information is highly EMC confidential and only available to SPEED “Gurus”. A Guru is someone who has demonstrated basic knowledge and competencies in performance by passing a qualification exam and has agreed to the covenants of the program. This includes respecting the confidentiality of the information and tools and only sharing as appropriate. It is also understood as part of the program that you will contribute to the community and help others with performance issues whenever possible. The three-day Engineering Education Symmetrix Internals: Performance class, along with selfstudy materials, will prepare you to successfully complete the qualification exam. Cindy O’Toole manages the program.

48



In this class, we introduced some key concepts and provided insight into several tools that can be used to understand the relationship between workload characteristics and the performance of the VMAX3 array.

49


X07_SVI-V3_Monitoring Worklloads and System Activity

Recommend Documents