Solving Intermittent SAN Performance Problems
Slowdowns in a SAN infrastructure are sometimes (usually) a combination of circumstances. This is a complex system, with many moving parts, all interacting with each other. Troubleshooting requires that you reproduce the problem, something which can seem nearly impossible given the number of subsystems and the number of variables (including time) which must be controlled.
Recreating the problem requires recreating the same workload, the same conditions, the same time base and the same sequence of events. For IT admins and network teams, trying to document the entire situation with multiple events occurring within multiple subsystems, at the same time, can be a real challenge. Like trying to record a live event with a still camera you’re left wondering where to position the camera to capture the right ‘scene’ and when to take the picture. It’s a multi-dimensional problem that requires more than a one-dimensional solution.
Performance problems are caused by a combination of many factors. Trying to monitor all of them with the tools most companies have at their disposal can be next to impossible and determining which storage systems or network elements to monitor, a frustrating experience. Like the carnival game “Whack-a-Mole”, where participants try to hit figures as they pop up through a matrix of holes, administrators can’t figure out which ports or systems to watch and are left trying to react fast enough to catch the problem.
Troubleshooting
When a performance issue occurs, most companies invoke a troubleshooting process that looks something like this:
An intermittent slowdown is reported with an application. After some basic investigation the server or application team discovers that a metric, like IOPS or throughput, has been degraded. The first assumption is often that the problem lies with the I/O subsystem, the SAN. So the SAN team turns on some monitoring and waits for the problem to reoccur (this assumes they’re
watching the right I/O path, with the right tools). But when the problem does reoccur, since it’s usually picked up by the same tools the first group used, it only reconfirms what’s already known - IOPS or MB/s are down.
Next, the component vendors (disk arrays, fabric switches, HBAs) are called and immediately ask for a detailed description of the problem and when it occurred, plus diagrams of the infrastructure. They also ask for a lot of information, with everything time-stamped, including (but not limited to): log dumps from each monitoring tool in use, a GetConfig from the affected servers, iostat and sar output for UNIX hosts and for Windows, PerfMon data of all physical disk objects.
If the problem occurs frequently or is predictable (a big “if”), it may be relatively easy to generate these data. When the logs capture the event with enough granularity, and with an accurate time base, the vendor(s) may be able to provide some insight into the root cause.
Troubleshooting complex systems like this is a difficult task if you’re on-site full time and are familiar with the environment. But it can be nearly impossible to do part time, especially long distance by technicians who have never seen the data center and are also working several other customers’ issues.
More than likely, they’ll ask you for more metrics, with more care taken to provide the time differences of the various component clocks, because the time variable is critical to differentiate between cause and effect. As an example, one major storage vendor’s best practices document for troubleshooting includes the following: “Detail the exact timeline of all events before, during and after the event - be as specific as possible”.
After a time, without resolution, pressure builds to ‘throw hardware at the problem’ (something that the component vendors don’t object to). If this course it taken, with luck and often a lot of money spent, some degree of resolution is achieved, usually temporary, and the problem goes away for a while. But to really fix the problem, it has to be recreated, along with detailed information about what was going on with the other systems in the environment. You have to capture everything that’s pertinent to the problem, and you have to control the time dimension. Otherwise you’re just waiting for the mole to pop back up.
Monitoring all elements, all the time
Comprehensive monitoring requires a persistent network-based solution, not a typical SRM product or a collection of tools from different switch or storage systems. The network is the common denominator so a monitoring system should be connected via network physical layer monitoring products, like VirtualWisdom from Virtual Instruments. In this way, administrators can capture network activity, continuously, in real-time, instead of simply polling different SAN components every few minutes. This produces a recording of all SAN activity that won’t miss an important but intermittent event, like a latency metric that doubles or triples for a short period of time. When measured and averaged over a period of several minutes, as is done by many SRM tools, this event may be completely hidden.
From this vantage point the system can transparently monitor the SAN and provide an ‘end to end’ view, detecting things like code violations, loss of sync and frame errors, metrics which can identify connectivity problems with cables or optical transceivers. These systems can also monitor storage access times, port congestion or exchange completion times - the time required to complete the transfer of all SCSI blocks included in a transaction.
A number of different factors can contribute to a performance problem and trigger an alert by the monitoring system. When this occurs, a dashboard can provide an overview of the environment from which the network team can view the triggering event, such as high port congestion, link errors or SCSI errors. Since the data gathered for this comprehensive display are captured simultaneously, throughout the environment from physical layer probes, they’re completely timesynchronized. With all the elements on the dashboard synched to that specific time, administrators can view the status of each fibre channel segment, storage system and network component in real-time or at any time in the past.
Like a DVR, solutions such as VirtualWisdom can 'rewind' back to the point in time when a problem occurred and examine what was going on with every network segment, switch port and storage device in the environment. This recording controls the time dimension and enables IT staff to clearly see the details of a performance-related event to identify probable causes. Most importantly, besides speeding up the process of problem resolution, replaying a recording of the original event means that the infrastructure doesn’t have to endure another failure before the problem can be addressed.
This real-time recording enables problem events to be recreated on demand, to isolate and resolve performance problems, even intermittent ones. Being able to replay the sequence of events before and after the 'symptom' as many times as needed, at any speed, can control the time dimension and separate cause from effect, the problem from the symptom. The result is no log file parsing, no calls to vendors, no finger-pointing and no big purchase orders for unneeded equipment to solve the intermittent performance problem.