Breaking

Monday, March 5, 2018

Fail-slow at scale: When the cloud stops working



If you've ever had a system fail-slow, you know how maddening it is. The lights are on, the fans are running, but nobody is home. Is it software? A background process run amok?

So you reboot and hope for the best - and maybe that works. If not, more hair pulling.

Now imagine you have a 100-node cluster that suddenly slows to a crawl. Reboot 100 nodes?

No, you've got management software to look at. And, nada. There's nothing that the software sees that could be the problem cluster-wide. So you start looking at each server. Then the storage. Then the network.

And, yes, there's a slow 1Gb NIC - a $50 board - that's created a cascading failure that's brought your half-million dollar cluster to its knees.

FAIL-SLOW AT SCALE

At this month's Usenix File and Storage Technology (FAST '18) conference in Oakland, California, a team of 23 authors presented Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems. They describe fail-slow failures at five companies, four universities, and three national labs, with node counts ranging from over 100 to over 10,000.

  • They found that some of the faults were permanent until fixed, or caused partial slowdowns, while others were transient, which are the hardest to diagnose.
  • The paper has some cautionary anecdotes that are amusing, if only in retrospect.
  • . . . one operator put an office chair adjacent to a storage cluster. The operator liked to rock in the chair, repeatedly popping hotplug drives out of the chassis (a hard correlation to diagnose).
  • But many of the failures were more subtle:
  • ". . . a vendor's buggy firmware made a batch of SSDs stop for seconds, disabling the flash cache layer and making the entire storage stack slow."
  • ". . . a machine was deemed nonfunctional due to heavy ECC correction of many DRAM bit-flips."
  • ". . . bad chips in SSDs reduce the size of over-provisioned space, triggering more frequent garbage collection."
  • ". . . applications that create a massive load can cause the rack power control to deliver insufficient power to other machines (degrading their performance), but only until the power-hungry applications finish."
  • "A fan in a compute node stopped working, making other fans compensate the dead fan by operating at maximal speeds, which then caused a lot of noise and vibration that subsequently degraded the disk performance."


Naturally, finding these problems took a minimum of hours and often days, weeks, or even months. In one case an entire team of engineers has pulled off a project to diagnose a bug, at a cost of tens of thousands of dollars.

ROOT CAUSES

The paper summarizes the causes of the 101 fail-slow incidents they analyzed. Network problems were the #1 cause, followed by CPU, disk, SSD and memory. Most of the network failures were permanent, while SSD and CPUs had the most transient errors.

Nor does the root cause necessarily rest with the slow hardware, as in the case above where a power-hungry application on some servers caused other servers to slow down. In another case, the vendor couldn't reproduce the user's high-altitude failure mode at their sea level facility.

THE STORAGE BITS TAKE

Any sysadmin plagued by slowdowns should read this paper. The researcher's taxonomy and examples are sure to be helpful in expanding one's vision of what could be happening.

For (one more) example,

In one condition, a fan firmware would not react quickly enough when CPU-intensive jobs were running, and as a result, the CPUs entered thermal throttle (reduced speed) before the fans had the chance to cool down the CPUs.

All in all, a fascinating compendium of failure statistics and types. And for those of us who don't manage large clusters, a welcome sense of many bullets dodged. Whew!


No comments:

Post a Comment