Why your old SAN doesn't scale

The speed of flash storage devices has changed how applications should access data and heralded the end of old storage.

Thanks to virtualization, the efficiency and flexibility of the server side of computing has improved by leaps and bounds. However, the storage side has remained largely stagnant. In fact, the storage world hasn’t changed much since the days when tape ruled the data center. As a result, we now find ourselves in a situation where one part of the data center stack is significantly more efficient than the other. Worse, it’s all organized in a way that can’t take advantage of recent innovations in storage, namely flash.

Tech giants like Google and Facebook have built their own systems that are scalable and cost-efficient in response, but this kind of innovation hasn’t made its way to the enterprise data center yet. Meanwhile, the storage market is filled with companies selling containers stuffed with disks. While these short-term solutions promise a lot, they can’t solve the problem at hand.

In this article, I’ll describe some of the insights the Coho Data engineering team and I have had in building a high-performance, Web-scale storage system. Specifically, I'll focus on the challenge of exposing the full capabilities of emerging flash hardware in a modern scale-out storage system.

When trying to understand and improve the performance of any software system, a common first step is to identify the most significant performance bottleneck. We all intuitively know how this works: If you drive the system as fast as it can go, the bottleneck is the part that prevents it from going faster. To make it go faster, the focus must be on identifying and fixing the bottleneck. However, the interesting thing with bottlenecks is that they never go away. They simply move around.

In storage systems, the bottleneck has always been the media -- because of the mechanical limitations of tape originally, then spinning disks. A single spinning disk can stream data sequentially, for reads or writes, at about 100MBps. However, when that disk has to access data randomly, this number becomes 10MBps or less, often a lot less. No other aspect of performance has really mattered, because the mechanical limitation of physically moving the disk heads around to access your data dominates all other performance issues. Because disks are so much slower than every other part of the system, the fastest storage systems in the world have focused on how to aggregate lots and lots and lots of disk. Even then, the disk was still the bottleneck.

With enterprise-class, PCIe-attached solid-state storage devices, this situation has completely reversed. Even a single PCIe SSD is faster than literally hundreds of spinning disks. Not only that, it doesn't have the mechanical limitation that makes random access slow. It is now possible to buy a single storage device that can saturate a 10GB network link. Think about that for a second: A single disk is fast enough to saturate a high-speed physical network connection!
The result of this change in the components used to build storage systems is that the bottleneck has moved entirely. The slowest part of the system is suddenly the absolute fastest. If I put additional flash devices alongside that first device, in the same way I might add disks to a conventional array, the network itself will be the bottleneck. I am wasting performance, because my applications can't take full advantage of what the device is capable of.

The network isn’t the only object that becomes a bottleneck. These flash devices are so fast that processing I/O requests fast enough to take full advantage of flash hardware consumes an enormous amount of CPU. In fact, request processing consumes so much CPU that PCIe flash devices effectively need dedicated processors to process requests fast enough to saturate that 10GB connection.

To understand the performance implications of this aspect of new storage systems, I like to think about the idea of "data aperture." In photography, the aperture of a lens is measured by the width of the opening and the amount of light allowed to pass through it. You can think about access to your data the same way: Data aperture is the width of the path from all of your applications to all of the data it needs to access.

Storage systems traditionally haven't had to worry about aperture because it wasn't a bottleneck, but now it absolutely is. This was one of the first challenges that our engineering team faced two years ago, as we started to wrap our heads around what it would mean to build scalable storage using these emerging high-performance devices. After a lot of benchmarking and analysis, we realized that the only way to build a scalable system without imposing significant bottlenecks was to balance all of the physical resources used in the design of the storage system.

Traditional storage systems have relied on a fixed amount of network connectivity and a static storage controller (or “head”), then added disks in order to scale up performance and capacity. Modern storage systems must take a different approach. Namely, CPU and network resources must scale out in proportion to the available high-performance flash.

A result of balanced resources is that a storage system can be designed around matched pieces. A PCIe flash device is paired with a CPU that is fast enough to handle I/O dispatch between it and the network. This pair is attached to a 10GB network interface that can be well utilized by the available flash. Thus, the aperture of access to data increases linearly as the storage system scales out.

This data aperture challenge is only one reason why the days of traditional scale-up arrays are numbered. Web-scale approaches to storage that leverage flexible, commodity hardware in scale-out architectures will be much better suited to incorporating the quickly evolving flash hardware options available in more performant and cost-effective ways.