Some thoughts about RAID in 2024
I made a strange decision I didn't think I would when building two new servers. Instead of arranging the SATA SSDs I bought as a raidset, I just set them up individually. Different VMs are on different drives. There's a reason for it, and I'm not entirely sure it was the right decision, but I didn't get any feedback suggesting I was fundamentally wrong.
When are you supposed to use RAID?
“RAID is not a back-up”
It's one of those glib phrases that's used in discussions about RAID when people are asking whether they should use it. It's also, very obviously, wrong (except in the case of RAID 0, of course.) RAID is primarily about preventing data loss, which it does by duplicating data. That's literally a back-up. And for many people, that's the same improvement they'd get from copying all their data to another drive on a regular basis, except that RAID results in less data loss than that strategy because it's automatic and immediate.
There are situations RAID doesn't handle. But there are also situations “copying your disk” doesn't handle either. Both react badly to the entire building catching fire unless you're in the habit of mailing your back-up disks to Siberia each time you make a back-up, or are ploughing through your monthly 1Tb Comcast quota shoving the data, expensively into the cloud. On the other hand, RAID is pretty bad at recovering from someone typing 'rm -rf /', though that happens less frequently than people think it does. And arguably, a good file system lets you recover from that too. And a good back-up system should somehow recognize when you intentionally remove a file.
“RAID is about uptime and high availability”
...which RAID does by automatically backing everything up in real time. Also that's 100% true of RAID on an enterprise grade server. But if you shoved five disks into your whitebox PC (keyword: into) and configured them as a RAID 6 set (because someone told you RAID 5 is bad, we'll get to that in a moment) then it's more about predictable up time in the sense you'll be able to say “Well, I'm going to need to turn off the computer to fix this bad disk but I can wait until things are quiet before doing it.”
Anyway it's kind of about uptime and, more importantly, high availability. And it's actually high availability that's the thing you're probably after. And RAID only has a limited impact on high availability generally.
“RAID makes some things faster, and other things slower”
This bit's awkward because while it's been technically true in the past that RAID's spreading of data across multiple disks has helped with speed, assuming a fast controller that can talk to multiple disks at once, it's always been a six of one, half a dozen of the other, kind of situation. You probably read files more than you write them, and most RAID configurations speed up the former at the expense of the latter, but over all it's unclear what kind of advantage you'd gain from using it. RAID 1 or “RAID 10” were once probably best if you're just looking for ways to speed up a slow server. But technologies change. We have SSDs now. It is very unlikely RAIDing SSDs makes any difference to read speeds whatsoever.
Conclusion: RAID is ultimately a way to make one component of your system, albeit a fairly important one, more reliable and less prone to data loss. But it's considered inadequate even when it works, and requires augmentation by other systems to prevent data loss.
How are technology changes affecting RAID?
Disk sizes, error rates, and matrixing
Until the early 2010s, the major technology change concerning storage was the exponential increase in capacity regular hard disks were seeing. Around 2010 or so, experts started to warn computer users that continuing to use RAID 5 might be a problem, because RAID controllers would mark entire disks as bad if they saw as many as just one unrecoverable failure Disk capacities had grown faster than their reliability per bit, so the chances of there being a bad sector in your disk grew with its capacity.
The scenario RAID 5 opponents worried over was essentially that one disk would fail, and then when recovering the data to add to another disk, the RAID controller would notice a problem with one or other of the other drives. This might make the data in question unrecoverable assuming the original failed drive was completely offline, which it might be if the entire disk failed, or in some silly circumstances where the RAID controller wants to be snotty about disks it found errors on, or if there are only three hotswap bays. Regardless, most (all?) RAID controllers will refuse to continue at that point because there's no way to recover without losing data, possibly a lot of data because of the matrixing algorithms used in RAID 5.
RAID 6 requires two redundant drives instead of one, and so, in 2010, was considered better than RAID 5. But recently we've been hearing similar concerns expressed about RAID 6. At the time of writing there is no standard RAID 7, with three redundant drives, but several unofficial implementations. A potential option is just to use RAID 1 with more than two drives, with a controller that can tolerate occasional, non-overlapping, errors.
These concerns are rarely expressed for RAID 1/10, but these have the same potential issue. A disk fails, you replace it, the controller tries to replicate the drive and finds a bad sector. But in theory, less data would be lost under that scenario because of the lack of matrixing.
SMR/Shingled drives
Post 2010s, the major innovation in hard disk technologies is a technology called “shingling”. As disk capacities increase, the size of magnetic tracks on those disks inevitably decreases until it becomes difficult to even make robust read/write heads capable of writing tracks small enough. Shingling is the great hope, instead of making the heads smaller, you keep them the same size, but you write and rewrite several tracks in succession as a single operation – writing each track so it overlaps with the track next to it. This means your big ass head might be wide enough to write three tracks at once, but you can still get multiple tracks in the same space.
For example, suppose you group seven together, you can write these like this:
First pass:
111??????
Second pass:
1222?????
Third (and so on) passes: 12333???? 123444??? 1234555?? 12345666? 123456777
Now if you hadn't used shingling, then in that same space you'd have only fit three tracks:
123456777 111222333
But, due to magic, you have seven! Hooray!
Let's not kid ourselves, this is a really good, smart, idea. It almost certainly improves reliability (you don't want to shrink disk heads too much or massively increase the number of heads and platters – the latter adds cost as well as increasing things that can fail.) But it does come at the expense of write speeds. Because every time you write to a drive like this, the drive has to (in the above case) read 7 tracks into memory, patch your changes into that image, and then rewrite the entire thing. It can't just just locate the track and sector and write just that sector because it'll overwrite two other sectors when it does it.
A less controversial technology that has a similar, albeit nowhere near as bad, effect is Advanced Format, often known as 4K. This increases the drive's native sector size from 512 bytes to 4k, essentially eliminating 7 sector gaps per sector on a drive which can be used to store data instead. Because operating systems have assumed 512 byte sectors since the mid-1980s, most drives emulate 512 byte sectors by... reading an entire sector into memory, patching it with the 512 bytes just read, and writing it out as needed. This is less of a problem than shingling because (1) operating systems can just natively use 4k sectors if they're available and (2) usually consecutive sectors contain consecutive data, so normally, during a copy, the drive can wait a while before updating a sector and will see the an entire 4k sector's worth of data given to it so it doesn't have to read and patch anything.
Shingled drives however don't have anything that would make their approach more efficient. They end up being very slow, and during a RAID recovery process, many RAID controllers will simply give up and assume a fault with a drive if it's shingled.
Shingled drives are rapidly becoming default for hard disks. Despite needing more on-board memory they end up being cheaper per terabyte, for obvious reasons.
Solid State Drives (SSD)
Probably the single biggest innovation that's occurred in the last 20 years has been the transition to SSDs as the primary storage medium. SSDs have very few disadvantages over HDDs. They're faster, so fast the best way to use an SSD is to wire it directly to the computer's PCIe bus (that's what “M.2” is), they use less power, and they're far more reliable. Their negatives are that they're more expensive per terabyte (though costs are coming down, and are comparable with 2.5” HDDs at the time of writing) and they have a “write limit”. The write limits are becoming less and less of a problem, though much of that is because of operating system vendors being more sensitive to how SSDs should be written to.
SSDs are RAIDable but there's a price. RAID usually astronomically increases the number of writes. Naively you'd expect at least a doubling of the number of writes, purely because of the need for redundancy. This isn't an issue with RAID 1 as you have double the disks, but for RAID 5 you have at most 1.5x as many disks, and RAID 6 isn't much better.
But, aside from RAID 1, it isn't “double” the writes. Each write in a RAID 5 or RAID 6 environment requires also updating error correction data, and while some of that can be buffered and done as a single operation, that does substantially increase the number of writes per underlying write.
For HDDs this isn't much of a problem, HDD life is mostly affected by general environment, drives will literally live for decades if cared for properly, no matter how much you write to them, but for SSDs each write is a reduction in the SSD's lifetime. And given most RAID users would initially start with a similar set of SSDs, that also increases the chance of more than one SSD failing in a short space of time, reducing the chance to recover a RAID set if one drive fails.
Ironically, SSDs should be easier to recover from than HDDs in this set of circumstances because a typical SSD, upon noticing its write count is almost up, will fall back to a read-only mode. So your data will be there, it's just your RAID controller may or may not be able to recover it because it was built for an earlier time.
Preempting a predicted reply about the above: many argue SSDs don't have write limit problems in practice because manufacturers are getting better at increasing write limits and they haven't had a problem in the whole five years they had one in their PC. Leaving aside issues with anecdotal data, remember that most operating systems have been modified to be more efficient and avoid unnecessary writes, and remember that RAID is inherently not efficient and makes many redundant writes (that's the 'R' in RAID). The performance of an SSD in a RAID 5 or RAID 6 set will not be similar to its performance in your desktop PC.
Alternative approaches to high availability
A common misunderstanding about high availability is that it's a component level concern, rather than an application level concern. But ultimately as a user you're only interested in the following:
- The application needs to be up when I want to use it
- I don't want to lose my work when using it.
RAID only secures one part of your application stack, in that it makes it less likely your application will fail due to a disk problem. But in reality, any part of the computer the application runs on could fail. It might become unpowered due to a UPS failure or a prolonged power outage, or because the PSU develops a fault. The system's CPU might overheat and burn out, perhaps because of a fan failure. While enterprise grade servers contain some mitigations to reduce the chance of these types of failure happening, the reality is that there are many failure modes, and RAID is designed to deal with just one of them, and increasingly it's bad at it.
But let's look at the structure of an average application: It serves web pages. All those web pages are generated from content stored in a database. For images and other semi-static assets, you can put them in a Minio bucket. You'd have nginx in front of all of this doing reverse proxying.
For the application and nginx, you are looking at virtually no file system modifications except when deploying updates. So you could easily create two nginx servers and two application servers (assuming the application server tolerates the idea of multiple servers, most will and are designed like that to begin with.) So you don't actually need RAID for either server, if one suffers a disk crash, already unlikely, you can just switch over to the back-up server and clone a new one while you wait.
For the database, both MySQL/MariaDB, and PostgreSQL, the two major open source databases, support replication. This means you can stand up a second server of the same type, and with a suitable configuration the two will automatically sync with one another. Should one server suffer a horrific disk crash, you can switch over to the back up, and stand up a new server to replace the back-up.
For your assets server, you can also do replication just as you can with the database.
One huge advantage of this approach over trying to make your server hardware bullet proof is that you don't have to have everything in the same place. Your Minio server can be a $1/month VPS somewhere. Your PostgreSQL replication server can be two servers on different PCs in your house. Your clones of your nginx and application servers can be on Azure and Amazon respectively. You could even have your primary servers running from unRAIDed SSDs, and secondary servers on a big slow box running classic HDDs in a RAIDed configuration.
This is how things should work. But there's a problem: some applications just don't play well in this environment. There are very few email servers, for example, that are happy storing emails in databases or blob storage. Wordpress needs a lot of work to get it to support blob storage for media assets, though it's not impossible to configure it to do so if you know what you're doing.
But if you do this properly, RAID isn't just unnecessary, it just complicates your system and makes it more expensive.
Final thoughts
This is a summary of the above, you can CTRL-F to find the justifications for each if you skimmed the above and do not understand why I would come to the conclusions I have come to, but I came to them, so suck it up:
- RAID is viable right now, but within the next 5-10 years seems likely to be become mostly obsolete and a danger to anyone who blindly uses it without limiting themselves to those specific instances where it's necessary, and favoring large numbers of small drives in a RAID 1 configuration for those instances.
- RAID is still useful if you're using them to create redundant high availability storage of relatively small disks, I've seen numbers as low as 2T quoted for RAID 5, but I'm sure you can go a little bigger on that.
- Other than RAID 1, RAID with SSDs is probably an unwise approach.
- Going forward, you should be choosing applications that rarely access the file system, preferring instead to use storage servers like databases, blog storage APIs, etc, that in turn support replication. If you have to run some legacy application that doesn't understand the need to do this, then use RAID, but limit the amount you use them for. (If it helps, a Raspberry Pi 5 with 4Gb of RAM and a suitably large amount of storage can probably manage MariaDB, PostgreSQL, and Minio replication servers. None of these servers are heavy on RAM.)
- Moving away from RAID, forced or not, means you must, again, look at your back-up strategy. From a home user's point of view: consider adding a hotswap drive slot on your server and some automated processes that mount that drive, copy stuff to it, and unmount it, every night. You probably have a ton of unused SATA HDDs anyway, right?