A 12x drive raid60 is an overkill, not efficient...
Your opinion.
In my case, the cost of downtime or loss of data is very great. Buying 72TB raw to get a usable, reliable 48TB is not inefficient - it's smart. It might be overkill if your array has 128 GB SSDs, but with 6 TB spinners the bit error rate issues are not theoretical.
A twelve drive RAID-5 set would be ludicrous under any circumstances - although you'd get 66TB usable.
Twelve drive RAID-50 would give 60TB, but a second error during rebuild leaves you with 0TB.
Twelve drive RAID-60 gives you 48TB, and the ability to withstand the failure of two drives (perhaps even four drives failing).
...and with SAS system, there is noway you can expand the existing RAID.
That's funny, because yesterday I added two drives to a RAID-60 set - it not only expanded the array, but it did it online while the system was in use. You should look into better SAS controllers.
This afternoon (often multi-TB online reconfiguration operations take a while) my logged in users noticed that there was a lot more free space on the work drive. Completely online - no reboots, no dismounts, no logouts.
One minute, the drive shows 100GB free. A few minutes later, the same command shows 2.5TB free. It just works.
So whether RAID5 or 6 check the volume
it would be lot less headache if you will
We're definitely on the same page here.
A huge percentage of "RAID failure" events are "the humans didn't notice that one drive had failed, and lost the array when the second drive failed".
My controllers perform a background full volume scan (read and verify every data and parity chunk) every 72 to 168 hours. Not only that, but they also continuously monitor S.M.A.R.T. data and if a drive enters "failure predicted" mode the controller will:
- copy the "soon to fail" drive to a hot spare
- Note that this is not a "rebuild" -- it's a simple sequential copy of the suspect drive to the hot spare. If there's an error on a chunk, then that one chunk will be rebuilt (if necessary) from the other drives, and the sequential copy proceeds.
- put the hot spare into the array as a full member, removing the suspect drive
- put the suspect drive on the "unusable" list
- send an email requesting service, with serial numbers, failure codes, slot numbers, etc
I forward the email to HP, and a new drive arrives in a day or two. I have the tech remove the drive with the yellow light and insert the new drive. Back to fully redundant and spared service (actually, at no point in the process was redundancy lost).
Depending on the array, either the event is over - or the controller will copy the data from the "hot spare" to the newly inserted drive.
- If the array is homogeneous, then it's over.
- A global hot spare can be used for any failing drive - so a 1.2 TB spinner might be used to replace a failing 200GB SSD. You'd want that array to revert to the new 200GB SSD and put the 1.2 TB spinner back as hot spare.
- Some setups have different bandwidth subscription models, in which case you might want to restore the original topology. (I have a number of controllers where the first 24 drives have direct SAS channels (6/12 Gbps per disk, 144/288 Gbps aggregate), and the next 175 drives share a 4 lane daisy chain with 24/48Gbps. If you've put an array in the first group for performance, you probably don't want it using a disk on the daisy chain.)