Although I suspected my storage server power supply wasn't up to the task, I nevertheless went ahead and loaded up the last 3 drives for a total of thirteen live drives, two hot spares and one cold spare.
The installation went really well, so I started the migration process and was about to stop worrying about it. However, during the migration a drive went and got some power problems, and ended up disconnecting both itself and a second drive from the array.
After several hundred error messages, I finally got down and removed the bad drive, but the damage was done - I now had a RAID6 volume in the middle of a migrate operation with two drives missing. The 3ware manual is silent on what happens in this case - it only says the migration process may not be aborted.
Kudos to the 3ware engineers, though. The 9650SE not only went ahead and completed the migration on the degraded array, it then immediately picked up the two hot spares and began a full rebuild. Hopefully no more drives will go down in the next 36 hours, which should be enough time to finish rebuilding. (Hey, it's a nine terabyte array, it takes a while.)
I have no doubt that if I had a third failure during the migrate, or have one during the rebuild, my array will be gone. That's a limitation of RAID, though, and not the controller.
This isn't the first time the RAID6 controller has saved my ass. Thanks to Seagate's ineptitude and Intel's inadequate understanding, I built my entire original array from ST34000340AS 1Tb 7200.11 drives that were all potential bricks. In the process of updating them all, I booted the server with a new 'fixed' drive, only to have a second drive brick on boot. This left me with two degraded drives, which would have taken my old RAID5 down. Fortunately, the RAID6 kept working and got it all sorted out.
A word to the wise, though - when rebuilding, set the controller to do it quickly. If I had set background tasks to a higher priority, it would have completed before the drive errors caused any issues.
So far very few of the original Seagate drives are left in my array - three of the eight, to be exact. They all failed before five years, meaning that two-thirds of them did not even make their warranty period. Hardly an impressive statistic, even leaving aside the firmware problems that caused so much grief.
Returning them to Seagate is uneconomic, and incredibly burdensome, so I just end up replacing them and buying the over-the-counter warranty at Memory Express to avoid future issues. Also mixing up the array with different brands and models, to avoid unnecessary exposure to future bugs.
Despite several attempts, fate has yet to permanently disable my server, thanks in no small part to the features and engineering of the 9650SE. Let's hope that 3ware is able to continue this tradition of fine products under their new ownership.
I do still think that my PSU is not quite up to the load, though. At the time, getting a redundant unit seemed more important than the overall wattage, and I think MemX did their best, but if they had any inkling that the box wasn't going to support my planned 16 drives they kept it to themselves. This puts me between a rock and a hard place on my next build - do I go for high output and get rock-solid hot-swap for more failed drives - but risk the entire box dying one day - or do I stay with redundant and take the risk of power-line sags affecting the array?
A PSU failure on the server can probably be fixed in four hours, and with adequate backups my entire array brought back within a day or two. But it really corrupts the original vision of a bulletproof, always-on server. Of course, not being able to effectively hot-swap drives and always worrying about dropouts due to poor power isn't comforting either.
I should probably go rack-mount with high wattage redundant supplies, but I don't want to make that kind of investment yet. I guess I'll stick with what I have and see where that leads.
No comments:
Post a Comment