RAID Disconnect/HD Problem

Discussion in 'Mac Pro' started by kimberlysd, Aug 8, 2010.

  1. kimberlysd macrumors newbie

    Joined:
    May 24, 2010
    #1
    I have a 2007 Mac Pro with all 4 hard drive bays in use, 2 of them set up together as a mirrored RAID. I was using it as normal, when I suddenly got that "A device has been disconnected..." dialogue box. After looking in my list of drives, my mirrored RAID drives were no longer displaying. They weren't showing up in disk utility, so I tried restarting. The computer wouldn't boot at all, just got stuck frozen on the beginning white screen. Listening to the computer, I heard a clicking sound, so I thought maybe one of my drives involved in the raid had just bit the dust.

    I opened up the computer, took out out one and left the other in, the computer still would not boot. I put that one back in, and took out the other, and still the computer wouldn't boot. With both I heard the same clicking sound. I was able to get the computer to boot again by leaving both drives out.

    So now I'm not sure where to go from here. I find it difficult to believe that both hard drives died at the exact same moment. Are there any suggestions as to how I can determine if the drives are actually dead or if it's some sort of problem with the computer or the RAID?
     
  2. Ryft macrumors newbie

    Joined:
    Aug 18, 2008
    Location:
    Chicago, IL
    #2
    are you using new western digital drives? because WD recently removed the TLER feature in their non-RE/Enterprise edition HDDs. TLER, to put it simply, allows for the drives to successfully be raided together.

    or maybe that's not your problem at all :rolleyes:
     
  3. kimberlysd thread starter macrumors newbie

    Joined:
    May 24, 2010
    #3
    I am using WD, but I've had this RAID and been using it for about 2 years ish, so the drives and the RAID aren't new.
     
  4. nanofrog macrumors G4

    Joined:
    May 6, 2008
    #4
    A bad disk is the likely culprit, particularly given the clicking sound.

    That said, the inability to boot with one or both disks still in the system is really weird.

    So I've a few questions:
    • What version of OS X?
    • Which model MP?
    • What are the disks used in the mirror (WD's P/N)?
    • Where is the boot disk located (HDD bay or off of an ODD_SATA port / optical bay)?

    I ask, as 10.6.3 caused problems with RAID (both Disk Utility based arrays and true hardware controllers). Most got around it by rolling their systems back to 10.6.2, and then upgraded to 10.6.4. But AFIAK, they only went as far as making sure the array/s were visible to the system, not testing out fault conditions (some issue may have carried over).

    Please understand, that 2x disks can fail at the same time if they were bought at the same time from the same vendor. The reason is, you end up with disks from the same batch (sequential or nearly sequential serial numbers). So if one has a defect and fails, the other is likely to soon follow. This will also tends to happen with sufficient age. Two years is a little short IMO, but consumer models are really only meant to last ~3 years from most vendors these days anyway.

    I also hope you've a backup in place, as RAID, no matter the level (0/1/10/5/6/50/60; there's others too, but they're less common) or how it's implemented (software or hardware), isn't a replacement for a backup system (data can still be lost).

    But I do expect the fault itself, is the result of DOA disk/s. The boot issue isn't so clear, as the OS isn't on the array from what you've posted (I'm assuming there isn't 2x OS X entries in the Boot Loader).
     
  5. kimberlysd thread starter macrumors newbie

    Joined:
    May 24, 2010
    #5
    My Mac Pro is from spring 2007, 2.66, running 6.3. I'm going to try upgrading and see if that helps, but like you say, I suspect the disks must be dead.

    My OS is not on the drives affected and is an internal drive (the raid disks were as well) The two disks not in the RAID are my startup disk and a clone of that disk. I tried booting off the clone in case something got corrupted but it still would not boot with the raid disks.

    The drives are WD7500AAKS-00RBA0 and I bought them together, same vendor. It looks like I underestimated their age as the label says Dec 2007, so I likely got them around then and seems like their natural lifespan may be up.

    I was under the impression that two disks failing at the exact same second was unlikely, barring any outside destructive influence, even if they were close together in serial number. Do you think it's possible that one of the drives may have been dead for a few days or so but I didn't notice because the other disk was compensating and masking the other's problem? Would a software RAID set up in disk utility send an alert message if one of the disk failed, or would it just ignore it?

    I do have a somewhat complete backup of the raid (made a month ago). The data that would be missing isn't very important, but it'd be nice to have so I'd like to eliminate all possibility that one might be ok. Would the best way to do this be to get an external enclosure and see if the computer will recognize the drive/say anything about it when I don't have to boot with it when it's internally installed?

    I'm really puzzled by the boot issue myself. Despite the ominous clicking, it's the only thing leaving me somewhat hopeful the problem might lie in the RAID setup, because otherwise I can't understand why the computer would be getting so stuck on booting with those disks installed that it won't even get as far as an option to boot into safe mode or choose a different startup disk. (And I did manually select the second bootable clone disk and start up with the drives installed, and that system, which is a day behind my usual startup disk, would also get stuck on the white screen the apple logo). If one normally tried to boot a computer with a bad drive in it, as long as it wasn't the startup disk, wouldn't the computer just ignore it as it would be nonessential? That would seem the smart thing to do.
     
  6. nanofrog macrumors G4

    Joined:
    May 6, 2008
    #6
    Hopefully, this is all that's wrong with the ability to boot. Past that, new disk/s would solve the problem of a working mirror (just hope you've a backup if both drives are shot).

    I realized that the OS install wasn't on the mirror, and the only thing I can think of that would cause this, was an Update (10.6.3).

    Either try 10.6.4, or roll it back to 10.6.2 (particularly if the problem persists in 10.6.4). Presuming that the SATA controller will behave properly with 10.6.3 gone (hope it doesn't leave a remnant that keeps interfering with the system).

    For RAID use, it's best to get enterprise disks. In the case of WD's models, you can run the TLER utility and change the timings to OFF = 0,0.

    Just don't buy them from the same vendor if possible (solves the issue of sequential serial numbers in case there's a bad batch).

    It's quite possible you didn't notice the array was running in a degraded state. Particularly if OS X doesn't give any warnings that a fault has occured.

    For other levels, such as parity based arrays, you notice a performance drop when the array is running in a degraded state. Not so much on a mirror.
    Stripe sets just fail, and is blatently obvious.

    I've never used Disk Utility to create RAID sets to know of it's error reporting capabilities (I stick to proper hardware controllers, and mine are setup to send error reports via email), as even then, there's not a screen that pops up and says there's a failure of a disk/s, and the array is running in a degraded state.

    I can also look in the control panel for it (access the card's IP address via the browser for Areca cards), but that's manually (if I'm nosey, and check the error logs/condition of the array/s). So I depend on email notification to let me know immediately (I keep email open). Software implementations don't do email notification.

    You can do this, but first you'll have to un-create the array as mentioned before, to be sure you can attempt to access the disks (will let you know if one or both are shot).

    You'd lose access to that particular disk, but the others on the controller/s should still be available. If the OS is on the disk in question, No. The entire system will stall out the instant the OS disk is accessed for any reason (recovery method keeps trying indefinitely; no timeout).
     
  7. kimberlysd thread starter macrumors newbie

    Joined:
    May 24, 2010
    #7
    Unfortunately, the upgrade to 6.4 did not solve the problem. The computer still will not boot with the RAID disks installed. I'll have to try going back to 6.2

    I'm not sure if you'd know since you say you're not very familiar with Disk Utility's handling of RAID, but would it be possible to do this if the computer won't recognize/mount the drive? Or if it doesn't recognize the drive a sign that the drive itself is beyond hope? When I initially got the "devices have been disconnected" dialogue box, I opened Disk Utility, and it didn't display either of the two disks there that I could even un-create the array. Would trying it with a computer that never knew the disk was ever in a RAID help, or does Disk Utility leave some sign on the disk that would tell another Mac it was supposed to be part of a mirrored array?

    This seems to be what's happening, except I have NO operating system installed on those disks. The computer seems like it would be trying to access those drives, because I can't think of why the clicking sound would be consistent and persistent unless the computer was continually trying to access those drives and stalling startup when it can't. The only thing on those disks are various media files, no system files whatsoever, nothing that I can think of that my computer would require access to function... So I'm still not sure why those two drives would be causing this boot issue, even if they are both dead.

    Thanks for your help!
     
  8. nanofrog macrumors G4

    Joined:
    May 6, 2008
    #8
    Ouch. :(

    I hope it works, but I've a feeling that there's a remnant (whatever's causing the problem) that's been left over, and continuing to roll it back may not solve it. You may actually have to pull the set, and make a fresh OS X installation to try and clean up the problem (i.e. 10.6.0), and test with the disks. If it works (boots rather than remains stalled out), proceed to 10.6.2, etc. Just pull the set each time you update (could cause a problem with the update installations).

    And this is the best case scenario. :eek: Read on, the realistic outlook isn't great I'm affraid.

    I see what you're getting at. Try it with one disk at a time, and see if that gets you anywhere (worth a shot at this point). But with the set's current status (ignoring the boot issue ATM), you may not be able to do anything with the disks except attach them individually and attempt to reformat (better yet, run a diagnostic, such as what's offered by WD - then run a scan). Hopefully, you may be able to do a low level format (each disk maker's low level formatting is proprietary, and the software is free from WD; but you'd need to run it from another OS/boot environment, such as DOS, Linux, or Windows - just make sure to use the correct version for the boot environment).

    At this point, this is the best you can hope for IMO, given the current state of those disks (data will be lost with reformatting; HFS/+ or low level). So just run the scan first (before any reformatting, to see if any data is in tact, as your backup isn't current). Otherwise that unbackedup data is gone.... :eek:

    Sorry, but I don't think there's too much hope for the data at this point. :(

    I know. You've definitely got a mess here.

    The array should become invisible, but still boot. But as both are on the ICH, the array is locking up the system and preventing a boot. What I don't know, is if this is the normal behavior under OS X's software implementation, or if its a fault in the OS. But as the OS isn't even loading, the controller seems to not be able to handle it at all (disks won't pass during boot, and locks up the controller). Removing them is the only way to boot, and OS X isn't capable of Hot Swapping, and the MP isn't capable of Hot Plugging either (power aspect of removing/adding drives, as it needs an Inrush Current Limiter that Apple didn't include in the MP).

    Ultimately, this is why you need to replace the disks. DATA may be gone. Trying the current disks individually is the only way to even attempt to recover any data (hope one of them is good, but the information you've posted indicates they're both shot = data's unrecoverable).
     
  9. kimberlysd thread starter macrumors newbie

    Joined:
    May 24, 2010
    #9
    Thanks for your continued help. I've searched my console logs and from what you can see, do you think one of the disks might be ok or am I misreading it? Does disk4 refers to the set as a whole and not an individual disk? I've searched the console for that string of letters and numbers of the member mention but those are the only instances in the log. http://i36.tinypic.com/10nupfo.jpg
     
  10. nanofrog macrumors G4

    Joined:
    May 6, 2008
    #10
    Thanks for the shot of the log file.

    Is disk 4 one of the disks of the array (or what the entire array is marked as)? Or something else?

    I ask, as the first 2x lines are recovery entries, and each has a different ID (member A.....), both part of an array called "Media". Then it's followed by a Restart, and then the disk 4 error pops up (I read disk 4 as the disk # for the set). Assuming this is the case, this indicates both disks are shot.

    With an array, multiple disks are connected into a logical volume (think MBR's strung together as a set). As OS X uses GPT, there's another layer, which is the set you labeled as "Media" (where your data actually is). This is what OS X would call the volume.

    You can check out the GPT wiki if you're interested.
     

Share This Page