Some of you may have noticed in the last few months there were times when the whole machine would crash and be non-responsive. After banging my head against the wall, I finally narrowed it down to bad hard drives.
I sent new ones to the colo place to replace the bad ones. Yesterday we attempted to replace them but that attempt failed.
Following is the transcript from this morning on the second attempt. This got the adrenaline pumping. We were 2 inches away from being server-less for at least a week.
Hard Drive replacement part II
Hello, please see ticket 24592 for reference.
I think the issue with replacing the hard drive is that I did not let the machine notice that one of the volumes was gone, and therefore it was not expecting a rebuild.
Anyway, I'd like to try this one more time tomorrow morning @ 6am EST. I will make a slight modification to the instructions:
Hello, here are the instructions for the hard drive replacement.
My machine has 3 drives in a raid 5 setup. Two of the drives are no good and need to be replaced. The model numbers of the bad drives are WD1600YS. The model number of the good drive is WD1600YD.
So I have sent two WD1600YD drives to replace the WD1600YS drives.
To log into the machine using RDP, you must use a non-standard port:
xx.xx.xx.xx
First, log into the machine, and do a controlled shut down.
Please ensure there are no current reported errors with the raid subsystem before shutting down. You will see an icon in the lower right tray with a message if there is.
The drives are not hot swappable. the machine will need to be opened. You must replace only ONE DRIVE AT A TIME since there's only three drives in raid 5. So:
1. Open the machine and remove one of the YS drives cables so the machine does not see it.
2. Boot into windows. In the icon tray, the intel matrix console should report that a volume is missing and the raid is degraded.
3. Once that is verified, shut the machine down once more, and replace that YS drive with a YD drive.
4. Boot up into windows. If you cannot boot, and get the same error as you did in the other ticket, then put the YS drive back and abandon the attempt.
If you boot and can log in, you should see the raid rebuilding in the icon try.
4. YOU MUST WAIT FOR THE RAID TO REBUILD BEFORE YOU CAN SWAP OUT THE OTHER DRIVE. I think it will take about an hour.
5. Once the raid is fully rebuilt and the server is running normally again, do a normal shut down again.
6. As above, remove the cable from the remaining YS drive and boot into windows. Once again, you should see that a volume is missing and the raid is degraded.
7. Once that is verifiied, again shut down and replace the YS with a YD drive.
8. Boot up and you should see the volume is rebuilding. At this point, the operation is a success.
At that point, you can close the machine back up, and the array will rebuild itself over time.
Please confirm you have received these instructions and they are acceptable.
Thank you very much for your support.
Rich
-------------
No File
User/Staff Follow-ups
marc
04/22/2007 02:20 PM
We will be prepared for this task tomorrow morning.
-----------
No File Attached
Rate This Response
Justin
04/23/2007 05:01 AM
We are proceeding at this time.
-----------
No File Attached
Rate This Response
Justin
04/23/2007 05:29 AM
Steps 1-4 of the operation were successful. The RAID is being rebuilt onto the new YD drive.
-----------
No File Attached
Rate This Response
Justin
04/23/2007 07:03 AM
The Remote Desktop session died when rebuilding was approx. 94% complete. I have not been able to re-establish the session and video-output is still a no-go. Unfortunately there is no way to know what state the server is in and if it is safe to reboot. Please advise how you wish to proceed.
-----------
No File Attached
Rate This Response
richard.
04/23/2007 07:28 AM
EDIT doesn't sound like we have a choice. go ahead and reboot
-----------
No File Attached
Dave
04/23/2007 07:43 AM
The server has been rebooted but doesn't appear to be responding to requests. I am looking into this further.
-----------
No File Attached
Rate This Response
richard.
04/23/2007 07:50 AM
EDIT You can hook up video and keyboard for the dos bootup and see if there were any issues. The raid console might give you feedback at that point.
If worst comes to worst, you can remove the YD drive you just put in and see if that gets it moving again.
This procedure was not meant to be apparently.
-----------
No File Attached
Dave
04/23/2007 07:58 AM
Upon further investigation it appears as though the remaining YS drive has failed and has dropped out of the already degraded array. Windows attempts to boot and bluescreens shortly afterwards referencing a bad boot device.
Would you like for us to attempt installing the original YS drive we removed and seeing if it can boot?
-----------
No File Attached
Rate This Response
richard.
04/23/2007 08:00 AM
EDIT Yes please re-hook up the original YS drive.
-----------
No File Attached
Dave
04/23/2007 08:19 AM
I attempted putting the original YS drive in the server but it failed to boot due to a completely "Failed" array. I then reinstalled the YD drive we put in this morning, it booted and ran a chkdsk, it then rebooted again and is now at a login prompt and accessible via RDP.
If you haven't already, I strongly suggest you use this opportunity to backup any critical data as the server is in a very unstable state.
Note, the rebuild has restarted and is at 9%.
-----------
No File Attached
Rate This Response
richard.
04/23/2007 08:39 AM
EDIT Thank you for your dilligence Dave. I have critical files backed up.
If by some miracle this piece is successful, I would like to contuinue with replacement of the final YD drive.
-----------
No File Attached
Dave
04/23/2007 08:40 AM
If this rebuild completes we will continue with the next swap.
Note, its currently at 51%.
-----------
No File Attached
Rate This Response
Dave
04/23/2007 10:17 AM
The array successfully rebuilt. We are performing the final drive swap now.
-----------
No File Attached
Rate This Response
Dave
04/23/2007 10:57 AM
I have installed the last drive, and the array is currently being rebuilt and is at 30%.
Once the array is optimal we will take the server offline one more time to properly rack the equipment.
-----------
No File Attached
Rate This Response
richard.
04/23/2007 10:58 AM
EDIT Thanks for your help Dave! That one was a nail biter!
-----------
No File Attached
Dave
04/23/2007 12:28 AM
The array has again been rebuilt correctly. I am powering off your server so we can rerack it for you.
-----------
No File Attached
Rate This Response
Dave
04/23/2007 12:45 AM
The server has been racked and is online and responding to requests via RDP. The RAID array also appears to be in an optimal state.
How would you like for us to handle the failed drives?
-------
Tier 1 - 1.25
-----------
No File Attached
WHEW!
Now lets hope this really does fix the issue!
Rich (TW)
What a ride!
-
- 1337 Haxor
- Posts: 497
- Joined: Sun Jun 18, 2006 11:00 pm
- Location: Caves of ice, Xanadu
Skimming that, I'm glad I don't have you're job. You deserve a trophy or... pie... or something.
-
- DW Clan Member
- Posts: 381
- Joined: Fri Mar 24, 2006 12:00 am
Quick! Go buy a lottery ticket! Your luck is obviously on a run. :clown:
Snaggle.
Snaggle.
-
- Site Admin
- Posts: 2005
- Joined: Thu Mar 23, 2006 12:00 am
- Xfire: KarmaKat
::tha-THUD!!!:: <---- the sound of my heart restarting!!!!
Holy cow! I think I got a few more gray hairs just reading that!!
The phoenix has risen and we can SPAM on!! :cheers:
KKat

Holy cow! I think I got a few more gray hairs just reading that!!
The phoenix has risen and we can SPAM on!! :cheers:
KKat

Karma...a term that comprises the entire cycle of cause & effect...
Kat...a supercilious quadrupedal pile of fur that doesn´t give a flying fig for Karma...
Kat...a supercilious quadrupedal pile of fur that doesn´t give a flying fig for Karma...
The Remote Desktop session died when rebuilding was approx. 94% complete. I have not been able to re-establish the session and video-output is still a no-go. Unfortunately there is no way to know what state the server is in and if it is safe to reboot. Please advise how you wish to proceed.
I wudda died right there reading that and then it gets better!!!.
Well hope it holds for the time being.
wp
I wudda died right there reading that and then it gets better!!!.
Well hope it holds for the time being.
wp
joblow@bellsouth.net game contact email.