When Good Hard Drives Go Bad
Picture the scene, you return to your system and casually attempt to create a new file on one of your disks. Permission Denied. What?
- Check you’re logged in as the correct user - check
- Check that the user has write permissions - check
Following this you start to get a sinking feeling, something is not right. In fact, something is very wrong but you don’t know what that is just yet. Although, the situation feels vaguely farmilar. Without recalling why you decide to check if the partition you’re attempting to write to has been mounted read only. It has, your mind races and you check /var/log/messages for the worst.
ide: failed opcode was: unknown
end_request: I/O error, dev hdd, sector 61866003
Buffer I/O error on device hdd1, logical block 30933001
hdd: dma_intr: status=0×51 { DriveReady SeekComplete Error }
hdd: dma_intr: error=0×40 { UncorrectableError }, LBAsec= 63963157, high=3, low=1460081, sector=65011733
And now you realise why this felt so familiar, its because the same thing has happened to one of your disks at the beginning of summer three years in a row. You really should have made backups by now.
I’m sorry to say that this is a true story, three different branded disks in different systems in different houses all in different conditions. The only common factor is that these disks all ran 24/7 and were used as a ‘media repository’. Having been in this situation before I would like to share my experiences of simple disk recovery under Linux.
Step One.
Unmount the faulty disk (stop reading this, do it now). Forget about reading from it, forget writing to it and whatever happens do not fsck the faulty disk. Running fsck against a disk with physical errors will only make the problem worse. I recommend that you disconnect the disk and store it somewhere safe.
Step Two.
Buy/locate/steal/borrow a replacement disk, as you’re going to need somewhere to copy your data to. You should then connect both disks, you new disk and the faulty disk to your system. If possible try to get them on separate IDE buses, I’ve had one disk fail which created IDE bus resets when reading from it. This had a negative effect on all disks on the same IDE bus (including the good disk I was copying to). Format your new disk and check that its working, do not store anything important on it just yet as it will be overwritten.
It should also be noted that if you have a disk or array which is already in-use but has enough free space available, then you can write an image of your damaged disk to this space. The disk image can then be fscked and mounted via a loop back mount. The exact details of this method are not covered here but it should be easy to adjust the methods described here.
Step 3.
Install GNU DD Rescue onto your system ( http://www.gnu.org/software/ddrescue/ddrescue.html ) the package name under Debian / ‘bunty (Ubuntu) is gddrescue. This is possibly one of the most useful data recovery tools I’ve used. It is dd designed with data recovery in mind, it will allow you to read directly from the disk ignoring any errors which may exist in the file system. The website lists its most important features but most importantly to us its very easy to use and fully automatic. Once you’ve read the manual page for dd rescue its time to start on recovering your data.
Step 4.
Make sure that both your old faulty disk and your new disk are unmounted, once that is done invoke dd rescue.
# ddrescue /dev/hdd1 /dev/hdc1 /logfile
In this example, /dev/hdd1 is my old faulty disk (the source) and /dev/hdc1 is my new disk (the target) the /logfile will create a log file which dd rescue can use to resume its recovery should you need to stop it for any reason. This will take some time so sit back and wait, the output from dd rescue is very clear. It will keep you informed as to how much data it has copied, how much is unreadable and how many errors have occoured.
For my most recent disk failure on a 250 Gigabyte drive ddrescue was able to read all but ~2 Megabytes of data. Which is really quite fantastic when you consider that with the faulty disk mounted read only I was unable to copy any files without producing hundreds of errors.
Step 5.
The moment of truth, dd rescue has read all it can read and now its time to fsck the new disk.
# fsck -C /dev/hdc1
This will run fsck with a handy progress bar, If you’re very lucky fsck will complete without complaining and you have successfully rescued all of your data. If you’re not so lucky (as I have been this time) you may have suffered some fairly nasty file system corruption, fsck will go through asking you if you would like to fix the errors. You should make a note of which files have gone and let fsck do its thing, there may be better ways to handle this, if there are I would love to hear them. If you have only suffered damage to the file system structure and your actual data files are okay then you will end up with a load of files in the lost+found directory. Simply put these are files which exist on disk but no longer have a file name (and other things) associated with them, so fsck will move them into the lost+found directory.
Step 6.
Thats it really, you should have most or all of your data back, unless you’ve been really unlucky. Now is the time to create a backup plan, do it now while the fear is still in you. Otherwise six months down the line you’ll be thinking to yourself, “nahh disk failure wont happen to me (again)”.
It should be noted that there are variations on this method, if anyone knows of any better ways to handle a failed disk please let me know, I’d love to know if there are any steps or tools I’ve missed out upon.