hardware

When good hard drives go bad - 2008 remix.

Its that time of year, I had the same issue in 2007. This time I’m pleased(?) to say that only my laptop hard disk has died, I’m also pleased to say that this time I had a backup. Fingers crossed this is the only disk (this year).

hardware

Comments (0)

Permalink

Redundant bonding of wireless and wired interfaces in Ubuntu.

Recently while attending a networking course which covered redundant multipathing for high availability systems I got to thinking - could this be applied to my wired / wireless network at home? The end result being that removing the wired network from my laptop by unplugging the cable would instantly fail over to my wireless connection. I’m sure many people are thinking “but gnome network manager already does that” well, not quite.

With redundant bonding or multi-pathing both interfaces are connected to the same network and put into a special group. These interfaces are constantly connected to the network and are ready to work, however only one of them is active at any one time. Should the active interface fail (such as a network cable being removed) the mac address, ip address and all other configuration is almost instantly assigned to one of the other available network interfaces. This allows you to continue working as if nothing had happened, songs playing from network shares will continue playing, instant messenger conversations will continue to work, downloads will not be interrupted… the list goes on.

It turns out that under Linux this works incredibly well, read on for details.

Requirements.

In terms of networking you must have your wired and wireless networks on the same physical network segment, in this post I am also assuming that your wireless network is setup and that you have knowledge of networking.
As for packages you will need to install the ifenslave package under ubuntu.

# apt-get install ifenslave-2.6

Step One.

Warning: if you follow the steps I’ve included here, reboot and anything goes wrong, I accept no responsibility - thanks.

Under Linux high availability and link aggregation is handled by the bonding module, this can be enabled by adding the following line to your /etc/modules configuration file.

bonding mode=1 miimon=100 downdelay=200 updelay=200

This will load the bonding module upon next boot, however, it can be loaded at any time using the modprobe command the options are as follows.

  • mode=1 - This enables the active backup mode, this will provide link level redundancy but it does not allow for any kind of link aggregation.
  • miimon=100 - This enables link level monitoring of the connection (the default value is 0 - which disables link monitoring). The value passed is the frequency in milliseconds that the link is checked. Link level monitoring only takes into account the physical connection, not if the network is correctly configured or not.
  • downdelay=200 - This is the delay in milliseconds before the link is marked as failed, it must be a multiple of miimon
  • updelay=200 - This is the delay in milliseconds before the link is marked as active, it must be a multiple of miimon

Step Two.

Next the bonding interface (bond0) must be configured, this can be done in the /etc/network/interfaces file an entry such as this needs to be added:

auto bond0
iface bond0 inet static
address 192.168.1.34
netmask 255.255.255.0
gateway 192.168.1.24
broadcast 192.168.1.255
post-up ifenslave bond0 eth0 eth1
post-up echo “eth1″ > /sys/class/net/bond0/bonding/primary
pre-down ifenslave -d bond0 eth0 eth1

The network address, netmask, gateway and broadcast should be configured for your network. This is the one single address that your machine will be known as, assuming that one of the connections to your laptop is available. You should be able to use dhcp to configure this interface, however I have chosen a static address.

  • ‘post-up ifenslave bond0 eth0 eth1′ - This line assigns my wired (eth1) and wireless (eth0) interfaces into my failover group bond0. You should replace these with the interfaces you want to use.
  • ‘post-up echo “eth1″ > /sys/class/net/bond0/bonding/primary’ - This line specifies a primary slave, that is if this interface is available it will always be used in preference to the others. In this case eth1 is my wired interface, this ensures that when I return to my desk and plug my network cable back in I will be using the faster ethernet network and not my slower wireless.
  • ‘pre-down ifenslave -d bond0 eth0 eth1′ - This line removes the eth0 and eth1 interfaces from the bond group when networking is stopped.

Step Three.

Now reboot for the settings to take effect. It is of course possible to do all of this without rebooting, one would simply perform the following steps:

  1. modprobe the bonding driver
  2. ifconfig the bonding interface
  3. add the default route for the bonding interface
  4. ifenslave the network devices
  5. set the primary bonding interface

To Test that everything is working simply ping a host on your local network then once its going unplug the wired interface it should fail over to the wireless interface without dropping any packets. You can then re-insert the network cable and instantly switch back to the wired interface.
Help, its not working.

I had a few problems getting this working, firstly ifenslave would assign the bonding interface the mac address of my wired interface. As part of my wireless network security I have mac filtering enabled, however, only the mac address of the wireless card was allowed. Adding the mac of the bonding interface to my mac filter list cleared up that problem.

My other problem involved gnome network applet, I believe that this process is called nm-monitor. This would attempt to reconfigure my wireless network or wired network whenever I removed the network cable to my laptop. This appeared to prevent the failover from working as quickly as it should have. killing off the nm-monitor fixed this little issue.

I’ve now been using this for roughly two weeks without any issues, being able to unplug my laptop and take it to another room of the house without having to even consider whats going to happen to my network has been a real step forward.

Linux
hardware
networking

Comments (2)

Permalink

When Good Hard Drives Go Bad

Picture the scene, you return to your system and casually attempt to create a new file on one of your disks. Permission Denied. What?

  1. Check you’re logged in as the correct user - check
  2. Check that the user has write permissions - check

Following this you start to get a sinking feeling, something is not right. In fact, something is very wrong but you don’t know what that is just yet. Although, the situation feels vaguely farmilar. Without recalling why you decide to check if the partition you’re attempting to write to has been mounted read only. It has, your mind races and you check /var/log/messages for the worst.

ide: failed opcode was: unknown
end_request: I/O error, dev hdd, sector 61866003
Buffer I/O error on device hdd1, logical block 30933001
hdd: dma_intr: status=0×51 { DriveReady SeekComplete Error }
hdd: dma_intr: error=0×40 { UncorrectableError }, LBAsec= 63963157, high=3, low=1460081, sector=65011733

And now you realise why this felt so familiar, its because the same thing has happened to one of your disks at the beginning of summer three years in a row. You really should have made backups by now.

I’m sorry to say that this is a true story, three different branded disks in different systems in different houses all in different conditions. The only common factor is that these disks all ran 24/7 and were used as a ‘media repository’. Having been in this situation before I would like to share my experiences of simple disk recovery under Linux.
Step One.

Unmount the faulty disk (stop reading this, do it now). Forget about reading from it, forget writing to it and whatever happens do not fsck the faulty disk. Running fsck against a disk with physical errors will only make the problem worse. I recommend that you disconnect the disk and store it somewhere safe.
Step Two.

Buy/locate/steal/borrow a replacement disk, as you’re going to need somewhere to copy your data to. You should then connect both disks, you new disk and the faulty disk to your system. If possible try to get them on separate IDE buses, I’ve had one disk fail which created IDE bus resets when reading from it. This had a negative effect on all disks on the same IDE bus (including the good disk I was copying to). Format your new disk and check that its working, do not store anything important on it just yet as it will be overwritten.

It should also be noted that if you have a disk or array which is already in-use but has enough free space available, then you can write an image of your damaged disk to this space. The disk image can then be fscked and mounted via a loop back mount. The exact details of this method are not covered here but it should be easy to adjust the methods described here.
Step 3.

Install GNU DD Rescue onto your system ( http://www.gnu.org/software/ddrescue/ddrescue.html ) the package name under Debian / ‘bunty (Ubuntu) is gddrescue. This is possibly one of the most useful data recovery tools I’ve used. It is dd designed with data recovery in mind, it will allow you to read directly from the disk ignoring any errors which may exist in the file system. The website lists its most important features but most importantly to us its very easy to use and fully automatic. Once you’ve read the manual page for dd rescue its time to start on recovering your data.
Step 4.

Make sure that both your old faulty disk and your new disk are unmounted, once that is done invoke dd rescue.

# ddrescue /dev/hdd1 /dev/hdc1 /logfile

In this example, /dev/hdd1 is my old faulty disk (the source) and /dev/hdc1 is my new disk (the target) the /logfile will create a log file which dd rescue can use to resume its recovery should you need to stop it for any reason. This will take some time so sit back and wait, the output from dd rescue is very clear. It will keep you informed as to how much data it has copied, how much is unreadable and how many errors have occoured.

For my most recent disk failure on a 250 Gigabyte drive ddrescue was able to read all but ~2 Megabytes of data. Which is really quite fantastic when you consider that with the faulty disk mounted read only I was unable to copy any files without producing hundreds of errors.
Step 5.

The moment of truth, dd rescue has read all it can read and now its time to fsck the new disk.

# fsck -C /dev/hdc1

This will run fsck with a handy progress bar, If you’re very lucky fsck will complete without complaining and you have successfully rescued all of your data. If you’re not so lucky (as I have been this time) you may have suffered some fairly nasty file system corruption, fsck will go through asking you if you would like to fix the errors. You should make a note of which files have gone and let fsck do its thing, there may be better ways to handle this, if there are I would love to hear them. If you have only suffered damage to the file system structure and your actual data files are okay then you will end up with a load of files in the lost+found directory. Simply put these are files which exist on disk but no longer have a file name (and other things) associated with them, so fsck will move them into the lost+found directory.
Step 6.

Thats it really, you should have most or all of your data back, unless you’ve been really unlucky. Now is the time to create a backup plan, do it now while the fear is still in you. Otherwise six months down the line you’ll be thinking to yourself, “nahh disk failure wont happen to me (again)”.
It should be noted that there are variations on this method, if anyone knows of any better ways to handle a failed disk please let me know, I’d love to know if there are any steps or tools I’ve missed out upon.

Linux
hardware

Comments (1)

Permalink