In line with my SysRq Post comes another bit of assumed knowledge, SMART. Let’s begin at the beginning (and stick to the PC world).
Magnetic storage is fault-prone. In the old days, when you formatted a drive, part of the formatting process was to check that each block seemed to be able to store data. All the bad blocks would be listed in a “bad block list” and the file-system would never try to use them. File-system checks would also be able to mark blocks as being bad.
As disks got bigger, this meant that formatting could take hours. Drives had also become fancy enough that they could manage bad blocks themselves, and so a shift occured. Disks were shipped with some extra spare space that the computer can’t see. Should the drive controller detect that a block went bad (they had parity checks), it could re-allocate a block from the extra space to stand in for the bad block. If it was able to recover the data from the bad block, this could be totally transparent to the file-system, if not the file-system would see a read-error and have to handle it.
This is where we are today. File-systems still support the concept of bad blocks, but in practice they only occur when a disk runs out of spare blocks.
This came with a problem, how would you know if a disk was doing ok or not? Well a standard was created called SMART. This allows you to talk to the drive controller and (amongst other things) find out the state of the disk. On Linux, we do this via the package smartmontools.
Why is this useful? Well you can ask the disk to run a variety of tests (including a full bad block scan), these are useful for RMAing a bad drive with minimum hassle. You can also get the drive’s error-log which can give you some indication of it’s reliability. You can see it’s temperature, age, and Serial Number (useful when you have to know which drive to unplug). But, most importantly, you can find out the state of bad sectors. How many sectors does the drive think are bad, and how many has it reallocated.
Why is that useful?
In the event of a bad block, you can manually force a re-allocation. This way it happens under your terms, and you’ll know exactly what got corrupted.
Next, Google published a paper linking non-zero bad sector values to drive failure. Do you really want be trusting known-non-trustworthy drives with critical data?
Finally, there is a nasty RAID situation. If you have a RAID-5 array with say 6 drives in it and one fails either the RAID system will automatically select a spare drive (if it has one), or you’ll have to replace it. The system will then re-build on the new disk, reading every sector on all the other disks, to calculate the sector contents for the new disk. If one of those reads fails (bad sector) you’ll now be up shit-creek without a paddle. The RAID system will kick out the disk with the read failure, and you’ll have a RAID-5 array with two bad disks in it — one more than RAID-5 can handle. There are tricks to get such a RAID-5 array back online, and I’ve done it, but you will have corruption, and it’s risky as hell.
So, before you go replacing RAID-5 member-disks, check the SMART status of all the other disks.
Personally, I get twitchy when any of my drives have bad sectors. I have smartd monitoring them, and I’ll attempt to RMA them as soon as a sector goes bad.
I’m constantly surprised when I come across long-time Linux users who don’t know about SysRq. The Linux Magic System Request Key Hacks are a magic set of commands that you can get the Linux kernel to follow no matter what’s going on (unless it has panicked or totally deadlocked).
Why is this useful? Well, there are many situations where you can’t shut a system down properly, but you need to reboot. Examples:
lp0 possibly?)In any of those situations, grab a console keyboard, and type Alt+SysRq+s (sync), Alt+SysRq+u (unmount), wait for it to have synced, and finally Alt+SysRq+b (reboot NOW!). If you don’t have a handy keyboard attached to said machine, or are on another continent, you can
In my books, the useful SysRq commands are:
In fact, read the rest of the SysRq documentation, print it out, and tape it above your bed. Next time you reach for the reset switch on a Linux box, stop your self, type the S,U,B sequence, and watch your system come up as if nothing untoward has happened.
Update: I previously recommended U,S,B but after a bit of digging, I think S,U,B may be correct.
I’ve written about this before, but Ctrl-Alt-workspace switching key-presses nail me routinely.
Let’s go through some history:
We have Ctrl-Alt-Delete, the “three-fingered-salute”, meaning reboot, right? That combination was designed to NEVER be pressed by accident. And it never used to be.
The X guys needed a kill-X key-press, as things can sometimes get broken in X. So they chose Ctrl-Alt-Backspace, which is also a pretty sensible combination. It’s very similar to Ctrl-Alt-Delete, so we remember it, and backspace has milder connotations than delete, so we understand it to mean that it’ll only kill a part of the system.
X also has some other Ctl-Alt- shortcuts. Some of these are also suitably obscure, i.e. NumPad+ and NumPad-. Others like Ctrl-Alt-F1 mean change to virtual console 1. That one might do by accident, if you are an old WordPerfect user, but should be safe enough otherwise. They were designed to look like big brothers of (and even work as) standard VT-changing behaviour.
For changing workspaces, Alt-F1 style key-presses were used, mimicing VT-changing key-presses. This is great for *nix users, but people coming from Windows expect Alt-F4 to close a program, not take them to workspace 4.
So GNOME came along, and decided that instead, they’d use Ctrl-Alt-Arrow key-presses to change workspace. That’s fine, but it’s a pretty common action, so I’m often holding down Ctrl-Alt without even thinking about it. If I start editing something and press delete/backspace, before I’ve released Ctrl-Alt, boom! And I run screaming and write a blog post.
Now, I know that Ctl-Alt-{Delete,Backspace} can be disabled (even if the latter is a little tricky to do), but I’d really like to change them. I like to be able to kill X without using another machine and ssh, I just don’t like this to happen by accident. And no, the solution isn’t for me to change my workspace-changing keys, because this problem must affect every GNOME user, not just me.
Dangerous key-presses should be really unlikely key-presses. Alt-SysRq- key-presses are good in this regard, they’ll always be unlikely. (Oh, and they are insanely useful.)
My post on split-routing on OpenWRT has been incredibly popular, and led to many people implementing split-routing, whether or not they had OpenWRT. While it's fun to have an exercise as a reader, it led to me having to help lots of newbies through porting that setup to a Debian / Ubuntu environment. To save myself some time, here's how I do it on Debian:
Background, especially for non-South Africa readers: Bandwidth in South Africa is ridiculously expensive, especially International bandwidth. The point of this exercise is that we can buy "local-only" DSL accounts which only connect to South African networks. E.g. I have an account that gives me 30GB of local traffic / month, for the same cost as 2.5GB of International traffic account. Normally you'd change your username and password on your router to switch account when you wanted to do something like an Debian apt-upgrade, but that's irritating. There's no reason why you can't have a Linux-based router concurrently connected to both accounts via the same ADSL line.
Firstly, we have a DSL modem. Doesn't matter what it is, it just has to support bridged mode. If it won't work without a DSL account, you can use the Telkom guest account. My recommendation for a modem is to buy a Telkom-branded Billion modem (because Telkom sells everything with really big chunky, well-surge-protected power supplies).
For the sake of this example, we have the modem (IP 10.0.0.2/24) plugged into eth0 on our server, which is running Debian or Ubuntu, doesn't really matter much - personal preference. The modem has DHCP turned off, and we have our PCs on the same ethernet segment as the modem. Obviously this is all trivial to change.
You need these packages installed:
You need ppp interfaces for your providers. I created /etc/ppp/peers/intl-dsl:
/etc/ppp/peer/local-dsl:
unit 1 makes a connection always bind to "ppp1". Everything else is pretty standard. Note that only the international connection forces a default route.
To /etc/ppp/pap-secrets I added my username and password combinations:
You need custom iproute2 routing tables for each interface, for the source routing. This will ensure that incoming connections get responded to out of the correct interface. As your provider only lets you send packets from your assigned IP address, you can't send packets with the international address out of the local interface. We get around that with multiple routing tables. Add these lines to /etc/iproute2/rt_tables:
Now for some magic. I create /etc/ppp/ip-up.d/20routing to set up routes when a connection comes up:
That script loads routes from /etc/network/routes-intl-dsl and /etc/network/routes-local-dsl. It also sets up source routing so that incoming connections work as expected.
Now, we need those route files to exist and contain something useful. Create the script /etc/cron.daily/za-routes (and make it executable):
It downloads the routes file from cocooncrash's site (he gets them from local-route-server.is.co.za, aggregates them, and publishes every 6 hours). Run it now to seed that file.
Now some International-only routes. I use IS local DSL, so SAIX DNS queries should go through the SAIX connection even though the servers are local to ZA.
My /etc/network/routes-intl-dsl contains SAIX DNS servers and proxies:
Now we can tell /etc/network/interfaces about our connections so that they can get brought up automatically on bootup:
For DNS, I use dnsmasq, hardcoded to point to IS & SAIX upstreams. My machine's /etc/resolv.conf just points to this dnsmasq.
So something like /etc/resolv.conf:
/etc/dnsmasq.conf:
If you haven't already, you'll need to turn on ip_forward. Add the following to /etc/sysctl.conf and then run sudo sysctl -p:
Finally, you'll need masquerading set up in your firewall. Here is a trivial example firewall, put it in /etc/network/if-up.d/firewall and make it executable. You should probably change it to suit your needs or use something else, but this should work:
I had an interesting discussion with “bonnyrsa” in #ubuntu-za today. He’d re-arranged his partitions with gparted, and copied and pasted his / partition, so that he could move it to the end of the disk.
However this meant that he now had two partitions with the same UUID. While you can imagine that this is the correct result of a copy & paste operation, it now means that your universally unique ID is totally non-unique. Not in your PC, and no even on it’s home drive.
Ubuntu mounts by UUID, so now how do we know which partition is being mounted?
However neither were correct.
Mounting /dev/sda4 (ro) produced “/dev/sda4 already mounted or /mnt busy”.
Aha, so we must be running from /dev/sda4.
/dev/sda2 mounted fine, but then wouldn’t unmount: “it seems /dev/sda2 is mounted multiple times”.
Aaaargh!
I got him to reboot, change /dev/sda2s UUID, and reboot again (sucks). Then everything was better.
This shouldn’t have happened. Non-unique UUIDs is a really crap situation to be in. It brings out bugs in all sorts of unexpected places. I think parted should (by default) change the UUID of a copied partition (although if you are copying an entire disk, it shouldn’t).
I’ve filed a bug on Launchpad, let’s see if anyone bites.
PS: All UUIDs in this post have been changed to protect the identity of innocent Ubuntu systems (who aren’t expecting a sudden attack of non-uniqueness).