Archives: October 2008

smartmontools

In line with my SysRq Post comes another bit of assumed knowledge, SMART. Let’s begin at the beginning (and stick to the PC world).

Magnetic storage is fault-prone. In the old days, when you formatted a drive, part of the formatting process was to check that each block seemed to be able to store data. All the bad blocks would be listed in a “bad block list” and the file-system would never try to use them. File-system checks would also be able to mark blocks as being bad.

As disks got bigger, this meant that formatting could take hours. Drives had also become fancy enough that they could manage bad blocks themselves, and so a shift occured. Disks were shipped with some extra spare space that the computer can’t see. Should the drive controller detect that a block went bad (they had parity checks), it could re-allocate a block from the extra space to stand in for the bad block. If it was able to recover the data from the bad block, this could be totally transparent to the file-system, if not the file-system would see a read-error and have to handle it.

This is where we are today. File-systems still support the concept of bad blocks, but in practice they only occur when a disk runs out of spare blocks.

This came with a problem, how would you know if a disk was doing ok or not? Well a standard was created called SMART. This allows you to talk to the drive controller and (amongst other things) find out the state of the disk. On Linux, we do this via the package smartmontools.

Why is this useful? Well you can ask the disk to run a variety of tests (including a full bad block scan), these are useful for RMAing a bad drive with minimum hassle. You can also get the drive’s error-log which can give you some indication of it’s reliability. You can see it’s temperature, age, and Serial Number (useful when you have to know which drive to unplug). But, most importantly, you can find out the state of bad sectors. How many sectors does the drive think are bad, and how many has it reallocated.

Why is that useful?

In the event of a bad block, you can manually force a re-allocation. This way it happens under your terms, and you’ll know exactly what got corrupted.

Next, Google published a paper linking non-zero bad sector values to drive failure. Do you really want be trusting known-non-trustworthy drives with critical data?

Finally, there is a nasty RAID situation. If you have a RAID-5 array with say 6 drives in it and one fails either the RAID system will automatically select a spare drive (if it has one), or you’ll have to replace it. The system will then re-build on the new disk, reading every sector on all the other disks, to calculate the sector contents for the new disk. If one of those reads fails (bad sector) you’ll now be up shit-creek without a paddle. The RAID system will kick out the disk with the read failure, and you’ll have a RAID-5 array with two bad disks in it — one more than RAID-5 can handle. There are tricks to get such a RAID-5 array back online, and I’ve done it, but you will have corruption, and it’s risky as hell.

So, before you go replacing RAID-5 member-disks, check the SMART status of all the other disks.

Personally, I get twitchy when any of my drives have bad sectors. I have smartd monitoring them, and I’ll attempt to RMA them as soon as a sector goes bad.

The joy that is SysRq

I’m constantly surprised when I come across long-time Linux users who don’t know about SysRq. The Linux Magic System Request Key Hacks are a magic set of commands that you can get the Linux kernel to follow no matter what’s going on (unless it has panicked or totally deadlocked).

Why is this useful? Well, there are many situations where you can’t shut a system down properly, but you need to reboot. Examples:

  • You’ve had a kernel OOPS, which is not quite a panic but there could be memory corruption in the kernel, things are getting pretty weird, and quite honestly you don’t want to be running in that condition for any longer than necessary.
  • You have reason to believe it won’t be able to shut down properly.
  • Your system is almost-locked-up (i.e. the above point)
  • Your UPS has about 10 seconds worth of power left
  • Something is on fire (lp0 possibly?)
  • …Insert other esoteric failure modes here…

In any of those situations, grab a console keyboard, and type Alt+SysRq+s (sync), Alt+SysRq+u (unmount), wait for it to have synced, and finally Alt+SysRq+b (reboot NOW!). If you don’t have a handy keyboard attached to said machine, or are on another continent, you can

# echo u > /proc/sysrq-trigger

In my books, the useful SysRq commands are:

b
Reboot
f
Call the oom_killer
h
Display SysRq help
l
Print a kernel stacktrace
o
Power Off
r
Set your keyboard to RAW mode (required after some X breakages)
s
Sync all filesystems
u
Remount all filesystems read-only
0-9
Change console logging level

In fact, read the rest of the SysRq documentation, print it out, and tape it above your bed. Next time you reach for the reset switch on a Linux box, stop your self, type the S,U,B sequence, and watch your system come up as if nothing untoward has happened.

Update: I previously recommended U,S,B but after a bit of digging, I think S,U,B may be correct.

Videos up

Administravia: Just uploaded a pile of videos:

Also, I changed a few feeds on CLUG Park from RSS to Atom, so sorry about any RSS-reader-spamming.

The word "Cruft"

According to the Media course I did back in first year, there’s a fine line between Jargon and Technical Terminology, jargon being more informal. Amongst geeks, technical terminology makes communication possible, and jargon makes our lives easier. To anyone else, it makes us sound like “geek”s, but we are generally ok with that.

When we say “webserver daemon” or “apache httpd” to each other, we mean “the piece of software running on the computer that gives web-pages to anyone who asks for them”. We don’t want to say things like that to each other (it’d take years to say anything), but we often need to break terms down like that when talking to people from other worlds (fields).

Technical, computer-jargon isn’t the only type of jargon, either, there’s legalese jargon, mathematical jargon, medical jargon, literary jargon, etc etc.

Just like normal English, we all know a large enough subset of our jargon that we can understand each other even if we don’t know the odd specific term, but every now and then we have to refer to a dictionary.

Of course to be able to speak amongst ourselves and to the world at large, we need to know where the lines are between technical terminology, jargon, and normal English. Sometimes, we get this wrong. A housemate called me on it today, asking what I meant when I was describing something as “crufty”. I was incredulous - was cruft not a standard, everyday word? Is it just me, or do other geeks regard “cruft” as being a normal English word? It hardly sounds technical, it sounds like something you’d find on the corner a carpet, that we abused as a metaphor for unmaintained messy bits of computing.

A quick google (note “to google” is no longer jargon) took me to the Jargon file entry stating a very geeky etymology:

  1. Poorly built, possibly over-complex. The canonical example is “This is standard old crufty DEC software”. In fact, one fanciful theory of the origin of crufty holds that was originally a mutation of ‘crusty’ applied to DEC software so old that the ‘s’ characters were tall and skinny, looking more like ‘f’ characters.

This term is one of the oldest in the jargon and no one is sure of its etymology, but it is suggestive that there is a Cruft Hall at Harvard University which is part of the old physics building; it’s said to have been the physics department’s radar lab during WWII. To this day (early 1993) the windows appear to be full of random techno-junk. MIT or Lincoln Labs people may well have coined the term as a knock on the competition.

On a similar note, how many other people only discovered the word “segue” after hearing about the Segway Personal Transporter? (Hint: “segue” is pronounced “seg-way”) It certainly seems to have become a much more popular word to use since the launch of the Segway.