Posted on Leave a comment

Is SMART Really Useful?

Being in technology for a long time, I have seen my fair share of disk failures. However I have never seen a single instance where SMART has issued a sufficient warning to backup any data on a failing disk. The following is an example of this in action.

Toshiba MQ01ABD050
Toshiba MQ01ABD050

Here is a 2.5″ Toshiba MQ01ABD050 500GB disk drive. This unit was made in 2014, but has a very low hour count of ~8 months, with only ~5 months of the heads being loaded onto the platters, since it has been used to store offline files. This disk was working perfectly the last time it was plugged in a few weeks ago, but today within seconds of starting to transfer data, it began slowing down, then stopped entirely. A quick look at the SMART stats showed over 4000 reallocated sectors, so a full scan was initiated.

SMART Test Failure
SMART Test Failure

After the couple of hours an extended test takes, the firmware managed to find a total of 16,376 bad sectors, of which 10K+ were still pending reallocation. Just after the test finished, the disk began making the usual clicking sound of the head actuator losing lock on the servo tracks. Yet SMART was still insisting that the disk was OK! In total about 3 hours between first power up & the disk failing entirely. This is possibly the most sudden failure of a disk I’ve seen so far, but SMART didn’t even twig from the huge number of sector reallocations that something was amiss. I don’t believe the platters are at fault here, it’s most likely to be either a head fault or preamp failure, as I don’t think platters can catastrophically fail this quickly. I expected SMART to at least flag that the drive was in a bad state once it’s self-test completed, but nope.

Internals
Internals

After pulling the lid on this disk, to see if there’s any evidence of a head crashing into a platter, there’s nothing – at least on a macroscopic scale, the single platter is pristine. I’ve seen disks crash to the point where the coating has been scrubbed from the platters so thoroughly that they’ve been returned to the glass discs they started off as, with the enclosure packed full of fine black powder that used to be data layer, but there’s no indication of mechanical failure here. Electronic failure is looking very likely.

Clearly, relying on SMART to alert when a disk is about to take a dive is an unwise idea, replacing drives after a set period is much better insurance if they are used for critical applications. Of course, current backups is always a good idea, no matter the age of drive.

Posted on Leave a comment

nb Tanya Louise – Gas Locker Corrosion Part 1 – Removing The Old Locker & Replacing the Deck Plate

Severe Corrosion
Severe Corrosion

This is a part of the boat that hasn’t really had much TLC since we moved aboard, and finally it’s completely succumbed to corrosion, opening a rusty hole into the engine space below. I’ve already used a grinder to remove the rest of the locker – and even this had corroded to the point of failure all around the bottom just above the welds. The bulkhead forming the rear of the locker has also corroded fairly severely, so this will be getting cut out & replaced with a new piece of steel.
This was originally a 1/8″ plate, but now it’s as thin as foil in some places, with just the paint hiding the holes.

Replacement Steel
Replacement Steel

I’ve cut out as much of the corroded deck plate as possible –  it’s supported underneath by many struts made of angle iron, and got the new 3mm replacement tacked in place with the MIG. I’ve not yet cut out the rotten section on the bulkhead, this will come after we’ve got the steel cut to replace it, as electrical distribution is behind this plate – I’d rather not have weather exposure to the electrical systems for long! Unfortunately more corrosion has showed itself around the edges of the old locker:

Thin Steel
Thin Steel

Around the corner the steel has pretty much totally failed from corrosion coming from underneath – applying welding heat here has simply blown large holes in the steel as there’s nothing more than foil thickness to support anything.

Some more extensive deck replacement is going to happen to fix this issue, more to come when the steel comes in!

Posted on Leave a comment

16-Port SATA PCIe Card – Cooling Recap

It’s been 4 months since I did a rejig of my storage server, installing a new 16-port SATA HBA to support the disk drives. I mentioned the factory fan the card came with in my previous post, and I didn’t have many hopes of it surviving long.

Heatsink
Heatsink

The heatsink card has barely had enough time to accumulate any grime from the air & the fan has already failed!

There’s no temperature sensing or fan speed sensing on this card, so a failure here could go unnoticed, and under load without a fan the heatsink becomes hot enough to cause burns. (There are a total of 5 large ICs underneath it). This would probably cause the HBA to overheat & fail rather quickly, especially when under a high I/O load, with no warning. In my case, the bearings in the fan failed, so the familiar noise of a knackered sleeve bearing fan alerted me to problems.

Replacement Fan
Replacement Fan

A replacement 80mm Delta fan has been attached to the heatsink in place of the dead fan, and this is plugged into a motherboard fan header, allowing sensing of the fan speed. The much greater airflow over the heatsink has dramatically reduced running temperatures. The original fan probably had it’s bearings cooked by the heat from the card as it’s airflow capability was minimal.

Fan Rear
Fan Rear

Here’s the old fan removed from the heatsink. The back label, usally the place where I’d expect to find some specifications has nothing but a red circle. This really is the cheapest crap that the manufacturer could have fitted, and considering this HBA isn’t exactly cheap, I’d expect better.

Bearings
Bearings

Peeling off the back label reveals the back of the bearing housing, with the plastic retaining clip. There’s some sign of heat damage here, the oil has turned into gum, all the lighter fractions having evaporated off.

Rotor
Rotor

The shaft doesn’t show any significant damage, but since the phosphor bronze bearing is softer, there is some dirt in here which is probably a mix of degraded oil & bearing material.

Stator & Bearing
Stator & Bearing

There’s more gunge around the other end of the bearing & it’s been worn enough that side play can be felt with the shaft. In ~3000 hours running this fan is totally useless.