If you are in charge of Epicor Hardware and have HP SSD’s you may want to have a quick read of this and forward it to your Hardware / IT Infrastructure guy/gal
Long story short, some HP SSD drives will BRICK themsleves after a certain number of running hours and this is not RECOVERABLE (unless you pre-patch the firmware)
What a wonderful feature. Its sort of like having a car airbag that explodes in your face instead of saving you. It doesn’t just lose your data, but it completely bricks itself. Only way to make it better is to give you the option of ransoming your device back for a nominal fee.
It was found by a Hard Drive company who does backup / restore / hard drive recovery they have labs setup with a bunch of this stuff hammering it and running it it like crazy.
Only reason I know any of this is bc I was listening to a podcast that mentioned it.
I assume most hard drives in servers and farms are turned on and then never turned off (save a few cumulative minutes during reboots). So this problem should have been appearing in mass among users.
And as their tech brief says, “In addition, SSDs which were put into service at the same time will likely fail nearly simultaneously.”
If it wasn’t so blatantly obvious I would almost think this was planned obsolescence (planned self destruction). The shocking thing isn’t even that there is an error in the firmware, the shocking thing is how terribly it handles the error state. Even if it runs for a trillion years, it shouldn’t brick itself just because it encountered a unexpected error. These things should be able to handle being randomly powered off without being destroyed. Sure, your data might get some corruption, but the device should still work.
Heck, I can even see the firmware getting permanently stuck in the error state - you should still be able to do a factory reset or a firmware update to fix it. I’m surprised the over-all architecture of the firmware wasn’t more robust.
(Don’t listen to me, I don’t know anything about embedded or low level programming)
The fact that it is a parameter not typically used in the actual data (FAT, data sectors, etc…), probably means that its part of some sort of real time diagnostics, that monitor things like #of reads, writes, run-time, etc…, that runs in the background. And when that value (2^15, or 16 bits using a one for the sign) rolled over, something went WAY wrong. Perhaps marking all FLASH areas as having reached their write limit. And since the FAT is in FLASH too, it cannot be updated.
While it might brick the drive, I’d be surprised if the individual FLASH chips couldn’t be removed, read, and then the drive reconstructed (somewhat). Unless the “bricking” included erasing the flash first.
The show I read said that the Drive, the Memory Itself and the Board become un-usable.
Not sure to what extent, but apparently it is not recoverable at all.