Friday, August 26, 2016

Data loss postmortem

Flash Fail? by E V Peters from flickr (CC-NC)

I just lost a lot of data, and I'm extremely annoyed, to describe thing mildly.

Here's my backup setup:
  • OSX laptop as primary
  • Gaming Windows 7 box as secondary, with cygwin installed
  • (in the past I also had a few more boxes to which this system was extended)
  • status script automatically checks all boxes - every file or folder is inspected according to some set of rules:
    • system files are considered safe
    • all git repos are considered safe if they're pushed to master with no extra files
    • everything that's in Dropbox folder is treated as safe
    • for things too big for Dropbox there's a pair of backup drives - everything on them is considered safe as long as both files contain same files (for obvious performance reasons I'm only checking directory listing not TBs of content)
    • symlinks pointing to safe locations are safe
    • there's a whitelist of locations to ignore, for various low value data, applications' folders with nothing I care about etc.
    • everything else is automatically flagged as TODO
  • to prevent data loss in shell, rm command is aliased away (safe trash is used), mv and cp are aliased to -i to prevent accidental overwriting, and I'm very strict about always using >> and never under any circumstances > in shell redirects
  • Dropbox offers 30 day undelete, so anything deleted locally can still be recovered
  • and just to be super extra sure, various cloud contents are snapshotted every now and then to backup drives; list of installed software is snapshotted etc.
  • phones, tablets etc. all sync everything with the cloud, and contains nothing valuable locally
  • MP3 player and Kindle are mirrored on Dropbox, and synchronized automatically by scripts whenever they're connected
This system is really good at dealing with hardware failures, system reinstalls, and random human error. All files are protected from single failure, and in some cases from multiple failures.

Unfortunately there are two huge holes in the system:
  • configuration which doesn't live in user-accessible files - like /etc on OSX, Windows registry etc. This is less of an issue nowadays than it used to be.
  • the manually created whitelist of locations to ignore. You can guess where this leads.
It also offers limited protection from any kind of hacking or ransomware attack, but in any realistic threat model they're not terribly important.

Video Games

For casual gamers it's enough to just install games with Steam or whatever, and enjoy.

This unfortunately is absolutely unacceptable if you're into any kind of serious gaming. Steam autoupdates both game and all its mods, with no way to roll back, so if you had any kind of long running campaign, it will get wrecked.

As far as I can tell, that's what caused death of my Let's Play Civilization V as Germany series - I probably mindlessly pressed buttons to update 3UC/4UC mods, and that resulted in unfixable save game corruption.

So to protect against this, if possible I'm not playing using Steam - instead I install every version to separate folder. All versions of same game unfortunately share same user data folders, so if I ever want to go back I need to do some folder reshuffling, but as long as I don't run that game in Steam, mods won't get overwritten by newer versions, so I can safely play even campaign that takes months.

And I'm perfectly aware than for Paradox games it's possible to revert to previous versions as betas, but that does absolutely nothing whatsoever to deal with mods irreversibly autoupdating without my consent, and in HOI4 (and apparently Stellaris, but I never played that) it's even worse as mods are saved deep in Steam user data, so I had to write some script to even have mod folder I can safely backup.

Now here's where first part of the problem begins - I added all folders with save games to the whitelist. This is mostly reasonable, as I don't need long term backups of them, and if I lose saves from campaigns I already finished, it's no big deal.

Unfortunately whitelist has no good way to tell them apart from saves (and mod folder) for any ongoing campaigns, so here's failure number one.

Uninstallers

I've noticed that I had way too many old versions of various games installed, so I decided to clean them up - there's zero risk in deleting installed applications, so it was a routine thoughtless operation.

While uninstalling some old version of Crusader Kings 2, just another confirmation popup happened, which I automatically replied with a yes, and then it deleted my whole user directory with all my saves and everything else.

This is unacceptable UX on so many levels:
  • Surprise popups should never ask to delete user data - it should either never happen, or be a checkbox user must explicitly choose. It is completely unacceptable.
  • if you ever actually delete user data, use system trash. It is completely unacceptable to use hard delete like it's 1980s and we learned nothing in last 30 years of computing.
If your software does it, just stop writing software until you learn better, because you're causing more harm than good.

So we had 3 failures in a row (one my fault, other two the fault of whoever wrote that uninstaller), but that was still sort of recoverable with undelete process which existed since days of DOS.

I downloaded some software for it - the first one was bait and switch bullshit which would display files it found, but wouldn't actually recover anything. If you write that kind of software, please just kill yourself, there's no hope for you.

Second I found some legitimate recovery software, it recovered the files to second drive, so I thought 4th level of protection worked... and unfortunately they were all filled with zeroes. That confused me, but then I noticed that it was all on an SSD and TRIM command was indeed enabled, so completes the explanation.

Next actions

Historical saves from my past campaigns were nice to have for testing some tools, but I don't care about them terribly much. Recovering settings and mod folder from scratch will take maybe an hour, as it contained a mix of mods from Steam Workshop, downloaded separately, and my own. Annoying, but not a big deal.

What I lost were mostly saves for my ongoing Let's Play CK2 as Islamic State [Modern Times mod] campaign I've been playing on twitch. It got up to the point where the first caliph died, and his underage son inherited Islamic State. It was still quite fun, and I have all the video saved, so I'm going to upload that to youtube soon enough - and in the meantime all 3 sessions are available on twitch.

Even after this loss, I still have 22GB of save files in my folder. If this was OSX, I could just move them to Dropbox and symlink back (size to value ratio is not great, but often doing this brute force is good enough), but that's not terribly reliable on Windows, so I'll probably just delete old ones manually, remove save folders from the whitelist and instead tell the script to copy them all over to Dropbox.

The upside is that this is the biggest data loss I had in something like 10 years. The only other incident was losing about two day's worth of git commits to one repository I apparently forgot to push before formatting old laptop, which also annoyed me greatly.

Two incidents in a decade is pretty much nothing compared to the kind of massive data loss I suffered (to hardware failure) before that twice, and which made me the level of anal about backups you can see.

No comments:

Post a Comment