Monday, August 27, 2012

Script to convert Google+ takeout into a single easy to use document

Google+ did many things wrong like their retarded and discriminatory real name policy, but one surprising thing they did right that almost everybody else gets wrong was making it easy to export all your data using Google Takeout.

Unfortunately Google+ posts from Takeout (and pretty much everything else from Takeout) are pretty hard to use directly, but we're all hackers, so it's not a big deal to reformat them, and at least this one time it doesn't involve breaking any Terms of Service or working around any rate limiters, captchas, and other such nonsense just to get your own data.

I wrote a script to process Takeout archive into a single easy to search HTML document. Since it's pretty short, I put it in unix-utilities repository on github (the one I wrote about earlier) instead of making a new repository for it.

It's very easy to use (Stream/ directory is how it's packed in Takeout .zip):
process_gplus_takeout Stream/ output.html
It removes everything except actual content and attachments, and sorts entries by date. If you want to include different things or filter them, it should be pretty easy to modify the script.

It's even a reasonable example of how to use Hpricot to mass-process a lot of HTML documents if that's a new thing to you.

About the only hard part is not arranging computations in a way that doesn't load DOM of every single HTML file in memory simultaneously, but extracts them one by one instead, and frees DOM in between. It probably doesn't even matter in this case, since it's just a few MBs of HTML, so even all DOMs will fit in memory together, but it's a good practice in general.

