The best kittens, technology, and video games blog in the world.

Saturday, October 31, 2015

Music scrobbling and regular expression bias

May 22, 2012 - Music of the Spheres by guidedbycthulhu from flickr (CC-NC)
Once upon a time I used iPod, and I even wrote last.fm scrobbler for it. Of course no hardware lasts forever, and I much prefer Sansa Clip, which unfortunately doesn't record enough date to write similar scrobbler.

Anyway, physical devices are really mostly for outdoor use, and I got most of my music from Streamus before it shared the fate of most music services from as far back as AudioGalaxy of being outright destroyed because of copyright fascism. The only good site that wasn't outright shut down so far was Pandora Radio, which was instead locked to US IP addresses and requires far too much messing with VPNs - from London it's a fate not much different.

So now I'm mostly using youtube as music site. It has this mix feature which initially was dreadful, but after some training it got to be fairly decent. And apparently there's even Chrome extension which records what I listen to on last.fm. Except of course youtube doesn't have ID3 tags, so it just uses a regular expression - apparently very simple regular expression.

I'm mildly annoyed by this. It's fun to record this kind of personal data, and it's fine it's in a random sample and some is missing, but this just has massive bias. Most common title format for youtube music videos is "artist - title", and those get recorded, but some use "artist: title" or "title - artist" or "title by artist" or some other format, and currently they all don't get recorded or get misrecorded.

For example during last hour I apparently listened to song titled "Taylor Davis" by such diverse artists as:
  • Circle of Life on Violin (The Lion King)
  • Let It Go (Disney's Frozen)
  • Doctor Who Theme (Violins)
  • Bolero of Fire (From Zelda OoT) — Violin
  • Duel Of The Fates (From Star Wars) Violin
This unfortunately can't really be solved by a better regexp, but last.fm has crazy big database of artists and songs, so it could just detect that and flip them back to right order maybe? It really just needs API for unstructured titles, in addition to its existing artist/title API, as client can't really do that.

There's also a function to enter artist/album when scrobbler can't guess, but it should seriously at least remember what I entered if I listen to same music again (or even better have some database so if some other person corrected music title, just use that). And seriously, it needs to support Cmd-V to paste the title, it's totally silly that the mini-dialog closes instead of pasting if I press Cmd-V.

And of course there's a lot of other music which doesn't get detected as music at all. So far I haven't seen any cases of non-music getting registered as music - which is probably the wrong direction to err, as it's much easier to correct extra data with delete button than to add data manually.

Well, it seems they released it relatively recently so they might fix it at some point. For now I'm going to be mildly annoyed that data it records about myself is biased (probably in pro-mainstream way, as happens with most bias).

On an off chance this interests anybody else, here's the list.

By the way,  should something like "Taking the Hobbits to Isengard - 10 HOURS" count as one song in the date? I'd think it should count as a lot of separate songs, based on how much one listened to it. This is somewhat niche, but it's just another example how data recording process has pro-mainstream bias.

No comments: