The best kittens, technology, and video games blog in the world.

Monday, October 10, 2016

Architecture of search engine

There's nothing revolutionary here, it's a reasonably straightforward design. All the code discussed can be seen in its repository.

What are we searching?

We're searching "cards", but that's not specific enough, as there are two big complications:
  • Some cards have multiple printings
  • Some cards have multiple parts
So we need to decide if our searchable object is:
  • one part, one printing
  • one part, all printings
  • all parts, one printing
  • all parts, all printings
Far // Away is an example of multipart card

Vast majority of cards have only one part, and a lot of the time card's properties are the same between printings, so we could get away with other choices, or even inconsistent semantics, but I've chosen to operate on most granular "a single part of a single printing of a card".


Raw data comes mostly from mtgjson, but first step is converting into a more usable form, which is actually just more json. I keep both raw and processed json files in repository, mildly annoying github, but in this case it's best practice as tests can't run without that, so code is incomplete without those jsons.

Indexer does some transformations and validations, and groups data by properties which should be same between cards (like name, text, cost, color) and those which can vary (like artist, rarity, flavor text).

When I started the project mtgjson data was of high quality, but it got a lot worse since it got new maintainers, so now I need to do a lot more fixing (and reporting those bugs upstream did nothing).

Every now and then I feel like I should just fork mtgjson and fix all those issues, but it's a pile of javascript and duct tape, so I mostly hope they get their act together again.

Because indexer started as very simple script and accumulated hacks over time (mostly due to deteriorating mtgjson quality), it's the ugliest part of the codebase. It has no dedicated tests, but as the whole test suite is operating on real data, any bug in indexer or mtgjson is likely to be picked by tests down the line.

Loading data

The library loads data from processed json files and transforms them into in-memory representation of Ruby objects. There's no dedicated database or anything like that - we're doing everything directly.

There's currently 16686 cards and 31648 printings, so it's not tiny dataset, but it's not unreasonably huge for this fairly naive approach to work.

Query Parser

We parse any query with StringScanner - which is one of the least known parts of ruby standard library, generating Query object.

Query object contains a tree of Condition objects corresponding to the query. In addition to searching Query object is also responsible for sorting results (by name, most recent etc., with sort: operator) and for time travel.

Yes, time travel - I find it occasionally useful to sometimes search for results as if I was searching at particular date in the past - so cards not printed at that time are not in search results, and format legalities are changed accordingly. This isn't literally going into the past, as we're not trying to do anything like updating Oracle text to old wording (which wouldn't be all that helpful), but that's how you can search what to play when you somehow get to play flashback Innistrad Standard.

time: operator is on query level, so you can't do cute things like asking for cards legal in both Ravnica Standard and Return to Ravnica Standard.


Condition objects must implement #search(db) method which takes database and returns Set of matching card printings.

This probably seems like a weird interface, and most conditions (those returning true when asked #simple?) also implement secondary interface of #match?(card_printing) returning true or false and avoiding allocations of intermediate Sets.

ConditionAnd, ConditionOr, and ConditionNot use one or the other API depending on type of their subconditions.

The reason for this design is that some subqueries ask about other part of the card, or other printing. For example you can check for cards printed in both M10 (Magic 2010) and M11 (Magic 2011) by e:m10 alt:e:m11. Doing recursive search and checking all subconditions would be exponentially slow in the worst case, and this design is always linear is query size.

Searching for edition (e:) and block (b:) always return small number of results, and we already have to maintain set data, so we reuse it as index and return small result Sets from them. Other queries don't use any indexing and just check cards one by one.

For most conditions either all printings match a query or none do, so special case for conditions which don't deal with anything printing-specific could potentially speed up a lot of queries. On the other hand while some cards have crazy number of printings, average is just 1.9 printings/card, so it's hard to tell if such savings would be worth added complexity.

Of course more indexes and more sophisticated query optimizer would be possible - or even using some dedicated search engine like solr. This didn't seem necessary due to relatively small amount of data. Fortunately the search engine has very extensive tests, so if I ever feel like developing one, it would be reasonably easy to do so safely.

Spelling correction

Many Magic cards have names which are very easy to misspell.
How do I spell this exactly?
QueryParser deal with it - if search returns zero results, it sets fuzzy flag on Conditions and retries the whole search. A few conditions will autocorrect your query, and any such autocorrection are logged and displayed with search results.

This two step process (as well as diacritics stripping and very limited amount of stemming) deals with most misspelled searches with good usability while keeping false positives to very low level.

Results coalescence

At this point we get a sorted list of printings, which is not quite what we want to display - all printings of the same card need to be merged (which is just .group_by(&:name)) - and since we'll be displaying card picture from specific printing we also want to choose the most appropriate one.

The algorithm for that:
  • choose only from matching printings
  • prefer those with card picture (small number of rare promos don't have pictures) - that's a frontend consideration database has no idea about
  • otherwise use order returned by search engine, which is:
    • whichever order was explicitly requested if any
    • name in alphabetical order
    • Standard-legal sets first
    • most recent first
After that we paginate the results.

Future direction for

The most common requests are:
  • improving website performance
  • adding card pricing information via some third party API
  • advanced search form
Advanced search form is a UI problem, for which I don't have a solution yet, as all advanced search forms are atrocious, leave most possible options out, or both.

Third party API for pricing ended up more complex than I thought, but I'll probably get there at some point.

Performance complaints probably mean that current architecture could use some improvements after all. It seems fast enough to me, but possibly that's not good enough.

No comments: