There's nothing revolutionary here, it's a reasonably straightforward design. All the code discussed can be seen in its repository.
What are we searching?We're searching "cards", but that's not specific enough, as there are two big complications:
- Some cards have multiple printings
- Some cards have multiple parts
- one part, one printing
- one part, all printings
- all parts, one printing
- all parts, all printings
Vast majority of cards have only one part, and a lot of the time card's properties are the same between printings, so we could get away with other choices, or even inconsistent semantics, but I've chosen to operate on most granular "a single part of a single printing of a card".
IndexerRaw data comes mostly from mtgjson, but first step is converting into a more usable form, which is actually just more json. I keep both raw and processed json files in repository, mildly annoying github, but in this case it's best practice as tests can't run without that, so code is incomplete without those jsons.
Indexer does some transformations and validations, and groups data by properties which should be same between cards (like name, text, cost, color) and those which can vary (like artist, rarity, flavor text).
When I started the project mtgjson data was of high quality, but it got a lot worse since it got new maintainers, so now I need to do a lot more fixing (and reporting those bugs upstream did nothing).
Because indexer started as very simple script and accumulated hacks over time (mostly due to deteriorating mtgjson quality), it's the ugliest part of the codebase. It has no dedicated tests, but as the whole test suite is operating on real data, any bug in indexer or mtgjson is likely to be picked by tests down the line.
Loading dataThe library loads data from processed json files and transforms them into in-memory representation of Ruby objects. There's no dedicated database or anything like that - we're doing everything directly.
There's currently 16686 cards and 31648 printings, so it's not tiny dataset, but it's not unreasonably huge for this fairly naive approach to work.
Query ParserWe parse any query with StringScanner - which is one of the least known parts of ruby standard library, generating Query object.
Yes, time travel - I find it occasionally useful to sometimes search for results as if I was searching at particular date in the past - so cards not printed at that time are not in search results, and format legalities are changed accordingly. This isn't literally going into the past, as we're not trying to do anything like updating Oracle text to old wording (which wouldn't be all that helpful), but that's how you can search what to play when you somehow get to play flashback Innistrad Standard.
time: operator is on query level, so you can't do cute things like asking for cards legal in both Ravnica Standard and Return to Ravnica Standard.
ConditionCondition objects must implement #search(db) method which takes database and returns Set of matching card printings.
This probably seems like a weird interface, and most conditions (those returning true when asked #simple?) also implement secondary interface of #match?(card_printing) returning true or false and avoiding allocations of intermediate Sets.
ConditionAnd, ConditionOr, and ConditionNot use one or the other API depending on type of their subconditions.
The reason for this design is that some subqueries ask about other part of the card, or other printing. For example you can check for cards printed in both M10 (Magic 2010) and M11 (Magic 2011) by e:m10 alt:e:m11. Doing recursive search and checking all subconditions would be exponentially slow in the worst case, and this design is always linear is query size.
Searching for edition (e:) and block (b:) always return small number of results, and we already have to maintain set data, so we reuse it as index and return small result Sets from them. Other queries don't use any indexing and just check cards one by one.
For most conditions either all printings match a query or none do, so special case for conditions which don't deal with anything printing-specific could potentially speed up a lot of queries. On the other hand while some cards have crazy number of printings, average is just 1.9 printings/card, so it's hard to tell if such savings would be worth added complexity.
Spelling correctionMany Magic cards have names which are very easy to misspell.
This two step process (as well as diacritics stripping and very limited amount of stemming) deals with most misspelled searches with good usability while keeping false positives to very low level.
Results coalescenceAt this point we get a sorted list of printings, which is not quite what we want to display - all printings of the same card need to be merged (which is just .group_by(&:name)) - and since we'll be displaying card picture from specific printing we also want to choose the most appropriate one.
The algorithm for that:
- choose only from matching printings
- prefer those with card picture (small number of rare promos don't have pictures) - that's a frontend consideration database has no idea about
- otherwise use order returned by search engine, which is:
- whichever order was explicitly requested if any
- name in alphabetical order
- Standard-legal sets first
- most recent first
Future direction for mtg.wtfThe most common requests are:
- improving website performance
- adding card pricing information via some third party API
- advanced search form
Third party API for pricing ended up more complex than I thought, but I'll probably get there at some point.
Performance complaints probably mean that current architecture could use some improvements after all. It seems fast enough to me, but possibly that's not good enough.