Sunday, August 21, 2016

Test coverage in ruby

IMG_5547 by rautiocination from flickr (CC-NC)

In highly dynamic language like ruby it ought to be common sense to aim at 100% test coverage. Any line without coverage can contain any kind of typo or other silliness and crash in production.

Statically typed languages have a minor advantage here, as compile-time type check is sort of a test. Unfortunately it has such a massive false positive rate and incurs such a massive penalty to expressiveness as to make the whole idea of static typing worthless. In early days of computing people seriously expected rich static type systems to be mathematical proofs of various properties of languages, but that went nowhere. Anyway, enough history.

In ruby 100% test coverage - every part of the program being executed at least once by some kind of rudimentary test - really ought to be the minimum goal. Just running the code and verifying that it returns something vaguely reasonable without crashing really isn't that much to ask for.

Unfortunately it seems to be far too common for people to see 100% coverage as some highly ambitious goal, and be satisfied with "good enough" numbers like let's say 80%.

Let's have a talk about all that.

Rails autoloader

A lot of ruby code uses Rails, which uses code autoloading in development for a good reason, but unfortunately it also uses it during testing when it really seriously shouldn't, and someone please submit that as a bug report on an off-change it won't get EWONTFIXed.

This leads to a very nasty situation where simplecov reports that you have 100% test coverage, even though you completely forgot about a bunch of models and controllers. Not loaded means not included in coverage report at all, instead of being big red 0%.

Rails generators will automatically create empty test classes for you, so if you use generators you're going to be somewhat less affected by this problem, but most people end up creating a lot of files manually as well.

Anyway, if you use any kind of Rails project for which you want to get reasonable coverage report, you ought to put something like this in your spec/spec_helper.rb or equivalent, possibly guarded by if ENV["COVERAGE"].present? if you'd prefer to only generate coverage report when explicitly requested:

SimpleCov.at_exit do
  Rails.application.config.eager_load_paths |= Rails.application.config.autoload_paths
  Rails.application.eager_load!
  SimpleCov.result.format!
end


If you do that on existing project, you might be for a big surprise.

0 tests = 50% test coverage

Next time you think 80% coverage is good, try to run coverage report with all tests disabled. You'll probably get a figure around 50% coverage, from zero tests. Let's take a look at this example:

class HelloWorld
  def say_hello
    putts "Hello, world!"
  end
end


In meaningful sense it has only one line of code, third, which also happens to have nasty typo. As far as ruby and simplecov are concerned however, there are three lines, as that code is more or less just syntactic sugar for:

(HelloWorld = Class.new).instance_eval do
  define_method :say_hello do
    putts "Hello, world!"
  end
end


And lines 1-3 are all perfectly executable. Depending on how long your methods are, you'll get something close to 50% test coverage with no test- just from the way ruby works.

That means that "80% coverage" might very well be 60% of actual code, and that sounds a lot less appealing, doesn't it?

ruby -w

I tend to dislike linters, as they mostly tend to enforce authors' misguided ideas on what is proper way to code, and have extremely high false positive rate for finding genuine problems.

For some reason every one of them has some ideas which are not just pointless, they result in actually horrible code. For example rubocop will whine if you use consistent double quotes and will demand awful mix of single and double quotes; and then it will whine even more if you consistently use trailing comma in multiline Array/Hash literals, instead forcing you to write ugly code with even uglier git diffs and pointless merge conflicts. The project should die in a fire, and its authors should feel ashamed of ever publishing such pile of crap.

One simple alternative is to that is checking your program with ruby -w, which has fairly low rate of false positives (whining about system *%W[some --command] is the only one I keep running into a lot), but can still find most simple typos, misindentation, and such.

This kind of analysis is usually not counted as part of coverage report, but is useful supplement.

Error lines

Even in reasonably tested software it is very common to see code which raises exception not covered.

There are two cases which look the same syntactically:
  • errors because something bad happened
  • errors because something supposed to be impossible happened
For example z3 gem had this code in a private method:

  def check_sat_results(r)
    case r
    when 1
      :sat
    when 0
      :unknown
    when -1
      :unsat
    else
      raise Z3::Exception, "Wrong SAT result #{r}"
    end
  end

z3 library is supposed to return -1, 0, or 1, so the whole raise line is just debugging aid and it's rather pointless to test it - but on an off chance gem author messed it up or z3 quietly changed its API, it's better to have this kind of line than just return nil quietly.

On the other hand if the same code was in some user-exposed API, then we should probably be checking that exception happens when we pass unexpected value.

These two situations are semantically different, but syntactically equivalent, so coverage reporter can't tell the difference between the two.

This only gets worse because...

Coverage is by line

So it's very easy to rewrite such one lines of "not covered" into code which doesn't execute but appears green in coverage report. Like this:

  def check_sat_results(r)
    raise Z3::Exception, "Wrong SAT result #{r}" if r > 1 or r < -1
    case r
    when 1
      :sat
    when 0
      :unknown
    when -1
      :unsat
    end
  end

or this:

  def check_sat_results(r)
    {
      1 => :sat,
      0 => :unknown,
     -1 => :unsat,
    }[0] or raise Z3::Exception, "Wrong SAT result #{r}"
  end

Sometimes such code which doesn't get own line looks perfectly natural, other times it feels a bit convoluted.

Unfortunately there's no reason to expect that we'll consistently use this form for "can't happen" errors, but never for "legitimate" exceptions, so we'll have both false positives and false negatives here.

Scripts

Unix scripts, rake files, deploy scripts, and such are left untested so often they're usually not even included in coverage report. I wrote previously about some patterns for testing them, but even then they're not really the low hanging fruit of testing.

Even worse when you actually test your scripts in realistic way by actually executing them, simplecov requires quite a few hoops to jump before it will show code executed via such scripts as covered.

What is code anyway?

Your codebase might contain files which are sort of code, but not completely so, like .haml/.erb files. It's not generally included in coverage reports, even though view logic is as much part of program logic as anything else, and complex views are fairly common, even if people frown at them.

You can move complex logic out of views into models, controllers, and helpers, but even when all that's left is relatively straightforward it would be useful to know it was ran at least once during testing.

Automatically generated code

Metaprogramming is a big aspect of ruby, so it's fairly common for ruby programs to contain a lot of generated code.

It still deserves some kind of at least rudimentary testing, even if tests are generated as well.

Count all kinds of tests together for coverage report

Test coverage scale doesn't go from "no tests" to "great tests", it goes from "no tests" to "probably good enough tests", and it can't distinguish between "probably good enough" (at 100% or close to it) and tests which are actually great.

There's a temptation to check unit test coverage - counting only code tested directly, not "covered" accidentally by some high level integration tests.

That's a clever idea, but small number of high level integration tests can provide a lot of value for relatively little effort, and a lot of code probably doesn't deserves much more testing than that.

Excessive number of low-level tests would not only be slow to write, but they'd need constant updates every times something gets refactored.

It's a red flag if code you expect to run on production doesn't get run even once during your tests, but you should leave yourself flexibility on what kind of testing is most appropriate.

Merging coverage reports from multiple runs

This is currently far more complex than it has any reason to be. Let's say your program has a bunch of scripts, and you want to run them as scripts, and still include in coverage report.

The easiest way to deal with it I found is injecting a small bit of code into them. So in your test you'd run your scripts as IO.popen("ruby -r./spec/coverage_helper #{executable_path}").read and spec/coverage_helper.rb would setup per-script coverage report:

if ENV["COVERAGE"]
  require "simplecov"
  SimpleCov.command_name "#{File.basename($0)}"
  SimpleCov.start do
    add_filter "/spec/"
  end
  SimpleCov.at_exit do
    # Remove verbosity
    $stderr = open("/dev/null", "w")
  end
end


Those partial coverage reports are then automatically added together when you finish.

1 comment:

  1. Ruby has very poor tools for coverage testing, as you point out. The main Ruby coverage tool is SimpleCov, which only provides C0 code coverage; that is, it tests that a line has been executed. (That's why you get "50% test coverage" on some files without executing a single test — because having Ruby parse the file and read the method declarations counts as "coverage" per SimpleCov.

    Your best legitimate confidence-builder then becomes mutation testing (as nicely described here and here). Mutation testing isn't a replacement for your "regular" tests; it exists to point out where changing the logic paths executed in your code does not change the outcome of your tests. This can be a real eye-opener, especially if you routinely spend lots of time, say, debugging NoMethodErrors raised by calling some method on `nil` when you expected to be calling that method on another object.

    ReplyDelete