Friday, Jun 3, 2011

What's Wrong With Ruby's Test Doubles

Prologue

First things first: let’s square up terminology. For the sake of facilitating sane discussion on this topic, I’ve adopted the terms used in Gerard Meszaros’ XUnitPatterns book. He drew a complex table for this, but I’ll quickly summarize here:

Test Double — a generic term to describe an artifical stand-in for code (usually an object) upon which the subject code you’re specifying depends. Mocks, spies, stubs, fakes, etc. are all specific subtypes of test doubles.
Stub — a test double that can be configured to respond to certain invocations (e.g. `when(panda.poke()).thenReturn(“chuckle”)`) in order to facilitate downstream behavior within your subject code. However, a stub can’t do anything to verify that certain invocations take place.
Mock — a test double that can be configured to expect certain invocations in advance, raising exceptions if those interactions never occur. They add the bizarre wrinkle that if they receive any unexpected invocations, they’ll raise an exception. For convenience, the mock objects generated by most (all?) modern mock libraries can do double-duty as stubs, despite Martin Fowler’s best effort to explain all this.
Spy — a test double that records all of the invocations made against it, exposing some way to interrogate how it was interacted with after the fact (e.g. `verify(panda).eat(bamboo)`). Spies respond quietly when interacted with by your subject code, usually returning the bare minimum the language supports (`undefined` in JavaScript, `null` in Java, `nil` in Ruby). Of course, they respond less silently when they’ve been set up to stub an interaction, because most spies can stub too!
Partial Mock / Proxy — a real object for which only particular method interactions have been cherry-picked to be stubbed or expected. Partial mocks break unit test isolation (because your subject code is now interacting with a quasi-real dependency) and their controversial use has been known to incite nerdy fisticuffs.

When it comes to test doubles available to Ruby developers, something has puzzled me for a while. Many of the brightest minds in testing left Javaland to join the Ruby community. Because of this, I was shocked to find that Mockito—a test spy framework for Java—is a more expressive tool for working with generated test doubles than any of the numerous libraries available for Ruby.

Mock Objects aren’t Fantastic

One root cause of the problem seems to be buried in (relatively) ancient agile history. The popular concept of “mock objects” appears to have its roots in a paper submitted at XP 2000. Mock objects were introduced as a sort of configurable booby trap: you can spring-load them to expect a discrete set of invocations; later when you execute your code, the mock will explode in your face if it was interacted with in any way other than what was explicitly expected in advance.

However, the pattern quickly earned popularity, because it lowered the barrier of entry for writing tests that isolated subject code from the implementation of its dependencies and verifying interactions that lacked observable side effects. And when I say “lowered the barrier of entry,” I mean to say, “hand-rolling test doubles in a language like Java produces mountains of cruft to maintain.”

Being in the business of specifying code using automated specs or unit tests, mock objects have always seemed to have a couple glaring flaws to the present author:

Because expectations need to be declared up front, the developer is forced to violate the arrange-act-assert pattern. My challenged experience as a student of Japanese—where the verb is usually the last word in a sentence—concurs that it’s cumbersome to read every sentence to an “arrange-assert-act” rhythm.
Mock objects raise errors whenever they receive messages that weren’t explicitly expected. When a test double explodes for being interacted with in a way that may be irrelevant to the behavior being specified, it presents an unnecessary obstacle to any author intending to specify behavior over implementation.

The introduction of the Spy pattern alleviated these concerns. In fact, when Dan North asked Mockito’s author how he rose above the constraints imposed by traditional endotesting, Szczepan Faber replied, “what’s endotesting?”

Ruby Test Doubles aren’t Fantastic

Last summer, I wrapped up a project that benefited greatly from the low-maintenance semantics of spies in both Mockito and Jasmine. Following that experience, I was caught off guard as I worked through the otherwise fabulous RSpec Book, because I failed to find the same enjoyable workflow in any of Ruby’s popular test double libraries. (For what it’s worth, I played with Mocha, RSpec Mocks, FlexMock, NotAMock, and rr).

None of the libraries exhibited all of the following problems, but they all seemed to suffer from at least one of the following ails:

In RSpec, performing the “arrange” and “act” steps in an example group’s `before` block can free up each `it` to be a one-line verification. This not only enables DRYer specs, but when each `it` is just one line long, it can serve as an obvious English-to-code translation. Mock objects, which require expectations be set in advance of the “act” step, completely break this approach and force you back into larger `it` blocks that look more like traditional xUnit test methods.
As a result of the above, mock expectations also limit your ability to nest example groups—in which the “act” phase is performed in each `before` block and the behavior cascades into deeper example groups. It’s a shame, because nested example groups are a tremendously expressive ways to specify a subject as its state changes or when characterizing complex legacy code.
Some libraries rely on codifying methods as strings or symbols (e.g. `User.should_receive(:find).with(42)` instead of `verify(User).find(42)`), which makes refactoring method names harder. When a developer renames a method, she must remember to update references where the method is not only used like a method but also when it’s expressed as a symbolic argument.
None of the Ruby libraries seem to do anything to discourage “fantasy tests” (coined by Jim Weirich in this talk), which are unit tests that will stay green even after the depended-on-method’s name has changed. This problem doesn’t occur in Mockito (which has Java’s compile-time safety) and has never bitten me while using Jasmine (which has a simple runtime check that the function exists before creating each spy).
Since the primary reason I use test doubles is to achieve isolation from my subject code’s dependencies, I was disappointed by how eagerly some of the tools sought to break unit test isolation. For instance, rr’s proxy pattern calls through to the real method.

The Way Forward

Reviewing that list of foibles, I have to say I’m no longer astonished that it’s my rubyist friends who seem the most annoyed with, frustrated by, and often vehemently opposed to using test doubles. Instead, many rubyists are content to opt for (and Rails encourages) varying degrees of integrated tests when specifying code. However, insisting on only writing ”most-stack” tests has consequences of its own, as David Chelimsky argued well.

On the topic of “when to use a test double” (which my friend Zach Dennis wrote about just this week), many developers I speak with seem to waste a lot of time throughout the day toiling over when and when not to employ test doubles when writing a code specification. Their concern is rooted in the astute observation that if they write only a single test to specify some code, a test that realistically integrates the code with its collaborators will provide more rich feedback than an isolated test will. This is absolutely true.

But these concerns seem to evaporate when one practices outside-in development, in which a failing full-stack spec (written in Cucumber or Steak, perhaps) demands a failing unit specification, which in turn demands real code be written. It’s in this BDD cadence (advocated, again, by the RSpec book) where the decision of “when to mock” ceases to be so contentious. First, the code is already exercised in a fully-integrated setting, so if anything breaks, you’ll know about it. Second, the code can be specified safely and exhaustively under whatever degree of isolation makes for the cleanest, most readable spec. Finally, isolated specs are portable and can easily travel with the code, while the full-stack specs remain coupled to the broader application for which the code was first created.

Gimme

In response to all this, I spent two weeks over my Christmas holiday last year holed-up, writing a test spy library of my own called “gimme”. It was an exciting endeavor in learning a little bit about introspection in Ruby and specifying an API (without a user interface) using Cucumber.

I started gimme in an effort to escape each of the annoyances listed above. It has many of the features a test double library needs, like: stubbing, verifying, argument matchers, and argument captors. However, gimme currently can’t stub or verify class methods, which more or less annihilates its usefulness for Rails apps.

Are you perhaps interested in helping poor gimme cross the finish line? Kevin Baribeau and I would love to meet more people interested in working on it.

It seems like it would be ready for public consumption if it only had RSpec support and class method stubbing & verification. (Update: Gimme now supports both RSpec and class method stubbing and verification) Tweet at me or just fork it yourself if you’re interested in helping!

The fine print:

I normally defend myself vigorously, but this is one case where I’d love to be exactly wrong, called out on it, and put to shame throughout the community. I’d happily suffer life as a disgrace if it meant there was a better test double library in Ruby and I was merely ignorant.
This post started out as a talk submission to Great Lakes Ruby Bash. They didn’t accept it, and perhaps for great reasons (read: I’m not awesome at Ruby).