Apache Camel Part II: TDDing CSV Parsing

Apache Camel Series: This post forms part of a series of blog posts about writing a “Real World Application” with Apache Camel. Covering topics like Persistence, Testing, IoC, API Integration and “Big Data”. Using the example of processing the NYC Trip Data, you can follow my progress here

TDD is often viewed as a cult! And with good reason a lot it’s practices seem counter productive and outright absurd. It’s devotee’s can come across as mumbling eccentrics extolling benefits of the practice… I hope to aviod that. While not being dogmatic highlight some of the benifits of the practice in a practical manner.

I’m going to assume you are familiar with unit testing and its benefits

Why Test Driven Development?

Unit testing is important! It protects the system against unintentional changes in behaviour when we change our code. Code will change and it is often not changed by the origanal author. It is really nice to have some code to automatically tell me: “Hey! Are you sure you meant to change this?”. That’s all that unit test is, it’s simply some code that says “Hey! This code is supposed to work like this”.

The problem is that unit testing is often an after thought in the development cycle. This leads unit tests being retrofitted onto code that has already been written days or weeks before, long after the context that the code was written has passed. That context is vital to the process of writing good tests. The point of unit tests is to assert aspects of the code work as intended. Assertions that seem obvious in the context the code was written in, are often forgotten when tests are written later. All TDD is trying to do is get you to write your tests while that context is still in your head.

Test driven dogma states: red, green, refactor! Red: write a failing test with some assertions about the code you want to write. Green: write just enough code to pass that test. Refactor: refactor the code to improve design and remove duplication. Rinse and repeat until a fully functional system appears. Test driven dogma is all YAGNI, anti-BDUF, let the tests drive the design.

Test Driven What?

WTF? This makes no sense! The idea that unit tests can design your code for you? While it’s benefits are often misunderstood, often because what actual unit testing actually is, is misunderstood. It provides an invaluable tool helping keeping the design of your code loosely coupled and ready for the inevitable changes it will under go in its life time.

Getting Started: Writing the first test

Now I wrote a bunch of code without writing any tests at all. Which is totally not TDD but is totally practical. I was figuring out how to get 2 libraries to play nice. I was not writing any sort of business logic, I was just figuring out how I would use guice with camel. I view TDD as tool to validate my code works as I expect it if I’m not sure how I expect my code to work there is not much for me to test.

Once I’m happy with the direction I’m taking the first test I always write is a simple intergration test to validate my application starts, does what I want and shutsdown cleanly.

package me.martinrichards.apache.camel.address;

import ...

public class ApplicationIntegrationTest {
    @Test
    @Ignore
    public void testApplicationStartupUrl() throws Exception {
       Application.main("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-01.csv");
    }
}

This is not a unit test, the point of this test is to allow me quick and easy way for me to run my application and get a debugger on the code doing actual work. Actually hit a database, actually download the CSV file, etc. most of the time I find it difficult to know what to I need to write until I can look at the arguments I’ve got with actual values. What does a CSVParser look like with data in it? How are the headers in my CSV file reflected in the CSVRecord?

This is my poor mans subsitute for a REPL in java, I need a way to interact with and experiment with my code. This test gives me that. But this is not something I want to always run with my unit tests so the test usually remains annotated with @Ignore.

The same thing can be achieved, if you’re working from an IDE like intelij, netbeans or eclipse, is to just run the main application. But I must be a some sort of TDD fanatic as that didn’t occur to me at first.

So how is this going to work?

Apache Camel works with messages, the first message into the system will be a url for the csv that will be download and process. On startup the application takes the list of URL’s passed as arguments and pushes them into camel.

// Application.java
public void start(String... args) throws Exception {
    camel.start();
    for (String arg : args) {
        camel.getManagedCamelContext().sendBody(AddressRoute.FROM, arg);
    }
}

The Route takes each message and hands it off to the TaxiDataProcessor. This is where the business logic starts and this is where we start to test. From a functional point all that the processor needs to do is extract the CSVRecords and pass them on for further processing.

public class TaxiDataProcessor implements Processor {
    @Override
    public void process(Exchange exchange) throws Exception {
        String file = (String) exchange.getIn().getBody();
        URL url = new URL(file);
        CSVParser parser = CSVParser.parse(url);
        for(CSVRecord record : parser) {
            exchange.getOut().sendBody(record);
        }
    }
}

That’s a simple implementation that does what I need. Let’s write a test to verify the behaviour of the code.

package me.martinrichards.apache.camel.address.processor;

import ...

@RunWith(MockitoJUnitRunner.class)
public class TaxiDataProcessorTest extends BaseTestCase {
    @Mock
    Exchange exchange;

    @Mock
    Message message;

    @InjectMocks
    TaxiDataProcessor processor;

    @Before
    public void setup() {
        when(message.getBody()).thenReturn("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-01.csv");
        when(exchange.getIn()).thenReturn(message);
    }

    @Test
    public void testProcess() throws Exception {
        processor.process(exchange);
        Mockito.verify(message, times(3))
               .setBody(any(TaxiDataCSVRecord.class));
    }
}

There is quite a lot going in this first test, quite a lot to digest, especially if you’re new to testing.

So what are all these `@Mock` annotations?

For me mocking was a key breakthrough for me when it came to testing, before mocks testing felt very difficult and cumbersome and complicated. Having to build complex object hierarchies just to test a single aspect always felt too hard for what at the end of the day is very simple. Mocks are simply objects pretend to be other objects. They allow you as the tester to define there behavior without having to stub out the full object hierarchy required to achieve the desired behavior from the system under test. For a detailed discussion on mocking and how it fits into testing I’d highly recommend Martin Fowler’s article Mocks aren’t Stubs.

The above test simply mocks out a camel message and calls the processor directly, as if the message was routed by camel. This will create an instance of the TaxiDataCSVParser, start downloading and processing the CSV. Which is about 1GB, obviously something that should be avoided in an unit test.

The TaxiDataProcessorTest should only validate the behavior of the TaxiDataProcessor. What exactly is that behavior? Not downloading the CSV.

From a high level the processor takes a camel Message extracts the useful data from it and passes it on to something that turns it into a list of records. Which are then put on the camel output. The Processor’s responsibility is to mediate between the Apache Camel framework and the business logic of the application.

Testing one unit at a time

The goal of a unit test, as implied by the name, is to test a single unit of the code at a time. Making assertions about the units expected behavior in an isolated manner. In order to achieve this ideally you’d want to mock behavior that is external to the class. This is where creational design patterns come into play, since a new CSVParser needs to be made for each message (URL) we need a way to intercept the creation of parser. I order to achieve that let’s create a CSVParserFactory and pass it in via the TaxiDataProcessor’s constructor.

public class TaxiDataProcessor implements Processor {
    private final TaxiDataCSVParserFactory factory;

    public TaxiDataProcessor(final TaxiDataCSVParserFactory factory) {
        this.factory = factory;
    }

    @Override
    public void process(Exchange exchange) throws Exception {
        String file = (String) exchange.getIn().getBody();
        URL url = new URL(file);
        final TaxiDataCSV parser = factory.build(url);
        for(TaxiDataCSVRecord record : parser) {
            exchange.getOut().sendBody(record);
        }
    }
}

Now unfortunately the CSVParser in Apache Commons is a final class, so it can not be easily mocked. So in order to mock the CSVParser it needs to be wrapped by our own class. Which from purely a design perspective means our business logic is going to be completely separated from any direct reliance on external libraries. If we want to change CSVParser’s in the future all that is required is changing the type of parser that our parser uses. All the business logic of our application should remain untouched by the change of an underlying library.

public class TaxiDataCSVFactory {
    public TaxiDataCSV build(URL url) throws IOException {
        return new TaxiDataCSV(CSVParser.parse(url, Charset.defaultCharset(),
                CSVFormat.RFC4180.withHeader()));
    }
}

This does however require a little extra work up front in distilling the behavior of the system under test and how best to approach testing it.

public class TaxiDataCSV implements Iterable<TaxiDataCSVRecord> {
    private final TaxiDataCSVIterator iterator;

    public TaxiDataCSV(Iterable<CSVRecord> parser){
        this.iterator = new TaxiDataCSVIterator(parser);
    }

    @Override
    public Iterator<TaxiDataCSVRecord> iterator() {
        return iterator;
    }
}

The essence of the TaxiDataProcessor is taking a URL and looping over the resulting records of that URL. What you end up with is a TaxiDataCSVFactory that returns an Iterable. Since the Apache Commons CSVParser is an Iterator we simply need to wrap it in an Iterable to allow the processor to loop through all the records returned by the parser.

The actual CSVParser even though it is an Iterator it is still a final class and cannot be mocked. So it still needs to be wrapped.

public class TaxiDataCSVIterator implements Iterator<TaxiDataCSVRecord> {
    private final Iterable<CSVRecord> parser;

    public TaxiDataCSVIterator(Iterable<CSVRecord> parser) {
        this.parser = parser;
    }

    @Override
    public boolean hasNext() {
        return parser.iterator().hasNext();
    }

    @Override
    public TaxiDataCSVRecord next() {
        return new TaxiDataCSVRecord(parser.iterator().next());
    }
}

The test can now simply mock out the expected behavior of the dependent objects and assert it responds as expected.

@RunWith(MockitoJUnitRunner.class)
public class TaxiDataProcessorTest extends BaseTestCase {
    @Mock
    Exchange exchange;

    @Mock
    Message message;

    @Mock
    TaxiDataCSV taxiDataCSV;

    @Mock
    TaxiDataCSVIterator iterator;

    @Mock
    TaxiDataCSVRecord record;

    @Mock
    TaxiDataCSVFactory factory;

    @InjectMocks
    TaxiDataProcessor processor;

    @Before
    public void setup() {
        when(message.getBody()).thenReturn("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-01.csv");
        when(exchange.getIn()).thenReturn(message);
    }

    @Test
    public void testProcess() throws Exception {
        when(iterator.hasNext()).thenReturn(true, true, true, false);
        when(iterator.next()).thenReturn(record)
                .thenReturn(record).thenReturn(record);
        when(taxiDataCSV.iterator()).thenReturn(iterator);

        when(factory.build(any(URL.class))).thenReturn(taxiDataCSV);
        processor.process(exchange);
        Mockito.verify(message, times(3))
                .setBody(any(TaxiDataCSVRecord.class));
    }
}

An explosion of Classes

Now that escalated quickly! From simply trying to mock a single dependency to 5 additional classes isolating the parsing of the data from the routing of the data (keeping Apache Camel ignorant of Apache Common CSV). This whole design bore out of the need to test processor in an isolated manner. It quite neatly illustrates the advantages and disadvantages of using a test driven approach to the design and development of software.

Simple is not Easy and this makes it quite clear. There is a clean separation of concerns with each class remaining simple by itself but as whole it seems to be complex.

Balance

At the end of the day testing is another tool to create reliable software. Finding the right place to use it is as important as knowing how to use it. While testing the processor as a single unit drove out a quite elegant design, demonstrating the power of taking a test driven approach. It seems like over kill for such a simple task, providing little benefit over testing it as a more complete unit using a fixture.

Writing good tests is hard! Finding the right balance between writing simple, maintainable code and simply getting things done is a true art form.