Agile Data Science

by on August 30, 2016

As a Software Engineer, I use a different lense than one who comes from a Statistics background. Coding software lends itself to finding shortcuts for the benefit of development time and execution speed. Whereas, many Data Scientist concentrate most on computational efficiency over improved development processes.

In hopes of helping Data Scientist increase productivity, reduce errors, and improve overall computational quality, here are my shortcuts for analytics:

Design your Analysis before Working it:

My mentor was a Phd candidate in Chemistry before leaving to develop software as a profession. Therefore he insisted we keep a lab notebook on our desk and regularly make notes, perform calculations, and make references to any discoveries found.

This notebook works out well as a place to sketch out even simple programs before starting the coding process. These usually take the form of a diagram of boxes and lines to represent various components of the soon to be created software. From here, the boxes become pseudo code and this becomes the comments for the various functions, methods, and classes.

The point is to think out the solution before building it. This makes it easy to pass sanity checks, discover any would-be pitfalls, and create improvements before the code is even created. The end result is code that less prone to issue, a system that is pre-documented, and a result that is easily reproducible.

Define Data Requirements:

While chatting with a new developer, the subject of Reverse Polish Notation (RPN) came up and I shared with him my RPN calculator. He was unable to use it without further instruction despite being well versed in high level mathematics. This is due to RPN requiring a different way of thinking (Stack based) than traditional Algebraic reasoning.

When working to find answers in data, it is imperative to list the data sources that will be accessed. Even if it is just a single source, the act of writing it out forces one to think through the problem in terms of data and code as opposed to statistical thought.

At other times, the data sources required for the project may be varying in scope. For example, these can include: comma separated values within text files, records in a structured table, and third party RESTful API’s. Here one needs to ensure that their code can accurately read from, convert, and analyze the various sources as required.

Pre-planning the work helps to ensure that the code being created will give the desired result while solving the complexities before the building occurs.

Test for Reproducibility:

Having built financial systems, I am well aware that the most important factor in calculations is reproducibility. Trading Software algorithms are tested to ensure that the result of each iteration produces the same result for the same input. The reason is that if it does not then there is a problem with the math and/or logic that must be corrected.

Consider a simple discount calculator. To reduce a price by 50%, we can multiply it by .50 and use the result directly as the discounted price. However, calculating 25% we must subtract the result from the price. If we use the 25% calculation the same as 50%, we are giving a 75% discount.

While this example is very simple, it illustrated the need to ensure all calculations are reproducible over all valid inputs.

Test with Invalid Data:

New developers often test for the desired result and forget about ensuring proper handling of invalid values. Young Data Scientist are often guilty of the same. It is actually much easier for a Data Scientist to make this mistake since they most often utilize a subset of the actual data that may not have all of the types represented.

For example, as a young Software Engineer, I was having to utilize handwritten data from Excel spreadsheets to map data into a GIS application. This data was taken by persons not trained to do so and the null values could be an empty cell, the letters na, or something else instead of the expected numeric value.

Due to the large amount of data in multiple files being feed into a SQL database, it was not easy to discover all of the various methods of describing null. Needless to say, my code found the invalid ones before I did, leading to invalid calculations and problems with GIS mappings.

While embarrassing, a lesson was learned and the code fixed. However, with very very large datasets as is the norm today, this issue could go weeks or months before being discovered, leading to organizations making decisions based on faulty calculations. Therefore always ensure your code can handle invalid values appropriately.

Utilize Code Reuse Strategies:

One of the matras of Object Oriented Programming, code reuse is a staple of modern development teams. However, it is not always used in Data Science. I have witnessed analytic coders copying and pasting code from previous projects instead of creating reusable source. This not only wastes time, but increases the likelihood for error.

In short, create a reusable collection of libraries from proven and well-tested algorithms that can be utilized whenever needed. This makes it much faster to accomplish common tasks while ensuring that the code can handle invalid data and generate reproducible results.

Common code repositories like Git seem to work best and can easily be integrated into most development stacks.

Divide and Conquer:

Modern development teams use Agile development methods that include dividing work into smaller parts that can be completed within a given timeframe. While doing so requires more upfront planning, the end result is much faster development time with the overall quality greatly improved.

When working on large projects it is always best to divide them into smaller more manageable work. If you design the work first, then dividing it into multiple parts should be straightforward. Each part must be tested both during development and as it is complete. Once all tests pass, continue on to the next component.

Once all of the parts are complete, put the pieces together and the final result should function as required with little additional effort. Now you have a reusable code source, tested algorithms, and a proven solution within a pre-planned time frame.

There are many good books on the subject and online sources for Agile development and I highly recommend any analytics team learn about and adopt the principles from them. In so doing, your team will be able to better predict development times while delivering higher quality code much more consistently.

The above suggestions are all designed to help ensure goals are reached. While this can be expanded in more detail, this article is a good starting point for any analytic developer to use as a starting point for vastly improved results.