Blog Posts Business Management

Taking R from Prototype to Production

Blog: Capgemini CTO Blog

The Problem: anticipating bugs

R is a wildly popular language in the Data Science community. The language’s success has come from the provision of a deep and broad statistical tool kit in a free, open-source language with a large community behind it. Some indexes of programming language popularity even place the R language as high as fifth, below C++, Python, Java and C[1].

Getting data cleaned, into a model, and producing results can be achieved quickly in R. It is one of the language’s greatest assets, but this comes at a cost; making an R program that works every time is an arduous task. This is a critical feature of a code base that is to be put into production.

Say I wanted to forecast a time series which I have in an Excel (.csv) format, all I would need to do is this:

library(forecast)

test.data <- ts(read.csv("time_series.csv"), frequency = findfrequency(test.data))

forecast.result <- forecast(auto.arima(test.data))

plot(forecast.result)

 

In just four lines of code I have created a simple program that can forecast a time series with no parameter specification. As a reward for my short labour I get a fancy plot with an accurate forecast:

Forecast of Air Passenger data in the US

Say this code gets put into production, with a new data set provided each day. What would happen to the program should a value go missing and the data provided contains the word “missing” for the first value in the data with all other values remaining the same? This is enough to break our program, as the following result is produced:

The forecast model now uses completely different values due to a single missing value in the data

Not only is the forecast completely wrong, the values of the time series values are totally incorrect. Worse still, the program hasn’t given an error message or even a warning, it has simply assumed all is well and provided a totally invalid result.

Why has this happened? The character value “missing” means that R has read the entire data set as a text values rather than numeric values. By default, R converts text to a data type known as factor, which assigns a number to each unique data point. The model has used this index, not the actual time series, for the forecast.

This shows how a prototype in R can fail when brought into production. While these are not problems unique to R, they are problems that are harder to solve in R.

Problems such as these are hard to avoid and fix in R due the nature of the language. There is no static typing outside of the S4, most CRAN packages lack unit testing, the unit tests CRAN packages must pass have significant problems[2], parallel implementations only work on some platforms and the S3 and S4 object orientated functionality is clunky. Purdue’s study of the language said it best:

“For robust code, one would like to have less ambiguity and would probably be willing to pay for that by more verbose specifications, perhaps going as far as full-fledged type declarations. So, R is not the ideal language for developing robust packages.”[3]

Serious questions follow form that statement. Can R provide business value in the long run? What are the best use cases of R? Why is R so popular? Should you even use R?

To the last question, I would say yes, emphatically. A language where you can get a result quickly is highly valuable. I’ll address how you can future proof that result below.

The Solution: adopt best practices early

The previous section talked about some of the weaknesses of the R language, showing the ease at which you can build prototypes which are feature rich but break easily in production. Here, I’ll show how by using good practices from the beginning solves the problems that R presents and make it an invaluable tool when bringing a product to market.

In my experience, it is possible to take R beyond the prototyping stage by using good practices from the beginning. These are practices that are found across all computing languages but are all too often left behind in R code.

References

[1] http://spectrum.ieee.org/computing/software/the-2016-top-programming-languages

[2] http://stackoverflow.com/questions/9439256/how-can-i-handle-r-cmd-check-no-visible-binding-for-global-variable-notes-when

[3] Morandat, Floréal, et al. “Evaluating the design of the R language.” European Conference on Object-Oriented Programming. Springer, Berlin, Heidelberg, 2012.

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="https://www.businessprocessincubator.com/content/taking-r-from-prototype-to-production/?feed=html" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples

BPMN.org

XPDL.org

×