The equivalence of the ellipsis argument and an infinite set of closures

This post is about a practical application of a topic I discuss in my book. In my book, I prove mathematically that the ellipsis in a function signature is equivalent to passing a closure with the same arguments bound to the closure.

I’m currently modeling consumer spending behavior by modeling their credit and debit card transactions. I model each type of transaction (for example coffee shops or utilities) as a transaction stream. One such function produces a sequence of daily transactions as f(\mu, \sigma, p) = \begin{cases}0, & U(0,1) < p \\ N(\mu,\sigma), & otherwise \end{cases} .

x.daily(mu=0, sigma=1, p=0.1) %as%
{
  function(t)
    ifelse(runif(length(t)) < p, rnorm(length(t), mu,sigma), 0)
}

I had a situation where I needed to generate a transaction stream using a uniform random number instead. Since the above function exposes parameters specific to a normal distribution, it’s not possible to just use a function reference for the random number generator. Instead the parameters to the RNG must be passed along. This is a good use of the ellipsis argument since it will capture any unreferenced parameters.

x.daily(p=0.1, rng=rnorm, ...) %as%
{
  function(t)
    ifelse(runif(length(t)) < p, rng(length(t), ...), 0)
}

Now this can be called as f <- x.daily(.4, rng=rnorm, mean=11, sd=2); f(1:20), which will generate a sequence of events where each non-zero event is N(11,2)

Lunch purchases - normal

On the other hand to generate a uniform distribution between [10, 20] is accomplished with f <- x.daily(.4, rng=runif, min=10, max=20); f(1:20).

Lunch purchases - uniform

What is the intuition around the use of the ellipsis? This is the core of the discussion and stems from how one would solve this problem if the ellipsis argument were not present. In this scenario the function would be defined

x.daily.1(p, rng) %::% numeric : Function : Function
x.daily.1(p=0.1, rng) %as%
{
  function(t)
    ifelse(runif(length(t)) < p, rng(length(t)), 0)
}

Calling the function is then done by passing a closure with the arguments predefined, such as x.daily.1(.4, function(x) runif(x, min=10, max=20)) or x.daily.1(.4, function(x) rnorm(x, mean=11, sd=2)). What’s remarkable is that both of these calls are equivalent to calling the original version that uses the ellipsis argument instead. Hence,

x.daily(.4, runif, min=10, max=20) ~ x.daily.1(.4, function(x) runif(x, min=10, max=20))

x.daily(.4, rnorm, mean=11, sd=2) ~ x.daily.1(.4, function(x) rnorm(x, mean=11, sd=2))

These relationships show that the ellipsis argument maps to the free variables of runif and rnorm. Clearly the function reference is arbitrary so this relationship maps to any specified function.

Armed with this equivalence relationship we can then symbolically transform a function from one representation to another. This becomes another tool used to reason about our programs and ultimately about the behavior of our models.

Futile.logger 1.3.3 RC available

This is a preview release of futile.logger 1.3.3 so some people with feature requests can try out the code before I push to CRAN.

A futile try/catch

It is now possible to capture non-futile warnings and errors by wrapping a block in ftry, which is essentially a wrapper around tryCatch. The return value is the value of the expression or the message invoked by a warning or

> ftry(log(-1))
WARN [2013-05-29 15:49:45] simpleWarning in log(-1): NaNs produced

When using ftry, the warning and error handlers are defined to call flog.warn or flog.error, respectively. A finally statement can also be passed to ftry, just like in tryCatch.

> o <- ftry(log(-1), finally=flog.info("In finally block"))
INFO [2013-05-29 15:54:26] In finally block
WARN [2013-05-29 15:54:26] simpleWarning in log(-1): NaNs produced

Custom layouts

The default layout for the futile.logger is designed to be minimally function. For people who want control over their log output, it is now possible to call layout.format to define a format string for a logger. The available variables are:

~l The log level
~t The timestamp
~n The namespace of the calling function
~f The name of the calling function
~m The message string

None of these variables are required, although at a minimum ~m should be used so you get useful strings out. To set a logger’s layout amounts to passing the closure generated by layout.format to flog.layout.

> flog.layout(layout.format('[~l] [~t] [~n.~f] ~m'))
> dummy <- function() { flog.info("Check out my shiny log message") }
> dummy()
[INFO] [2013-05-29 16:26:19] [ROOT.dummy] Check out my shiny log message

For those that are truly particular, the timestamp format can also be set using the datetime.fmt option.

Carping log messages

With a nod to PERL, I’ve also introduced (by request) the ability to return every single log message directed to futile.logger. This is a per logger setting and is returned invisibly so as to not clutter the console.

> flog.carp(TRUE)
NULL
> flog.threshold(WARN)
NULL
> o <- flog.info("This won't print")
> o
[1] "[INFO] [2013-05-29 16:29:11] [futile.logger.flog.info] This won't print\n"

If you are testing this package, please let me know. The download link will be available for about a week, after which I will push to CRAN.

Better logging in R (aka futile.logger 1.3.0 released)

In many languages logging is now part of the batteries included with a language. This isn’t yet the case in R so most people make do with cat commands laced through their code. Then after development, when the system is being run for real, a lot of those cat statements are commented out so it doesn’t fill up log files and hide real error messages. When there is a bug, the cat commands are pulled down from the attic and so on. It would be far easier to have a simple logging utility that managed the various debug statements in your code such that they could all be turned on and off as a set or by severity.

This dream is no longer futile. The latest release of my package futile.logger has been rewritten with usability in mind (and is now built on lambda.r). It itself comes with batteries included and requires zero configuration to work.

Basics

You can use futile.logger right out of the box as a replacement for cat. Here are some examples. Notice that futile.logger decorates log statements with generally useful information, like the logging level and a timestamp.

library(futile.logger)
flog.info("My first log statement with futile.logger")
flog.warn("This statement has higher severity")
flog.fatal("This one is really scary")

Format strings are built into the logging system, so log statements can be dynamic based on the data.

flog.info("This is my %s log statement, %s", 'second', 'ever')
x <- rnorm(400)
flog.info("The variance of the sample is %s", var(x))

Use the flog.threshold function to choose which messages should be displayed. If you no longer want to see INFO-level messages, then change the threshold to WARN.

flog.threshold(WARN)
flog.info("Log statement %s is hidden!", 3)
flog.warn("However warning messages will still appear")

Concepts

To take full advantage of futile.logger, it is important to understand the handful of concepts that the package introduces. For anyone familiar with log4j or its myriad clones this should be familiar.

LOGGERS

A logger is simply a namespace bound to a threshold, an appender, and a layout. When futile.logger is loaded, a single logger is created, which is known as the ROOT logger. All other loggers inherit from the ROOT logger in terms of namespace and also configuration. The ROOT logger has a threshold of INFO, logs to standard out, and has a simple layout for the messages.

Loggers are configured automatically whenever they are referenced (for example when changing the threshold, or logging to a logger) inheriting the settings of the ROOT logger. Suppose you want a separate namespace for data i/o. You can make log messages that reference a logger data.io simply by specifying a different logger name.

> flog.threshold(INFO)
NULL
> flog.info("Loading data.frame", name='data.io')

Even though no logger was explicitly defined, futile.logger knows how to print messages since the logger data.io inherits from ROOT. Having a new logger is nice, but it’s not particularly useful if it behaves identically to the default logger. What if we want to see TRACE level messages for our data i/o but not clutter up the rest of our logging?

> flog.threshold(TRACE, name='data.io')
NULL
> 
> flog.trace("Connecting to data store")

Whoops, this is gong to the ROOT logger, so obviously it won’t print (since it’s still at INFO). Let’s route the message to the data.io logger we referenced earlier.

> flog.trace(“Connecting to data store”, name=’data.io’)
TRACE [2013-03-10 16:03:40] Connecting to data store

That looks better. This is how the hierarchy works since futile.logger will use values specified higher in the hierarchy if they don’t exist for the current logger. We can even change the appender for data.io to write to a file instead.

> flog.appender(appender.file("data.io.log"), name="data.io")
NULL
> flog.trace("Connecting to data store", name='data.io')

What about intermediate points in the hierarchy? Suppose we have other data operations logging to the logger data. We want these to be at TRACE, but we want them to continue writing to the console.

> flog.threshold(TRACE, name='data')
NULL
> flog.trace("Merging data.frames", name='data')
TRACE [2013-03-10 16:10:14] Merging data.frames
> flog.trace("Connecting to data store", name='data.io')

So now the logger data will print all log messages to console, while data.io will write to a file. This should give you a good idea of how to manipulate loggers in futile.logger.

Logging Levels and Thresholds

It’s probably clear that there are multiple log levels available. Each log message has an associated log level that ranges from TRACE to FATAL. The levels form a hierarchy that not only indicate the severity of the message but also when it should be displayed. This is controlled by the threshold, which only allows messages with a severity greater than or equal to the threshold to be displayed. When futile.logger is loaded, the default log level is INFO. To change the logging threshold use the futile.threshold function. The function has two arguments: the new threshold and the name of the logger. The available thresholds are the same as the log levels: TRACE, DEBUG, INFO, WARN, ERROR, FATAL. Choose the threshold that you want to see and all messages of that severity and higher will display.

This means that the default INFO threshold will display all log messages with a log level of INFO, WARN, ERROR, FATAL. If you only want to see errors, then set the threshold to ERROR.

flog.threshold(ERROR)
flog.info("Will print: %s", FALSE)
flog.warn("Will print: %s", FALSE)
flog.error("Will print: %s", TRUE)
flog.fatal("Will print: %s", TRUE)

With this simple mechanism you can add log statements in your code and control the display of the messages by simply changing the logging threshold. Hence for development keep the threshold on INFO, for debugging change to TRACE, and for production set to WARN.

The name argument determines the logger itself, as discussed above.

Appenders

An appender defines where a log message goes. The default is to standard out, but it is also possible to write log messages to files, URLs, message queues, etc. Writing your own appender is as simple as creating a function that has the following signature.

appender.fn <- function(line) { }

Assigning it to a given logger follows the same convention as for layouts.

flog.appender(appender.fn)
flog.appender(appender.fn, name='test.logger')

Similarly calling the function without an explicit logger will bind this appender to the ROOT logger.

The following two appenders are provided out of the box in futile.logger. Each function returns a function, so when binding an appender to a logger, be sure to execute the top level function.

  • appender.console()
  • appender.file(filename)

LAYOUTS

Suppose you want to display log messages in a different format? To do this you would create a custom layout function, which is responsible for returning a string that represents the log message. The function interface is

layout.fn <- function(level, message, ...) { }

where the ellipsis argument represents represent any arguments passed via the logging functions described above. You set the layout by assigning the layout function to the given logger.

flog.layout(layout.fn)
flog.layout(layout.fn, name='test.logger')

If you don’t provide a logger, futile.logger will use the ROOT logger. This means all loggers will be impacted by the change in layout (except loggers that explicitly defined a different layout).

Package Support

If you happen to use a package that uses futile.logger, you can control logging of messages interactively on a per package basis. That means you, as the package user, decides how much logging information you wish to see for a given package.

> library(tawny)
> library(fractalrock)
> prices <- getPortfolioPrices(LETTERS[1:10], 100)
> returns <- apply(prices,2,Delt)[-1,]
> s <- cov.shrink(as.xts(returns,order.by=index(prices)[-1]))
<em id="__mceDel">INFO [2013-03-08 09:52:07] Got intensity k = 140.006097751888 and coefficient d = 1</em>
> flog.threshold(WARN, name='tawny')
NULL
> s <- cov.shrink(as.xts(returns,order.by=dates[-1]))
> flog.info("Got covariance matrix")
INFO [2013-03-08 09:55:25] Got covariance matrix

Hence by using futile.logger, package authors can balance their debugging needs with a user’s usage needs. It also means that for debugging issues, a user can change the log threshold for the package and give a more detailed output to the package maintainer.

Conclusion

Futile.logger is now more robust and easier to use. Install from CRAN or read more documentation and the source on the github page.

Lambda.r 1.1.1 released (and introducing the EMPTY keyword)

I’m pleased to announce that lambda.r 1.1.1 is now available on CRAN. This release is mostly a bug fix release, although a few important enhancements were included.  

  • [bug] Support Function in every type position (only supported for return type)
  • [bug] Auto-replacing a function with 0 arguments fails
  • [bug] Fix type inheritance
  • [new] Functions that return non-scalar values work as default values on arguments
  • [new] Support pattern matching on NULL and NA
  • [new] Support pattern matching on special symbol EMPTY

Pattern Matching for NA and NULL

Most significantly are improvements to the pattern matching semantics. Pattern matching now supports NA and NULL directly. This is particularly useful for programmatic control when a specific function signature is required but the argument value is non-deterministic. This can happen when accessing non-existent list elements as well. Suppose you want to forecast a time series. You want to choose the forecasting method based on whether the data is seasonal or not. A classification technique is used for this purpose and sets the period or NULL if it is not seasonal.

Traditional imperative code would run the classification, check its output and then use a conditional to execute the seasonal or non-seasonal forecasting routine. A functional approach would use function clauses to control the flow.

forecast_ts(x, NULL) %as% {
  # non-seasonal forecast
}

forecast_ts(x, period) %as% {
  # seasonal forecast
}

period <- classify_ts(x) # NULL or numeric
forecast_ts(x, period)

Obviously the same thing can be accomplished using an explicit guard statement, but pattern matching has an elegant simplicity to it that efficiently communicates the intent of the logic.

Behind the scenes these are additional parse transforms that take into consideration the special nature of these constants (and how you test for them). At some point I want to generalize the parse transform machinery so anyone can develop their own set of transforms (just like in erlang).

Introducing the EMPTY Pattern

I’ve also introduced a new constant called EMPTY, which allows you to pattern match on empty lists and vectors (or anything with 0 length). This means recursive definitions and other iterative methods against vectors and lists work as expected.

fold(f, EMPTY, acc) %as% acc
fold(f, x, acc) %as% { fold(f, x[-1], f(x[[1]], acc)) }

plus(x,y) %as% { x + y }

x <- 2
n <- 0:10
fold(plus, x^n/factorial(n), 0)

You can also capture situations where empty sets are the result of set operations using EMPTY. The clean declarative aspect of this notation makes your analytical code easier to understand by removing the overhead of data management and manipulation.

Full details and source are available at: https://github.com/muxspace/lambda.r

Lambda.r 1.1.0 released

Tags

This is a quick post to announce lambda.r version 1.1.0 is released and available on CRAN.1 This release has a handful of important new features and bug fixes:

  • [new] Type variables in type constraints
  • [new] Auto-replacement of function clauses
  • [bug] Function types break in type constraints
  • [bug] Zero argument functions don’t dispatch properly

Type Variables

I’ll discuss type variables in full in a separate post, but the basic idea is that you can retain polymorphism of functions by using type variables instead of concrete types. In other words, type variables define the relationship between arguments but not the actual type. Take for instance the Heaviside step function. This function will evaluate equally well for a float, integer, or double. (In R, these are all represented by numeric, so this is somewhat contrived). The output of the function is 0, 0.5 (if x == 0), or 1. Essentially the return type should match the input type.


heaviside(n) %::% a : a
heaviside(n) %when% { n < 0 } %as% 0
heaviside(0) %as% 0.5
heaviside(n) %as% 1

Suppose instead the output of the function is 0 or 1. We can represent the return value as a logical. We still don’t care about the input type, so we can define the type constraint as


heaviside(n) %::% a : logical
heaviside(n) %when% { n <= 0 } %as% FALSE
heaviside(n) %as% TRUE

In short, type variables are another tool to manage how functions are dispatched. Used in conjunction with concrete types, you can achieve generality while preserving granularity.

Auto-Replacement of Function Clauses

Before this release, overwriting a specific function clause required either sealing the definition or deleting it from the environment. If you skipped this step then new function clauses would continue to be appended to the function. Not only was this annoying it also prevented you from interactively modifying function clauses. Lambda.r is now smart enough to recognize existing function clauses and replace the specific clause.

> fib(0) %as% 5
> fib(1) %as% 2
> fib(n) %as% { fib(n-1) + fib(n-2) }
>
> fib(0) %as% 1
> fib(1) %as% 1
> 
> fib(5)
[1] 8

The one exception is when there are two function signatures only differentiated by type. In this situation, lambda.r has no way of knowing which clause to replace. The solution is that there is always one type constraint in scope. Hence any ties will be resolved by the type constraint that is in scope. To set the type constraint that is in scope for a function, simply redeclare the type constraint. Let’s define a simple function generator that multiplies an input by some number.

times.n(n) %::% numeric : Function
times.n(n=1) %as% { function(x) x + n }

times.n(n) %::% character : Function
times.n(n) %as% { times.n(as.numeric(n)) }

We source this and try it out for the default case.

> f <- times.n()
> f(4)
[1] 4

All good, so let’s check it out with a different multiplier.

> f <- times.n(2)
> f(4)
[1] 6

Whoops, it looks like we have a bug. We need to update the first function clause to use * instead of +. Since the two clauses have the same function signature we need to tell lambda.r which type constraint is in scope.

> times.n(n) %::% numeric:Function
> times.n(n) %as% { function(x) x * n }
> f <- times.n(2)
> f(4)
[1] 8

Hence functions can be interactively modified as well as re-sourced with the same behavior. It does imply that if you use type constraints in a function definition, then you need to use them consistently in that function definition.

In some ways auto-replace should merely produce a yawn as a reaction. This is because the behavior is what you expect anyway. While the implementation is non-trivial, my hope is that it is an obvious, almost trivial feature.

1 There are actually two versions 1.1.0-2, which supports the 2.15.x R series and 1.1.0-3, which is compatible with the 3.0.x R series. Selection of versions should be automatic. Once R 3 is released in the spring, I plan on supporting the 2.x series for the remainder of the year.

The myth of the missing Data Scientist

Tags

,

Much has been said about the dire shortage of Data Scientists looming on the horizon. With the spectre of Big Data casting shadows over every domain, it would seem we need nothing short of a caped wonder to help us see the light. Heralded as superheroes, Data Scientists will swoop into an organization and free the Lois Lane of latent knowledge from the cold clutches of Big Data. In the end the enterprise bystanders will marvel at the amazing powers these superhumans possess. Everyone will be happy and the Data Scientist will get the girl.

It’s a great story and a great time to be a nerd. As much as I want to believe in this story, I just don’t buy it. True there is more data being produced now than ever before. The rate of data production is growing exponentially and people need to be able to analyze this data. Yet this dire need feels manufactured. The promoters of Data Science point to the McKinsey study that cites a “shortage of 140,000 – 190,000 people with deep analytical skills” by 2018. That’s a lot of Data Scientists! Some people claim that every organization will eventually need at least one Data Scientist and perhaps even have their own department. This all sounds fantastic (who wouldn’t want a legion of super-nerds be a force in culture?) except there are some serious problems with this analysis. There are three significant problems with the hyperbole surrounding Data Science: selection bias, assimilation blindness, and automation blindness. What we’ll see is that the need for Data Scientists is likely smaller than advertised with a startlingly short half-life.

Selection Bias

The first problem is that people assume that all 150k “people with deep analytical skills” are all Data Scientists. First let’s look at the math. Suppose every organization does need at least one Data Scientist. We start with the number of public companies listed on major exchanges in the US as a proxy for “every organization”, which is about 5000. Why is this a reasonable proxy? Because smaller companies probably don’t have the budget to support full time data scientists. Adding businesses listed in OTC markets, we can roughly double that number. Fine, so let’s say 10000 companies. Then on average that would mean each organization has a team of 15 Data Scientists. Wow, I see a lot of dollar signs piling up alongside the map-reduce queries.

Clearly there must be other professions that require analytical skills that aren’t Data Scientists. Look at the cross section of people that use R and you’ll see people in Psychology, Economics, Biology, Finance, etc. The biggest population by far is the traditional group you think of when you think analysis: engineering. McKinsey hints at this when they list the Internet of Things as being one of the sources for the exponential growth in data. This version of the future, popularized by GE, points to Computational Engineers as filling most of this population. When GE alone is hiring 400 people to fill one development center, it’s plausible that the net shortage could reach hundreds of thousands.

Assimilation Blindness

The next problem is what I call assimilation blindness. Even if a shortage of this scale did exist for Data Scientists, it wouldn’t be sustained. As understanding of Big Data and analytical methods becomes more widespread, the need for specialists will often diminish. A good example is how web developers used to be a prized resource but are now commoditized since even High School students can build web sites (or iPhone apps for that matter). Data Scientists will find that their role will be assimilated quickly since their role only differs from traditional roles by having a big data component. What is the role of a Data Scientist? It is still up for debate, but here are some of the most popular themes I’ve seen:

  • Telling stories with data (including visualization) – This is what marketers do. As tools become easier to use and analytical methods more pervasive, presumably many people in the marketing department will know how to take advantage of these tools directly rather than relying on a Data Scientist
  • Finding insights in data – This is what business analysts do. They’ve been trained to use analytical tools for years and know how to spot interesting phenomena in data. The tool set is different as is the scale of the data, but given most business analysts know a little SQL and basic statistics, it isn’t a stretch to conclude that they would assimilate many of the functions a Data Scientist fills
  • Creating products from data – This is what product managers do. In finance there are plenty of data products, and they aren’t managed or invented by Data Scientists. As data products become more mainstream, more people in the product management arena will know how to ask questions of data directly because they will have learned these skills themselves

Hence while there may be a shortage in the short term, over time the Data Scientist will lose his cape and disappear into the crowd.

Automation Blindness

The functions of a Data Scientist that aren’t assimilated will likely be automated away. Not recognizing this phenomenon is what I call automation blindness. Numerous startups and big players such as IBM are developing tools to simplify big data analysis. Currently a big portion of a Data Scientist’s role is bringing together data from disparate sources to make an analysis possible. Once this is automated, the need for specialists will again decline.

In short the shortage of Data Scientists is shrouded in the myths of storytellers. There is definitely a need for people with analytical skills, and we will see this separate into skills that are generally assimilated and advanced skills used by engineers to design tools and systems that rely on data for their proper function.

Infinite generators in R

Tags

, , ,

This is first in a series of posts about creating simulations in R. As a foundational discussion, I first look at generators and how to create them in R. Note: If you are following along, all the examples rely on lambda.r, so be sure to have that installed (from CRAN) first. If you are not familiar with lambda.r, you can read the introduction.

Put simply a generator returns a function that can produce a sequence. They are common in Python and are defined as functions that “allow you to declare a function that behaves like an iterator, i.e. it can be used in a for loop.” [1]. In R for loops are generally avoided in favor of functional approaches that use the *apply suite of higher order functions. Generators are still useful under this paradigm. Haskell has a similar concept for producing infinite data structures based on lazy evaluation of functions [2]. In short generators are useful because they allow you to programmatically construct a sequence.

Construction

Using closures it is easy to emulate this behavior in R. There are two key ingredients to make this work: a generator function and a new apply function that applies over an iterator (the return function of the generator).

The generator

At its most basic a generator is simply a function that returns a closure: a function bound to an environment that references non-local variables. Closures are a fundamental building block in functional programming and can eliminate the need for (dangerous) global variables. Our iterator is the closure and will reference a number of non-local variables defined in the scope of the generator.

seq.gen(start) %as%
{
  value <- start - 1
  function() {
    value <<- value + 1
    return(value)
  }
}

For simplicity the first generator is infinite: it will continue producing values for as long as it is called. Let’s think about that for a moment. We can produce an infinitely long sequence so long as we keep calling this function. In R we typically consider sequences as being finite. We also think about them being produced as a batch i.e. in a single function call. For example, creating a sequence from 1 to 10 is simply seq(1,10) or 1:10. The sequence is then passed to some other function as a vector. If passed to apply, it will then be iterated over element by element. This is fine for data analysis or batch-oriented back testing. However, what if instead of a batch we want to run a simulation as though the model or system were acting for real? This is where an iterator is useful because it can produce inputs that behave like real inputs.

Introducing iapply

Since the standard suite of apply functions expect a complete sequence, this technique cannot be used out of the box. Instead we need to create our own apply function, which we’ll call iapply (i for iterator). It acts like the other apply functions with the exception that it understands iterators.

iapply(iterator, fn, simplify=TRUE, formatter=function(x) format(x,"%Y-%m-%d")) %as%
{
  out <- list()
  while (! is.null(input <- iterator()))
  {
    df <- data.frame(fn(input))
    if (ncol(df) > 1)
      out[formatter(input)][[1]] <- df
    else
      out[formatter(input)] <- df
  }
  if (simplify) out <- do.call(rbind,out)
  out
}

There is no magic in iapply. As shown it’s really just looping over the iterator, calling the function fn with the result of the iterator, doing some formatting, and finally collecting the result. The format function I use is for dates because my primary use case is to create a sequence of dates. I then simulate data over a sequence of dates and pass that into my system as though it were real data. The advantage is that I only write the model once, and I also don’t have to worry about accidentally using data in the past since model testing behaves exactly the same as real world operation.

Embedding control into an iterator

Since the current generator is infinite, the sequence will never stop. This means that iapply will never return. To resolve this minor detail, we need to add some control logic into the iterator. Remember that the iterator is a closure, so it is easy to add some more variables to the non-local scope and use that for control. The updated function provides a stop value, a step interval, and a way to reset the iterator back to the original starting value. There’s also an additional clause to handle character data and convert them to Date objects for convenience.

seq.gen(start, stop, step=1) %when% {
  is.character(start)
  is.character(stop)
} %as% {
  seq.gen(as.Date(start), as.Date(stop), step)
}

seq.gen(start, stop=Inf, step=1) %as%
{
  first <- value <- start - step
  function(reset=FALSE) {
    if (reset) { value <<- first; return(invisible()) }
    if (value >= stop) return(NULL)

    value <<- value + step
    return(value)
  }
}

As an example, we can now call the generator to provide a Date iterator. You can pass reset=TRUE at any time to reset the iterator to the first value.

> date.fn <- seq.gen('2013-01-01', '2013-02-01')
> date.fn()
[1] "2013-01-01"
> date.fn()
[1] "2013-01-02"
> date.fn()
[1] "2013-01-03"
> date.fn(reset=TRUE)
> date.fn()
[1] "2013-01-01"

Keep in mind that once you hit the end of the iterator, it will return NULL unless you tell it to reset.

A complete example

The iterator is now ready to use within iapply. To use the technique, simply call the generator to create an instance and pass it along with a function to iapply.

> date.fn <- seq.gen('2013-01-01', '2013-02-01')
> iapply(date.fn, function(x) rnorm(1))
 [,1]
2013-01-01 -1.311821e+00
2013-01-02 4.112014e-01
2013-01-03 6.985409e-02
2013-01-04 3.905463e-01
...

Admittedly using all this structure to generate a sequence of random numbers is counterproductive. In the next post, I’ll describe how to simulate stock price data using this approach and how to plug it into a model. It will then be clear what advantages this technique provides.

Conclusion

Generators are a powerful technique for programming a sequence. Armed with an iterable function it is possible to create simulations that operate realistically through the system. The advantage to this approach is that model development, system development, and testing can all share the same code base, meaning less work for you.

References

[1] Generators
[2] Functions

Confident package releases in R with crant

I recently released the new lambda.r package on CRAN for functional programming. This was my first new package in quite some time, and I forgot how onerous package releases to CRAN can be.

What I didn’t realize is that packages are tested against three distinct versions of R:

  • R – The latest official point release (e.g. 2.15)
  • R-patched – The latest patch release (e.g. 2.15.2)
  • R-devel – The latest source in the development branch

It just so happened that a dependency of lambda.r failed in the R-devel branch. The update then broke my package but only for one of the builds.

Most of the time worrying about these slightly different versions isn’t a big deal, but the more low-level a package is the more you have to worry about it. Lambda.R is one such package, and I was stuck in a curious situation where the CRAN maintainers were telling me the package was inconsistently failing. Joyriding aside, I’m not one for driving blind. Instead I created some tools to provide the coverage I needed, so I can release packages with confidence and not waste anybody’s time. The collection of these tools is called crant.

Features

Crant does just three things. Starting from scratch you can:

  • Create the three standalone R builds mentioned above. Each version is downloaded from source and can be updated at any time
  • Install 3rd party libraries for each build (e.g. RUnit, testthat). Basically any package that is “Suggested” but not “Required” will need to be installed prior to building your own package
  • Build and install your package. The rant script will also set the package version and perform the CRAN checks as necessary.

The key to crant is a consistent build chain for any version of R. This flexibility means that you can test your package against any set of R sources (not just the three tags above), depending on how much backwards compatibility you require.

Usage

Here is a quick guide to crant. More documentation is available on the source page.

Building the Environment

The buildenv.sh script can setup a clean OS with all the tools, such as make, gcc, gfortran, java, etc, necessary to build R from source. Note that only debian-flavored OSes are compatible. Include the -d option to install these dependencies (only do this once).

export PATH=$PATH:path/to/crant
buildenv.sh -u # Add -d if you have a clean OS

The -u option tells the script to get the R source and update it to the latest version. It will also build the R source and set up the package directory to something portable and safe from the other builds.

The end result is that you will have 3 installations of R built from source that correspond to the latest minor release (e.g. 2.15), the latest patch release (e.g. 2.15.2), and the current development source (R-devel).

Install Package Dependencies

Now armed with three R builds, you can install packages into each one. For fine-grained control, you need to do this with each version of R. I may wrap this up into a single command if there is interest.

setuplib.sh -R ~/devel/bin/R RUnit testthat
setuplib.sh -R ~/patch/bin/R RUnit testthat
setuplib.sh -R ~/release/bin/R RUnit testthat

Note that you only need to install packages that are Suggested. Required dependencies should be installed automatically during the package build process.

Build Your Package

At this point your environment is fully configured. You can now test your package against these R builds. The rant script will build, run tests, and check for CRAN compatibility. It will also optionally install the package to the R build running.

rant -v 1.0.0 -R path/to/R your.package

Note that rant will automatically set the version and date in the DESCRIPTION and package.Rd files for you. This works if you use the placeholders ‘{version}’ and ‘{date}’, respectively, in these files.

During package development, you can run the rant script whenever you want to test the integrity of the package. You will find the source packages in a directory called ‘export’. These can be uploaded to CRAN.

To summarize, anyone writing packages should take a look at crant. Having a consistent and easily reproducible build chain can greatly improve the success ratio of a CRAN upload in addition to making the process more efficient.

Functional programming with lambda.r

Tags

, , ,

After a four month simmer on various back burners and package conflicts, I’m pleased to announce that the successor to futile.paradigm is officially available on CRAN. The new package is lambda.r (source on github), which hopefully conveys the purpose of the library better than its whimsical predecessor. In some ways this new version deserves a more serious name as the package has matured quite a bit not to mention is part and parcel of a book I’m writing on computational systems and functional programming.

So what exactly is lambda.r? Put simply, lambda.r gives you functional programming in the R language. While R has many functional features built into the language, application development in R is a decidedly object-oriented affair. I won’t go into all the reasons why it’s better to write computational systems in a functional paradigm since that is covered in depth in my forthcoming book “Computational Finance and the Lambda Calculus”. However, here are the salient points:

  • Conceptual consistency with mathematics resulting in less translation error between model and system (see my slides from R/Finance 2011)
  • Modular and encapsulated architecture that makes growth of a system easier to manage (not to mention easier to accommodate disparate computing needs — think parallel alternatives of the same processing pipeline)
  • Efficiency in application development since wiring is trivial

Features

The fundamental goal of lambda.r is to provide a solid architectural foundation that remains intact through the prototyping and development phases of a model or application. One half is accomplished with a functional syntax that builds in modularity and encapsulation into every function. The other half is through the incremental adoption of constraints in the system. This article will focus primarily on the features, and in a separate article I will outline how to best leverage these features as a system matures.

Pattern Matching

Functional languages often use pattern matching to define functions in multiple parts. This syntax is reminiscent of sequences or functions with initial values in addition to multi-part definitions. Removing control flow from function definitions makes functions easier to understand and reduces the translation error from math to code.

Fibonacci Sequence

For example, the ubiquitous Fibonacci sequence is defined as

F_{n} = F_{n-1} + F_{n-2} , where F_{1} = F_{2} = 1

In standard R, one way to define this is with an if/else control block or function [1].

fib <- function(n)
{
  ifelse(n < 2, 1, fib(n - 1) + fib(n - 2))
}

Using lambda.r, pattern matching defines the function in three clauses. The behavior of the function is free of clutter as each clause is self-contained and self-explanatory.

fib(0) %as% 1
fib(1) %as% 1
fib(n) %as% { fib(n-1) + fib(n-2) }

Heaviside Step Function

When represented as a piecewise constant function, the Heaviside step function is defined in three parts. [2]

H(x) = \begin{cases}    0 & x < 0 \\    1/2 & x = 0 \\    1 & x > 0    \end{cases}

Using pattern matching in lambda-r, the function can be defined almost verbatim.

h.step(n) %when% { n < 0 } %as% 0
h.step(0) %as% 0.5
h.step(n) %as% 1

In languages that don’t support pattern matching, again if/else control structures are used to implement these sorts of functions, which can get complicated as more cases are added to a function. A good example of this is the ‘optim’ function in R, which embeds a number of cases within the function definition.

Guard Statements

The last example sneaks in a guard statement along with pattern matching. Guards provide a rich vocabulary to control when a specific function clause is executed. Each guard statement is a logical expression. Multiple expressions can be present in a guard block, so that the function clause only executes when all the expressions evaluate to TRUE. Using the Fibonacci example above, we can add an argument check to only allow integers.

fib(0) %as% 1
fib(1) %as% 1
fib(n) %when% {
  is.integer(n)
  n > 1
} %as% { fib(n-1) + fib(n-2) }

If none of the clauses are satisfied, lambda.r will complain telling you that it couldn’t find a matching function clause.

> fib(2)
Error in UseFunction("fib", ...) : No valid function for 'fib(2)'
> fib(as.integer(2))
[1] 2

Note: If you are running the examples as you are reading along, then you need to either seal() the functions or rm() the current definition prior to redefining the function. The reason is that function clauses are additive. You can add as many clauses as you want, and they will be evaluated in the order they were declared. Since lambda.r has no way of knowing when you are done defining your function you must explicitly tell it via the seal() function.

Types

Custom types can be defined in lambda.r. These types can be used in type constraints to provide type safety and distinguish one function clause from another. All types must be defined using PascalCase.

Type Constructors

A type constructor is simply a function that creates a type. The name of the function is the name of the type. The return value will automatically be typed while also preserving existing type information. This means that you can create type hierarchies as needed.

Point(x,y) %as% list(x=x,y=y)
Polar(r,theta) %as% list(r=r,theta=theta)

In this example we use a list as the underlying data structure. To create an instance of this type simply call the constructor.

point.1 <- Point(2,3)
point.2 <- Point(5,7)

Under the hood lambda.r leverages the S3 class mechanism, which means that lambda.r types are compatible with S3 classes.

Type Constraints

Types by themselves aren’t all that interesting. Once we define the types, they can be used as constraints on a function.

distance(a,b) %::% Point : Point : numeric
distance(a,b) %as% { ((a$x - b$x)^2 + (a$y - b$y)^2)^.5 }

distance(a,b) %::% Polar : Polar : numeric
distance(a,b) %as%
{
  (a$r^2 + b$r^2 - 2 * a$r * b$r * cos(a$theta - b$theta))^.5
}

As shown above each function clause can have its own constraint. Since type constraints are greedy, a declared constraint will apply to every successive function clause until a new type constraint is encountered.

> distance(point.1, point.2)
[1] 5

Attributes

Types are great for adding structure and safety to an application. However types can have diminishing returns as more types are introduced. In general lambda.r advocates using existing data structures where possible to minimize type clutter. Of course if data.frames and matrices are used for most operations, how do you differentiate function clauses? The answer of course are attributes, which come standard with R. Attributes can be considered meta-data that is orthogonal to the core data structure. They are preserved during operations, so can be carried through a process. Lambda.r makes working with attributes so easy that they should become second nature fairly quickly.

With lambda.r you can access attributes via the ‘@’ symbol. Define them in a type constructor as shown below.

Temperature(x, system='metric', units='celsius') %as%
{
  x@system <- system
  x@units <- units
  x
}

Function clauses can now be defined based on the value of an attribute.

freezing(x) %::% Temperature : logical
freezing(x) %when% {
  x@system == 'metric'
  x@units == 'celsius'
} %as% {
  if (x < 0) { TRUE }
  else { FALSE }
}

freezing(x) %when% {
  x@system == 'metric'
  x@units == 'kelvin'
} %as% {
  if (x < 273) { TRUE }
  else { FALSE }
}

It is trivial then to check whether a given temperature is freezing, based on the units. This approach can be extended to objects like covariance matrices to preserve information that is normally lost in the creation of the matrix (e.g. number of observations).

ctemp <- Temperature(20)
freezing(ctemp)

Note that the Temperature type extends the type of ‘x’, so it is also a numeric value. This means that you can add a scalar to a Temperature object, and everything behaves as you would expect.

> ctemp1 <- ctemp - 21
> freezing(ctemp1)
[1] TRUE

The combination of types and attributes are two essential tools in the lambda.r toolkit. In this section I’ve also illustrated how S3 classes can naturally be mixed and matched with lambda.r classes.

Introspection

The goal of lambda.r is to provide rich functionality with a simple and intuitive syntax. To accomplish this there is a lot of wiring behind the scenes. While most of the implementation can safely be ignored, there are times when it is necessary to look under the hood for troubleshooting purposes. Lambda.r provides a number of tools to make debugging and introspection as simple as possible.

The default output of a lambda.r function or type gives a summary view of the function clauses associated with this function. This is an abridged view to prevent long code listings from obscuring the high level summary. Any type constraints and guard statements are included in this display as well as default values.

> freezing
<function>
[[1]]
freezing(x) %::% Temperature:logical 
freezing(x) %when% {
  x@system == "metric"
  x@units == "celsius"
} %as% ...
[[2]]
freezing(x) %::% Temperature:logical 
freezing(x) %when% {
  x@system == "metric"
  x@units == "kelvin"
} %as% ...

Index values prefix each function clause. Use this key when looking up the definition of an explicit function clause with the ‘describe’ function.

> describe(freezing,2)
function(x) { if ( x < 273 ) {
TRUE
}
else {
FALSE
} }
<environment: 0x7f8cfcd7de60>

Debugging

Since lambda.r implements its own dispatching function (UseFunction), you cannot use the standard ‘debug’ function to debug a function clause. Instead use the supplied ‘debug.lr’ and ‘undebug.lr’. These functions will allow you to step through any of the function clauses within a lambda.r function.

Examples

All examples are in the source package as unit tests. Below are some highlights to give you an idea of how to use the package.

Prices and Returns

This example shows how to use attributes to limit the scope of a function for specific types of data. Note that the definition of Prices makes no restriction on series, so this definition is compatible with a vector or data.frame as the underlying data structure.

Prices(series, asset.class='equity', periodicity='daily') %as%
{
  series@asset.class <- asset.class
  series@periodicity <- periodicity
  series
}

returns(x) %when% {
  x@asset.class == "equity"
  x@periodicity == "daily"
} %as% {
  x[2:length(x)] / x[1:(length(x) - 1)] - 1
}

Taylor Approximation

This is a simple numerical implementation of a Taylor approximation.

fac(1) %as% 1
fac(n) %when% { n > 0 } %as% { n * fac(n - 1) }

d(f, 1, h=10^-9) %as% function(x) { (f(x + h) - f(x - h)) / (2*h) }
d(f, 2, h=10^-9) %as% function(x) { (f(x + h) - 2*f(x) + f(x - h)) / h^2 }

taylor(f, a, step=2) %as% taylor(f, a, step, 1, function(x) f(a))
taylor(f, a, 0, k, g) %as% g
taylor(f, a, step, k, g) %as% {
  df <- d(f,k)
  g1 <- function(x) { g(x) + df(a) * (x - a)^k / fac(k) }
  taylor(f, a, step-1, k+1, g1)
}

Use the following definitions like so:

> f <- taylor(sin, pi)
> v <- f(3.1)

References

[1] Julia, I Love You

[2] Heaviside Step Function

Datacentric product development and the rebirth of engineering

An old irony in New York is the ubiquity of the ‘gourmet deli’. It is hard to find a deli that doesn’t proclaim to be gourmet. It is so commonplace that the word gourmet has lost all of its original meaning and perhaps taken on the opposite meaning. A similar phenomenon happened with the word engineer resulting in a dilution of its meaning. I recently heard someone describe themselves as a ‘domestic engineer’. While presumably tongue-in-cheek, the overuse and perhaps abuse of the term engineer has precipitated its loss of meaning and value. Other pithy examples include ‘sanitation engineer’, ‘sales engineer’, or ‘quality assurance engineer’.

Many will argue that title doesn’t really mean all that much and for the most part I agree. However being an engineer used to be more than a title: it also defined your role, your education, and your liability. I’m biased because I have a degree in Electrical Engineering and have gone through the rigors of the discipline culminating in the 8-hour long marathon Engineer In Training exam. One of my professors would state this disclaimer at the start of his Communications (where you design a wireless transmission network) class: “There are two things you won’t be able to do with a degree in Electrical Engineering: wire a house and fix a TV”. We all laughed at the time but it took time for the real message to sink in. His point was that engineering is a professional discipline that focuses on theory and design. We learned why things work and how to predict the behavior of systems we design based on first principles. It is not a technical trade that teaches you how to do perform specific functions. So we may not know how to wire a house, but it didn’t matter because under the hood we knew all the underlying theory.

The curious thing about engineering is that we often get confused as people who build things. The world of software has reinforced this stereotype, which I think needs to be addressed. At our core engineers are problem solvers: people who use mathematics, physics, and other scientific fields along with analytical skills to solve practical design problems. A civil engineer designs bridges and transforms building architectures into safe, viable structures. Electrical engineers design microchips, computers, wireless networks, power stations. Aeronautical engineers design aircraft. The list goes on and on. Note that in all these domains an engineer does not build these things as that is the domain of manufacturing or construction. Engineering is about design and prototyping that culminates in an explicit and detailed plan that describes how to build or manufacture the end product. The epitome of this philosophy comes from Qualcomm who revolutionized the integrated circuit (IC) business by focusing only on the IP surrounding the design of wireless ICs and eschewing manufacturing altogether. Other companies like ARM Holdings have since followed suit.

So where does that leave software? The challenge with software is that it is so widely adopted and the tools so easy that almost anyone can program these days (and in fact is now being taught to primary school students [1]). Knowing how to program is different from being a programmer, which is different from being an engineer. For example mathematics is taught throughout primary and secondary education but no one in their right mind would call themselves a mathematician unless they have a degree in mathematics. Most people I know with a math degree (including myself) do not refer to themselves as a mathematician. The same is true of physics: we learn a fair amount of physics but we don’t call ourselves physicists. The same really should apply to software: just because you know how to write software doesn’t mean you are a software engineer (a term I don’t particularly like). To be an engineer your understanding needs to be more than how to write software. You also need to understand data structures and algorithms, system design, resiliency and redundancy, analysis, numerical methods, etc. Without this foundation there is little guarantee that in production a system will operate at a level of reliability and accuracy that is expected of the system. This isn’t so different from a lay person building a house. While the house may stand and be livable, there is little analysis surrounding its structural stability, how much load it can bear, etc. I’m close to throwing out the baby with the bath water, so where does that leave us?

We are now truly in an information age, where even the most mundane products have become datacentric. This means that a product or service cannot exist without data to drive it. I’m not talking about Salesforce.com where you enter in data and retrieve it later. I’m talking search engines, mapping tools (Google Maps), recommendation systems (Amazon, Netflix), digital personal assistants (Siri), etc. I call this datacentric product development. Finance has been doing this for years under various guises. It is particularly prevalent in portfolio management, trading, and risk management, not to mention applications like creditworthiness or fraud detection. The term Financial Engineer embodies the role, which involves equal parts analysis, modeling, and development. These people are typically part of business groups and not part of the greater IT organization, which is usually designated a cost center. What is remarkable is that the role of Financial Engineer embodies this notion that being an engineer is about applying analytical methods to solve problems. The solutions just happen to be written in software. As datacentric product development becomes more widespread, being just a programmer will not be sufficient.

Organizationally I think this model is what technology product companies need to move towards. Product development needs more engineers and fewer programmers. What this means is that a single role must be responsible for understanding the business domain, assessing the problem, designing a solution, and illustrating that it works. The toolset of the engineer is more than just a Java IDE. It also includes working in languages like R or Matlab to analyze data and test hypotheses. In many cases this will also mean building a production system, while in other cases it may mean handing off to a vendor that specializes in building software. In some ways I’m trying to dial back the clock to the “good ‘ol days” but not for reasons of nostalgia. It’s really about recognizing what it means to be an engineer and what datacentric product development needs. (In a separate article I’ll share my thoughts on what I think an optimal organizational structure looks like when you adopt this model.)

The last question is what do we call this legion of skilled engineers? I’ve already professed my distaste for Software Engineer as I think this term confines the discipline to just software, which I’ve argued is only a small portion of the overall capability a person in this role needs. I could cheat since the organization I’m building is in the financial industry and stick with Financial Engineer but that still leaves an unnamed army to fend for themselves. Instead the emerging field of Computational Engineering [2] embodies the characteristics that I think datacentric product development needs. The reason is that the Computational Sciences are known for solving domain-specific problems using analytical and numerical methods, which of course includes writing software. People in these fields face the same challenges of those in product organizations: scaling, performance, reliability, accuracy of the model, etc. The term has not been widely adopted for Internet and business technology, but it is not difficult to see how it applies here as well.

If you’ve bought this argument and are interested in being a part of a new breed of engineer, don’t hesitate to contact me.

References

[1] Code Club: After-school group teaches children how to become programming whizz kids

[2] Graduate Education for Computational Science and Engineering

 

Follow

Get every new post delivered to your Inbox.

Join 38 other followers