Das Auto meets Car Talk

I recently entered the dark world of automotive diagnostics. A bad diagnostic trouble code and a jaw dropping repair quote led me down this path. What started off as an unsteady and reluctant journey has yielded some hidden treasures in the form of a standard microcontroller and an endless source of data.

Since the mid nineties (1996) all vehicles in the US by law must support a standard set of diagnostic codes called OBD-II. These are monitored and reported by the engine control unit (sometimes also known as the ECM). These codes can be read from scan tools that range from simple to quite sophisticated. Under the hood many of the aftermarket scan tools use a PIC microcontroller loaded with Elm Software’s firmware known by its model number, ELM327. The ELM327 normalizes the various protocols that ECUs speak providing a single API to program against.

In short to talk to your car you need the following items:

For example, I used the BAFX bluetooth reader (which actually looks like an ELM knock-off — oops), which easily establishes a bluetooth connection with my laptop creating a new serial port. I then connect using the free miniterm that is linked from pyserial’s website. I also tried pyobd, which has potential but didn’t support my car’s handshake quite right. After modifying some of the source, I found it easier to use miniterm directly until I knew what I was looking at.

Notes About elm and Serial Connections

Serial connections can seem like a bygone era populated with such quaint devices as phonographs and ham radios. In reality they are still a staple to any device manufacturer that needs to directly communicate with an embedded device. These protocols are easily defined and implemented, and many protocols are so widespread that drivers are readily available. Although fairly simple, serial interfaces are not very forgiving if the basic communication parameters are incorrect. For ISO9141-2 (what I’m connected with) you need to ensure that parity is off, there are 8 data bits, 1 stop bit, and a baud rate of 10.4k. The underlying protocol used to communicate to the ECM the link to your computer seems to require a compatible baud rate. I haven’t verified this, but I think the reason is that many of the commands are passed through to the ECU. I’ve read that for older cars ELM327 is too fast, but it appears that by setting the baud rate properly fixes this issue.

Anyone serious about interfacing with their car should really read the elm data sheet. However I’ll admit that data sheets aren’t for everyone (one of my “talents” is to enjoy “boring stuff”), so here are some salient points. To start, any commands that begin with AT are for the microcontroller itself and not for the ECM. ELM327 is case insensitive and also ignores whitespace. This means that ATZ and AT Z are equivalent. A command that is unrecognized returns with a ‘?’ while the ECM will respond with NO DATA (which happens to be ambiguous at times — more on that later). One final note is that ELM327 by default only sends carriage returns. Make sure you handle this correctly on linux machines otherwise the output will be difficult to use.

Handshake and Initialization

When connecting to an elm device, it typically responds with the version of the firmware it is running. This is equivalent to resetting the system, which is done with AT Z. After this step you need to choose a protocol to speak to the ECM. With my device this seems to be automatic, but if you plan to write software to connect it is better to make this explicit via AT SP Ax (x = protocol). Note that this will write to the EEPROM and set this as the default (with fallback as a search). To just try a protocol, use AT TP x.

Once connected you can verify the protocol used (should be the default) with AT DP. Note that when connecting via software you will want to turn off echo (AT 0) so that the data received is strictly the response and not the echo of your input. The final step in this process is to issue the 0100 command directly to the ECM, which requests all supported PIDs between 00 and 10 in hex. A PID is actually two things: the mode and the parameter ID. The modes act as a high-level grouping. The response is a sequence of bytes where each bit acts as a flag indicating whether a p-code is supported starting at 0101 and ending at 0120 in hex. The mapping is taking each byte in sequence, so expand the hex values to binary and move along in that order. For example if your first byte is 8A, this translates to

hex:       8       A
     ------- -------
     0 1 0 0 1 0 1 0
PID: 1 2 3 4 5 6 7 8

OBD-II responses are variable length but start with a predictable sequence. Valid responses begin with 4x, where x is the mode and the following byte contains the PID. The remainder is the output from the ECM. These values are also in hex and need to be converted. If you get a NO DATA response then nothing was returned from the ECM. From what i gather this could result from one of two things: an unsupported code or a faulty sensor that is not returning data. To know which it is requires ensuring that you interpret the output of the 0100 command properly.

Next Steps

So where to from here? For a data junkie like me the next step is reading real time data and plotting the results. It could get interesting integrating the car data with GPS. Feel free to send me ideas.

Multipart anonymous functions

I didn’t think this was possible, but as I was perusing some code (and consequently making a fork), I discovered there is syntax to support this. Check out these two examples.

Pattern Matching

1> P = fun(undefined) -> null; (X) -> X end.
#Fun<erl_eval.6.13229925>
2> P(undefined).
null
3> P(42). 
42

Guard Expressions

4> G = fun(X) when X==undefined -> null; (X) -> X end.
#Fun<erl_eval.6.13229925>
5> G(undefined).
null
6> G(42).
42

 

Quantitative finance and computational systems

I’m writing a book proposal based on the lecture notes for my R for Quants workshop I conducted at the Baruch MFE program. I’ve discovered that a book proposal is remarkably similar to a business plan: you need to identify your target audience, the size of that audience, what differentiates your from other books in the market, and why you’re qualified to write that book. As an exercise to focus my own thoughts on this process, this post touches on the motivation for such a book.

The purpose of this book is to illustrate the value of bringing together quantitative analysis and modeling with computer science and systems development into a holistic discipline. The emphasis is on the use of functional programming as the conceptual force that binds the quantitative model with the systems model. Joining two fields that at times seem mutually repellent yields a program structure that readily supports a sequential development process that easily transitions from analysis to modeling to systems development. While this approach expects more from practitioners, we argue that the gains are well worth the effort. Regardless of our enthusiasm, quants face ever larger data sets, more computationally demanding problems, and real-time demands (not to mention shrinking budgets). It is thus more challenging to solve contemporary problems without having a solid foundation in both quantitative methods and computer science. Perhaps the term Financial Engineer has gained currency precisely because academics and practitioners recognize the need to emphasize both aspects of the discipline. The world of high frequency trading has already embraced this model where the superstars are experts in both domains. While not everybody needs to fit this mold, at a minimum practitioners must appreciate both sides of the coin to maximize the effectiveness of their analytical systems.

Convincing the reader of such a claim requires a stroll through computer science history, touching on the origins of mathematical thinking, statistical programming, and programming paradigms. On the surface such an exposition may seem superfluous, when in fact it provides a conceptual and contextual foundation for the design principles discussed later in the book. For example, discussing the merits of a functional programming paradigm requires one to understand other programming paradigms and how they impact systems design. These discussions are viewed through the lens of quantitative finance with contrasting examples touching on modern portfolio theory, asset pricing, signal generation, portfolio optimization, etc. In most cases the principles highlighted in these examples are not exclusive to quantitative finance and can be applied to other computational fields. As an example, the price change of an instrument can be modeled various ways. Certain situations may require balancing precision of the model with performance of the calculation. Identifying which implementation provides the optimal balance means swapping out one model for another. A poor system design makes this process tedious and error prone. The lesson can be applied equally well irrespective of whether the model is for modeling asset prices or sodium channels in the brain.

Once the fundamentals are covered, the core of the book will examine various functional approaches and how these methods facilitate the modeling process. Functional programming concepts like first-class functions, higher-order functions, side effects, and pattern matching will be presented alongside numerical optimization and linear algebra to illustrate this interplay. We rely on two parallel reference implementations (Reference Classes and our own futile.paradigm) to drive our main arguments despite the plethora of programming models available in R. At times we also discuss the S3 and S4 object systems since they are both integral parts of R and are unavoidable in daily use. In order to avoid reams of source listings, at times transformations between implementations will seem like magic, similar to jumps in a proof. We make heavy use of diagrams to show the conceptual connections between the quantitative model and the software model as a way to fill in these gaps.

Next we examine more advanced topics such as developing practical strategies to cope with big data, on-demand analytics, and high performance computing and the impact these scenarios have on the modeling process. A common theme is how the superior modularity of functional programming enables the quant to easily tune and replace models as conditions and constraints change over time. An equally important benefit is how functional programs with limited side effects are easier to parallelize, which means that not only is the model pluggable but so is the wiring. Packages like ‘foreach’ and ‘snow’ can be drop-in implementations that leverage parallelism behind the scenes if done correctly. Similar strategies can be applied for GPU-based computations. When done incorrectly, these enhancements can act as shackles that prevent alternative analytics as the logical flow is stuck in the one path. Proper systems design and functional programming techniques simplifies the process of adding these features without disrupting the modeling process.

An interdisciplinary book of this nature runs the risk of alienating all interested parties. To even pick up the book requires more than a passing knowledge and interest of both quantitative fields and computer science. Many introductory concepts are glossed over in order to maintain focus on the core discussion. Knowledge of basic statistical and numerical methods, machine learning, and programming concepts are all assumed, though we do provide copious references to literature as well as refresher notes in the appendix. When it makes sense, we do spend extra time establishing core concepts. For example we expect most quants to know the basics of object oriented programming but little if anything about its dual, functional programming. Coming from the other side, software engineers transitioning into quantitative development should be comfortable with basic statistics and linear algebra. The non-finance audience will find that many of the analytical methods are industry-independent. For those in the financial industry, many of the techniques and systems discussed can be used as is.

While assuming a fair amount of knowledge, at the same time this book is not a research book. Experts looking for new approaches to quantitative analysis or computer science will not find it here. The contribution of this book is really the marriage of these fields and the approaches one must take to satisfy the demands of modern computational systems. Our hope is that the diverse background of quants is fertile ground for such a book, and the desire to optimize various processes and systems applies equally to financial engineering itself.

R for Quants, Part III (A)

This is the third part in a three part series on teaching R to MFE students at CUNY Baruch. The focus of this lesson is on programming methods and application development in R.

Contents

PART I: PRELIMINARIES

PART II: STATISTICS

PART III: STRUCTURING CODE

Object-Oriented Programming

Depending on with whom you speak, you may hear that R is object-oriented. Others will say it’s functional. In fact it’s both and neither simultaneously. In R, object-oriented programming centers around how functions are dispatched and less about how code is structured. S3 introduces a class attribute and a polymorphic dispatching system, which resembles functional programming. In S4 certain embellishments give the illusion of a class-based programming model. The two systems are mostly compatible, but there are instances where there can be conflicts. Large projects like RMetrics and BioConductor heavily use the S4 style, but many smaller projects do not really benefit from the added complexity of S4.

S3 Classes and Dispatching

The simplest dispatching system is object-oriented in the sense that a function is called based on the ‘class’ of the first argument. A variable’s class is simply an attribute attached to the variable.

> class(h)
 [1] "xts" "zoo" "returns"
> attr(h, 'class')
 [1] "xts" "zoo" "returns"

When calling a function the actual implementation depends on whether the generic function is S3 or not. If it is, the definition will typically defer to a separate function called UseMethod. This function will dispatch to a concrete implementation based on the class of the first argument. The matching function will be named

dispatched function := base function "." class

If no such function is found, then the default function is called. As an example, let’s look at the function mean:

> mean
function (x, ...)
UseMethod("mean")
<bytecode: 0x1051616a8>
<environment: namespace:base>

This function has a number of implementations including a default function mean.default. Try (methods(mean) to see what’s available). Hence, to get the mean of the returns our portfolio, mean(h) will dispatch to mean.default since there are no declared functions for any of the classes associated with h.

> mean(h)
[1] 0.001991222

Unfortunately, this isn’t the behavior we want. Rather, we want to see the mean for each asset. We can accomplish this by implementing a new function mean.zoo (which would then apply to any zoo objects).

> mean.zoo <- function(x, ...) apply(x, 2, mean, ...)
> mean(h)
        AAPL          XOM           KO            F           GS
0.0023180089 0.0021953922 0.0002628634 0.0028525868 0.0023272563

This technique can be used to create new functions as well as add implementations to existing S3 methods.

S4 Classes and Dispatching

While S3 is simple yet powerful, it doesn’t offer much in the way of programmer safety. Since the class attribute can be changed at will, it’s easy to break the convention and consequently other people’s code. The S4 system attempts to formalize object-oriented programming. It introduces constructors, type safety, inheritance and other features typically associated with object-oriented programming languages.

Classes are defined using the setClass and setClassUnion functions.

setClassUnion('XtsNull', c('xts','NULL'))
setClass('Equity',
  representation(ticker='character', returns='XtsNull'),
  prototype=list(ticker='', returns=NULL))

Methods are then attached to the class using the setGeneric and setMethod functions.

setGeneric('beta', function(equity, market, ...) standardGeneric('beta'))
setMethod('beta', c('Equity','Equity'),
  function(equity, market) {
    cov(equity@returns, market@returns) / var(market@returns)
  })

Instances are created with the new function.

> xom <- new('Equity', ticker='XOM', returns=h$XOM)
> mkt <- new('Equity', ticker='^GSPC', returns=h[,'^GSPC'])
> beta(xom,mkt)
        ^GSPC
XOM 0.8693016

Clearly one cost of the S4 system is the added overhead in programming. It is not so easy to transition from exploratory programming to formal applications because S4 demands a lot of structure from the beginning.

There are also now ReferenceClasses, which is like S4 but objects are mutable, creating an even stronger object-oriented paradigm within R.

Functional Dispatching

While much emphasis has been on object-oriented programming in R, other programming paradigms are equally valid. Functional programming has become popular once again, and R is particularly suited for this programming style.

Lambda.R

(Note this section is updated to reflect the latest generation of my functional programming package. Lambda.r replaces the older futile.paradigm.)

R has its roots in both S and Scheme. Many of the improvements to S (e.g. lexical scoping) is directly attributed to Scheme, which is a functional language derived from LISP. The lambda.r approach borrows additional concepts from the functional world so programs can be structured functionally*. This package attempts to return R to an environment that is conducive to iterative development that leads to structured programs. In fact, this is one of John Chambers’ original goals for the S language [1]. Lambda.r introduces syntax to write multipart functions reminiscent of Erlang or Haskell.

Functions in lambda.r are defined as multipart definitions. The advantage of this approach is that data manipulation is kept separate from application logic. The drawback is that it can be more verbose. Multipart functions are defined as separate clauses each with corresponding guard statements or type constraints. Guards define the conditions for executing a particular implementation while type constraints restrict the function to specific input types. Here is the beta implementation again using a type constraint

beta(equity, market) %::% Equity : Equity : numeric
beta(equity, market) %as%
{
  cov(equity$returns, market$returns) / var(market$returns)
}

Each function clause is started with a %when% operator, which supports multiple conditional expressions. Each expression must evaluate to true for the function to execute. The actual function definition is then specified by the %as% operator. Supporting additional signatures is as easy as adding another function clause. In this example we only use a guard to show the different ways of defining functions.

beta(portfolio, market) %when% {
  portfolio %hasa% returns
  market %isa% Equity
} %as% {
  cov(portfolio$returns, market) / var(market)
}

* Note that I’m the author of lambda.r and futile.paradigm, so the discussion is slightly biased.

A Type System

Lambda.r offers its own type system. Types are simply data structures tagged as a particular data type. We avoid using the word ‘class’ to avoid confusion with the legacy OOP programming models. Types are instantiated by calling its type constructor, which is defined like any other function in lambda.r,

Bond(T, coupon=0.02, tenor=10) %as% {
  list(coupon=coupon, tenor=tenor)
}

Creating an instance of the type is the same as before,

> bond <- Bond()
> bond$coupon
[1] 0.02

In the above example, the astute reader will wonder how Bond can be represented as raw types. S3 and S4 classes are not native to the language, so they must be wrapped in quotes and represented as strings. Lambda.r provides syntactic sugar to allow the use of raw types. To use this feature, the types must be specified as PascalCase, otherwise they, too, must be enclosed in quotes.

Attributes

When working with data it is convenient to attach meta-data to an object. For example with a matrix that represents equity return correlations, it is useful to add attributes that indicate the number of observations or periodicity of the input set. This can be done easily with attributes in lambda.r. In the beta example above we might want to retain the date of the calculation. The ‘@’ symbol is used to access attributes and can be used in any lambda.r definition (both type constructors and functions).

beta(portfolio, market) %when% {
  portfolio %hasa% returns
  market %isa% Equity
} %as% {
  b <- cov(portfolio$returns, market) / var(market)
  b@date <- portfolio$date
  b
}

Ultimately the choice of programming style depends on the author of the software and the domain in use. In finance many concepts are directly related to mathematics, which itself is functional, so translating these ideas to code is much simpler than in object-oriented contexts [2].

References

[1] J. Chambers. Evolution of the s language. In Proceedings of the 20th Symposium on the Interface. The Interface Foundation of North America, 1996.
[2] B. Rowe. A Beautiful Paradigm: Functional Programming in Finance. R/Finance 2011, 2011.

R for Quants, Part II (A)

This is the second part in a three part series on teaching R to MFE students at CUNY Baruch. The focus of this lesson is on basic statistical methods in R.

Contents

PART I: PRELIMINARIES

PART II: STATISTICS

PART III: STRUCTURING CODE

Distributions

As a statistical programming language, R provides many functions for describing distributions. Standard distributions include normal, binomial, uniform, chi-squared, etc. Others are available from add-on packages.

Distribution functions follow a standard pattern in R, which makes it easy to use any of the available variants. The stats package, which is automatically loaded, provides functions for many common distributions. To see what is available, try ?Distributions.

Prefix Description Example
d* density function (PDF) dnorm
p* distribution function (CDF) pnorm
q* quantile function (inverse CDF) qnorm
r* random generation rnorm

Let’s look at a plot for each of these functions. Notice the usage of par and plot to generate each chart. When adding multiple charts to a plot, special parameters must be set to configure the layout. When you are done, it’s good practice to reset the configuration by calling par again with the old par object (as returned by the function previously).

opar <- par(mfrow=c(2,2), mar=c(3,5,1,1)+0.1)
x <- seq(-5,5, by=0.1)
y <- seq(0,1, by=0.02)

plot(x, dnorm(x), t='l')
plot(x, pnorm(x), t='l')
plot(y, qnorm(y), t='l')
plot(x, rnorm(x))
par(opar)

The graphics capabilities in R are quite extensive [1] but can be quite cumbersome to work with. There are numerous packages that attempt to improve the way graphics are handled in R including ggplot, lattice [3].

Exercise: Write a function plot_dists(name) that generates the above plot for the given distribution.

Analyzing Distributions

It is easy to spot check whether a given set of data fits a distribution by using a QQ plot. The qqnorm and qqline functions use a normal distribution as a default. For other distributions, use qqplot to explicitly specify the x data series.

h <- getPortfolioReturns(c('AAPL','XOM','KO','F','GS'), 100)
qqplot(h$XOM)
qqline(h$XOM)


As expected, the tails do not fit a normal distribution. To align the theoretical quantiles with the sample quantiles, you have two approaches: normalize the sample data or use a different set of theoretical quantiles. [2]

z <- (h$XOM - mean(h$XOM)) / sd(as.vector(h$XOM))
qqnorm(z)
qqline(z)

Exercise: Generate a QQ plot using a better fit for the theoretical distribution. Hint: Try using rnorm.

Parameter Estimation

To fit data to a particular distribution, you can use any of the numerous optimization functions (optim, nlm, optimize, etc) to minimize your log likelihood function. Here is a contrived example of estimating the mean and variance of one of our stocks.

get_estimator <- function(x)
{
  n <- length(x)
  function(theta)
  {
    n/2*log(2*pi) + n/2*log(theta[2]) + 1/(2*theta[2]) * sum((x-theta[1])^2)
  }
}

Let’s execute the function and compare with the built-in functions.

> f <- get_estimator(h$XOM)
> optim(c(0,1), f)$par
[1] 0.0020516656 0.0002079123
> mean(h$XOM)
[1] 0.002052150
> sd(h$XOM)^2
         XOM
0.0002100090

Note the use of a closure in this example. Once we have the closure, it is easy to pass the closure to other optimization functions for experimentation. We will discuss optimization in more detail in the next section.

Sampling Data

R provides many methods for sampling data from an existing data set, depending on what you want to accomplish. The simplest is sample, which selects entries from the specified vector. By default, the probability of selecting a specific element is equally-weighted over the elements. For example, this is appropriate when drawing cards from a deck.

> cards <- sapply(c('H','S','C','D'), function(x) paste(x, 1:13, sep=''))
> sample(cards, 5)
[1] "C12" "S6" "H10" "S12" "D7"

To bootstrap a data set, set replace=TRUE, which will reuse all elements for each sample selection. This option is appropriate when flipping a coin multiple times.

> sample(c('H','T'), 10, replace=TRUE)
 [1] "T" "H" "H" "H" "H" "T" "T" "T" "H" "H"

For more sophisticated bootstrapping, see the boot package [4].

Covariance and Correlation

Not surprisingly, there are built-in functions for calculating the covariance (cov) and correlation (cor) between data. You can even convert between them with cov2cor.

> cov(h$XOM, h$F)

Exercise: For portfolio h, write a function to calculate the covariance matrix without using cov.

Regression

Linear regression is performed by using the lm function, which fits data to linear models. We can use it to estimate the parameters of the CAPM for our toy portfolio. Note we assume the risk free rate is 0 to simplify the example.

> ws <- t(sapply(1:100, function(x) rdirichlet(rep(1,5)) ))
> p <- apply(h * ws, 1, sum)
> m <- getPortfolioReturns('^GSPC',100)
> m$portfolio <- p
> colnames(m) <- c('market','portfolio')
> out <- lm(portfolio ~ market, m)
> summary(out)

In this example we are generating random weights for each day to calculate our portfolio returns. Afterward, we download our returns for our market and run the regression.

> plot(as.vector(m$market), as.vector(m$portfolio))
> abline(out)

A good discussion of doing regression analysis in R is in [5]. Other forms of regression analysis are supported as well but is out of scope for this document.

References

[1] An Introduction to R Graphics
[2] Fitting Distributions With R
[3] Lattice - Multivariate Data Visualization with R - Figures and Code
[4] Bootstrapping Regression Models
[5] Practical Regression and Anova using R

R for Quants, Part I (B)

This is a continuation of the R workshop I’m teaching at the Baruch MFE program. This section discusses the programming model of R in a slightly biased way. The full contents are below.

Contents

PART I: PRELIMINARIES

PART II: STATISTICS

PART III: STRUCTURING CODE

Function Definition and Evaluation

Defining a function is done via assignment, similar to any other variable.

f <- function(x) 3 * x + 2

This function can be executed as f(5). Since R is vector-based, a vector can be passed into the function as is giving results for each element in the vector.

> f(-5:5)
 [1] -13 -10 -7 -4 -1 2 5 8 11 14 17

R is a dynamically typed language so any compatible argument will be evaluated. We’ll see in the third part how this is combined with the dispatching systems to implement polymorphism (as well as class systems).

Note that in lambda.r, it is possible to define multipart functions that eschews direct assignment with a declarative notation. This is covered in my introduction to lambda.r.

Named Arguments

In the example above, the argument was passed into the function as a positional argument. Named arguments are also supported, which allows you to specify arguments in any order you choose, assuming that the names are valid.

f <- function(x,y) (x-5)^2 + (y+2)^2
f(y=3,x=4)

Optional Arguments

Not all arguments need to be specified in a function call. When defining a function, any argument can provide a default value. When calling a function, any argument not explicitly passed into the function will then be populated by the default value.

f <- function(x,y=3) (x-5)^2 + (y+2)^2
f(4)

The Ellipsis Argument

Sometimes a collection of arguments need to be passed onto another function. This happens frequently with functions that call plot functions though it’s useful in numerous situations. Essentially any unmatched arguments will populate the ellipsis arguments, which can be passed along to another function as is.


f <- function(x, ...) plot(x, ...)

f(rnorm(100), main="100 Random Values")

The ellipsis argument can be manipulated in other ways, but that is out of scope for this discussion.

First Class Functions

All functions are first class in R, which means you can pass them around like any other variable. This property is used throughout R, where numerous higher order functions are used to operate on data.

Like many languages, a single statement does not require braces, although multi-line definitions must be formally blocked. Unlike many C variants, R does not require an explicit return statement at the end of the function definition. Whatever is the result of the last statement is returned by the function.

Higher Order Functions

When working with data structures it is useful to perform actions against each column (or row) of data. In other languages this would be accomplished using a loop while in R a higher order function is employed. The most basic of these is apply. This function is similar to the common higher order function map but operates on an array or matrix.

> h <- getPortfolioReturns(c('AAPL','XOM','KO','F','GS'), 100)
> apply(h, 2, sd)
      AAPL        XOM         KO          F         GS
0.01719355 0.01528400 0.01050431 0.02480413 0.03114184

When working with lists, lapply is typically the variant to use. Other variants include sapply (simplify result), mapply (multivariate sapply), and tapply (table data). In certain cases, do.call can be used to execute a function passing arguments to the function as a list.

Common higher order functions like fold or reduce are not built-in, although there are packages that provide these functions.

Exercise: Use a higher order function to separate up days from down days for each asset in h above.

Lambda Expressions

In the example above, sd was used as a function reference. If the default behavior of sd is not desired, we can construct an anonymous function that defines custom behavior.

> apply(h, 2, function(x) sd(x, na.rm=TRUE))

Anonymous functions like these are used throughout R. Note that lambda expressions are best when they are short and concise. Using this approach with longer definitions can make code hard to read.

Closures

A closure is a function with an associated environment. They can be constructed easily in R. An important consideration is that all referenced variables in the closure are by default read-only. To change their value a special assignment operator must be used, which will search through parent environments until a matching variable is found.

counter <- function(start=0)
{
  x <- start
  function() { x <<- x + 1; x }
}

The above function can be evaluated as,

> f <- counter(5)
> f()
[1] 6
> f()
[1] 7

Under what situations are closures useful? Any time a function reference is passed as an argument to a function, a function signature is implicitly defined. If your function does not have the same signature, it is necessary to wrap your function in another function that matches said signature. While a direct call can be made, sometimes it’s better to delay evaluation, in which case return a function reference is the best choice.

Exercise: Use a closure to generate a parameterless function that caps returns to some threshold x.

R for Quants, Part I (A)

I’m teaching an R workshop for the Baruch MFE program. This is the first installment of the workshop and focuses on some basics, although we assume you already know how to program. Below are the contents for the full workshop.

Contents

PART I: PRELIMINARIES

PART II: STATISTICS

PART III: STRUCTURING CODE

Getting Started

To get the most out of the workshop, you need to have basic programming knowledge. At a minimum you should understand what control structures are and how variable scopes work.

Somewhere you will need to have a working copy of R. As R is open source and popular, it’s available on all major operating systems. To write R code you can use a standard text editor, like vim, or obtain an IDE (e.g. Eclipse or RStudio) if you prefer a visual editor. For the workshop, we will stick with vim.

While R is a language that comes with “batteries included”, there are additional packages that you will need for the workshop. These include:

Note that installing tawny will get you all its dependencies including futile.paradigm and PerformanceAnalytics.

Getting Help

There are a number of ways to get help. The most direct way is to use the R shell. Most functions provide a documentation page that is retrieved by prefixing the function name with a question mark. e.g. ?lm opens a help page on the function lm. If you don’t know the specific function, try the help.search() command. At this point, you should know how to get help for this function!

Search engines are always an option, but with R it can at times be problematic due to the genericness of the letter. Many people have developed solutions to this problem, with the most popular being rseek or a filtered Google search.

The R community has a number of mailing lists for getting help in addition to special interest groups (e.g. R-SIG-Finance), while the younger generation seems to have opted for online Q&A sites as a primary resource.

The R Shell

There are a number of useful functions for interacting with the R shell. In many ways you can think of it as being a lightweight version of bash. Common operations like ls() and rm() exist to list the objects you’ve created as well as remove them. You can also view your search() path and see which packages are loaded.

To install a new package from the shell, use the install.packages() command. R will download and build the package while you wait. Don’t forget to load the package after you build it with the library() function.

install.packages('tawny')
library(tawny)

Examining Objects

Any object can be examined directly in the shell by typing its name. Note that some objects, like matrices, will flood your shell, so be careful.
Since R is open source, functions written in pure R can also be viewed directly in the shell. This is useful for learning R as well as debugging code.

Vector Primitives

In R, everything is a vector. This means that even primitives have a length():

> length(4)
 [1] 1

This seemingly strange idea makes translating mathematical notation into code very easy since vector notation is built-in. That means no loops just to add two vectors together.

> 1:5 + c(1,2,3,4,5)
 [1] 2 4 6 8 10

It also means mathematical properties are honored by default so operators behave as you expect. As we’ll discuss later, this behavior also extends to matrices.

> c(2,3,4) + 2
 [1] 4 5 6
 > c(2,3,4) * 2
 [1] 4 6 8

From the examples, you can find two ways to create vectors. Other methods include seq(), which creates a sequence of numbers based on a variety of rules.

Subsetting Notation

To access elements within a vector, R provides many handy built-in constructs. The simplest is an indexing notation. More complicated expressions can be applied as well.

> a <- 1:10
> a[4]
 [1] 4
> a[a>6]
 [1] 7 8 9 10

This works because R is evaluating the expression across the vector so any function that returns booleans properly indexed to the vector will yield deterministic results. This property is used to apply sorting over a vector.

> b <- sample(1:10)
> b[order(b)]
 [1] 1 2 3 4 5 6 7 8 9 10

What do you suppose is the output of the order() function?

Elements in a vector can also be named. Once defined, these names can be used to access elements.

> names(a) <- strsplit('abcdefghij','', fixed=TRUE)[[1]]
> a
 a b c d e f g h i j
 1 2 3 4 5 6 7 8 9 10
> a['c']
 c
 3

Operators and functions

By default vectors are defined as column vectors. This is true of the internal data structure as well. To create a row vector, use the transpose function,

> t(1:4)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4

Other common functions like inner and outer product are defined as operators,

> c(1,2,3) %*% c(3,2,1)
     [,1]
[1,]   10
> c(1,2,3) %o% c(3,2,1)
     [,1] [,2] [,3]
[1,]    3    2    1
[2,]    6    4    2
[3,]    9    6    3

When working with vectors, R tries its best to protect you from any obvious mistakes, like incompatible lengths between the operator. In general, R attempts to do the right thing while issuing errors for any glaring problems.

Arrays and Matrices

Arrays are vectors that have a dim(ension) attribute. Matrices are simply two-dimensional arrays. Each of these types have a constructor: array() and matrix(), respectively. When creating a matrix, note that it is constructed along columns. You can override this behavior but be aware that the performance may degrade since the internal representation is based on columns.

> matrix(1:6, nrow=2)
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Since matrices have two dimensions, the names() function will not work to access a matrices column or row names. Instead there are colnames() and rownames(). Similarly, length() is not appropriate for matrices; use dim().

Subsetting Notation

Accessing specific elements of a matrix is accomplished using similar subsetting notation. Since there are two dimensions, an index can be applied to either dimension, or a full column or row can be accessed. Notice that the printed output of the matrix actually shows you the notation.

Exercise: Use rnorm() to generate a 20 x 6 matrix. Add column names to the matrix: C, F, T, A, D, K. Extract column A. How do you extract more than one column?

Another common technique for creating matrices is to use either rbind() or cbind() with existing vectors or arrays.

Important Types

Lists

A list is a general purpose data structure or object that stores named elements. Objects can be stored within a list at multiple levels. Once a list is created, elements can be accessed by name or index. When using an index, typically the special double bracket notation is used unless you want another list back [1].

> li <- list(a=rnorm(5), b=1:8, c='label')
> li$b
[1] 1 2 3 4 5 6 7 8
> li[[2]]
[1] 1 2 3 4 5 6 7 8
> li[2]
$b
[1] 1 2 3 4 5 6 7 8

Data.frames

While matrices require data to be of a consistent type, a data.frame allows arbitrary types for each column.

Factors

A factor is essentially an enum. Performance benefits exist when using factors for grouping or filtering since the comparison is faster than with a string. Be careful, though, as R will attempt to convert string data to factors by default, which can result in unexpected behavior.

Coercion

Sometimes the data you get needs to be converted to a different format. Most type constructors have corresponding as.* functions to coerce data into the given type. A typical usage is converting a string to a date via as.Date().

Exercise: Given the following data.frame, get the average of the values for label b.

> l <- sample(strsplit('abcdefg','',fixed=TRUE)[[1]],10,replace=TRUE)
> d <- data.frame(cbind(value=rnorm(10), label=l))

Hint: Use anytypes() to see the type for each column in the data.frame.

Reading Data

read.csv

The most common method for getting data into R is by reading a file. Typically the family of read and write functions are used for general purpose reading and writing of data.frames, while scan is sometimes used directly when reading in all numeric matrices (as an example).

> df <- read.csv(textConnection("a,b,c
+ 1,2,3
+ 4,5,6
+ 7,8,9"))
> write.csv(df, file='dummy.csv')
> dd <- read.csv('dummy.csv')

Notice anything strange when reading this back in?

Other Methods

With the help of additional packages, you can load data from a database (DBI, RODBC, RMySQL), or read financial data from publicly available sites (quantmod, RMetrics).

Additional Resources

R for Beginners

Econometrics in R

Time Series Analysis and Its Applications: With R Examples

References

Zero rates with futile.paradigm

Here’s a short example of calculating zero rates and discount factors from cash rates using futile.paradigm. Of note is how type conversions are handled and how the core function implementation focuses strictly on the calculation while overloaded versions focus on the type conversions.

discount_factor %when% (day.count %isa% Act360)
discount_factor %also% is.character(spot.date)
discount_factor %also% is.character(maturity.date)
discount_factor %as% function(deposit.rate, spot.date, maturity.date, day.count)
{
  s <- as.Date(spot.date)
  m <- as.Date(maturity.date)
  discount_factor(deposit.rate, s, m, day.count)
}

discount_factor %when% (day.count %isa% Act360)
discount_factor %also% (spot.date %isa% Date)
discount_factor %also% (maturity.date %isa% Date)
discount_factor %as% function(deposit.rate, spot.date, maturity.date, day.count)
{
  days <- as.numeric(maturity.date - spot.date, units="days")
  discount_factor(deposit.rate, days, day.count)
}

discount_factor %when% (day.count %isa% Act360)
discount_factor %also% is.numeric(days)
discount_factor %as% function(deposit.rate, days, day.count)
{
  1 / (1 + deposit.rate * days / 360)
}

zero_rate %when% is.character(spot.date)
zero_rate %also% is.character(maturity.date)
zero_rate %as% function(discount.factor, spot.date, maturity.date)
{
  s <- as.Date(spot.date)
  m <- as.Date(maturity.date)
  zero_rate(discount.factor, s,m)
}

zero_rate %when% (spot.date %isa% Date)
zero_rate %also% (maturity.date %isa% Date)
zero_rate %as% function(discount.factor, spot.date, maturity.date)
{
  days <- as.numeric(maturity.date - spot.date, units="days")
  zero_rate(discount.factor, days)
}

zero_rate %when% is.numeric(days)
zero_rate %as% function(discount.factor, days)
{
  1 / df ^ (1/(days/365)) - 1
}

Also of note is how types can be used for polymorphism. These types can be created on the fly without a formal definition.

To call the functions,

df <- discount_factor(0.0123375,
  '2009-02-05', '2009-05-05', create(Act360))
zr <- zero_rate(df, '2009-02-05', '2009-05-05')

What’s new in futile.matrix 1.1.2

Tags

This is an exciting release of futile.matrix, which in some ways the package grows up and finds its purpose. It now includes a suite of functions for generating random matrices and can be used in conjunction with RMTstat. In short, futile.matrix has two purposes:

  1. Investigating the properties of random matrices
  2. Fitting distributions and finding the upper cutoff on the noise part of the eigenvalue spectrum

This post will focus on the first purpose, that of exploring random matrices. It is meant to be an overview of functions available in the package and not a rigorous treatment of the subject matter.

Random Matrices

The package provides a number of types for describing random matrices. These types are strictly model parameters that are used to generate random matrices. Supported matrix classes include Wigner matrices, Wishart matrices, and Jacobian matrices.

Wigner Matrix

A Wigner matrix is a symmetric matrix with random Gaussian entries. To generate a matrix of this class we can create a model and then call the matrix generator. Once we create a matrix we can plot a histogram of the eigenvalues, which conforms to the Wigner semicircle law.

model <- create(WignerModel, 1000)
mat <- rmatrix(model)
hist(eigenvalues(m), breaks=20)

Additional arguments can be passed into the model to specify, for example, whether the entries are real or complex. To visually verify that the entries look correct, use the peek function.

> model <- create(WignerModel, 1000, real=FALSE)
> m <- rmatrix(model)
> peek(m)
                          [,1]                    [,2]
[1,]  0.024964162+0.000000000i  0.02515425-0.00411280i
[2,]  0.025154249+0.004112805i -0.03926107+0.00000000i
[3,] -0.001786425+0.002332125i -0.02215823-0.00532748i
[4,]  0.003182095+0.012343427i -0.05085979-0.02496951i
[5,]  0.022302859-0.006057670i -0.00372532-0.01257203i
                          [,3]                    [,4]                    [,5]
[1,] -0.001786425-0.002332125i  0.00318209-0.01234343i  0.02230286+0.00605767i
[2,] -0.022158230+0.005327478i -0.05085979+0.02496951i -0.00372532+0.01257203i
[3,] -0.041233230+0.000000000i -0.03274875-0.01665794i -0.01389319+0.00570893i
[4,] -0.032748746+0.016657935i  0.01681741+0.00000000i -0.03047206+0.00500749i
[5,] -0.013893195-0.005708930i -0.03047206-0.00500749i  0.01964392+0.00000000i

Note that when the matrix is generated, the function automatically asserts that the transpose is equal to the matrix itself via futile.paradigm.

rmatrix %must% (result == ct(result))

As expected, the histogram for this variant also conforms to the semi-circle law.

Wishart Matrix

In finance, a more useful matrix is a Wishart matrix, which is a sample covariance matrix. These matrices have an eigenvalue spectrum that is described by the Marcenko-Pastur distribution.

As an example, we’ll create a random Wishart matrix that is similar to two years of daily data on the S&P 500. We plot the actual distribution of eigenvalues alongside the theoretical distribution (generated from the RMTstat package).

model <- create(WishartModel, 500, 500)
mat <- rmatrix(model)
x <- seq(0,4, by=0.1)
par(mfrow=c(1,2))
hist(eigenvalues(m), breaks=20)
plot(x, dmp(x, svr=1), main="Theoretical distribution")

Fitting the spectrum to a theoretical curve enables you to isolate the noise part of the spectrum. Once this is established, it’s possible to clean the source correlation matrix and remove the estimation error.

Matrix Ensembles

Additional properties of random matrices can be discerned by analyzing ensembles of said matrices. We can generate the ensemble based on an existing model along with a count. Since an ensemble contains many random matrices, there is no practical way of displaying its contents. Hence, only some basic information is printed to reduce screen clutter.

</pre>
> model <- create(WishartModel, 100, 400)
> e <- create(Ensemble, 200, model)
> hist(max_eigen(e), freq=FALSE, breaks=20)
<pre>

Conclusion

This post covered the basic functions of futile.matrix and how one can analyze random matrices and their ensembles using the package. In a subsequent post, we’ll cover using the package to fit distributions and identifying the cutoff point for the noise portion of the eigenvalue spectrum.

What’s new in futile.paradigm 2.0.4

Well this certainly took a while but the latest installment of my functional dispatching library for R is finally released and pushed to CRAN. As a major point release, it has some breaking changes, although the older style of 1.x is supported via an option (more later).

The 2.x series of futile.paradigm introduces a cleaner syntax that doesn’t clutter the environment. All function variants are attached to a single function reference, which means multi-part definitions are much cleaner than in 1.x.

fact %when% (length(x) == 1)
fact %as% function(x) { fact(1:x) }

fact %when% (length(x) > 1)
fact %as% function(x) { prod(x) }

Adding additional guards is done by using the %also% operator.

fact %when% is.numeric(x)
fact %also% (length(x) == 1)
fact %as% function(x) { fact(1:x) }

Default function variants can now be defined in case all others fail. This only works when the argument counts match, although you’re safe if you use the ellipsis argument here.

fact %default% function(x) { fact(as.numeric(x) }

Now calling the function is exactly how you would expect. Functional polymorphism works as expected and all function variants have clear code paths, making each definition remarkably concise.

> fact(5)
[1] 120
> fact("5")
[1] 120
> fact(1:4)
[1] 24

What about optional arguments? I’ve teetered back and forth on whether this is a good idea (always deciding on no), but there are circumstances where variables should be passed along for compatibility with other dispatching systems. When this need arises, it’s easiest to simply pass in a list as an extra parameter and pass it along as necessary. Sometimes I’ll just attach() the list to the environment inside the function definition, which makes it easy to access any variables specified.

This post is a brief introduction to futile.paradigm and how to structure code in a functional way in R. There are other fun features in version 2.0 of futile.paradigm that I will discuss in later posts, but this covers the bare essentials.

Code Listing

Here’s a complete code listing for convenience.
fact %when% is.numeric(x)
fact %also% (length(x) == 1)
fact %as% function(x) { fact(1:x) }

fact %when% (length(x) > 1)
fact %as% function(x) { prod(x) }

fact %default% function(x) { fact(as.numeric(x)) }
Follow

Get every new post delivered to your Inbox.

Join 36 other followers