PELTing a Competing Changepoint Algorithm

This post will demonstrate the PELT algorithm from the changepoint package–a competing algorithm to the twitter package’s breakout detection algorithm. While neither of these algorithms produce satisfactory results, one change point location approximation algorithm that makes no distributional assumptions shows potentially promising results.

I received some feedback regarding my first foray into change point analysis from Brian Peterson. While some of it was good, a fair bit more was how I can add more to my analysis, more boxes I could check off, and so on. One of those boxes was the PELT algorithm, devised by one Rebecca Killick of Lancaster University, which I’ll give a quick run-through of below.

In the twitter paper, PELT was the competing algorithm that the paper compared itself to, and while I didn’t think that replicating the competing algorithm would be necessary at first go, it turns out, that, well, it was necessary. So, going forward, I’m going to have more points demonstrating some more aspects of these change point detection algorithms. Thus far, the most impressive one has been David Matteson’s e.divisive algorithm in his ecp package. However, its one caveat for me is its massively long running time.

Anyhow, without further ado, let’s replicate a diagram found on page 7 of the original paper in the lower right hand corner. Turns out, that is the Scribe data set that comes with the BreakoutDetection package, so we have all we need.

require(BreakoutDetection)
require(changepoint)
data(Scribe)
bd <- breakout(Scribe)
pelt <- cpt.meanvar(Scribe)
plot(Scribe, type="l")
abline(v=bd$loc, col="red")
abline(v=pelt@cpts[1], col="blue")
legend(legend = c("Breakout", "PELT"), x = 2, y = 1000, fill = c("red", "blue"))

This gives us the following diagram.

In short, the paper’s claim is corroborated. PELT underperforms even in a simple example, using both packages’ only-one-changepoint methodology. Furthermore, PELT is actually an S4-class type of object (so for those wondering what the @ character is doing, it’s the equivalent of the $ elsewhere in R).

Let’s move onto the GSPC data.

suppressMessages(require(quantmod))
suppressMessages(require(PerformanceAnalytics))

suppressMessages(getSymbols("^GSPC", from = "1984-12-25", to = "2013-05-31"))
#these two lines are only needed if Yahoo gives you holidays such as New Year's -- EG 1985-01-01
require(timeDate)
GSPC <- GSPC[!as.character(index(GSPC)) %in% as.character(holidayNYSE(1985:2013)),]

dailySqRets <- Return.calculate(Cl(GSPC))*Return.calculate(Cl(GSPC))
dailySqRets <- dailySqRets["1985::"]

Now we’ll look at a histogram of the daily squared returns and see if it looks like some sort of famous distribution.

plot(hist(dailySqRets, breaks=1000), xlim=c(0, 0.001))

Which results in the following image

So, an exponential distribution (give or take)? Well, let’s try and do some PELT changepoint analysis using the exponential distribution.

PELTcps <- cpt.meanvar(as.numeric(dailySqRets), 
                       method="PELT", test.stat="Exponential")

Did that work? No. Here’s the error message.

Error in PELT.meanvar.exp(coredata(data), pen.value) : 
  Exponential test statistic requires positive data

Now, squared returns can’t possibly be negative, because that’s just nonsensical. So what does that mean? Let’s take a look.

dailySqRets[dailySqRets==0]

And the output:

> dailySqRets[dailySqRets==0]
           GSPC.Close
1985-03-28          0
1985-10-08          0
1988-02-04          0
1988-11-02          0
1992-09-03          0
1997-01-28          0
2003-01-10          0
2008-01-03          0

So, this essentially alleges that there were some days on which the close to close didn’t move at all. Let’s take a look.

GSPC["1985-03-27::1985-04-01"]

This gives us the following output:

> GSPC["1985-03-27::1985-04-01"]
           GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted
1985-03-27    178.43    179.80   178.43     179.54   101000000        179.54
1985-03-28    179.54    180.60   179.43     179.54    99780000        179.54
1985-03-29    179.54    180.66   179.54     180.66   101400000        180.66
1985-04-01    180.66    181.27   180.43     181.27    89900000        181.27

Notice that the close price for the 27th and 28th day are identical, creating a return of zero, which breaks the PELT algorithm. So let’s fix that.

#the method will throw an error with zero returns, so this deals with that
dailySqRets[dailySqRets == 0] <- dailySqRets[dailySqRets==0] + 1e-100 

Essentially, this is just to get past the error messages within the changepoint package. So now, let’s try applying the algorithm once again.

peltCps <- cpt.meanvar(as.numeric(dailySqRets), 
                       method = "PELT", test.stat = "Exponential")

This time, success–it ran! Let’s check the amount of changepoints.

length(peltCps@cpts)

Which gives:

[1] 374

…which is vastly different from the e.divisive algorithm’s output from the previous investigation, or Dr. Robert Frey’s post (20, give or take). The upside is that this algorithm works quickly, but that’s not much solace if the answers are unreliable. To further dive down into the nature of the changepoints, we will remove the last change point (which is simply the length of the data) and do some summary statistics on how long some of these complete “regimes” are (that is, drop the first and last).

newCpts <- peltCps@cpts[-length(peltCps@cpts)]
regimeTime <- diff(newCpts)
summary(regimeTime)
hist(regimeTime, breaks = 100)

…which provides the following output:

> summary(regimeTime)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    3.00    8.00   19.14   26.00  149.00 

And the following histogram:

In short, most of these “regimes” last less than a month, and half of them don’t even make it out to two weeks. These results are not corroborated by the previously investigated methods. As more academic literature uses differences of log returns, and the point is to search for changes in the variance regime, that is the procedure that will be employed, and as the data is continuous and contains negative values, only the Normal distribution is available to choose from when using the PELT method.

logDiffs <- as.numeric(diff(log(Cl(GSPC)))["1985::"])
peltGaussCps <- cpt.var(logDiffs, method="PELT", test.stat="Normal")
length(peltGaussCps@cpts)
fullGaussRegimes <- diff(peltGaussCps@cpts[-length(peltGaussCps@cpts)])
summary(fullGaussRegimes)
hist(fullGaussRegimes, breaks = 100)

Which gives the following output:

> length(peltGaussCps@cpts)
[1] 93
> fullGaussRegimes <- diff(peltGaussCps@cpts[-length(peltGaussCps@cpts)])
> summary(fullGaussRegimes)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00   16.00   48.00   73.76  106.50  498.00 

And the following histogram:

In short, even though the result is improved, it’s still far from reliable, with most regime switches taking place within 2 months, and many of those lasting barely a few days.

Lastly, I’d like to correct an issue from the previous changepoint post in which I used the “at most one changepoint” method from the BreakoutDetection package. Now, I’ll use the multiple change point detection method.

bdcp <- breakout(dailySqRets, min.size = 20, method = "multi")

With the following results:

> length(bdcp$loc)
[1] 0

In short, the breakout algorithm found no change points whatsoever on default settings, which, once again, does not line up with the results from both the ecp package, along with Dr. Frey’s post. Even if the beta (penalty parameter) gets decreased to .0001 (from .008, its default), it still fails to find any changes in the squared returns data, which is disappointing.

However, over the course of exploring the changepoint package, I have found that there is a method that is directly analogous to Dr. Frey’s cumulative sum of squares method (that is, if you check the help file for cpt.var, one of the test.stat distributions is “CSS”, aka cumulative sum of squares). There are two methods that employ this, neither of which are PELT, but both of which predate PELT (the binary segmentation and segment neighborhood methods), and are explained in Dr. Killick’s original paper.

First off, both algorithms contain a penalty term, and the penalty term that is the default setting is the Bayesian Information Criterion (or BIC aka SIC), which is a double log of the number of data points. In contrast, the AIC is simply the log of 2, so the BIC will be greater than AIC at 8 data points or higher. The cpt.var algorithm function is mostly a wrapper for further nested wrappers, and essentially, the “cut to the chase” is that we eventually nest down to the not-exported binary segmentation variance cumulative sum of squares algorithm, and the segmentation neighborhood sum of squares algorithm.

So here’s the code for the former:

changepoint:::binseg.var.css <- function (data, Q = 5, pen = 0) 
{
  n = length(data)
  if (n < 4) {
    stop("Data must have atleast 4 observations to fit a changepoint model.")
  }
  if (Q > ((n/2) + 1)) {
    stop(paste("Q is larger than the maximum number of segments", 
               (n/2) + 1))
  }
  y2 = c(0, cumsum(data^2))
  tau = c(0, n)
  cpt = matrix(0, nrow = 2, ncol = Q)
  oldmax = Inf
  for (q in 1:Q) {
    lambda = rep(0, n - 1)
    i = 1
    st = tau[1] + 1
    end = tau[2]
    for (j in 1:(n - 1)) {
      if (j == end) {
        st = end + 1
        i = i + 1
        end = tau[i + 1]
      }
      else {
        lambda[j] = sqrt((end - st + 1)/2) * 
                         ((y2[j + 1] - y2[st])/(y2[end + 1] - y2[st]) - 
                          (j - st + 1)/(end - st + 1))
      }
    }
    k = which.max(abs(lambda))
    cpt[1, q] = k
    cpt[2, q] = min(oldmax, max(abs(lambda), na.rm = T))
    oldmax = min(oldmax, max(abs(lambda), na.rm = T))
    tau = sort(c(tau, k))
  }
  op.cps = NULL
  p = 1:(Q - 1)
  for (i in 1:length(pen)) {
    criterion = (cpt[2, ]) >= pen[i]
    if (sum(criterion) == 0) {
      op.cps = 0
    }
    else {
      op.cps = c(op.cps, max(which((criterion) == TRUE)))
    }
  }
  if (op.cps == Q) {
    warning("The number of changepoints identified is Q, 
             it is advised to increase Q to make sure 
             changepoints have not been missed.")
  }
  return(list(cps = cpt, op.cpts = op.cps, pen = pen))
}

Essentially, the algorithm loops through all the data points (that is, it defines the start as the beginning of the data, and end as the end of the data), and appends them by the quantity as defined by the lambda[j] line, which is, again:

lambda[j] = sqrt((end - st + 1)/2) * 
                 ((y2[j + 1] - y2[st])/(y2[end + 1] - y2[st]) - 
                  (j - st + 1)/(end - st + 1))

Which, to put it in financial terms, multiplies the square root of half the range by the difference of the percent B (think Bollinger Bands) of the *values* of the data and the percent B of the *indices* of the data for a given start and end location (which are consecutive change point locations). Then, the candidate change point is defined by the maximum of the absolute value of lambda between a starting and ending point, as defined by this code:

    k = which.max(abs(lambda))
    cpt[1, q] = k
    cpt[2, q] = min(oldmax, max(abs(lambda), na.rm = T))
    oldmax = min(oldmax, max(abs(lambda), na.rm = T))
    tau = sort(c(tau, k))

The algorithm then updates tau (the collection of change point locations), which updates the start and end point segment locations, and the algorithm restarts again.

Lastly, at the end, the algorithm loops through the different penalties (in this case, the BIC is simply one constant value, so there may be a special case that allows for some dynamic sort of penalty), and all the change points that have a higher lambda value than the penalty are returned as candidate change points. Again, there is no test of significance in the binary segmentation variance cumulative sum of squares algorithm — so if the penalty is manually specified to be zero, and the user specifies the maximum number of change points (n/2, where n is the length of data), then the algorithm will indeed spit back that many change points. In short, while the default settings may do a half-decent job with finding change points, it’s possible to deliberately force this algorithm to produce nonsensical output. In other words, to be glib, this algorithm isn’t attempting to win the battle against the universe in the never-ending battle to sufficiently idiot-proof something vs. the universe’s capability to create a better idiot. However, using the out-of-the-box settings should sufficiently protect oneself from having absurd results.

Here’s an illustrative example of a few iterations of a few iterations of this algorithm.

In this case, our data will be the daily returns of the GSPC from 1985 onward.

I’ll set Q (the maximum number of change points) at 30.

Let’s start this off.

data <- as.numeric(Return.calculate(Cl(GSPC))["1985::"])
Q <- 30
n <- length(data)
y2 = c(0, cumsum(data^2))
tau = c(0, n)
cpt = matrix(0, nrow = 2, ncol = Q)
oldmax = Inf

Which gives us:

> tau
[1]    0 7328
> cpt
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] 
[1,]    0    0    0    0    0    0    0    0    0    
[2,]    0    0    0    0    0    0    0    0    0  
      [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19]   
[1,]      0     0     0     0     0     0     0     0     0     0
[2,]      0     0     0     0     0     0     0     0     0     0
     [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30]
[1,]     0     0     0     0     0     0     0     0     0     0     0
[2,]     0     0     0     0     0     0     0     0     0     0     0

In short, tau is the location for the changepoints, and and the cpt matrix will be demonstrated shortly.

Here’s the first iteration through the main loop.

q <- 1
lambda = rep(0, n - 1)
    i = 1
    st = tau[1] + 1
    end = tau[2]
    for (j in 1:(n - 1)) {
      if (j == end) {
        st = end + 1
        i = i + 1
        end = tau[i + 1]
      }
      else {
        lambda[j] = sqrt((end - st + 1)/2) * 
                         ((y2[j + 1] - y2[st])/(y2[end + 1] - y2[st]) - 
                          (j - st + 1)/(end - st + 1))
      }
    }
    k = which.max(abs(lambda))
    cpt[1, q] = k
    cpt[2, q] = min(oldmax, max(abs(lambda), na.rm = T))
    oldmax = min(oldmax, max(abs(lambda), na.rm = T))
    tau = sort(c(tau, k))

And after this first iteration of the loop completes, here are the updated values of tau and cpt:

> tau
[1]    0 5850 7328
> cpt
           [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] 
[1,] 5850.00000    0    0    0    0    0    0    0    0    
[2,]   10.51362    0    0    0    0    0    0    0    0  
      [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19]   
[1,]      0     0     0     0     0     0     0     0     0     0
[2,]      0     0     0     0     0     0     0     0     0     0
     [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30]
[1,]     0     0     0     0     0     0     0     0     0     0     0
[2,]     0     0     0     0     0     0     0     0     0     0     0

In short, tau is the location of the new change point (also the first row of the cpt matrix), and the second row is the absolute value of lambda. After this, the start and end vectors get appended with a new change point to allow for the binary segmentation that forms the basis of the algorithm. However, a picture’s worth a thousand words, so here are some illustrations. The blue lines denote previous change points, and the red point the new change point.

Here is the code I added in the bottom of the loop for plotting purposes:

    k = which.max(abs(lambda))
    plot(lambda, type = "l", main = paste("Changepoint", q))
    abline(v = tau, col="blue")
    cpt[1, q] = k
    cpt[2, q] = min(oldmax, max(abs(lambda), na.rm = T))
    oldmax = min(oldmax, max(abs(lambda), na.rm = T))
    tau = sort(c(tau, k))
    points(x = k, y = lambda[k], col="red", pch = 19)

And here is the result for the first change point:

The second change point:

The 10th change point:

And the 19th (of 30) change points.

Notice that A) lambda is constantly recomputed after every iteration of the main loop, as it’s updated with every new change point and that B) the values of lambda generally decrease as more change point candidates are found, such that the 19th change point is already on the border of the penalty boundary formed by the BIC. Unlike the math in the paper, this particular algorithm in R does not actually stop when lambda of k (that is, lambda[k]) is smaller than the penalty parameter, so if someone plugged in say, Q = 300, the algorithm would take around 30 seconds to run.

So, what’s the punch line to this approximate algorithm?

This:

 t1 <- Sys.time()
 binSegCss <- cpt.var(as.numeric(Return.calculate(Cl(GSPC))["1985::"]), 
                      method="BinSeg", test.stat="CSS", Q = 30)
 t2 <- Sys.time()
 print(t2 - t1)

The algorithm ran in a few seconds for me. Here is the output:

> binSegCss@cpts
 [1]  310  720  739  858 1825 2875 3192 4524 4762 5044 5850 6139 6199 6318 6548 6641 6868 6967 7328
> length(binSegCss@cpts)
[1] 19

Since the last change point is the length of the data, we’ll disregard that. In short, 18 change points (so that last picture of the four change points? That one and all subsequent ones fell in the realm of noise), which falls right into the ballpark of the post from Keplerian Finance.

So, as before, let’s create the cumulative sum of squares plot, the time series plot, and the volatility terrain map.

require(xtsExtra)
returns <- Return.calculate(Cl(GSPC))["1985::"]
dailySqReturns <- returns*returns
cumSqReturns <- cumsum(dailySqReturns)
plot(cumSqReturns)
xtsExtra::addLines(index(dailySqReturns)[binSegCss@cpts[-length(binSegCss@cpts)]], on = 1, col="blue", lwd = 2)

plot(Cl(GSPC)["1985::"])
xtsExtra::addLines(index(dailySqReturns)[binSegCss@cpts[-length(binSegCss@cpts)]], on = 1, col="blue", lwd = 2)


returns$regime <- NA
for(i in 1:(length(binSegCss@cpts)-1)) {
  returns$regime[binSegCss@cpts[i]] <- i
}
returns$regime <- na.locf(returns$regime)
returns$regime[is.na(returns$regime)] <- 0
returns$regime <- returns$regime + 1
returns$annVol <- NA
for(i in 1:max(unique(returns$regime))) {
  regime <- returns[returns$regime==i,]
  annVol <- StdDev.annualized(regime[,1])
  returns$annVol[returns$regime==i,] <- annVol
}

plot(returns$annVol)

Which gives us the following three images:



This last volatility map is even closer to the one found in Keplerian Finance, thus lending credibility to the technique.

In conclusion, it seems that the twitter breakout detection algorithm (e-divisive with medians) is “too robust” against the type of events in financial data, and thus does not detect enough change points. On the other hand, the PELT algorithm suffers from the opposite issues — by forcing the assumption of a popular distribution from probability theory (EG Normal, Exponential, Poisson, Gamma), PELT outputs far too many candidate change points, making its results highly suspect. However, a special case of the binary segmentation algorithm — the binary segmentation algorithm for variance using cumulative sum of squares — presents interesting results. Although the algorithm is a heuristic approximation, its fast running time and lack of distributional assumptions lend it usefulness for doing a “quick and dirty” analysis to potentially cut down on the running time of a more computationally-intensive changepoint detecting algorithm.

When this topic will be revisited, the segment neighborhood method will be examined.

Thanks for reading.

NOTE: I am a freelance consultant in quantitative analysis on topics related to this blog. If you have contract or full time roles available for proprietary research that could benefit from my skills, please contact me through my LinkedIn here.

Advertisements

3 thoughts on “PELTing a Competing Changepoint Algorithm

  1. Pingback: Distilled News | Data Analytics & R

  2. I think your conclusions are overly strong. First, for cpt.meanvar, you just accepted the defaults rather than seeing what models might be appropriate for cpt.meanvar. It offers several. Secondly, you are expecting the function to act as a black box and do everything for you, in your complaints about the Exponential. While it is not clear the Exponential is the best choice, the Gamma may well be. It wants positive values, too. A basic technique in applied statistics is to transform datasets into something more suitable, and then reverse transform the results back to the original basis.

    In the case of the Gamma, it’s easy. If X is the vector of data, do the analysis on 1+X-min(X) instead of on X, then, when changepoints or whatever are found, map them back by adding min(X)-1 to the result.

    Really need to think about what packages are doing.

    I will be putting up a blog post on using PELT in lieu of Bayesian blocks at my place in the next day or two.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s