14.7 Backtesting Strategy
Specifying a backtesting program for a trading organization can be an unsettling experience, plagued by data limitations and philosophical quandaries. Here we shall address issues and present practical advice on how to proceed.
14.7.1 Backtesting as Hypothesis Testing
Backtesting, as it is commonly practices, is hypothesis testing. It poses all the familiar challenges of hypothesis testing. Let’s focus on two:
- Philosophically, hypothesis testing treats the null hypothesis as “valid” or “invalid” whereas in many applications the question is more one of the null hypothesis being either an imperfect but useful assumption or an imperfect and not useful assumption. Stated another way, hypothesis testing is often applied to situations that are “gray” to determine if they are “black” or “white.”
- If we accept the null hypothesis as either “valid” or “invalid” there remains an uncomfortable tradeoff between the risk of Type I error and that of Type II error—reducing one increases the other, making it difficult—or controversial—to find a balance.
The problems are related. Value-at-risk measures aren’t “valid” or “invalid,” just as the approximation 3.142 for π is not “valid” or “invalid.” Value-at-risk measures and approximations are either “useful” or “not useful,” and usefulness depends on context. For a carpenter, 3.142 may be a useful approximation of π, but it might not be for an astronomer. A particular value-at-risk measure may be useful for assessing the market risk of futures portfolios but not of portfolios containing options on those futures. While we generally speak of “backtesting a value-at-risk measure,” in fact we backtest a value-at-risk measure as applied to a particular portfolio.
With backtesting, we distinguish between those value-at-risk measures we will reject and those we will continue to use for a particular trading portfolio. Where we draw the line is a compromise to balance the risk of rejecting a “valid” value-at-risk measure against that or failing to reject an “invalid” value-at-risk measure. Never mind that this is a compromise over a contrived issue. It really isn’t a compromise at all. Researchers in the social sciences long ago adopted the convention of testing at the .05 or .01 significance level. Use of the .05 significance level predominates, but a researcher whose data is particularly strong may report results at the .01 significance level to emphasize the fact. Accordingly, there is no real debate about what significance level to use. In backtesting, we use the .05 significance level based solely on the established convention and the fact that backtest data is rarely good enough to warrant the .01 significance level.
Bluntly stated, we accept or reject value-at-risk measures based on a convention for how to compromise over a contrived issue. The convention is use of the .05 significance level. The compromise is about balancing the risks of Type I vs. Type II errors. The contrived issue is that of a particular value-at-risk measure being somehow “valid” or “invalid”.
These problems exist for hypothesis testing in fields other than finance. Social scientists embrace the hypothesis testing approach because there aren’t really good alternatives. In backtesting, we are fortunate to have two or three years of data on the performance of a value-at-risk measure. If historical data weren’t so limited, we could go beyond the contrived issue of value-at-risk measures being “valid” or “invalid” and truly assess the usefulness of individual value-at-risk measures. It is limited data, more than anything else, that drives us to accept the hypothesis testing approach to backtesting. Formal hypothesis testing largely substitutes convention for meaningful test design. This may be a weakness, but it is also a strength. Without extensive data, careful test design is impossible. Convention-driven hypothesis testing allows us to make decisions with limited data in a manner that, despite only loosely conforming to our needs, is consistent. Arguably, it represents the best option available to us for interpreting limited data.
The Basel Committee’s traffic light backtest doesn’t employ hypothesis testing. It is just a rule specified by regulators based on their intuitive sense of what seemed reasonable. Its graded response of increasing capital charges within the yellow zone avoids the stark “valid” or “invalid” distinction of hypothesis testing at the expense of creating an illusion of precision. With just α + 1 = 250 data points, it is difficult to draw any conclusion whatsoever about a value-at-risk measure, especially a value-at-risk measure that is supposed to experience just one exceedance every 100 days.
For banks, having their value-at-risk measure perform poorly on the traffic light test would cost more than elevated capital charges. Regulators might force them to go through an expensive and time consuming process of implementing a new value-at-risk measure. At a minimum, poor performance on the traffic light test would attract scrutiny, which banks generally want to avoid.
Rather than entrust such matters to luck, banks have tended to implement conservative value-at-risk measures whose coverage q* well exceeds the 0.99 quantile of loss they purport to measure. Some such measures are so conservative they practically never experience an exceedance.5 This all but guarantees the value-at-risk measures perform well on the traffic light test.
Lopez (1999) builds on the traffic light approach of more finely grading backtest results. Drawing on decision theory, he suggests that the accuracy of value-at-risk measures be gauged by how well they minimize a “loss function” reflective of the evaluator’s priorities, which might include avoiding extraordinary one-day losses or avoiding increased regulatory capital charges. While this is consistent with the goal of accepting or rejecting value-at-risk measures based on an assessment of their usefulness, it poses a risk of drawing conclusions not warranted by limited data available for backtesting.
Lopez’s approach compares
- the value of the loss function achieved by a value-at-risk measure over a period of α + 1 observations, and
- some benchmark value an accurate value-at-risk measure might have achieved over the same period.
Depending on how the loss function is defined, this can be straightforward, or it can entail assumptions. For example, if the loss function is set equal to the number of exceedances experienced over the α + 1 observations, Lopez’s methodology reduces to a simple coverage test. For a more interesting—and problematic—loss function, define the magnitude of an exceedence as the maximum of 1) a portfolio’s actual loss minus the value-at-risk for that period, and 2) zero. A loss function based on the magnitude of exceedences addresses a concern of many managers: how bad can a loss be on days it exceeds reported value-at-risk? But evaluating a benchmark for such a loss function requires some assumptions as to how an accurate value-at-risk measure might have performed. Should a value-at-risk measure fail a backtest based on such a loss function, the question arises as to whether the problem resides with the value-at-risk measure or with the assumptions used to model the benchmark.
14.7.3 Joint Tests
Joint tests are backtests that simultaneously assess two or more criteria for a value-at-risk measure—say coverage and exceedance independence. Such tests have been proposed by Christoffersen (1998) and Christoffersen and Pelletier (2004). Campbell (2005) recommends against their use:
While joint tests have the property that they will eventually detect a value-at-risk measure which violates either of these properties, this comes at the expense of a decreased ability to detect a value-at-risk measure which only violates one of the two properties. If, for example, a value-at-risk measure exhibits appropriate unconditional coverage but violates the independence property, then an independence test has a greater likelihood of detecting this inaccurate value-at-risk measure than a joint test.
14.7.4 Designing a Backtesting Program
When a value-at-risk measure is first implemented its performance will be closely monitored. Data will be insufficient for meaningful statistical analyses, but a graph such as Exhibit 14.1 can be updated monthly and monitored for signs of irregular performance. Parallel testing against a legacy value-at-risk measure is also appropriate. At this stage, the goal is primarily to address Type B model implementation risk. Coding or implementation errors can produce noticeable distortions in a value-at-risk measure’s performance, even over short periods of time.
At six months, coding or other implementation issues should have been identified and resolved. If any of these motivated substantive changes in the value-at-risk measure or its output, you will want to wait until six months after the last substantive change before performing any statistical backtests. Results from our recommended standard distribution test are likely to be the most meaningful at this point, as six months of data really isn’t enough for coverage or independence tests.
Perform another backtest at one year. Now include our recommended standard independence test. If you calculate value-at-risk at the 90% or 95% level, also include our recommended standard coverage test. Otherwise, wait two years before performing all three of our recommended standard tests. Continue to backtest annually using those three tests. Use all available data generated since the last substantive change to thevalue-at-risk system, up to a maximum of five years.
I recommend institutions use the three recommended standard tests described in this chapter. They are as good as any you will find in the literature, and better than most. Some widely cited backtests are flawed or ineffective. Banks will also need to perform the traffic light backtest, as required by their regulators. Backtests should be performed with both clean and dirty data.
14.7.5 Failing a Backtest
Because they are performed at the .05 significance level, failure of any one of our recommended standard backtests is a strong indication of a material shortcoming in a value-at-risk measure’s performance. Your response will depend on the particular test failed, whether it was failed with clean or dirty data, and your assessment of the circumstances that caused the failure. A graph similar to Exhibit 14.10 is useful for diagnosing problems identified by coverage or distribution tests.
Failure of a clean test—or both a clean test and the corresponding dirty test—is indicative of a Type A (model design) or Type B (implementation) problem with the value-at-risk measure. Focus your analysis first on eliminating the possibility of an implementation or coding error. Only then address the possibility of Type A design shortcomings.
A design shortcoming may not necessarily dictate a fundamental change in the design of your value-at-risk measure. If your value-at-risk measure already incorporates sophisticated analytics suitable for your portfolio, modifying those analytics may not be productive. A review of your backtesting data may indicate that an ad hoc solution, such as multiplying output by a scalar, may fix the problem
For example, if your value-at-risk measure failed a clean recommended standard distribution test, and you are comfortable the model design is appropriate for your portfolio, you can go back and redo the distribution test using the same past value-at-risk measurements, but multiply each by a scalar w. Through trial and error, or some search routine, you can solve for that value w that optimizes performance on the test (i.e. maximizes the sample correlation between the nj and ). Going forward, scale value-at-risk measurements by that value w.
Some may feel uncomfortable with an ad hoc solution like this. Keep in mind that a value-at-risk measure is a practical tool. Our goal is not to develop some theoretically beautiful model for the complex dynamics of markets. All we require is a reasonable indication of market risk. The philosophy of science tells us to judge a model based on the usefulness of its predictions and not on the nature of its assumptions. If we can fix a value-at-risk measure by simply scaling its output, then there is every reason to do so.
Of course, this solution only applies if a value-at-risk measure is already sophisticated enough to capture relevant market dynamics. If a portfolio is exposed to vega risk or basis risk, and the value-at-risk measure isn’t designed to capture these, no amount of scaling of that value-at-risk measure’s output is going to solve the problem. If a Monte Carlo value-at-risk measure is so computationally intensive that there is only time for a sample of size 250 for each overnightvalue-at-risk analysis, the standard error will be enormous. Scaling the output will not solve this problem. The computations need to be streamlined—perhaps with a holdings remapping and/or variance reduction—and the sample size increased.
Tweaking a poorly designed value-at-risk measure is only going to produce another poorly designed value-at-risk measure. If a value-at-risk measure is fundamentally unsuited for the portfolio it is applied to, it needs to be fundamentally redesigned.
Some shortcomings of value-at-risk measures must be lived with. The standard UWMA and EWMA techniques for modeling covariance matrices do not address market heteroskedasticity well. As we indicated in Chapter 7, there are currently no good solutions to this problem. Today’s value-at-risk measures are slow in responding to rising market volatility. During such periods, they tend to experience clustered exceedances. Similarly, when volatilities decline, they again lag, and may experience few or no exceedances. These phenomena may cause a value-at-risk measure to fail an independence test. There is little that can be done about the problem.
Failure of a dirty test and not the corresponding clean test is an indication of a Type C model application problem.
14.7.6 Backtesting Other PMMRs
This chapter, like the literature, has focused on backtesting of value-at-risk measures. If you employ some other PMMR, coverage and exceedance independence tests will not apply, but it may be possible to develop tests analogous to those tests for your particular PMMR. Our recommended standard distribution and independence tests are not limited to value-at-risk. They can be applied with most PMMRs.