# 14.7 Backtesting Strategy

Specifying a backtesting program for a trading organization can be an unsettling experience, plagued by data limitations and philosophical quandaries. Here we shall address issues and present practical advice on how to proceed.

###### 14.7.1 Backtesting as Hypothesis Testing

Backtesting, as it is commonly practices, is hypothesis testing. It poses all the familiar challenges of hypothesis testing. Let’s focus on two:

- Philosophically, hypothesis testing treats the null hypothesis as “valid” or “invalid” whereas in many applications the question is more one of the null hypothesis being either an imperfect but useful assumption or an imperfect and not useful assumption. Stated another way, hypothesis testing is often applied to situations that are “gray” to determine if they are “black” or “white.”

- If we accept the null hypothesis as either “valid” or “invalid” there remains an uncomfortable tradeoff between the risk of Type I error and that of Type II error—reducing one increases the other, making it difficult—or controversial—to find a balance.

The problems are related. VaR measures aren’t “valid” or “invalid,” just as the approximation 3.142 for π is not “valid” or “invalid.” VaR measures and approximations are either “useful” or “not useful,” and usefulness depends on context. For a carpenter, 3.142 may be a useful approximation of π, but it might not be for an astronomer. A particular VaR measure may be useful for assessing the market risk of futures portfolios but not of portfolios containing options on those futures. While we generally speak of “backtesting a VaR measure,” in fact we backtest a VaR measure as applied to a particular portfolio.

With backtesting, we distinguish between those VaR measures we will reject and those we will continue to use for a particular trading portfolio. Where we draw the line is a compromise to balance the risk of rejecting a “valid” VaR measure against that or failing to reject an “invalid” VaR measure. Never mind that this is a compromise over a contrived issue. It really isn’t a compromise at all. Researchers in the social sciences long ago adopted the convention of testing at the .05 or .01 significance level. Use of the .05 significance level predominates, but a researcher whose data is particularly strong may report results at the .01 significance level to emphasize the fact. Accordingly, there is no real debate about what significance level to use. In backtesting, we use the .05 significance level based solely on the established convention and the fact that backtest data is rarely good enough to warrant the .01 significance level.

Bluntly stated, we accept or reject VaR measures based on a convention for how to compromise over a contrived issue. The convention is use of the .05 significance level. The compromise is about balancing the risks of Type I vs. Type II errors. The contrived issue is that of a particular VaR measure being somehow “valid” or “invalid”.

These problems exist for hypothesis testing in fields other than finance. Social scientists embrace the hypothesis testing approach because there aren’t really good alternatives. In backtesting, we are fortunate to have two or three years of data on the performance of a VaR measure. If historical data weren’t so limited, we could go beyond the contrived issue of VaR measures being “valid” or “invalid” and truly assess the usefulness of individual VaR measures. It is limited data, more than anything else, that drives us to accept the hypothesis testing approach to backtesting. Formal hypothesis testing largely substitutes convention for meaningful test design. This may be a weakness, but it is also a strength. Without extensive data, careful test design is impossible. Convention-driven hypothesis testing allows us to make decisions with limited data in a manner that, despite only loosely conforming to our needs, is consistent. Arguably, it represents the best option available to us for interpreting limited data.

###### 14.7.2 Alternatives

The Basel Committee’s traffic light backtest doesn’t employ hypothesis testing. It is just a rule specified by regulators based on their intuitive sense of what seemed reasonable. Its graded response of increasing capital charges within the yellow zone avoids the stark “valid” or “invalid” distinction of hypothesis testing at the expense of creating an illusion of precision. With just α + 1 = 250 data points, it is difficult to draw any conclusion whatsoever about a VaR measure, especially a VaR measure that is supposed to experience just one exceedance every 100 days.

For banks, having their VaR measure perform poorly on the traffic light test would cost more than elevated capital charges. Regulators might force them to go through an expensive and time consuming process of implementing a new VaR measure. At a minimum, poor performance on the traffic light test would attract scrutiny, which banks generally want to avoid.

Rather than entrust such matters to luck, banks have tended to implement conservative VaR measures whose coverage *q** well exceeds the 0.99 quantile of loss they purport to measure. Some such measures are so conservative they practically never experience an exceedance.5 This all but guarantees the VaR measures perform well on the traffic light test.

Lopez (1999) builds on the traffic light approach of more finely grading backtest results. Drawing on decision theory, he suggests that the accuracy of VaR measures be gauged by how well they minimize a “loss function” reflective of the evaluator’s priorities, which might include avoiding extraordinary one-day losses or avoiding increased regulatory capital charges. While this is consistent with the goal of accepting or rejecting VaR measures based on an assessment of their usefulness, it poses a risk of drawing conclusions not warranted by limited data available for backtesting.

Lopez’s approach compares

- the value of the loss function achieved by a VaR measure over a period of α + 1 observations, and

- some benchmark value an accurate VaR measure might have achieved over the same period.

Depending on how the loss function is defined, this can be straightforward, or it can entail assumptions. For example, if the loss function is set equal to the number of exceedances experienced over the α + 1 observations, Lopez’s methodology reduces to a simple coverage test. For a more interesting—and problematic—loss function, define the **magnitude of an exceedence** as the maximum of 1) a portfolio’s actual loss minus the value-at-risk for that period, and 2) zero. A loss function based on the magnitude of exceedences addresses a concern of many managers: how bad can a loss be on days it exceeds reported value-at-risk? But evaluating a benchmark for such a loss function requires some assumptions as to how an accurate VaR measure might have performed. Should a VaR measure fail a backtest based on such a loss function, the question arises as to whether the problem resides with the VaR measure or with the assumptions used to model the benchmark.

###### 14.7.3 Joint Tests

Joint tests are backtests that simultaneously assess two or more criteria for a VaR measure—say coverage and exceedance independence. Such tests have been proposed by Christoffersen (1998) and Christoffersen and Pelletier (2004). Campbell (2005) recommends against their use:

While joint tests have the property that they will eventually detect a VaR measure which violates either of these properties, this comes at the expense of a decreased ability to detect a VaR measure which only violates one of the two properties. If, for example, a VaR measure exhibits appropriate unconditional coverage but violates the independence property, then an independence test has a greater likelihood of detecting this inaccurate VaR measure than a joint test.

###### 14.7.4 Designing a Backtesting Program

When a VaR measure is first implemented its performance will be closely monitored. Data will be insufficient for meaningful statistical analyses, but a graph such as Exhibit 14.1 can be updated monthly and monitored for signs of irregular performance. Parallel testing against a legacy VaR measure is also appropriate. At this stage, the goal is primarily to address Type B model implementation risk. Coding or implementation errors can produce noticeable distortions in a VaR measure’s performance, even over short periods of time.

At six months, coding or other implementation issues should have been identified and resolved. If any of these motivated substantive changes in the VaR measure or its output, you will want to wait until six months after the last substantive change before performing any statistical backtests. Results from our recommended standard distribution test are likely to be the most meaningful at this point, as six months of data really isn’t enough for coverage or independence tests.

Perform another backtest at one year. Now include our recommended standard independence test. If you calculate value-at-risk at the 90% or 95% level, also include our recommended standard coverage test. Otherwise, wait two years before performing all three of our recommended standard tests. Continue to backtest annually using those three tests. Use all available data generated since the last substantive change to the VaR system, up to a maximum of five years.

I recommend institutions use the three recommended standard tests described in this chapter. They are as good as any you will find in the literature, and better than most. Some widely cited backtests are flawed or ineffective. Banks will also need to perform the traffic light backtest, as required by their regulators. Backtests should be performed with both clean and dirty data.

###### 14.7.5 Failing a Backtest

Because they are performed at the .05 significance level, failure of any one of our recommended standard backtests is a strong indication of a material shortcoming in a VaR measure’s performance. Your response will depend on the particular test failed, whether it was failed with clean or dirty data, and your assessment of the circumstances that caused the failure. A graph similar to Exhibit 14.10 is useful for diagnosing problems identified by coverage or distribution tests.

Failure of a clean test—or both a clean test and the corresponding dirty test—is indicative of a Type A (model design) or Type B (implementation) problem with the VaR measure. Focus your analysis first on eliminating the possibility of an implementation or coding error. Only then address the possibility of Type A design shortcomings.

A design shortcoming may not necessarily dictate a fundamental change in the design of your VaR measure. If your VaR measure already incorporates sophisticated analytics suitable for your portfolio, modifying those analytics may not be productive. A review of your backtesting data may indicate that an ad hoc solution, such as multiplying output by a scalar, may fix the problem

For example, if your VaR measure failed a clean recommended standard distribution test, and you are comfortable the model design is appropriate for your portfolio, you can go back and redo the distribution test using the same past VaR measurements, but multiply each by a scalar *w*. Through trial and error, or some search routine, you can solve for that value *w* that optimizes performance on the test (i.e. maximizes the sample correlation between the *n _{j}* and ). Going forward, scale VaR measurements by that value

*w*.

Some may feel uncomfortable with an ad hoc solution like this. Keep in mind that a VaR measure is a practical tool. Our goal is not to develop some theoretically beautiful model for the complex dynamics of markets. All we require is a reasonable indication of market risk. The philosophy of science tells us to judge a model based on the usefulness of its predictions and not on the nature of its assumptions. If we can fix a VaR measure by simply scaling its output, then there is every reason to do so.

Of course, this solution only applies if a VaR measure is already sophisticated enough to capture relevant market dynamics. If a portfolio is exposed to vega risk or basis risk, and the VaR measure isn’t designed to capture these, no amount of scaling of that VaR measure’s output is going to solve the problem. If a Monte Carlo VaR measure is so computationally intensive that there is only time for a sample of size 250 for each overnight VaR analysis, the standard error will be enormous. Scaling the output will not solve this problem. The computations need to be streamlined—perhaps with a holdings remapping and/or variance reduction—and the sample size increased.

Tweaking a poorly designed VaR measure is only going to produce another poorly designed VaR measure. If a VaR measure is fundamentally unsuited for the portfolio it is applied to, it needs to be fundamentally redesigned.

Some shortcomings of VaR measures must be lived with. The standard UWMA and EWMA techniques for modeling covariance matrices do not address market heteroskedasticity well. As we indicated in Chapter 7, there are currently no good solutions to this problem. Today’s VaR measures are slow in responding to rising market volatility. During such periods, they tend to experience clustered exceedances. Similarly, when volatilities decline, they again lag, and may experience few or no exceedances. These phenomena may cause a VaR measure to fail an independence test. There is little that can be done about the problem.

Failure of a dirty test and not the corresponding clean test is an indication of a Type C model application problem.

###### 14.7.6 Backtesting Other PMMRs

This chapter, like the literature, has focused on backtesting of VaR measures. If you employ some other PMMR, coverage and exceedance independence tests will not apply, but it may be possible to develop tests analogous to those tests for your particular PMMR. Our recommended standard distribution and independence tests are not limited to value-at-risk. They can be applied with most PMMRs.