14.3 Backtesting With Coverage Tests
Even before J.P. Morgan’s RiskMetrics Technical Document described a graphical backtest, the concept of backtesting was familiar, at least within institutions then using value-at-risk. Two years earlier, the Group of 30 (1993) had recommended, and one month earlier the Basel Committee (1995) had also recommended, that institutions apply some form of backtesting to their value-at-risk results. Neither specified a methodology. In September 1995, Crnkovic and Drachman circulated to clients of J.P. Morgan a draft paper describing a distribution test and an independence test, which they published the next year. The first published statistical backtests were coverage tests of Paul Kupiec (1995). In 1996, the Basel Committee published its “traffic light” backtest.
14.3.1 A Recommended Standard Coverage Test
Consider a q quantile-of-loss value-at-risk measure and define a univariate exceedance process I with terms
[14.1]
To conduct a coverage test, we gather historical exceedance data – αi, – α +1i, … , 0i. We assume the tI are IID, which allows us to treat our data as a realization i[0], … , i[α – 1], i[α] of a sample I [0], … , I [α – 1], I [α].
We define the coverage q* of the value-at-risk measure as the actual frequency with which it experiences exceedances (i.e. instances of ti = 1). This can be expressed as an unconditional expectation:
[14.2]
Coverage tests are hypothesis tests with the null hypothesis that q = q*. Let x denote the number of exceedances observed in the data:
[14.3]
We treat x as a realization of a binomial random variable X. Our null hypothesis is then simply that X ~ B(α + 1, 1 – q). To test at some significance level ε, we must determine values x1 and x2 such that
[14.4]
Multiple intervals [x1, x2] will satisfy this criteria, so we seek a solution that is generally symmetric in the sense that Pr(X < x1) ≈ Pr(x2 < X) ≈ ε/2.
Formally, define a as the maximum integer such that Pr(X < a) ≤ ε/2 and b as the minimum integer such that Pr(b < X) ≤ ε/2. Consider all intervals of the form [a + n, b] or [a, b – n] where n is a non-negative integer. Set [x1, x2] equal to whichever of these maximizes Pr(X ∉ [x1, x2]) subject to the constraint that Pr(X ∉ [x1, x2]) ≤ ε.2 Our backtest procedure is then to observe the value-at-risk measure’s performance for α + 1 periods and record the number of exceedances X. If X ∉ [x1, x2], we reject the value-at-risk measure at the ε significance level.
Suppose we implement a one-day 95% value-at-risk measure and plan to backtest it at the .05 significance level after 500 trading days (about two years). Then q = 0.95 and α + 1 = 500. Assuming , we know X ~ B(500, .05). We use this distribution to determine x1 = 15 and x2 = 36. Calculations are summarized in Exhibit 14.2. We will reject the value-at-risk measure if X ∉ [16, 35].

Exhibit 14.3 indicates similar .05 significance level non-rejection intervals [x1, x2] for other values of q and α + 1.

14.3.2 Kupiec’s PF Coverage Test
Kupiec’s “proportion of failures” (PF) coverage test takes a circuitous—and approximate—route to an answer, offering no particular advantage over our recommended standard coverage test. Comparing the two tests can be informative, illustrating the various respects in which test designs may differ. As the first published backtesting methodology, the PF test has been widely cited.
As with the recommended standard test, a value-at-risk measure is observed for α + 1 periods, experiencing X exceedances. We adopt the same null hypothesis that q = q*. Rather than directly calculate probabilities from the B(α + 1, 1 – q) distribution of X under
, the PF test uses that distribution to construct a likelihood ratio:
[14.5]
It is difficult to infer probabilities with this. As described in Section 4.5.4 a standard technique is to consider –2 log(Λ).
[14.6]
[14.7]
which is—see Lehmann and Romano (2005)—approximately centrally chi-squared with one degree of freedom. That is –2log(Λ) ~ χ2(1,0), assuming . Kupiec found this approximation to be reasonable based on a Monte Carlo analysis, but Lopez (1999) claims to have found “meaningful” discrepancies using his own Monte Carlo analysis.
For a given significance level ε, we construct a non-rejection interval [x1, x2] such that
[14.8]
under . To do so, calculate the ε quantile of the χ2(1,0) distribution. Setting this equal to [14.7], solve for X. There will be two solutions. Rounding the lower one down and the higher one up yields x1 and x2.3
Consider the example we looked at with the recommended standard coverage test. We implement a one-day 95% value-at-risk measure and plan to backtest it at the .05 significance level after 500 trading days, so q = 0.95 and α + 1 = 500. We calculate the ε = .05 quantile of the χ2(1,0) distribution as 3.841. Setting this equal to [14.7], we solve for X. There are two solutions: 16.05 and 35.11. Rounding down and up, respectively, we set x1 = 16 and x2 = 36. We will reject the value-at-risk measure if X ∉ [16, 36].
Exhibit 14.4 indicates similar .05 significance level non-rejection intervals [x1, x2] for other values of q and α + 1.

14.3.3 The Basel Committee’s Traffic Light Coverage Test
The 1996 Amendment to the Basel Accord imposed a capital charge on banks for market risk. It allowed banks to use their own proprietary value-at-risk measures to calculate the amount. Use of a proprietary measure required approval of regulators. A bank would have to have an independent risk management function and satisfy regulators that it was following acceptable risk management practices. Regulators would also need to be satisfied that the proprietary value-at-risk measure was sound.
Proprietary measures had to support a 10-day 99% value-at-risk metric, but as a practical matter, banks were allowed to calculate 1-day 99%value-at-risk and scale the result by the square root of 10.
The Basel Committee (1996b) specified a methodology for backtesting proprietary value-at-risk measures. Banks were to backtest their one-day 99%value-at-risk results (i.e. value-at-risk before scaling by the square root of 10) against daily P&L’s. It was left to national regulators whether backtesting was based on clean or dirty P&L’s. Backtests were to be performed quarterly using the most recent 250 days of data. Based on the number of exceedances experienced during that period, the value-at-risk measure would be categorized as falling into one of three colored zones:

Value-at-risk measures falling in the green zone raised no particular concerns. Those falling in the yellow zone required monitoring. The Basel Committee recommended that, at national regulators’ discretion, value-at-risk results from yellow-zone value-at-risk measures be weighted more heavily in calculating banks’ capital charges for market risk—the recommended multipliers are indicated in Exhibit 14.5. Value-at-risk measures falling in the red zone had to be weighted more heavily and were presumed flawed—national regulators would investigate what caused so many exceedances and require that the value-at-risk measure be improved.
The Basel Committee’s procedure is based on no statistical theory for hypothesis testing. The three zones were justified as reasonable in light of the probabilities indicated in Table 14.5 (and probabilities assuming q* = 0.98, q* = 0.97, etc., which the committee also considered). Due to its ad hoc nature, the backtesting methodology is not theoretically interesting. It is important because of its wide use by banks.
Exercises
Suppose we implement a one-day 90% value-at-risk measure and plan to backtest it with our recommended standard coverage test at the .05 significance level after 375 trading days (about eighteen months). Then q = 0.90 and α + 1 = 375. Calculate the non-rejection interval.
Solution
Suppose we want to apply Kupiec’s PF backtest to the same one-day 90% value-at-risk measure as in the previous exercise. Again, the significance level is .05, q = 0.90 and α + 1 = 375. Calculate the non-rejection interval. Compare your result with that of the previous exercise.
Solution