14.8 Further Reading Backtesting

14.8  Further Reading – Backtesting

For an alternative discussion of backtesting, see Campbell (2005).

For some notable backtesting methodologies not discussed in this chapter, see Haas (2001), Engle and Manganelli (2004), and Ziggel et al. (2014). See also Christoffersen and Pelletier (2004), Haas (2005), and Berkowitz et al. (2011), who discuss duration-based backtesting methodologies. These are a form of exceedence independence test that assess if intervals between exceedences appear random.

Da Silva et al. (2006), Berkowitz et al. (2011) and Røynstrand et al. (2012) assesses the performance of various backtesting methodologies using actual and/or simulated P&L data.

 

14.7 Backtesting Strategy

14.7  Backtesting Strategy

Specifying a backtesting program for a trading organization can be an unsettling experience, plagued by data limitations and philosophical quandaries. Here we shall address issues and present practical advice on how to proceed.

14.7.1 Backtesting as Hypothesis Testing

Backtesting, as it is commonly practices, is hypothesis testing. It poses all the familiar challenges of hypothesis testing. Let’s focus on two:

  • Philosophically, hypothesis testing treats the null hypothesis as “valid” or “invalid” whereas in many applications the question is more one of the null hypothesis being either an imperfect but useful assumption or an imperfect and not useful assumption. Stated another way, hypothesis testing is often applied to situations that are “gray” to determine if they are “black” or “white.”
  • If we accept the null hypothesis as either “valid” or “invalid” there remains an uncomfortable tradeoff between the risk of Type I error and that of Type II error—reducing one increases the other, making it difficult—or controversial—to find a balance.

The problems are related. Value-at-risk measures aren’t “valid” or “invalid,” just as the approximation 3.142 for π is not “valid” or “invalid.” Value-at-risk measures and approximations are either “useful” or “not useful,” and usefulness depends on context. For a carpenter, 3.142 may be a useful approximation of π, but it might not be for an astronomer. A particular value-at-risk measure may be useful for assessing the market risk of futures portfolios but not of portfolios containing options on those futures. While we generally speak of “backtesting a value-at-risk measure,” in fact we backtest a value-at-risk measure as applied to a particular portfolio.

With backtesting, we distinguish between those value-at-risk measures we will reject and those we will continue to use for a particular trading portfolio. Where we draw the line is a compromise to balance the risk of rejecting a “valid” value-at-risk measure against that or failing to reject an “invalid” value-at-risk measure. Never mind that this is a compromise over a contrived issue. It really isn’t a compromise at all. Researchers in the social sciences long ago adopted the convention of testing at the .05 or .01 significance level. Use of the .05 significance level predominates, but a researcher whose data is particularly strong may report results at the .01 significance level to emphasize the fact. Accordingly, there is no real debate about what significance level to use. In backtesting, we use the .05 significance level based solely on the established convention and the fact that backtest data is rarely good enough to warrant the .01 significance level.

Bluntly stated, we accept or reject value-at-risk measures based on a convention for how to compromise over a contrived issue. The convention is use of the .05 significance level. The compromise is about balancing the risks of Type I vs. Type II errors. The contrived issue is that of a particular value-at-risk measure being somehow “valid” or “invalid”.

These problems exist for hypothesis testing in fields other than finance. Social scientists embrace the hypothesis testing approach because there aren’t really good alternatives. In backtesting, we are fortunate to have two or three years of data on the performance of a value-at-risk measure. If historical data weren’t so limited, we could go beyond the contrived issue of value-at-risk measures being “valid” or “invalid” and truly assess the usefulness of individual value-at-risk measures. It is limited data, more than anything else, that drives us to accept the hypothesis testing approach to backtesting. Formal hypothesis testing largely substitutes convention for meaningful test design. This may be a weakness, but it is also a strength. Without extensive data, careful test design is impossible. Convention-driven hypothesis testing allows us to make decisions with limited data in a manner that, despite only loosely conforming to our needs, is consistent. Arguably, it represents the best option available to us for interpreting limited data.

14.7.2 Alternatives

The Basel Committee’s traffic light backtest doesn’t employ hypothesis testing. It is just a rule specified by regulators based on their intuitive sense of what seemed reasonable. Its graded response of increasing capital charges within the yellow zone avoids the stark “valid” or “invalid” distinction of hypothesis testing at the expense of creating an illusion of precision. With just α + 1 = 250 data points, it is difficult to draw any conclusion whatsoever about a value-at-risk measure, especially a value-at-risk measure that is supposed to experience just one exceedance every 100 days.

For banks, having their value-at-risk measure perform poorly on the traffic light test would cost more than elevated capital charges. Regulators might force them to go through an expensive and time consuming process of implementing a new value-at-risk measure. At a minimum, poor performance on the traffic light test would attract scrutiny, which banks generally want to avoid.

Rather than entrust such matters to luck, banks have tended to implement conservative value-at-risk measures whose coverage q* well exceeds the 0.99 quantile of loss they purport to measure. Some such measures are so conservative they practically never experience an exceedance.5 This all but guarantees the value-at-risk measures perform well on the traffic light test.

Lopez (1999) builds on the traffic light approach of more finely grading backtest results. Drawing on decision theory, he suggests that the accuracy of value-at-risk measures be gauged by how well they minimize a “loss function” reflective of the evaluator’s priorities, which might include avoiding extraordinary one-day losses or avoiding increased regulatory capital charges. While this is consistent with the goal of accepting or rejecting value-at-risk measures based on an assessment of their usefulness, it poses a risk of drawing conclusions not warranted by limited data available for backtesting.

Lopez’s approach compares

  • the value of the loss function achieved by a value-at-risk measure over a period of α + 1 observations, and
  • some benchmark value an accurate value-at-risk measure might have achieved over the same period.

Depending on how the loss function is defined, this can be straightforward, or it can entail assumptions. For example, if the loss function is set equal to the number of exceedances experienced over the α + 1 observations, Lopez’s methodology reduces to a simple coverage test. For a more interesting—and problematic—loss function, define the magnitude of an exceedence as the maximum of 1) a portfolio’s actual loss minus the value-at-risk for that period, and 2) zero. A loss function based on the magnitude of exceedences addresses a concern of many managers: how bad can a loss be on days it exceeds reported value-at-risk? But evaluating a benchmark for such a loss function requires some assumptions as to how an accurate value-at-risk measure might have performed. Should a value-at-risk measure fail a backtest based on such a loss function, the question arises as to whether the problem resides with the value-at-risk measure or with the assumptions used to model the benchmark.

14.7.3 Joint Tests

Joint tests are backtests that simultaneously assess two or more criteria for a value-at-risk measure—say coverage and exceedance independence. Such tests have been proposed by Christoffersen (1998) and Christoffersen and Pelletier (2004). Campbell (2005) recommends against their use:

While joint tests have the property that they will eventually detect a value-at-risk measure which violates either of these properties, this comes at the expense of a decreased ability to detect a value-at-risk measure which only violates one of the two properties. If, for example, a value-at-risk measure exhibits appropriate unconditional coverage but violates the independence property, then an independence test has a greater likelihood of detecting this inaccurate value-at-risk measure than a joint test.

14.7.4 Designing a Backtesting Program

When a value-at-risk measure is first implemented its performance will be closely monitored. Data will be insufficient for meaningful statistical analyses, but a graph such as Exhibit 14.1 can be updated monthly and monitored for signs of irregular performance. Parallel testing against a legacy value-at-risk measure is also appropriate. At this stage, the goal is primarily to address Type B model implementation risk. Coding or implementation errors can produce noticeable distortions in a value-at-risk measure’s performance, even over short periods of time.

At six months, coding or other implementation issues should have been identified and resolved. If any of these motivated substantive changes in the value-at-risk measure or its output, you will want to wait until six months after the last substantive change before performing any statistical backtests. Results from our recommended standard distribution test are likely to be the most meaningful at this point, as six months of data really isn’t enough for coverage or independence tests.

Perform another backtest at one year. Now include our recommended standard independence test. If you calculate value-at-risk at the 90% or 95% level, also include our recommended standard coverage test. Otherwise, wait two years before performing all three of our recommended standard tests. Continue to backtest annually using those three tests. Use all available data generated since the last substantive change to thevalue-at-risk system, up to a maximum of five years.

I recommend institutions use the three recommended standard tests described in this chapter. They are as good as any you will find in the literature, and better than most. Some widely cited backtests are flawed or ineffective. Banks will also need to perform the traffic light backtest, as required by their regulators. Backtests should be performed with both clean and dirty data.

14.7.5 Failing a Backtest

Because they are performed at the .05 significance level, failure of any one of our recommended standard backtests is a strong indication of a material shortcoming in a value-at-risk measure’s performance. Your response will depend on the particular test failed, whether it was failed with clean or dirty data, and your assessment of the circumstances that caused the failure. A graph similar to Exhibit 14.10 is useful for diagnosing problems identified by coverage or distribution tests.

Failure of a clean test—or both a clean test and the corresponding dirty test—is indicative of a Type A (model design) or Type B (implementation) problem with the value-at-risk measure. Focus your analysis first on eliminating the possibility of an implementation or coding error. Only then address the possibility of Type A design shortcomings.

A design shortcoming may not necessarily dictate a fundamental change in the design of your value-at-risk measure. If your value-at-risk measure already incorporates sophisticated analytics suitable for your portfolio, modifying those analytics may not be productive. A review of your backtesting data may indicate that an ad hoc solution, such as multiplying output by a scalar, may fix the problem

For example, if your value-at-risk measure failed a clean recommended standard distribution test, and you are comfortable the model design is appropriate for your portfolio, you can go back and redo the distribution test using the same past value-at-risk measurements, but multiply each by a scalar w. Through trial and error, or some search routine, you can solve for that value w that optimizes performance on the test (i.e. maximizes the sample correlation between the nj and ). Going forward, scale value-at-risk measurements by that value w.

Some may feel uncomfortable with an ad hoc solution like this. Keep in mind that a value-at-risk measure is a practical tool. Our goal is not to develop some theoretically beautiful model for the complex dynamics of markets. All we require is a reasonable indication of market risk. The philosophy of science tells us to judge a model based on the usefulness of its predictions and not on the nature of its assumptions. If we can fix a value-at-risk measure by simply scaling its output, then there is every reason to do so.

Of course, this solution only applies if a value-at-risk measure is already sophisticated enough to capture relevant market dynamics. If a portfolio is exposed to vega risk or basis risk, and the value-at-risk measure isn’t designed to capture these, no amount of scaling of that value-at-risk measure’s output is going to solve the problem. If a Monte Carlo value-at-risk measure is so computationally intensive that there is only time for a sample of size 250 for each overnightvalue-at-risk analysis, the standard error will be enormous. Scaling the output will not solve this problem. The computations need to be streamlined—perhaps with a holdings remapping and/or variance reduction—and the sample size increased.

Tweaking a poorly designed value-at-risk measure is only going to produce another poorly designed value-at-risk measure. If a value-at-risk measure is fundamentally unsuited for the portfolio it is applied to, it needs to be fundamentally redesigned.

Some shortcomings of value-at-risk measures must be lived with. The standard UWMA and EWMA techniques for modeling covariance matrices do not address market heteroskedasticity well. As we indicated in Chapter 7, there are currently no good solutions to this problem. Today’s value-at-risk measures are slow in responding to rising market volatility. During such periods, they tend to experience clustered exceedances. Similarly, when volatilities decline, they again lag, and may experience few or no exceedances. These phenomena may cause a value-at-risk measure to fail an independence test. There is little that can be done about the problem.

Failure of a dirty test and not the corresponding clean test is an indication of a Type C model application problem.

14.7.6 Backtesting Other PMMRs

This chapter, like the literature, has focused on backtesting of value-at-risk measures. If you employ some other PMMR, coverage and exceedance independence tests will not apply, but it may be possible to develop tests analogous to those tests for your particular PMMR. Our recommended standard distribution and independence tests are not limited to value-at-risk. They can be applied with most PMMRs.

 

14.6 Example: Backtesting a One-Day 95% EUR Value-at-Risk Measure

14.6  Example: Backtesting a One-Day 95% EUR Value-at-Risk Measure

Assume a one-day 95% EUR value-at-risk measure was used for a period of 125 trading days. Data gathered for backtesting is presented in Exhibit 14.8. We have already used the data from the second and third columns to construct Exhibit 14.1. We will now use the data to apply coverage, distribution and independence backtests.

14.6.1 Example: Applying Coverage Tests

To apply a coverage test, we need

  • the quantile of loss the value-at-risk measure is intended to measure: q = 0.95,
  • the number of observations: α + 1 = 125, and
  • the number of exceedances x = 10.

The last value is obtained by summing the 0’s and 1’s in the fourth column of Exhibit 14.8.

Exhibit 14.8: Backtesting data for a one-day 95% EUR value-at-risk measure compiled over 125 trading days. Value-at-risk (VaR) and P&L values in the second and third columns are expressed in millions of euros. The exceedance column has a value of 1 if the portfolio realized a loss exceeding the 0.95 quantile of loss, as determined by the value-at-risk measure. Otherwise it has a value of 0. The last column indicates the specific quantile of loss for each P&L result, again, as determined by the value-at-risk measure.

It can also be obtained by visual inspection of Exhibit 14.1.

In Exhibit 14.3, we find that our recommended standard coverage test’s non-rejection interval for q = 0.95 and α + 1 = 125 is [2, 11]. Since our number of exceedances falls in this interval, we do not reject the value-at-risk measure.

In Exhibit 14.4, we find that the PF test’s non-rejection interval for q = 0.95 and α + 1 = 125 is [2, 12]. Since our number of exceedances falls in this interval, we do not reject the value-at-risk measure.

We cannot use the Basel Committee’s traffic light coverage test because it applies only to 99% value-at-risk measures.

14.6.2 Example: Applying Distribution Tests

For distribution testing, we apply [14.10] to the loss quantiles tu and arrange the results in ascending order to obtain the nj. Values for the  are obtained from [14.11], with α + 1 = 125. Values for nj and  are presented in Exhibit 14.9.

Exhibit 14.9: Values nj and calculated for the data of Exhibit 14.8.

These are plotted in Exhibit 14.10.

Exhibit 14.10: A plot of the points (nj , ) from Exhibit 14.9.

The graphical results are inconclusive. The points do fall near a line of slope one passing through the origin, but the fit isn’t particularly good. Is this due to the small sample size, or does it reflect shortcomings in the value-at-risk measure? For another perspective, we calculate the sample correlation between the nj and  as .997. Consulting Exhibit 14.6, we do not reject the value-at-risk measure at either the .05 or the .01 significance levels.

14.6.3 Example: Applying Independence Tests

Starting with Christoffersen’s test for independent tI, we use the data of Exhibit 14.8 to calculate α00 = 105, α01 = α10 = 9 and α11 = 1. From these, we calculate = 0.9211,  = 0.9000 and   = 0.9194. Our likelihood ratio is

[14.23]

so –2log(Λ) = 0.0517. This does not exceed 3.814, so we do not reject the value-at-risk measure.

Next, applying our recommended standard independence test, we use [14.21] to calculate values tn from the loss quantiles tu. Results are indicated in Exhibit 14.11. 

Exhibit 14.11: Example values tn for our basic independence test. They are identical to those in the first column of Exhibit 14.9, only they are ordered by time while those in Exhibit 14.9 are ordered by magnitude.

We calculate the sample autocorrelations of the tn for lags 1 through 5 as indicated in Exhibit 14.12. 

Exhibit 14.12: Example sample autocorrelations for use with our basic independence test.

Our test statistic—the largest absolute value of the autocorrelations—is 0.132. This is less than the non-rejection value 0.274 obtained from Exhibit 14.7, so we do not reject the value-at-risk measure at the .05 significance level.

Exercises
Exhibit 14.13 below presents 125 days of performance data for a one-day 99% USD value-at-risk measure for use in Exercises 14.8 through 14.11. For convenience, you should be able to cut and paste it from this webpage into a spreadsheet.
time 99% VaR P&L time 99% VaR P&L time 99% VaR P&L
t at t – 1 at t t at t – 1 at t t at t – 1 at t
-124 3.468 -2.107 -82 4.693 0.252 -40 1.401 0.683
-123 3.095 -0.143 -81 3.789 -0.074 -39 1.282 0.241
-122 3.245 0.894 -80 4.897 -0.153 -38 1.524 0.118
-121 2.969 0.990 -79 4.256 0.267 -37 1.834 -0.810
-120 3.472 -0.060 -78 4.537 1.804 -36 1.534 -0.455
-119 4.513 -1.123 -77 4.508 -0.196 -35 1.839 -0.612
-118 3.418 1.090 -76 5.010 0.887 -34 1.585 -0.108
-117 3.641 -0.948 -75 4.308 0.385 -33 1.178 0.197
-116 3.226 0.230 -74 5.361 0.030 -32 0.801 0.136
-115 3.282 0.887 -73 3.940 -0.356 -31 1.021 0.078
-114 3.047 0.352 -72 2.890 -0.279 -30 0.848 -0.041
-113 2.765 -1.060 -71 3.625 -1.376 -29 0.937 0.517
-112 2.437 0.113 -70 3.332 1.031 -28 1.194 0.053
-111 3.093 0.475 -69 3.655 -0.721 -27 1.283 -0.709
-110 2.407 -1.587 -68 3.857 -0.465 -26 1.362 0.189
-109 2.687 -0.537 -67 3.646 1.189 -25 1.455 0.681
-108 2.326 -0.854 -66 3.611 1.787 -24 1.280 0.079
-107 2.722 0.021 -65 5.304 -1.618 -23 1.619 -0.809
-106 2.699 -0.762 -64 4.849 -1.711 -22 1.901 -0.018
-105 2.887 0.619 -63 5.160 2.407 -21 1.920 -0.041
-104 2.168 -0.414 -62 4.643 1.974 -20 2.114 -0.714
-103 1.989 -1.242 -61 4.784 2.092 -19 2.042 0.052
-102 1.987 -0.375 -60 3.804 -0.861 -18 1.852 -2.103
-101 1.714 -0.198 -59 4.492 2.870 -17 1.662 1.062
-100 2.315 -0.231 -58 4.701 -2.246 -16 2.310 -1.014
-99 2.788 0.528 -57 4.721 1.669 -15 2.078 -0.988
-98 2.855 -1.024 -56 4.446 1.352 -14 2.460 2.662
-97 3.726 0.796 -55 3.793 -1.976 -13 2.594 -1.405
-96 2.734 -0.057 -54 3.833 0.022 -12 1.609 2.165
-95 3.482 -3.851 -53 3.707 0.340 -11 1.970 -0.034
-94 3.342 0.914 -52 3.805 -5.143 -10 1.776 1.260
-93 2.486 -3.966 -51 3.507 0.202 -9 2.341 2.799
-92 3.455 -1.853 -50 3.158 -0.411 -8 2.335 1.797
-91 3.602 3.909 -49 2.688 0.606 -7 2.868 2.224
-90 4.021 -3.818 -48 2.308 0.169 -6 2.866 2.663
-89 3.927 -3.043 -47 2.404 1.254 -5 2.843 -2.600
-88 3.929 0.624 -46 2.079 0.010 -4 2.380 0.403
-87 4.805 -2.384 -45 2.000 0.030 -3 2.195 -1.043
-86 3.857 -1.463 -44 1.446 0.399 -2 2.107 -2.325
-85 3.701 -0.355 -43 1.533 0.034 -1 1.789 -0.238
-84 3.481 -5.738 -42 1.412 -0.498 0 2.107 -1.145
-83 4.617 -3.076 -41 1.229 -0.092
Exhibit 14.13: Daily performance data (in USD millions) for a one-day 99% USD value-at-risk measure for use in Exercises 14.8 through 14.11.
14.8

In this exercise you will perform several coverage backtests.

  1. Use the data of Exhibit 14.13 to calculate exceedence data ti. Save your results, as you will need them again in Exercise 14.11.
  2. Use the data of Exhibit 14.13 to construct a graphical backtest similar to Exhibit 14.1.
  3. Apply our recommended standard coverage test at the .05 significance level using your results from part (a).
  4. Apply Kupiec’s PF coverage test at the .05 significance level using your results from part (a).
  5. Apply the Basel Committee’s traffic light coverage test using your results from part (a).

Solution

14.9

In this exercise, you will perform the graphical and recommended standard distribution tests of Section 14.4 using the data of Exhibit 14.13.

  1. Our value-at-risk measure is a linear value-at-risk measure that assumes tL is conditionally normal with t–1E(tL) = 0. Use this information and the data of Exhibit 14.13 to calculate loss quantile data tu.
  2. Apply the inverse standard normal CDF to your loss quantile data tu to obtain values tn. Save your results, as you will need them again in Exercise 14.11.
  3. Order your values tn by magnitude. Denote the ordered valued nj.
  4. Calculate values  as describe in Section 14.4.2.
  5. Plot of the points (nj, ) in a Cartesian plane. Interpret the result.
  6. Calculate the sample correlation between the nj and . Does the value-at-risk measure pass or fail our recommended standard distribution test at the .05 significance level?

Solution

14.10

In this exercise, you will perform Christoffersen’s exceedences independence test using the data of Exhibit 14.13.

  1. Retrieve the exceedence ti data you calculated for Exercise 14.8, and use it to calculate values α00,  α01,  α10 and  α11.
  2. Use your results from part (a) and formulas [14.16], [14.17] and [14.18] to calculate values  and  .
  3. Use your results from parts (a) and (b) and [14.20] to calculate the log likelihood ratio –2log(Λ). What conclusion do you draw?

Solution

14.11

In this exercise, you will perform our recommended standard loss quantile independence test using the data of Exhibit 14.13.

  1. Retrieve the tn data you calculated for Exercise 14.9, and calculate its sample autocorrelations for lags 1 through 5.
  2. Take the absolute value of each sample autocorrelation, and then take the maximum of the five results. Based on this, does the value-at-risk measure pass or fail the recommended standard independence test at the .05 significance level?

Solution

 

14.5 Backtesting With Independence Tests

14.5  Backtesting With Independence Tests

Independence tests are a form of backtest that assess some form of independence in a value-at-risk measure’s performance from one period to the next. Independence of exceedances tI and independence of loss quantiles tU are separate forms of independence that might be tested for. We have already seen that coverage tests assume the former and most distribution tests assume the latter. If a value-at-risk measure fails an independence test, that can cast doubt on coverage or distribution backtest results obtained for that value-at-risk measure.

There is no way to directly test for independence, so null hypotheses address specific properties of independence—say exceedances not clustering or loss quantiles not being autocorrelated. Accordingly, backtests for independence can be judged, among other things, based on how broad their null hypotheses are.

14.5.1 Christoffersen’s 1998 Exceedence Independence Test

Christoffersen’s (1998) independence test is a likelihood ratio test that looks for unusually frequent consecutive exceedances—i.e. instances when both t–1i = 1 and ti = 1 for some t. The test is well known, since it was first proposed in an often-cited endorsement of testing for independence of exceedances.

Extending our earlier notation q* for the coverage of a value-at-risk measure, we define

[14.12]

[14.13]

These are the value-at-risk measure’s conditional coverages—its actual probabilities of not experiencing an exceedance given that it did not (in the case of ) or did (in the case of ) experience an exceedance in the previous period. Our null hypothesis script h naught is that q*.

If a value-at-risk measure is observed for α + 1 periods, there will be α pairs of consecutive observations (t–1i, ti). Disaggregate these as

[14.14]

where α00 is the number of pairs (t–1i, ti) of the form (0, 0); α01 is the number of the form (0, 1); etc. We want to test if

[14.15]

which would support our null hypothesis. We apply a likelihood ratio test as follows. Assuming script h naught doesn’t hold, we estimate  and with

[14.16]

[14.17]

Assuming script h naught does hold, we estimate q* with

[14.18]

Our likelihood ratio is

[14.19]

[14.20]

and –2log(Λ) is approximately centrally chi-squared with one degree of freedom—that is –2log(Λ) ~ χ2(1,0)—assuming script h naught. The 0.95 quantile of the χ2(1,0) distribution is 3.841, so we reject script h naught at the .05 significance level if –2log(Λ) ≥ 3.841. Similarly, we reject it at the .01 significance level if –2log(Λ) ≥ 6.635.

The test largely depends on the frequency with which consecutive exceedances are experienced. As these are inherently rare events, the test has limited power. Also, the test isn’t defined when there are no consecutive exceedances at all, which is common. Christoffersen doesn’t address this situation. In some cases it may be reasonable to simply accept the null hypothesis when there are no consecutive exceedances, but not always. For example, if you backtest a one-day 90% value-at-risk measure with 1,000 days of data, there should be about 10 instances of consecutive exceedances. If there are none, it might be inappropriate to accept the null hypothesis.

14.5.2 A Recommended Standard Loss-Quantile Independence Test

For a recommended standard test, we assess the independence of the values tN obtained by applying the inverse standard normal CDF to the loss quantiles tU:

[14.21]

Note that this is the same transformation we made with [14.10]. As before, given loss quantile data mu, m+1u, … , –1u, we apply [14.21] to obtain values mn, m+1n, … , –1n.

We adopt the null hypothesis that the autocorrelations

[14.22]

are all 0 for lags k = 1, 2, 3, 4 and 5. We test this hypothesis by calculating the sample autocorrelations of our data mn, m+1n, … , –1n for those same five lags. We take the maximum of the absolute values of the five sample autocorrelations. That is our test statistic. We reject the null hypothesis at the .05 significance level if the test statistic exceeds the non-rejection value indicated for sample size α + 1 in Exhibit 14.7.

Exhibit 14.7: Non-rejection values for the recommended standard independence test at the .05 and .01 significance levels. If the test statistic exceeds the non-rejection value, the null hypothesis is rejected at the indicated significance level.

Non-rejection values were calculated for each sample size α + 1 with a Monte Carlo analysis that found the 0.95 (for the .05 significance level) or 0.99 (for the .01 significance level) quantile for the test statistic assuming the null hypothesis.

Exercises
14.5

In Christoffersen’s 1998 independence test, α01 routinely equals α10. Why is this, and what would cause them to differ?
Solution

14.6

A value-at-risk measure is to be backtested using Christoffersen’s 1998 independence test. Based on 250 days of exceedence data, α00 = 237, α01 = α10 = 5, and α11 = 2. Do we reject the value-at-risk measure at the .10 significance level?
Solution

14.7

A value-at-risk measure is to be backtested using our recommended standard independence test and 500 days of data. Values tn are calculated, and their sample autocorrelations are determined to be 0.034, –0.078, –0.124, 0.107 and 0.029 for lags 1 through 5, respectively. Do we reject the value-at-risk measure at the .05 significance level?
Solution

 

14.4 Backtesting Value-at-Risk With Distribution Tests

14.4  Backtesting With Distribution Tests

As part of the process of calculating a portfolio’s value-at-risk, value-at-risk measures—explicitly or implicitly—characterize a distribution for 1P or 1L. That characterization takes various forms. A linear value-at-risk measure might specify the distribution for 1P with a mean, standard deviation and an assumption that the distribution is normal. A Monte Carlo value-at-risk measure simulates a large number of values for 1P. Any histogram of those values can be treated as a discrete approximation to the distribution of 1P.

Distribution tests are goodness-of-fit tests that go beyond the specific quantile-of-loss a value-at-risk measure purports to calculate and more fully assess the quality of the 1P or 1L distributions the value-at-risk measure characterizes.

For example, a crude distribution test can be implemented by performing multiple coverage tests for different quantiles of 1L. Suppose a one-day 95% value-at-risk measure is to be backtested. Our basic coverage test is applied to assess how well the value-at-risk measure estimates the 0.95 quantile of 1L, but we don’t stop there. We apply the same coverage test to also assess how well the value-at-risk measure estimates the 0.99, 0.975, 0.90, 0.80, 0.70, 0.50 and 0.25 quantiles of 1L. Collectively, these analyses provide a rudimentary goodness-of-fit test for how well the value-at-risk measure characterized the overall distribution of 1L.

Various distribution tests have been proposed in the literature. Most employ the framework we describe below.

14.4.1 Framework for Distribution Tests

While coverage tests assess a value-at-risk measure’s exceedances data – αi, – α +1i, … , 0i, which is a series of 0’s and 1’s, most distribution tests consider loss data – αl, – α +1l, … , 0l. Although it is convenient to assume exceedance random variables tI are IID, that assumption is unreasonable for losses tL.

A value-at-risk measure characterizes a CDF for each tL. Treating probabilities as objective for pedagogical purposes, is a forecast distribution we use to model the “true” CDF for each tL, which we denote . Our null hypothesis script h naught is then = for all t.

Testing this hypothesis poses a problem: We are not dealing with a single forecast distribution modeling some single “true” distribution. The distribution changes from one day to the next, so each data point tl is drawn from a different probability distribution. This renders statistical analysis futile. We circumvent this problem by introducing a random variable tU for the quantile at which tL occurs.

[14.9]

Assuming our null hypothesis script h naught, the tU are all uniformly distributed, tU ~ U(0,1). We assume the tU are independent. Applying [14.9], we transform our loss data –αl, –α+1l, … , 0l into loss quantile data –αu, –α+1u, … , 0u, which we treat as a realization u[0], … , u–1], u] of a sample. This we can test for consistency with a U(0,1) distribution. Crnkovic and Drachman’s (1996) distribution test applied Kuiper’s statistic4 for this purpose.

Some distribution tests—see Berkowitz (2001)—further transform the data – αu, – α +1u, … , 0u by applying the inverse standard normal CDF Φ–1:

[14.10]

Assuming our null hypothesis script h naught, the tN are identically standard normal, tN ~ N(0,1), so transformed data –αn, –α+1n, … , 0n can be tested for consistency with a standard normal distribution.

Below, we introduce a simple graphical test of normality that can be applied. This will motivate a recommended standard test based on Filliben’s (1975) correlation test for normality. That is one of the most powerful tests for normality available.

14.4.2 Graphical Distribution Test

Construct the tn as described above, and arrange them in ascending order. We adjust our notation, denoting n1 the lowest and nα+1 the highest, so n1 ≤ n2 ≤ … ≤ nα+1. Next, define

[14.11]

for j = 1, 2, … , α + 1, where Φ is the standard normal CDF. The are quantiles of the standard normal distribution, with a fixed 1/(α + 1) probability between consecutive quantiles. If our null hypothesis holds, and the nj are drawn from a standard normal distribution, each nj should fall near the corresponding . We can test this by plotting all points (nj, ) in a Cartesian plane. If the points tend to fall near a line with slope one, passing through the origin, this provides visual evidence for our null hypothesis.

14.4.3 A Recommended Standard Distribution Test

We now introduce a recommended standard distribution test based on Filliben’s correlation test for normality. Construct pairs (nj, ) as described above, and take the sample correlation of the nj and . Sample correlation values close to one tend to support the null hypothesis.

Using the Monte Carlo method, we can determine non-rejection values for the sample correlation at various levels of significance. If the sample correlation falls below a non-rejection value, we reject the null hypothesis at the indicated level of significance. Non-rejection values for the .05 and .01 significance levels are indicated in Exhibit 14.6.

Exhibit 14.6: Non-rejection values for the sample correlation calculated for our recommended standard distribution test. If the sample correlation is less than the non-rejection value, the value-at-risk measure is rejected at the indicated significance level.

Suppose we are backtesting a one-day 99% value-at-risk measure based on α + 1= 250 days of data. We calculate the nj and  and find their sample correlation to be 0.993. Based on the values in Exhibit 14.6, we reject the value-at-risk measure at the .01 significance level but do not reject it at the .05 significance level.

Exercises
14.3

Why is it unreasonable to assume losses – αL, – α +1L, … , –1L are IID?
Solution

14.4

In applying our recommended standard distribution test with 750 days of data, the sample correlation of the nj and  is found to be 0.995. Do we reject the value-at-risk measure at the .05 significance level?
Solution

 

14.3 Backtesting Value-at-Risk With Coverage Tests

14.3  Backtesting With Coverage Tests

Even before J.P. Morgan’s RiskMetrics Technical Document described a graphical backtest, the concept of backtesting was familiar, at least within institutions then using value-at-risk. Two years earlier, the Group of 30 (1993) had recommended, and one month earlier the Basel Committee (1995) had also recommended, that institutions apply some form of backtesting to their value-at-risk results. Neither specified a methodology. In September 1995, Crnkovic and Drachman circulated to clients of J.P. Morgan a draft paper describing a distribution test and an independence test, which they published the next year. The first published statistical backtests were coverage tests of Paul Kupiec (1995). In 1996, the Basel Committee published its “traffic light” backtest.

14.3.1 A Recommended Standard Coverage Test

Consider a q quantile-of-loss value-at-risk measure and define a univariate exceedance process I with terms

[14.1]

To conduct a coverage test, we gather historical exceedance data – αi, – α +1i, … , 0i. We assume the tI are IID, which allows us to treat our data as a realization i[0], … , i[α – 1], i[α] of a sample I [0], … , I [α – 1], I [α].

We define the coverage q* of the value-at-risk measure as the actual frequency with which it experiences exceedances (i.e. instances of ti = 1). This can be expressed as an unconditional expectation:

[14.2]

Coverage tests are hypothesis tests with the null hypothesis  that q = q*. Let x denote the number of exceedances observed in the data:

[14.3]

We treat x as a realization of a binomial random variable X. Our null hypothesis is then simply that X ~ B(α + 1, 1 – q). To test  at some significance level ε, we must determine values x1 and x2 such that

[14.4]

Multiple intervals [x1, x2] will satisfy this criteria, so we seek a solution that is generally symmetric in the sense that Pr(X < x1) ≈ Pr(x2 < X) ≈ ε/2.

Formally, define a as the maximum integer such that Pr(X < a) ≤ ε/2 and b as the minimum integer such that Pr(b < X) ≤ ε/2. Consider all intervals of the form [a + n, b] or [a, bn] where n is a non-negative integer. Set [x1, x2] equal to whichever of these maximizes Pr(X ∉ [x1, x2]) subject to the constraint that Pr(X ∉ [x1, x2]) ≤ ε.2 Our backtest procedure is then to observe the value-at-risk measure’s performance for α + 1 periods and record the number of exceedances X. If X ∉ [x1, x2], we reject the value-at-risk measure at the ε significance level.

Suppose we implement a one-day 95% value-at-risk measure and plan to backtest it at the .05 significance level after 500 trading days (about two years). Then q = 0.95 and α + 1 = 500. Assuming , we know X ~ B(500, .05). We use this distribution to determine x1 = 15 and x2 = 36. Calculations are summarized in Exhibit 14.2. We will reject the value-at-risk measure if X ∉ [16, 35].

Exhibit 14.2: Calculations to determine the non-rejection interval for our recommended standard coverage test when ε = .05, α + 1 = 500 and q = 0.95.

Exhibit 14.3 indicates similar .05 significance level non-rejection intervals [x1, x2] for other values of q and α + 1.

Exhibit 14.3: Recommended standard coverage test non-rejection intervals [x1, x2] for various values of q and α+1. The value-at-risk measure is rejected at the .05 significance level if the number of exceedances X is less than x1 or greater than x2.
14.3.2 Kupiec’s PF Coverage Test

Kupiec’s “proportion of failures” (PF) coverage test takes a circuitous—and approximate—route to an answer, offering no particular advantage over our recommended standard coverage test. Comparing the two tests can be informative, illustrating the various respects in which test designs may differ. As the first published backtesting methodology, the PF test has been widely cited.

As with the recommended standard test, a value-at-risk measure is observed for α + 1 periods, experiencing X exceedances. We adopt the same null hypothesis  that q = q*. Rather than directly calculate probabilities from the B(α + 1, 1 – q) distribution of X under , the PF test uses that distribution to construct a likelihood ratio:

[14.5]

It is difficult to infer probabilities with this. As described in Section 4.5.4 a standard technique is to consider –2 log(Λ).

[14.6]

[14.7]

which is—see Lehmann and Romano (2005)—approximately centrally chi-squared with one degree of freedom. That is –2log(Λ) ~ χ2(1,0), assuming . Kupiec found this approximation to be reasonable based on a Monte Carlo analysis, but Lopez (1999) claims to have found “meaningful” discrepancies using his own Monte Carlo analysis.

For a given significance level ε, we construct a non-rejection interval [x1, x2] such that

[14.8]

under . To do so, calculate the ε quantile of the χ2(1,0) distribution. Setting this equal to [14.7], solve for X. There will be two solutions. Rounding the lower one down and the higher one up yields x1 and x2.3

Consider the example we looked at with the recommended standard coverage test. We implement a one-day 95% value-at-risk measure and plan to backtest it at the .05 significance level after 500 trading days, so q = 0.95 and α + 1 = 500. We calculate the ε = .05 quantile of the χ2(1,0) distribution as 3.841. Setting this equal to [14.7], we solve for X. There are two solutions: 16.05 and 35.11. Rounding down and up, respectively, we set x1 = 16 and x2 = 36. We will reject the value-at-risk measure if X ∉ [16, 36].

Exhibit 14.4 indicates similar .05 significance level non-rejection intervals [x1, x2] for other values of q and α + 1.

Exhibit 14.4: PF coverage test non-rejection intervals [x1, x2] for various values of q and α+1. The value-at-risk measure is rejected at the .05 significance level if the number of exceedances X is less than x1 or greater than x2.
14.3.3 The Basel Committee’s Traffic Light Coverage Test

The 1996 Amendment to the Basel Accord imposed a capital charge on banks for market risk. It allowed banks to use their own proprietary value-at-risk measures to calculate the amount. Use of a proprietary measure required approval of regulators. A bank would have to have an independent risk management function and satisfy regulators that it was following acceptable risk management practices. Regulators would also need to be satisfied that the proprietary value-at-risk measure was sound.

Proprietary measures had to support a 10-day 99% value-at-risk metric, but as a practical matter, banks were allowed to calculate 1-day 99%value-at-risk and scale the result by the square root of 10.

The Basel Committee (1996b) specified a methodology for backtesting proprietary value-at-risk measures. Banks were to backtest their one-day 99%value-at-risk results (i.e. value-at-risk before scaling by the square root of 10) against daily P&L’s. It was left to national regulators whether backtesting was based on clean or dirty P&L’s. Backtests were to be performed quarterly using the most recent 250 days of data. Based on the number of exceedances experienced during that period, the value-at-risk measure would be categorized as falling into one of three colored zones:

Exhibit 14.5: Basel Committee defined green, yellow and red zones for backtesting proprietary one-day 99% value-at-risk measures, assuming α + 1 = 250 daily observations. For banks whose value-at-risk measures fell in the yellow zone, the Basel Committee recommended that, at national regulators’ discretion, the multiplier k used to calculate market risk capital charges be increased above the base level 3, as indicated in the table. The committee required that the multiplier be increased to 4 if a value-at-risk measure fell in the red zone. Cumulative probabilities indicate the probability of achieving the indicated number of exceedances or less. They were calculated with a binomial distribution, assuming the null hypothesis q* = 0.99.

Value-at-risk measures falling in the green zone raised no particular concerns. Those falling in the yellow zone required monitoring. The Basel Committee recommended that, at national regulators’ discretion, value-at-risk results from yellow-zone value-at-risk measures be weighted more heavily in calculating banks’ capital charges for market risk—the recommended multipliers are indicated in Exhibit 14.5. Value-at-risk measures falling in the red zone had to be weighted more heavily and were presumed flawed—national regulators would investigate what caused so many exceedances and require that the value-at-risk measure be improved.

The Basel Committee’s procedure is based on no statistical theory for hypothesis testing. The three zones were justified as reasonable in light of the probabilities indicated in Table 14.5 (and probabilities assuming q* = 0.98, q* = 0.97, etc., which the committee also considered). Due to its ad hoc nature, the backtesting methodology is not theoretically interesting. It is important because of its wide use by banks.

Exercises
14.1

Suppose we implement a one-day 90% value-at-risk measure and plan to backtest it with our recommended standard coverage test at the .05 significance level after 375 trading days (about eighteen months). Then q = 0.90 and α + 1 = 375. Calculate the non-rejection interval.
Solution

14.2

Suppose we want to apply Kupiec’s PF backtest to the same one-day 90% value-at-risk measure as in the previous exercise. Again, the significance level is .05, q = 0.90 and α + 1 = 375. Calculate the non-rejection interval. Compare your result with that of the previous exercise.
Solution

 

14.2 Backtesting

14.2  Backtesting

JP Morgan’s RiskMetrics Technical Document was released in four editions between 1994 and 1996. The first had limited circulation, being distributed at the firm’s 1994 annual research conference, which was in Budapest.1 It was the second edition, released in November of that year, that accompanied the public rollout of RiskMetrics. Six months later, a dramatically expanded third edition was released, reflecting extensive comments JP Morgan received on their methodology. While the second edition described a simple linear value-at-risk measure similar to JP Morgan’s internal system, the third edition reflected a diversity of practices employed at other firms. That edition described linear, Monte Carlo and historical transformation procedures. It also, perhaps for the first time in print, illustrated a crude method of backtesting.

Exhibit 14.1 is similar to a graph that appeared in that third edition. It depicts daily profits and losses (P&L’s) against (negative) value-at-risk for an actual trading portfolio. Not only does the chart summarize the portfolio’s daily performance and the evolution of its market risk. It provides a simple graphical analysis of how well the firm’s value-at-risk measure performed.

Exhibit 14.1: Chart of a portfolio’s daily P&L’s. The jagged line running across the bottom of the chart indicates the portfolio’s (negative) one-day 95% EURvalue-at-risk. Any instance of a P&L falling below that line is called an exceedance. We would expect a 95% value-at-risk measure to experience approximately six exceedances in six months. In the chart, we count ten.

With a one-day  95% value-at-risk metric, we expect daily losses to exceed value-at-risk approximately 5% of the time—or six times in a six month period. We define an exceedance as an instance of a portfolio’s single-period loss exceeding its value-at-risk for that single period. In Exhibit 14.1, we can count ten exceedances over the six months shown.

Is this result reasonable? If it is, what would we consider unreasonable? If we experienced two exceedances—or fourteen—would we question our value-at-risk measure? Would we continue to use it? Would we want to replace it or modify it somehow to improve performance?

Questions such as these have spawned a literature on techniques for statistically testing value-at-risk measures ex post. Research to date has focused on value-at-risk measures used by banks. Published backtesting methodologies mostly fall into three categories:

  • Coverage tests assess whether the frequency of exceedances is consistent with the quantile of loss a value-at-risk measure is intended to reflect.
  • Distribution tests are goodness-of-fit tests applied to the overall loss distributions forecast by complete value-at-risk measures.
  • Independence tests assess whether results appear to be independent from one period to the next.

Later in this chapter, we cover several backtesting procedures that are prominent in the literature. Because all have shortcomings, we also introduce three basic tests—a coverage test, a distribution test and an independence test—that we recommend as minimum standards for backtesting in practice.

The question arises as to which P&L’s to use in backtesting a value-at-risk measure. We distinguish between dirty P&L’s and clean P&L’s. Dirty P&L’s are the actual P&L’s reported for a portfolio by the accounting system. They can be impacted by trades that take place during the value-at-risk horizon—trades the value-at-risk measure cannot anticipate. Dirty P&L’s also reflect fee income earned during the value-at-risk horizon, which value-at-risk measures also don’t anticipate. Clean P&L’s are hypothetical P&L’s that would have been realized if no trading took place and no fee income were earned during the value-at-risk horizon.

The Basel Committee (1996) recommends that banks backtest their value-at-risk measures against both clean and dirty P&L’s. The former is essential for addressing Type A and Type B model risk. The latter can be used to assess Type C model risk.

Suppose a firm calculates its portfolio value-at-risk at the end of each trading day. In a backtest against clean P&L’s, the value-at-risk measure performs well. Against dirty P&L’s, it does not. This might indicate that the value-at-risk measure is sound but that end-of-day  value-at-risk does not reasonably indicate the firm’s market risk. Perhaps the firm engages in an active day trading program, reducing exposures at end-of-day.

Financial institutions don’t calculate clean P&L’s in the regular course of business, so provisions must be made for calculating and storing them for later use in backtesting. Other data to maintain are

  • the value-at-risk measurements;
  • the quantiles of the loss distribution at which clean and dirty P&L’s occurred (the loss quantiles), as determined by the value-at-risk measure. This information is important if distribution tests are to be employed in backtesting;
  • inputs, including the portfolio composition, historical values for key factors and current values for key factors;
  • intermediate results calculated by the value-at-risk measure, such as a covariance matrix or a quadratic remapping; and
  • a history of modifications to the system.

The last three items will not be used in backtesting, but they could be useful if backtesting raises concerns about the value-at-risk measure, which people want to investigate. The third and fourth items could be regenerated at the time of such an investigation, but doing so for a large number of trading days might be a significant undertaking. Data storage is so inexpensive, there is every reason to err on the side of storing too much information.

A history of modifications to the value-at-risk measure is important because the system’s performance is likely to change with any substantive modification. Essentially, you are dealing with a new model each time you make a modification. We are primarily interested in backtesting the measure since its last substantive modification.

 

14.1 Motivation Backtesting

Chapter 14

Backtesting

14.1  Motivation

The empiricist tradition in the philosophy of science tells us that a model should not be assessed based on the reasonableness of its assumptions or the sophistication of its analytics. It should be assessed based on the usefulness of its predictions. Backtesting is a process of assessing the usefulness of a value-at-risk measure’s predictions when applied to a particular portfolio over time. The value-at-risk measurements obtained for the portfolio from the value-at-risk measure are recorded, as are the realized P&L’s for the portfolio. Once sufficient data has been collected, statistical or other tests are applied to assess how well the value-at-risk measurements reflect the riskiness of the portfolio.

 

13.4 Further Reading – Model Risk

13.4  Further Reading – Model Risk

Most discussions of model risk in the financial literature focus on model risk in asset pricing models rather than model risk in risk models. Office of the Comptroller of the Currency (2000) is a standard resource on model validation.

A number of papers assess the performance of various value-at-risk measures. See, for example, Marshall and Siegel (1997) and Berkowitz and O’Brien (2002).

 

13.3 Managing Model Risk

13.3  Managing Model Risk

Many front- and middle-office systems entail some degree of model risk, so when you implement a value-at-risk system, model risk should not be a new issue. Your firm should already have considerable infrastructure in place for addressing it. Below, we discuss a variety of strategies, primarily from the perspective of a practitioner.

13.3.1 Personnel

I was once involved in a credit risk model implementation. The credit department lacked people with the necessary quantitative skills, so design work was done by another department. For political reasons, the head of the credit department insisted on retaining control. She inserted herself into the model design process, going so far as to sketch out analytics with her people. These made no sense in the way 1 + 1 = 3 makes no sense. The design team sat through a presentation of her design, thanked her, and then quietly proceeded with their own design.

Perhaps it goes without saying, but an important step in implementing any quantitative model is making sure it is designed by competent individuals. Modeling is a specialized skill that bridges mathematics and real-world applications. Someone with strong math skills may not be a good modeler. There is an old joke about organized crime figures hiring a mathematician to find a way for them to make money at horse betting. After spending a year on this task, the mathematician calls the mobsters together and opens his presentation with “Consider a spherical race horse … ”

Non-quantitative professionals are not qualified to assess an individual’s modeling skills. When hiring modelers, involve individuals with proven modeling skills in the process. Have a policy that quantitative professionals must report to other quantitative professionals with proven modeling skills and the ability to assess work based on its technical merit.

13.3.2 Standard Assumptions and Modeling Procedures

While it is true that any model should be assessed based on the usefulness of its predictions and not on the reasonableness of its analytics, there is a flip side to this. Designing a model to conform to established practices—using proven assumptions and modeling techniques—will decrease model risk. Novice modelers sometimes employ techniques they invent or read about in theoretical or speculative literature. Such innovation is how theory progresses, but it is best left for academia or personal research conducted in one’s free time. When a bank or other trading institution implements value-at-risk, considerable resources are bought to bear, and time is limited. A failed implementation could severely limit the institution, hobbling its risk management—and hence its ability to take risk—for years.

13.3.3 Design Review

Large financial institutions employ teams of financial engineers or other quantitative staff whose full-time job is to review other people’s models. The focus tends to be on asset pricing models, but they can also review risk management models, and especially value-at-risk measures. Banks can expect their regulators to ask specifically how a value-at-risk measure’s design was independently reviewed and to see documentation of that review.

The review should be based on the design document (discussed in Section 12.6.1) that describes the value-at-risk measure. This needs to be a stand-alone document, operationally describing all inputs, outputs and calculations in sufficient detail that a value-at-risk measure can be implemented based on it alone. Do not attempt a review of a design document that is imprecise or incomplete. A value-at-risk measure has not been designed—it cannot be reviewed—until the design document is complete. Review should result in either recommendations for improving the model or approval for the model to be implemented. Keep in mind that the system requirements and design document are likely to evolve during implementation, so additional reviews will be necessary.

13.3.4 Testing

Complex systems such as value-at-risk measures can be difficult to test, so it is critical to define the testing environment and strategies early on in the implementation process. In some cases, creating the test environment may require days or weeks of development and setup, so it isn’t something to leave to the end.

Different forms of testing are performed throughout the implementation of a system:

  • unit testing is done by developers as they code. Its purpose is to ensure that individual components function as they are supposed to.
  • integration testing is done by developers as they finalize the software. Its purpose is to ensure that the components integrate correctly and the system works as a whole. This is the development team’s opportunity to ensure that everything will work properly during the system/regression testing phase.
  • system/regression testing is done by a separate quality assurance (QA) team to confirm that the system meets the functional requirements and works as a whole
  • stress/performance/load testing is done by developers and sometimes QA to ensure that the system can handle expected volumes of work and meets performance requirements. A system could pass the system test (i.e. meet all the functional requirements) but be slow. This testing ensures the system performs correctly under load.
  • user acceptance testing is done by business units, often with assistance from QA, to confirm the system meets all functional requirements. This is typically an abbreviated form of system/regression testing.

For value-at-risk applications, a common technique for system/regression testing and user acceptance testing is to build simulators to model aspects of the system that are expensive to replicate or not available in a non-production environment. This allows you to test systems in isolation prior to your final integration efforts. For example, instead of using actual real-time market data feeds, simulators can be developed that simulate sequences of real-time data. In addition to avoiding the use of expensive feeds, this approach gives you the ability to define and repeat certain sequences of data or transactions that test specific conditions in the system. While simulators are not a substitute for integration testing with actual live systems, they can be essential for developing the sorts of thorough testing processes required in value-at-risk implementations.

The system’s value-at-risk analytics need to be tested to ensure they reflect the formulas specified in the design document. The recommended approach is to implement a stripped down value-at-risk measure with the same analytics as in the design document. It may be possible to do this in Excel, Matlab or some similar environment that will facilitate participation by business professionals on the implementation team. When identical inputs are run through it and the production system, outputs need to match. If they don’t, that indicates there is a problem, either with the test value-at-risk measure or the production value-at-risk measure. Usually, it is with the stripped down test value-at-risk measure, but each discrepancy needs to be investigated. Even small discrepancies must be addressed. A bug that causes a discrepancy at the eighth decimal place for one set of inputs may cause one at the first decimal place with another.

It is critical that you take the time at the onset of implementing a value-at-risk measure to define and budget for the testing processes. Plan for a generous user acceptance testing period. There can be considerable push back from business units to limit their involvement in this phase, which can be mitigated by combining user acceptance testing with user training.

13.3.5 Parallel Testing

If a new value-at-risk system is replacing an old system, the two systems should be run in parallel for a few months to compare their performance. Output from the two systems should not be expected to match, since their analytics are different. Even wide discrepancies in output may not be cause for alarm, but they should be investigated. Users need to understand in what ways the new system performs differently from the system it is replacing. If output from the new system ever doesn’t make economic sense during this period, that should prompt a more exhaustive review.

Parallel testing can work well with agile software development. Business units can continue to rely on the legacy system while testing and familiarizing themselves with components of the new system as they are brought on-line.

13.3.6 Backtesting

Backtesting is performed on an ongoing basis once a value-at-risk measure is in production for a given portfolio. Portfolio value-at-risk measurements and corresponding P&Ls are recorded over time and statistical tests are applied to the data to assess how well the value-at-risk measure reflects the portfolio’s market risk. This is an important topic, which we cover fully in Chapter 13.

13.3.7 Ongoing Validation

Validation is the process of confirming that a model is implemented as designed and produces outputs that are meaningful or otherwise useful. For value-at-risk measures, this encompasses design review, software testing at both the system/regression and user acceptance stages, parallel testing and backtesting, all discussed above.

Validation also needs to be an ongoing process. value-at-risk measures will be modified from time to time, perhaps to add additional traded instruments to the model library, improve performance, or to reflect new modeling techniques. Proposed modifications need to be documented in an updated design document. That needs to be reviewed and approved. Once the modifications are coded, they need to be fully tested. For some modifications, it will make sense to parallel test the modified system against the unmodified system. The modified system then needs to be backtested on an ongoing basis

Even if a value-at-risk system isn’t modified, it needs to be periodically reviewed to check if developments in the environment in which it is used have rendered if obsolete or not as useful as it once was. These scheduled reviews should be based on the design document, read in light of what may have changed (within the firm, the markets, data sources, etc.) since the design document was first written. The review should also include interviews with users to determine how they are currently using the system, if those uses are consistent with its design, and if modifications to the value-at-risk measure might be called for.

13.3.8 Model Inventory

Internal auditors should maintain an inventory of all models used within a trading environment, and the value-at-risk measure should be included in that list. A model inventory facilitates periodic validation of all models.

13.3.9 Vendor Software

Software vendors will generally test their own code to ensure it is bug free but otherwise rely on clients to report problems with the software. Also, due to each user’s choice of settings and interfaces, eachvalue-at-risk implementation tends to be unique. For these reasons, Vendor software needs to be validated on and ongoing basis much like internally-developed software.

13.3.10 Communication and Training

A critical step for addressing Type C, model application risk, is employee training. This is especially true for value-at-risk measures, which often relate only tangentially to end-users’ primary job functions. Training should cover more than basic functionality. It should communicate the purpose of the value-at-risk measure and help end-users understand how the value-at-risk measure can help them in their work. As mentioned earlier, it may be advantageous to integrate training with user acceptance testing.