14.4 Backtesting Value-at-Risk With Distribution Tests

14.4  Backtesting With Distribution Tests

As part of the process of calculating a portfolio’s value-at-risk, value-at-risk measures—explicitly or implicitly—characterize a distribution for 1P or 1L. That characterization takes various forms. A linear value-at-risk measure might specify the distribution for 1P with a mean, standard deviation and an assumption that the distribution is normal. A Monte Carlo value-at-risk measure simulates a large number of values for 1P. Any histogram of those values can be treated as a discrete approximation to the distribution of 1P.

Distribution tests are goodness-of-fit tests that go beyond the specific quantile-of-loss a value-at-risk measure purports to calculate and more fully assess the quality of the 1P or 1L distributions the value-at-risk measure characterizes.

For example, a crude distribution test can be implemented by performing multiple coverage tests for different quantiles of 1L. Suppose a one-day 95% value-at-risk measure is to be backtested. Our basic coverage test is applied to assess how well the value-at-risk measure estimates the 0.95 quantile of 1L, but we don’t stop there. We apply the same coverage test to also assess how well the value-at-risk measure estimates the 0.99, 0.975, 0.90, 0.80, 0.70, 0.50 and 0.25 quantiles of 1L. Collectively, these analyses provide a rudimentary goodness-of-fit test for how well the value-at-risk measure characterized the overall distribution of 1L.

Various distribution tests have been proposed in the literature. Most employ the framework we describe below.

14.4.1 Framework for Distribution Tests

While coverage tests assess a value-at-risk measure’s exceedances data – αi, – α +1i, … , 0i, which is a series of 0’s and 1’s, most distribution tests consider loss data – αl, – α +1l, … , 0l. Although it is convenient to assume exceedance random variables tI are IID, that assumption is unreasonable for losses tL.

A value-at-risk measure characterizes a CDF for each tL. Treating probabilities as objective for pedagogical purposes, is a forecast distribution we use to model the “true” CDF for each tL, which we denote . Our null hypothesis script h naught is then = for all t.

Testing this hypothesis poses a problem: We are not dealing with a single forecast distribution modeling some single “true” distribution. The distribution changes from one day to the next, so each data point tl is drawn from a different probability distribution. This renders statistical analysis futile. We circumvent this problem by introducing a random variable tU for the quantile at which tL occurs.


Assuming our null hypothesis script h naught, the tU are all uniformly distributed, tU ~ U(0,1). We assume the tU are independent. Applying [14.9], we transform our loss data –αl, –α+1l, … , 0l into loss quantile data –αu, –α+1u, … , 0u, which we treat as a realization u[0], … , u–1], u] of a sample. This we can test for consistency with a U(0,1) distribution. Crnkovic and Drachman’s (1996) distribution test applied Kuiper’s statistic4 for this purpose.

Some distribution tests—see Berkowitz (2001)—further transform the data – αu, – α +1u, … , 0u by applying the inverse standard normal CDF Φ–1:


Assuming our null hypothesis script h naught, the tN are identically standard normal, tN ~ N(0,1), so transformed data –αn, –α+1n, … , 0n can be tested for consistency with a standard normal distribution.

Below, we introduce a simple graphical test of normality that can be applied. This will motivate a recommended standard test based on Filliben’s (1975) correlation test for normality. That is one of the most powerful tests for normality available.

14.4.2 Graphical Distribution Test

Construct the tn as described above, and arrange them in ascending order. We adjust our notation, denoting n1 the lowest and nα+1 the highest, so n1 ≤ n2 ≤ … ≤ nα+1. Next, define


for j = 1, 2, … , α + 1, where Φ is the standard normal CDF. The are quantiles of the standard normal distribution, with a fixed 1/(α + 1) probability between consecutive quantiles. If our null hypothesis holds, and the nj are drawn from a standard normal distribution, each nj should fall near the corresponding . We can test this by plotting all points (nj, ) in a Cartesian plane. If the points tend to fall near a line with slope one, passing through the origin, this provides visual evidence for our null hypothesis.

14.4.3 A Recommended Standard Distribution Test

We now introduce a recommended standard distribution test based on Filliben’s correlation test for normality. Construct pairs (nj, ) as described above, and take the sample correlation of the nj and . Sample correlation values close to one tend to support the null hypothesis.

Using the Monte Carlo method, we can determine non-rejection values for the sample correlation at various levels of significance. If the sample correlation falls below a non-rejection value, we reject the null hypothesis at the indicated level of significance. Non-rejection values for the .05 and .01 significance levels are indicated in Exhibit 14.6.

Exhibit 14.6: Non-rejection values for the sample correlation calculated for our recommended standard distribution test. If the sample correlation is less than the non-rejection value, the value-at-risk measure is rejected at the indicated significance level.

Suppose we are backtesting a one-day 99% value-at-risk measure based on α + 1= 250 days of data. We calculate the nj and  and find their sample correlation to be 0.993. Based on the values in Exhibit 14.6, we reject the value-at-risk measure at the .01 significance level but do not reject it at the .05 significance level.


Why is it unreasonable to assume losses – αL, – α +1L, … , –1L are IID?


In applying our recommended standard distribution test with 750 days of data, the sample correlation of the nj and  is found to be 0.995. Do we reject the value-at-risk measure at the .05 significance level?


14.3 Backtesting Value-at-Risk With Coverage Tests

14.3  Backtesting With Coverage Tests

Even before J.P. Morgan’s RiskMetrics Technical Document described a graphical backtest, the concept of backtesting was familiar, at least within institutions then using value-at-risk. Two years earlier, the Group of 30 (1993) had recommended, and one month earlier the Basel Committee (1995) had also recommended, that institutions apply some form of backtesting to their value-at-risk results. Neither specified a methodology. In September 1995, Crnkovic and Drachman circulated to clients of J.P. Morgan a draft paper describing a distribution test and an independence test, which they published the next year. The first published statistical backtests were coverage tests of Paul Kupiec (1995). In 1996, the Basel Committee published its “traffic light” backtest.

14.3.1 A Recommended Standard Coverage Test

Consider a q quantile-of-loss value-at-risk measure and define a univariate exceedance process I with terms


To conduct a coverage test, we gather historical exceedance data – αi, – α +1i, … , 0i. We assume the tI are IID, which allows us to treat our data as a realization i[0], … , i[α – 1], i[α] of a sample I [0], … , I [α – 1], I [α].

We define the coverage q* of the value-at-risk measure as the actual frequency with which it experiences exceedances (i.e. instances of ti = 1). This can be expressed as an unconditional expectation:


Coverage tests are hypothesis tests with the null hypothesis  that q = q*. Let x denote the number of exceedances observed in the data:


We treat x as a realization of a binomial random variable X. Our null hypothesis is then simply that X ~ B(α + 1, 1 – q). To test  at some significance level ε, we must determine values x1 and x2 such that


Multiple intervals [x1, x2] will satisfy this criteria, so we seek a solution that is generally symmetric in the sense that Pr(X < x1) ≈ Pr(x2 < X) ≈ ε/2.

Formally, define a as the maximum integer such that Pr(X < a) ≤ ε/2 and b as the minimum integer such that Pr(b < X) ≤ ε/2. Consider all intervals of the form [a + n, b] or [a, bn] where n is a non-negative integer. Set [x1, x2] equal to whichever of these maximizes Pr(X ∉ [x1, x2]) subject to the constraint that Pr(X ∉ [x1, x2]) ≤ ε.2 Our backtest procedure is then to observe the value-at-risk measure’s performance for α + 1 periods and record the number of exceedances X. If X ∉ [x1, x2], we reject the value-at-risk measure at the ε significance level.

Suppose we implement a one-day 95% value-at-risk measure and plan to backtest it at the .05 significance level after 500 trading days (about two years). Then q = 0.95 and α + 1 = 500. Assuming , we know X ~ B(500, .05). We use this distribution to determine x1 = 15 and x2 = 36. Calculations are summarized in Exhibit 14.2. We will reject the value-at-risk measure if X ∉ [16, 35].

Exhibit 14.2: Calculations to determine the non-rejection interval for our recommended standard coverage test when ε = .05, α + 1 = 500 and q = 0.95.

Exhibit 14.3 indicates similar .05 significance level non-rejection intervals [x1, x2] for other values of q and α + 1.

Exhibit 14.3: Recommended standard coverage test non-rejection intervals [x1, x2] for various values of q and α+1. The value-at-risk measure is rejected at the .05 significance level if the number of exceedances X is less than x1 or greater than x2.
14.3.2 Kupiec’s PF Coverage Test

Kupiec’s “proportion of failures” (PF) coverage test takes a circuitous—and approximate—route to an answer, offering no particular advantage over our recommended standard coverage test. Comparing the two tests can be informative, illustrating the various respects in which test designs may differ. As the first published backtesting methodology, the PF test has been widely cited.

As with the recommended standard test, a value-at-risk measure is observed for α + 1 periods, experiencing X exceedances. We adopt the same null hypothesis  that q = q*. Rather than directly calculate probabilities from the B(α + 1, 1 – q) distribution of X under , the PF test uses that distribution to construct a likelihood ratio:


It is difficult to infer probabilities with this. As described in Section 4.5.4 a standard technique is to consider –2 log(Λ).



which is—see Lehmann and Romano (2005)—approximately centrally chi-squared with one degree of freedom. That is –2log(Λ) ~ χ2(1,0), assuming . Kupiec found this approximation to be reasonable based on a Monte Carlo analysis, but Lopez (1999) claims to have found “meaningful” discrepancies using his own Monte Carlo analysis.

For a given significance level ε, we construct a non-rejection interval [x1, x2] such that


under . To do so, calculate the ε quantile of the χ2(1,0) distribution. Setting this equal to [14.7], solve for X. There will be two solutions. Rounding the lower one down and the higher one up yields x1 and x2.3

Consider the example we looked at with the recommended standard coverage test. We implement a one-day 95% value-at-risk measure and plan to backtest it at the .05 significance level after 500 trading days, so q = 0.95 and α + 1 = 500. We calculate the ε = .05 quantile of the χ2(1,0) distribution as 3.841. Setting this equal to [14.7], we solve for X. There are two solutions: 16.05 and 35.11. Rounding down and up, respectively, we set x1 = 16 and x2 = 36. We will reject the value-at-risk measure if X ∉ [16, 36].

Exhibit 14.4 indicates similar .05 significance level non-rejection intervals [x1, x2] for other values of q and α + 1.

Exhibit 14.4: PF coverage test non-rejection intervals [x1, x2] for various values of q and α+1. The value-at-risk measure is rejected at the .05 significance level if the number of exceedances X is less than x1 or greater than x2.
14.3.3 The Basel Committee’s Traffic Light Coverage Test

The 1996 Amendment to the Basel Accord imposed a capital charge on banks for market risk. It allowed banks to use their own proprietary value-at-risk measures to calculate the amount. Use of a proprietary measure required approval of regulators. A bank would have to have an independent risk management function and satisfy regulators that it was following acceptable risk management practices. Regulators would also need to be satisfied that the proprietary value-at-risk measure was sound.

Proprietary measures had to support a 10-day 99% value-at-risk metric, but as a practical matter, banks were allowed to calculate 1-day 99%value-at-risk and scale the result by the square root of 10.

The Basel Committee (1996b) specified a methodology for backtesting proprietary value-at-risk measures. Banks were to backtest their one-day 99%value-at-risk results (i.e. value-at-risk before scaling by the square root of 10) against daily P&L’s. It was left to national regulators whether backtesting was based on clean or dirty P&L’s. Backtests were to be performed quarterly using the most recent 250 days of data. Based on the number of exceedances experienced during that period, the value-at-risk measure would be categorized as falling into one of three colored zones:

Exhibit 14.5: Basel Committee defined green, yellow and red zones for backtesting proprietary one-day 99% value-at-risk measures, assuming α + 1 = 250 daily observations. For banks whose value-at-risk measures fell in the yellow zone, the Basel Committee recommended that, at national regulators’ discretion, the multiplier k used to calculate market risk capital charges be increased above the base level 3, as indicated in the table. The committee required that the multiplier be increased to 4 if a value-at-risk measure fell in the red zone. Cumulative probabilities indicate the probability of achieving the indicated number of exceedances or less. They were calculated with a binomial distribution, assuming the null hypothesis q* = 0.99.

Value-at-risk measures falling in the green zone raised no particular concerns. Those falling in the yellow zone required monitoring. The Basel Committee recommended that, at national regulators’ discretion, value-at-risk results from yellow-zone value-at-risk measures be weighted more heavily in calculating banks’ capital charges for market risk—the recommended multipliers are indicated in Exhibit 14.5. Value-at-risk measures falling in the red zone had to be weighted more heavily and were presumed flawed—national regulators would investigate what caused so many exceedances and require that the value-at-risk measure be improved.

The Basel Committee’s procedure is based on no statistical theory for hypothesis testing. The three zones were justified as reasonable in light of the probabilities indicated in Table 14.5 (and probabilities assuming q* = 0.98, q* = 0.97, etc., which the committee also considered). Due to its ad hoc nature, the backtesting methodology is not theoretically interesting. It is important because of its wide use by banks.


Suppose we implement a one-day 90% value-at-risk measure and plan to backtest it with our recommended standard coverage test at the .05 significance level after 375 trading days (about eighteen months). Then q = 0.90 and α + 1 = 375. Calculate the non-rejection interval.


Suppose we want to apply Kupiec’s PF backtest to the same one-day 90% value-at-risk measure as in the previous exercise. Again, the significance level is .05, q = 0.90 and α + 1 = 375. Calculate the non-rejection interval. Compare your result with that of the previous exercise.


14.2 Backtesting

14.2  Backtesting

JP Morgan’s RiskMetrics Technical Document was released in four editions between 1994 and 1996. The first had limited circulation, being distributed at the firm’s 1994 annual research conference, which was in Budapest.1 It was the second edition, released in November of that year, that accompanied the public rollout of RiskMetrics. Six months later, a dramatically expanded third edition was released, reflecting extensive comments JP Morgan received on their methodology. While the second edition described a simple linear value-at-risk measure similar to JP Morgan’s internal system, the third edition reflected a diversity of practices employed at other firms. That edition described linear, Monte Carlo and historical transformation procedures. It also, perhaps for the first time in print, illustrated a crude method of backtesting.

Exhibit 14.1 is similar to a graph that appeared in that third edition. It depicts daily profits and losses (P&L’s) against (negative) value-at-risk for an actual trading portfolio. Not only does the chart summarize the portfolio’s daily performance and the evolution of its market risk. It provides a simple graphical analysis of how well the firm’s value-at-risk measure performed.

Exhibit 14.1: Chart of a portfolio’s daily P&L’s. The jagged line running across the bottom of the chart indicates the portfolio’s (negative) one-day 95% EURvalue-at-risk. Any instance of a P&L falling below that line is called an exceedance. We would expect a 95% value-at-risk measure to experience approximately six exceedances in six months. In the chart, we count ten.

With a one-day  95% value-at-risk metric, we expect daily losses to exceed value-at-risk approximately 5% of the time—or six times in a six month period. We define an exceedance as an instance of a portfolio’s single-period loss exceeding its value-at-risk for that single period. In Exhibit 14.1, we can count ten exceedances over the six months shown.

Is this result reasonable? If it is, what would we consider unreasonable? If we experienced two exceedances—or fourteen—would we question our value-at-risk measure? Would we continue to use it? Would we want to replace it or modify it somehow to improve performance?

Questions such as these have spawned a literature on techniques for statistically testing value-at-risk measures ex post. Research to date has focused on value-at-risk measures used by banks. Published backtesting methodologies mostly fall into three categories:

  • Coverage tests assess whether the frequency of exceedances is consistent with the quantile of loss a value-at-risk measure is intended to reflect.
  • Distribution tests are goodness-of-fit tests applied to the overall loss distributions forecast by complete value-at-risk measures.
  • Independence tests assess whether results appear to be independent from one period to the next.

Later in this chapter, we cover several backtesting procedures that are prominent in the literature. Because all have shortcomings, we also introduce three basic tests—a coverage test, a distribution test and an independence test—that we recommend as minimum standards for backtesting in practice.

The question arises as to which P&L’s to use in backtesting a value-at-risk measure. We distinguish between dirty P&L’s and clean P&L’s. Dirty P&L’s are the actual P&L’s reported for a portfolio by the accounting system. They can be impacted by trades that take place during the value-at-risk horizon—trades the value-at-risk measure cannot anticipate. Dirty P&L’s also reflect fee income earned during the value-at-risk horizon, which value-at-risk measures also don’t anticipate. Clean P&L’s are hypothetical P&L’s that would have been realized if no trading took place and no fee income were earned during the value-at-risk horizon.

The Basel Committee (1996) recommends that banks backtest their value-at-risk measures against both clean and dirty P&L’s. The former is essential for addressing Type A and Type B model risk. The latter can be used to assess Type C model risk.

Suppose a firm calculates its portfolio value-at-risk at the end of each trading day. In a backtest against clean P&L’s, the value-at-risk measure performs well. Against dirty P&L’s, it does not. This might indicate that the value-at-risk measure is sound but that end-of-day  value-at-risk does not reasonably indicate the firm’s market risk. Perhaps the firm engages in an active day trading program, reducing exposures at end-of-day.

Financial institutions don’t calculate clean P&L’s in the regular course of business, so provisions must be made for calculating and storing them for later use in backtesting. Other data to maintain are

  • the value-at-risk measurements;
  • the quantiles of the loss distribution at which clean and dirty P&L’s occurred (the loss quantiles), as determined by the value-at-risk measure. This information is important if distribution tests are to be employed in backtesting;
  • inputs, including the portfolio composition, historical values for key factors and current values for key factors;
  • intermediate results calculated by the value-at-risk measure, such as a covariance matrix or a quadratic remapping; and
  • a history of modifications to the system.

The last three items will not be used in backtesting, but they could be useful if backtesting raises concerns about the value-at-risk measure, which people want to investigate. The third and fourth items could be regenerated at the time of such an investigation, but doing so for a large number of trading days might be a significant undertaking. Data storage is so inexpensive, there is every reason to err on the side of storing too much information.

A history of modifications to the value-at-risk measure is important because the system’s performance is likely to change with any substantive modification. Essentially, you are dealing with a new model each time you make a modification. We are primarily interested in backtesting the measure since its last substantive modification.


14.1 Motivation Backtesting

Chapter 14


14.1  Motivation

The empiricist tradition in the philosophy of science tells us that a model should not be assessed based on the reasonableness of its assumptions or the sophistication of its analytics. It should be assessed based on the usefulness of its predictions. Backtesting is a process of assessing the usefulness of a value-at-risk measure’s predictions when applied to a particular portfolio over time. The value-at-risk measurements obtained for the portfolio from the value-at-risk measure are recorded, as are the realized P&L’s for the portfolio. Once sufficient data has been collected, statistical or other tests are applied to assess how well the value-at-risk measurements reflect the riskiness of the portfolio.


13.4 Further Reading – Model Risk

13.4  Further Reading – Model Risk

Most discussions of model risk in the financial literature focus on model risk in asset pricing models rather than model risk in risk models. Office of the Comptroller of the Currency (2000) is a standard resource on model validation.

A number of papers assess the performance of various value-at-risk measures. See, for example, Marshall and Siegel (1997) and Berkowitz and O’Brien (2002).


13.3 Managing Model Risk

13.3  Managing Model Risk

Many front- and middle-office systems entail some degree of model risk, so when you implement a value-at-risk system, model risk should not be a new issue. Your firm should already have considerable infrastructure in place for addressing it. Below, we discuss a variety of strategies, primarily from the perspective of a practitioner.

13.3.1 Personnel

I was once involved in a credit risk model implementation. The credit department lacked people with the necessary quantitative skills, so design work was done by another department. For political reasons, the head of the credit department insisted on retaining control. She inserted herself into the model design process, going so far as to sketch out analytics with her people. These made no sense in the way 1 + 1 = 3 makes no sense. The design team sat through a presentation of her design, thanked her, and then quietly proceeded with their own design.

Perhaps it goes without saying, but an important step in implementing any quantitative model is making sure it is designed by competent individuals. Modeling is a specialized skill that bridges mathematics and real-world applications. Someone with strong math skills may not be a good modeler. There is an old joke about organized crime figures hiring a mathematician to find a way for them to make money at horse betting. After spending a year on this task, the mathematician calls the mobsters together and opens his presentation with “Consider a spherical race horse … ”

Non-quantitative professionals are not qualified to assess an individual’s modeling skills. When hiring modelers, involve individuals with proven modeling skills in the process. Have a policy that quantitative professionals must report to other quantitative professionals with proven modeling skills and the ability to assess work based on its technical merit.

13.3.2 Standard Assumptions and Modeling Procedures

While it is true that any model should be assessed based on the usefulness of its predictions and not on the reasonableness of its analytics, there is a flip side to this. Designing a model to conform to established practices—using proven assumptions and modeling techniques—will decrease model risk. Novice modelers sometimes employ techniques they invent or read about in theoretical or speculative literature. Such innovation is how theory progresses, but it is best left for academia or personal research conducted in one’s free time. When a bank or other trading institution implements value-at-risk, considerable resources are bought to bear, and time is limited. A failed implementation could severely limit the institution, hobbling its risk management—and hence its ability to take risk—for years.

13.3.3 Design Review

Large financial institutions employ teams of financial engineers or other quantitative staff whose full-time job is to review other people’s models. The focus tends to be on asset pricing models, but they can also review risk management models, and especially value-at-risk measures. Banks can expect their regulators to ask specifically how a value-at-risk measure’s design was independently reviewed and to see documentation of that review.

The review should be based on the design document (discussed in Section 12.6.1) that describes the value-at-risk measure. This needs to be a stand-alone document, operationally describing all inputs, outputs and calculations in sufficient detail that a value-at-risk measure can be implemented based on it alone. Do not attempt a review of a design document that is imprecise or incomplete. A value-at-risk measure has not been designed—it cannot be reviewed—until the design document is complete. Review should result in either recommendations for improving the model or approval for the model to be implemented. Keep in mind that the system requirements and design document are likely to evolve during implementation, so additional reviews will be necessary.

13.3.4 Testing

Complex systems such as value-at-risk measures can be difficult to test, so it is critical to define the testing environment and strategies early on in the implementation process. In some cases, creating the test environment may require days or weeks of development and setup, so it isn’t something to leave to the end.

Different forms of testing are performed throughout the implementation of a system:

  • unit testing is done by developers as they code. Its purpose is to ensure that individual components function as they are supposed to.
  • integration testing is done by developers as they finalize the software. Its purpose is to ensure that the components integrate correctly and the system works as a whole. This is the development team’s opportunity to ensure that everything will work properly during the system/regression testing phase.
  • system/regression testing is done by a separate quality assurance (QA) team to confirm that the system meets the functional requirements and works as a whole
  • stress/performance/load testing is done by developers and sometimes QA to ensure that the system can handle expected volumes of work and meets performance requirements. A system could pass the system test (i.e. meet all the functional requirements) but be slow. This testing ensures the system performs correctly under load.
  • user acceptance testing is done by business units, often with assistance from QA, to confirm the system meets all functional requirements. This is typically an abbreviated form of system/regression testing.

For value-at-risk applications, a common technique for system/regression testing and user acceptance testing is to build simulators to model aspects of the system that are expensive to replicate or not available in a non-production environment. This allows you to test systems in isolation prior to your final integration efforts. For example, instead of using actual real-time market data feeds, simulators can be developed that simulate sequences of real-time data. In addition to avoiding the use of expensive feeds, this approach gives you the ability to define and repeat certain sequences of data or transactions that test specific conditions in the system. While simulators are not a substitute for integration testing with actual live systems, they can be essential for developing the sorts of thorough testing processes required in value-at-risk implementations.

The system’s value-at-risk analytics need to be tested to ensure they reflect the formulas specified in the design document. The recommended approach is to implement a stripped down value-at-risk measure with the same analytics as in the design document. It may be possible to do this in Excel, Matlab or some similar environment that will facilitate participation by business professionals on the implementation team. When identical inputs are run through it and the production system, outputs need to match. If they don’t, that indicates there is a problem, either with the test value-at-risk measure or the production value-at-risk measure. Usually, it is with the stripped down test value-at-risk measure, but each discrepancy needs to be investigated. Even small discrepancies must be addressed. A bug that causes a discrepancy at the eighth decimal place for one set of inputs may cause one at the first decimal place with another.

It is critical that you take the time at the onset of implementing a value-at-risk measure to define and budget for the testing processes. Plan for a generous user acceptance testing period. There can be considerable push back from business units to limit their involvement in this phase, which can be mitigated by combining user acceptance testing with user training.

13.3.5 Parallel Testing

If a new value-at-risk system is replacing an old system, the two systems should be run in parallel for a few months to compare their performance. Output from the two systems should not be expected to match, since their analytics are different. Even wide discrepancies in output may not be cause for alarm, but they should be investigated. Users need to understand in what ways the new system performs differently from the system it is replacing. If output from the new system ever doesn’t make economic sense during this period, that should prompt a more exhaustive review.

Parallel testing can work well with agile software development. Business units can continue to rely on the legacy system while testing and familiarizing themselves with components of the new system as they are brought on-line.

13.3.6 Backtesting

Backtesting is performed on an ongoing basis once a value-at-risk measure is in production for a given portfolio. Portfolio value-at-risk measurements and corresponding P&Ls are recorded over time and statistical tests are applied to the data to assess how well the value-at-risk measure reflects the portfolio’s market risk. This is an important topic, which we cover fully in Chapter 13.

13.3.7 Ongoing Validation

Validation is the process of confirming that a model is implemented as designed and produces outputs that are meaningful or otherwise useful. For value-at-risk measures, this encompasses design review, software testing at both the system/regression and user acceptance stages, parallel testing and backtesting, all discussed above.

Validation also needs to be an ongoing process. value-at-risk measures will be modified from time to time, perhaps to add additional traded instruments to the model library, improve performance, or to reflect new modeling techniques. Proposed modifications need to be documented in an updated design document. That needs to be reviewed and approved. Once the modifications are coded, they need to be fully tested. For some modifications, it will make sense to parallel test the modified system against the unmodified system. The modified system then needs to be backtested on an ongoing basis

Even if a value-at-risk system isn’t modified, it needs to be periodically reviewed to check if developments in the environment in which it is used have rendered if obsolete or not as useful as it once was. These scheduled reviews should be based on the design document, read in light of what may have changed (within the firm, the markets, data sources, etc.) since the design document was first written. The review should also include interviews with users to determine how they are currently using the system, if those uses are consistent with its design, and if modifications to the value-at-risk measure might be called for.

13.3.8 Model Inventory

Internal auditors should maintain an inventory of all models used within a trading environment, and the value-at-risk measure should be included in that list. A model inventory facilitates periodic validation of all models.

13.3.9 Vendor Software

Software vendors will generally test their own code to ensure it is bug free but otherwise rely on clients to report problems with the software. Also, due to each user’s choice of settings and interfaces, eachvalue-at-risk implementation tends to be unique. For these reasons, Vendor software needs to be validated on and ongoing basis much like internally-developed software.

13.3.10 Communication and Training

A critical step for addressing Type C, model application risk, is employee training. This is especially true for value-at-risk measures, which often relate only tangentially to end-users’ primary job functions. Training should cover more than basic functionality. It should communicate the purpose of the value-at-risk measure and help end-users understand how the value-at-risk measure can help them in their work. As mentioned earlier, it may be advantageous to integrate training with user acceptance testing.

13.2 Model Risk

13.2  Model Risk

We define Model risk as the risk of a model being poorly specified, incorrectly implemented or used in a manner for which it is inappropriate.

Consider the task of pricing swaptions. A financial engineer might employ finance theory to develop a model for that purpose. A programmer might implement the model as a computer program. A trader might use that implementation to price a swaptions trade. In this example there is risk that the financial engineer might poorly specify the model, or that the programmer might incorrectly implement it, or that the trader might use it in a manner for which it is not intended—perhaps pricing some non-standard swaption for which the model is not suited. Here we have three types of model risk:

  • Type A: model specification risk,
  • Type B: model implementation risk, and
  • Type C: model application risk.

Every quantitative model has three components:

  • inputs,
  • analytics, and
  • outputs.

To assess model risk, we should assess the potential for each type of model risk to arise in each model component.

13.2.1 Type A: Model Specification Risk

Model specification risk is the risk that a model will be poorly specified. The question is not so much of a model being “right” as it is of the model being “useful.” As Box and Draper (1987, p. 424) observed

Essentially, all models are wrong, but some are useful.

Any meaningful model makes predictions. A well specified model will—if correctly implemented and appropriately used—make generally useful predictions. This is true whether a model is used for predicting earthquakes, forecasting weather or assessing market risk.

Model specification risk arises primarily when a model is designed—when inputs are operationally defined, mathematical computations are specified, and outputs are indicated, preferably all in a formal design document. While careless errors such as misstating a formula are always a problem, a more general issue is the fact that a particular combination of inputs and computations may yield outputs that are misleading or otherwise not useful. While an experienced modeler may review a design document and make some assessment of how well the model might perform, the ultimate test is to implement the model and assess its performance empirically.

Model specification error also exists after a model is designed. The model itself may be modified from time to time, perhaps to address new traded instruments, fix a perceived problem with its performance, or to remain current with industry practices. It can also arrise without any obvious changes to a value-at-risk measure. For example, historical values for a specific key factor may be drawn from a particular time series maintained by a data vendor. If the data vendor changes how it calculates values of that time series, or loosens quality controls, this might impact the value-at-risk measure’s performance.

13.2.2 Type B: Model Implementation Risk

Model implementation risk is the risk that a model, as it is implemented, strays from what is specified in the design document. Inputs may differ: perhaps historical data for a key factor is obtained from a source other than that specified in the design document. Mathematical formulas may differ: this can arise from simple typing errors, or a programmer may implement a formula in a manner that inadvertently alters it. Outputs can be misrepresented: perhaps two numbers are juxtaposed in a risk report, or outputs are presented in a manner that is confusing or misleads.

I recall once attending a bank’s risk committee meeting. A printout of the risk report was distributed, and I noticed that the exposure indicated for a particular category of risks was zero. I asked about this, as I knew the bank took such risks. No one knew why the number was zero. I went back and looked at past risk reports and found that the number was always zero. The system was calculating that exposure, but the result never made its way into the risk report. Somehow, zero had been hard-coded into that particular field of the risk report.

Implementation error can cause a good model to be implemented as a bad model, but that is just part of the problem. If the implemented model strays from its design—whether it performs well or not—it is a different model from what the designer and users believe it to be. They think they have one model, when they actually have another. Results may be unpredictable.

Implementation error generally results from human error. With large software projects, some coding or logic errors are all but inevitable. Implementation risk also arises if dishonest programmers decide to cut corners. Unfortunately, such dishonesty occurs, but it is not common on largevalue-at-risk implementations subject to a rigorous testing and validation process.

13.2.3 Type C: Model Application Risk

Model application risk is the risk that a model will be misinterpreted or used in a manner for which it is unsuited. Misinterpretation can produce a false sense of security or prompt actions that are inadvisable. There are many ways a model can be misused.

A risk manager once brought me in to consult for a day. He wanted me to explain to his CFO how the firm suffered a market loss of USD 22MM when their 95%value-at-risk had been reported as USD 5MM. The loss had wiped out an entire year’s profits, and the CFO was trying to shift blame to a “flawed”value-at-risk system. The risk manager had installed the system a few years earlier, so he was taking some heat.

I sat with the CFO—Machiavelli reincarnated, if I am not mistaken—and explained that the USD 22MM loss was for the entire month of March. Thevalue-at-risk system reported one-dayvalue-at-risk. He looked at me blankly. I asked him if it was reasonable to expect monthly P&L’s to be of a greater magnitude than daily P&L’s, since markets typically move more in a month than in a day. He accepted this, so I had him take out his calculator. I explained that, if we assumed 21 trading days in a month, we could approximate the one-monthvalue-at-risk as the one-dayvalue-at-risk multiplied by the square root of 21. He typed the numbers into his calculator, and the result was USD 23MM. Whatever the man’s blame-shifting motives may have been, he was impressed with the math. He cracked a smile.

I was fortunate the numbers worked out so nicely. Politics aside, the CFO learned a lesson that day. He really had misunderstood what the value-at-risk measure was telling him.

Modelers tend to have different backgrounds from traders or senior managers. We think in different ways. While designers of a model understand the meaning of its outputs on a technical level, users may have a less precise, more intuitive understanding. Modelers generally have a better sense of how reliable a model’s outputs are than do users, who may have unrealistic expectations. Also, if they don’t like what a model tells them, users are more likely to let emotions cloud their assessment of the model’s worth.

I once had a trader confide to me that a real-time value-at-risk measure I had implemented was wrong. He had checked his reported value-at-risk, placed a trade, and then noticed that the value-at-risk went down. I asked him if the new trade had been a hedge. When he said “yes,” I explained that the value-at-risk measure didn’t quantify the risk of each trade and then sum the results. It analyzed correlations to capture hedging and diversification effects. In this way, the value-at-risk measure understood that his last trade had been a hedge, which is why the value-at-risk had gone down. The trader stared at me uneasily, as if I were some used car salesman.

A common misperception with value-at-risk is a belief that it represents a maximum possible loss. But in the scheme of things, a quantile-of-loss is easier to grasp than other PMMRs, such as expected tail loss or standard deviation of return.

A common form of model application risk is use of an otherwise good value-at-risk measure with a portfolio for which it is unsuited. This might happen if an organization’s trading activities evolve over time, but their value-at-risk measure is not updated to reflect new instruments or trading strategies.

Alternatively, the environment in which a model is used may evolve, rendering its predictions less useful—the model was well specified for the old environment but now it is being mis-applied in a new environment. For example, a value-at-risk measure should employ key factors related to liquid, actively traded-instruments. But markets evolve; liquidity can dry up. A choice of key factors that is suitable today may not be suitable next year. Consider how the forward power market dried up in the wake of the 2001 Enron bankruptcy, or how the CDO market collapsed following the 2008 subprime mortgage debacle.


Categorize the following situations as relating to Type A, Type B or Type C model risk. Some relate to more than one category. Discuss each situation.

  1. A trading organization implements a value-at-risk measure for an options portfolio. It constructs a remapping based on the portfolio’s deltas and then applying a linear transformation.
  2. A firm calculates value-at-risk each day and includes it in a risk report, which is available to traders and management through the firm’s intranet. Most have never figured out how to access the risk report.
  3. A value-at-risk measure employs many key factors. Some of them are prices of certain novel instruments. A few years after the value-at-risk measure is implemented, turmoil in the markets causes trading in those instruments to cease. Without data, the value-at-risk measure is inoperable.
  4. An equity portfolio manager asks his staff to calculate his portfolio’s one-year 99.9%value-at-risk. This relates to a loss that will occur once every thousand years, but most of the stocks in his portfolio haven’t existed thirty years. With inadequate data, his staff cobbles together a number. The portfolio manager then asks his staff to calculate what the portfolio’s value-at-risk would be if he implemented various put options hedges. The staff has no idea how to extend their crude analysis to handle put options, but they don’t tell him this. Instead, they simply make up somevalue-at-risk numbers. The portfolio manager reviews these and implements a put options strategy whose cost he feels is warranted by the indicated reduction in value-at-risk.
  5. A firm implements its own one-day 97.5% USD value-at-risk measure, which employs a Monte Carlo transformation. A programmer coding the measure inadvertently hits an incorrect key, which causes the measure to always calculate the 94.5% quantile of loss instead of the intended 97.5% quantile of loss.
  6. A firm bases traders’ compensation on their risk-adjusted P&L’s. The risk adjustment is calculated from the traders’ value-at-risk, which is measured daily based on the composition of their portfolios at 4:30PM. The head of risk management eventually notices that traders hedge all their significant exposures with futures just before 4:30PM each day—and then lift the hedges, either later in the evening or early the next morning.



13.1 Motivation

Chapter 13

Model Risk, Testing and Validation

13.1  Motivation

Between 2000 and 2001, National Australia Bank took write downs totaling USD 1.2 billion on its US mortgage subsidiary HomeSide Lending. The losses were attributed to a series of errors in how the firm modeled its portfolio of mortgage servicing rights. This is not an isolated incident. In finance, models are widely used. Usually, they are formulas or mathematical algorithms. Examples include portfolio optimization models, option pricing formulas and value-at-risk measures. Flaws in such models cause firms to misprice assets, mishedge risks or enter into disadvantage trades. If traders or other personnel become aware that a model is flawed, they may exploit the situation fraudulently. Flawed risk management models can cause firms to act imprudently—either too aggressively or too cautiously. For banks, they can result in regulators imposing additional capital requirements.

Berkowitz and O’Brien (2002) used daily P&L andvalue-at-risk data from six major US banks between 1997 and 2000 to assess (backtest) the performance of the banks’ value-at-risk measures. They concluded that

Banks’ 99th percentilevalue-at-risk forecasts tend to be conservative, and, for some banks, are highly inaccurate.


12.7 Further Reading – Implementing Value-at-Risk

12.7  Further Reading – Implementingvalue-at-risk

Value-at-risk implementations differ significantly from one market to the next. Examples in this book illustrate techniques applicable to various markets. For some other markets, see

  • agriculturals – Wilson, Nganje and Hawes (2007)
  • credit default swaps – O’Neil (2010)
  • energies – Dahlgren, Liu and Lawarrée (2003) ans Vasey and Bruce (2010)
  • mortgage-backed securities – Han, Park and Kang (2007)
  • shipping – Abouarghoub (2013)

More generally, Levine (2007) provides a high-level discussion of implementing risk management systems, primarily for the capital markets. Vasey and Bruce (2010) is a wonderful book on sourcing vendor trading and risk management software. While it targets the energy trading industry exclusively, professionals in other industries will benefit from many of its practical insights.

Leffingwell and Widrig (1999) is a general book for IT professionals on discovering and documenting systems requirements. Use cases are discussed by Cockburn (2000) and Schneider and Winters (2001). Larman (2003) offers an overview of agile software development methods. The specific methods of Scrum, XP, and Test-Driven Development are discussed in Schwaber and Beedle (2001), Beck and Andres (2004), and Beck (2002), respectively.

12.6 Implementation

12.6  Implementation

Actual implementation of thevalue-at-risk system will be a different process depending on whether you install vendor software or internally developed your own system. However, since most implementations combine some vendor software and some internally developed software, your own implementation is likely to blend the two processes.

12.6.1 Internally Developed Software

The first step in internally developingvalue-at-risk software is for quantitative finance professionals to work with IT professionals to develop a design document. This will be the primary point of communication between business and IT professionals on the design team. The business professionals will provide the detailed know-how on the algorithms and processes required, while the IT professionals will provide expertise on how to best partition the technical solution.

The purpose of the document is to detail all inputs, mathematical formulas and outputs for thevalue-at-risk system. The document should be expressed in non-technical language to the extent possible, so as to be accessible to both finance and IT professionals. Authors should draw on both the requirements document, which should already exist, and this book, which provides detailed information on the analytics one might incorporate into avalue-at-risk system. If authors adopt the terminology and notation of this book, then the book can serve as a reference to support the design document.

If use cases are employed for the requirements process, the design documentation will be driven from the use cases. The design process will consist of expanding each use case with design details.

After this design document is finished, the authors will need to be available on an ongoing basis to answer the inevitable questions that arise during the implementation. You can expect that the requirements and design of the system will undergo revisions throughout the implementation process. Strive to make the requirements document complete, but plan for the “80-20 rule”—80% of the requirements will be defined in the requirements phase of the project, while 20% of the requirements will be discovered during the design and implementation process. For this reason, it is critical that the business experts participate in frequent reviews throughout all phases of development.

12.6.2 Agile Software Development Methods

There are a number of agile software development methods (e.g., Scrum, Extreme Programming (XP), Test-Driven Development) that have become popular for developing complex systems suchvalue-at-risk solutions. These methods can be effective forvalue-at-risk development because they employ frequent, focused software releases to deliver the solution incrementally.

The primary benefit to an agile approach is the immediate feedback and business value of each release. By delivering smaller software releases sooner, the firm realizes business value more quickly and also is able to adjust for missing/incorrect software requirements earlier rather than later in the development process.

There are only certain contexts where agile methods can be used effectively, so development teams pick and choose various aspects of these methodologies for development. An experienced team is needed for any of these methods, so you will need to consult your IT team to determine if agile methods make sense for your implementation.

12.6.3 Vendor Software

Vendors will generally take responsibility for implementing their own software, but your implementation team should be an active participant in the process. Vendor software is often configurable in different ways, and team members should be active in deciding how their system should be configured. They should receive training in the vendor software before it is implemented, so they are fully versed in its features and configuration options.