6.4 Data Errors
Data errors are data values that are, in some sense, erroneous. Filtering is any procedure for identifying data errors. Cleaning is any procedure that corrects data errors or replaces them in the data series with some missing data indicator.
6.4.1 Data Errors
Data errors frequently arise when data is first placed in electronic form, either through manual keying or through a process of scanning and optical character recognition. With manual keying, forward prices may be entered for the wrong maturity. Bid and offer prices may be transposed. Exchange rates may be inverted or entered for the wrong currency pair. Decimal places can be omitted or shifted. Digits can be transposed. Incorrect digits may be entered. Digits may be dropped or extra digits entered.
Optical character recognition software introduces different types of errors. Characters may be mistaken for other characters, with particular mistakes depending upon the font in which scanned data is printed as well as the quality of the printing. The letter O may be read as the number 0. In certain fonts, the numerals 6, 8, and 9 may be mistaken for one another, as may be 5 and 6. Stray marks may be interpreted as text or decimal points. Legitimate decimal points may be overlooked.
Even after data is in electronic form, software or hardware may introduce data errors. If a real-time data feed fails, a user may misinterpret bid and offer prices at the time of the failure as being effective throughout the period when the system is down. A system might transmit only the third and fourth decimal places of a price except when the second decimal place changes. If the second decimal place changes during a period when the system is down, a user may be unaware of the change after the system is restored. Some systems may transmit dummy data when they are turned on or being tested. In some markets, automated quotation systems are designed to indicate “reasonable” prices as if they were actual prices during periods of trading inactivity.
Off-market prices are actual prices that are not, in our subjective opinion, reflective of markets at the time they are quoted. A broker may carelessly quote an indicative price. An unsophisticated market participant may accept an unreasonable price. A trader may mistakenly trade at an off-market price. Occasionally, a trader may transact at off-market prices to manipulate markets or distort his portfolio’s mark-to-market value. Wash trades are identical offsetting trades between two counterparties, which can be performed at off-market prices. Ramping is the performance of very small transactions at off-market prices.
The pricing of one transaction may influence the pricing of other transactions with the same counterparty. If a trade accounting system cannot handle swaps, a trader may transact a swap as a strip of individual forwards, each at the same price. Recorded in a time series, the identical prices will appear to indicate a flat forward curve.
6.4.2 Data Filtering
Data filtering may entail one or more of:
- computer algorithms,
- human review, or
- data comparisons.
Many data errors—especially keying errors—are blatant. If the decimal point is accidentally dropped from 1.72, it becomes 172.00. If we mistakenly transpose the leading digits in 19.02, it becomes 91.02. Such blatant data errors are easily identified in the context of a time series. Computer algorithms can sift through large volumes of data to locate such outliers. For more subtle errors, algorithms may employ statistical inference, pattern-recognition techniques or arbitrage relationships—such as put-call parity or interest-rate parity—to identify suspect data values. Spreads or other price relationships can be checked to see if they conform to historical patterns.
Simple filtering algorithms are easy to construct, but more sophisticated algorithms require careful design as well as some fine-tuning over time. Seasonality and heteroskedasticity complicate designs. Tests relating to specific price relationships or patterns must be customized.
Subtle data errors can be identified manually by traders or other professionals who follow market developments. Such reviews should be performed soon after data is recorded, while a reviewer’s memory is fresh. Also, if a computer algorithm identifies data values as suspect, humans may perform a final review to determine which of these are actual data errors.
Finally, if data for certain risk factors is available from several independent sources, the data from these sources can be compared for consistency.
6.4.3 Data Cleaning
Once data errors have been identified, they must be cleaned—either set equal to values we believe to be correct or deleted. Corrected values should be obtained from the same source from which the data originated. For keying errors, refer to the documents from which the data was keyed. If erroneous data is obtained from an exchange, the exchange should be able to correct it. If transaction prices are erroneous, contact the counterparties to the trade. If data is obtained from a data vendor, you won’t have access to original sources, but you can request that the vendor obtain corrected values.