- in Data by Cesar Alvarez
Is your data in good shape? Would you know it if it was not?
At the end of last year, I was working with a client and we were having problems with code I had written. We would get different results depending on who ran the code. After comparing trade lists and doing some debugging, we discovered that their database was missing several symbols. These symbols existed in my database but did not exist in theirs. We were both using Norgate Data, which I highly recommend and can read my review here.
They contacted Norgate and they walk them through fixing their database. This worried me. My theory was user error caused the problem. But more importantly, this could easily have been my database missing the symbols. Had we not discovered this discrepancy, this issue would still be hidden away.
Before you think, that your data is in good shape because you use a big name data provider, let me give you a couple of examples. Over the last decade, I have used data from lots of data providers. These include CSI Data, Norgate Data, Worden, Reuters and CBOE. In all providers I would eventually discover a data issue. Most of the time it would be small. Now, some of these have been smaller data providers but a couple of them are very big name providers.
Data Issues
These are the problems discovered in the past:
- Symbols being dropped
- Stock splits being missed
- Data holes
- Sudden bad prices on historical data
- Applying splits wrongly
- Bad delisting
When I worked for Connors Research, each month we would update our data and I would then spend a day running tests on it to make sure there were no major issues. I was looking for changes in individual returns, changes in volume, missed stock splits, missing symbols and a lot more.
A common fallacy people have is that they believe that past data should stay unchanged. I can tell you this is not the case. If the data provider is adjusting for splits or dividends or any other reason, they can and will make mistakes that changes past data.
You also run into precision rounding when doing adjustments which can change price data slightly. I find this one useful. Some people expect that if they run a backtest today and then run it again sometime in the future, that they should be exactly the same. But given these slight data changes, you may get slight (or even large) differences in your backtest results. This is good. Why? Because it tells one how sensitive one’s strategy is to the data.
Trust but verify
I have been lazy (OK really lazy) in my data verification but since the beginning of the year have been doing it again. Every two weeks, I have two other research friends run an AmiBroker exploration and send me the results. I run it through Excel looking for differences in our databases. After a few weeks of doing this, I discovered an issue which I brought up to Norgate. They were quick to acknowledge the problem and fix it. If you are an Alpha Norgate Data user, you may have seen the bulletin when doing your update about a month ago. This is no way makes me worried about my data from Norgate. Now if started happening frequently, I would be worried. What impressed me is that they were very good and quick about fixing the issue.
A second test I do is I take a snapshot of my database and then compare that snapshot with the one two weeks ago. This is looking for changes in prices, volume, index constituent data and other changes.
This process takes work because there are often small discrepancies that one must hand investigate to decide if there is a real problem. Typically due to precision of data. Do I report tiny differences to Norgate? No. I am looking for larger issues.
Final Thoughts
If you are not checking your data, you don’t know if have problems. Do you have a method of making sure your data is good? Or are you simply taking your providers word that all is fine? I was lazy and doing that. But no more. I am back to the ‘trust but verify’ relationship with my data.
Backtesting platform used: AmiBroker. Data provider:Norgate Data (referral link)