Introduction
As individuals and as a group we have been collecting malware for many years. The Shadowserver Foundation repository dates back to 2005 and we collected our first million shortly after we actually started counting. I still remember my excitement when we hit our first million. Within a month I was astounded as we hit our second million, and then worried when the third million rolled in a week later.
As the counts only continued to grow my fascination turned to horror. Because not only did I have to count all these objecting coming in, I had to store and analyze them all as well.
Counting
Realizing that we only had a growing problem we had to decide how to count and how to differentiate between the different files we were tracking. We quickly settled upon SHA512 as the base index to ensure that there should never be a collision of hashes, but also added in SHA1 and MD5 so that we could at the same test for collisions in those smaller hash values. We have yet to see a collision with any file we have collected yet but continue to test each file.
In the earlier days we collected all the files ourselves via a variety of technologies that were publicly available. As we grew so did our feeds of files. We now have many partners that we receive malicious files from and just from the running of the malware we collect additional binaries. Unfortunately as we add in the partner feeds we also increased our parallelism in how we imported in all the data.
This was a great thing for our backend systems but introduced an accounting issue for the binaries themselves. Because we were trying to count uniqueness of files in specific time periods it caused us to over count the files we brought in. The system had several relief valves as well in case the system got overloaded. The combination of these two items over time has inflated our total count by a fair amount.
In fact, we over counted by 30 million over the last eight years. This was discovered when we began digging into some new feeds and were very interested in the actual unique counts because of how the data was being collected. We really wanted to see the files that were not being seen by any other source. This brought to light our counting issue. The backend system had already reconciled the differences, but our statistics showed a completely different number.
This is because where possible we generate as many of our statistics on the fly during importing and processing. Post processing of anything is a much more painful endeavor. Now knowing where the issues were, we forced a recount of everything we had from different sources and reconciled the statistics system. We have also changed the counting methodology such that the miscounting would not occur again.
Still Counting Up
So, if you are an avid watcher of our statistics you will have seen a recent drop of our total malware count from 220 million unique files to 190 million unique files. Although if you were not looking close enough you might have missed it since we also regenerated all of our charts to reflect the new counting.
Conclusion
The lesson for us was that we tend to deploy systems and let them run for years while expanding their capability as needed. But there can always be unintended consequences when so many of the systems are tied together for multiple purposes. For us this became evident with what we thought was just simple counting.