I went to an absolutely fascinating talk today at CU. Sunghun Kim gave a talk titled “Predicting Bugs by Analyzing Software History”. The basic premise is you can look at historical unstructured information, including emails, bug reports, check in comments, and if you can identify bugs that were related to that unstructured information, you can use that to find other bugs.
He talked about two different methods to ‘find other bugs’. The first is change classification. Based on a large number of factors, including attributes of the program text, complexity metrics and source control management meta data like time of checkin (don’t check code in on Friday!) and committing developer, he was able to identify whether or not a bug was introduced for a given checkin. (A question was asked about looking at changes at the token level, and he said that would be an interesting place for further research.) This was 94% precise (if the system said a bug was introduced, there was a 94% chance it was) and had 70% recall (it missed 30% of real bugs introduced, but got 70% of them).
They [he collaborates a lot] were able to judge future changes probabilities of introducing a bug by feeding all the attributes I mention above on known bugs to a machine learning program. Kim said there were ways of automating some of the collection of this data, but I can imagine that the quality of bug prediction depends massively on the ability to find which bugs were introduced when, and tie known bugs to those attributes. The numbers I quote above were based on research from a number of open source projects like Apache and Mozilla, and varied quite a bit. Someone asked about the difference of numbers, and he said that the habit of commit activity was a large cause of that variation. Basically, the more targeted commits were (one file instead of five, etc), the higher precision could be attained. Kim also mentioned that this type of change classification is similar to using a GPS for directions around a city–the more unfamiliar you are with the code, the more useful it would be. He mentioned that Apple and Yahoo! were using this system in their software development.
The other interesting concept he talked about was a bug cache. If you’ve developed for any length of time on a given project, you know there are places developers fear to tread. Whether it is complicated logic, fragile interfaces with legacy systems or just plain fugly code, there are sections of any codebase where change is a bit scary. Kim talked about the Windows Server 2003 team maintaining a list of such modules, so that anytime anyone changed something on that list, more review than normal would take place. This list is what he’s trying to repeat in an automated fashion.
If you place files in a cache when they are identified as having a bug, and also place other files that are close in checkin time to that file, you can build a cache of files to closely review. After about 50-100 files for the 200 file Apache project, that cache of 20 files (10%) contained a significant portion of future bugs. Across several open source projects, the range of bugs contained in the cache was 73-95%. He also talked about using this on the method level as opposed to the file level.
In both these cases, machine learning that happens on one project is not very useful for others. (When they did an analysis of the Mozilla codebase and then turned it on the Eclipse codebase, it wasn’t good at finding bugs). Kim speculated that this was due to project and personal coding styles (some people are from Mars, others write buffer overflow bugs), as the Apache 1.3 trained machine was OK at finding bugs in the Apache 2.0 codebase.
Kim talked about several other interesting projects that he has been part of, including the ‘Memory of Similar Bugs’, which found that up to 40% of bugs are recurring patterns, and ReCrash, a probe that monitors an application for crash conditions, and, when it finds one, automatically writes a unit test that can reproduce the crash situation. How cool is that? The cost, of course, is ReCrash imposing high overhead (13-64% increase) as a cost of monitoring.
This was so fascinating to me because anything we can do to automate the bug finding process lets us build better software. There are always data input problems (GIGO still rules), but it seemed like Kim had found some ways around that, especially when the developers were good about comments on checkin. I’m all for spending more time building cool features and better business logic. All in all, a great talk about an interesting topic.