Skip to content

Finding bugs from software history: a talk at CU I attended

I went to an absolutely fascinating talk today at CU. Sunghun Kim gave a talk titled “Predicting Bugs by Analyzing Software History”. The basic premise is you can look at historical unstructured information, including emails, bug reports, check in comments, and if you can identify bugs that were related to that unstructured information, you can use that to find other bugs.

He talked about two different methods to ‘find other bugs’. The first is change classification. Based on a large number of factors, including attributes of the program text, complexity metrics and source control management meta data like time of checkin (don’t check code in on Friday!) and committing developer, he was able to identify whether or not a bug was introduced for a given checkin. (A question was asked about looking at changes at the token level, and he said that would be an interesting place for further research.) This was 94% precise (if the system said a bug was introduced, there was a 94% chance it was) and had 70% recall (it missed 30% of real bugs introduced, but got 70% of them).

They [he collaborates a lot] were able to judge future changes probabilities of introducing a bug by feeding all the attributes I mention above on known bugs to a machine learning program. Kim said there were ways of automating some of the collection of this data, but I can imagine that the quality of bug prediction depends massively on the ability to find which bugs were introduced when, and tie known bugs to those attributes. The numbers I quote above were based on research from a number of open source projects like Apache and Mozilla, and varied quite a bit. Someone asked about the difference of numbers, and he said that the habit of commit activity was a large cause of that variation. Basically, the more targeted commits were (one file instead of five, etc), the higher precision could be attained. Kim also mentioned that this type of change classification is similar to using a GPS for directions around a city–the more unfamiliar you are with the code, the more useful it would be. He mentioned that Apple and Yahoo! were using this system in their software development.

The other interesting concept he talked about was a bug cache. If you’ve developed for any length of time on a given project, you know there are places developers fear to tread. Whether it is complicated logic, fragile interfaces with legacy systems or just plain fugly code, there are sections of any codebase where change is a bit scary. Kim talked about the Windows Server 2003 team maintaining a list of such modules, so that anytime anyone changed something on that list, more review than normal would take place. This list is what he’s trying to repeat in an automated fashion.

If you place files in a cache when they are identified as having a bug, and also place other files that are close in checkin time to that file, you can build a cache of files to closely review. After about 50-100 files for the 200 file Apache project, that cache of 20 files (10%) contained a significant portion of future bugs. Across several open source projects, the range of bugs contained in the cache was 73-95%. He also talked about using this on the method level as opposed to the file level.

In both these cases, machine learning that happens on one project is not very useful for others. (When they did an analysis of the Mozilla codebase and then turned it on the Eclipse codebase, it wasn’t good at finding bugs). Kim speculated that this was due to project and personal coding styles (some people are from Mars, others write buffer overflow bugs), as the Apache 1.3 trained machine was OK at finding bugs in the Apache 2.0 codebase.

Kim talked about several other interesting projects that he has been part of, including the ‘Memory of Similar Bugs’, which found that up to 40% of bugs are recurring patterns, and ReCrash, a probe that monitors an application for crash conditions, and, when it finds one, automatically writes a unit test that can reproduce the crash situation. How cool is that? The cost, of course, is ReCrash imposing high overhead (13-64% increase) as a cost of monitoring.

This was so fascinating to me because anything we can do to automate the bug finding process lets us build better software. There are always data input problems (GIGO still rules), but it seemed like Kim had found some ways around that, especially when the developers were good about comments on checkin. I’m all for spending more time building cool features and better business logic. All in all, a great talk about an interesting topic.

[tags]the bug in the machine, commit early and often[/tags]

Squid Notes: A fine web accelerator

I recently placed squid in front of an Apache/Tomcat based web application to serve as a web accelerator. We could have used Apache’s mod_proxy, but squid has the ability to federate and that was considered valuable for future growth. (Plus, Wikipedia uses squid, and it has worked out pretty good for them so far.) I didn’t find a whole lot of other options–Varnish looks good, but wasn’t quite documentation and feature rich enough.

However, when the application generates a page for a user who is logged in, the content can be different than if the exact same URL is visited by a robot or a user who is not signed in. It’s easy to tell if a user is signed in, because they send cookies. What was not intuitive was how to tell Squid that pages for logged in users (matching a certain header, or a certain URL pattern) should always be referred to Tomcat. In fact, I asked about this on the mailing list, and it doesn’t seem as if it is possible at all. Squid caches objects at the page level, and can’t cache just pieces of a page (like I believe, among others, OSCache can).

I compromised by deleting the cached object (a page, for example) whenever a logged in user visits it. This forces squid to go back to the origin server, guaranteeing that the logged in user gets the correct version. Then, if a non logged in user requests the page, squid again goes back to the origin server (since it doesn’t have anything in its cache). If a second logged in user requests the same page, squid serves it out of cache. It’s not the best solution, but it works. And non logged in users are such a high proportion of the traffic that squid is able to serve a fair number of pages without touching the application.

Overall I found Squid to be pretty good–even with the above workaround it was able to take a substantial amount of traffic off the main application. Note that Squid was designed to be a forward proxy (for example, a proxy at an ISP caching commonly requested pages for that ISPs users), so if you want to use it as a web accelerator (in front of your website, to increase the speed of pages you create), you have to ignore a lot of the documentation. The FAQ is a must read, especially the section on reverse proxying and the logs section.

[tags]proxy, squid,increasing webapplication performance[/tags]

GIS Geeks Unite

Or at least come to Boulder and have a beer.  The FRUGOS (Front Range Users Group of GIS Open Source?) folks are meeting at the Boulder Beer Company tomorrow night.  More info.  I’m thinking of attending, but the Boulder Denver New Tech Meetup is happening at the same time in a different place, so I’m torn.

My experience at the MySQL Performance Coding Webinar

Last Tuesday, I attended a MySQL webinar. I registered on the MySQL website (with a site-specific email address, of course) and periodically am invited to these webinars. I’d tried to attend in the past, but something else (usually billable) always interfered. Not this time!

The talk was titled “Performance Coding for MySQL” and the author, Jay Pipes, did a fantastic job. (He is also the co-author of Pro MySQL.) The slides from the presentation are up, and he also answered questions sent to him during the presentation in some detail as well. His presentation, about an hour in length, covered both basics (like, normalize first (slide 4), think in sets rather than iterators (slides 20-23)–basic, but not intuitive), and under the hood intricacies (like, think about the size of your primary keys and consider record size (slide 6), avoiding deletes with MyISAM (slide 27) and vertical partitioning to take advantage of the query cache (slides 9-11) ) . He also pointed to a script that he wrote to find useless indices.

Well worth my time. Thanks MySQL and Jay, for making a resource like this available and free! There’s a whole lot more, so I’d recommend downloading the slides and giving them a run through, if you interact with MySQL as a DBA or a developer (or, as is often the case, both).

(On that note, I’d like to recommend the MySQL DBA blog for your perusal–apparently recently renamed the ‘Senior MySQL DBA’ blog, heh.)

[tags]mysql dba, think in sets, webinar[/tags]

Author of ‘Ask the Headhunter’ now blogging

The author of my favorite jobs/careers newsletter, Ask The Headhunter, is now blogging. In his first post, Nick says he’ll consider some of the following questions:

Why does HR dump jobs into the Monster pit? How can managers recruit without resumes? Do all headhunters really suck? How can you get past the Top 10 Stupid Interview Questions? Is there a way to get a raise without begging?

And he starts right off with an explanation of why job hunters shouldn’t focus on the macro economy job stats.

If you haven’t checked out his site, it’s well worth a read as well, especially the articles. Of course, he writes with a bias (he is a head hunter, after all), but I find his tone humorous and his advice accurate (where I’ve applied it).

IP Crash Course For Entrepreneurs

This past Wednesday, I went to an interesting talk sponsored by Silicon Flatirons (an organization worth knowing about). Jason Haislmaier gave the talk, and the subject was intellectual property (IP); it was titled ‘Intellectual Property “Crash Course” for Entrepreneurs’ and was packed! I got there 10 minutes late (parking on the CU campus is no fun at all) and sat in the back on a heater. Good thing the fire department didn’t come by, as I’m sure we were over capacity. (Incidentally, I heard about this via the Boulder Denver New Tech Meetup mailing list but it was also on the Colorado Startups Events calendar.)

Jason said the presentation and possibly a recording of it would be available, but I was unable to find it by looking around his blog or the Silicon Flatirons site. I took some notes, but his presentation, if and when it becomes available, will be a great introduction to what entrepreneurs need to know about IP. (Note that all mistakes herein are mine, and I am most definitely not a lawyer. Consult your friendly attorney for serious advice. I marked things I thought I remembered with a ‘?’.)

There are 4 kinds of IP: patents, which are ideas or inventions, trademarks, which are about branding, copyright, which deals with creative expression, and trade secrets, which is know how. The overall emphasis on his talk was that you may not need protection from one or any of these forms of property, but that you, as an entrepreneur should be aware of all of them and make a conscious choice to pursue or not to pursue them. Which makes a lot of sense to me! (Incidentally, he repeatedly mentioned that the US was different in IP than the rest of the world, in a lot of ways, so if you plan to do business internationally, you should definitely think about that sooner rather than later.)

Trade secrets are pretty much anything–data, methods, software, etc. The protection is dependent on keeping them secret. Jason was working with a $10-20 million company that had only one patent; its valuation was almost entirely based on trade secrets. NDAs and employment contracts are the front line of trade secrets. He emphasized that you need to read NDAs and think about how they affect you and your relationship with the NDA signer. In particular, you can’t expect a signer to forget everything they’ve learned after a relationship ends, but you can expect them to return all the tangible forms of information. NDAs should have remedies (injunctions). If the other side won’t sign an NDA, that’s fine, just don’t tell them anything that you wouldn’t want to see posted on the Internet.

Copyright is protection for an original work or authorship in a tangible form from which the work can be perceived, not an idea. Apparently, there was a famous case (Feist) which basically outlined the limits of copyright–anything more creative than the White Pages qualifies for copyright protection. There are five rights, which I didn’t note because I thought the presentation would be up. Copyright can be unregistered (just about anything–these notes and this blog post are unregistered copyright) or registered. Registering costs something, but means you can sue folks. Under the DMCA, the copyright owner no longer has to show infringement–the possibility of infringement is enough (?). There are safe harbors though, one of which is the service provider harbor(?). You have to register with the Library of Congress and take things down if notified, but if you are providing any service with user generated content, you should pursue this safe harbor.

Trademarks (or service marks) are about branding. They’re easier to file for than patents. Use in commerce generates rights. He had a great slide showing the protection levels of trademarks from the fantastic (Kodak, Exxon) to the arbitrary (Apple) to the suggestive to the descriptive (World Poker Tour) to the generic (aspirin, escalator). The more the trademark describes what it represents, the less protectable it is, and trademarks can be lost (as escalator was).

Patents–the big one! Patents are the right to excludes others from making, using and selling a new, useful and non-obvious invention. There are a number of reasons to patent–defensive, offensive, ego, source of revenue (a secondary market is developing for patents. Offensive patents are getting riskier recently (courts are narrowing down patent infringement). But, investors are starting to ask why patents weren’t filed, and “we didn’t think to do so” is a poor answer. The answer to the question “Is it patentable?” for almost any value of “it” is yes, but you need to think about why–the better question is “How relevant and valuable will a patent be for the business?”. Lack of knowledge or independent development is not a defense against patent infringement.

All in all, it was a lot of ground to cover. Jason did a good job making things very applicable to the audience he was talking to. It kinda sucks that you have to think about such things, when all you want to do is develop killer software. (Brian made an offhand comment about patents and long running servlets almost 4 years ago, incidentally.) As Jason said in closing, if you don’t have an intellectual property strategy, your competitors will give you one (and, I inferred, you probably won’t like that one very much).

GWT Talk at the Boulder Denver New Tech Meetup tonight

I presented on the Google Web Toolkit at the Boulder Denver New Tech Meetup tonight (presentation and useful links). It was a rush, as presentations always are. However, the adrenaline was compounded by two factors: the length of the presentation and the composition of the audience.

To present something as large in scope as GWT in 5 minutes was difficult. Though I’d been to 3 previous meetups, I didn’t have a good feel for the technical knowledge of the audience, so I aimed to keep the presentation high level. (The audience, on this particular night, was about 50/50 split between coders and non coders, as determined by a show of hands. However, almost everyone knew the acronym AJAX and what it meant.) This lack of knowledge compounded the difficulty, but I still feel I got across some of the benefits of GWT.

I’ll be writing more about what I learned about GWT in preparing for this, but I wanted to answer 3 questions posed to me that I didn’t have off the cuff answers for tonight.

1. Who is using GWT?

I looked and couldn’t find a good list. This list is the best I could do, along with this GWT Groups post. I find it rather astonishing that there’s not a better list out there, as the above list was missing some big ones (Timepedia’s Chronoscope, the Lombardi Blueprint system) as well as my own client: Colorado HomeFinder.

2. How much time does the compilation process add?

I guessed on this tonight but guessed too high. I said it was on the order of 30 seconds to a minute. On my laptop (2 cpu/2 ghz/2 gb of ram box) GWT compilation takes ~7 seconds to build incrementally (from ant, which appears to add ~2 seconds to all of these numbers) and ~21 seconds to build after all classes and artifacts have been deleted. This is for 7400 lines of code.

3. How does GWT compare to other frameworks like Dojo and YUI?

I punted on this one when perhaps I should not have. From what I can tell, GWT attacks adding dynamic behavior to web pages in a fundamentally different way. Dojo and YUI (from what I know of them) are about adding behavior to existing widgets on a page. GWT is about adding objects to a page, which may or may not be attached to existing widgets. I’ll not say more, as I don’t have the experience with other toolkits to speak authoritatively.

Also, here’s an AJAX toolkit comparison that I found.
[tags]gwt presentations, unanswered questions[/tags]

BatchUpdateException using Hibernate and MySQL5

I ran into a crazy error last week. One of my clients was upgrading from MySQL4 to MySQL5. The application in question was using Hibernate 3.2. Here’s the table structure, and the hibernate bean definition. See if you can spot the issue:

mysql> desc stat;
+--------------+-------------+------+-----+---------------------+-------+
| Field        | Type        | Null | Key | Default             | Extra |
+--------------+-------------+------+-----+---------------------+-------+
| stat_date    | date        |      | PRI | 0000-00-00          |       |
| stat_type    | varchar(50) |      | PRI |                     |       |
| stat_count   | int(11)     | YES  |     | NULL                |       |
| last_updated | datetime    |      |     | 0000-00-00 00:00:00 |       |
+--------------+-------------+------+-----+---------------------+-------+
4 rows in set (0.00 sec)

<class name="com.foo.common.data.Statistic" table="stat" lazy="false">
<cache usage="read-write"/>
<composite-id name="statisticId" class="com.foo.common.data.StatisticId">
<key-property name="date" type="java.util.Date" column="stat_date"/>
<key-property name="type" column="stat_type"/>
</composite-id>
<property name="count" column="stat_count"/>
<property name="lastUpdated" type="java.util.Date" column="last_updated" />
</class>

The exception stack trace I was seeing was something like this:

2007-12-31 14:15:09,888 ERROR [Thread-14] def.AbstractFlushingEventListener (AbstractFlushingEventListener.java:301)

- Could not synchronize database state with session
org.hibernate.exception.ConstraintViolationException: Could not execute JDBC batch update
at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:71)
at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43)
at org.hibernate.jdbc.AbstractBatcher.executeBatch(AbstractBatcher.java:249)
at org.hibernate.engine.ActionQueue.executeActions(ActionQueue.java:235)
at org.hibernate.engine.ActionQueue.executeActions(ActionQueue.java:139)
at org.hibernate.event.def.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:298)
at org.hibernate.event.def.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:27)
....
Caused by: java.sql.BatchUpdateException: Duplicate key or integrity constraint
violation message from server: "Duplicate entry '2007-12-31-stattype' for key 1"
at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1492)
at com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeBatch(NewProxyPreparedStatement.java:1723)
at org.hibernate.jdbc.BatchingBatcher.doExecuteBatch(BatchingBatcher.java:48)
at org.hibernate.jdbc.AbstractBatcher.executeBatch(AbstractBatcher.java:242)
... 57 more

I ended up turning on the mysql logging (the log setting in the my.ini file, which logs all sql statements mysql makes) to see what was happening.

Basically, I was looking to see if an entry in the stat table existed; if it did, increment and update, if it did not, insert. And the insert was always happening, so the entry was not found--it did exist because mysql threw the 'integrity constraint' exception.

The cause of the issue was the date type of stat_date and the fact that I incorrectly mapped it to java.util.Date. It really should have been mapped to java.sql.Date. How this worked in mysql4 is beyond me, but it did. Changing the hibernate dialect to mysql5 had no impact.

[tags]mysql upgrade,hibernate[/tags]

GWT Mini Pattern: Cache ‘static’ data

Using the Java marshalling API makes remote calls from GWT to a Java server quick and easy. However, each of those calls introduces performance and complexity issues. Performance, because browsers tend to limit the number of remote network calls per server. From the HTTP 1.1 RFC: “A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy.” This means that if you have more than two components making network calls, the calls will queue up (and that ignores image downloads and other network connections). The complexity arises from the non linear nature of asynchronous network calls. This can lead to a program structure that is not easily understood without careful reading (‘first, we find the user, then, after that we branch, and if we have a valid user, we find their saved bookmarks from the server…’).

If you have an amount of static content that doesn’t often change from user to user, use a RequestBuilder to retrieve the data from the server. Whatever software generates the data should set headers such that the page is cached for a fair while (either via the Cache Control header or the Expires header). The first time the request is made, the browser ends up making a network request–all the way to the server. After that, the browser will return the resource from its cache, using fewer network resources and returning the data more quickly. Using this approach also makes data available across components and pages. I almost always use JSON to represent such rarely changing data, because GWT can parse that easily, though you could use some other data format as well.

Using the browser as your caching system is not free: you pay for the creating of the network call (XMLHttpRequest, etc) and the reparse of the JSON. But you aren’t going over the network, and it’s relatively transparent to all the clients. The alternative, however, of creating your own caching system, is often more daunting–especially if you want to share data across page requests. I’m not aware of any caching systems (such as ehcache) that are GWT compatible at the moment (and they’d have their own costs in terms of javascript download size). Google Gears does provide something compatible, but is not transparent to the user (Gears requires installation).

[tags]caching, gwt, leverage the browser[/tags]