Skip to content

Programming - 9. page

On writing a book

I saw this great post on the nuts and bolts of writing a book on the BJUG mailing list. Well worth a read.

I have sympathy in particular with his ‘writing schedule’ comment. I started to put together a book with some friends, and it was hard to keep things moving. We ended up not moving forward with the book and placing the content on a blog (which has proven hard to update as well). The book was about software contracting.

[tags]authorship,inertia[/tags]

Finding bugs from software history: a talk at CU I attended

I went to an absolutely fascinating talk today at CU. Sunghun Kim gave a talk titled “Predicting Bugs by Analyzing Software History”. The basic premise is you can look at historical unstructured information, including emails, bug reports, check in comments, and if you can identify bugs that were related to that unstructured information, you can use that to find other bugs.

He talked about two different methods to ‘find other bugs’. The first is change classification. Based on a large number of factors, including attributes of the program text, complexity metrics and source control management meta data like time of checkin (don’t check code in on Friday!) and committing developer, he was able to identify whether or not a bug was introduced for a given checkin. (A question was asked about looking at changes at the token level, and he said that would be an interesting place for further research.) This was 94% precise (if the system said a bug was introduced, there was a 94% chance it was) and had 70% recall (it missed 30% of real bugs introduced, but got 70% of them).

They [he collaborates a lot] were able to judge future changes probabilities of introducing a bug by feeding all the attributes I mention above on known bugs to a machine learning program. Kim said there were ways of automating some of the collection of this data, but I can imagine that the quality of bug prediction depends massively on the ability to find which bugs were introduced when, and tie known bugs to those attributes. The numbers I quote above were based on research from a number of open source projects like Apache and Mozilla, and varied quite a bit. Someone asked about the difference of numbers, and he said that the habit of commit activity was a large cause of that variation. Basically, the more targeted commits were (one file instead of five, etc), the higher precision could be attained. Kim also mentioned that this type of change classification is similar to using a GPS for directions around a city–the more unfamiliar you are with the code, the more useful it would be. He mentioned that Apple and Yahoo! were using this system in their software development.

The other interesting concept he talked about was a bug cache. If you’ve developed for any length of time on a given project, you know there are places developers fear to tread. Whether it is complicated logic, fragile interfaces with legacy systems or just plain fugly code, there are sections of any codebase where change is a bit scary. Kim talked about the Windows Server 2003 team maintaining a list of such modules, so that anytime anyone changed something on that list, more review than normal would take place. This list is what he’s trying to repeat in an automated fashion.

If you place files in a cache when they are identified as having a bug, and also place other files that are close in checkin time to that file, you can build a cache of files to closely review. After about 50-100 files for the 200 file Apache project, that cache of 20 files (10%) contained a significant portion of future bugs. Across several open source projects, the range of bugs contained in the cache was 73-95%. He also talked about using this on the method level as opposed to the file level.

In both these cases, machine learning that happens on one project is not very useful for others. (When they did an analysis of the Mozilla codebase and then turned it on the Eclipse codebase, it wasn’t good at finding bugs). Kim speculated that this was due to project and personal coding styles (some people are from Mars, others write buffer overflow bugs), as the Apache 1.3 trained machine was OK at finding bugs in the Apache 2.0 codebase.

Kim talked about several other interesting projects that he has been part of, including the ‘Memory of Similar Bugs’, which found that up to 40% of bugs are recurring patterns, and ReCrash, a probe that monitors an application for crash conditions, and, when it finds one, automatically writes a unit test that can reproduce the crash situation. How cool is that? The cost, of course, is ReCrash imposing high overhead (13-64% increase) as a cost of monitoring.

This was so fascinating to me because anything we can do to automate the bug finding process lets us build better software. There are always data input problems (GIGO still rules), but it seemed like Kim had found some ways around that, especially when the developers were good about comments on checkin. I’m all for spending more time building cool features and better business logic. All in all, a great talk about an interesting topic.

[tags]the bug in the machine, commit early and often[/tags]

My experience at the MySQL Performance Coding Webinar

Last Tuesday, I attended a MySQL webinar. I registered on the MySQL website (with a site-specific email address, of course) and periodically am invited to these webinars. I’d tried to attend in the past, but something else (usually billable) always interfered. Not this time!

The talk was titled “Performance Coding for MySQL” and the author, Jay Pipes, did a fantastic job. (He is also the co-author of Pro MySQL.) The slides from the presentation are up, and he also answered questions sent to him during the presentation in some detail as well. His presentation, about an hour in length, covered both basics (like, normalize first (slide 4), think in sets rather than iterators (slides 20-23)–basic, but not intuitive), and under the hood intricacies (like, think about the size of your primary keys and consider record size (slide 6), avoiding deletes with MyISAM (slide 27) and vertical partitioning to take advantage of the query cache (slides 9-11) ) . He also pointed to a script that he wrote to find useless indices.

Well worth my time. Thanks MySQL and Jay, for making a resource like this available and free! There’s a whole lot more, so I’d recommend downloading the slides and giving them a run through, if you interact with MySQL as a DBA or a developer (or, as is often the case, both).

(On that note, I’d like to recommend the MySQL DBA blog for your perusal–apparently recently renamed the ‘Senior MySQL DBA’ blog, heh.)

[tags]mysql dba, think in sets, webinar[/tags]

Using ‘tasks’ in Eclipse

Update, 11/13/2009:  If you are looking for help with tasks using Mylyn (integrated into later versions of eclipse), you don’t want this post.  Instead, you’ll want to read and watch the resources here.  This post is all about simple text based code markers, not Mylyn’s implementation.

In Eclipse (version 3.2–one revision behind the current latest), I use a feature called ‘Tasks’. Using this feature, I can, anywhere in a file managed by Eclipse, put in a tag like ‘XXX’ and write a note to myself. It’s very handy because when I’m developing, I’ll often think of a problem or situation that I’d like my code to handle, but not have time to deal with it just then. I could add it to a bug tracker, or an excel spreadsheet, or write it down, but I find adding it to the source code works just as well. Then, I can use the ‘Tasks’ view in Eclipse to gather all of these notes at a later time, and deal with them one by one. I add it using this type of comment (for java–if I add the task in an XML file, I use XML comments):

//XXX need to revisit this class.

The way you include the tasks view in the java perspective is: go to the java perspective, then choose ‘Window’ from the menubar, then ‘Show View’ then ‘Tasks’. According to the help, Tasks are usually only shown in the ‘Resources’ perspective.

My one gripe with tasks is that it seems to be easy to create them, but darn hard to delete them. You can add new task tags via this path: ‘Window’ / ‘Preferences’ / ‘Web and XML’ / ‘Task Tags’, and use the ‘clean and redetect task tags’ button. This does appear to pick up new tasks (or tasks marked with tags that you’ve added), but doesn’t seem to remove tasks that are no longer marked in the source code (whether they are no longer marked because you removed them from the source code, or because you removed that task [TODO] tag from the list of task tag).

If you add a task via the mouse, you can remove it by right clicking on the checkbox. However, that doesn’t remove the task comment from the source file. Also, if you add a task via a comment, you cannot mark it done in the ‘Task’ view.

What I’d like is some synchronicity between the source file and the view. If I add a task in the source file, it should show up in the ‘Task’ view. If I then delete that line from the source file, I’d like the view to reflect that. If I mark it as completed in the view, then choose ‘Delete Completed tasks’ from the context menu, the line in the source file should be removed as well.

Am I missing something here? Am I using tasks incorrectly?

I looked through the Eclipse help, on google, and in the Eclipse newsgroups but did not find anything that helped. No mention of this issue in the release notes for version 3.3. Browsing around the buglist didn’t turn up anything that applied to what I want to do (via a quick scan).

I’ll probably file a bug sometime soon, but should really review all the entered bugs to see if someone else has my issue. In the meantime, I’ll just bleat a bit here.

[tags]eclipse,tasks view[/tags]

The ant jar task and duplicate files can cause bizarre behavior and missing/incorrect files when unzipping

I just ran into some bizarre behavior. I’m building a web application on Windows XPSP2, using Ant 1.6.1. The application worked fine. Then, I added one more copy instruction, to move a GWT component I’d just added to the appropriate place in the war file. Suddenly, the application ceased to work.

After a few hours of tracking down the issue, I found that it didn’t matter whether it was the new GWT component or an old one; it didn’t matter whether the copy went to the same directory or a new one–simply adding the new set of files caused the issue. Then I noticed that the unzipped version of the application differed. In particular, the configuration files differed. That explained why the application wasn’t working–it wasn’t configured correctly.

But, why were the configuration files different?

I examined the generated jar file. When I unjarred it, the configuration files were correct. I was using the jar command, whereas the ant script was using unzip. I made sure the jar file was copied correctly. I made sure the old directory was deleted, and that the ant unzip task would overwrite existing files. Still, no fix–I was seeing the incorrect configuration files.

Then, this part of the jar task documentation jumped out at me:

Please note that the zip format allows multiple files of the same fully-qualified name to exist within a single archive. This has been documented as causing various problems for unsuspecting users. If you wish to avoid this behavior you must set the duplicate attribute to a value other than its default, “add”.

The other possible values for the duplicate attribute to the jar task listed are “fail” and “preserve”. It doesn’t explain what the other options actually do; “fail” causes the jar task to fail when duplicate files are encountered. This seems to be sane default behavior, and I’m not sure why it’s not the case. “preserve” seems to preserve the first file added, and doesn’t add duplicates, but doesn’t tell you that duplicates exist.

Update, 2:09:   “preserve” does tell you that duplicates exist, in this form: WEB-INF/web.xml already added, skipping

I had, for a variety of reasons, a jar task that was adding two sets of the configuration files, with the same names and paths, to the war file. Something about adding a few extra files seemed to flip a switch, and change existing behavior. Whereas before, the unzip task picked the correct configuration file, now the unzip task picked the incorrect file. I don’t know more than that, because I didn’t dig down into the source.

The answer is to move the correct files to the top of the jar task, and change the “duplicate” attribute to be “preserve”.

I hope this post saves someone a few hours of banging their head.

[tags]ant, jar files, duplicate files, headbanging[/tags]

FogBugz world tour, Boulder edition

I went and saw Joel Spolsky’s talk about FogBugz6 tonight. It seems to be quite the powerful software development tool. But I’m afraid that it seems to suffer like every tool–it forces you into certain methods of development. For example, there’s no way to ensure that every bug entered is viewed by QA. Now, that isn’t a problem for the teams I currently work on, but I can see it being a problem for teams I have worked on. Joel mentioned very valid reasons for doing this, but they only seem valid for the subset of development teams that FogBugz targets.

In fact, as I left, almost every conversation I heard was about the product, and how people could fit it into their process, rather than use the process it gives you. Because FogBugz really is more than a bug tracking system–it now goes from documentation/requirements gathering all the way to estimation to bug tracking to customer support. FogBugz appears to be a tool that is used in almost the entire software development life cycle–hey look, it’s RUP lite.

But I’ve used never version 6, and I’m sure there are significant wins. My other concerns are that the software estimation parts sound like they’re 1.0 features (just from the words he used–at best 1.1 since they used FogBugz6 to develop FogBugz6); I’d rather wait until the features are more settled. I’m sure you could use just the bug tracking system, and they’ve certainly taken the ‘Web 2.0’/instant response/make it feel like a desktop application ideas to heart. The cost is another concern; while minimal, it is greater than $0. On many projects I’m on, just using any bug tracker, let alone an entire software development tool, is difficult, and you can’t beat stealth bug tracker installs. (I’m on record as saying “I have to say that I think the open source solutions (Bugzilla and PHPBT) are going to eat the commercial solutions’ lunch for small projects, because they are a cheaper substitute with all the required attributes”, just as an FYI.)

One thing that really surprised me at the talk is how many folks were there evaluating FogBugz as opposed to seeing Joel speak. Around one third of the audienced had used or was using FogBugz. Joel opened up the floor to questions, and every single one except one (of mine) was about features or flaws in FogBugz. I mean, this is the guy who wrote the Joel Test and no one took the opportunity to ask him general development questions, even though he said he’d field them. I don’t know what the deal was.

Will I give FogBugz a try? Not right now. But I’ll keep an eye on what they’re doing.
[tags]software development tools, bug tracking, fogbugz, joelonsoftware[/tags]

Choosing new technology, or tail chasing

Robert Hanson, who built the very useful GWT Widget Library, has an interesting post where he asks:

Let’s say that you are a developer, and you have been spending the past year or so really getting to know a given technology. Now you are being told that the technology you are using is inferior to this “other” technology. You take a look and realize that it might be best to switch. A year later you finally have a good understanding of the tool, and use it with great skill. Then someone tells you about this “other” technology.

How many of us built our own MVC frameworks only to move to Struts, then maybe on to Spring MVC. Sure, there are some improvements made in each technological step, but since you are spending most of your time really getting to know a product you often spend little time getting the most out of it. This is compounded by the fact that you often use several of these products at the same time, adding to what you need to learn.

So what is a dog to do? Although you are moving forward, you never quite catch the tail. Should you just stop moving forward, or run faster or slower?

Personally, I think that there is a middle ground. As a developer, you need to keep up on broad trends and tools, because they can make you so much more productive. The problem arises when you don’t know how much more productive you will be, until you use the technology or tool for a while….

However, just because there is a new tool around, that doesn’t mean you have to use it. In fact, if you have an existing technology that does the job, you should not abandon it just to move to the new technology. There’s always a cost analysis, because learning a new technology is not free. Your time is worth something.

This cost analysis is something that developers should learn to do and appreciate because that process is exactly what most companies need to do before they decide to implement or build new software. Just like a developer, most companies think that a new technology, or system, will help them, but are unsure how much it will help them, and how much it will cost them. Just as for a company, a developer deciding to learn and use a new technology is not solely a technology decision.

There are many ways to minimize the risk of learning a new technology–prototype, read documentation, be conservative and consult someone who’s an expert in the new technology (which means they’ve already made some of the mistakes). Each of these have benefits and detriments. Prototyping takes more time than the others. Reading documentation is great if there is documentation, and if the documentation is accurate, but might teach one as many lessons as using the technology. Being conservative means that you’ll probably miss out on some productivity improvements, just as you’ll miss out on some time sinks. Consulting an expert is great, if you have access and know what questions to ask.

I think the answer to Robert’s final question is intensely context sensitive. It depends on the following five considerations, among others:

  • how crucial a new technology is to your productivity (ie, if you are a java business developer, learning GWT might be lower on the list than learning Spring)
  • how easy you think it will be to learn
  • whether you can be paid to learn it
  • how much spare time you have
  • whether you have a project to use the new technology on

[tags]tail chasing,technology[/tags]

PHP form generation

I just wanted to say: if you are building an application in PHP and you need to edit or search data from a relational database, HTML_QuickForm, DB_DataObject, and, occasionally, DB_DataObject_FormBuilder, can be very useful for prototyping and, depending on your client’s needs, building.  The tools are well worth a look if you’re planning to write any custom PHP database manipulation code.

Announcement: FRUGOS GeoSummit 2007

One of my clients is helping out with this unconference. If you’re into GIS, it seems like it’d be worth going. I certainly had fun at the last unconference I went to.  I am planning to attend; hope I see you there.
———————–

FRUGOS (Front Range Users of Geospatial Open Source) is holding its
first GeoSummit on Saturday, June 16th at Churchill Navigation–100
Arapahoe–in Boulder.

This will be a unique gathering of a variety of folks interested in
Place–geo-types, hackers, academics, artists, amateur enthusiasts,
etc. While there certainly will be representation from the GIS and open
source worlds, we encourage all who are fascinated about the
intersections of technology and engagement with the world around us to
participate.

Also, we’ll be structuring the day around the “un-conference” model (see
http://www.barcamp.org), so, for starters, you
can expect:
No Pitches
No PowerPoint
No Passivity (unless you’re a little sleepy after lunch)

Bring your laptop (we’ll have wireless), and a project or enthusiasm
you’d like to talk about with the group, get feedback, and collaborate
on fresh solutions: the agenda of the day will be structured during
the morning registration/sign-up/socializing period.

If interested–
1) RSVP by joining the Google Groups set up for this event–

http://groups.google.com/group/geosummit

2) Bring a laptop (and cellphone/GPS if your enthusiasms tilt that
way), your idea/project, and willingness to collaborate

3) Spread the word

Tentative Schedule

9:30-10:30AM Registration, refreshment, socializing
10:30-12ish Sessions
12ish-2 Lunch (there’s a grill, beverages, and hiking trails)
2-? Sessions

This promises to be a great combination of creativity, intellectual
engagement, eating and drinking, and socializing.

————————

[tags]barcamp,gis,unconference[/tags]

My software internationalization article is now up at Ccaps

A few months back, I was approached to write an article for the web newsletter that the folks at Ccaps publish. With the help of Mike and Jeff from Zia Consulting, and Brian Pontarelli, I was able to write what I think is a fairly decent introduction into internationalization from a software engineering standpoint, based on the project I’ve presented about in the past. I hope you’ll enjoy it. (I’ll be archiving it permanently on my website soon.)

Update 4/19/07: I corrected the format of the name Ccaps, and linked to their website.
[tags]i18n[/tags]