Recently, we’ve run into some stability issues with our main web application. It’s a small company, so even though I’m definitely not the ops guy, everyone is pitching in with ideas and suggestions. One thing that the ops guy did install that has been super useful is munin. This graphing software lets you monitor, over time, many different aspects of your web application and/or servers. It has been invaluable in letting us know what the effects of the various changes we’ve made have been.
Questions that munin helps you answer can pretty impressive. For example, does doubling the amount of memory available to your webapp container help stability? How can you know unless you’re measuring stability? What happens if you prohibit certain bots from visiting your website? When during the day or week is your server running hottest?
We have a watchdog that monitors our main application server, and if it is not responsive, restarts it. The watchdog also records when the restart occurred. I decided, as a fun project, to write a plugin for munin that would graph the number of restarts per day, as a high level ‘are we more stable yet’ graph.
Writing a plugin was trivial–it’s a shell script that follows certain output formatting. All I really needed was this HOWTO and this explanation of the types of data sources, though the FAQ is, as per usual, worth a scan.
Munin is by no means perfect (my dream feature would be the ability to annotate graphs at a certain moment in time; ‘this is when we released version 2.1’), but it is a huge hammer in the IT toolbox for understanding current and historic behavior of your application.