Skip to content

Programming - 13. page

MySQL performance and doing calculations on varchar columns

MySQL, along with other features designed to make it easy to use, tries to do the right thing regarding strings. When you perform a math calculation on a column or columns that are of type varchar, MySQL automatically conversts that string to a number (empty strings are treated as zero.):

To cast a string to a numeric value in numeric context, you normally do not have to do anything other than to use the string value as though it were a number[.]

However, this translation, convenient as it may be, is not free. A client of mine had a query that was running calculations against two such columns. After indexing and trying to simplify the query, we were still seeing query execution times of 2+ seconds (all times are quoted for MySQL 4.1, on my relativly slow personal laptop).

The solution appears to be to change the type of the columns using the alter table syntax to type double. After doing so and running analyze table mytable, I was seeing query execution times of 0.2 seconds for the same query on the same box. Fantastic.

I am not sure if this result was due to not having to do several string conversions for each row returned by the query, or the fact that:

In some cases, a query can be optimized to retrieve values without consulting the data rows. If a query uses only columns from a table that are numeric and that form a leftmost prefix for some key, the selected values may be retrieved from the index tree for greater speed[.]

Regardless of the cause, if you’re doing some complicated calculations on columns, consider making them numbers.

mod_alias vs mod_proxy for XMLHttpRequest proxying

If you’re going to use an apache proxy to fix some of the issues with XMLHttpRequest, be aware that mod_alias sends a redirect to the browser. That is, it sends one of the 3XX HTTP status codes to the XMLHttpRequest object. The XMLHttpRequest object then issues the GET itself (well, some do, check out these XMLHttpRequest tests for more). That’s fine if you’re doing a GET, but if you’re doing a POST, then some redirects will require user interaction. I found that mod_alias, which was sending a 301 (redirect permanent) just turned the POST request into a GET. (mod_alias doesn’t always come to the rescue, apparently.)

The solution? Well, since you are really proxying the XMLHttpRequest’s request, use mod_proxy.

Google Maps Gotchas

I’ve done some recent work with Google Maps for a client and thought I’d share some lessons learned. (It seems I’ve been posting a lot about Google lately–I don’t know why.)

First off, like many other folks on the web, I think Google Maps are great. I’ve been a long time MapQuest user and the fact is that Google’s image panes just produces a better, slicker user experience than MapQuest’s dynamic image generation. Not to mention the fact that Google’s map API is free (as in beer, not in speech). Now, I haven’t had a chance to compare Yahoo’s map offering (as Michael Yuan did), though I have played around with Yahoo! MapMaker for Excel, but more mapping options can only be better. On to the issues I had with Google Maps.

* Geocoding is not provided with Google Maps, which means that you need to set up your own geocoding engine. However, the Tiger/Line dataset has some holes in it. Especially for rural regions, I was seeing many addresses that could not be geocoded. Even for an urban area (like Boulder, CO) around ten percent of the addresses were not geocodable. As for accuracy of the geocoding itself, I don’t even know how to test it on a large scale, but my client said he saw at least some instances where an address was geocoded incorrectly (a small number, but still significant). Well, what can you say? If you want precision, much less accuracy, pay for it. I investigated using Yahoo’s geocoding service, which is free and based on higher quality commercial data. Since my client is a commercial site (even though the maps are available for free) Yahoo said that they would require Yahoo maps on the site if it were to use their geocoding service. Fair enough. (As an aside, this was an interesting podcast of a speech by an executive of Navteq outlining some of the issues surrounding procuring good geodata.)

* PNGs are the default image type for the map pinpoints on Google Maps. These images () let you mark certain locations. However, if you display them in a list alongside the map, you’ll quickly find that transparent PNGs don’t work with Internet Explorer. They show up, but are surrounded by a black text box (Update Feb 8: transparent PNGs are outlined by a box in Internet Explorer. I’ve seen black boxes and blue boxes; regardless of the color, it looks bad). Luckily, the transparent PNG/Internet Explorer problem has already been solved.

* Each pinpoint/marker is added using an overlay. If you have significant numbers of overlays, map rendering becomes quite slow. Now, for Firefox users, it’s not as big an issue, because Firefox renders the rest of the page before the map. But for IE users, the table containing both the list and the map is not rendered until the map is displayable. On a reasonably fast box, I was seeing times of 80 seconds until the page was rendering. Clearly, this was unacceptable. Even on Firefox, some of the rendering times were just too slow. I searched the Google Maps Discussion Group and most everyone was saying that if you have a large number of markers, you should cluster them into a few representative markers until the user has zoomed sufficiently. (If I recall correctly, the maximum number of markers most folks recommended was around 20.) Such clustering can happen on the server side, or on the client side.

* Data retrieval architecture affects performance. For the first revision of the mapping application, I sent the data for each pinpoint at the same time as the map and the listing. This was the wrong way to do it–this method makes perceived rendering time much longer. The correct way to go is documented in ‘Hacking Maps with the Google Maps API’ XML.com article (linked below), where you use an XMLHttpRequest to pull in the pinpoint data asynchronously. I am in the midst of developing revision two of the interface and have noticed an appreciable speed up in rendering, so I’d recommend heading down that path at the start.

Finally, some resources. The first article you should read is Hacking Maps with the Google Maps API. This tutorial steps you through everything you need to know. The API documentation is adequate. I found some interesting discussions happening on the API discussion group. And, finally, the GoogleMapki was useful in looking at performance concerns. I haven’t read Scott Davis’ Google Maps API article, but keep meaning to spend the $8.50 and do so.

All in all, I enjoyed learning this new technology. The API is simple and easy to use, even for someone who’s no javascript expert. I look forward to adding maps to some of my other pages; my cross country skiing resources page in particular would benefit. Google has kickstarted a whole new area of web development and I think there’s plenty more to do.

Perl to the rescue

I am using Apache JMeter to load test a web application. JMeter has an XML file format for storing load test configuration information. I wanted to system test as well, and needed to generate a large number of unique URL hits. Rather than using the clunky UI to add them, and getting carpal tunnel from it, I analyzed the XML file format and split it up. Then I put tokens (XXXTIMETOWAITXXX) in the appropriate places, and used an Excel generated CSV file to drive perl to assemble the pieces of text into a valid JMeter config file.

Well, what happened next? I needed some way to generate a larger number URLs than Excel would be pleasant to handle? Again, perl came to the rescue, making it easy to generate umpteen lines of correctly formatted CSV.

Performance testing, complexity of

Performance testing is a bit like visiting your girlfriend’s father. You’re never quite sure what you’re accomplishing, it can be alternately frustrating and satisfying, and you have to do it. Right now I’m in the midst of performance testing a web based application for my new company. I’ve been in such testing tangles before, though always as a consultant on a fixed bid project. I’d have to say that performance testing as an employee is less stressful than that.

The reasons why performance testing, especially of web applications, is such a rat’s nest are many:

complexity of platforms
Most modern web applications are built on a lot of code. In our case, it’s a servlet and logging framework on top of tomcat on top of the JVM on top of the operating system. Four levels on the web server, not counting the back end or the load balancer or any interaction with the browser! And this is a relatively simple system. I’ve seen portal applications that had 6 or more levels in the web server. Each level of the software stack interacts in (sometimes unforeseen) ways with the others, which means that changing parameters can have unpredictable effects. You simply must test every change you make.

realistic hardware
Unless you’re working with Scrooge McDuck or an application that has yet to be deployed, you’re probably not going to be able to test on production hardware. Very few companies I’ve dealt with are willing to buy a duplicate of their production hardware for testing purposes, so you’ll probably be testing on a scaled down version of the production system. That means that you’ll have to make assumptions about what the smaller system will tell you about the bigger system. One usually safe conclusion is that the smaller system sets a performance minimum for the larger system.

amount of time required
Each performance test takes a significant amount of time, minutes rather than unit testing where you want the unit tests to run quickly. Such slow turnarounds mean that performance testing just can’t be done quickly.

difficulty of understanding real user behavior
The more complicated your application is, the harder it is to understand how people are going to use it. Will they move quickly through the application? Will they leave sessions open for a long time? How many states will they go through? Anyone can come up with a reasonable guess as to the answers for these questions, but the only way to know for sure is to a) user experience test it, or b) unleash the application.

ambiguous or arbitrary goals
Unless you really understand how your userbase is going to use the application, it’s hard to come up with reasonable goals. ‘Make it run faster’ doesn’t cut it. Nor does picking an arbitrary number: ‘we want to service 10,000 hits a second’ may seem like a good goal, but if that number just was plucked from the air, a lot of misery can result. Especially if you’re on a fixed bid project, and every hour you spend is eating into your margin. (It’s OK for performance testing to make a tech person miserable, as long as there’s business benefit—and an arbitrary number is likely to under- or overshoot the optimum for business.)

difficulty of reproducing real user behavior
I’ve not had a lot of experience with for pay tools, but have used a variety of free (as in beer) tools. I’ve written before about my experiences with The Grinder, and am currently using JMeter. I’ve also used apachebench. And all of these tools were great at hitting URLs repeatedly and rapidly, but it was hard to really reproduce user (and browser) behavior because they’re simple programs. An example is that some versions of IE can call a servlet multiple times. You can’t possibly hope to replicate all the quirks of browsers when testing, but sometimes those quirks can have performance impacts.

These dimensions of complexity feed on each other. Because it takes so long to performance test an application, you are tempted to change more than one level of the application at once. Because you think you understand user behavior, you come up with an erroneous performance target.

Is it hopeless? Nope, and it can be a very good exercise—it can turn up areas of real weakness in your application. Just remember to document your assumptions, make the tradeoffs abundantly clear to non technical folks and realize that you’re going to miss something important. Your results won’t be worth as much as you think they will be. Oh yeah, and don’t sign any fixed bid performance testing contracts unless you know what you’re doing.

On ‘The Perils of Java Schools’

Joel has another interesting article, The Perils of Java Schools, where he laments the fact that many CS degrees are focusing on Java. His main points are that if you don’t focus on the harder parts of CS (recursion and pointers) then you don’t weed out inadequate programmers, and that Java doesn’t allow for adequate examination of those harder parts. Weeding is needed, even though the harder parts aren’t–for most jobs. (His example of the need for pointers is working on operating systems–how many programmers really need to do that? Darn few. [Of course, for those who are interested in working on operating systems, I’d recommend avoiding a Java based CS degree.]) In addition to weeding out those who don’t have a talent for programming, recursion and pointers are a great interview topics (see his Guerrilla Guide to Interviewing) for finding smart people to hire.

When I read his article, I thought about the two related responses to his lament. The first is that coding isn’t the most important thing for many ‘programming jobs’ anymore. For a large number of them, the ability to relate to business problems and solve business needs is much more important. See this article for a related discussion on how to avoid being outsourced. A pure coder is more likely to be outsourced than a coder who also knows the business. I’d argue that at many organizations, a brilliant pure coder who can’t relate to the business folks is less effective than a decent coder who can extract requirements.

I don’t have a CS degree. It has definitely hurt me at times: I’m not as comfortable with some of the lower level constructs (parse trees, pointers) as other colleagues with a traditional CS degree. However, my liberal arts education has benefited me, because the writing and oral communication skills that I honed at college help me pinpoint what non-technical folks want to build. In fact, while building a system is fun, the real challenge and reward of software engineering is finding out what needs to be built, and figuring out how to build it. Both types of skills are necessary.

But, the other point of his lament remains. How do you find intelligent software engineers, and how do you distinguish those who talk a good game from those who can actually play it? I was sitting around the poker table a few days ago and some friends were discussing MVC and n-tier architectures. It’s so easy to toss around those high falutin’ words–it’s another to understand the nitty gritty of building them–I don’t. But I don’t think anyone who hasn’t worked on those large scale systems really does–CS degree or not.

I don’t know any one way to distinguish good workers. The closest thing to a methodology I have is to ask them questions about real world situations that stonkered me, and see if their answers make sense. Joel still makes sense in the Guerrilla Guide when he says you want people who are “smart and get things done”. But I believe that the focus of development has changed enough that the lack of C knowledge is not a loss. Just as the lack of punch card skills is not a loss.

(Note that I’ve used software engineers, developers, and programmers synonymously above, which may or may not be a justifiable abuse of the English language.)

On contracting

I recently (within the last couple of months) took a full time job as a software developer. After two years of contract software development, I have found it to be quite a change. I think that both contracting and full time employment have good things to offer; if you can manage it economically, contracting is well worth doing for a while. Why? A number of reasons:

When you contract, you’re responsible for finding your own work and making sure you get paid. This alone is worthwhile, as it gives you a fantastic appreciation for sales and accounting. The networking to find your next gig is great experience for moving up the food chain in a company.

You also get exposed to different technologies and businesses. In my two years, I did work in 6 different programming languages, a number of different frameworks (both free and expensive) and build environments, and 3 or 4 operating systems. That was great–I grew technically and learned how to tell if another techie was competent (and what they were competent at!) relatively quickly.

But more importantly was the variety of business situations I worked in. I did work alone, with one other technical person, on a team with no dedicated QA, on a team with dedicated QA and build staff. I did work with small companies, medium sized companies and enormous companies. I worked with startups and with established firms. Some of the organizations had excellent management structures–in others the inmates were running the asylum. And I paid much more attention than I would have if I had been an employee, because my paycheck rested on making sure that whoever was in charge was happy. This breadth of business experience is something that you cannot come by if you’re a full time employee.

Beyond that experience, there’s also the time and money factors. If you want to control your time, contracting is great–I regularly took multiple week vacations, because I was willing to sacrifice earnings for them. On the flip side, if money is more important than time, you can certainly earn significant amounts of money when contracting, as you get paid by the hour.

Contracting also has a lot lower stress than an employee-employer relationship. In my opinion, in a proper employee-employee relationship, the employee is loyal to the goals of the company (which implies understanding them–itself a difficult matter) and helps achieve those goals. In addition, if an employee makes a technology decision, they are living with that decision for the foreseeable future. These factors combine to make work as a full time employee more stressful; more rewarding, but more stressful. In the contractor-client relationship, you still want to do a great job. But the permanence of technology decisions isn’t there–either you’re doing work that fits into an existing technology stack or you’re making technology decisions but will be moving on in some finite period of time. Your stress is limited to whether your invoices will get paid!

There are downsides to contracting, too. For one, you’re not part of a team. You may work in a team, and they may make an effort to include you, but you’re not really part of that team–you’re just a hired gun who will do the work and then head off. When the inevitable bug comes up 5 months down the road, they can call you, but doing so incurs a higher transaction cost than if they just had to grab a fellow employee for a minute or two. The flexibility of money and time that you have can cause resentment too.

While the breadth of technologies and business methods can be great to experience, it can also be difficult to process. To hop in and be productive on the first day or two of a new contract can be hard, because you want to make sure you’re fitting into the existing processes.

All of the above comments are based solely on my experience. But I’d say that if you’re considering contracting, do it! It’s a great experience. Make sure you have a buffer of 6 months of pay, and then jump in.

unescaping a string with PL/SQL

I’ve written about PL/SQL before, but I’ve recently started working on a project that uses it heavily. Given the amount of code written for Oracle databases, I’m rather suprised that there’s not a PL/SQL Cookbook, where, like the Perl Cookbook and the Java Cookbook (more cookbooks from O’Reilly are listed here). There is an Oracle Cookbook, but based on a quick scan of Amazon, it’s is focused, as you’d expect, more on the database design than on PL/SQL programming. (Interestingly, there is a Oracle+PHP cookbook, and a PL/SQL sample code page but neither of those is quite what I’m looking for.)

The reason that I’d like a PL/SQL cookbook is that there are large sets of problems that routinely need to be solved in PL/SQL, but the language is so low level (though they just added some regex support in 10g; bravo!) that doing these routine tasks and making sure they’re correctly implemented can be difficult and tedious. This is especially true when it’s a programmer from a different language who’s used to higher levels of abstraction (like, for example, the good folks who author CPAN modules provide)–it’d be well worth my $70 to make sure that I never had to deal with a problem like, say, unescaping a string.

For that’s the problem I recently had to solve. Essentially, we have a string that looks like this: yellow,apple. This string represents two values, which need to be put in different places by splitting them up into ‘yellow’ and ‘apple’. All well and good until the possiblity of embedded commas arises, for it’s possible that the desired end values were ‘yellow,blue’ and ‘apple,banana’. The answer, of course, is to escape the commas on the way in (turning the second input into something like this: yellow:,blue,apple:,banana, and when processing to unescape those special characters (both the comma and the escape character, which in the example is the colon). That’s what these three functions do. They take a string like the above examples and parse it into a table, to be iterated over at your leisure.

/* ------------------- function splitit ------------------*/
FUNCTION splitit(p_str VARCHAR2, p_del VARCHAR2  := ',',p_idx PLS_INTEGER, p_esc VARCHAR2

:= ':')
RETURN INTEGER
IS
l_idx       PLS_INTEGER;
l_chars_before      VARCHAR2(32767);
l_escape_char       VARCHAR2(1) := p_esc;
l_chars_before_count        PLS_INTEGER := 0;
BEGIN
>
LOOP
l_idx := instr(p_str,p_del, p_idx);
IF l_idx > 0 then
WHILE substr(p_str, l_idx-l_chars_before_count-1, 1) = l_escape_char LOOP
l_chars_before_count := l_chars_before_count +1;
END LOOP;

IF mod(l_chars_before_count, 2) = 0 THEN
-- if chars_before_count is even, then we're at a segment boundary
RETURN l_idx;
ELSE
-- if odd, then we're at an escaped delimiter, want to move past
RETURN splitit(p_str, p_del, l_idx+1, p_esc);
END IF;
l_chars_before_count := 0;
ELSE
RETURN l_idx;
EXIT outer;
END IF;
END LOOP;
END splitit;
/* ------------------- function splitit ------------------*/

/* ------------------- function unescape ------------------*/

FUNCTION unescape(p_str VARCHAR2, p_del VARCHAR2 := ',', p_esc VARCHAR2 := ':')
RETURN VARCHAR2
IS
l_str VARCHAR2(32767);
BEGIN
l_str := replace(p_str, p_esc||p_del, p_del);
l_str := replace(l_str, p_esc||p_esc, p_esc);
RETURN l_str;
END unescape;
/* ------------------- function unescape ------------------*/

/* ------------------- function split ------------------*/

FUNCTION split(p_list VARCHAR2, p_del VARCHAR2 := ',')
RETURN split_tbl
IS
l_idx       PLS_INTEGER;
split_idx   PLS_INTEGER     := 0;
l_list      VARCHAR2(32767) := p_list;
l_chars_before      VARCHAR2(32767);
l_escape_char       VARCHAR2(1) := ':';
l_array split_tbl := split_tbl('','','','','','','','','','');
BEGIN
l_list := p_list;
LOOP
split_idx := split_idx + 1;
IF split_idx > 10 then
EXIT;
END IF;

l_idx := splitit(l_list, p_del, 1, l_escape_char);
IF l_idx > 0 then
l_array(split_idx) := unescape(substr(l_list,1,l_idx-1), p_del,

l_escape_char);
l_list := substr(l_list,l_idx+length(p_del));
ELSE
l_array(split_idx) := l_list;
EXIT;
END IF;
END LOOP;
RETURN l_array;
END split;
/* ------------------- function split ------------------*/

/* in the header file, split_tbl is defined */
TYPE split_tbl IS TABLE of varchar2(32767)

Not all of this code is mine–I built on a solution from a colleague. But I hope this saves one other person from the afternoon I just endured. And if you are a PL/SQL expert and care to critique this solution, please feel free.