Mon, 28 Dec 09

Hibernate Boolean/Integer ClassCastException

Posted in Java, Databases at 11:25 am by moore

I ran into an issue the other day with Hibernate configuration.

I have a bean that maps to a table in the database.  It has a column, featured, that only has values of 0 or 1.  For legacy reasons, we map that to an integer (turning it into a boolean is on the List Of Things To Do).

I ran into an issue, and was getting this error in the logs:

Caused by: java.lang.ClassCastException: java.lang.Boolean cannot be cast to java.lang.Integer
at org.hibernate.type.IntegerType.set(IntegerType.java:41)
at org.hibernate.type.NullableType.nullSafeSet(NullableType.java:136)
at org.hibernate.type.NullableType.nullSafeSet(NullableType.java:116)
at org.hibernate.loader.Loader.bindPositionalParameters(Loader.java:1698)
at org.hibernate.loader.Loader.bindParameterValues(Loader.java:1669)
at org.hibernate.loader.Loader.prepareQueryStatement(Loader.java:1554)
at org.hibernate.loader.Loader.doQuery(Loader.java:661)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:224)
at org.hibernate.loader.Loader.doList(Loader.java:2211)
at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2095)
at org.hibernate.loader.Loader.list(Loader.java:2090)
at org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:94)
at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1569)
at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:283)

Here’s the relevant section of the mapping file:

<property name=”featured” type=”integer” columnn=”featured” />

I tried explicitly laying out the type of the SQL column:

<property name=”featured” type=”integer”>
<column sql-type=”INTEGER” name=”featured” />
</property>

But neither of these worked; I received the same error.

I took a step back and looked at the bean.

Integer featured;

public Integer getFeatured() {
return featured;
}
public void setFeatured(Integer featured) {
this.featured = featured;
}

...

and realized that I had added a convenience method that was confusing the Hibernate code:

public boolean isFeatured() {
return ...
}

When I changed the signature of the convenience method to:

public boolean isFeaturedListing() {
return ...
}

the exception went away.  Thought I’d share for anyone searching for this stacktrace.

Technorati Tags:

Thu, 29 Oct 09

Tips: Deploying a web application to the cloud

Posted in Databases, Web Applications, Useful Tools, Drupal, Cloud Computing at 2:20 pm by moore

I am wrapping up helping a client with a build out of a drupal site to ec2. The site itself is a pretty standard CMS implementation–custom content types, etc. The site is an extension to an existing brand, and exists to collect email addresses and send out email newsletters. It was a team of three technical people (there were some designers and other folks involved, but I was pretty much insulated from them by my client) and I was lucky enough to do a lot of the infrastructure work, which is where a lot of the challenge, exploration and experimentation was.

The biggest attraction of the cloud was the ability to spin up and spin down extra servers as the expected traffic on the site increased or decreased. We choose Amazon’s EC2 for hosting. They seem a bit like the IBM of the cloud–no one ever got fired, etc. They have a rich set of offerings and great documentation.

Below are some lessons I learned from this project about EC2. While it was a drupal project, I believe many of these lessons are applicable to anyone who is building a similar system in the cloud. If you are building an video processing super computer, maybe not so much.

Fork your AMI

Amazon EC2 running instances are instantiations of a machine image (AMI). Anyone can create a machine image and make it available for others to use. If you start an instance off an image, and then the owner of the image deletes the image (or otherwise removes it), your instance continues to run happily, but, if you ever need to spin up a second instance off the same AMI, you can’t. In this case, we were leveraging some of the work done by Chapter Three called Project Mercury. This was an evolving project that released several times while we were developing with it. Each time, there was a bit of suspense to see if what we’d done on top of it worked with the new release.

This was suboptimal, of course, but the solution is easy. Once you find an AMI that works, you can start up an instance, and then create your own AMI from the running instance. Then, you use that AMI as a foundation for all your instances. You can control your upgrade cycle. Unless you are running against a very generic AMI that is unlikely to go away, forking is highly recommended.

Use Capistrano

For remote deployment, I haven’t seen or heard of anything that compares to Capistrano. Even if you do have to learn a new scripting language (Ruby), the power you get from ‘cap’ is fantastic. There’s pretty good EC2 integration, though you’ll want to have the EC2 response XML documentation close by when you’re trying to parse responses. There’s also some hassle involved in getting cap to run on EC2. Mostly it involves making sure the right set of ssh keys is in the correct place. But once you’ve got it up and running, you’ll be happy. Trust me.

There’s also a direct capistrano/EC2 integration project, but I didn’t use that. It might be worth a look too.

Use EBS

If you are doing any kind of database driven website, there’s really no substitute for persistent storage. Amazon’s Elastic Block Storage (EBS) is relatively cheap. Here’s an article explaining setting up MySQL on EBS. I do have a friend who is using EC2 in a different manner that is very write intensive, that is having some performance issues with his database on EBS, but for a write seldom, read often website, like this one, EBS seems plenty fast.

EC2 Persistence

Some of the reasons to use Capistrano are that it forces you to script everything, and makes it easy to keep everything in version control. The primary reason to do that is that EC2 instances aren’t guaranteed to be persistent. While there is an SLA around overall EC2 availability, individual instances don’t have any such assurances. That’s why you should use EBS. But, surprisingly, the EC2 instances that we are using for the website haven’t bounced at all. I’m not sure what I was expecting, but they (between three and eight instances) have been up and running for over 30 days, and we haven’t seen a single failure.

Use ElasticFox

This is a FireFox extension that lets you do every workaday task, and almost every conceivable operation, to your EC2 instances. Don’t delay, use this today.

Consider CloudFront

For distributed images, CloudFront is a natural fit. Each instance can then reference the image, without you needing to sync files across instances. You could use this for other files as well.

Use Internal Network Addressing where possible

When you start an EC2 instance, Amazon assigns it two IP addresses–an external name that can be used to access it from the internet, and an internal name. For most contexts, the external name is more useful, but when you are communicating within the cloud (pushing files around, or a database connection), prefer the internal DNS. It looks like there are some performance benefits, but there are definitely pricing benefits. “Always use the internal address when you are communicating between Amazon EC2 instances. This ensures that your network traffic follows the highest bandwidth, lowest cost, and lowest latency path through our network.” We actually used the internal DNS, but it makes more sense to use the IP address, as you don’t get any abstraction benefits from the internal DNS, which you don’t control–that takes a bit of mental adjustment for me.

Consider reserved instances

If you are planning to use Amazon for hosting, make sure you explore reserved instance pricing. For an upfront cost, you get significant savings on your runtime costs.

On Flexibility

You have a lot of flexibility with EC2–AMIs are essentially yours to customize as you want, starting up another node takes about 5 minutes, you control your own DNS, etc. However, there are some things that are set at startup time. Make sure you spend some time thinking about security groups (built in firewall rules)–they fall into this category. Switching between AMIs requires starting up a new instance. Right now we’re using DNS round robin to distribute load across multiple nodes, but we are planning to use elastic IPs which allow you to remap a routable ip address to a new instance without waiting for DNS timeouts. EBS volumes and instances they attach to must be in the same availability zone. None of these are groundbreaking news, it’s really just a matter of reading all the documentation, especially the FAQs.

Documentation

Be aware that there are a ton of documentation, one set for each API release, for EC2 and the other web services that Amazon provides. Rather than starting with Google, which often leads you to an outdated version of documentation, you should probably start at the AWS documentation center. This is especially true if you’re working with any of the systems that are newer with perhaps not as stable an API.

In the end

Remember that, apart from new tools and a few catches, using EC2 is not that different than using a managed server where you don’t have access to the hardware. The best document I found on deploying drupal to EC2 doesn’t talk about EC2 at all–it focuses on the architecture of drupal (drupal 5 at that) and how to best scale that with additional servers.

Technorati Tags: , ,

Fri, 16 Oct 09

Optimizing a distance calculation in a mysql query

Posted in Databases, MySQL at 12:15 pm by moore

If you have a query that sorts by a derived field, and then takes a limited number of the results, it can be a real dog.  Here’s how I optimized a situation like this.  Imagine this table.

create table office_building (
id int primary key,
latitude float not null,
longitude float not null,
rent int,
address varchar(20),
picture_url varchar(255)
);

If you want to find the nearest 100 office buildings to a point on a map, you run a query something like this (plug your lat/lng into the question marks):

explain select *, round( sqrt( ( ( (latitude - ?) * (latitude - ?) ) *  69.1 * 69.1) +
((longitude - ?) * (longitude - ?) * 53 * 53 ) ) ) as distance
from office_building order by distance limit 100

(See here for an explanation of the 69.1 and 53 constants–basically they convert roughly from lat/lng to miles.) Unfortunately, you are ordering by a derived field, and mysql can no longer do order by optimization.

This means that you’ll be doing a filesort (which does not actually have anything to do with the filesystem, but is just a sort not on an index).  And this, in turn means that your performance will suck if you have any large number of rows returned.

You can help things out a bit by limiting your office building query to a box of a certain size around the point.  Here’s the query with a 5 mile box:

select *, round( sqrt( ( ( (latitude - ?) * (latitude - ?) ) *  69.1 * 69.1 ) +
( (longitude - ?) * (longitude - ?) * 53 * 53 ) ) ) as distance
from office_building
where latitude < ?  + (1/69.1)*5 and latitude > ? - (1/69.1)*5 and longitude < ? + (1/53)*5
and longitude > ? - (1/53)*5
order by distance limit 100

But if you still have too many results, the sorting on distance will be slow.  Also, even if you have an index on latitude and longitude, (such as create index idx_nearby on office_building (latitude,longitude)) because you are not using equality, only the first column will be used.

This is worth repeating, because it took me a while to understand.  If you have an index: create index idx on tbl (col1,col2,col3,col4,col5) and you run a query like select count(*) from tbl where col1 = 1 and col2 > 2 and col3 < 3 and col4 > 4 only col1 and col2 will be used from the index.  Mysql goes to the table data files for col3 and beyond (assuming no other indices on the table).  This makes sense when you think about how indices are created and stored, but I didn’t really understand it until I’d been beaten over the head with it.

As stated here: “[mysql] will use the fields [in the index], from left to right, as long as the WHERE clause has “=”. Once it hits a ‘range’ (>, IN, BETWEEN, …), it stops with that field.”  I don’t know why it is not in the mysql index documentation–maybe it is obvious?

The solution I found was to separate what I wanted to find in the select clause from how I find it, in the where and order by clause:

select select_clause.*,
round( sqrt( ( ( where_clause.latitude - ?) * (where_clause.latitude - ? ) *  69.1 * 69.1 ) +
( (where_clause.longitude - ? ) *(where_clause.longitude - ? ) * 53 * 53 ) ) ) as distance
from office_building where_clause, office_building select_clause
where where_clause.latitude < ? + (1/69.1)*5
and where_clause.latitude > ? - (1/69.1)*5
and where_clause.longitude < ? + (1/53)*5 and
where_clause.longitude > ? - (1/53)*5
and where_clause.id = select_clause.id
order by distance
limit 100

You also need to add an index:

create index idx_nearby on office_building (latitude,longitude,id);

Then, when you run the query, you still have the filesort, but you also see the magic ‘Using index’ in your explain plan.  You never have to go to the table to do the sort!  You also have a join now, but it’s on the primary key, and you only need to go to the table for the 100 rows that you know you want.

Using this query had an effect on one live system of one to two orders of magnitude increase in query speed, depending on the query.  This not only works for distance queries, but anytime you want to order by a calculated value.

More useful links: geo search suggestions, index explanation

Technorati Tags: , ,

Thu, 30 Jul 09

Article about using hibernate with GWT

Posted in Java, Databases, GWT at 3:29 pm by moore

I just read this article about the Google Web Toolkit and hibernate, and I’m thrilled that someone wrote this. A few years ago, when I was just starting to use GWT and hibernate, the ORM tool, I thought about writing something similar myself. I could never get over the hump of writing about setting up all the infrastructure necessary, something which the author does quite nicely.

I think this article gives a great overview of some of the complexities of using hibernate with the GWT client. The author essentially talks about three possible solutions to the primary problem when using hibernate objects in a GWT system: hibernate enhances your POJO code, and thus you cannot send objects returned from hibernate queries down the wire to the JavaScript client.  The JRE emulation simply can’t handle it.

I especially enjoyed the explanations of how to use some of the tools, to make mapping between GWT capable objects and hibernate objects easier. I’d heard of hibernate4gwt, now Gilead, but never used it. For most of my RPC calls, I end up using the first approach the author explores, custom DTO creation. Often times, I won’t create a special DTO object, but rather reuse the POJO that represents the domain object. This way, you can scrub subsidiary objects (though you lose lazy loading when you do this) and send those down as well.  As long as the POJO doesn’t have too many extraneous members, this seems to work fine, and removes the need for an extra class.

I was a bit frustrated, however, that the author ignored the delete case. This seems like a situation where tools like Gilead might really shine. I have often run into issues where I have to add a ‘deleted’ boolean flag to the hibernate object.  I do this because when an object gets deleted from a collection on the GWT side, my server-side code has no way of knowing this, without some additional complexity (rerunning the query and doing a comparison of results). Adding such a ‘deleted’ boolean flag, solves one set of problems, but raises additional complexity, because you end up having to check to see whether or not an object exists before you try to insert it in the database.

For example, imagine you have a user with set of CDs, which you display in a grid.  If you want to allow a user to correct the name of one of the CDs, and send it back, the server side has the modified record, hopefully with and ID, and can simply save it.  But if you delete one of the CDs from the collection, the server side does not have the modified object, and so has to figure out which one to delete.  Gilead, with its knowledge of the object graph, seems at first glance like it could solve this problem elegantly (a quick search on the Gilead site shows nothing that I could see).

Also note that, using RPC is fantastic for GWT applications, but if you think about using GWT for widgets, I would suggest using something that gives you a bit more flexibility like JSONP. Because GWT RPC depends on XMLHTTPRequest, it is fundamentally limited to sites where the JavaScript and RPC services are on the same host.  Obviously, since using JSONP serializes hibernate objects to strings, none of these tools are appropriate.  (See my survey of Google Web Toolkit client-server communication strategies for more.)

All that said, if you’re thinking about using hibernate and GWT in the same project, reading this paper and running through the examples will be a worthwhile use of your time.

Technorati Tags: , ,

Sat, 23 May 09

Data.gov–freely accessible data in standard formats from the USA federal government

Posted in Databases, GIS, Useful Tools at 1:05 pm by moore

Have you ever wondered where the world’s copper smelters are?  Or pondered reservoir storage data for the Colorado river?  Had questions about the residential energy use of US households?

Now you can find the answers to these questions, using data.gov, the stated purpose of which is “to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government.”  I’ve been writing for a while about the publishing power of the internet, but data.gov takes this to a whole new level.

It’s definitely a starting point, not an end, as there are only 47 raw datasets that you can access.  They cover a wide range of data and agencies, and were apparently chosen to kick things off because they “already enjoy a high degree of consensus around definitions, are in formats that are readily usable, include the availability of metadata, and provide support for machine-to-machine data transfer.”  The four main formats for data provided by data.gov are XML, CSV, KML and ESRI.  (There are also a number of widgets, and tools you can use, including the census factfinder.)

More datasets can be requested, and I’m hoping that they will be rolled out soon.  What a playground! Go take a look!

Update, 4:55: Here’s a great article on the whole process and problem.

Technorati Tags: ,

Fri, 04 Jan 08

BatchUpdateException using Hibernate and MySQL5

Posted in Java, Databases, MySQL at 7:59 pm by moore

I ran into a crazy error last week. One of my clients was upgrading from MySQL4 to MySQL5. The application in question was using Hibernate 3.2. Here’s the table structure, and the hibernate bean definition. See if you can spot the issue:

mysql> desc stat;
+--------------+-------------+------+-----+---------------------+-------+
| Field        | Type        | Null | Key | Default             | Extra |
+--------------+-------------+------+-----+---------------------+-------+
| stat_date    | date        |      | PRI | 0000-00-00          |       |
| stat_type    | varchar(50) |      | PRI |                     |       |
| stat_count   | int(11)     | YES  |     | NULL                |       |
| last_updated | datetime    |      |     | 0000-00-00 00:00:00 |       |
+--------------+-------------+------+-----+---------------------+-------+
4 rows in set (0.00 sec)

<class name="com.foo.common.data.Statistic" table="stat" lazy="false">
<cache usage="read-write"/>
<composite-id name="statisticId" class="com.foo.common.data.StatisticId">
<key-property name="date" type="java.util.Date" column="stat_date"/>
<key-property name="type" column="stat_type"/>
</composite-id>
<property name="count" column="stat_count"/>
<property name="lastUpdated" type="java.util.Date" column="last_updated" />
</class>

The exception stack trace I was seeing was something like this:

2007-12-31 14:15:09,888 ERROR [Thread-14] def.AbstractFlushingEventListener (AbstractFlushingEventListener.java:301)

- Could not synchronize database state with session
org.hibernate.exception.ConstraintViolationException: Could not execute JDBC batch update
at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:71)
at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43)
at org.hibernate.jdbc.AbstractBatcher.executeBatch(AbstractBatcher.java:249)
at org.hibernate.engine.ActionQueue.executeActions(ActionQueue.java:235)
at org.hibernate.engine.ActionQueue.executeActions(ActionQueue.java:139)
at org.hibernate.event.def.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:298)
at org.hibernate.event.def.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:27)
....
Caused by: java.sql.BatchUpdateException: Duplicate key or integrity constraint
violation message from server: "Duplicate entry '2007-12-31-stattype' for key 1"
at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1492)
at com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeBatch(NewProxyPreparedStatement.java:1723)
at org.hibernate.jdbc.BatchingBatcher.doExecuteBatch(BatchingBatcher.java:48)
at org.hibernate.jdbc.AbstractBatcher.executeBatch(AbstractBatcher.java:242)
... 57 more

I ended up turning on the mysql logging (the log setting in the my.ini file, which logs all sql statements mysql makes) to see what was happening.

Basically, I was looking to see if an entry in the stat table existed; if it did, increment and update, if it did not, insert. And the insert was always happening, so the entry was not found–it did exist because mysql threw the ‘integrity constraint’ exception.

The cause of the issue was the date type of stat_date and the fact that I incorrectly mapped it to java.util.Date. It really should have been mapped to java.sql.Date. How this worked in mysql4 is beyond me, but it did. Changing the hibernate dialect to mysql5 had no impact.

Technorati Tags: ,

Wed, 11 Apr 07

MySQL tuning

Posted in Databases, MySQL at 4:18 pm by moore

If you’re trying to tune MySQL, make sure you measure before and after. mysqlreport is a great way to measure a number of aspects of mysql (make sure you check out the guide). If you want graphing and long term tracking for mysqlreport and just about any other measure you care to track, I recommend cacti, which works with rrdtool.

Technorati Tags: ,

Tue, 20 Mar 07

Out of the box complete open source WAMP stack

Posted in Databases, Web Applications at 8:12 pm by moore

If you’re looking for an out of the box WAMP stack, I had good luck with Apache2Triad. A big download, but you’re getting Apache, PostGreSQL, MySQL, a mail server and an FTP server, plus various admin tools. I had things running in about 15 minutes–very cool, given that I’d spent a fair bit of time just trying to get PostGreSQL installed alone. The only hiccup was that my spyware tool thought SlimFTPd was a virus.

Beware, though, if you have an existing MySQL installation. Apache2Triad won’t blow away the data, but it will usurp the Services entry. And then, if you remove Apache2Triad, your existing MySQL instance is orphaned. I was able to get mine up and running again, from the command line. But to get it running again as a service required a complete uninstall/reinstall. No fun.

So, if you have no dev environment and need a quick start, take a look at this package. If you, on the other hand, have existing development tools installed, be more careful than I was.

Technorati Tags:

Thu, 11 Jan 07

Large varchar columns can lead to huge ESRI exports when using ogr2ogr

Posted in Programming, Databases at 10:02 am by moore

I was recently using ogr2ogr to convert, on the fly, some data in a PostGIS to other standard formats (ESRI and MapInfo). The ESRI export in particular had some problems–it took about 4 minutes for the export of an table with 11K rows and 37 columns, and it generated a 700M dbf file. This file was then zipped (with the other config files), and in around 6 minutes was compressed to a 7M zip file, that was sent to the browser. Now, you can imagine how thrilled a user would be to wait 10 minutes for an export. Apache was timing out (the default timeout is 5 min) and I was at a loss as to how to address the performance issue.

I mentioned this to a colleague who has significantly more experience with GIS tools, and he pointed out that in the source table there were several varchar(4000) fields. Now, in PostgreSQL, [i]f the string to be stored is shorter than the declared length … values of type character varying [varchar] will simply store the shorter string. But the ESRI export does not do that–each varchar(4000) field was padded to a length of 4000, even though none of the fields approached that length.

The solution? A few simple select max(length(colname)) from table and alter table statements, and the varchar(4000)columns were decreased in size. The dbf file decreased to a 50M file, uncompressed, and the entire zip file decreased to 5M. As you can guess, the download time was slashed.

Update 2/16: The kind members of the GDAL mailing list pointed me to a document listing all the limitations of the ESRI driver for ogr2ogr.  Check out the “Creation Issues Section”.

Technorati Tags: , ,

Wed, 27 Sep 06

Installing the median user defined function on MySQL

Posted in Programming, Databases, MySQL at 11:57 am by moore

I just re-read “How To Lie With Statistics”, which is so good I think it should be required reading in every middle school. In it, the author makes the point that there are three kinds of ‘averages’: arithmetic mean, median and mode (here I am, contributing to Wikipedia’s dominance, due to my laziness in looking up alternative definitions of statistical concepts). In general, the median is the most informative average, because it’s not skewed by a small number of outliers.But mysql (and other databases I’ve worked on) don’t natively supprt the medan, whereas I believe most support average (by which they mean ‘arithmetic mean’). Sure, you can use a stored procedure (as suggested here for PostgreSQL. However, I’m working with MySQL 4, which does not support stored procs. However, there is another solution: user defined functions. These seem like stored procedures, except you have to write them in C (or C++).

Now, I’m not a C programmer. Luckily, someone has written and released a set of mysql user defined functions that include median (as well as many other statistical manipulations). The bad news is that it hasn’t been updated for years. The good news is that with a bit of luck and many downloads, I was able to get the median function working on mysql, both on windows as a dll, and on linux as a shared library. To repeat, I am not a C programmer, so if you see any head thumping errors below, please let me know and I’ll update this document.

First off, I was working with these versions of mysql: c:\Program Files\MySQL\MySQL Server 4.1\bin\mysql.exe Ver 14.7 Distrib 4.1.10a, for Win95/Win98 (i32) and mysql Ver 14.7 Distrib 4.1.7, for pc-linux (i686)

To get median working on windows, you need to:

  1. Download the mysql-udf tarball.
  2. patch the files if you’re running a version of mysql greater than 4.1.1. patch available here, or the patched tarball is here.
  3. Download and install Visual C++ Express. (If you have a C compiler on Windows, you can skip this step and the next. Oh, and the ones following that will probably be different. (Here’s a blog post about creating a UDF using Visual Studio C++ 2003.)
  4. Download and install the platform SDK; I only followed through step 3. If you don’t, you’ll get ‘windows.h’ errors when you try to compile the UDF.
  5. Untar the myslq-udf tarball. Patch if needed.
  6. Install the mysql header files. I was able to do this via the Windows Installer, which let me modify my existing mysql installation; I had to add the ‘C Include Files / Lib Files’ feature.
  7. Create a new directory. Copy udf_median.cc from the untarred directory to this new directory.
  8. Create a new file in that directory called udf_median.def. This file contains all the methods the UDF is exporting. Or you can just download the file I used here.
  9. Open Visual C++ Express
  10. Choose File / New / Project From Existing Code. Hit Next. Browse to the directory you just created. Create a name for the project. Hit Finish
  11. Edit the udf_median.cc file and comment out the #ifdef HAVE_DLOPEN line as well as the corresponding #endif. If I didn’t do this, I kept getting link errors, as I guess everything between those preprocessor directives was not being compiled.
  12. Add the mysql include files: right click on the project and choose properties. Expand ‘Configuration Properties’ then ‘C/C++’ and click ‘General’. On the right, add an include directory. Navigate to the Mysql include directory and add that.
  13. Add the module definition file: right click on the project and choose properties. Expand ‘Configuration Properties’ then ‘Linker’ and click ‘Input’. Add ‘udf_median.def’ to the key ‘Module Definition File’.
  14. Make sure VC knows this is a DLL: right click on the project and choose properties. Expand ‘Configuration Properties’ and click ‘General’. Choose ‘Dynamic Library (.dll)’ for Configuration Type. If you don’t do this, you’ll get errors like: error LNK2019: unresolved external symbol _WinMain because the compiler thinks you’re trying to build an application.
  15. Right click on the project and choose ‘Build’. This gives you a DLL in the Debug directory.
  16. Copy the DLL to the bin directory of your mysql installation.
  17. Create the function by logging in to mysql and running this command: CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME 'udf_median.dll';. (The user you log in as will need to have the ability to insert rows into the mysql tables.)
  18. Test and enjoy.

Deploying the UDF to linux is much simpler, mostly because you don’t have to install a compiler, linker, etc. I used ‘gcc (GCC) 3.3.4′.

  1. Download the mysql-udf tarball.
  2. patch the files if you’re running a version of mysql greater than 4.1.1. patch available here, or the patched tarball is here.
  3. Untar the myslq-udf tarball. Patch if needed.
  4. Edit the udf_median.cc file and comment out the #ifdef HAVE_DLOPEN line as well as the corresponding #endif.
  5. Compile and link the code. Do not use the instructions on the mysql-udf homepage. If you compile with those flags, you’ll get this error when you try to add the function: mysql> CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME 'udf_median.so';
    ERROR 1126 (HY000): Can't open shared library 'udf_median.so' (errno: 22 /usr/lib/udf_median.so: undefined symbol: _Znwj)
    . Rather, use the instructions in this bug report: gcc -shared -lstdc++ -I /usr/include -I /usr/local/include -I /usr/local/mysql/include/ -o udf_median.so udf_median.cc'
  6. Copy the shared library to a directory where mysql will see it. I put it in /usr/lib.
  7. Create the function by logging in to mysql and running this command: CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME 'udf_median.so';. (The user you log in as will need to have the ability to insert rows into the mysql tables.)
  8. Test and enjoy.

We have pushed this UDF to production with replicated servers and haven’t seen any issues with it yet.

I want to extend my thanks to:

Technorati Tags: , ,

« Previous entries ·


© Moore Consulting, 2003-2010