Wed, 27 Sep 06

Installing the median user defined function on MySQL

Posted in Programming, Databases, MySQL at 11:57 am by moore | Permalink

I just re-read “How To Lie With Statistics”, which is so good I think it should be required reading in every middle school. In it, the author makes the point that there are three kinds of ‘averages’: arithmetic mean, median and mode (here I am, contributing to Wikipedia’s dominance, due to my laziness in looking up alternative definitions of statistical concepts). In general, the median is the most informative average, because it’s not skewed by a small number of outliers.But mysql (and other databases I’ve worked on) don’t natively supprt the medan, whereas I believe most support average (by which they mean ‘arithmetic mean’). Sure, you can use a stored procedure (as suggested here for PostgreSQL. However, I’m working with MySQL 4, which does not support stored procs. However, there is another solution: user defined functions. These seem like stored procedures, except you have to write them in C (or C++).

Now, I’m not a C programmer. Luckily, someone has written and released a set of mysql user defined functions that include median (as well as many other statistical manipulations). The bad news is that it hasn’t been updated for years. The good news is that with a bit of luck and many downloads, I was able to get the median function working on mysql, both on windows as a dll, and on linux as a shared library. To repeat, I am not a C programmer, so if you see any head thumping errors below, please let me know and I’ll update this document.

First off, I was working with these versions of mysql: c:\Program Files\MySQL\MySQL Server 4.1\bin\mysql.exe Ver 14.7 Distrib 4.1.10a, for Win95/Win98 (i32) and mysql Ver 14.7 Distrib 4.1.7, for pc-linux (i686)

To get median working on windows, you need to:

  1. Download the mysql-udf tarball.
  2. patch the files if you’re running a version of mysql greater than 4.1.1. patch available here, or the patched tarball is here.
  3. Download and install Visual C++ Express. (If you have a C compiler on Windows, you can skip this step and the next. Oh, and the ones following that will probably be different. (Here’s a blog post about creating a UDF using Visual Studio C++ 2003.)
  4. Download and install the platform SDK; I only followed through step 3. If you don’t, you’ll get ‘windows.h’ errors when you try to compile the UDF.
  5. Untar the myslq-udf tarball. Patch if needed.
  6. Install the mysql header files. I was able to do this via the Windows Installer, which let me modify my existing mysql installation; I had to add the ‘C Include Files / Lib Files’ feature.
  7. Create a new directory. Copy udf_median.cc from the untarred directory to this new directory.
  8. Create a new file in that directory called udf_median.def. This file contains all the methods the UDF is exporting. Or you can just download the file I used here.
  9. Open Visual C++ Express
  10. Choose File / New / Project From Existing Code. Hit Next. Browse to the directory you just created. Create a name for the project. Hit Finish
  11. Edit the udf_median.cc file and comment out the #ifdef HAVE_DLOPEN line as well as the corresponding #endif. If I didn’t do this, I kept getting link errors, as I guess everything between those preprocessor directives was not being compiled.
  12. Add the mysql include files: right click on the project and choose properties. Expand ‘Configuration Properties’ then ‘C/C++’ and click ‘General’. On the right, add an include directory. Navigate to the Mysql include directory and add that.
  13. Add the module definition file: right click on the project and choose properties. Expand ‘Configuration Properties’ then ‘Linker’ and click ‘Input’. Add ‘udf_median.def’ to the key ‘Module Definition File’.
  14. Make sure VC knows this is a DLL: right click on the project and choose properties. Expand ‘Configuration Properties’ and click ‘General’. Choose ‘Dynamic Library (.dll)’ for Configuration Type. If you don’t do this, you’ll get errors like: error LNK2019: unresolved external symbol _WinMain because the compiler thinks you’re trying to build an application.
  15. Right click on the project and choose ‘Build’. This gives you a DLL in the Debug directory.
  16. Copy the DLL to the bin directory of your mysql installation.
  17. Create the function by logging in to mysql and running this command: CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME 'udf_median.dll';. (The user you log in as will need to have the ability to insert rows into the mysql tables.)
  18. Test and enjoy.

Deploying the UDF to linux is much simpler, mostly because you don’t have to install a compiler, linker, etc. I used ‘gcc (GCC) 3.3.4′.

  1. Download the mysql-udf tarball.
  2. patch the files if you’re running a version of mysql greater than 4.1.1. patch available here, or the patched tarball is here.
  3. Untar the myslq-udf tarball. Patch if needed.
  4. Edit the udf_median.cc file and comment out the #ifdef HAVE_DLOPEN line as well as the corresponding #endif.
  5. Compile and link the code. Do not use the instructions on the mysql-udf homepage. If you compile with those flags, you’ll get this error when you try to add the function: mysql> CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME 'udf_median.so';
    ERROR 1126 (HY000): Can't open shared library 'udf_median.so' (errno: 22 /usr/lib/udf_median.so: undefined symbol: _Znwj)
    . Rather, use the instructions in this bug report: gcc -shared -lstdc++ -I /usr/include -I /usr/local/include -I /usr/local/mysql/include/ -o udf_median.so udf_median.cc'
  6. Copy the shared library to a directory where mysql will see it. I put it in /usr/lib.
  7. Create the function by logging in to mysql and running this command: CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME 'udf_median.so';. (The user you log in as will need to have the ability to insert rows into the mysql tables.)
  8. Test and enjoy.

We have pushed this UDF to production with replicated servers and haven’t seen any issues with it yet.

I want to extend my thanks to:

Technorati Tags: , ,

20 Comments »

  1. Wouter said,

    October 5, 2006 at 7:25 am

    There is one more kind of ‘Average’: Mathematical average.

    The mathematical average between n numbers is calculated as:
    The ‘nth root’ of all n numbers multiplied.

    Between 2 numbers that would be: sqrt(axb).
    The mathematical average of 8 and 12 = sqrt(8*12) = sqrt(96) = 9.79796.

  2. moore said,

    October 5, 2006 at 7:35 am

    It sounds to me like you’re talking about the geometric mean: http://en.wikipedia.org/wiki/Geometric_mean

    The mysql-udf package actually provides the geometric mean (a function called ‘geomean’), andif I ever did something like average price rise, it seems like I’d want to use that function. Not sure what else it is useful for–do you hae any further examples?

  3. Mark Z said,

    October 10, 2006 at 12:20 pm

    Dan,
    I followed all of the instructions and installed the function in mysql, but when I call the function, I get no results and no errors for that matter. Any ideas? Any help would be appreciated.

    Thanks
    Mark

  4. moore said,

    October 10, 2006 at 4:43 pm

    Hi Mark,

    What version of mysql are you running? Windows or unix? What happens when
    you run the create function sql call? Is there any error messages? Is
    there an entry in the mysql.func table?

  5. Dan Moore said,

    October 12, 2006 at 5:28 pm

    Ended up corresponding with Mark Z a few times. Turns out he was trying to run the median function on a column with type bigint. The median function only works on columns with type float or double, because the median_init function requires the column to have a type of REAL_RESULT. Otherwise, you get this message: “ERROR: median() requires a real as parameter 1.”Changing the column to a double fixed his problem.

    Again, I’m not a C programmer, but I don’t know why you couldn’t find the median of an int column. One thing you can do is coerce the column from an int to a double, like so: select median(inventory_left+0.0,0) from product.

    Incidentally, looking at the median.cc code, it appears you can give the function two arguments, a column of type double or float, and a number of decimal places.

  6. Ulrich Kadolsky said,

    November 27, 2006 at 5:13 am

    This was so helpful, thanks! I’ve just upgraded to mysql v5.0.26 (gentoo), and discovered median() no longer works (not even with the “+0.0″ trick).

    Fortunately there’s a solution - cast the first parameter to REAL_RESULT (see http://dev.mysql.com/doc/refman/5.0/en/udf-arguments.html), instead of returning an error.

    To do it yourself: comment out the body of the if statement at line 72 (”if (args->arg_type[0]!=REAL_RESULT)”), and add the following line instead:
    args->arg_type[0] = REAL_RESULT;

    I also got an error when I try to compile, so I also comment out line 32 (”#include “). Seems to work, and it no longer needs the “+0.0″! Once I’ve tested it fully I’ll try to get some code out.

  7. Erel Segal said,

    May 20, 2007 at 6:52 am

    Trying to compile the file according to instruction #5, I got the following erros:
    udf_median.cc:31:23: error: my_global.h: No such file or directory
    udf_median.cc:32:20: error: my_sys.h: No such file or directory
    udf_median.cc:34:19: error: mysql.h: No such file or directory
    udf_median.cc:35:21: error: m_ctype.h: No such file or directory
    udf_median.cc:36:24: error: m_string.h: No such file or directory

    They were gone after I changed to “I” directive, to point to the real location of the MySQL header files on my system:
    > gcc -shared -lstdc++ -I /usr/include/mysql -I /usr/local/include -I /usr/local/mysql/include/ -o udf_median.so udf_median.cc

    Other than that, everything worked fine. Thank you very much!

  8. Roland Bouman said,

    June 5, 2007 at 8:08 am

    Hi Dan,

    I’m glad my udf article helped you getting started with UDF’s on windows.

    In fact, I would like to invite you to take a look at the MySQL UDF Repository. This is a community driven project that tries to compile a comprehensive and complete repository of open source production-ready UDF’s for various platforms, including windows, linux and Mac OS X.

    You can find the Google Group here:
    http://groups.google.com/group/mysql-udf-repository
    The repository itself is located here:
    http://www.xcdsql.org/MySQL/UDF/

    kind regards,

    Roland Bouman

  9. moore said,

    June 6, 2007 at 9:46 am

    Hi Roland,

    I’ll take a look at the UDF repository. I don’t do a lot of UDF develpoment, other than when I have a specific need (as above), but I appreciate you building a community and a resource around UDFs.

  10. Yuan Pen said,

    July 20, 2007 at 2:16 pm

    Hi Dan,

    Thanks a lot for the information. I compiled the median function and installed on linux mysql 5.01. However, mysql complained that query batch completed with error(s) when I called the function. Do you have experience of port the function to mysql 5.X? Thanks.

    Yuan

  11. Yuan Pen said,

    July 20, 2007 at 4:01 pm

    Hi Dan,

    I would like to take my posted question back because after scan through the comments on this site, I was able to install median function to mysql 5.01. Actually, I even created a 95 percentile function based on the median function. I would like to post it if anyone is interested. Thanks a lot.

    Yuan

  12. arsin said,

    September 26, 2007 at 6:23 pm

    I would love to see the 95 percentile function ..
    Any chance of getting that?

  13. Spade Jac said,

    February 28, 2008 at 10:31 pm

    Hi Dan,

    I have followed all of your instructions but I now encounter the following error.mysql> CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME ‘udf_median.so’;
    ERROR 1126 (HY000): Can’t open shared library ‘udf_median.so’ (errno: 22 udf_median.so: cannot open shared object file: No such file or d)

    I made sure that the .so is found in /lib as well as /usr/lib. I had to go through recompilation using -fPIC for the .so. Do you know how to fix this?

    Thanks,
    Spade

  14. moore said,

    March 3, 2008 at 6:00 pm

    FYI, Jeff Beard has written a beautiful explanation (with screen shots) of building a UDF with Visual Studio 2008: http://www.cyberxape.com/blog/jeff/entry/udf_median_on_windows

  15. mike said,

    March 30, 2008 at 3:16 pm

    Hmm i loaded it on Windows on mysql 5.0.51a and when executing it returns invariably null regardless of what columns select. I tried it both under x86 and x64 versions of myql 5.0.51a.

    No errors in the mysql log or Windows log.

    Any ideas?

    Thanks!

  16. rt said,

    June 14, 2008 at 9:38 pm

    Guys - this was tremendously helpful!! I got up and running with this with mySQL 5.1.25-rc. Here are some notes I took that may be helpful for others when installing on Fedora core 8.

    I had some weird error when trying the gcc command - turns out I didn’t have gcc installed. Instead it with:
    yum install gcc-c++

    Also the instructions above do not mention it but you need to install the mySQL headers. The version I used was this:
    wget http://dev.mysql.com/get/Downloads/MySQL-5.1/MySQL-devel-5.1.25-0.glibc23.i386.rpm/from/http://mysql.mirrors.hoobly.com/

    Then I installed it with this:
    rpm -i MySQL-devel-5.1.25-0.glibc23.i386.rpm

    Then I got the patched souce for this udf:
    wget http://www.mooreds.com/files/mysql-udf-patched.tar.gz

    and extracted it:
    tar zxvf mysql-udf-patched.tar.gz
    cd mysql-udf

    Then some edits per the above notes:
    vi udf_median.cc
    # two things to edit:
    # 1) remove #ifdef HAVE_DLOPEN line and corresponding #endif at end of file
    # 2) comment out the body of the if statement at line 72 (”if (args->arg_type[0]!=REAL_RESULT)”),
    # and add the following line instead: args->arg_type[0] = REAL_RESULT;

    Then I had to change the include path to /usr/include/mysql (instead of /usr/mysql/include).

    gcc -shared -lstdc++ -I /usr/include -I /usr/local/include -I /usr/include/mysql -o udf_median.so udf_median.cc

    Then apparently under this version of mysql it looks for udfs in a plugin dir - at least, that was the err message I got when I first tried the create aggregate function command. So I created the dir it complained about and put the .so in there:

    mkdir -p /usr/lib/mysql/plugin/
    cp udf_median.so /usr/lib/mysql/plugin/

    CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME ‘udf_median.dll’;

    That worked. Then I do this:
    select median(myIntVal) from mydb;

    and BINGO it works like a charm!! I hope these notes are helpful to someone else like the rest of the above notes were so very helpful for me!

    The only down side is that this function is sluggish compared to simply doing select myIntVal from mydb order by myIntval limit halfTableSize,1. That query finishes in less than 0.5sec with 600k rows, whereas the median() query takes 15 seconds :(

    That being said the benefit of being able to get median as an aggreate makes it worth it (since I can now do select x,avg(myIntVal), median(myIntVal), std(myIntVal) like that…

    That being said, does anyone have any tips for how to speed up this median function? I did: create index xyz on mytable (myIntVal); but other than that what if anything can I do to speed this up? Has anyone looked at speeding up the .c code used in this UDF? Thanks again!!

  17. moore said,

    June 16, 2008 at 8:43 am

    Hi RT,

    Thanks for the feedback. I’m really glad this post helped you out, and appreciate the clarifying comments.

    I’m afraid I’m using the median function in a batch manner, so haven’t really spent any time looking at the performance. You might want to browse around the mysql-udf-repository google group: http://groups.google.com/group/mysql-udf-repository, or post this question there.

  18. Derek said,

    August 14, 2008 at 12:45 am

    Thank you all for the info on this page.

    It took me a day pulling my hair our, but have finaly got it to work on Windows / MySQL 5.0.41 / Visual C++ 2008 Express.

  19. moore said,

    August 14, 2008 at 11:09 am

    Derek,

    Glad to help. Any clues to help other folks using that combination of software?

  20. Pedro said,

    August 28, 2008 at 8:01 am

    Hello,

    I managd to compile and follow all the steps until the very end. When I type

    create aggregate function udf_median returns real soname ‘udf_median.dll’

    I get the following error message:
    ERROR 1127 (HY000): Can’t find function ‘’udf_median” in library.

    Can anyone help me with this error message?

    Best regards and thank you in advance,
    Pedro

Leave a Comment

*
To prove you're a person (not a spam script), type the security word shown in the picture.
Anti-Spam Image

© Moore Consulting, 2003-2008