I just re-read “How To Lie With Statistics”, which is so good I think it should be required reading in every middle school. In it, the author makes the point that there are three kinds of ‘averages’: arithmetic mean, median and mode (here I am, contributing to Wikipedia’s dominance, due to my laziness in looking up alternative definitions of statistical concepts). In general, the median is the most informative average, because it’s not skewed by a small number of outliers.But mysql (and other databases I’ve worked on) don’t natively supprt the medan, whereas I believe most support average (by which they mean ‘arithmetic mean’). Sure, you can use a stored procedure (as suggested here for PostgreSQL. However, I’m working with MySQL 4, which does not support stored procs. However, there is another solution: user defined functions. These seem like stored procedures, except you have to write them in C (or C++).

Now, I’m not a C programmer. Luckily, someone has written and released a set of mysql user defined functions that include median (as well as many other statistical manipulations). The bad news is that it hasn’t been updated for years. The good news is that with a bit of luck and many downloads, I was able to get the median function working on mysql, both on windows as a dll, and on linux as a shared library. To repeat, I am not a C programmer, so if you see any head thumping errors below, please let me know and I’ll update this document.

First off, I was working with these versions of mysql: c:\Program Files\MySQL\MySQL Server 4.1\bin\mysql.exe Ver 14.7 Distrib 4.1.10a, for Win95/Win98 (i32) and mysql Ver 14.7 Distrib 4.1.7, for pc-linux (i686)

To get median working on windows, you need to:

  1. Download the mysql-udf tarball.
  2. patch the files if you’re running a version of mysql greater than 4.1.1. patch available here, or the patched tarball is here.
  3. Download and install Visual C++ Express. (If you have a C compiler on Windows, you can skip this step and the next. Oh, and the ones following that will probably be different. (Here’s a blog post about creating a UDF using Visual Studio C++ 2003.)
  4. Download and install the platform SDK; I only followed through step 3. If you don’t, you’ll get ‘windows.h’ errors when you try to compile the UDF.
  5. Untar the myslq-udf tarball. Patch if needed.
  6. Install the mysql header files. I was able to do this via the Windows Installer, which let me modify my existing mysql installation; I had to add the ‘C Include Files / Lib Files’ feature.
  7. Create a new directory. Copy udf_median.cc from the untarred directory to this new directory.
  8. Create a new file in that directory called udf_median.def. This file contains all the methods the UDF is exporting. Or you can just download the file I used here.
  9. Open Visual C++ Express
  10. Choose File / New / Project From Existing Code. Hit Next. Browse to the directory you just created. Create a name for the project. Hit Finish
  11. Edit the udf_median.cc file and comment out the #ifdef HAVE_DLOPEN line as well as the corresponding #endif. If I didn’t do this, I kept getting link errors, as I guess everything between those preprocessor directives was not being compiled.
  12. Add the mysql include files: right click on the project and choose properties. Expand ‘Configuration Properties’ then ‘C/C++’ and click ‘General’. On the right, add an include directory. Navigate to the Mysql include directory and add that.
  13. Add the module definition file: right click on the project and choose properties. Expand ‘Configuration Properties’ then ‘Linker’ and click ‘Input’. Add ‘udf_median.def’ to the key ‘Module Definition File’.
  14. Make sure VC knows this is a DLL: right click on the project and choose properties. Expand ‘Configuration Properties’ and click ‘General’. Choose ‘Dynamic Library (.dll)’ for Configuration Type. If you don’t do this, you’ll get errors like: error LNK2019: unresolved external symbol _WinMain because the compiler thinks you’re trying to build an application.
  15. Right click on the project and choose ‘Build’. This gives you a DLL in the Debug directory.
  16. Copy the DLL to the bin directory of your mysql installation.
  17. Create the function by logging in to mysql and running this command: CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME 'udf_median.dll';. (The user you log in as will need to have the ability to insert rows into the mysql tables.)
  18. Test and enjoy.

Deploying the UDF to linux is much simpler, mostly because you don’t have to install a compiler, linker, etc. I used ‘gcc (GCC) 3.3.4’.

  1. Download the mysql-udf tarball.
  2. patch the files if you’re running a version of mysql greater than 4.1.1. patch available here, or the patched tarball is here.
  3. Untar the myslq-udf tarball. Patch if needed.
  4. Edit the udf_median.cc file and comment out the #ifdef HAVE_DLOPEN line as well as the corresponding #endif.
  5. Compile and link the code. Do not use the instructions on the mysql-udf homepage. If you compile with those flags, you’ll get this error when you try to add the function: mysql> CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME 'udf_median.so';
    ERROR 1126 (HY000): Can't open shared library 'udf_median.so' (errno: 22 /usr/lib/udf_median.so: undefined symbol: _Znwj)
    . Rather, use the instructions in this bug report: gcc -shared -lstdc++ -I /usr/include -I /usr/local/include -I /usr/local/mysql/include/ -o udf_median.so udf_median.cc'
  6. Copy the shared library to a directory where mysql will see it. I put it in /usr/lib.
  7. Create the function by logging in to mysql and running this command: CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME 'udf_median.so';. (The user you log in as will need to have the ability to insert rows into the mysql tables.)
  8. Test and enjoy.

We have pushed this UDF to production with replicated servers and haven’t seen any issues with it yet.

I want to extend my thanks to:

Technorati Tags: , ,

46 thoughts on “Installing the median user defined function on MySQL

  1. Wouter says:

    There is one more kind of ‘Average’: Mathematical average.

    The mathematical average between n numbers is calculated as:
    The ‘nth root’ of all n numbers multiplied.

    Between 2 numbers that would be: sqrt(axb).
    The mathematical average of 8 and 12 = sqrt(8*12) = sqrt(96) = 9.79796.

  2. moore says:

    It sounds to me like you’re talking about the geometric mean: http://en.wikipedia.org/wiki/Geometric_mean

    The mysql-udf package actually provides the geometric mean (a function called ‘geomean’), andif I ever did something like average price rise, it seems like I’d want to use that function. Not sure what else it is useful for–do you hae any further examples?

  3. Mark Z says:

    Dan,
    I followed all of the instructions and installed the function in mysql, but when I call the function, I get no results and no errors for that matter. Any ideas? Any help would be appreciated.

    Thanks
    Mark

  4. moore says:

    Hi Mark,

    What version of mysql are you running? Windows or unix? What happens when
    you run the create function sql call? Is there any error messages? Is
    there an entry in the mysql.func table?

  5. Dan Moore says:

    Ended up corresponding with Mark Z a few times. Turns out he was trying to run the median function on a column with type bigint. The median function only works on columns with type float or double, because the median_init function requires the column to have a type of REAL_RESULT. Otherwise, you get this message: “ERROR: median() requires a real as parameter 1.”Changing the column to a double fixed his problem.

    Again, I’m not a C programmer, but I don’t know why you couldn’t find the median of an int column. One thing you can do is coerce the column from an int to a double, like so: select median(inventory_left+0.0,0) from product.

    Incidentally, looking at the median.cc code, it appears you can give the function two arguments, a column of type double or float, and a number of decimal places.

  6. Ulrich Kadolsky says:

    This was so helpful, thanks! I’ve just upgraded to mysql v5.0.26 (gentoo), and discovered median() no longer works (not even with the “+0.0” trick).

    Fortunately there’s a solution – cast the first parameter to REAL_RESULT (see http://dev.mysql.com/doc/refman/5.0/en/udf-arguments.html), instead of returning an error.

    To do it yourself: comment out the body of the if statement at line 72 (“if (args->arg_type[0]!=REAL_RESULT)”), and add the following line instead:
    args->arg_type[0] = REAL_RESULT;

    I also got an error when I try to compile, so I also comment out line 32 (“#include “). Seems to work, and it no longer needs the “+0.0”! Once I’ve tested it fully I’ll try to get some code out.

  7. Erel Segal says:

    Trying to compile the file according to instruction #5, I got the following erros:
    udf_median.cc:31:23: error: my_global.h: No such file or directory
    udf_median.cc:32:20: error: my_sys.h: No such file or directory
    udf_median.cc:34:19: error: mysql.h: No such file or directory
    udf_median.cc:35:21: error: m_ctype.h: No such file or directory
    udf_median.cc:36:24: error: m_string.h: No such file or directory

    They were gone after I changed to “I” directive, to point to the real location of the MySQL header files on my system:
    > gcc -shared -lstdc++ -I /usr/include/mysql -I /usr/local/include -I /usr/local/mysql/include/ -o udf_median.so udf_median.cc

    Other than that, everything worked fine. Thank you very much!

  8. Roland Bouman says:

    Hi Dan,

    I’m glad my udf article helped you getting started with UDF’s on windows.

    In fact, I would like to invite you to take a look at the MySQL UDF Repository. This is a community driven project that tries to compile a comprehensive and complete repository of open source production-ready UDF’s for various platforms, including windows, linux and Mac OS X.

    You can find the Google Group here:
    http://groups.google.com/group/mysql-udf-repository
    The repository itself is located here:
    http://www.xcdsql.org/MySQL/UDF/

    kind regards,

    Roland Bouman

  9. moore says:

    Hi Roland,

    I’ll take a look at the UDF repository. I don’t do a lot of UDF develpoment, other than when I have a specific need (as above), but I appreciate you building a community and a resource around UDFs.

  10. Yuan Pen says:

    Hi Dan,

    Thanks a lot for the information. I compiled the median function and installed on linux mysql 5.01. However, mysql complained that query batch completed with error(s) when I called the function. Do you have experience of port the function to mysql 5.X? Thanks.

    Yuan

  11. Yuan Pen says:

    Hi Dan,

    I would like to take my posted question back because after scan through the comments on this site, I was able to install median function to mysql 5.01. Actually, I even created a 95 percentile function based on the median function. I would like to post it if anyone is interested. Thanks a lot.

    Yuan

  12. arsin says:

    I would love to see the 95 percentile function ..
    Any chance of getting that?

  13. Spade Jac says:

    Hi Dan,

    I have followed all of your instructions but I now encounter the following error.mysql> CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME ‘udf_median.so’;
    ERROR 1126 (HY000): Can’t open shared library ‘udf_median.so’ (errno: 22 udf_median.so: cannot open shared object file: No such file or d)

    I made sure that the .so is found in /lib as well as /usr/lib. I had to go through recompilation using -fPIC for the .so. Do you know how to fix this?

    Thanks,
    Spade

  14. mike says:

    Hmm i loaded it on Windows on mysql 5.0.51a and when executing it returns invariably null regardless of what columns select. I tried it both under x86 and x64 versions of myql 5.0.51a.

    No errors in the mysql log or Windows log.

    Any ideas?

    Thanks!

  15. rt says:

    Guys – this was tremendously helpful!! I got up and running with this with mySQL 5.1.25-rc. Here are some notes I took that may be helpful for others when installing on Fedora core 8.

    I had some weird error when trying the gcc command – turns out I didn’t have gcc installed. Instead it with:
    yum install gcc-c++

    Also the instructions above do not mention it but you need to install the mySQL headers. The version I used was this:
    wget http://dev.mysql.com/get/Downloads/MySQL-5.1/MySQL-devel-5.1.25-0.glibc23.i386.rpm/from/http://mysql.mirrors.hoobly.com/

    Then I installed it with this:
    rpm -i MySQL-devel-5.1.25-0.glibc23.i386.rpm

    Then I got the patched souce for this udf:
    wget http://www.mooreds.com/files/mysql-udf-patched.tar.gz

    and extracted it:
    tar zxvf mysql-udf-patched.tar.gz
    cd mysql-udf

    Then some edits per the above notes:
    vi udf_median.cc
    # two things to edit:
    # 1) remove #ifdef HAVE_DLOPEN line and corresponding #endif at end of file
    # 2) comment out the body of the if statement at line 72 (”if (args->arg_type[0]!=REAL_RESULT)”),
    # and add the following line instead: args->arg_type[0] = REAL_RESULT;

    Then I had to change the include path to /usr/include/mysql (instead of /usr/mysql/include).

    gcc -shared -lstdc++ -I /usr/include -I /usr/local/include -I /usr/include/mysql -o udf_median.so udf_median.cc

    Then apparently under this version of mysql it looks for udfs in a plugin dir – at least, that was the err message I got when I first tried the create aggregate function command. So I created the dir it complained about and put the .so in there:

    mkdir -p /usr/lib/mysql/plugin/
    cp udf_median.so /usr/lib/mysql/plugin/

    CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME ‘udf_median.dll’;

    That worked. Then I do this:
    select median(myIntVal) from mydb;

    and BINGO it works like a charm!! I hope these notes are helpful to someone else like the rest of the above notes were so very helpful for me!

    The only down side is that this function is sluggish compared to simply doing select myIntVal from mydb order by myIntval limit halfTableSize,1. That query finishes in less than 0.5sec with 600k rows, whereas the median() query takes 15 seconds 🙁

    That being said the benefit of being able to get median as an aggreate makes it worth it (since I can now do select x,avg(myIntVal), median(myIntVal), std(myIntVal) like that…

    That being said, does anyone have any tips for how to speed up this median function? I did: create index xyz on mytable (myIntVal); but other than that what if anything can I do to speed this up? Has anyone looked at speeding up the .c code used in this UDF? Thanks again!!

  16. moore says:

    Hi RT,

    Thanks for the feedback. I’m really glad this post helped you out, and appreciate the clarifying comments.

    I’m afraid I’m using the median function in a batch manner, so haven’t really spent any time looking at the performance. You might want to browse around the mysql-udf-repository google group: http://groups.google.com/group/mysql-udf-repository, or post this question there.

  17. Derek says:

    Thank you all for the info on this page.

    It took me a day pulling my hair our, but have finaly got it to work on Windows / MySQL 5.0.41 / Visual C++ 2008 Express.

  18. moore says:

    Derek,

    Glad to help. Any clues to help other folks using that combination of software?

  19. Pedro says:

    Hello,

    I managd to compile and follow all the steps until the very end. When I type

    create aggregate function udf_median returns real soname ‘udf_median.dll’

    I get the following error message:
    ERROR 1127 (HY000): Can’t find function ”udf_median” in library.

    Can anyone help me with this error message?

    Best regards and thank you in advance,
    Pedro

  20. Tom S. says:

    If you would like to supply INT arguments without explicit coersion, udf_median.cc can be modified like so…
    REPLACE THIS:
    if (args->arg_type[0]!=REAL_RESULT)
    {
    strcpy(message,”median() requires a real as parameter 1″);
    return 1;
    }

    WITH THIS:
    args->arg_type[0] = REAL_RESULT;

    Which forces mysql to do any needed coersion.

    NOTE 1: Result is DOUBLE regardless of argument type supplied
    NOTE 2: I tested successfully, but use at your own risk.

    Why did I bother with this if result is DOUBLE anyway?
    Explicit coersion of the input arguments did not work for me. Any attempt I made was producing DECIMAL fields, which were not accepted by udf_median. Don’t know why, is it related to mysql version?? (5.0.51a)

  21. Vlad Zelenko says:

    Dan, great post!!!

    I have followed your instructions but ran into the following issue.

    Environment:
    ———–
    uname -a
    Linux 2.4.21-32.0.1.ELsmp #1 SMP Tue May 17 17:52:23 EDT 2005 i686 i686 i386 GNU/Linux

    cat /etc/issue
    Red Hat Enterprise Linux AS release 3 (Taroon Update 5)
    Kernel \r on an \m

    mysql -V
    mysql Ver 14.7 Distrib 4.1.15, for pc-linux-gnu (i686) using readline 4.3

    gcc -v
    Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/3.2.3/specs
    Configured with: ../configure –prefix=/usr –mandir=/usr/share/man –infodir=/usr/share/info –enable-shared –enable-threads=posix –disable-checking –with-system-zlib –enable-__cxa_atexit –host=i386-redhat-linux
    Thread model: posix
    gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-49)

    The SO library was compiled as specified and placed in LD_LIBRARY_PATH visible directory, mysqld bounced.

    Problem:

    mysql> CREATE FUNCTION median RETURNS REAL SONAME ‘udf_median.so’;
    ERROR 2006 (HY000): MySQL server has gone away
    No connection. Trying to reconnect…
    Connection id: 1
    Current database: ws_data

    ERROR 2013 (HY000): Lost connection to MySQL server during query
    mysql>
    Number of processes running now: 0
    080926 09:26:10 mysqld restarted

    I am not a linux, MySQL, or c++ expert, so I need some help. 🙂

    Thanks,

    – vlad

  22. grk says:

    Dan, rt,

    I updated the median-udf patch with an optional parameter to compute percentiles.

    median(variable) returns the standard median
    median(variable, i) returns the standard median to i decimals
    median(variable, i, percentile) returns the percentile to i decimals

    percentile can be a value between 0 and 100. Default value is 50 for the standard median.

    The patch is posted here on sourceforge:
    http://sourceforge.net/tracker/?func=detail&aid=2766201&group_id=114041&atid=667000

    A few notes:
    1. The patch has been tested only on Linux. It includes the changes recommended by rt in the post on June 14, 2008 above.

    2. gcc required an additional -fPIC parameter in order to go through

    gcc -shared -lstdc++ -fPIC -I /usr/include -I /usr/local/include -I /usr/include/mysql -o udf_median.so udf_median.cc

    I found that with 64-bit mysql, the median-udf.so file has to be copied to /usr/lib64

    3. Remaining instructions in rt’s post remain the same

    Hope this helps!

  23. Oliver says:

    Hi Folks –

    I am running Ubuntu 9.04 with MySQL 5.0.75 and have successfully compiled a few different flavors of the udf_median UDF on my machine (including the variant above).

    Everything works just fine when I calculate the median of a set of numbers greater than 1. However, when I try to calculate the median of a set of values which are floating point numbers around the number 1 (e.g. 0.9873, 1.0023, 1.23, 0.6753), I only ever get a value of 1 returned from the function. I can never obtain a value with precision to the right of the decimal point.

    Any thoughts on what might be the cause of this?

    Thanks!

  24. Oliver says:

    Scratch that last comment – I figured it out – user error on my end.

  25. Sumedh says:

    1. I had to do this on Ubuntu…so had to install header files as –
    sudo apt-get install libmysql++-dev

    2. I had to install g++

    3. I had to change compiler options to -fPIC. Honestly, I don’t know why…compiler said – relocation R_X86_64_32 against `compare_doubles(void const*, void const*)’ can not be used when making a shared object; recompile with -fPIC
    /tmp/ccS3AYK9.o: could not read symbols: Bad value

    Putting the .so file to /usr/lib seems to be working fine…I have just tested it with simple integer data.

    Thanks for a very informative post 🙂

  26. moore says:

    Sumedh,

    Thanks for adding instructions for Ubuntu.

  27. Jo Yo says:

    I’m working on JsReduce which is going to be a special aggregate function. I have JsMap working and you can check it out now: http://bit.ly/cKA8np

    Ideally, this will be ultra flexible for prototyping reports and getting things done.

  28. moore says:

    Hi Jo Yo,

    JsMap looks really cool!

    Thanks.

  29. Dave says:

    Thank you!

    For what it’s worth, here are the instructions I used on Kubuntu Linux, after installing MySQL and its development libraries:

    for i in *.cc; do
    gcc -Wall -fPIC -DHAVE_DLOPEN -I/usr/include/mysql -c $i -o $(basename $i .cc).o;
    ld -shared -o $(basename $i .cc).so $(basename $i .cc).o;
    done

    This way you install everything and don’t have to comment out any flags. Use the -D parameter to set the flag while compiling. Then you can do:

    CREATE AGGREGATE FUNCTION slope RETURNS REAL SONAME ‘udf_slope.so’;

    And it worked for me!

  30. moore says:

    Thanks for the additional instructions, Dave!

  31. Thomas says:

    This was really helpful.

    BTW:
    If anyone encounters the following error(ubuntu) while trying to insert the UDF in mysql:
    “failed to map segment from shared object: Permission denied”

    you can fix it by stop apparmor
    cd /etc/init.d/
    ./apparmor stop

    then insert the udf and
    ./apparmor start

    Thanks!

  32. moore says:

    Thanks Thomas!

  33. moore says:

    Has anyone managed to get this UDF to work with Windows 7? I tried, but couldn’t get it working, and would love any tips.

  34. Ven says:

    I am trying to compile using the command: gcc -shared -lstdc++ -I /usr/include -I /usr/local/include -I /usr/local/mysql/include/ -o udf_median.so udf_median.cc on Mac OS X 10.4.11. I get the following error:

    i686-apple-darwin8-gcc-4.0.1: unrecognized option ‘-shared’
    /usr/bin/ld: Undefined symbols:
    _main
    collect2: ld returned 1 exit status

    The Darwin kernel version is 8.11.1. Any thoughts?

  35. moore says:

    Thanks Jeff!

  36. hans says:

    question – is this code finding the true median when having a set of even number of items? it seems to select the lower of the 2 middle items. instead, should it not return the average of the 2 middle items?

  37. moore says:

    Hi Hans,

    I don’t think so. Here’s the relevant section of the code:
    qsort(buffer->values,buffer->abscount,sizeof(double),compare_doubles);

    if (buffer->abscount&1==1)
    {
    return buffer->values[(buffer->abscount-1)/2];
    } else
    {
    return (buffer->values[(buffer->abscount-2)/2]+buffer->values[buffer->abscount/2])/2.0;
    }

    You can see if that if there are an odd number of elements, we return the true median. If there are an even number,
    For example, if abscount is 10, then (abscount – 2)/2 is 4, and abscount/2 is 5, then we add the numbers at the 4th and 5th positions together.

    Here’s an example:

    mysql> insert into foo values (1);
    Query OK, 1 row affected (0.00 sec)

    mysql> insert into foo values (2);
    Query OK, 1 row affected (0.00 sec)

    mysql> insert into foo values (3);
    Query OK, 1 row affected (0.00 sec)

    mysql> insert into foo values (4);
    Query OK, 1 row affected (0.00 sec)

    mysql> select median(bar) from foo;
    +-------------+
    | median(bar) |
    +-------------+
    | 2.50 |
    +-------------+
    1 row in set (0.00 sec)

  38. Kai says:

    Hi,
    I compiled with ‘gcc -shared -fPIC -lstdc++ -I /usr/include -I /usr/local/include -I /usr/local/mysql/include/ -o /usr/lib/mysql/plugin/udf_intercept.so udf_intercept.cc’ and I got (after long painfull time) an udf_intercept.so. succesfully, but I cant create the function , i get the ‘xxx_clear’ error: #1127 – Can’t find symbol ‘intercept_clear’ in library.

    I found something here but not yet helpfull for me:
    “it seems that the non-aggregate functions will still work in MySQL 4.1.1+, but the aggregate functions need some additional hook-in to work (the xxx_clear procedure).”
    http://phpforum.de/forum/showthread.php?t=194073
    Any idea?
    Kai
    ps: I have ubuntu 10.04 and Mysql 5.1.41.

  39. trist says:

    I would have liked writing my own median (from an algorithm I found a research paper about) using the files in this post as a starting point. I am however not able to add the function to mysql. I am using Fedora 18 and mySQL 5.5.29. I get the following error message:

    ERROR 1126 (HY000): Can’t open shared library ‘my_udf_example.so’ (errno: 0 /usr/lib64/mysql/plugin/my_udf_example.so: cannot open shared object file: Permission denied)

    I have ensured that the my_udf_example.so file has “r-x” permission for everybody (and so have the mysql plugin directory where I put it). I know mysql is finding it because if I remove it I get another error message saying it cant be found.

    All sugestions are apreciated!

  40. moore says:

    Hi Trist,

    Did you ever solve this? I’d probably try to install another mysql plugin and see if I got the same error message.

  41. Yashwanth says:

    Hi I was able to create the dll file and when i try to create the function in mYSQL i am getting this error

    Error Code: 1126. Can’t open shared library ‘mymedian.dll’ (errno: 126 The specified module could not be found.)

    has anyone faced this issue, if yes please share the remedies or suggestions.

     

    Thanks

    Yashwanth

  42. moore says:

    Hi Yashwanth. Where did you place the DLL file? Is it readable by the mysql process user? Did you restart mysql?

Comments are closed.


© Moore Consulting, 2003-2017 +