Wed, 27 Sep 06
Installing the median user defined function on MySQL
I just re-read “How To Lie With Statistics”, which is so good I think it should be required reading in every middle school. In it, the author makes the point that there are three kinds of ‘averages’: arithmetic mean, median and mode (here I am, contributing to Wikipedia’s dominance, due to my laziness in looking up alternative definitions of statistical concepts). In general, the median is the most informative average, because it’s not skewed by a small number of outliers.But mysql (and other databases I’ve worked on) don’t natively supprt the medan, whereas I believe most support average (by which they mean ‘arithmetic mean’). Sure, you can use a stored procedure (as suggested here for PostgreSQL. However, I’m working with MySQL 4, which does not support stored procs. However, there is another solution: user defined functions. These seem like stored procedures, except you have to write them in C (or C++).
Now, I’m not a C programmer. Luckily, someone has written and released a set of mysql user defined functions that include median (as well as many other statistical manipulations). The bad news is that it hasn’t been updated for years. The good news is that with a bit of luck and many downloads, I was able to get the median function working on mysql, both on windows as a dll, and on linux as a shared library. To repeat, I am not a C programmer, so if you see any head thumping errors below, please let me know and I’ll update this document.
First off, I was working with these versions of mysql: c:\Program Files\MySQL\MySQL Server 4.1\bin\mysql.exe Ver 14.7 Distrib 4.1.10a, for Win95/Win98 (i32) and mysql Ver 14.7 Distrib 4.1.7, for pc-linux (i686)
To get median working on windows, you need to:
- Download the mysql-udf tarball.
- patch the files if you’re running a version of mysql greater than 4.1.1. patch available here, or the patched tarball is here.
- Download and install Visual C++ Express. (If you have a C compiler on Windows, you can skip this step and the next. Oh, and the ones following that will probably be different. (Here’s a blog post about creating a UDF using Visual Studio C++ 2003.)
- Download and install the platform SDK; I only followed through step 3. If you don’t, you’ll get ‘windows.h’ errors when you try to compile the UDF.
- Untar the myslq-udf tarball. Patch if needed.
- Install the mysql header files. I was able to do this via the Windows Installer, which let me modify my existing mysql installation; I had to add the ‘C Include Files / Lib Files’ feature.
- Create a new directory. Copy
udf_median.ccfrom the untarred directory to this new directory. - Create a new file in that directory called
udf_median.def. This file contains all the methods the UDF is exporting. Or you can just download the file I used here. - Open Visual C++ Express
- Choose File / New / Project From Existing Code. Hit Next. Browse to the directory you just created. Create a name for the project. Hit Finish
- Edit the
udf_median.ccfile and comment out the#ifdef HAVE_DLOPENline as well as the corresponding#endif. If I didn’t do this, I kept getting link errors, as I guess everything between those preprocessor directives was not being compiled. - Add the mysql include files: right click on the project and choose properties. Expand ‘Configuration Properties’ then ‘C/C++’ and click ‘General’. On the right, add an include directory. Navigate to the Mysql include directory and add that.
- Add the module definition file: right click on the project and choose properties. Expand ‘Configuration Properties’ then ‘Linker’ and click ‘Input’. Add ‘udf_median.def’ to the key ‘Module Definition File’.
- Make sure VC knows this is a DLL: right click on the project and choose properties. Expand ‘Configuration Properties’ and click ‘General’. Choose ‘Dynamic Library (.dll)’ for Configuration Type. If you don’t do this, you’ll get errors like:
error LNK2019: unresolved external symbol _WinMainbecause the compiler thinks you’re trying to build an application. - Right click on the project and choose ‘Build’. This gives you a DLL in the Debug directory.
- Copy the DLL to the bin directory of your mysql installation.
- Create the function by logging in to mysql and running this command:
CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME 'udf_median.dll';. (The user you log in as will need to have the ability to insert rows into the mysql tables.) - Test and enjoy.
Deploying the UDF to linux is much simpler, mostly because you don’t have to install a compiler, linker, etc. I used ‘gcc (GCC) 3.3.4′.
- Download the mysql-udf tarball.
- patch the files if you’re running a version of mysql greater than 4.1.1. patch available here, or the patched tarball is here.
- Untar the myslq-udf tarball. Patch if needed.
- Edit the
udf_median.ccfile and comment out the#ifdef HAVE_DLOPENline as well as the corresponding#endif. - Compile and link the code. Do not use the instructions on the mysql-udf homepage. If you compile with those flags, you’ll get this error when you try to add the function:
mysql> CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME 'udf_median.so';. Rather, use the instructions in this bug report:
ERROR 1126 (HY000): Can't open shared library 'udf_median.so' (errno: 22 /usr/lib/udf_median.so: undefined symbol: _Znwj)gcc -shared -lstdc++ -I /usr/include -I /usr/local/include -I /usr/local/mysql/include/ -o udf_median.so udf_median.cc' - Copy the shared library to a directory where mysql will see it. I put it in
/usr/lib. - Create the function by logging in to mysql and running this command:
CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME 'udf_median.so';. (The user you log in as will need to have the ability to insert rows into the mysql tables.) - Test and enjoy.
We have pushed this UDF to production with replicated servers and haven’t seen any issues with it yet.
I want to extend my thanks to:
- Jan Steemann, the creator of the mysql-udf project
- Redy Rodriguez and sergivs for the patch and the compiler flag help, respectively
- Roland Bouman, for a great tutorial on building UDFs for Windows
[tags]median, mysql, user defined functions[/tags]
« Monday Consulting Question series | Permalink | A serious Google Maps bug »
Wouter said,
October 5, 2006 at 7:25 am
There is one more kind of ‘Average’: Mathematical average.
The mathematical average between n numbers is calculated as:
The ‘nth root’ of all n numbers multiplied.
Between 2 numbers that would be: sqrt(axb).
The mathematical average of 8 and 12 = sqrt(8*12) = sqrt(96) = 9.79796.
moore said,
October 5, 2006 at 7:35 am
It sounds to me like you’re talking about the geometric mean: http://en.wikipedia.org/wiki/Geometric_mean
The mysql-udf package actually provides the geometric mean (a function called ‘geomean’), andif I ever did something like average price rise, it seems like I’d want to use that function. Not sure what else it is useful for–do you hae any further examples?
Mark Z said,
October 10, 2006 at 12:20 pm
Dan,
I followed all of the instructions and installed the function in mysql, but when I call the function, I get no results and no errors for that matter. Any ideas? Any help would be appreciated.
Thanks
Mark
moore said,
October 10, 2006 at 4:43 pm
Hi Mark,
What version of mysql are you running? Windows or unix? What happens when
you run the create function sql call? Is there any error messages? Is
there an entry in the mysql.func table?
Dan Moore said,
October 12, 2006 at 5:28 pm
Ended up corresponding with Mark Z a few times. Turns out he was trying to run the median function on a column with type bigint. The median function only works on columns with type float or double, because the median_init function requires the column to have a type of REAL_RESULT. Otherwise, you get this message: “ERROR: median() requires a real as parameter 1.”Changing the column to a double fixed his problem.
Again, I’m not a C programmer, but I don’t know why you couldn’t find the median of an int column. One thing you can do is coerce the column from an int to a double, like so: select median(inventory_left+0.0,0) from product.
Incidentally, looking at the median.cc code, it appears you can give the function two arguments, a column of type double or float, and a number of decimal places.
Ulrich Kadolsky said,
November 27, 2006 at 5:13 am
This was so helpful, thanks! I’ve just upgraded to mysql v5.0.26 (gentoo), and discovered median() no longer works (not even with the “+0.0″ trick).
Fortunately there’s a solution – cast the first parameter to REAL_RESULT (see http://dev.mysql.com/doc/refman/5.0/en/udf-arguments.html), instead of returning an error.
To do it yourself: comment out the body of the if statement at line 72 (“if (args->arg_type[0]!=REAL_RESULT)”), and add the following line instead:
args->arg_type[0] = REAL_RESULT;
I also got an error when I try to compile, so I also comment out line 32 (“#include “). Seems to work, and it no longer needs the “+0.0″! Once I’ve tested it fully I’ll try to get some code out.
Erel Segal said,
May 20, 2007 at 6:52 am
Trying to compile the file according to instruction #5, I got the following erros:
udf_median.cc:31:23: error: my_global.h: No such file or directory
udf_median.cc:32:20: error: my_sys.h: No such file or directory
udf_median.cc:34:19: error: mysql.h: No such file or directory
udf_median.cc:35:21: error: m_ctype.h: No such file or directory
udf_median.cc:36:24: error: m_string.h: No such file or directory
They were gone after I changed to “I” directive, to point to the real location of the MySQL header files on my system:
> gcc -shared -lstdc++ -I /usr/include/mysql -I /usr/local/include -I /usr/local/mysql/include/ -o udf_median.so udf_median.cc
Other than that, everything worked fine. Thank you very much!
Roland Bouman said,
June 5, 2007 at 8:08 am
Hi Dan,
I’m glad my udf article helped you getting started with UDF’s on windows.
In fact, I would like to invite you to take a look at the MySQL UDF Repository. This is a community driven project that tries to compile a comprehensive and complete repository of open source production-ready UDF’s for various platforms, including windows, linux and Mac OS X.
You can find the Google Group here:
http://groups.google.com/group/mysql-udf-repository
The repository itself is located here:
http://www.xcdsql.org/MySQL/UDF/
kind regards,
Roland Bouman
moore said,
June 6, 2007 at 9:46 am
Hi Roland,
I’ll take a look at the UDF repository. I don’t do a lot of UDF develpoment, other than when I have a specific need (as above), but I appreciate you building a community and a resource around UDFs.
Yuan Pen said,
July 20, 2007 at 2:16 pm
Hi Dan,
Thanks a lot for the information. I compiled the median function and installed on linux mysql 5.01. However, mysql complained that query batch completed with error(s) when I called the function. Do you have experience of port the function to mysql 5.X? Thanks.
Yuan
Yuan Pen said,
July 20, 2007 at 4:01 pm
Hi Dan,
I would like to take my posted question back because after scan through the comments on this site, I was able to install median function to mysql 5.01. Actually, I even created a 95 percentile function based on the median function. I would like to post it if anyone is interested. Thanks a lot.
Yuan
arsin said,
September 26, 2007 at 6:23 pm
I would love to see the 95 percentile function ..
Any chance of getting that?
Spade Jac said,
February 28, 2008 at 10:31 pm
Hi Dan,
I have followed all of your instructions but I now encounter the following error.mysql> CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME ‘udf_median.so’;
ERROR 1126 (HY000): Can’t open shared library ‘udf_median.so’ (errno: 22 udf_median.so: cannot open shared object file: No such file or d)
I made sure that the .so is found in /lib as well as /usr/lib. I had to go through recompilation using -fPIC for the .so. Do you know how to fix this?
Thanks,
Spade
moore said,
March 3, 2008 at 6:00 pm
FYI, Jeff Beard has written a beautiful explanation (with screen shots) of building a UDF with Visual Studio 2008: http://www.cyberxape.com/blog/jeff/entry/udf_median_on_windows
mike said,
March 30, 2008 at 3:16 pm
Hmm i loaded it on Windows on mysql 5.0.51a and when executing it returns invariably null regardless of what columns select. I tried it both under x86 and x64 versions of myql 5.0.51a.
No errors in the mysql log or Windows log.
Any ideas?
Thanks!
rt said,
June 14, 2008 at 9:38 pm
Guys – this was tremendously helpful!! I got up and running with this with mySQL 5.1.25-rc. Here are some notes I took that may be helpful for others when installing on Fedora core 8.
I had some weird error when trying the gcc command – turns out I didn’t have gcc installed. Instead it with:
yum install gcc-c++
Also the instructions above do not mention it but you need to install the mySQL headers. The version I used was this:
wget http://dev.mysql.com/get/Downloads/MySQL-5.1/MySQL-devel-5.1.25-0.glibc23.i386.rpm/from/http://mysql.mirrors.hoobly.com/
Then I installed it with this:
rpm -i MySQL-devel-5.1.25-0.glibc23.i386.rpm
Then I got the patched souce for this udf:
wget http://www.mooreds.com/files/mysql-udf-patched.tar.gz
and extracted it:
tar zxvf mysql-udf-patched.tar.gz
cd mysql-udf
Then some edits per the above notes:
vi udf_median.cc
# two things to edit:
# 1) remove #ifdef HAVE_DLOPEN line and corresponding #endif at end of file
# 2) comment out the body of the if statement at line 72 (”if (args->arg_type[0]!=REAL_RESULT)”),
# and add the following line instead: args->arg_type[0] = REAL_RESULT;
Then I had to change the include path to /usr/include/mysql (instead of /usr/mysql/include).
gcc -shared -lstdc++ -I /usr/include -I /usr/local/include -I /usr/include/mysql -o udf_median.so udf_median.cc
Then apparently under this version of mysql it looks for udfs in a plugin dir – at least, that was the err message I got when I first tried the create aggregate function command. So I created the dir it complained about and put the .so in there:
mkdir -p /usr/lib/mysql/plugin/
cp udf_median.so /usr/lib/mysql/plugin/
CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME ‘udf_median.dll’;
That worked. Then I do this:
select median(myIntVal) from mydb;
and BINGO it works like a charm!! I hope these notes are helpful to someone else like the rest of the above notes were so very helpful for me!
The only down side is that this function is sluggish compared to simply doing select myIntVal from mydb order by myIntval limit halfTableSize,1. That query finishes in less than 0.5sec with 600k rows, whereas the median() query takes 15 seconds
That being said the benefit of being able to get median as an aggreate makes it worth it (since I can now do select x,avg(myIntVal), median(myIntVal), std(myIntVal) like that…
That being said, does anyone have any tips for how to speed up this median function? I did: create index xyz on mytable (myIntVal); but other than that what if anything can I do to speed this up? Has anyone looked at speeding up the .c code used in this UDF? Thanks again!!
moore said,
June 16, 2008 at 8:43 am
Hi RT,
Thanks for the feedback. I’m really glad this post helped you out, and appreciate the clarifying comments.
I’m afraid I’m using the median function in a batch manner, so haven’t really spent any time looking at the performance. You might want to browse around the mysql-udf-repository google group: http://groups.google.com/group/mysql-udf-repository, or post this question there.
Derek said,
August 14, 2008 at 12:45 am
Thank you all for the info on this page.
It took me a day pulling my hair our, but have finaly got it to work on Windows / MySQL 5.0.41 / Visual C++ 2008 Express.
moore said,
August 14, 2008 at 11:09 am
Derek,
Glad to help. Any clues to help other folks using that combination of software?
Pedro said,
August 28, 2008 at 8:01 am
Hello,
I managd to compile and follow all the steps until the very end. When I type
create aggregate function udf_median returns real soname ‘udf_median.dll’
I get the following error message:
ERROR 1127 (HY000): Can’t find function ”udf_median” in library.
Can anyone help me with this error message?
Best regards and thank you in advance,
Pedro
Tom S. said,
September 12, 2008 at 11:44 am
If you would like to supply INT arguments without explicit coersion, udf_median.cc can be modified like so…
REPLACE THIS:
if (args->arg_type[0]!=REAL_RESULT)
{
strcpy(message,”median() requires a real as parameter 1″);
return 1;
}
WITH THIS:
args->arg_type[0] = REAL_RESULT;
Which forces mysql to do any needed coersion.
NOTE 1: Result is DOUBLE regardless of argument type supplied
NOTE 2: I tested successfully, but use at your own risk.
Why did I bother with this if result is DOUBLE anyway?
Explicit coersion of the input arguments did not work for me. Any attempt I made was producing DECIMAL fields, which were not accepted by udf_median. Don’t know why, is it related to mysql version?? (5.0.51a)
Vlad Zelenko said,
September 26, 2008 at 7:38 am
Dan, great post!!!
I have followed your instructions but ran into the following issue.
Environment:
———–
uname -a
Linux 2.4.21-32.0.1.ELsmp #1 SMP Tue May 17 17:52:23 EDT 2005 i686 i686 i386 GNU/Linux
cat /etc/issue
Red Hat Enterprise Linux AS release 3 (Taroon Update 5)
Kernel \r on an \m
mysql -V
mysql Ver 14.7 Distrib 4.1.15, for pc-linux-gnu (i686) using readline 4.3
gcc -v
Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/3.2.3/specs
Configured with: ../configure –prefix=/usr –mandir=/usr/share/man –infodir=/usr/share/info –enable-shared –enable-threads=posix –disable-checking –with-system-zlib –enable-__cxa_atexit –host=i386-redhat-linux
Thread model: posix
gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-49)
The SO library was compiled as specified and placed in LD_LIBRARY_PATH visible directory, mysqld bounced.
Problem:
mysql> CREATE FUNCTION median RETURNS REAL SONAME ‘udf_median.so’;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect…
Connection id: 1
Current database: ws_data
ERROR 2013 (HY000): Lost connection to MySQL server during query
mysql>
Number of processes running now: 0
080926 09:26:10 mysqld restarted
I am not a linux, MySQL, or c++ expert, so I need some help.
Thanks,
- vlad
moore said,
September 26, 2008 at 8:57 am
Hi Vlad,
It might be a core dump. See http://bytes.com/forum/thread504683.html
Mysql 4.1.15 is pretty old, you might want to try upgrading to the latest 4.1 version (4.1.22) http://dev.mysql.com/downloads/mysql/4.1.html
grk said,
April 15, 2009 at 1:24 pm
Dan, rt,
I updated the median-udf patch with an optional parameter to compute percentiles.
median(variable) returns the standard median
median(variable, i) returns the standard median to i decimals
median(variable, i, percentile) returns the percentile to i decimals
percentile can be a value between 0 and 100. Default value is 50 for the standard median.
The patch is posted here on sourceforge:
http://sourceforge.net/tracker/?func=detail&aid=2766201&group_id=114041&atid=667000
A few notes:
1. The patch has been tested only on Linux. It includes the changes recommended by rt in the post on June 14, 2008 above.
2. gcc required an additional -fPIC parameter in order to go through
gcc -shared -lstdc++ -fPIC -I /usr/include -I /usr/local/include -I /usr/include/mysql -o udf_median.so udf_median.cc
I found that with 64-bit mysql, the median-udf.so file has to be copied to /usr/lib64
3. Remaining instructions in rt’s post remain the same
Hope this helps!
Oliver said,
August 22, 2009 at 4:02 pm
Hi Folks -
I am running Ubuntu 9.04 with MySQL 5.0.75 and have successfully compiled a few different flavors of the udf_median UDF on my machine (including the variant above).
Everything works just fine when I calculate the median of a set of numbers greater than 1. However, when I try to calculate the median of a set of values which are floating point numbers around the number 1 (e.g. 0.9873, 1.0023, 1.23, 0.6753), I only ever get a value of 1 returned from the function. I can never obtain a value with precision to the right of the decimal point.
Any thoughts on what might be the cause of this?
Thanks!
Oliver said,
August 22, 2009 at 6:06 pm
Scratch that last comment – I figured it out – user error on my end.
Sumedh said,
September 24, 2009 at 5:34 am
1. I had to do this on Ubuntu…so had to install header files as -
sudo apt-get install libmysql++-dev
2. I had to install g++
3. I had to change compiler options to -fPIC. Honestly, I don’t know why…compiler said – relocation R_X86_64_32 against `compare_doubles(void const*, void const*)’ can not be used when making a shared object; recompile with -fPIC
/tmp/ccS3AYK9.o: could not read symbols: Bad value
Putting the .so file to /usr/lib seems to be working fine…I have just tested it with simple integer data.
Thanks for a very informative post
moore said,
September 24, 2009 at 7:51 am
Sumedh,
Thanks for adding instructions for Ubuntu.
Jo Yo said,
February 12, 2010 at 2:57 pm
I’m working on JsReduce which is going to be a special aggregate function. I have JsMap working and you can check it out now: http://bit.ly/cKA8np
Ideally, this will be ultra flexible for prototyping reports and getting things done.
moore said,
February 12, 2010 at 3:11 pm
Hi Jo Yo,
JsMap looks really cool!
Thanks.
Dave said,
May 8, 2010 at 6:31 pm
Thank you!
For what it’s worth, here are the instructions I used on Kubuntu Linux, after installing MySQL and its development libraries:
for i in *.cc; do
gcc -Wall -fPIC -DHAVE_DLOPEN -I/usr/include/mysql -c $i -o $(basename $i .cc).o;
ld -shared -o $(basename $i .cc).so $(basename $i .cc).o;
done
This way you install everything and don’t have to comment out any flags. Use the -D parameter to set the flag while compiling. Then you can do:
CREATE AGGREGATE FUNCTION slope RETURNS REAL SONAME ‘udf_slope.so’;
And it worked for me!
moore said,
May 16, 2010 at 10:51 am
Thanks for the additional instructions, Dave!
Thomas said,
October 12, 2010 at 9:55 am
This was really helpful.
BTW:
If anyone encounters the following error(ubuntu) while trying to insert the UDF in mysql:
“failed to map segment from shared object: Permission denied”
you can fix it by stop apparmor
cd /etc/init.d/
./apparmor stop
then insert the udf and
./apparmor start
Thanks!
moore said,
October 25, 2010 at 9:02 am
Thanks Thomas!
moore said,
December 6, 2010 at 5:25 pm
Has anyone managed to get this UDF to work with Windows 7? I tried, but couldn’t get it working, and would love any tips.
Ven said,
February 22, 2011 at 1:38 am
I am trying to compile using the command: gcc -shared -lstdc++ -I /usr/include -I /usr/local/include -I /usr/local/mysql/include/ -o udf_median.so udf_median.cc on Mac OS X 10.4.11. I get the following error:
i686-apple-darwin8-gcc-4.0.1: unrecognized option ‘-shared’
/usr/bin/ld: Undefined symbols:
_main
collect2: ld returned 1 exit status
The Darwin kernel version is 8.11.1. Any thoughts?
Jeff Beard said,
May 21, 2011 at 3:22 pm
Dan,
Here’s my blog post about udf_median on 64bit Windows 7 for you and the folks that refer to this post:
http://jeffbeard.org/2011/05/mysql-udf_median-on-windows-7-64bit/
moore said,
May 22, 2011 at 9:59 pm
Thanks Jeff!