I just re-read “How To Lie With Statistics”, which is so good I think it should be required reading in every middle school. In it, the author makes the point that there are three kinds of ‘averages’: arithmetic mean, median and mode (here I am, contributing to Wikipedia’s dominance, due to my laziness in looking up alternative definitions of statistical concepts). In general, the median is the most informative average, because it’s not skewed by a small number of outliers.But mysql (and other databases I’ve worked on) don’t natively supprt the medan, whereas I believe most support average (by which they mean ‘arithmetic mean’). Sure, you can use a stored procedure (as suggested here for PostgreSQL. However, I’m working with MySQL 4, which does not support stored procs. However, there is another solution: user defined functions. These seem like stored procedures, except you have to write them in C (or C++).
Now, I’m not a C programmer. Luckily, someone has written and released a set of mysql user defined functions that include median (as well as many other statistical manipulations). The bad news is that it hasn’t been updated for years. The good news is that with a bit of luck and many downloads, I was able to get the median function working on mysql, both on windows as a dll, and on linux as a shared library. To repeat, I am not a C programmer, so if you see any head thumping errors below, please let me know and I’ll update this document.
First off, I was working with these versions of mysql: c:\Program Files\MySQL\MySQL Server 4.1\bin\mysql.exe Ver 14.7 Distrib 4.1.10a, for Win95/Win98 (i32)
and mysql Ver 14.7 Distrib 4.1.7, for pc-linux (i686)
To get median working on windows, you need to:
- Download the mysql-udf tarball.
- patch the files if you’re running a version of mysql greater than 4.1.1. patch available here, or the patched tarball is here.
- Download and install Visual C++ Express. (If you have a C compiler on Windows, you can skip this step and the next. Oh, and the ones following that will probably be different. (Here’s a blog post about creating a UDF using Visual Studio C++ 2003.)
- Download and install the platform SDK; I only followed through step 3. If you don’t, you’ll get ‘windows.h’ errors when you try to compile the UDF.
- Untar the myslq-udf tarball. Patch if needed.
- Install the mysql header files. I was able to do this via the Windows Installer, which let me modify my existing mysql installation; I had to add the ‘C Include Files / Lib Files’ feature.
- Create a new directory. Copy
udf_median.cc
from the untarred directory to this new directory. - Create a new file in that directory called
udf_median.def
. This file contains all the methods the UDF is exporting. Or you can just download the file I used here. - Open Visual C++ Express
- Choose File / New / Project From Existing Code. Hit Next. Browse to the directory you just created. Create a name for the project. Hit Finish
- Edit the
udf_median.cc
file and comment out the#ifdef HAVE_DLOPEN
line as well as the corresponding#endif
. If I didn’t do this, I kept getting link errors, as I guess everything between those preprocessor directives was not being compiled. - Add the mysql include files: right click on the project and choose properties. Expand ‘Configuration Properties’ then ‘C/C++’ and click ‘General’. On the right, add an include directory. Navigate to the Mysql include directory and add that.
- Add the module definition file: right click on the project and choose properties. Expand ‘Configuration Properties’ then ‘Linker’ and click ‘Input’. Add ‘udf_median.def’ to the key ‘Module Definition File’.
- Make sure VC knows this is a DLL: right click on the project and choose properties. Expand ‘Configuration Properties’ and click ‘General’. Choose ‘Dynamic Library (.dll)’ for Configuration Type. If you don’t do this, you’ll get errors like:
error LNK2019: unresolved external symbol _WinMain
because the compiler thinks you’re trying to build an application. - Right click on the project and choose ‘Build’. This gives you a DLL in the Debug directory.
- Copy the DLL to the bin directory of your mysql installation.
- Create the function by logging in to mysql and running this command:
CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME 'udf_median.dll';
. (The user you log in as will need to have the ability to insert rows into the mysql tables.) - Test and enjoy.
Deploying the UDF to linux is much simpler, mostly because you don’t have to install a compiler, linker, etc. I used ‘gcc (GCC) 3.3.4’.
- Download the mysql-udf tarball.
- patch the files if you’re running a version of mysql greater than 4.1.1. patch available here, or the patched tarball is here.
- Untar the myslq-udf tarball. Patch if needed.
- Edit the
udf_median.cc
file and comment out the#ifdef HAVE_DLOPEN
line as well as the corresponding#endif
. - Compile and link the code. Do not use the instructions on the mysql-udf homepage. If you compile with those flags, you’ll get this error when you try to add the function:
mysql> CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME 'udf_median.so';
. Rather, use the instructions in this bug report:
ERROR 1126 (HY000): Can't open shared library 'udf_median.so' (errno: 22 /usr/lib/udf_median.so: undefined symbol: _Znwj)gcc -shared -lstdc++ -I /usr/include -I /usr/local/include -I /usr/local/mysql/include/ -o udf_median.so udf_median.cc'
- Copy the shared library to a directory where mysql will see it. I put it in
/usr/lib
. - Create the function by logging in to mysql and running this command:
CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME 'udf_median.so';
. (The user you log in as will need to have the ability to insert rows into the mysql tables.) - Test and enjoy.
We have pushed this UDF to production with replicated servers and haven’t seen any issues with it yet.
I want to extend my thanks to:
- Jan Steemann, the creator of the mysql-udf project
- Redy Rodriguez and sergivs for the patch and the compiler flag help, respectively
- Roland Bouman, for a great tutorial on building UDFs for Windows
[tags]median, mysql, user defined functions[/tags]
There is one more kind of ‘Average’: Mathematical average.
The mathematical average between n numbers is calculated as:
The ‘nth root’ of all n numbers multiplied.
Between 2 numbers that would be: sqrt(axb).
The mathematical average of 8 and 12 = sqrt(8*12) = sqrt(96) = 9.79796.
It sounds to me like you’re talking about the geometric mean: http://en.wikipedia.org/wiki/Geometric_mean
The mysql-udf package actually provides the geometric mean (a function called ‘geomean’), andif I ever did something like average price rise, it seems like I’d want to use that function. Not sure what else it is useful for–do you hae any further examples?
Dan,
I followed all of the instructions and installed the function in mysql, but when I call the function, I get no results and no errors for that matter. Any ideas? Any help would be appreciated.
Thanks
Mark
Hi Mark,
What version of mysql are you running? Windows or unix? What happens when
you run the create function sql call? Is there any error messages? Is
there an entry in the mysql.func table?
Ended up corresponding with Mark Z a few times. Turns out he was trying to run the median function on a column with type bigint. The median function only works on columns with type float or double, because the median_init function requires the column to have a type of REAL_RESULT. Otherwise, you get this message: “ERROR: median() requires a real as parameter 1.”Changing the column to a double fixed his problem.
Again, I’m not a C programmer, but I don’t know why you couldn’t find the median of an int column. One thing you can do is coerce the column from an int to a double, like so: select median(inventory_left+0.0,0) from product.
Incidentally, looking at the median.cc code, it appears you can give the function two arguments, a column of type double or float, and a number of decimal places.
This was so helpful, thanks! I’ve just upgraded to mysql v5.0.26 (gentoo), and discovered median() no longer works (not even with the “+0.0” trick).
Fortunately there’s a solution – cast the first parameter to REAL_RESULT (see http://dev.mysql.com/doc/refman/5.0/en/udf-arguments.html), instead of returning an error.
To do it yourself: comment out the body of the if statement at line 72 (“if (args->arg_type[0]!=REAL_RESULT)”), and add the following line instead:
args->arg_type[0] = REAL_RESULT;
I also got an error when I try to compile, so I also comment out line 32 (“#include “). Seems to work, and it no longer needs the “+0.0”! Once I’ve tested it fully I’ll try to get some code out.
Trying to compile the file according to instruction #5, I got the following erros:
udf_median.cc:31:23: error: my_global.h: No such file or directory
udf_median.cc:32:20: error: my_sys.h: No such file or directory
udf_median.cc:34:19: error: mysql.h: No such file or directory
udf_median.cc:35:21: error: m_ctype.h: No such file or directory
udf_median.cc:36:24: error: m_string.h: No such file or directory
They were gone after I changed to “I” directive, to point to the real location of the MySQL header files on my system:
> gcc -shared -lstdc++ -I /usr/include/mysql -I /usr/local/include -I /usr/local/mysql/include/ -o udf_median.so udf_median.cc
Other than that, everything worked fine. Thank you very much!
Hi Dan,
I’m glad my udf article helped you getting started with UDF’s on windows.
In fact, I would like to invite you to take a look at the MySQL UDF Repository. This is a community driven project that tries to compile a comprehensive and complete repository of open source production-ready UDF’s for various platforms, including windows, linux and Mac OS X.
You can find the Google Group here:
http://groups.google.com/group/mysql-udf-repository
The repository itself is located here:
http://www.xcdsql.org/MySQL/UDF/
kind regards,
Roland Bouman
Hi Roland,
I’ll take a look at the UDF repository. I don’t do a lot of UDF develpoment, other than when I have a specific need (as above), but I appreciate you building a community and a resource around UDFs.
Hi Dan,
Thanks a lot for the information. I compiled the median function and installed on linux mysql 5.01. However, mysql complained that query batch completed with error(s) when I called the function. Do you have experience of port the function to mysql 5.X? Thanks.
Yuan
Hi Dan,
I would like to take my posted question back because after scan through the comments on this site, I was able to install median function to mysql 5.01. Actually, I even created a 95 percentile function based on the median function. I would like to post it if anyone is interested. Thanks a lot.
Yuan
I would love to see the 95 percentile function ..
Any chance of getting that?
Hi Dan,
I have followed all of your instructions but I now encounter the following error.mysql> CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME ‘udf_median.so’;
ERROR 1126 (HY000): Can’t open shared library ‘udf_median.so’ (errno: 22 udf_median.so: cannot open shared object file: No such file or d)
I made sure that the .so is found in /lib as well as /usr/lib. I had to go through recompilation using -fPIC for the .so. Do you know how to fix this?
Thanks,
Spade
FYI, Jeff Beard has written a beautiful explanation (with screen shots) of building a UDF with Visual Studio 2008: http://www.cyberxape.com/blog/jeff/entry/udf_median_on_windows
Hmm i loaded it on Windows on mysql 5.0.51a and when executing it returns invariably null regardless of what columns select. I tried it both under x86 and x64 versions of myql 5.0.51a.
No errors in the mysql log or Windows log.
Any ideas?
Thanks!
Guys – this was tremendously helpful!! I got up and running with this with mySQL 5.1.25-rc. Here are some notes I took that may be helpful for others when installing on Fedora core 8.
I had some weird error when trying the gcc command – turns out I didn’t have gcc installed. Instead it with:
yum install gcc-c++
Also the instructions above do not mention it but you need to install the mySQL headers. The version I used was this:
wget http://dev.mysql.com/get/Downloads/MySQL-5.1/MySQL-devel-5.1.25-0.glibc23.i386.rpm/from/http://mysql.mirrors.hoobly.com/
Then I installed it with this:
rpm -i MySQL-devel-5.1.25-0.glibc23.i386.rpm
Then I got the patched souce for this udf:
wget http://www.mooreds.com/files/mysql-udf-patched.tar.gz
and extracted it:
tar zxvf mysql-udf-patched.tar.gz
cd mysql-udf
Then some edits per the above notes:
vi udf_median.cc
# two things to edit:
# 1) remove #ifdef HAVE_DLOPEN line and corresponding #endif at end of file
# 2) comment out the body of the if statement at line 72 (”if (args->arg_type[0]!=REAL_RESULT)”),
# and add the following line instead: args->arg_type[0] = REAL_RESULT;
Then I had to change the include path to /usr/include/mysql (instead of /usr/mysql/include).
gcc -shared -lstdc++ -I /usr/include -I /usr/local/include -I /usr/include/mysql -o udf_median.so udf_median.cc
Then apparently under this version of mysql it looks for udfs in a plugin dir – at least, that was the err message I got when I first tried the create aggregate function command. So I created the dir it complained about and put the .so in there:
mkdir -p /usr/lib/mysql/plugin/
cp udf_median.so /usr/lib/mysql/plugin/
CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME ‘udf_median.dll’;
That worked. Then I do this:
select median(myIntVal) from mydb;
and BINGO it works like a charm!! I hope these notes are helpful to someone else like the rest of the above notes were so very helpful for me!
The only down side is that this function is sluggish compared to simply doing select myIntVal from mydb order by myIntval limit halfTableSize,1. That query finishes in less than 0.5sec with 600k rows, whereas the median() query takes 15 seconds 🙁
That being said the benefit of being able to get median as an aggreate makes it worth it (since I can now do select x,avg(myIntVal), median(myIntVal), std(myIntVal) like that…
That being said, does anyone have any tips for how to speed up this median function? I did: create index xyz on mytable (myIntVal); but other than that what if anything can I do to speed this up? Has anyone looked at speeding up the .c code used in this UDF? Thanks again!!
Hi RT,
Thanks for the feedback. I’m really glad this post helped you out, and appreciate the clarifying comments.
I’m afraid I’m using the median function in a batch manner, so haven’t really spent any time looking at the performance. You might want to browse around the mysql-udf-repository google group: http://groups.google.com/group/mysql-udf-repository, or post this question there.
Thank you all for the info on this page.
It took me a day pulling my hair our, but have finaly got it to work on Windows / MySQL 5.0.41 / Visual C++ 2008 Express.
Derek,
Glad to help. Any clues to help other folks using that combination of software?
Hello,
I managd to compile and follow all the steps until the very end. When I type
create aggregate function udf_median returns real soname ‘udf_median.dll’
I get the following error message:
ERROR 1127 (HY000): Can’t find function ”udf_median” in library.
Can anyone help me with this error message?
Best regards and thank you in advance,
Pedro
If you would like to supply INT arguments without explicit coersion, udf_median.cc can be modified like so…
REPLACE THIS:
if (args->arg_type[0]!=REAL_RESULT)
{
strcpy(message,”median() requires a real as parameter 1″);
return 1;
}
WITH THIS:
args->arg_type[0] = REAL_RESULT;
Which forces mysql to do any needed coersion.
NOTE 1: Result is DOUBLE regardless of argument type supplied
NOTE 2: I tested successfully, but use at your own risk.
Why did I bother with this if result is DOUBLE anyway?
Explicit coersion of the input arguments did not work for me. Any attempt I made was producing DECIMAL fields, which were not accepted by udf_median. Don’t know why, is it related to mysql version?? (5.0.51a)
Dan, great post!!!
I have followed your instructions but ran into the following issue.
Environment:
———–
uname -a
Linux 2.4.21-32.0.1.ELsmp #1 SMP Tue May 17 17:52:23 EDT 2005 i686 i686 i386 GNU/Linux
cat /etc/issue
Red Hat Enterprise Linux AS release 3 (Taroon Update 5)
Kernel \r on an \m
mysql -V
mysql Ver 14.7 Distrib 4.1.15, for pc-linux-gnu (i686) using readline 4.3
gcc -v
Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/3.2.3/specs
Configured with: ../configure –prefix=/usr –mandir=/usr/share/man –infodir=/usr/share/info –enable-shared –enable-threads=posix –disable-checking –with-system-zlib –enable-__cxa_atexit –host=i386-redhat-linux
Thread model: posix
gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-49)
The SO library was compiled as specified and placed in LD_LIBRARY_PATH visible directory, mysqld bounced.
Problem:
mysql> CREATE FUNCTION median RETURNS REAL SONAME ‘udf_median.so’;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect…
Connection id: 1
Current database: ws_data
ERROR 2013 (HY000): Lost connection to MySQL server during query
mysql>
Number of processes running now: 0
080926 09:26:10 mysqld restarted
I am not a linux, MySQL, or c++ expert, so I need some help. 🙂
Thanks,
– vlad
Hi Vlad,
It might be a core dump. See http://bytes.com/forum/thread504683.html
Mysql 4.1.15 is pretty old, you might want to try upgrading to the latest 4.1 version (4.1.22) http://dev.mysql.com/downloads/mysql/4.1.html
Dan, rt,
I updated the median-udf patch with an optional parameter to compute percentiles.
median(variable) returns the standard median
median(variable, i) returns the standard median to i decimals
median(variable, i, percentile) returns the percentile to i decimals
percentile can be a value between 0 and 100. Default value is 50 for the standard median.
The patch is posted here on sourceforge:
http://sourceforge.net/tracker/?func=detail&aid=2766201&group_id=114041&atid=667000
A few notes:
1. The patch has been tested only on Linux. It includes the changes recommended by rt in the post on June 14, 2008 above.
2. gcc required an additional -fPIC parameter in order to go through
gcc -shared -lstdc++ -fPIC -I /usr/include -I /usr/local/include -I /usr/include/mysql -o udf_median.so udf_median.cc
I found that with 64-bit mysql, the median-udf.so file has to be copied to /usr/lib64
3. Remaining instructions in rt’s post remain the same
Hope this helps!
Hi Folks –
I am running Ubuntu 9.04 with MySQL 5.0.75 and have successfully compiled a few different flavors of the udf_median UDF on my machine (including the variant above).
Everything works just fine when I calculate the median of a set of numbers greater than 1. However, when I try to calculate the median of a set of values which are floating point numbers around the number 1 (e.g. 0.9873, 1.0023, 1.23, 0.6753), I only ever get a value of 1 returned from the function. I can never obtain a value with precision to the right of the decimal point.
Any thoughts on what might be the cause of this?
Thanks!
Scratch that last comment – I figured it out – user error on my end.
1. I had to do this on Ubuntu…so had to install header files as –
sudo apt-get install libmysql++-dev
2. I had to install g++
3. I had to change compiler options to -fPIC. Honestly, I don’t know why…compiler said – relocation R_X86_64_32 against `compare_doubles(void const*, void const*)’ can not be used when making a shared object; recompile with -fPIC
/tmp/ccS3AYK9.o: could not read symbols: Bad value
Putting the .so file to /usr/lib seems to be working fine…I have just tested it with simple integer data.
Thanks for a very informative post 🙂
Sumedh,
Thanks for adding instructions for Ubuntu.
I’m working on JsReduce which is going to be a special aggregate function. I have JsMap working and you can check it out now: http://bit.ly/cKA8np
Ideally, this will be ultra flexible for prototyping reports and getting things done.
Hi Jo Yo,
JsMap looks really cool!
Thanks.
Thank you!
For what it’s worth, here are the instructions I used on Kubuntu Linux, after installing MySQL and its development libraries:
for i in *.cc; do
gcc -Wall -fPIC -DHAVE_DLOPEN -I/usr/include/mysql -c $i -o $(basename $i .cc).o;
ld -shared -o $(basename $i .cc).so $(basename $i .cc).o;
done
This way you install everything and don’t have to comment out any flags. Use the -D parameter to set the flag while compiling. Then you can do:
CREATE AGGREGATE FUNCTION slope RETURNS REAL SONAME ‘udf_slope.so’;
And it worked for me!
Thanks for the additional instructions, Dave!
This was really helpful.
BTW:
If anyone encounters the following error(ubuntu) while trying to insert the UDF in mysql:
“failed to map segment from shared object: Permission denied”
you can fix it by stop apparmor
cd /etc/init.d/
./apparmor stop
then insert the udf and
./apparmor start
Thanks!
Thanks Thomas!
Has anyone managed to get this UDF to work with Windows 7? I tried, but couldn’t get it working, and would love any tips.
I am trying to compile using the command: gcc -shared -lstdc++ -I /usr/include -I /usr/local/include -I /usr/local/mysql/include/ -o udf_median.so udf_median.cc on Mac OS X 10.4.11. I get the following error:
i686-apple-darwin8-gcc-4.0.1: unrecognized option ‘-shared’
/usr/bin/ld: Undefined symbols:
_main
collect2: ld returned 1 exit status
The Darwin kernel version is 8.11.1. Any thoughts?
Dan,
Here’s my blog post about udf_median on 64bit Windows 7 for you and the folks that refer to this post:
http://jeffbeard.org/2011/05/mysql-udf_median-on-windows-7-64bit/
Thanks Jeff!
question – is this code finding the true median when having a set of even number of items? it seems to select the lower of the 2 middle items. instead, should it not return the average of the 2 middle items?
Hi Hans,
I don’t think so. Here’s the relevant section of the code:
qsort(buffer->values,buffer->abscount,sizeof(double),compare_doubles);
if (buffer->abscount&1==1)
{
return buffer->values[(buffer->abscount-1)/2];
} else
{
return (buffer->values[(buffer->abscount-2)/2]+buffer->values[buffer->abscount/2])/2.0;
}
You can see if that if there are an odd number of elements, we return the true median. If there are an even number,
For example, if abscount is 10, then (abscount – 2)/2 is 4, and abscount/2 is 5, then we add the numbers at the 4th and 5th positions together.
Here’s an example:
mysql> insert into foo values (1);
Query OK, 1 row affected (0.00 sec)
mysql> insert into foo values (2);
Query OK, 1 row affected (0.00 sec)
mysql> insert into foo values (3);
Query OK, 1 row affected (0.00 sec)
mysql> insert into foo values (4);
Query OK, 1 row affected (0.00 sec)
mysql> select median(bar) from foo;
+-------------+
| median(bar) |
+-------------+
| 2.50 |
+-------------+
1 row in set (0.00 sec)
Hi,
I compiled with ‘gcc -shared -fPIC -lstdc++ -I /usr/include -I /usr/local/include -I /usr/local/mysql/include/ -o /usr/lib/mysql/plugin/udf_intercept.so udf_intercept.cc’ and I got (after long painfull time) an udf_intercept.so. succesfully, but I cant create the function , i get the ‘xxx_clear’ error: #1127 – Can’t find symbol ‘intercept_clear’ in library.
I found something here but not yet helpfull for me:
“it seems that the non-aggregate functions will still work in MySQL 4.1.1+, but the aggregate functions need some additional hook-in to work (the xxx_clear procedure).”
http://phpforum.de/forum/showthread.php?t=194073
Any idea?
Kai
ps: I have ubuntu 10.04 and Mysql 5.1.41.
Here I found some newer files: http://code.google.com/p/mysql-udf-moments/source/browse/
Some functions are fixed, like median, colwidth, kurtosis, moment, skewness, but not the others.
I would have liked writing my own median (from an algorithm I found a research paper about) using the files in this post as a starting point. I am however not able to add the function to mysql. I am using Fedora 18 and mySQL 5.5.29. I get the following error message:
ERROR 1126 (HY000): Can’t open shared library ‘my_udf_example.so’ (errno: 0 /usr/lib64/mysql/plugin/my_udf_example.so: cannot open shared object file: Permission denied)
I have ensured that the my_udf_example.so file has “r-x” permission for everybody (and so have the mysql plugin directory where I put it). I know mysql is finding it because if I remove it I get another error message saying it cant be found.
All sugestions are apreciated!
Hi Trist,
Did you ever solve this? I’d probably try to install another mysql plugin and see if I got the same error message.
Hi I was able to create the dll file and when i try to create the function in mYSQL i am getting this error
Error Code: 1126. Can’t open shared library ‘mymedian.dll’ (errno: 126 The specified module could not be found.)
has anyone faced this issue, if yes please share the remedies or suggestions.
Thanks
Yashwanth
Hi Yashwanth. Where did you place the DLL file? Is it readable by the mysql process user? Did you restart mysql?