Sat, 13 Mar 10

Moving data from MS-SQL to mysql

Posted in Databases, MySQL at 10:36 am by moore

I recently worked on a project where I needed to port data from a MS-SQL database to a mysql database.  There are programs, both payware and freeware that will help with this process, but I didn’t have ODBC access to the MS-SQL database, which these programs require.  All I had were a bunch of insert statements that looked like this:

INSERT INTO
[sample].[dbo].[users_info]([ID],[registration_day],[registration_month],
[registration_year],[registration_time],[first_name],[last_name],
[username],[userpwd],[repeat_pwd],[birth_date],[birth_day],[birth_year],
[street],[state],[city],[zip_code],[email_address],[membership_startdate],
[online],[gender]) VALUES(463,N'15',N'7',N'2009',N'9:13:52 AM',N'Homer',
N'Simpson',N'homerrocks',N'beerbeer',N'beerbeer',N'January',N'1',N'1999',
N'234 Main',N'Alaska',N'Springfield',NULL,N'homer@thesimpsons.com'),
CAST(0x9AE90000 AS SmallDateTime),
CAST(0x00009CD700000000 AS DateTime),N'Man');

I wrote a perl script to turn that dialect of SQL into a mysql friendly dialect (feel free to download it).

The most interesting parts were those CAST statements.  This forum post and this blog post helped me turn those casts into real dates.  (After loading the inserts into mysql, I did some post processing, using the helpful case statement and str_to_date function to rationalize some of the data.)

[tags]mssql, mysql, data migration[/tags]

Wed, 02 Dec 09

Hadoop logging case study

Posted in MySQL, Technology, Useful Tools at 9:53 am by moore

Here’s a fascinating case study of the evolution of a (large!) log analysis toolset. It follows Rackspace as they strive to allow searching logfiles from an ever growing number of servers, starting from manual ssh/grep, through various incarnations of a MySQL database, and ending up using Hadoop/Lucene/Solr.  There’s (some) more info on the rackspace blog.  Cool stuff.

Via George Fairbanks.

[tags]lucene, solr,case study,hadoop[/tags]

Fri, 16 Oct 09

Optimizing a distance calculation in a mysql query

Posted in Databases, MySQL at 12:15 pm by moore

If you have a query that sorts by a derived field, and then takes a limited number of the results, it can be a real dog.  Here’s how I optimized a situation like this.  Imagine this table.

create table office_building (
id int primary key,
latitude float not null,
longitude float not null,
rent int,
address varchar(20),
picture_url varchar(255)
);

If you want to find the nearest 100 office buildings to a point on a map, you run a query something like this (plug your lat/lng into the question marks):

explain select *, round( sqrt( ( ( (latitude - ?) * (latitude - ?) ) *  69.1 * 69.1) +
((longitude - ?) * (longitude - ?) * 53 * 53 ) ) ) as distance
from office_building order by distance limit 100

(See here for an explanation of the 69.1 and 53 constants--basically they convert roughly from lat/lng to miles.) Unfortunately, you are ordering by a derived field, and mysql can no longer do order by optimization.

This means that you'll be doing a filesort (which does not actually have anything to do with the filesystem, but is just a sort not on an index).  And this, in turn means that your performance will suck if you have any large number of rows returned.

You can help things out a bit by limiting your office building query to a box of a certain size around the point.  Here's the query with a 5 mile box:

select *, round( sqrt( ( ( (latitude - ?) * (latitude - ?) ) *  69.1 * 69.1 ) +
( (longitude - ?) * (longitude - ?) * 53 * 53 ) ) ) as distance
from office_building
where latitude < ?  + (1/69.1)*5 and latitude > ? - (1/69.1)*5 and longitude < ? + (1/53)*5
and longitude > ? - (1/53)*5
order by distance limit 100

But if you still have too many results, the sorting on distance will be slow.  Also, even if you have an index on latitude and longitude, (such as create index idx_nearby on office_building (latitude,longitude)) because you are not using equality, only the first column will be used.

This is worth repeating, because it took me a while to understand.  If you have an index: create index idx on tbl (col1,col2,col3,col4,col5) and you run a query like select count(*) from tbl where col1 = 1 and col2 > 2 and col3 < 3 and col4 > 4 only col1 and col2 will be used from the index.  Mysql goes to the table data files for col3 and beyond (assuming no other indices on the table).  This makes sense when you think about how indices are created and stored, but I didn't really understand it until I'd been beaten over the head with it.

As stated here: "[mysql] will use the fields [in the index], from left to right, as long as the WHERE clause has "=". Once it hits a 'range' (>, IN, BETWEEN, ...), it stops with that field."  I don't know why it is not in the mysql index documentation--maybe it is obvious?

The solution I found was to separate what I wanted to find in the select clause from how I find it, in the where and order by clause:

select select_clause.*,
round( sqrt( ( ( where_clause.latitude - ?) * (where_clause.latitude - ? ) *  69.1 * 69.1 ) +
( (where_clause.longitude - ? ) *(where_clause.longitude - ? ) * 53 * 53 ) ) ) as distance
from office_building where_clause, office_building select_clause
where where_clause.latitude < ? + (1/69.1)*5
and where_clause.latitude > ? - (1/69.1)*5
and where_clause.longitude < ? + (1/53)*5 and
where_clause.longitude > ? - (1/53)*5
and where_clause.id = select_clause.id
order by distance
limit 100

You also need to add an index:

create index idx_nearby on office_building (latitude,longitude,id);

Then, when you run the query, you still have the filesort, but you also see the magic 'Using index' in your explain plan.  You never have to go to the table to do the sort!  You also have a join now, but it's on the primary key, and you only need to go to the table for the 100 rows that you know you want.

Using this query had an effect on one live system of one to two orders of magnitude increase in query speed, depending on the query.  This not only works for distance queries, but anytime you want to order by a calculated value.

More useful links: geo search suggestions, index explanation

[tags]mysql, performance, query optimization[/tags]

Fri, 07 Aug 09

Using phpMyAdmin without the “Show Databases” privilege

Posted in MySQL, PHP, Useful Tools at 9:05 am by moore

phpMyAdmin is a pretty cool piece of software, and a very useful tool. If you haven’t used it before, it’s a full featured web-based interface to MySQL databases. You can use it to add data, update data, delete data, export data, and basically any other management of the database you might need.

I ran into an issue the other day. I was trying to allow phpMyAdmin to access a database on a remote host. The user that phpMyAdmin was connecting as didn’t have the “show databases” privilege (I imagine this is common in shared hosting environments, which is what this was). This, apparently, is what phpMyAdmin uses to populate the drop-down of databases on the left-hand side after you login. Since it didn’t display that drop-down, there is no way of selecting the database to which this user did have access.

I searched for a while, but couldn’t find anyone else with this problem. So I thought I would post the solution I found.

The solution is to hard code authentication data for the remote server in my config.inc.php file.  Then, you append the server and the database that you want to connect to the phpMyAdmin url.

In the config.inc.php file:
$cfg['Servers'][$i]['host'] = 'xxx.xxx.xx.xx';
$cfg['Servers'][$i]['user'] = 'user';
$cfg['Servers'][$i]['password'] = 'pass';

In the url that normal users use to access the database:
http://phpMyAdmin.example.com/phpMyAdmin/index.php?db=databasename&server=xxx.xxx.xxx.xxx

The left hand side still isn’t populated with any databases. But, this allows you to go directly to the database to which you do have access, and perform whatever tasks you want (and have permissions for). I tried adding the “db” parameter to the config.inc.php file, but that didn’t seem to work. I didn’t try using any other form of authentication than hardcoded, as I wanted to make this as simple as possible for the phpMyAdmin users.

I was running an older version of phpMyAdmin (2.11.9.5); I didn’t test this on newer versions.  If you do, please post a comment and let me know if this still works.

[tags]phpmyadmin,mysql privileges,remote database access[/tags]

Thu, 14 Aug 08

MySQL and memcached: notes from a webinar

Posted in MySQL at 12:27 pm by moore

Just listened to another mysql webinar. This time it was about mysql and memcached. I got in a little late, but here are my notes.

Memcached is a distributed caching mechanism. Not as robust as Coherence, for example, but free, open source, and widely used and supported.

There is a project that hooks mysql UDFs to memecached.

The main speaker worked at Grazr, a feed manipulation company. He presented some lessons and stated that some seemed obvious, but we all know how it is, you learn mistakes the hard way.

  1. Beware architecture momentum–early decisions influence later ones.
  2. Your scaling plan is wrong in some way!–grazr overbuilt.
  3. Testing is limited: they had issues with replication speed. Users are better at testing site than any perl script.
  4. Replication is fast, until it isn’t. It’s asynchronous nature can make it hard to program to.
  5. Memcached is your friend
  6. Sphinx, an open source SQL full text search engine (like Lucene), is your friend. It also has a mysql storage engine–virtual table that you can join with other data sources. This sounds cool.
  7. Do bulk insert and lazy writes–if you don’t need to immediately read it, do the write later.
  8. Emphasis on scaling hurt user experience. Fast, data correct transaction are available with two options: one server, or a sync buffer like memcached.
  9. Use instrumentation–he mentioned tools like nagios and cacti (cacti has mysql plugin [which I couldn't find]) to help scaling–close the loop: when you make a change, find out what that change did.
  10. Try new things–for example, the MySQL blackhole storage engine for data that is only read on slaves, not on the master.
  11. Everyone has same problems–avoid Not Invented Here syndrome. Not just the obvious stuff (who writes a web server these days–oh wait), for example, there’s a batch job processor like gearman. Note that he also said they had to Invent a few things before they learned this lesson.
  12. Accept change–design for now+1, not now+100
  13. Listen–even folks not intimately familiar with your technology may have good ideas, and more brains are better.

End notes

Looks like a couple of interesting whitepapers are available (you’ll have to register and get pinged via email a few times by sales folks): “Designing and Implementing Scalable Applications with Memcached and MySQL”. And there are future webinars on the MySQL/memcached topic, one on SAAS applications and one on Scalable web architecture.

In a meta note, I enjoy the webinars because:

  1. Easy to get to
  2. Free
  3. High quality
  4. One can bail if you want (which I did a few weeks ago when MySQL proxy http://forge.mysql.com/wiki/MySQL_Proxy didn’t turn out to be all that interesting to me).
  5. The give me something to blog about :)

[tags]webinar, memcached, users are the best testers[/tags]

Mon, 11 Feb 08

My experience at the MySQL Performance Coding Webinar

Posted in MySQL, Programming at 11:37 am by moore

Last Tuesday, I attended a MySQL webinar. I registered on the MySQL website (with a site-specific email address, of course) and periodically am invited to these webinars. I’d tried to attend in the past, but something else (usually billable) always interfered. Not this time!

The talk was titled “Performance Coding for MySQL” and the author, Jay Pipes, did a fantastic job. (He is also the co-author of Pro MySQL.) The slides from the presentation are up, and he also answered questions sent to him during the presentation in some detail as well. His presentation, about an hour in length, covered both basics (like, normalize first (slide 4), think in sets rather than iterators (slides 20-23)–basic, but not intuitive), and under the hood intricacies (like, think about the size of your primary keys and consider record size (slide 6), avoiding deletes with MyISAM (slide 27) and vertical partitioning to take advantage of the query cache (slides 9-11) ) . He also pointed to a script that he wrote to find useless indices.

Well worth my time. Thanks MySQL and Jay, for making a resource like this available and free! There’s a whole lot more, so I’d recommend downloading the slides and giving them a run through, if you interact with MySQL as a DBA or a developer (or, as is often the case, both).

(On that note, I’d like to recommend the MySQL DBA blog for your perusal–apparently recently renamed the ‘Senior MySQL DBA’ blog, heh.)

[tags]mysql dba, think in sets, webinar[/tags]

Fri, 04 Jan 08

BatchUpdateException using Hibernate and MySQL5

Posted in Databases, Java, MySQL at 7:59 pm by moore

I ran into a crazy error last week. One of my clients was upgrading from MySQL4 to MySQL5. The application in question was using Hibernate 3.2. Here’s the table structure, and the hibernate bean definition. See if you can spot the issue:

mysql> desc stat;
+--------------+-------------+------+-----+---------------------+-------+
| Field        | Type        | Null | Key | Default             | Extra |
+--------------+-------------+------+-----+---------------------+-------+
| stat_date    | date        |      | PRI | 0000-00-00          |       |
| stat_type    | varchar(50) |      | PRI |                     |       |
| stat_count   | int(11)     | YES  |     | NULL                |       |
| last_updated | datetime    |      |     | 0000-00-00 00:00:00 |       |
+--------------+-------------+------+-----+---------------------+-------+
4 rows in set (0.00 sec)

<class name="com.foo.common.data.Statistic" table="stat" lazy="false">
<cache usage="read-write"/>
<composite-id name="statisticId" class="com.foo.common.data.StatisticId">
<key-property name="date" type="java.util.Date" column="stat_date"/>
<key-property name="type" column="stat_type"/>
</composite-id>
<property name="count" column="stat_count"/>
<property name="lastUpdated" type="java.util.Date" column="last_updated" />
</class>

The exception stack trace I was seeing was something like this:

2007-12-31 14:15:09,888 ERROR [Thread-14] def.AbstractFlushingEventListener (AbstractFlushingEventListener.java:301)

- Could not synchronize database state with session
org.hibernate.exception.ConstraintViolationException: Could not execute JDBC batch update
at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:71)
at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43)
at org.hibernate.jdbc.AbstractBatcher.executeBatch(AbstractBatcher.java:249)
at org.hibernate.engine.ActionQueue.executeActions(ActionQueue.java:235)
at org.hibernate.engine.ActionQueue.executeActions(ActionQueue.java:139)
at org.hibernate.event.def.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:298)
at org.hibernate.event.def.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:27)
....
Caused by: java.sql.BatchUpdateException: Duplicate key or integrity constraint
violation message from server: "Duplicate entry '2007-12-31-stattype' for key 1"
at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1492)
at com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeBatch(NewProxyPreparedStatement.java:1723)
at org.hibernate.jdbc.BatchingBatcher.doExecuteBatch(BatchingBatcher.java:48)
at org.hibernate.jdbc.AbstractBatcher.executeBatch(AbstractBatcher.java:242)
... 57 more

I ended up turning on the mysql logging (the log setting in the my.ini file, which logs all sql statements mysql makes) to see what was happening.

Basically, I was looking to see if an entry in the stat table existed; if it did, increment and update, if it did not, insert. And the insert was always happening, so the entry was not found--it did exist because mysql threw the 'integrity constraint' exception.

The cause of the issue was the date type of stat_date and the fact that I incorrectly mapped it to java.util.Date. It really should have been mapped to java.sql.Date. How this worked in mysql4 is beyond me, but it did. Changing the hibernate dialect to mysql5 had no impact.

[tags]mysql upgrade,hibernate[/tags]

Wed, 11 Apr 07

MySQL tuning

Posted in Databases, MySQL at 4:18 pm by moore

If you’re trying to tune MySQL, make sure you measure before and after. mysqlreport is a great way to measure a number of aspects of mysql (make sure you check out the guide). If you want graphing and long term tracking for mysqlreport and just about any other measure you care to track, I recommend cacti, which works with rrdtool.

[tags]mysql performance, monitoring[/tags]

Wed, 27 Sep 06

Installing the median user defined function on MySQL

Posted in Databases, MySQL, Programming at 11:57 am by moore

I just re-read “How To Lie With Statistics”, which is so good I think it should be required reading in every middle school. In it, the author makes the point that there are three kinds of ‘averages’: arithmetic mean, median and mode (here I am, contributing to Wikipedia’s dominance, due to my laziness in looking up alternative definitions of statistical concepts). In general, the median is the most informative average, because it’s not skewed by a small number of outliers.But mysql (and other databases I’ve worked on) don’t natively supprt the medan, whereas I believe most support average (by which they mean ‘arithmetic mean’). Sure, you can use a stored procedure (as suggested here for PostgreSQL. However, I’m working with MySQL 4, which does not support stored procs. However, there is another solution: user defined functions. These seem like stored procedures, except you have to write them in C (or C++).

Now, I’m not a C programmer. Luckily, someone has written and released a set of mysql user defined functions that include median (as well as many other statistical manipulations). The bad news is that it hasn’t been updated for years. The good news is that with a bit of luck and many downloads, I was able to get the median function working on mysql, both on windows as a dll, and on linux as a shared library. To repeat, I am not a C programmer, so if you see any head thumping errors below, please let me know and I’ll update this document.

First off, I was working with these versions of mysql: c:\Program Files\MySQL\MySQL Server 4.1\bin\mysql.exe Ver 14.7 Distrib 4.1.10a, for Win95/Win98 (i32) and mysql Ver 14.7 Distrib 4.1.7, for pc-linux (i686)

To get median working on windows, you need to:

  1. Download the mysql-udf tarball.
  2. patch the files if you’re running a version of mysql greater than 4.1.1. patch available here, or the patched tarball is here.
  3. Download and install Visual C++ Express. (If you have a C compiler on Windows, you can skip this step and the next. Oh, and the ones following that will probably be different. (Here’s a blog post about creating a UDF using Visual Studio C++ 2003.)
  4. Download and install the platform SDK; I only followed through step 3. If you don’t, you’ll get ‘windows.h’ errors when you try to compile the UDF.
  5. Untar the myslq-udf tarball. Patch if needed.
  6. Install the mysql header files. I was able to do this via the Windows Installer, which let me modify my existing mysql installation; I had to add the ‘C Include Files / Lib Files’ feature.
  7. Create a new directory. Copy udf_median.cc from the untarred directory to this new directory.
  8. Create a new file in that directory called udf_median.def. This file contains all the methods the UDF is exporting. Or you can just download the file I used here.
  9. Open Visual C++ Express
  10. Choose File / New / Project From Existing Code. Hit Next. Browse to the directory you just created. Create a name for the project. Hit Finish
  11. Edit the udf_median.cc file and comment out the #ifdef HAVE_DLOPEN line as well as the corresponding #endif. If I didn’t do this, I kept getting link errors, as I guess everything between those preprocessor directives was not being compiled.
  12. Add the mysql include files: right click on the project and choose properties. Expand ‘Configuration Properties’ then ‘C/C++’ and click ‘General’. On the right, add an include directory. Navigate to the Mysql include directory and add that.
  13. Add the module definition file: right click on the project and choose properties. Expand ‘Configuration Properties’ then ‘Linker’ and click ‘Input’. Add ‘udf_median.def’ to the key ‘Module Definition File’.
  14. Make sure VC knows this is a DLL: right click on the project and choose properties. Expand ‘Configuration Properties’ and click ‘General’. Choose ‘Dynamic Library (.dll)’ for Configuration Type. If you don’t do this, you’ll get errors like: error LNK2019: unresolved external symbol _WinMain because the compiler thinks you’re trying to build an application.
  15. Right click on the project and choose ‘Build’. This gives you a DLL in the Debug directory.
  16. Copy the DLL to the bin directory of your mysql installation.
  17. Create the function by logging in to mysql and running this command: CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME 'udf_median.dll';. (The user you log in as will need to have the ability to insert rows into the mysql tables.)
  18. Test and enjoy.

Deploying the UDF to linux is much simpler, mostly because you don’t have to install a compiler, linker, etc. I used ‘gcc (GCC) 3.3.4′.

  1. Download the mysql-udf tarball.
  2. patch the files if you’re running a version of mysql greater than 4.1.1. patch available here, or the patched tarball is here.
  3. Untar the myslq-udf tarball. Patch if needed.
  4. Edit the udf_median.cc file and comment out the #ifdef HAVE_DLOPEN line as well as the corresponding #endif.
  5. Compile and link the code. Do not use the instructions on the mysql-udf homepage. If you compile with those flags, you’ll get this error when you try to add the function: mysql> CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME 'udf_median.so';
    ERROR 1126 (HY000): Can't open shared library 'udf_median.so' (errno: 22 /usr/lib/udf_median.so: undefined symbol: _Znwj)
    . Rather, use the instructions in this bug report: gcc -shared -lstdc++ -I /usr/include -I /usr/local/include -I /usr/local/mysql/include/ -o udf_median.so udf_median.cc'
  6. Copy the shared library to a directory where mysql will see it. I put it in /usr/lib.
  7. Create the function by logging in to mysql and running this command: CREATE AGGREGATE FUNCTION median RETURNS REAL SONAME 'udf_median.so';. (The user you log in as will need to have the ability to insert rows into the mysql tables.)
  8. Test and enjoy.

We have pushed this UDF to production with replicated servers and haven’t seen any issues with it yet.

I want to extend my thanks to:

[tags]median, mysql, user defined functions[/tags]


© Moore Consulting, 2003-2010