Observations on a Writing a Custom Report with Java, Quickbooks, Jasper Reports, Google Spreadsheets and Google Drive

A recently released project is using java and spring to pull data from quickbooks, a mysql database and google spreadsheets, munging the data in various ways, and using jasper reports and jfreechart to generate a good looking report and a CSV of transactions that will give our brokers weekly updates on how they are doing compared to their goals for the year. I then upload it to Google Drive and send an email notifying each realtor that they have a new file.  It’s always nice to release useful software–feels like a new day dawns.

A few observations from this project:

  • The tech was interesting, but it was actually more interesting to see how the needs of tech drove the business to ‘tighten up’ their processes. Whether that was making sure there was one place to look for user type data, or defining exactly what constituted achieving a goal, or making sure that any new realtors who joined created business goals (and committed them to writing), the binary nature of software forced the business (or, more accurately, people in the business) to make decisions. Just another example of business process crystallization. This also meant deferring some software development. Where the business couldn’t answer a question definitively, rather than force them to do so, we chose to defer development.
  • I’m glad that jasper reports makes it so easy to generate PDFs–you basically create an XML file (I was unable to find a spec, but there were plentiful examples) and then put tokens for dynamic content. Then you compile the XML file, give it a map of said tokens and values (which can be text, numbers, dates or images), and then export the object to PDF. Note that I was not using Jasper in a typical way–reporting from large amounts of similar data via a data connection–because I wanted different PDFs for each user. Perhaps there was a way to do this elegantly, but I was just trying to get stuff done. Creating a grid in jasper was interesting, though.
  • JFreechart had a very weird issue where on stage the graph labels were bolded and italicized, but not on production. Since we make every effort to keep these two environments in sync and they were running exactly the same code, this was a mystery. Finally solved it when we discovered java was different (same version, different vendors: openjdk vs sun java). Had been running this way for years. Oops.
  • Interacting with google spreadsheets is great for the business but a pain in the butt for developers. It’s great for our business because it is extremely easy for someone who is not a programmer to create a ‘database’ that is versioned, backed up, almost always accessible and helps them ‘get stuff done’ in a measured way. It’s a pain for developers because it is a ‘database’ and not a database–no referential integrity or data typing. Also, google provides cell based access and row based access, forcing you to choose. And the java libraries are old. Beats excel though–at least it is accessible from a server. We ended up writing a library to wrap Google’s Java SDK to make some common operations easier.
  • Pushing to google drive is interesting–I alluded to this in my last post but you have to be ready for failure. I ended up using a BlockingQueue for this–I throw files (a data structure defining the file, actually) to be uploaded on the queue, then consumers executing in a different thread each take one off, try to upload it, and if it fails, put it back on. I considered using a third party durable queue like IronMQ, but thought it was overkill.
  • Using the Quickbooks SDK, with all the accounting data exposed, lets you build some pretty powerful graphs that are useful to the business. But the docs are jumbled, with a lot of them aimed at developers who are building integrations to sell to Quickbooks users. Support is OK for standard operations, but for things like renewing your token, you have to drop down to the REST API (see my SO question) This article does a good job of outlining the various projects but as a dev you’ll have to ignore certain sets of information–never fun when getting up to speed.
  • We do a lot of backend processing and spring and maven and a custom assembler that generates a tarball when using ‘maven install’ have been great. I also finally figured out how to work with maven to use ‘release:prepare’ and ‘release:perform’ for releasing libraries, as opposed to going my own way, and that has made things much much easier. Learn your tools, folks!
  • I’m once again astounded by the richness of the java library ecosystem. There doesn’t seem to be very much that I can think of doing that doesn’t have at least one, and probably three, java implementations.

Multi threaded push to google drive with Java Executors

I’ve never really written threaded code in java. Sure, I took a java certification class long ago, and reviewed threading and probably wrote some toy code.

But it had always been one of the areas I was aware of but avoided. And it was easy to avoid–after all, I was often writing for a servlet container, and in that case you are supposed to rely on the platform (Tomcat, etc) to provide multi threading magic–you just have to stay in your servlet box. Other times, I was writing glue code where performance wasn’t an issue.

Recently, I was writing integration code that read from google spreadsheets (using the api) and quickbooks online (using the api) and did some calculations on the gathered data and created PDFs (using jfreechart and jasper reports). These PDFs then get pushed to certain folders on google drive.

I had run through some testing in my development and stage environments, and was ready to do a run in production. I started off the run and, after a few minutes, noticed that performance was not very good. About one report every 80 seconds or so.

Now, I’d run this process before I added google drive integration, and it had completed in a normal amount of time. So I suspected the google drive calls were the issue. When pushing the PDFs to google drive in my dev environment, I’d run into some connectivity issues, so as recommended I’d added a exponential backoff to my requests.

This meant that a lot of clock time was spent just waiting for the google services to recover.

So, an easily parallelizable task that spent a lot of time waiting? Seemed like a great place for multiple threads. I googled multithreading in java and found the Executor framework, which made running such tasks in parallel trivial. It even had futures and everything, and had been a standard part of the library since java 1.5.

I set up a thread pool of 15 threads, spent an hour or so rewriting and testing my file push code to fit the framework, and performance is much better. The push to google drive still takes about 20 seconds per report, but that’s a 4 fold increase in speed.

Next steps–use the spring abstraction layer to make the code more maintainable and clear.


Google Spreadsheets as REST API sources

I recently built out a read only JSON API with a Google spreadsheet as the back end data source.

Why do such a thing? I didn’t need to modify the back end, but wanted to make the data available to other software. The people who maintain this data are very comfortable using Google spreadsheets. While I could have written a custom CRUD app, this didn’t seem like a good use of time when Google spreadsheets had served us well in the past.

How did I do this? I first of all created a sanitized spreadsheet, using importrange and regexreplace. This level of indirection assures me that if the source data changes, I can adjust fairly easily. If the user managing the spreadsheet wants to rearrange columns, I can adjust my sanitized spreadsheet easily.

Then I created a Google apps script and used doGet to respond to requests and the spreadsheet service to retrieve the data. The content service lets you serve JSON with the appropriate mime type. I used qunit for Google apps script (invaluable when you are working in the loosely typed javascript world and relying on cloud resources). I also worked with the parameters to build the querying needed for our application.

Then, in order to make it look like a normal API call, I fronted the script with Varnish and did some regsub magic in my VCL file (as well as some light authentication).

This approach has the benefit of keeping everything in Google’s cloud, and allowing you to access the spreadsheet data easily.

This approach has significant limitations, however.

  • Google apps scripts calls are slow, especially when accessing spreadsheet data. Using the cache service can help.
  • You cannot return anything other than a 200 response code. None of the other response codes are available.
  • The actual content is served after a redirect, so caching it at the Varnish level is difficult (though possible), and clients must be able to follow redirects.
  • Google changes the ip of the server running the script. This is not such a big problem, unless your version of Varnish only takes IP addresses in the VCL file, not hostnames. Like ours.

Was this a good idea? Well, it let me build out the API relatively quickly without affecting the users managing the data or finding any other place to put it. But we’ll probably move away from this due to the limitations listed above. One we’ve found particularly painful is the IP address switching, which usually only shows up in our automated testing.

We’ll probably start pushing the data daily (it doesn’t change all that often) to a local JDBC database using the JDBC service and use either RestSQL or DropWizard to generate an API for it. (RestSQL is quicker, but DropWizard lets us maintain format compatibility.)


Easyrec: a recommendation engine worth looking at

I love recommendation engines.  These are the software that Amazon has everywhere showing “users who bought this also bought” recommendations.

I love them because they are an easy way to leverage the wisdom of the crowd to help users.  They also get better the more data you feed into them, so once you set one up, it just makes your site better and better.

For a while, I’ve wanted to explore mahout as a recommendation engine solution, but felt intimidated by how much work integration would be.  Luckily, I did a bit of searching and turned up this stackoverflow question about java recommendation engines.

Looking at some of the alternatives, I dug up easyrec, an open source recommendation engine.  Rather than solving a couple of different machine learning problems like mahout does, easyrec focuses on recommendations.

It also has a javascript API (for both sending information and displaying recommendations) and a demo installation you can use on your site, so it is trivial to integrate into a website to see if it works for you.  I did run into an issue with the demo server, but a post to the forums got it resolved in a few days.

Easyrec has support for generating recommendations for more than one kind of item (so if you want to display different recommendations within specific categories of an ecommerce site, that is possible) and is self hostable in any java container (which is recommended if you are going to use it in any commercial capacity).  You can also build the recommendations off of the following actions: views, rating, or purchase.

You can also customize easyrec with java plugins, though mahout definitely offers far more options for configuruation.

I haven’t noticed any speed changes to my site with the javascript installed, though I’m sure adding some more remote javascript code didn’t speed up page rendering.  I noticed an uptick in time on site after I installed it (small, on the order of 5%).

If you have a set of items that are viewed together, using easyrec can leverage the wisdom of the crowds with not much effort on your part.  It’s not as powerful or configurable as alternatives, but it drop dead simple to get started with.  It’s worth a look.


JFreeChart for the win

test

I wanted to make some charts showing linear regressions, but they needed to be images (so libraries like d3.js were out).  My first inclination was to use Google’s Image Charts API, but it has been deprecated.  I looked around and couldn’t find anything like it.

So, it looked like I’d need to code something up.  Some searching around turned up some great open source projects.  Using JFreeChart and Apache Commons Math, and some blog posts (this one about regression, this one about drawing dashed lines) and a post on StackOverflow about scatter plot basics, I was able to put together the above regression line chart in about half a day.


Java address parsing gets an upgrade

I wrote last year about address parsing solutions in java, and how an open source project call JGeocoder had worked out well for us.  I wanted to announce that my company, 8z, has significantly improved the address parsing capability of JGeocoder based on data from a number of property listings.

My colleague, Karamjeet Khalsa, added this functionality as well as more than fifty unit tests.  An address that previously would have failed but now is parsed correctly is: 25266 Road 38.1 Dolores CO. Or try this one: 10 Black Bear Gypsum CO.  This code focuses on not just parsing address, city and state, but also breaking apart the address into components like street number, unit number, etc (for US style addresses).

Working with the current maintainer of the project, Karamjeet uploaded the new jar yesterday, and it is now ready to download.  This is the first release in four years, so if you need address parsing, go take a look!


Pentaho Data Integration is damn cool

I have worked on two small projects with Pentaho Data Integration.  If you’re looking for a business intelligence tool that lets you manipulate large amounts of data in a performant way, you definitely want to take a look at this.  The version I’m working with is a couple of revisions back, but the online support is pretty good.  It’s way more developer-efficient than writing java, though debugging is more difficult.

Why is it so cool?  It lets you focus on your problem–validating and transforming your data–rather than the mechanics of it (where do the CSV files live?  what fields did I just add?  how do I parse this fixed width file?).  You can also call out to Java if you need to.

There is a bit of a learning curve, especially around the difference between transformations and jobs.  I bought my first tech book of 2011, Pentaho Kettle Solutions.  These projects weren’t even using Pentaho for its sweet spot, ETLing to a data warehouse, but I have found this to be an invaluable tool for moving data from text files to databases while cleaning up and processing it.


Using Cargo to deploy to multiple containers

I am new to maven, but recently used it on a project that needed to deploy to two tomcat containers.  Cargo seems to be the plugin of choice, but I had a devil of a time trying to get it to deploy to two existing containers with one command.

I ended up using the ant-run plugin to just manually copy the war files where they needed to be.  This plugin usage also archives old war files and expands the new one, as well as deploying to the failover container.

Attached, you can see the relevant section of the pom.xml: cargo-deploy-two-containers

Hope this helps.


Parsing street addresses in a java application

I recently had to find a way to parse a street address into its component parts, and thought I’d share my adventure.

The idea is to take a string like “123 S Main Street” and break it apart into the street number (123), the street direction (S), the street name (Main) and the street type (Street).

At first, I thought that regular expressions would work, but the sheer variety of legal postal street addresses quickly dissuaded me, as did my boss’s misgivings.

Stackoverflow has a nice discussion of the problem, which gave me some additional pointers.  There’s a commercial solution, which is available as a COM component or a web service–I didn’t try this.  There is a free, but application/attribution required, web service provided by a university that did a great job (thanks, California tax payers).  This solution is also available in a for-pay variant.

Neither of these were desirable because we needed to parse a lot of addresses quickly, and calling out over the web can be slow.  Some more digging turned up this stack question and  JGeocoder, which has a fairly robust address parser.  It’s not perfect, but it was free and open source.  I am not sure if it is still in development (the author didn’t respond to my email) but it does what we need it to do.

As an added bonus, we’re using pentaho for the data processing, and you can call java classes directly from your data processing steps, so I didn’t even have to wrap the java call in a shell script or anything.


Turning on hibernate query caching

Sometimes I write a blog post because I want to say something to others, other times because I want to help others, and sometimes I write one for myself.  This is one of the latter type of posts.

I find that nothing drives home a lesson I learned like writing out the solution on my blog, and more than one time I’ve searched my blog because I remember I had written about a topic of current interest in the past.

If you want hibernate, a java ORM tool, to cache your query results, there’s a section in the hibernate manual. Like most of the hibernate documentation, it’s quite good.  One thing that caught me up, though, is that I thought that all queries were cached, as soon as you had set up the correct configuration in your hibernate xml files and marked your objects as cacheable.

However, this is not the case,  as I learned when I turned on the mysql query log and looked at the calls the interesting web application was making.  I did quite a bit of searching, but the answer was in the same section that told me how to set up the configuration (as well as in this forum post).

…most queries do not benefit from caching or their results. So by default, individual queries are not cached even after enabling query caching. To enable results caching for a particular query, call org.hibernate.Query.setCacheable(true).

Doh!  Following these instructions decreased the page load time for some heavy query pages from 8 seconds to 0.5 seconds (the second time the page was loaded).  Impressive indeed, though the production environment might not see such drastic improvements.

One additional note–you can use query caching for SQLQueries (like group operations) by telling hibernate what the type is of each returned value.  If you don’t use addScalar, but you do mark a SQLQuery as cacheable, you’ll get a java.lang.ArrayIndexOutOfBoundsException when that code executes.  More here.

These articles were helpful to me as I navigated Hibernate caching, above and beyond the reference documentation:



© Moore Consulting, 2003-2017 +