Guide to Reindexing ElasticSearch data input with Logstash

I ran into an issue where I set up logstash to load data that was numeric as a string. Then, later on when we wanted to do visualizations with it, they were off. So, I needed to re-index all the data.

Total pain, hope this guide helps.  (Here’s some additional elastic search documentation: here and here.)

If you don’t care about your old data, just:

  • shut down logstash
  • deploy the new logstash filter (with mutates)
  • close all old indices
  • turn on logstash
  • send some data through to logstash
  • refresh fields in kibana–you’ll lose popularity

Now, if you do care about your old data, well, that’s a different story. Here are the steps I took:

First, modify the new logstash filter file, using mutate and deploy it. This takes care of the logstash indexes going forward, but will cause some kibana pain until you convert all the past indexes (because some indexes will have fields as strings and others as numbers).

Install jq: https://stedolan.github.io/jq/manual/v1.4/ which will help you transform your data (jq is magic, I tell you).

Then, for each day/index you care about (logstash-2015.09.22in this example ), you want to follow these steps.


# get the current mapping
curl -XGET 'http://localhost:9200/logstash-2015.09.22/_mapping?pretty=1' > mapping

#back it up
cp mapping mapping.old

# edit mapping, change the types of the fields that are strings to long, float, or boolean.  I used vi

# create a new index with the new mapping 
curl -XPUT 'http://localhost:9200/logstash-2015.09.22-new/' -d @mapping

# find out how many rows there are.  If there are too many, you may want to use the scrolled search.  
# I handled indexes as big as 500k documents with the below approach
curl -XGET 'localhost:9200/logstash-2015.09.22/_count'

# if you are modifying an old index, no need to stop logstash, but if you are modifying an index with data currently going to it, you need to stop logstash at this step.

# change size below to be bigger than the count.
curl -XGET 'localhost:9200/logstash-2015.09.22/_search?size=250000'> logstash-2015.09.22.data.orig

# edit data, just get the array of docs without the metadata
sed 's/^[^[]*\[/[/' logstash-2015.09.22.data.orig |sed 's/..$//' > logstash-2015.09.22.data

# run jq to build a bulk insert compatible json file ( https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html )

# make sure to correct the _index value. in the line below
jq -f jq.file logstash-2015.09.22.data |jq -c '\
{ index: { _index: "logstash-2015.09.22-new", _type: "logs" } },\
.' > toinsert

# where jq.file is the file below

# post the toinsert file to the new index
curl -s -XPOST localhost:9200/_bulk --data-binary "@toinsert"; echo

# NOTE: depending on the size of the toinsert file, you may need to split it up into multiple files using head and tail.  
# Make sure you don't split the metadata and data line (that is, each file should have an even number of lines), 
# and that files are all less than 1GB in size.

# delete the old index
curl -XDELETE 'http://localhost:9200/logstash-2015.09.22'

# add a new alias with the old index's name and pointing to the new index
curl -XPOST localhost:9200/_aliases -d '
{
   "actions": [
       { "add": {
           "alias": "logstash-2015.09.22",
           "index": "logstash-2015.09.22-new"
       }}
   ]
}'

# restart logstash if you stopped it above.
sudo service logstash restart

# refresh fields in kibana--you'll lose popularity

Here’s the jq file which converts specified string fields to numeric and boolean fields.


#
# this is run with the jq tool for parsing and modifying json

# from https://github.com/stedolan/jq/issues/670
def translate_key(from;to):
  if type == "object" then . as $in
     | reduce keys[] as $key
         ( {};
       . + { (if $key == from then to else $key end)
             : $in[$key] | translate_key(from;to) } )
  elif type == "array" then map( translate_key(from;to) )
  else .
  end;

def turn_to_number(from):
  if type == "object" then . as $in
     | reduce keys[] as $key
         ( {};
       . + { ($key )
             : ( if $key == from then ($in[$key] | tonumber) else $in[$key] end ) } )
  else .
  end;

def turn_to_boolean(from):
  if type == "object" then . as $in
     | reduce keys[] as $key
         ( {};
       . + { ($key )
             : ( if $key == from then (if $in[$key] == "true" then true else false end ) else $in[$key] end ) } )
  else .
  end;

# for example, this converts any values with this field to numbers, and outputs the rest of the object unchanged
# run with jq -c -f  
.[]|._source| turn_to_number("numfield")

Rinse, wash, repeat.


Avoiding “Module ‘XXX’ is not available” Errors in AngularJS

headbang photo

Photo by Nirazilla

As I continue to build applications with AngularJS, I see how strong the ecosystem is.  While there aren’t quite as many plugins for Angular as there are for JQuery, many of the JQuery plugins have been wrapped up as Angular directives.

One of the issues I banged my head against a couple of times when using a new directive was how many places I had to modify code, and the totally non intuitive error messages that were displayed.

Here are the places you need to modify if you want to use a directive:

  • a module definition (your app or a controller).  You need to add the directive to the list of dependencies: var controllers = angular.module("controllers", [ "localytics.directives",'ngModal' ]);.
  • the index.html file. You need to add links to the javascript files and css that the directive uses. If you are using bower to manage these components, you can find the files under the directory managed by bower.
  • the karma.conf.js file. This file sets up the environment for your unit tests. You want to set up the files attribute to point to the javascript files you added to the index.html page above.

If you don’t add the correct module name to your dependency list, misspell it, or don’t add the javascript to the index.html or your karma configuration file, you will see this error message in your console:

Uncaught Error: [$injector:modulerr] Failed to instantiate module app due to:
Error: [$injector:modulerr] Failed to instantiate module controllers due to:
Error: [$injector:modulerr] Failed to instantiate module ngModall due to:
Error: [$injector:nomod] Module ‘ngModall’ is not available! You either misspelled the module name or forgot to load it. If registering a module ensure that you specify the dependencies as the second argument.
http://errors.angularjs.org/1.2.25/$injector/nomod?p0=ngModall
at http://192.168.0.200:8000/app/bower_components/angular/angular.js:78:12

And the app won’t start.

Hope this helps someone else avoid some head banging.


Preparing Your AngularJS App for Deployment

angle photo

Photo by kevin dooley

I have recently been working on an AngularJS CRUD front end to a REST API (built with DropWizard).  I have been working off the angular-phonecat example app (from the tutorial).

After making a few changes, I wanted to deploy the app to a standalone web server (Apache2).  I naively checked out the codebase on the web server, and visited index.html.

I saw a blank screen.  Looking in the console, I saw this error message: ReferenceError: angular is not defined

Whoops.

“Looks like there’s more to deploying this application than I thought.”  Some searching around doesn’t reveal a lot, maybe because deployment is just taken for granted in the AngularJS community?  Or I don’t know what questions to ask?

Regardless, the fundamental fact of building AngularJS apps for deployment is that, at least with the angular-phonecat base, your dependencies are managed by bower and/or npm, and you need to make sure you bundle them up when you are running on a web server that isn’t the npm started web server.

In practice, this means writing a Gruntfile (or, actually, modify this Gruntfile), which is similar to an ant build.xml file–you write targets and then gather them together.

For my initial Gruntfile, I wanted to make things as simple as possible, so I stripped out some of the fanciness of the g00glen00b file.  In the end, I had two tasks:

  1. bowercopy to  copy my bower dependencies to a temp directory.  I tried to use the bower grunt task, but wasn’t able to make it work.
  2. compress to gather the files and build the zip file

These were bundled together in a ‘package’ task that looked like this: grunt.registerTask('package', [ 'bowercopy', 'compress:dist' ]);

The compress task is complicated and took some figuring out (this post was helpful, as was a close reading of the task’s readme page and the page detailing how file objects can be dynamically generated). Here’s an example of the dist task:

 compress: {
          dist: {
            options: {
              archive: 'dist/<%= pkg.name %>-<%= pkg.version %>.zip'
            },
            files: [{
              src: [ 'app/index.html' ],
              dest: '/',
              expand: true,
              flatten: true
            }, {
              cwd: 'dist/libs',
              src: [ '**' ],
              dest: 'bower_components',
	      expand: true,
            },
              // ... more files definitions 
            ]
          }
  }

I want to unpack this a bit. First, the options object specifies an archive file and the name of the file. You can see that it is dynamically created based on values from the package.json (which is read in earlier in the grunt config file).

Then, the set of files to be added to this archive is specified. The src attribute outlines the list of files to include in the zip file. src handles wildcards (* for all in the directory, ** for all in the directory and subdirectories). The dest attribute, in contrast, indicates the directory where the file is to land in the zip archive. The expand attribute lets grunt do dynamic file matching (more here). The flatten attribute removes all the leading paths from the source files–if you didn’t have flatten:true in the index.html entry, it would be placed at /app/index.html in the zip file. The dist/libs entry handles all the dependencies that were copied to that tmp directory by the bowercopy task. What the cwd attribute tells grunt is to act like it is in that directory for the src attribute. So, a file at dist/lib/foo/bar will be treated like it was foo/bar and, in the task above, copied to bower_components/foo/bar in the zipfile. This allows one to maintain the same directory structure in my index.html file in both dev and production.

Finally, you need to install grunt and run grunt package to get your zipfile with all dependencies, for deployment.

There are a lot of other beneficial changes grunt can make to your app before deployment, like concatenating css and minifying the javascript, and I encourage you to read this post for more information, but this was enough to get the app running in an environment without npm or any of the other angularJS toolchain.


Installing the AngularJS Docs on Your Computer

unplugged photo

Photo by jenny downing

If you want access to the AngularJS API docs offline, download the zipfile for the version you are using (latest as of writing is 1.2.23), unzip it it to a web server directory, and visit the URL in your browser.

The docs need to be served via http or https to work, so you can’t just download the zip file, unzip, and open up the index.html file in your browser.

I use the python simple web server.  If you are also using this simple web server, make sure to start the python web server in the top level directory of the unzipped files, not the docs directory.  The docs reach up a level and pull in javascript and css.

Obviously, you don’t get any links to outside blogs, articles or videos, but this does give you the API, the tutorial and the guide.  When I’m without internet access, having this on my computer gives me at least a fighting chance of solving any issues I run into.


Building an automated postcard mailing system with Lob and Zapier

Courtesy of smoothfluid

Courtesy of smoothfluid

I was looking at automated paper mailing systems recently (and listed what I found), and was especially impressed with Lob, especially the ease of its API.

Among other printing services, Lob will let you mail a postcard with a custom PDF on both sides, or a custom PDF on one side and a text message on the other, anywhere in the USA for $0.94.  (Sorry, not sure about international postcards) The company for which I work sends out tens of thousands of postcards every quarter. The vendor which we use charges them a similar fee (less, but in the same ballpark) but there’s a manual process to deliver the collateral and no API. So an on-demand, one by one post card sending system is very interesting to me.

Note that I haven’t received the Lob postcard which I sent myself, so I can’t speak to quality. Yet.

The Lob API is a bit weird, because the request is form encoded rather than a JSON payload.  It also uses basic auth, but only the username, not the password. But the API seems to have all the pieces you’d need to generate all kinds of postcards–reminder postcards, direct mail postcards, photo postcards, etc.

After testing out the service via the web interface and cURL examples, I thought that it’d be fun to build a Zapier zap. In particular, being able to send a postcard for an entry in a Google spreadsheet seemed like a useful use case. Plus, Zapier is awesome, and I’d wanted to test out their integration environment for myself.

So, I built a Zapier integration for Lob, using the Zapier developer docs in combination with the Lob developer docs. It was actually easy. The most complicated step was translating the Zapier action data, which is a one or two dimensional array of typed data, into the Lob data format, which wanted a couple of text fields and two address arrays. Zapier has a scripting environment that let me modify data from APIs pre and post send, and even had an example about form encoded APIs. Zapier’s JavaScript scripting development environment was full featured, including syntax and error highlighting. It had no real debugging available, but I could use the venerable debug-by-log-statement method fairly easily.

Where could I take this next? Everywhere people use postcards in real life. The postcards depend on PDF files (see a sample), so if you are generating a custom postcard for each interaction things become more complex, but there are a few APIs (based on a 30 second google search, here and here) available for dynamic PDF generation. There are also limits on API call throughput, if I stuck to the Zapier integration–I could send at most 300 postcards a day, unless I managed multiple spreadsheets.

I see reminders of high value events (dentist, house maintenance, etc), contests and marketing as key opportunities for this type of service. And if I had a product where direct mail was a key component, using Lob directly would be worth serious consideration.

Regarding the Zap, I believe I cannot make this Zap available to anyone else. Since I’m not a representative of Lob, I couldn’t commit to maintaining this Zap, and Zapier doesn’t want to have any of their customers depending on an integration that could disappear or be unsupported at any time–a fair position.

If the Zapier or Lob folks want to take this integration and run with it, I’d be happy to share my code–just leave a comment. If anyone else is interested in being able to generate Lob postcards from a Google spreadsheet (or any other compatible API) via Zapier integration, let me know and I’ll see what I can do.


Google Spreadsheet Custom Functions With Spreadsheet Based Configuration

The business for which I work, 8z Real Estate, runs on Google spreadsheets–they are everywhere, and are especially powerful when combined with Google Forms. (Except in the tech department–we use more specialized tools like wikis and bug trackers.)

Recently I was cleaning up one of these spreadsheets and making it more efficient. This spreadsheet was an ongoing list of items that should be charged to various real estate agents. There was a fairly clear set of rules (person A should be charged this for item B, but not for item C, and person D should not be charged for items E, but should for item F). The rules were also fairly constant–not a lot of change. While they were clear, there were some intricacies that had tripped up some folks. And since there was real money involved, there was often a lot of time expended to make sure the charges were correct.

For all these reasons, it made sense to automate the charge calculations. Given the data was already in a Google spreadsheet, I decided a custom function was the best way to do this. The custom function could read from a configuration tab in the same spreadsheet with a list of people and items which would represent the charge rules. In this way, if there was a new item, a new row could be added to the configuration spreadsheet without requiring any help from a developer.

I was able to write the functions fairly quickly, using QUnit for Google Apps Script. I can’t recommend using QUnit highly enough–developing in Google Apps Script combines the joys of javascript (with its … intrinsic difficulties) and a remote execution environment that can be tough to debug. So, unit test your Apps Script code!

The initial implementation pulled data from the configuration tab with each custom function call. This naive implementation worked fine up to a couple of hundred rows. However, eventually this caused Exceeded maximum execution time errors. I believe this is because when the spreadsheet was calculating all the charge values, it was accessing the configuration spreadsheet range hundreds of times a second.

My next step was to try to cache the configuration data. I used stringified JSON stored in the cache service. Unfortunately, this caused a different issue: Service invoked too many times in a short time: cacheService rateMax. Try Utilities.sleep(1000) between calls.

Third time is the charm: have the function return multiple values. The configuration data is only read once, as is the list of items and names. These are iterated and the charges for all of them are calculated and returned in a double array to fill entire columns. Using this technique avoided the above issues, but created another one. Adding new rows with new items wouldn’t update the charges columns. Apparently there is additional, ill documented caching of custom function values. Luckily StackOverflow had an explanation: spreadsheets “evaluate your [custom] functions only when a parameter changes.” The way I worked around this was to pass a parameter to the custom function of the number of non blank rows of the spreadsheet: =calculateCosts(counta(A:A)). When a new row is added, the custom function is re-evaluated.

In the end I have written unit tested code that works in the way the business wants to work (in Google Spreadsheets), that runs on someone else’s infrastructure (Google’s), that can be configured by non technical employees and that will increase the accuracy of the charge calculations. Wahoo!


Google Spreadsheets as REST API sources

I recently built out a read only JSON API with a Google spreadsheet as the back end data source.

Why do such a thing? I didn’t need to modify the back end, but wanted to make the data available to other software. The people who maintain this data are very comfortable using Google spreadsheets. While I could have written a custom CRUD app, this didn’t seem like a good use of time when Google spreadsheets had served us well in the past.

How did I do this? I first of all created a sanitized spreadsheet, using importrange and regexreplace. This level of indirection assures me that if the source data changes, I can adjust fairly easily. If the user managing the spreadsheet wants to rearrange columns, I can adjust my sanitized spreadsheet easily.

Then I created a Google apps script and used doGet to respond to requests and the spreadsheet service to retrieve the data. The content service lets you serve JSON with the appropriate mime type. I used qunit for Google apps script (invaluable when you are working in the loosely typed javascript world and relying on cloud resources). I also worked with the parameters to build the querying needed for our application.

Then, in order to make it look like a normal API call, I fronted the script with Varnish and did some regsub magic in my VCL file (as well as some light authentication).

This approach has the benefit of keeping everything in Google’s cloud, and allowing you to access the spreadsheet data easily.

This approach has significant limitations, however.

  • Google apps scripts calls are slow, especially when accessing spreadsheet data. Using the cache service can help.
  • You cannot return anything other than a 200 response code. None of the other response codes are available.
  • The actual content is served after a redirect, so caching it at the Varnish level is difficult (though possible), and clients must be able to follow redirects.
  • Google changes the ip of the server running the script. This is not such a big problem, unless your version of Varnish only takes IP addresses in the VCL file, not hostnames. Like ours.

Was this a good idea? Well, it let me build out the API relatively quickly without affecting the users managing the data or finding any other place to put it. But we’ll probably move away from this due to the limitations listed above. One we’ve found particularly painful is the IP address switching, which usually only shows up in our automated testing.

We’ll probably start pushing the data daily (it doesn’t change all that often) to a local JDBC database using the JDBC service and use either RestSQL or DropWizard to generate an API for it. (RestSQL is quicker, but DropWizard lets us maintain format compatibility.)


Older versions of Sinon.js don’t work with jquery 2.0

This is a quick hit, hopefully to help someone avoid spending the half day I just did.

The older versions of sinon.js, a helpful javascript testing tool which lets you mock up and stub out objects, do not work with jquery 2.0.  Even though 2.0 is API compatible with the 1.x series, apparently some different stuff happens under the covers.  This is an issue for me because a few months ago, I followed these instructions to set up our testing infrastructure, and used sinon.js version 1.4.2.  That worked fine with jquery 1.8.2, but when I upgraded everything, tests where I mocked up server calls failed–the backbone model’s parse method was never called.

The answer?  Use at least version 1.7.1 of sinon.js.


TimelineJS installation tips

TimelineJS is a JS library for making fantastic looking timelines.  The data that drives these timelines can be in JSON or in google spreadsheets.

I wanted to install this on a site I manage, but ran into a couple of issues.

First, there’s not great installation instructions.  Basically, you need to download the source from github and then copy the ‘compiled’ directory wherever you want the files to go.

I got the files installed, then cribbed from the google spreadsheet example.

Then, I kept seeing this error in the javascript console:  ReferenceError: VMM is not defined.  (That is the Firefox error message–Chrome had one that was slightly different.)  Looking in the chrome javascript console revealed that some js and css files were not being downloaded–they were 404ing.  Adding the js and css config options with full paths to the timeline-min.js and timeline.css files fixed this error.

The last issue I ran into was that the div into which I was putting the timeline was not 100% of the height of the page.  I wanted other content around it.  But the timeline wasn’t showing up at all.  I fixed this issue by putting another div around the timeline containing div, and setting the outer div’s height explicitly.  Then I made the containing div’s height 100%.

Hope these tips help someone out.


Running async tests with qunit and sinon

I ran into an issue running async tests with qunit and sinon.  For a primer on doing this, see this great article.

Basically, the asyncTest never returned.  This manifested itself in the html view of the tests like this (left is success, right is where the test never returns–note the white):

This gallery contains 2 photos.

Turns out that sinon-qunit adapter has sinon fake the browser timer, so setTimeout doesn’t work as expected.

To fix, just turn off the timer faking for a single test: this.clock.restore() or sinon.config.useFakeTimers = false to disable this for all tests.



© Moore Consulting, 2003-2017 +