Guide to Reindexing ElasticSearch data input with Logstash

I ran into an issue where I set up logstash to load data that was numeric as a string. Then, later on when we wanted to do visualizations with it, they were off. So, I needed to re-index all the data.

Total pain, hope this guide helps.  (Here’s some additional elastic search documentation: here and here.)

If you don’t care about your old data, just:

  • shut down logstash
  • deploy the new logstash filter (with mutates)
  • close all old indices
  • turn on logstash
  • send some data through to logstash
  • refresh fields in kibana–you’ll lose popularity

Now, if you do care about your old data, well, that’s a different story. Here are the steps I took:

First, modify the new logstash filter file, using mutate and deploy it. This takes care of the logstash indexes going forward, but will cause some kibana pain until you convert all the past indexes (because some indexes will have fields as strings and others as numbers).

Install jq: which will help you transform your data (jq is magic, I tell you).

Then, for each day/index you care about (logstash-2015.09.22in this example ), you want to follow these steps.

# get the current mapping
curl -XGET 'http://localhost:9200/logstash-2015.09.22/_mapping?pretty=1' > mapping

#back it up
cp mapping mapping.old

# edit mapping, change the types of the fields that are strings to long, float, or boolean.  I used vi

# create a new index with the new mapping 
curl -XPUT 'http://localhost:9200/logstash-2015.09.22-new/' -d @mapping

# find out how many rows there are.  If there are too many, you may want to use the scrolled search.  
# I handled indexes as big as 500k documents with the below approach
curl -XGET 'localhost:9200/logstash-2015.09.22/_count'

# if you are modifying an old index, no need to stop logstash, but if you are modifying an index with data currently going to it, you need to stop logstash at this step.

# change size below to be bigger than the count.
curl -XGET 'localhost:9200/logstash-2015.09.22/_search?size=250000'>

# edit data, just get the array of docs without the metadata
sed 's/^[^[]*\[/[/' |sed 's/..$//' >

# run jq to build a bulk insert compatible json file ( )

# make sure to correct the _index value. in the line below
jq -f jq.file |jq -c '\
{ index: { _index: "logstash-2015.09.22-new", _type: "logs" } },\
.' > toinsert

# where jq.file is the file below

# post the toinsert file to the new index
curl -s -XPOST localhost:9200/_bulk --data-binary "@toinsert"; echo

# NOTE: depending on the size of the toinsert file, you may need to split it up into multiple files using head and tail.  
# Make sure you don't split the metadata and data line (that is, each file should have an even number of lines), 
# and that files are all less than 1GB in size.

# delete the old index
curl -XDELETE 'http://localhost:9200/logstash-2015.09.22'

# add a new alias with the old index's name and pointing to the new index
curl -XPOST localhost:9200/_aliases -d '
   "actions": [
       { "add": {
           "alias": "logstash-2015.09.22",
           "index": "logstash-2015.09.22-new"

# restart logstash if you stopped it above.
sudo service logstash restart

# refresh fields in kibana--you'll lose popularity

Here’s the jq file which converts specified string fields to numeric and boolean fields.

# this is run with the jq tool for parsing and modifying json

# from
def translate_key(from;to):
  if type == "object" then . as $in
     | reduce keys[] as $key
         ( {};
       . + { (if $key == from then to else $key end)
             : $in[$key] | translate_key(from;to) } )
  elif type == "array" then map( translate_key(from;to) )
  else .

def turn_to_number(from):
  if type == "object" then . as $in
     | reduce keys[] as $key
         ( {};
       . + { ($key )
             : ( if $key == from then ($in[$key] | tonumber) else $in[$key] end ) } )
  else .

def turn_to_boolean(from):
  if type == "object" then . as $in
     | reduce keys[] as $key
         ( {};
       . + { ($key )
             : ( if $key == from then (if $in[$key] == "true" then true else false end ) else $in[$key] end ) } )
  else .

# for example, this converts any values with this field to numbers, and outputs the rest of the object unchanged
# run with jq -c -f  
.[]|._source| turn_to_number("numfield")

Rinse, wash, repeat.

Kibana Visualizations that Change With Browser Reload

I ran into a weird problem with Kibana recently.  We are using the ELK stack to ingest some logs and do some analysis, and when the Kibana webapp was reloaded, it showed different results for certain visualizations, especially averages.  Not all of them, and the results were always close to the actual value, but when you see 4.6 one time and 4.35 two seconds later on a system under light load and for the exact same metric, it doesn’t inspire confidence in your analytics system.

I dove into the issue.  By using Chrome Webtools, I noticed that the visualizations that were most squirrely were loaded last.  That made me suspicious that there was some failure causing missing data, which caused the average to change. However, the browser API calls weren’t failing, they were succeeding.

I first looked in the Elastic and Kibana configuration files to see if there was any easy timeout configuration values that I was missing.  But I didn’t see any.

I then tried to narrow down the issue.  When it was originally noted, we had about 15 visualizations working on about a months worth of data.  After a fair bit of URL manipulation, I determined that the discrepancies appeared regularly when there were about 10 visualizations, or when I cut the data down to four hours worth.  This gave me more confidence in my theory that some kind of timeout or other resource constraint was the issue. But where was the issue?

I then looked in the ElasticSearch logs.  We have a mapping issue, related to a scripted field and outlined here, which caused a lot of white noise, but I did end up seeing an exception:

org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on$23@3c26b1f5
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(
        at java.util.concurrent.ThreadPoolExecutor.reject(
        at java.util.concurrent.ThreadPoolExecutor.execute(
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(
        at org.elasticsearch.client.node.NodeClient.execute(
        at org.elasticsearch.client.FilterClient.execute(
        at org.elasticsearch.http.HttpServer.internalDispatchRequest(
        at org.elasticsearch.http.HttpServer$Dispatcher.dispatchRequest(
        at org.elasticsearch.http.netty.NettyHttpServerTransport.dispatchRequest(
        at org.elasticsearch.http.netty.HttpRequestHandler.messageReceived(
        at org.elasticsearch.http.netty.pipelining.HttpPipeliningHandler.messageReceived(
        at org.elasticsearch.common.netty.handler.codec.http.HttpChunkAggregator.messageReceived(
        at org.elasticsearch.common.netty.handler.codec.http.HttpContentDecoder.messageReceived(
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(
        at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.callDecode(
        at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.messageReceived(
        at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(

Which led me to this StackOverflow post.  Which led me to run this command on my ES instance:

$ curl -XGET localhost:9200/_cat/thread_pool?v
host            ip  bulk.queue bulk.rejected index.queue index.rejected search.queue search.rejected
ip-10-253-44-49           0          0             0            0           0              0             0            0               0
ip-10-253-44-49           0          0             0            0           0              0             0            0           31589

And as I ran that command repeatedly, I saw the search.rejected number getting larger and larger. Clearly I had a misconfiguration/limit around my search thread pool. After looking at the CPU and memory and i/o on the box, I could tell it wasn’t stressed, so I decided to increase the queue size for this pool. (I thought briefly about modifying the search thread pool size, but this article warned me off.)

This GH issue helped me understand how to modify the threadpool briefly so I could test the theory.

After making this configuration change, search.rejected went to zero, and the visualization aberrations disappeared. I will modify the elasticsearch.yaml file to make this persist across server restarts and re-provisions, but for now, the issue seems to be addressed.

The Deployment Age

If you haven’t read The Deployment Age (and its follow on post), you should go read it right now.

The premise is that we’re entering a technology super cycle with the Internet and PC where the technology will become far more integrated and invisible and the chief means of financing will be internal company resources.  The focus will be on existing markets, not creating new ones, and refinement rather than innovation.

If you work in technology and are interested in the big picture, it is worth a read:

Some things we’ve learned over the past 30 years–that novelty is more important than quality; that if you’re not disrupting yourself someone else will disrupt you; that entering new markets is more important than expanding existing markets; that technology has to be evangelized, not asked for by your customers–may no longer be true. Almost every company will continue to be managed as if these things were true, probably right up until they manage themselves out of business. There’s an old saying that generals are always fighting the last war, it’s not just generals, it’s everyone’s natural inclination.

Go read it: The Deployment Age.


Yesterday, I launched a partial rewrite of a long running side project connecting people to farm shares.  Over the past few months I’d fixed a few pressing bugs and overhauled the software so that a site and data model that previously supported only cities and zips as geographic features now supported states as well.

For the past week I’d been hemming and hawing about when to release–fixing “just one more thing” or tweaking one last bit.

But yesterday I finally bit the bullet and released.  Of course, there were a couple of issues that I hadn’t addressed that only showed up in production.  But after squashing those bugs, I completed a few more tasks: moving over SEO (as best as I can) and social accounts, making sure that key features and content hadn’t been thrown overboard, and checking the logs to make sure that users are finding what they need.

Why was I so reticent to release?  It’s a big step, shipping something, but in some ways it’s the only step that matters.  I was afraid that I’d forget something, screw things up, make a mistake.  And of course, I didn’t want to take a site that helped a couple thousand people a month find local food options that work for them and ruin it.  I wasn’t sure I’d have time to support the new site.  I wanted the new site to have as close to feature parity as possible.  I worked on the old site for five years, and a lot of features crept in (as well as external dependencies, corners of the site that google and users have found, but I had forgotten).

All good reasons to think before I leapt.

But you can only plan for so much.  At some point you just have to ship.

Heroku drains

drain photoSo, I’ve learned a lot more than I wanted to about heroku drains. These are sinks to which heroku applications can write.  After the logs are out of heroku, you analyze these logs just as you would in any other application living outside of a PaaS.  Logs are very useful to see long term trends, debug, etc.  (I’ve worked both on a rails3 app and a java spring/camel app that are deploying to heroku.)

Here are some things I’ve learned:

  • Heroku drains are well documented.
  • You want definitely want them for any production application, because only 1500 lines of heroku logs are retained at any one time.
  • They can go to either syslog (great for applications with a lot of other infrastructure) or https (great for applications without as much infrastructure support).
  • They can’t do any kind of authorization.
  • You can’t know what ip address the logs are coming from, so you can’t limit access by IP.
  • There are third party extensions you can pay for to avoid dealing with drains at all (I’ve heard good things about papertrail.)
  • You can use logstash to pull heroku logs from a syslog drain into elastic search.
  • There are numerous github projects that can drain to databases, etc.  There’s even one that, with echos of Ouroboros, drains to another heroku app.
  • Drains have intelligent behavior if your listener (or listeners) fails.  From heroku support: “The short answer is yes, the drain will drop logs when the sink is not responsive, but this isn’t really the full story. There are a number of undocumented limits and backoff retries that happen when a drain connection is lost.”  And then they go on to explain how the backoff behaviour happens.  I’m not going to cut and paste their entire answer because I assume it is undocumented for a reason (maybe it changes, maybe they don’t want to commit to supporting this behavior).  Ask them yourself :)
  • A simple drain can be as easy as <?php error_log(file_get_contents('php://input'), 3, "/var/log/logfile.log"); ?>, but make sure you rotate that log file.
  • You can use puppet to manage drains if you are bringing servers up and down, using the heroku toolbelt and CLI authentication.

If you are deploying anything beyond a toy app on heroku, don’t forget the ops folks and make sure you set up your drain!

“Wave a magic wand”

wand photoThat was what a previous boss said when I would ask him about some particularly knotty, unwieldy issue. “What would the end solution look like if you could wave a magic wand and have it happen?”

For instance, when choosing a vendor to revamp the flagship website, don’t think about all the million details that need to be done to ensure this is successful. Don’t think about who has the best process. Certainly don’t think about the technical details of redirects, APIs and integrations. Instead, “wave a magic wand” and envision the end state–what does that look like? Who is using
it? What do they do with it? What do you want to be able to do with the site? What do you want it to look like?

Or if an employee is unhappy in their role, ask them to “wave the magic wand” and talk about what role they’d rather be in. With no constraints you find out what really matters to them (or what they think really matters to them, to be more precise).

When you think about issues through this lens, you focus on the ends, not the means.  It lets you think about the goal and not the obstacles.

Of course, then you have to hunker down, determine if the goal is reachable, and if so, plan how to reach it. I like to think of this as projecting the vector of the ideal solution into the geometric plane of solutions that are possible to you or your organization–the vector may not lie in the plane, but you can get as close as possible.

“Waving a magic wand” elevates your thinking. It is a great way to think about how to solve a problem not using known methods and processes, but rather determining the ideal end goal and working backwards from there to the “hows”.

Masterless puppet and CloudFormation

I’ve had some experience with CloudFormation in the past, and recently gained some puppet expertise.  I thought it’d be great to combine the two, working on a new project to set up the ELK stack for a client.

Basically, we are creating an ec2 instance (or a number of them) from a vanilla image using a CloudFormation template, doing a small amount of initialization via the UserData section and then using puppet to configure them further.  However, puppet is used in a masterless context, where the intelligence (of knowing which machine should be configured which way) isn’t in the manifest file, but rather in the code that checks out the modules and manifests. Here’s a great example of a project set up to use masterless puppet.

Before I dive into more details, other solutions I looked at included:

  • doing all the machine setup in UserData
    • This is a bad idea because it forces you to set up and tear down machines each time you want to make a configuration change.  Leads to a longer development cycle, especially at first.  Plus bash is great for small configurations, but when you have dependencies and other complexities, the scripts can get hairy.
  • pulling a bash script from s3/github in UserData
    • puppet is made for configuration management and handles more complexity than a bash script.  I’ll admit, I used puppet with an eye to the future when we had more machines and more types of machines.  I suppose you could do the same with bash, but puppet handles more of typical CM tasks, including setting up cron jobs, making sure services run, and deriving dependencies between services, files and artifacts.
  • using a different CM tool, like ansible or chef
    • I was familiar with puppet.  I imagine the same solution would work with other CM tools.
  • using a puppet master
    • This presentation convinced me to avoid setting up a puppet master.  Cattle not pets.
  • using cloud-init instead of UserData for initial setup
    • I tried.  I couldn’t figure out cloud-init, even with this great post.  It’s been a few months, so I’m afraid I don’t even remember what the issue was, but I remember this solution not working for me.
  • create an instance/AMI with all software installed
    • puppet allows for more flexibility, is quicker to setup, and allows you to manage your configuration in a VCS rather than a pile of different AMIs.
  • use a container instead of AMIs
    • isn’t docker the answer to everything? I didn’t choose this because I was entirely new to containerization and didn’t want to take the risk.

Since I’ve already outlined how the solution works, let’s dive into details.

Here’s the UserData section of the CloudFormation template:

          "Fn::Base64": {
            "Fn::Join": [
                "#!/bin/bash \n",
                "exec > /tmp/part-001.log 2>&1 \n",
                "date >> /etc/ \n",
                "yum install puppet -y \n",
                "yum install git -y \n",
                "aws --region us-west-2 s3 cp s3://s3bucket/auth-files/id_rsa/root/.ssh/id_rsa && chmod 600 /root/.ssh/id_rsa \n",
                "# connect once to github, so we know the host \n",
                "ssh -T -oStrictHostKeyChecking=no \n",
                "git clone \n",
                "puppet apply --modulepath repo/infra/puppet/modules pure-spider/infra/puppet/manifests/",
                { "Ref" : "Environment" },
                "/logstash.pp \n",
                "date >> /etc/\n"

So, we are using a bash script, but only for a little bit.  The second line (starting with exec) stores output into a logfile for debugging purposes.  We then store off the date and install puppet and git.  The aws command pulls down a private key stored in s3.  This instance has access to s3 because of an IAM setup elsewhere in the CloudFormation template–the access we have is read-only and the private key has already been added to our github repository.  Then we connect to github via ssh to ‘get to know the host’.  Then we clone the repository containing the infrastructure code.  Finally, we apply the manifest, which is partially determined by a parameter to the CloudFormation template.

This bash script will run on creation of the EC2 instance.  Once this script is solid, if you are testing adding additional puppet modules, you only have to do a git pull and puppet apply to add more functionality to the modules.  (Of course, at the end you should stand up and tear down via CloudFormation just to test end to end.)  You can also see how it’d be easy to have the logstash.conf file be a parameter to the CloudFormation template, which would let you store your configuration for web servers, database servers, etc, in puppet as well.

I’m happy with how flexible this solution is.  CloudFormation manages the machine creation as well as any other resources, puppet manages the software installed in those machines, and git allows you to maintain all that configuration in one place.

Making my Twitter feed richer with Zapier and hnrss

twitter photo

Photo by marek.sotak

I read Hacker News, a site for startups and technologies, and occasionally post as well.  A few months back, I realized that the items that I post to HN, I want to tweet as well.  While I could have whipped something up with the HN RSS feed and the Twitter API (would probably be easier than Twitversation), I decided to try to use Zapier (which I’ve loved for a while).  It was dead simple to set up a Zap reading from my HN RSS feed and posting to my Twitter feed.  Probably about 10 minutes of time, and now I doubled my posts to Twitter.

Of course, this misses out on one of the huge benefits of Twitter–the conversational nature of the app.  When my auto posts happen, I don’t have a chance to follow up, or to cc: the authors, etc.

However, the perfect is the enemy of the good, and I figured it was better to engage in Twitter haphazardly and imperfectly than not at all.

Good time had by all at the HN Meetup

Where can you talk about super-capacitors vs batteries, whether you should rewrite your app (hint, you shouldn’t), penetration testing of well known organizations, cost of living, and native vs cross platform mobile apps?  All while enjoying a cold drink and the best fried food the Dark Horse can offer?

At the Boulder Hacker News Meetup, that’s where.  We had our inaugural meetup today and had a good showing.  Developers, startup owners, FTEs, contractors, backend folks, front end devs and penetration testers all showed up, and, as the Meetup page suggests, ate, drank and chatted.

Hope to see you at the next one.

© Moore Consulting, 2003-2015 +