GoFCCYourself — Setting up your workbench (elasticsearch/kibana)

Great, we’ve got some data thanks to the previous post. Now how do we dive in?

Well seems like these filings were originally in a document database, and given their json content, let’s pick on at random from the wonderful list from wikipedia.

I know the names of some of these, and surprised other one’s weren’t on this list (Cassandra and CouchDB, guess apache projects got off the short-list somehow…).

Anywho, I’ve heard good things about elasticsearch and a duckduckgo search leads me to believe it should do at least some of what I want, there seem to be reasonable docs, and Kibana looks cool. I’m sold (mostly cause it’s free/open)! Haven’t touched java in years let’s hope I don’t need to get in the weeds at all.

I’m cheap so let’s do this locally and see if we have a shot. After all 1.2GB should fit in RAM no problem.

I’m not one for reading docs. So I’m gonna use docker, and the pretty verbose documentation looks like what I need. I could deploy it in the cloud if I wanted… but localhost is more fun/free.

# Assuming recent ubuntu
$ snap install docker
$ sudo docker pull docker.elastic.co/elasticsearch/elasticsearch:5.4.0
$ sudo docker pull docker.elastic.co/kibana/kibana:5.4.0
$ sudo docker network create elastic --driver=bridge
$ sudo sysctl -w vm.max_map_count=262144
$ sudo docker run -p 9200:9200 -p 9300:9300 -e "bootstrap_memory_lock=true" --ulimit memlock=-1:-1 --ulimit nofile=262144:262144 --network elastic --name elasticsearch docker.elastic.co/elasticsearch/elasticsearch:5.4.0 &
$ sudo docker run -p 5601:5601 --network elastic docker.elastic.co/kibana/kibana:5.4.0

And time to let that baby spin up. (About 5 mins). disclaimer: I don’t use docker on a daily basis, there is a docker compose tool which can do this better

Screenshot from 2017-05-15 23-12-49


While that is going on, let’s see if we can whip up a script to ingest our data into our elasticsearch node…

Bulk commands and some jq should get us what we want…

I wish that either I could include the _id key in the data itself or leave it out on a create. I had to make a python script just for the bulk commands… Silly. That script is available in the git repo that goes along with this series.

$ sudo apt install -y jq curl
$ for i in $(ls -1 [0-9]*); # yes I hope you are in the right directory...
$ do jq -cM '.filings[] | del(._index)' $i | ./bulkify.py $i > current &&
$ curl -u elastic:changeme -s -H "Content-Type: application/x-ndjson" \
    -XPOST localhost:9200/fcc/create/_bulk --data-binary "@current"
$ done;

And the waiting game again… This is actually pretty quick — finished in about 20mins for me, could have done this with the harvesting code from the previous post to pipeline it and not have to wait, oh well.


Now to the kibana dashboard already! A whole night gone just for some data crunching… After logging in it wants us to setup our indexes. Ok. We’ve got a timeseries in the fcc dataset.

Screenshot from 2017-05-15 23-20-03

Now finally for the next post we can get to the bottom of this debacle, I promise.

Advertisements
GoFCCYourself — Setting up your workbench (elasticsearch/kibana)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s