Great, we’ve got some data thanks to the previous post. Now how do we dive in?
Well seems like these filings were originally in a document database, and given their json content, let’s pick on at random from the wonderful list from wikipedia.
I know the names of some of these, and surprised other one’s weren’t on this list (Cassandra and CouchDB, guess apache projects got off the short-list somehow…).
Anywho, I’ve heard good things about elasticsearch and a duckduckgo search leads me to believe it should do at least some of what I want, there seem to be reasonable docs, and Kibana looks cool. I’m sold (mostly cause it’s free/open)! Haven’t touched java in years let’s hope I don’t need to get in the weeds at all.
I’m cheap so let’s do this locally and see if we have a shot. After all 1.2GB should fit in RAM no problem.
I’m not one for reading docs. So I’m gonna use docker, and the pretty verbose documentation looks like what I need. I could deploy it in the cloud if I wanted… but localhost is more fun/free.
# Assuming recent ubuntu $ snap install docker $ sudo docker pull docker.elastic.co/elasticsearch/elasticsearch:5.4.0 $ sudo docker pull docker.elastic.co/kibana/kibana:5.4.0 $ sudo docker network create elastic --driver=bridge $ sudo sysctl -w vm.max_map_count=262144 $ sudo docker run -p 9200:9200 -p 9300:9300 -e "bootstrap_memory_lock=true" --ulimit memlock=-1:-1 --ulimit nofile=262144:262144 --network elastic --name elasticsearch docker.elastic.co/elasticsearch/elasticsearch:5.4.0 & $ sudo docker run -p 5601:5601 --network elastic docker.elastic.co/kibana/kibana:5.4.0
And time to let that baby spin up. (About 5 mins). disclaimer: I don’t use docker on a daily basis, there is a docker compose tool which can do this better
While that is going on, let’s see if we can whip up a script to ingest our data into our elasticsearch node…
Bulk commands and some jq should get us what we want…
I wish that either I could include the
_id key in the data itself or leave it out on a
create. I had to make a python script just for the bulk commands… Silly. That script is available in the git repo that goes along with this series.
$ sudo apt install -y jq curl $ for i in $(ls -1 [0-9]*); # yes I hope you are in the right directory... $ do jq -cM '.filings | del(._index)' $i | ./bulkify.py $i > current && $ curl -u elastic:changeme -s -H "Content-Type: application/x-ndjson" \ -XPOST localhost:9200/fcc/create/_bulk --data-binary "@current" $ done;
And the waiting game again… This is actually pretty quick — finished in about 20mins for me, could have done this with the harvesting code from the previous post to pipeline it and not have to wait, oh well.
Now to the kibana dashboard already! A whole night gone just for some data crunching… After logging in it wants us to setup our indexes. Ok. We’ve got a timeseries in the fcc dataset.
Now finally for the next post we can get to the bottom of this debacle, I promise.