GoFCCYourself — stange data, let’s capture it!

So this is interesting… yadda-yadda-yadda net-neutrality is gonna be taken away again. That is completely silly (imho) for various reasons:

  • I don’t want a fragmented/silo’d web
  • Information is the most valuable resource — literally what the internet is — putting a tollway on it only increases the gap for the less advantaged

So I was never really interested in the politics of it all. People have varying viewpoints, or do they? Let’s dive into the data.

I first noticed something strange when I went to the site www.gofccyourself.com (which is clearly just a redirect to the FCC’s Electronic Comment Filing System (ECFS)).

Screenshot from 2017-05-09 19-45-32

Look at those names… quite a few of them are in lexical order (and I’m sorting by date posted, so that is strange), not all of them mind you. But many with the clipped phrase ...The unprecedented regulatory power the Obama.... The full text is here:

The unprecedented regulatory power the Obama Administration imposed on the internet is smothering innovation, damaging the American economy and obstructing job creation. I urge the Federal Communications Commission to end the bureaucratic regulatory overreach of the internet known as Title II and restore the bipartisan light-touch regulatory consensus that enabled the internet to flourish for more than 20 years. The plan currently under consideration at the FCC to repeal Obama’s Title II power grab is a positive step forward and will help to promote a truly free and open internet for everyone.

Hmm, so many of the pro-dissolution of net-neutrality submissions have been submitted in lexical order of names. That is quite strange. Let’s dig a little deeper. That’s not written half-badly either other than . DuckDuckGo leads me to some reddit links which are on the right track:

Let’s try to get the FCC data for some offline analysis and to see who’s fighting a fair fight… Oh sweet, you can download a csv. click.


"=""17-108""","Brittany Mccain","5/9/2017","COMMENT",0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"=""17-108""","Brittany Proctor","5/9/2017","COMMENT",0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"=""17-108""","Brittney Moody","5/9/2017","COMMENT",0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"=""17-108""","Brittney Sharp","5/9/2017","COMMENT",0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"=""17-108""","Bryan White","5/9/2017","COMMENT",0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"=""17-108""","Bryan Wirth","5/9/2017","COMMENT",0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"=""17-108""","Kevin MccLain","5/9/2017","COMMENT",0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"=""17-108""","Chi Pham","5/9/2017","COMMENT",0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"=""17-108""","Caroline Schumacher ","5/9/2017","COMMENT",0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"=""17-108""","Tim Lynch","5/9/2017","COMMENT",0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"=""17-108""","Paul J Bednarski","5/9/2017","COMMENT",0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"=""17-108""","Bryan Ward","5/9/2017","COMMENT",0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"=""17-108""","Brittany Matthews","5/9/2017","COMMENT",0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


    • only the 10,000 most recent. Out of the total at the time 607,046 filings (thanks sidebar-info).
    • No comment text with the entries
    • No longform timestamps

What’s a guy to do… i (the debugging console in Firefox/Chromium). Neato, lets go over to the network request and reload… Ooh, yay, the FCC isn’t a bunch of troglodytes after all. Yummy yummy JSON, with a reasonable RESTish API to boot.

Screenshot from 2017-05-09 19-54-08

Lets see what we got here: https://ecfsapi.fcc.gov/filings?limit=25&proceedings.name=17-108&sort=date_disseminated,DESC looks promising… With a response of what we are looking for 25 of the filings with all that wonderful metadata 🙂 And offsets work too! Wonderful.

So given that they have complained about DDoS attacks, lets be careful scraping their site. Lets review their robots.txt They don’t have one for that domain… argh ok, back to their top-level: https://www.fcc.gov/robots.txt Crawl-delay: 10 looks relevant and don’t post to their urls. Sounds good to me. Lets see how quick we get a reasonable size response back… (oh and lets do ascending so we don’t get off track if they upload new ones)

$ wget "https://ecfsapi.fcc.gov/filings?limit=1000&proceedings.name=17-108&sort=date_disseminated,ASC"

Takes 1.5s to respond. I don’t think that will collapse their servers. Now iterate the lazy man’s way. Unix tools!

$ mkdir gofccyourself
$ cd gofccyourself
$ for i in $(seq 0 1000 3000); # 607000 will get you the full list, but I'll upload a torrent somewhere...
$ do wget "https://ecfsapi.fcc.gov/filings?limit=1000&offset=$i&proceedings.name=17-108&sort=date_disseminated,ASC" -O $i;
$ sleep 12;
$ done
$ cd ..
$ tar -cf net-neutrality-filings.tar
$ xz net-neutrality-filings.tar

By some back of the envelope calculation 607 requests with each taking ~15sec (including backoff) ~2.5 hrs and ~1GB. Ok, let’s let ‘er run a while.

So, apparently only 555,972 items are retrievable (those posted in the last 30 days). That is still much better than we had before. 1.2GB compressed down to 57MB, get it here.

Next time I’ll setup a workbench so that we can dive into the data and see what they show.

GoFCCYourself — stange data, let’s capture it!

One thought on “GoFCCYourself — stange data, let’s capture it!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s