blog blog blog, blog blog blog


December 04, 2014 — sam

(I am a terrible writer, but I'll try to keep these short-ish to the best of my meager abilities.)

So a major part of my thesis is data-gathering. As such, I needed to develop a way to scrape Twitter's stream of publicly-available tweets, process them, and then store them. I've tried this a few different ways, but what I'm currently doing is:

  • Scrape + process tweets using Clojure + Twitter's house-made Hosebird client
  • Store tweets in a CouchDB instance on one of my local machines.

Why Clojure?

The honest answer is, because I started learning Clojure about 6 months ago, and I REALLY like it. Books and tutorials and such are fine and all, but anyone who's ever tried to learn a programming language on their own knows you really need something to apply your newly-developed skills to in order to actually get anything to sink in. I had already implemented some twitter processing stuff in Python, and I thought having a basic framework, or at least a semi-concrete set of goals would be helpful.

The longer answer is that, in many ways, Clojure is a great tool for jobs like this. It's a LISP, which is short for LISt Processing, and it's great at data-processing and handling streams and such. It also has amazing interoperation with Java. I've generally stayed away from Java--it strikes me as extremely cumbersome and fusty. One thing Java does have, however, is a huge market/mind share, and a metric butt-ton of packages and stuff. Using Clojure lets me take advantage of the Java ecosystem without having to write public static void a thousand times. One such library is Hosebird, written by Twitter staffers for consuming the public stream. You set up the client with a streaming API endpoint and a queue, and it pulls tweets from the endpoint, and dumps them into the queue. What's nice is that, without getting too technical, Clojure can treat this queue the same as a list, which means you can iterate over it and do all kinds of fun things, without worrying about Hosebird- or Java-specific implementation details.

Why CouchDB?

This one is less well thought-out, honestly. The Python version of this stuff interfaced with a MySQL database, because that's what I knew at the time. Unfortunately, MySQL is, in many ways, a bad fit for what I'm doing. For one, it doesn't really 'do' lists of things--each row can only hold one thing, so if you want to express something like "tweets containing [some hashtag]" you have to create a table that stores the tweets, a second table that contains the hashtags, and a third table that contains connections between the two. CouchDB, for all intents and purposes, stores things as JSON, which allows for documents to have a far more complex internal structure. It's also convenient that Twitter's public stream happens to spit out tweets in JSON format. (Man, I love JSON.) You can thus store lists of tweets keyed by hashtags, with no added overhead if multiple hashtags show up in a single tweet, or if a single hashtag occurs in multiple tweets.

The other nice thing about CouchDB is that it can be thought of as "web-forward"--it's basically built as a web API, so you can issue GET & POST requests to retrieve & upload documents, respectively, and there's a nice built-in web interface called Futon (get it?) I've always anticipated this project having a life of its own on the web somewhere after my thesis is written, and having a web-oriented API is hugely helpful. As a contrast, you'd have to use a MySQL-specific package for your favorite language to even get at your data in any kind of programmatic way, and then you'd have to figure out how to expose it--via a RESTful API, a frontend website, or both (presumably, you could write the frontend to interact with your API, but still.) I'm sure there are tons of projects on GitHub and elsewhere that already to this for you, but CouchDB gives it to you "for free," which is nice.

So that's a rundown on what tools I'm using, and some discussion on why I'm using them. Stay tuned for probably a mix of more technical, "how"-oriented posts and more conceptual/explanatory, "why"/"what"-oriented ones. Or I'll just give up on this blog. We'll see.

Tags: thesis, clojure, couchdb

Comments? Tweet