With this series I will try to show how to use Cassandra for storing data and how to use MapReduce or Spark to analyse data. First I have to store data in a Apache Cassandra database and my decision was storing tweets. Sounds easy but with CQL you have to think about your data model and how you want to store data. Unlike using Thrift, what goes away, you can’t store data without having any idea of what it will look like.

I used python to store tweets in my database and hitting some issues storing data directly using CQL INSERT commands, cause they will produce a lot of code lines and you have to ensure every data format exactly. Talking to Jon Haddad who´s a Technical Evangelist here at DataStax pointed me to CQLEngine. CQLEngine is a kind of CQL wrapper and allowed me to create a data model directly in my python code. So I used cassandra, tweepy, json and cqlengine to build some code (note that you have to create a Keyspace in Cassandra before, based on your needs):



If needed you can find the description of the data types and entities here: https://dev.twitter.com/docs

This script will now pump the tweets into your cassandra database. You can control the values with the DevCenter by selecting all values of your table.


DevCenter Screenshot

So with this data we are now able to do some map reduce or spark tasks on. Part II will have a look on MapReduce first…