Spark vs. MapReduce Series – Part II – MapReduce

Yesterday I had the time to run my favourite example of MapReduce on my twitter data. And guess what, it´s WordCount 🙂 To be honest, this one is available in Hadoop, Spark and SolR and that´s the main reason for this decision. As you might remember there is a “text” field in my twitter data model from the first part. This is the field to use for WordCount,

After I changed on node to be an hadoop node (/etc/default/dse) and changed the replication factor (to replicate the data) I was able to start “dse hive”.

Screen Shot 2014-10-11 at 11.59.00

HIVE is the easiest way to access the data via hadoop and without manual extracting or copying them to CFS. After HIVE was started I issued the WordCount query:

SELECT word, COUNT(*) FROM twittercan.tweets LATERAL VIEW explode(split(text, ‘ ‘)) lTable as word GROUP BY word;

You can determine the actual state in the OpsCenter as well. As you can see, the WordCount took around 20 Minutes to count the actual dataset.

 

Screen Shot 2014-10-11 at 12.25.32

This was the fairly easiest way to do a WordCount on twitter tweets 🙂

 

 

 

Leave a Reply