Wednesday, September 25, 2013

Talk: Introduction to Cassandra, CQL, and the Datstax Java Driver

Presenter: Johnny Miller
Who is he? Datastax Solutions Architect
Where? Skills matter exchange in London

I went to an "introductory" talk even though I have a lot of experience with Cassandra for a few reasons:
  • Meet other people in London that are using Cassandra
  • To discover what I don't know about Cassandra
Here are my notes that in roughly the same order as the talk.

What's Cassandra? The headlines

  • Been around for ~5 years - originally developed by Facebook for their inbox search
  • Distributed key store - column orientated data model
  • Tuneable consistency - per request decide how consistent you want the response to be
  • Datacenter aware with asynchronous replication
  • Designed for use as a cluster - not much value in a single node Cassandra deployment

Gossip - how nodes in a cluster learn about other nodes

  • P2P protocol for how nodes discover location and state of other nodes
  • New nodes are given seed nodes for bootstrapping - but these aren't single points of failure as they aren't used again

Data distribution and replication

  • Replication factor: How many nodes each piece of data is stored on
  • Each node is given a range of primary keys to look after

Partitioners - How to decide which node gets what data

  • Row keys are hashed to decide node then a replication strategy defines how to pick the other replicas

Replicas - how to select where else the data lives

  • All replicas are equally important. No difference between the node the key hashed to and the other replicas that were selected
  • Two ways to pick the other replicas:
    • Simple: Only single DC. Specify just a replication factor. Hashes the key and then walks the cluster and picks the replicas. Not very clever - all replicas could end up on the same rack
    • Network: Configure with a RF per DC. Walk the ring for each DC until it reaches a node in another rack

Snitch - how to define a data centre and a rack

  • Informs Cassandra about node topology, designates DC and Rack for every node
  • Example: Rack inferring snitch designates DC and Rack based on the IP of the node 
  • Example: Property file snitch where every node has the DC and Rack of every other node
  • Example: GossipingPropertyFileSnitch: Every node knows its own DC and Rack and tells other nodes via Gossip
  • Dynamic snitching: monitors performance of reads, this snitch wraps the other snitches to respond to network latency

Client requests

  • Connect to any client in the node - becomes the coordinator. This node knows which nodes to talk to for the request
  • Multi DC - picks a coordinator in the other data centre  to replicate data there or to get data for a read


  • Quorum = (Replication Factor/2) + 1 i.e. more than half
  • E.g R = 3, Q = 2, tolerate 1 replica going down to continue reading and writing at Quorum
  • Per request consistency - can decide certain writes are more important and require higher consistency than others
  • Example consistency levels: ANY, ONE, TWO, THREE, QUORUM, EACH_QUORUM, LOCAL_QUORUM
  • SERIAL: New in cassandra 2.0

Write requests - what happens?

  • The coordinator (node the client connects to) forwards the write to all the replicas in the local DC and designates a coordinator in the other DCs to do the same there
  • The coordinator may be a replica but does not need to be
  • For a single node writes first go to commit log (disk), then writes to meltable (memory)
  • When does the write succeed? Depends on consistency e.g a write consistency of ONE means that the data needs to be in the commit log and memtable of at least one replica

Hinted handoff - how Cassandra deals with nodes being down on write

  • Coordinator node keeps hints if one of the replicas down
  • When the node comes back up the hints are then sent to the node so it can catch up
  • Hints are kept for a finite amount of time - default is three hours

Read requests  - what happens?

  • Coordinator contacts a number of nodes depending on the consistency - once enough have responded the read can be successful 
  • Will send requests to node responding the fastest
  • If not consistent - compare timestamps + do a read repair
  • Possible other background read repair

What was missing?

Overall it was a great talk however here is some possible improvements:
  • A glossary/overview at the start? Perhaps a mapping from relational terminology to Cassandra terminology. For example the term keyspace was used a number of times before describing what it is
  • Overview of consistency when talking about eventual consistency - however this did come later? A few scenarios for when read/writes at different consistency levels would fail/succeed would have been very helpful 
  • Compaction required for an intro talk? I thought talking about compaction was a bit too much for an introductory talk as you need to understand memtables and sstables before it makes sense
  • The downsides of Cassandra: for example some forms of schema migration/change is a nightmare when you are using CQL3 + have data you need to migrate

Sunday, September 22, 2013

Scala, MongoDB and Casbah: Dealing with Arrays

Get hold of a collection object using something like this:

scala> val collection = MongoClient()("test")("messages")
collection: com.mongodb.casbah.MongoCollection = messages

Where test is the database and messages is the name of the collection.

Inserting arrays is nice and easy, just build up your MongoDBObject with Lists inside:

scala> collection.insert(MongoDBObject("message" -> "Hello World") ++ ("countries" -> List("England","France","Spain")))
res18: com.mongodb.casbah.Imports.WriteResult = { "serverUsed" : "/" , "n" : 0 , "connectionId" : 234 , "err" :  null  , "ok" : 1.0}

Use your favourite one liner to print all the objects in the collection:

scala> collection.foreach(println(_))
{ "_id" : { "$oid" : "523f145e30041dae32fd04da"} , "message" : "Hello World" , "countries" : [ "England" , "France" , "Spain"]}

Now lets say you want a list of objects, simply create a list of MongoDBObjects:

scala> collection.insert(MongoDBObject("message" -> "A list of objects?") ++ ("objects" -> List(MongoDBObject("name" -> "England"),MongoDBObject("name" -> "France"))))
res20: com.mongodb.casbah.Imports.WriteResult = { "serverUsed" : "/" , "n" : 0 , "connectionId" : 234 , "err" :  null  , "ok" : 1.0}

scala> collection.foreach(println(_))
{ "_id" : { "$oid" : "523f145e30041dae32fd04da"} , "message" : "Hello World" , "countries" : [ "England" , "France" , "Spain"]}
{ "_id" : { "$oid" : "523f14b530041dae32fd04db"} , "message" : "A list of objects?" , "objects" : [ { "name" : "England"} , { "name" : "France"}]}

Now reading them back out of Mongo and processing the array items individually. First we can get a hold of an object that contains an array:

scala> val anObjectThatContainsAnArrayOfObjects = collection.findOne().get
anObjectThatContainsAnArrayOfObjects: collection.T = { "_id" : { "$oid" : "523f145e30041dae32fd04da"} , "message" : "Hello World" , "countries" : [ "England" , "France" , "Spain"]}

The extra get is on the end as we used the findOne method this time and it returns an Option. Then we can get just the array field:

val mongoListOfObjects = anObjectThatContainsAnArrayOfObjects.getAs[MongoDBList]("countries").get
mongoListOfObjects: Option[com.mongodb.casbah.Imports.MongoDBList] = Some([ "England" , "France" , "Spain"])

Now we have a handle on a MongoDBList which represents our array in Mongo. The MongoDBList is Iterable so we can loop through and print it out:

scala> mongoListOfObjects.foreach( country => println(country) )

Or map it to a sequence of Strings:

scala> val listOfCountries =
listOfCountries: scala.collection.mutable.Seq[String] = ArrayBuffer(England, France, Spain)

scala> listOfCountries

res24: scala.collection.mutable.Seq[String] = ArrayBuffer(England, France, Spain)

Friday, September 20, 2013

Scala and MongoDB: Getting started with Casbah

The officially supported Scala driver for Mongo is Casbah. Cashbah is a thin wrapper around the Java MongoDB driver that gives it a Scala like feel. As long as you ignore all the MongoDBObjects then it feels much more like being in the Mongo shell or working in Python that working with Java/Mongo.

All the examples are copied from a Scala REPL launched from an SBT project with Casbah added as a dependency.

So lets get started by importing the Casbah package:

scala> import com.mongodb.casbah.Imports._ 
import com.mongodb.casbah.Imports._

Now lets create a connection to a locally running Mongo and use the "test" database:

scala> val mongoClient = MongoClient()
mongoClient: com.mongodb.casbah.MongoClient = com.mongodb.casbah.MongoClient@2acf0276
scala> val database = mongoClient("test")
database: com.mongodb.casbah.MongoDB = test

And now lets get a reference to the messages collections:

scala> val collection = database("messages")
collection: com.mongodb.casbah.MongoCollection = messages

As you can see Casbah makes heavy use of the apply method to give relatively nice boiler plate connection code. To print all the rows for a collection you can use the find method which returns an iterator (there is none at the moment):

scala> collection.find().foreach(row => println(row) )

Now lets insert some data the using the insert method and then find and print it:

scala> collection.insert(MongoDBObject("message" -> "Hello world"))
res2: com.mongodb.casbah.Imports.WriteResult = { "serverUsed" : "/" , "n" : 0 , "connectionId" : 225 , "err" :  null  , "ok" : 1.0}

scala> collection.find().foreach(row => println(row) )
{ "_id" : { "$oid" : "523aa69a30048ee48f49c333"} , "message" : "Hello world"}

And adding another document:

scala> collection.insert(MongoDBObject("message" -> "Hello London"))
res4: com.mongodb.casbah.Imports.WriteResult = { "serverUsed" : "/" , "n" : 0 , "connectionId" : 225 , "err" :  null  , "ok" : 1.0}

scala> collection.find().foreach(row => println(row) )
{ "_id" : { "$oid" : "523aa69a30048ee48f49c333"} , "message" : "Hello world"}
{ "_id" : { "$oid" : "523aa6bf30048ee48f49c334"} , "message" : "Hello London"}

The familiar findone method is there. Rather than an Iterable object returned from find, findOne returns an Option so you can use a basic pattern match to handle the document being there or not:

scala> val singleResult = collection.findOne()
singleResult: Option[collection.T] = Some({ "_id" : { "$oid" : "523aa69a30048ee48f49c333"} , "message" : "Hello world"})

scala> singleResult match {
     |   case None => println("No messages found")
     |   case Some(message) => println(message)
     | }
{ "_id" : { "$oid" : "523aa69a30048ee48f49c333"} , "message" : "Hello world"}

Now lets query using the ID of an object we've inserted (querying by any other field is the same):

scala> val query = MongoDBObject("_id" -> helloWorld.get("_id"))
id: com.mongodb.casbah.commons.Imports.DBObject = { "_id" : { "$oid" : "523aa69a30048ee48f49c333"}}

scala> collection.findOne(query)
res12: Option[collection.T] = Some({ "_id" : { "$oid" : "523aa69a30048ee48f49c333"} , "message" : "Hello world"})

We can also update the document in the database and then get it again to prove it has changed:

scala> collection.update(query, MongoDBObject("message" -> "Hello Planet"))
res13: com.mongodb.WriteResult = { "serverUsed" : "/" , "updatedExisting" : true , "n" : 1 , "connectionId" : 225 , "err" :  null  , "ok" : 1.0}

scala> collection.findOne(query)
res14: Option[collection.T] = Some({ "_id" : { "$oid" : "523aa69a30048ee48f49c333"} , "message" : "Hello Planet"})

The remove method works in the same way, just pass in a MongoDBObject for the selection criterion.

Not look Scalary enough for you? You can also insert using the += method:

scala> collection += MongoDBObject("message"->"Hello England")
res15: com.mongodb.WriteResult = { "serverUsed" : "/" , "n" : 0 , "connectionId" : 225 , "err" :  null  , "ok" : 1.0}

scala> collection.find().foreach(row => println(row))
{ "_id" : { "$oid" : "523aa69a30048ee48f49c333"} , "message" : "Hello Planet"}
{ "_id" : { "$oid" : "523aa6bf30048ee48f49c334"} , "message" : "Hello London"}
{ "_id" : { "$oid" : "523c911230048ee48f49c335"} , "message" : "Hello England"}

How do you build more complex document in Scala? Simply use the MongoDBObject ++ method, for example we can create an object with multiple fields, insert it, then view it by printing all the documents in the collection:

scala> val moreThanOneField = MongoDBObject("message" -> "I'm coming") ++ ("time" -> "today") ++ ("Name" -> "Chris")
moreThanOneField: com.mongodb.casbah.commons.Imports.DBObject = { "message" : "I'm coming" , "time" : "today" , "Name" : "Chris"}

scala> collection.insert(moreThanOneField)
res6: com.mongodb.casbah.Imports.WriteResult = { "serverUsed" : "/" , "n" : 0 , "connectionId" : 234 , "err" :  null  , "ok" : 1.0}

scala> collection.find().foreach(println(_) )
{ "_id" : { "$oid" : "523aa69a30048ee48f49c333"} , "message" : "Hello Planet"}
{ "_id" : { "$oid" : "523aa6bf30048ee48f49c334"} , "message" : "Hello London"}
{ "_id" : { "$oid" : "523c911230048ee48f49c335"} , "message" : "Hello England"}
{ "_id" : { "$oid" : "523c96b530041dae32fd04d6"} , "message" : "I'm coming" , "time" : "today" , "Name" : "Chris"}

Saturday, September 14, 2013

Git + Tig: Viewing Git history from command line

Git is great. The output from git log is not!

One of my favourite tools to work with git is tig. Among other things tig is a great repository/history viewer for git. If you're on Mac OSX and have homebrew setup then all you need to do to install tig is:
brew install tig

Then go to a git repo on your computer and run tig and you get great output like:

You can see individual commits and where each remote branch is up to. In the above screenshot I am on the master branch and I can see that my remote branch heroku is four commits behind as is origin.

You can also see the diff for any commit:

Here I've selected an individual commit and tig shows me all the changes.

Tig also has a very nice status view. So if you like the above tree view try tig status.

Scala: What logging library to use?

Having investigated a few options I have decided on SLF4J + Logback + Grizzled. The project I am currently on uses Scalatra - this matches their solution.

Three libraries?? That sounds overkill but it is in fact very simple.

SLF4J + Logback are common place in Java projects. SLF4J is logging facade - basically a set of interfaces to program to where as Logback is an implementation you put on your classpath at runtime. There are other implementations of SLF4J you can use such as Log4J. Grizzled is a Scala wrapper around SLF4J to give it Scala like usage.

Having worked on many Java projects that use SLF4J + Logback I'm used to seeing lines at the top of files that look like this:
private static final Logger LOGGER = LoggerFactory.getLogger(SomeClass.class)
Fortunately Grizzled-SLF4J + traits help here.

Mixing in the Grizzled Logging trait allows you to write logging code like this:

This will produce logs like this:
09:57:11.169 [main] INFO com.batey.examples.SomeClass - Some information
09:57:11.172 [main] ERROR com.batey.examples.SomeClass - Something terrible
The grizzled trait uses your class name as the logger name. Job done with less boilerplate than Java!

Everything you need s on Maven central so can be added to your pom or sbt dependencies:
Grizzled: Grizzled
Logback: Logback

Sunday, September 1, 2013

Scala: SBT OutOfMemoryError: PermGen space

When starting out with Scala/SBT I very quickly ran into perm gen issues followed by SBT crashing:

java.lang.OutOfMemoryError: PermGen space

Especially when running the console from within interactive mode.

To fix/brush this under the carpet add the following to your profile (.bashrc / .bash_profile) and source it again (run . ~/.bashrc)

export SBT_OPTS=-XX:MaxPermSize=256m