Negentropy: noSQL (Document-based) or Distributed Cache

As numerous discussions at Shutterfly and on various online posts and this blog by Greg Luck indicates, there is definitely some interest in the market in terms of understandings where Distributed Cache and NoSQL Movement use-cases intersect and where they don't.

Of course, both aim to offload the RDBMS strong-hold in the enterprise by either

Deflection (i.e. minimizing direct access to the RDBMS) - i.e. a proxy to the RDBMS or
Avoidance (i.e. allowing for some types of data to not be stored in the RDBMS to begin with).

PURE-L2 Architecture:

Usage of L1-L2 analogies in distributed-cache literature stems from multi-level cache hierarchies described in CPU architectures - basically the L1 cache is local to your client application process and the L2 is an off-host/off-process manager of cache-data.

See this animated GIF -

As can be seen, whether the off-host process that manages the cache-data is MongoD or MemcacheD or Terracotta-Server, architecturally they all look equivalent - i.e. a pure-L2 with no-L1 - so that all data needs to be retrieved from over the network and then massaged into a POJO for consumption by the application. (FYI - I used these specific technology implementations as proxies for NoSQL and Distributed Cache since I have good familiarity with Mongo (Document Based noSQL), Terracotta-Ehcache and Memcache - given that we use these 3 technologies in our stack at Shutterfly).

Given the architectural equivalence here and our runtime-observations, one could argue that in a pure-L2 architecture, using No-SQL technologies seems like the way to go (especially if there is a requirement to view the data outside of the confines of the application's Domain Model - e.g MongoDB Shell). Or at least, drop them into the same pool of candidate technologies and map features as they relate to your use-case.

For us definitely, with Mongo-DB as L2, in certain use cases
- we observed good response times (see attached image)

- we experienced good query expressability
- we could validate data manipulations expected off the application easily from the MongoDB Shell.

Although there was definitely the domain model/ BSON impedance mismatch to be overcome with Morphia annotations and APIs. For a comparison along some common dimensions, see the table below:

Dimension	MongoD	MemcacheD	Terracotta Ehcache
Retrieval	BSON from MongoD that can be mapped via Morphia into a POJO	Serialized version from MemcacheD - that needs to be de-serialized via Java-Serializaton as a POJO	Object DNA from Terracotta Server - available at the application transparently as an Ehcache-Element.
Latency (as seen by app)	1ms - more	1ms - more	1ms - more
Horizontal Scaling	Mongo Replica Sets (available in open-source)	Multiple MemcacheD (available in open-source)	Terracotta Server Cluster Array (not Available in open-source, in enterprise-kit
High Availability	Mongo Replica Sets (available in open-source)	Not a good story - no HA available for MemcacheD (change to MemcacheDB but that has its own costs)	Terracotta Server Cluster Array (not Available in open-source, in enterprise-kit
Query Expressability	Very good - see Mongo Querying and Advanced Querying	Key-Value lookup only	Key-Value although Ehcache 2.5 and onward now offer some search capability via searchability API

Of course, there are use cases where cache access in the order of milli-seconds is a non-starter - you want cache-access in the order of mirco-seconds and hence if Network-hops and Impedance-mismatches can be avoided, you want to avoid them. A cache is about local data - and assuming your application has good locality characteristics, you want to exploit L1 caching and in such a case a L1-L2 architecture offered by some of the distributed cache implementations (Terracotta-Ehcache, Oracle-Coherence) or a bloated L1 architecture (Terracotta-BigMemory) could well be the way to go.

If you have experiences or comments on MongoDB (or other noSQL) and distributed cache usage comparisons, feel free to share them or comment.
Cheers,

Negentropy

Saturday, March 19, 2011

noSQL (Document-based) or Distributed Cache - a Practitioner's tale.

About Me

Blog Archive