Thursday, January 28, 2010

Terracotta - Comparisons with RDBMS/IMDB/ORDBMS/ORM/OODBMS/NoSQLDB


OMG, with all these acronyms, Borat would have observed that the software industry is run by teenage girls, unlike the great nation of Kazakhstan!


(The attached animated Gif summarizes the text below at a high-level. Click on the image to see the animation.)


Anyways, Relational Databases have been around for 30+ years (the predecessors – Networked/Hierarchical databases are extinct now). They matured over years and fit the bill in the early 90s, until Internet Scale and data-intensive applications broke their back. Basic issues have always been the same:


A> IMPEDANCE-MISMATCH: Domain Model to Relational Model Conversion – The impedance mis-match between your object-oriented application domain model and the relational schema that must support it results in a lot of tedious co-ordination between middle-ware/database developers, code and time (Time to arrive at a fixed/inflexible Schema Design, Stored Procedures and Interfaces to I/O against the schema and tuning/data-archival/backup etc.). This manifests as:

o Complexity

o Greater Time-To-Market


B> RDBMS IMPLEMENTATION: Data managed by the RDBMS is in Memory and/on Disk. Arriving at a QEP (Query Execution Plan) and retrieving/updating data off Memory and/or Disk with Transactional ACID Guarantees involves complex machinery in terms of transaction-logs/redo-logs/locking/guaranteeing isolation-levels - i.e. costs in terms of processing resource and latency. Typically the RDBMS, being an "enterprise" resource also gets abused, since any application desiring persistence tends to utilize it – thereby leading to bottle-necks on the Database/Database-Machine. One’s options then are then to Scale Up the Database Infrastructure and Pay More in terms of Database Licenses (typically per CPU Licensing) and/or Invest in Clustered Database Technology (which is also expensive and a little immature to boot) - and both of these are very expensive options. Or Shard – up in the application tier. All this manifests as:

o Poor Latency with regards to Query Execution.

o Poor Scalability due to a Database-Tier Bottle-neck.

o Additional layer of complexity.

o Excessive Costs.


C> DATA-REMOTENESS: Data is always one/many hops away from the Application where the Business logic is often codified and executed. And where definitely the Presentation Layer sits. This manifests itself as:

o Increased Latency to the Application.


The business is rife with sharp engineers, greedy business-men and dedicated workaholics – and status quo does not remain so, for more than an attosecond (10¯18)! So what’s everyone been up to – to solve these problems?


A> To resolve Impedance-Mismatch, the industry’s efforts have revolved around the following approaches:

o OODBMS – Object Oriented DBMS: (Examples è Db4o, Objectivity/DB, Intersystems-Caché, Versant, Progress)

o ORDBMS – Object Relational DBMS /Universal Servers: (Examples è PostgreSQL, Cubrid, OpenLink Virtuoso)

o ORM – Object Relational Mapping Software: (Examples è Hibernate, Oracle-TopLink, IBatis, EclipseLink, Django).


B> To resolve RDBMS Implementation Issues:

o Issue of Latency due to Data Being On Disk and requiring complex QEPs, IMDB - In Memory Database: (Examples è Oracle Times-Ten, IBM Solid-DB, Sybase-ASE 15.5, MySQL Cluster, eXtremeDB)

o Brand New No-SQL Databases: (Examples è Apache HBase, SimpleDB, Project Voldemort, Facebook Cassandra, Amazon Dynamo (closed), Google BigTable (closed)).


C> To resolve Data-Remoteness, the industry’s efforts have revolved around, expectedly, solving the problem in the middle-ware and/or client-TIER with:

o Caching and Distributed Caching Technologies: (Examples è Terracotta Ehcache, Oracle Coherence)

o IMDG (In Memory Data-Grid): (Examples è Terracotta Platform (with TIM-Messaging), Hadoop, Oracle Coherence)

o Distributed Java Heap: (Examples è Terracotta Platform and special editions of the Terracotta Platform such as Terracotta Sessions, Terracotta Quartz; and a couple of other wanna-bes).


Solving the problem at the middle-tier is what folks have discovered works better and more cost effectively in that one can reduce Complexity, improve Latency and Scale/Scalability at once (See Table-A for Details) unlike the other approaches that solve one or more, but not the entire set of problems.


Table-A:

Problem

RDBMS

IMDB

No-SQLDB

OODBMS

ORDBMS

OR-Mapper

Terracotta

Solution to impedance mismatch between Relational and Object Oriented Domain Model

X

X

X

Reachability Persistence.

Mapping Configuration files

Reachability Persistence.

Latency Improvement

X

(Since data is in memory, Query Execution Times are improved, although Data-Remoteness problem remains).

(Fast since several do not attempt to provide ACID guarantees and do not require fixed table schemas and avoid join

operations).

X

(For some navigational and associative lookups, object graphs can be pulled up as is)

X

X

YES - (if complemented with Distributed 2nd Level Cache e.g. Terracotta Ehcache (with H2LC plugin for Hibernate)

(Close to in-memory speeds possible)

Improve Scale and Solve Scalability Issues

X

Yes (if Ready to Spend $$$$$ on Scaling UP + DB-Cluster s/w like Oracle RAC)

X

Yes (to a certain extent, but limited by amount of memory and the need for Non-volatile RAM) or compromising with Save-Points.

Yes (assuming eventual consistency good enough)

X

X

X

YES - (with Distributed 2nd Level Cache e.g. Terracotta Ehcache (with H2LC plugin for Hibernate)

(since I/O is against HA Memory)

Easy Integration to RDBMS (entrenched source of truth)

N/a

Typically yes (e.g. Times Ten with Oracle)

X

X

n/a

n/a

(Write-Behind, Write-Through (with/without JTA))

Better Support for Fast Changing Domain Models (Agile)

X

X

X

X

X

Heterogeneous Application/ Stack Support.

X

(Typically you store strongly-typed objects in the database, which makes sharing across Apps/Heterogenous stacks difficult)

(Data is ultimately in relational form)

(Data is always in relational form – and ORM merely bridges the impedance)

(Java Solution – multiple apps can access the cache. Terracotta Ehcache supports EhcacheRESTful so heterogeneous Stacks can access the same cache service)

Implementation Maturity

?

?

?

X

(Vetted by 1000s of OSS and Commerical Deployments over 6 years)

Common Standard

X

X

X

X

X

X/

(Terracotta Ehcache is JSR 107 Compliant).



So, as the argument above indicates, if you are considering any of the following:

o more expensive RDBMS upgrade,

o ORDBMS,

o IMDB,

o OODBMS,

o No-SQL DB,

look to instead solve it with middleware, whilst still leveraging your existing investments.Of the Middle/Client-Tier based solutions, Terracotta is the most compelling, since along with agreeable runtime characteristics and visibility, you get the most number of choices with regards to application expressability and deployment choices to fit your needs/budget and to trade off correctness v/s scale – see http://www.terracotta.org/products

If the data does eventually reside in a RDBMS, look at

o Terracotta Ehcache (with the Hibernate 2nd Level Cache Plugin) if using an ORM (Hibernate)

o Terracotta Ehcache (with the ability to write Behind - see http://www.terracotta.org/beta/darwin) & join Greg Luck's upcoming Webinar at http://gregluck.com/blog/archives/2010/01/sign-up-for-a-webcast-i-am-giving-3-february-boost-application-performance-and-monitoring-with-terracotta-ehcache/).


There is plenty of quantitative evidence to support the above, in the form of white-papers and case-studies and an impressive client roster at http://www.terracotta.org.

About Me

I have spent the last 11+ years of my career either working for a vendor that provides infrastructure software or managing Platform Teams that deal with software-infrastructure concerns at large online-presences (e.g. at Walmart.com and currently at Shutterfly) - am interested in building and analyzing systems that speak to cross-cutting concerns across application development in the enterprise (scalability, availability, security, manageability, latency etc.)