Wednesday, July 11, 2007

Clustered Java - Operational Needs


Businesses are in constant flux and mutate in response to environmental stimuli - the Java application you deployed in the past has to change - new functionality, capacity, availability, monitorability requirements get tagged on.

Typically, the operations management and team (sysadmins, network engineers, DBAs etc.) who get measured on uptime/scale and the overall smooth functioning of the data-center. But ofcourse, they rely very heavily on the application/ application-infrastructure developer to help them get there. What is the expected application behavior:
1. When it gets deployed off 1 node to multiple nodes.
2. When 1 of the JVMs crashes.
3. When state gets out of synch across these nodes.
4. Under load etc.

Any stream-lining here will simplify the Development <-> Operations handoff i.e. fewer calls to you, the developer and lower maintenance costs to the organization as a whole.
So, then:
  • Where does the application developer’s job end and the software-infrastructure engineer’s job begin?
  • Where does the software-infrastructure engineer’s job end and the operational-infrastructure engineer’s job begin ?

Now whether you are in an IT culture where

  • Roles are clearly demarcated across silos – as in the attached figure OR
  • You are in a culture/organization, where the above described roles don’t have clean boundaries…

there is no denying that while there is an argument for “silo-ization” – concerns around application functionality, latency, scalability, availability and manageability cut across these boundaries. The characteristics of a successful, highly efficient IT organization arguably are:

  • Core-competencies within each silo.
  • Clearly defined artifacts and interfaces across each boundary resulting in minimal friction /iterative loops across each boundary AND
  • A good understanding the up-stream/down-stream consequence of decisions within each silo.

As an example, lets focus on a big, hard-to-tame beast of Availability and Scalability for Java Applications. Of course, as an application-developer you are bound by what requirements flow-downstream - but assuming reasonable business requirements/project- management ;-), you would impact the final infrastructure/management foot-print of your application, based on decisions you execute around:

Task / Decision

Downstream Impact

Modelling of real-world entities as Database-schema and as a OO hierarchy in the Java-tier. And implementation of these entitites in the Database and at the Java tier (Data-structures employed, package/class organization, "garbage" created)

Latency, Scale

Marshalling/Unmarshalling across the OO-DB representations (ORM).

Latency, Scale

Computational algorithms, control-flow

Latency, Scale

Dependency management.

Maintainability, Manageability

Management of state across JVMs and JVMs and DB/external systems.

Availability, Scale

Wrong choices anywhere (e.g. an inappropriate data-structure, needless sql within a tight-loop, massive amounts of new objects instantiated/request etc.) can adversely impact “down-stream” concerns such as latency/scale/manageability. In today’s environment, the first-4 bullet points on the list above are clearly better-contained within the application developer’s world and classic application trouble-shooting can get rid of reported problems.
Arguably however, the last bullet point on the list above around state-management across JVMs and external systems is the fuzziest – and routinely crosses the Development / Operations chasm. The consequence of addressing this “downstream-concern” much later is that the current state-of-the-art is intrusive/labor-intensive, expensive, inconsistent and features frequent round-trips across the Dev/Ops interface – leading to inefficiencies and high support-overhead costs. i.e. one is better off taking a long-term view during the SDLC as against when maintaining the application (i.e. during SMLC – Software Maintenance LifeCycle).

Focussing further on an application on a single-JVM now running on multiple JVMs, the interfaces between Ops and Dev (and at times between application-infrastructure-development and application-software development) are weakly defined - resulting in frequent traversals across the chasm and inefficiencies. So, what are some of these operational concerns/needs when running a Java application on a cluster of JVMs.

  1. Load Balancing strategy:
    1. How to ensure that the load gets appropriately partitioned across the JVMs in the cluster?
    2. Does the application need Locality of Reference - i.e. do requests need to be stickily routed to the same JVM?
    3. How to re-route and what is user-experience when a JVM goes down?

  1. Application Correctness: not synching state (e.g. caches) or any optimistic way of sending updates over implies
    1. Reacting to potentially incoherent application/state (cache) across the cluster.
    2. Non-guaranteed delivery of messaging across JVMs can easily lead to the brain being split across the participants in the cluster. (e.g. if the cache reflected price on an item in an electronic catalog and is inconsistent on 2 different JVMs, which JVM then has the right cache value - how does operations recover, do they need to restart the cluster ?)

  1. Capacity Planning:
    1. Operations needs a deterministic way to scale. Today there is a lack of a deterministic way to scale (e.g. add-a-node on demand).
    2. Few low clustering overhead solutions are available in the market (or if home-grown) - since the majority use expensive Java-serialization. i.e. The requirement of more scale per operational-dollar in a clustered environment is still largely unmet by today's solutions.
    3. Dynamic Infrastructure Provisioning based on usage characteristics.

  1. Breach of Availability SLA (Unplanned downtime - i.e the fewer pages go off, the better):
    1. Due to an Application Server Failure (Industry Java-app availability ~ 98%)
    2. Cascading Failure (1 JVM goes down and the others go down as well).
    3. Unpalatable Recoverability (long restart, application/cache warming, infrastructure abuse)
    4. Cluster work-load balance/re-balance in response to outage/recovery.

  1. Breach of Availability SLA (Planned downtime) - i.e. can i continue to maintain business continuity:
    • Application Release Processes: Rolling upgrades/Incompatible patches (e.g. apps that depend on Java Serialization)
    • Content Management (database/schema-related downtime).

  1. Manageability / Monitorability
    1. Few/No Intrusions into Release/Patch process
    2. No comprehensive cluster visibility - Cluster management today is per node.
    3. Proliferation of Tools/Mgmt Interfaces and complexity for the NOC.
    4. Running Stateless with Stateful applications (arbitrary servers out-of/in-rotation).

  1. Dev <-> Ops HandOff
    1. High Overhead/Friction in Dev<->Ops Hand-Off
    2. Operational Training costs
    3. NOC and Production-Support Complexity/Errors

  1. Others : Peculiar to a given IT/Organization’s culture.
Terracotta (the company behind the open-source clustering technology - DSO/Distributed Shared Objects) attempts (successfully, imho) to address all of the above operational concerns (with the exception of Load Balancing, Dynamic Infrastructure Provisioning 3-b) via a non-intrusive (simpilicity for Dev by clustering at the JVM), cost-effective (open-source) and consistent mechanism to cluster state across JVMs:
  • The application architect identifies what is stateful and codifies the clustering behavior in a XML file. Application code is cleanly separated from the clustering concern.
  • Operations gets a standard terminology and a standard mechanism (i.e a clean interface from Dev) to cluster Java applications, efficiently (given the implementation) and cost-effectively (highly scalable and open-source).

You can read more about it on the numerous blogs and the website at http://www.terracotta.org.

About Me

I have spent the last 11+ years of my career either working for a vendor that provides infrastructure software or managing Platform Teams that deal with software-infrastructure concerns at large online-presences (e.g. at Walmart.com and currently at Shutterfly) - am interested in building and analyzing systems that speak to cross-cutting concerns across application development in the enterprise (scalability, availability, security, manageability, latency etc.)