Friday, December 29, 2006

Clustering GlassBox

Anyone who has maintained an application, would agree that measuring and managing an application's runtime characteristics and diagnosing performance, scale issues is part art, part science. A fair bit of data needs to be collected but then aggregated at the appropriate level so as to be useful and actionable, especially when you are under the gun to solve a production issue.


GlassBox is a cool open-source tool that promises exactly that for your Java applications in that it can give unprecedented visibility and adds on a diagnostic layer above this to summarize any potential issue succintly (see figure). Added bonus - it does this via AOP i.e. non-intrusive as far as your application is concerned.


However all of the measurement that GlassBox provides are for a given JVM - so coming back to our earlier point around aggregating measurements, wouldn't it be cool if there was some way (with minimal code changes) to collect these stats across the entire cluster of JVMs - so that way when a problem occurs, you can correlate data across all JVMs and on the cluster as a whole, so as to easily identify if a problem is cluster-wide or is isolated to a single JVM(s).

I sat with Ron Bodkin, CTO of Glassbox to see if we could cluster Glassbox with Terracotta. Here is what we had to do :

  1. We installed GlassBox and then Terracotta for Spring (since glassBox uses Spring internally)

  2. Modified the catalina.sh script and passed in 4 additional JAVA OPTIONS (so Terracotta could get hooked in). Container was Tomcat 5.5.x.

  3. Ron then identified the Spring Bean that had all of the state (i.e. the per-JVM statistics).

  4. We ran into a few configuration issues, which had to get sorted. Upgraded Spring to 1.2.8 (1.2.8 and upwards is supported by Terracotta). Even so, we ran into some exceptions and then realized that there was another old spring.jar in the classpath that needed to get removed.

  5. Filled out the terracotta config file (tc-config.xml) which is where you state what bean needs to be clustered. Added wild-card characters to the application name and the resources entry, so Terracotta could find the Spring application context file, which captures the bean definitions. tc-config.xml entries look like

    <jee-application name ="*glassbox*">
    < paths><path>*beans.xml</path><paths>

  6. Also the requirements were such that cluster-wide stats needed to be updated on a timed basis and not in real-time (i.e. as soon as a given statistic changed on any JVM in the cluster). So, Ron then wrote a method that essentially made an intelligent deep clone of the object graph that needed to be clustered - this method would fire off every few seconds. And we then clustered the bean that wrapped this deep clone instead. Config entries were:

    <beans>
    <bean name = "distributedRegistryHolder"></bean>
    <beans>
    <locks><autolock>
    <method-expression> *glassbox.track.api.StatisticsRegistryHolderImpl.copy(..)</method-expression>
    <lock-level>write</lock-level>
    </autolock></locks>


  7. Terracotta for Spring also has a feature where with a one-line config entry, you could cluster HttpSessions. So we turned that on. Something like adding a
    <session-support>true</session-support>
    element to the tc-config.xml. A GlassBox User can used a Web-based client to monitor your App - not a whole lot of state is in HttpSession, except for things such as GUI preferences - so those would now get preserved in case of loss of a given app-server. Not much but 1 less minor inconvenience in the world.
And it took us all of 4 hours to complete all of this. A satisfying 4 hours I might add. Now we get cluster wide monitoring by just integrating 2 open source products without modifying a line of application code. And you as the application architect can breathe a little easier, in that should things go south, you have a dashboard that helps you zero in on the issue instead of having to grep through logs across disparate machines and doing a ton of correlation work...

About Me

I have spent the last 11+ years of my career either working for a vendor that provides infrastructure software or managing Platform Teams that deal with software-infrastructure concerns at large online-presences (e.g. at Walmart.com and currently at Shutterfly) - am interested in building and analyzing systems that speak to cross-cutting concerns across application development in the enterprise (scalability, availability, security, manageability, latency etc.)