Wednesday, June 25, 2008

Tuning, Testing, Deploying Terracotta Implementations - how does my application time-to-market change?

I work at Terracotta with the Field-Engineering Team. What we have noticed over 100s of implementations, is that a lot of technology-adaptors thus far, are pretty adept at getting to Dev Completion with Terracotta-DSO on their own. This coupled with the Terracotta Integration Module strategy (see http://www.terracotta.org/confluence/display/integrations/Home ) makes me believe that Terracotta’s promise of maintaining natural Java Semantics and hence reducing the level of effort needed by the Developer to cluster his/her app is, by and large, very real.

Another observation though is that, self-adopters while successful at getting to Integration-Complete phases of their projects, have needed, relatively speaking, more hand-holding with regards to tuning, testing and deploying their applications into production. Reasons are:

  • For the most part, programming languages feature no specification of behavioral characteristics outside of the contractual interface specification of inputs and resultant outputs. As an example, let's take the notion of sizing collections. We all understand that one should pre-size collections, so as to avoid expensive rehashing at run-time. At a default of 16 and load facotr of .75, as soon as the size of the Hashtable gets to 12, we are already paying the cost of a rehash (to roughly 32).But very few developers actually size their collections due to whatever reason (ignorance, negligence, or genuine inability to predict the collection-size). It becomes a later mop-up tuning task, as against being accounted for during implementation....Now a Hashtable/Hashmap with some self-correcting size related heuristics or some eventing on rehash (so one can react to it programmatically) would be useful - but in the absence of one, there is no substitute to competence when implementing.
  • The above argument is to highlight that even with technologies (such as Terracotta) that modify byte-code to give you desirable characteristics of add-a-node-predictive-scale and High Availability, there is really no substitute for thought and discipline when engineering, rigor during testing and scientific methodologies when stress-testing and deploying.
  • Of course, Terracotta is a byte-code enhancing technology and is working on providing the end-user more tooling, more visibility and more rules/inference-engines to suggest possible root-causes – and hence ease the process across the entire SDLC. You already get the gist of this approach via the Snapshot Visualization Tool and there is a lot more, coming down the pipe.

So then assuming that you have had to make some delta-changes to your application to get to Dev-Completion (e.g. initialize Terracotta transients , modify the coarse-ness of locks etc.) what is the typical time spent around Tuning, Testing, Deploying ? Perhaps these comparisons between any Java Application and a Terracotta-integrated Java Application can highlight the delta involved - this way you could have a clearer understanding of how time-to-market for your application changes, when you integrate Terracotta from POC-Production.
(Note that java Performance tuning is such a vast topic – and so we aren’t talking about all possible ways to enhance performance/scale of your Java app itself, so no NIO or AWT improvements for example - just identifying what one typically does during Terracotta tuning and how really a fair bit of it is similar to what one would do, without Terracotta in the picture, in the first place.)

Tuning Terracotta: See here for several common practices:



Programmer Consideration


Plain Old Java Application (POJA)

Terracotta integrated Plain Old Java application (TI-POJA)

Data structures

Size

Validate right choices of Data structure based on expected size of the structure.

Similar to a POJA.

Additional Considerations:

· Whether a collection is “partial” or not (Hashmap, CHM, Hashtable, LinkedHashMap, Arrays are partial – others are not as of 2.6).

· Does the object-graph play well with the Terracotta Virtual Memory Manager


Pattern of Access

If lookup based on key use associative array (e.g. Hashmap) instead of a Vector etc.

Identical to a POJA

Extent of I/O against the collection and Concurrency of Access

Example:

· Hashtable or Hashmap based on correctness needs?

· Hashmap or ConcurrentHashMap based on how concurrent the access needs to be?

Same concept how concurrent is the access across the cluster.

Data-structure implementation

Myriad other considerations based on how the data-structure is implemented and the domain-model under consideration. See here for more discussion

Identical to a POJA


Synchronization (given a requirement of a certain level of correctness)

Contention on monitor that you are synchronizing on

Be careful to lock on the right monitor. May need

to stripe Lock.

Identical

Scope of lock

Coarse grained versus Fine grained (is it method level or can you just keep it local to a specific block of code).The more fine-grained the better.

Identical - just note that in most cases there maybe value to making the lock more coarse-grained to benefit from batching.

Type of lock

Pessimistic Synchronization or lower level synchronization semantics (volatile or Atomic or ReentrantReadWriteLock etc.)

Volatiles are not supported as of 2.6. In addition one has to look at lock-types e.g. Synchronous-Write, Write, Read, Concurrent, None. You can get more details on distributed locking here .


Memory Management

Garbage Collector: Choice of Collector. Xms, Xmx, Ratio of Eden to Old etc.

CMS or Parallel or both. Appropriate sizing of Heap, Young, Old etc.

Identical. See here In addition one needs to consider DGC (Distributed Garbage Collection) on the Terracotta Server.

Soft References

Soft Reference Policy.

Tune Virtual Memory Manager on L1, L2.

Virtual Memory Usage.

Avoid thrash with virtual memory monitor via vmstat

Identical – some thrash is allowable as long as within Latency SLA– between L1<->L2<->L2Disk. Tune faulting, fault-flush, committing-Terracotta transactions

Traffic Patterns and Distribution. Locality of Reference

Based on data size - Partition Data across JVMs or not.Locality of reference is a good thing.

Identical (one can over come data-size limitation assuming trade-off on latency SLA is possible).

Delta: In several use-cases (but not all) Terracotta VMM can mean that you do not need to partition (hence simpler app). In some cases, given the data-volume and performance SLA, there may be no way to achieve Locality of Reference without partitioning. See here.

Other Considerations

Instrumentation Scope. See here




Testing and Deployment: See here for several thoughts around how testing a distributed application is different from a non-distributed application.

Consideration

Any Plain Old Java Application (POJA)

Any Terracotta integrated Java application (TI-POJA)

Testing

Functional

Code-Coverage is important to determine thoroughness of testing.

Identical - to ensure that there are no possible Terracotta runtime exceptions (UnlockedSharedException and NonPortableExceptions) lurking around.


Scale/Performance Testing

Need Test Script that mirrors production load. Production Monitoring should also be on. Need quantitative measures (average, median, max) of latencies for a basket of transactions and overall TPS measure.

Testing a distributed app is a different ball-game. Ideally a framework to spin-up/tear-down JVMs is useful. You will t to consider throughput in the presence of various failover scenarios (e.g. when a client-JVM fails, when the Primary Terracotta-server fails etc.), state of distributed components (e.g. a client JVM is at 100% cpu utilization, network switch is in the middle of a re-boot sequence etc.), manifestation of the object graph across client JVMs , Terracotta Server JVMs and disk (given that Terracotta has a Virtual Memory Feature which allows you to exceed physical limits of RAM on your client JVM) etc. You probably also want to run your monitoring scripts at the same time. See here


Availability

Need to identify Single Points of Failures in the application deployment

Terracotta will provide you an implementation with no single points of failure. Some effort needed towards tuning parameters that determine failover times /behaviors of various components based on the type-of failure. See here. The Network Active/Passive Link also talks about all the availability tests you could execute to reconfirm no SpoF in you specific environment.

Deployment





Sizing of hosts on which JVM resides

Based on capacity planning exercises

Identical. Additionally need to size Terracotta servers and choose storage strategy for cluster state. You may need to determine how many client JVMs as well to support Application-TPS and based on latency SLAs (i.e. if faulting is not OK)

Monitoring CPU, Memory, Disk, Network, GC, TPS of each client-JVM etc. Identical. In addition probably 10-20 more Terracotta specific parameters to monitor and of course the Terracotta Servers/Disk they write to.


Run-book

Protocol for Recovery and Troubleshooting in case of situations.

Identical. Given that the architecture will not have single points of failure, the run-book would probably feature more about what to do to restore HA to the environment.


As is evident from the above arguments, there is a fair bit of parallelism between tuning and deploying any Java Application and a Terracotta Integrated App...although distributing your app, necessarily requires a few changes around the entire software-manufacturing process. Hopefully based on the above list and your Dev-exercise, you can predict better the additional time-to-market (I would guesstimate typical ranges average around 3-12 weeks for mid-high complexity apps from start to finish and it will just get better with more TIMs, Visibility and specific Documentation). You can read more about specifics at http://www.terracotta.org/confluence/display/wiki/TechnicalFAQ . We are in the process of cleaning up this documentation with links to supporting product features etc.– so keep an eye out for new content at http://www.terracotta.org

About Me

I have spent the last 11+ years of my career either working for a vendor that provides infrastructure software or managing Platform Teams that deal with software-infrastructure concerns at large online-presences (e.g. at Walmart.com and currently at Shutterfly) - am interested in building and analyzing systems that speak to cross-cutting concerns across application development in the enterprise (scalability, availability, security, manageability, latency etc.)