Wednesday, January 21, 2009

Terracotta Usage Amongst Content Portals

TYPICAL REQUIREMENTS:

  1. Content Creation: A combination of
    • User-Generated
    • Feeds from Content Providers (e.g. Movies, Weather from Yahoo etc.)
    • Editorially Generated Content

  2. Content Publishing: On demand, once approved.

  3. Content Portal sites often feature pay-for-content services that imply:
    • User Subscription/Registration.
    • User Authentication and Authorization

  4. Traditional Distribution of Caches that query the DB for User Profile Data and/or Content-Attributes/Specific Content


TYPICAL IMPLEMENTATION:

  1. Typically Content Creation/Publication has been implemented:

    • Against the database in that the CMS publishes to the database and any concomitant caches are thereafter cleared. This has 2 issues:
      • Database can get overloaded depending on the amount of content streaming in and based on the fact that caches get cleared once the new content gets published, which then results in a spike in database queries.
      • To hedge against spikes, often the caches are cleared only at graveyard-hours on the site. So, often there is a delay between content publication and availability for consumption.


    • With Terracotta:
      • You solve both problems. THE CMS application would publish against in memory Java data-structures which are replicated to JVMs via Terracotta that serve up the content to users of the Content Portal.
      • One thus obviates the usage of the database and cache clearing and hence content is available as soon as possible.


    • Example:
      • Portal consists of multiple portlets. One portlet displays weather. Content for the portlet streams in through a Yahoo Feed. Weather for 5 zip-codes change - publication is done in-memory to Java data-structures. (e.g. to specific elements in a CHM where key is ZipCode, Value is Weather-Forecast Object and the CHM is shared between Publication JVMs and Consumption JVMs).



    • Terracotta Solution Considerations (see figure):
      • Publishing Application and Consumption Application are implemented as two different applications.

      • The applications MUST be factored in such a way that both share common data structures that represents the content.

      • Class Loader issues: Since these are 2 different applications, they feature different class loaders - so a common classloader is needed

      • EXAMPLE:
        • Tomcat appserver. CMS app deployed as CMSApp Context Root and Consuming app deployed as UserApp ContextRoot.
        • A class loader is created for each web application. The CMSApp ClassLoader loads all Unpacked-Classes+Resources in the CMSApp/WEB-INF/classes directory & Classes+Jars in the CMSApp/WEB-INF/lib directory of your web application archive, and are made visible to the containing web application, but to no others. The UserApp ClassLoader loads all Unpacked-Classes+Resources in UserApp/WEB-INF/classes and Classes+Jars in UserApp/WEB-INF/lib directory of your web application
        • Given this visibility problem now, you would need to have classes loaded off the Shared ClassLoader, so that all unpacked classes and resources in $CATALINA_BASE/shared/classes, as well as classes and resources in JAR files under $CATALINA_BASE/shared/lib, are made visible to all (and thus CMSApp and UserApp) Web Applications and the Classloader name across CMSApp/ UserApp remain identical (Terracotta requirement)



  2. Subscription and User Authentication and Authorization:

  3. Traditional Distributed Caching: See http://www.terracotta.org/web/display/orgsite/Data+Caching and http://javamuse.blogspot.com/2008/11/distributed-cachingdatabase-offload.html for caching JDBC/ORM querying of the database for Specific Content

  4. Other examples of Terracotta being used in Content Portals include Message Board Clustering (e.g JForum and others) and special case usages.



Hope those of you in the Portal business find this useful.

Tuesday, January 13, 2009

Terracotta Usage in Ecommerce

As this Computer-world article indicates, even as late as 2008, most E-Commerce sites have not yet solved their scale and availability issues, resulting in serious business losses, at a time they can ill afford to do so. Most ECommerce sites are implemented in Java and causes of outage of course vary widely, but typically usual suspects include (but are not limited to) the following:
  1. Malicious DOS Attacks/ Unintentional but excessive "Spidering"
  2. Poor Capacity Planning at the network tier:
    • Poor or Non-Existent CDN strategy (so no traffic is deflected off Source site)
    • Load Balancer being overwhelmed.
    • Network Bandwidth saturation
  3. Poor Systems infrastructure
  4. Misconfigured HTTP Settings (i.e. the several parameters in httpd.conf, if Apache)
  5. Misconfigured App-Server Settings (i.e. the several parameters in web.xml/server.xml if Tomcat, Connector related settings etc.)
  6. Poor Garbage Collection tuning on JVMs (e.g. large # GCs, long Full GCs).
  7. Database/Application MisConfiguration (e.g. sizing of SGA, DB Connection Pool etc.)
  8. Software Development Issues:
    • Database overwhelmed in terms of Reads/Writes; Quality of PL/SQL Algorithms/Code (e.g. DB Locks, Full table scans etc.)
    • Poor Application Architecture/ Design/ Implementation/ Configuration.

The last one - i.e. poor design and implementation of servlets that constitute the Commerce Site and Backend Systems is certainly something that consumes the bulk of the Development team's effort/time. A common theme here amongst front-end Java Developers is the over-dependence on a already heavily used database - given that it is there and provides persistence. This seriously limits scale, so that during traffic spikes, when the business needs the Systems to be most available (given the number of prospective customers) is exactly when the system blacks-out or browns-out.

Terracotta can help reduce and obviate RDBMS usage in many cases and allow you to safely operate "in-memory", that is durable across JVM life-cycles - so that you get scale and HA while allowing you to maintain a POJO-based programming environment - see http://www.terracotta.org and especially, http://www.terracotta.org/web/display/orgsite/Kill+Your+Database

LET US SEE HOW, Terracotta would add value specifically in E-Commerce Applications. Typically, the E-commerce business involves both FRONT-END (Websites, SOAP services etc.) and BACK-END systems (Supply Chain Planning/Fulfillment/Warehousing) systems. i.e.

A> ORDER ACQUISITION SYSTEMS to enable:







SUB SYSTEMPURPOSECOMMON TERMINOLOGY/ PROBLEMSCOMMON SOLUTIONS/ PROBLEMS with the SOLUTIONSTERRACOTTA VALUE-ADD
ELECTRONIC CATALOG CREATION/ MODIFICATION MERCHANTS decide on what to sell. Procure inventory through BUYERS. Product goes through large workflow(akin to CMS) before ready for publication to catalog (e.g. copy, price, description, photography etc.) as a saleable item. ITEM CREATION/CATALOG MODIFICATION: Typically, each these state changes via this CMS-like Workflow are typically stored in a RDBMS. Employ a home-grown or off-shelf CMS or Document Management System before final publication to the catalog - database.
ELECTRONIC CATALOG PRESENTATION Users browse through a pre-determined product classification hierarchy - Departments/Categories/Sub-Categories/Shelves BROWSE: Not caching Browsing activity leads to RDBMS saturation especially under high volume.Most sites typically report that 90% of all activitiy on the site is BROWSE/SEARCH. CATALOG CACHE - i.e. local Cache on each JVM of Catalog Database queries

Users search for specific products. SEARCH: Not caching Search activity leads to Search Engine saturation especially under high volumeImplement SEARCH CACHE on each local JVM
Inventory position (i.e. Availability Status) being up to date is of utmost importance. If Out-Of-Stock displayed (if inventory currently exists) then lost sales imply opportunity cost. If In-Stock displayed and item out of stock, there is a fulfillment issue INVENTORY CACHE: If not cached, Database saturated. If cached - typically, Inventory cache not up to date vis-à-vis back-end systems and inconsistent across JVMsINVENTORY CACHE is locally cached and when it changes within the database, the change is distributed via JMS. Alternatively, there is no change propagation but one could keep the INVENTORY CACHE TTL very low (e.g. a few minutes). In either case, there is a risk of the CACHE position being incorrectly reflected across JVMs at any point in time.
USER AUTHENTICATION, INTEREST and PURCHASE Mechanisms to identify the user, and allowing the user to express Desire for a product and Enabling the retail Transaction.AUTHENTICATION/ AUTHORIZATION information - needs to be preserved across requests since HTTP is a stateless protocol. Typically state preserved in a HTTP SESSION that is keyed off session-id and session-id is written as a cookie to the user's browser. If HTTP Session is not replicated and persisted elsewhere- losing a JVM would imply a poor user experience, since the user would have to re-authenticate and/or re-establish their position in a workflow.
Allow the user to express interest in a basket of productsSHOPPING CART/ WISH LISTSStored in HTTP Session or Cache keyed off customer-id (if registered customer) / visitor-id (if unregistered customer). If HTTP Session is not replicated and persisted elsewhere- losing a JVM would imply a poor user experience, since the user would lose his/her cart
Enable execution of the buying transaction CHECKOUT - typically implemented as a workflow since it is a multi-stage process across several HTTP Requests involving Credit-Card information, Shipping information etc. Users position in the workflow and pre-validated input is typically stored in the HTTP Session.If HTTP Session is not replicated and persisted elsewhere- losing a JVM would imply a poor user experience, since the user would be thrown out of the checkout process and have to restart all over


B> ORDER FULFILLMENT SYSTEMS:




SUB SYSTEMPURPOSECOMMON TERMINOLOGY/ PROBLEMSCOMMON SOLUTIONS/ PROBLEMS with the SOLUTIONSTERRACOTTA VALUE-ADD
OMS (ORDER MANAGEMENT SYSTEMS) Decomposition of the order into order line-items (e.g. an order may include flowers and basket balls each of which is fulfilled by a different distributor/warehouse) OMS typically requires a fair bit of complex algorithmic knowledge to arrive at the right decomposition of Orders into Order-line items Typically executed as a DBMS job The database is the right place to do this work given the algorithmic complexity and the amount of data to be referenced.
FULFILLMENT SYSTEMS Now that orders have been collected, there is a whole machinery needed to actually deliver the product to the customer. FULFILLMENT SYSTEMS: Must handle B2B work with 3rd party distributors and warehouses along with order line-item lifecycle Typically state-changes to the order-line-item lifecycle are reflected within a Fulfillment and Order Management database, which then results in backlogs from "Order Confirmation" Emails to updates of the status of the order line-item as it proceeds through the warehouse.
CUSTOMER SERVICE Deal with customers who have had issues either during placement of the order or in terms of the order fulfillment i.e. returns, etc. CUSTOMER SERVICE call spike could impact the Customer and Order Database in terms of read/write activity. A complex set of entities (Customers, Order History, Fulfillment status of specific orders etc.) need to be cached. Problems with caching this data per customer is that cache-hit ratio is very low. (i.e. the customer profile is typically needed for just the one contact the customer may have with customer service). One could pre-hydrate certain portions of the cache (e.g. if a spike of calls around a particular product) and/or Distribute a Reference Cache of Products to cut database latency to a certain extent.

IN SUMMARY: You could use Terracotta for:


  1. HTTP SESSION CLUSTERING:
    • To provide high availability to your HTTP SESSION(which likely contains user-authentication information, shopping cart, position within checkout).
    • Terracotta's Session Clustering implementation is the best in the market in terms of minimizing session replication overhead, transparency and visibility.

  2. CACHE DISTRIBUTION:
    • To distribute CATALOG CACHE, SEARCH CACHE, INVENTORY CACHE and elements of the family of Caches needed for CUSTOMER SERVICE.
    • Terracotta supports a variety of distributed caches (from EHCache to Hibernate to POJOs implemented as Hashmaps, CHMs or any arbitrary collection grouping etc.)

  3. INTER-JVM CO-ORDINATION/BATCH PROCESSING
    • To co-ordinate amongst participating JVMS for Customer notification, Work allocation amongst warehouses, Update of Fulfillment life-cycle changes (asyncrhonously drained to the database), etc.
    • Terracotta master-worker framework encapsulates the distribution of work amongst "workers" and aggregation of the output of individual "workers" and exposes some interfaces for you to implement any application specific logic.

  4. In addition, there are several other systems that comprise a E-Commerce business:
    • WAREHOUSE MANAGEMENT SOFTWARE: Planning for work for each shift of the warehouse and Enabling the execution of the Pick, Cartonization, Shipment of the packaged product. (e.g. Retek)
    • WAREHOUSE PLANNING: Linear Programming Exercise to minimize shipping costs given demand forecast and Warehouse Locations and Fulfillment portfolio (e.g. Manugistics, ILog).
    • DEMAND FORECASTING: Forecasting demand based on history and new trends. Grouping the demand temporally and geographically (e.g. Manugistics, i2 etc.)
    • BUSINESS METRICS MANAGEMENT: Recording user click-trail and make marketing/merchandising decisions (e.g. Coremetrics, Omniture etc.) and SEVERAL OTHERS
    • However, these are often procured as shrink-wrapped software from other vendors. So you may have more luck persuading the vendor to partner with Terracotta if scale and HA of the solution are unsatisfactory. Alternatively, if these are home grown Java based systems, you could investigate what data-structures need to be DSO (Distributed Shared Objects), so that they become durable, Since Terracotta is a general purpose platform that clusters at the JVM-Level - it is applicable in any POJO based app. See http://www.terracotta.org/web/display/orgsite/JVM+Level+Clustering

See http://www.terracotta.org and http://www.terracottatech.com for more detail.

About Me

I have spent the last 11+ years of my career either working for a vendor that provides infrastructure software or managing Platform Teams that deal with software-infrastructure concerns at large online-presences (e.g. at Walmart.com and currently at Shutterfly) - am interested in building and analyzing systems that speak to cross-cutting concerns across application development in the enterprise (scalability, availability, security, manageability, latency etc.)