Introduction to NOSQL and cassandra, part 2

In part 1 of this talk I presented few of the theoretical concepts behind nosql and cassandra.

In this talk we deep dive into the Cassandra API and implementation. The video is again in Hebrew, but the slides are multilingual ;-)

  • Started with a short recap of some of RDBMS and SQL properties, such as ADIC, why SQL if very programmer friendly, but is also limited in its support for large scale systems.
  • Short recap of the CAP theorem
  • Short recap of what N/R/W are
  • Cassandra Data Model: Cassandra is a column oriented DB which follows a similar data model to Google’s BigTable
  • Do you know SQL? So you better start forgetting it, Cassandra is a different game.
  • Vocabulary:
    • Keyspace – a logical buffer for application data. For example – Billing keyspace, or statistics keyspace, appX keyspace etc
    • ColumnFamily – similar to SQL tables. Aggregates columns and rows
    • Keys (or Rows). Each set of columns is identified by a key. A key is unique per Column Family
    • Columns – the actual values. Columns are represented by triplets – (name, value, timestamp)
    • Super-Columns – Facebook’s addition to the BigTable model SuperColumns are columns who’s values is a list of Columns. (but this is not recursive, you can only have one level of super-columns)
  • One way to think of cassandra is as a key-value store, but with extra functionality:
    • Each key has multiple values. In Cassandra jargon those are Columns
    • When reading or writing data it’s possible to read/write a set of columns for one specific key (row) atomically. This set of columns may either be a specified by the list column names, or by a slice predicate, assuming the columns are sorted in some way (that’s a configuration parameter)
    • In a addition, a multi-get operation is supported and a row-range-read operation is supported as well.
    • Row-range-read operations are supported only of a partitioner is defined which supports that (configuration parameter)
  • Key concept: In SQL you add your data first and then retrieve it in ad-hoc manner using select queries and where clauses; In Cassandra you can’t do that. Data can only be retrieved by it’s row key, so you have to think about how you’re going to be reading your data before you insert it. This is a conceptual diff b/w SQL and Cassandra.
  • I covered the Cassandra API methods:
    • get
    • get_slice
    • multiget
    • multiget_slice
    • get_count
    • get_range_slice
    • insert
    • batch_insert
    • delete
    • (these are the 0.4 api method. In 0.5 it’s a little different)
  • Between N/R/W, N is set per keyspace; R is defined per each read operation (get/multiget/etc) and W is defined per write operation (insert/batch_insert/delete)
  • Applications play with their R/W values to get different effects, for example they use QUORUM to get high consistency levels, or DC_QUORUM for a balance of high consistency and performance, W=0 to have async writes with reduced consistency.
  • Cassandra defines different sorting orders on it’s columns. Sort order may be defined at the ColumnFamily level and is used to get a slice of columns, for example, read all columns that start with a… and end with z…
  • There are several out of the box sort types, such as ascii, utf, numeric and date; Applications may also add their own sorters; This is as far as I recall the only place where Cassandra allows external code to be hooked in.
  • Thrift is a protocol and a library for cross-process communication and is used by Cassandra. You define a thrift interface and then compile it to the language of your choosing – C++, Java, Python, PHP etc. This makes it very easy for cross-language processes to talk to each other.
  • Thrift is also very efficient serializing and  deserializing objects and is also space-efficient (much more than Java serialization is).
  • I did not have enough time to cover the Gossip protocol used by Cassandra internally to learn about the health of its hosts.
  • I also did not have enough time to cover the Repair-on-reads algorithm used by Cassandra to repair data inconsistencies lazily.
  • I did not have time to talk about consistent hashing, which is what cassandra implements internally to reduce overhead of joined or dropped hosts occurrences.

So, as you can see, this was an overloaded, 1h+ talk with a lot to grasp. Wish me luck implementing Cassandra into outbrain!


Maven Code Quality Dashboard and TeamCity

I’ve recently implemented a code-quality dashboard at outbrain for maven java projects and hooked it into the our TeamCity continuous integration server. I was very pleased with the result, but the process had a few hickups, so I thought I’d mention them here for future generations.

A code quality dashboard includes the following components:

  • Tests status – failed, passed and  skipped count along with good looking graphs
  • Code coverage report detailing all covered and uncovered lines and branches, including nice coverage graphs
  • Copy-Paste detection by CPD
  • FindBugs report
  • jDepend report

The process had two phases: phase one is where I add the dashboard report to maven’s site goal in my pom.xml and phase two is where I make this report available at TeamCity, which is a bit of a manual work bot not too bad.

To add those nice reports, edit your pom.xml to add:

  <reporting>
    <plugins>
      <plugin>
        <groupId>org.codehaus.mojo</groupId>
        <artifactId>cobertura-maven-plugin</artifactId>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-pmd-plugin</artifactId>
        <version>2.3</version>
        <configuration>
          <linkXref>true</linkXref>
          <targetJdk>1.5</targetJdk>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-surefire-report-plugin</artifactId>
        <version>2.4.2</version>
      </plugin>
      <plugin>
        <groupId>org.codehaus.mojo</groupId>
        <artifactId>jdepend-maven-plugin</artifactId>
      </plugin>
      <plugin>
        <groupId>org.codehaus.mojo</groupId>
        <artifactId>findbugs-maven-plugin</artifactId>
        <version>2.0.1</version>
        <configuration>
          <xmlOutput>true</xmlOutput>
          <effort>Max</effort>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.codehaus.mojo</groupId>
        <artifactId>dashboard-maven-plugin</artifactId>
      </plugin>
    </plugins>
  </reporting>

Now, in theory, that would have been all. All you have to do is run mvn site and bang – you have the reports under target/site. That’s why maven is nice.

However, if you’re running a multi-module project then mvn-site is buggy… all links to the subproject are broken links. But no despair, here’s the solution – configure the site plugin to place its generated content where the site plugin expects it to be… yeah, I know it sounds confusing, the thing is that the site plugin has a bug so it’s links to the submodule projects are broken, but here’s an easy fix that worked for me (as long as the projects are only one directory deep under the parent pom.xml). In the parent pom.xml add:

      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-site-plugin</artifactId>
        <configuration>
          <outputDirectory>../target/site/${project.name}</outputDirectory>
        </configuration>
      </plugin>

Now the reports are fixed.

Next step is to add them into TeamCity.

Step one: Tell teamcity to collect the site artifacts (html, css, js…). Go to the build configuration and under Artifacts Paths add **/target/site/**/*

Step two: Add a Code Quality tab to the build results. You do that by ssh-ing to the teamc host and editing

vi /teamc/TeamCity/.BuildServer/config/main-config.xml

to add

<report-tab title="Code Quality" basePath="target/site/" startPage="dashboard-report-details.html" />

Here’s the result:

That’s it! Now run your build with mvn site and get those gorgeous looking reports right in your teamcity build results page.


Moved hosting away from #godaddy #sucks


This blog has moved away from GoDaddy hosting to bluehost. I don’t know how good bluehost is (so far it’s been OK) but I know for sure that GoDaddy is pretty darn awful.

Awful as in
  • So slow that it’s a nightmare to be editing posts online. I had to do them offline and then copy-paste (and reformat), what a waste of time.
  • So fragile that my pingdom monitor reports it unavailable for more than 5 minutes at least twice a day.
  • So unresponsive that I’m sometimes ashamed to share permalinks to my blog in fear it would be offline. Why did I even pay them?

I didn’t wait for the one year I paid up front to end, just packed my stuff and moved over to bluehost. I don’t know how good bluehost is, but at least it’s fun again to be editing online and my pingdom monitor hasn’t told me anything bad yet, so knock on wood, looking good so far.


Introduction to NOSQL and cassandra, part 1

I recently gave a talk at outbrain, where I work, about an introduction to no-sql and Cassandra as we’re looking for alternatives of scaling out our database solution to match our incredible growth rate.

NOSQL is a general name for many non relational databases and Cassandra is one of them.

This was the first session of two in which I introduced the theoretical background and explained few of the important concepts of nosql. In the second session, due next week, I’ll talk more specifically about Cassandra.

The talk is on youtube, video below, but it’s in Hebrew so I’ll share it’s outline in English here. Slides are enclosed as well.


  • SQL and relational DBs in general offer is a very good general purpose solution for many  applications such as blogs, banking, my cat’s site etc.
  • RDBMS provide ACID: Atomicity, Consistency, Isolation and Durability
  • RDBMS + SQL (the query language) + ACID provide a very nice and clean programming interface, suitable for banks, online merchants and many other applications, but not all applications really actually do require full ACID and one has to realize that ACID and SQL features are not without costs when systems need to scale out. Cost is not only in $$, it’s also in application performance and features.
  • The new generation of internet scale applications put very high demands on DB systems when it comes to scale and speed of operation but they don’t necessariry require all the good that’s in RDBMS, such as Full Consistency or Atomicity.
  • So, a new brand of DB systems has grown over the past 5 or so years – nosql, which either stands for No-SQL or Not-Only-SQL.
  • Leading actors in the nosql arena are Google with its BigTable, Amazon with Dynamo, Facebook with Cassandra and there’s more.
  • I presented intermediate solutions before going no-sql, namely RDBMS sharding which is very common and FriendFeed’s particularly interesting solution of application level indexing for using mysql with a schema-less data model.
  • CAP Theorem: At large scale systems you may only choose 2 out of the 3 desired attributes: Consistency, Availability and Partition-Tolerance. All three may not go hand in hand and application designers need to realize that.
  • A Consistent and Available system with no Partition-tolerance is a RDBMS system that comes to a halt if one of it’s hosts is down. That’s a very commonly used solution and perfect for small systems. This blog, for example, which uses WordPress, also uses a single mysql server which, if happens to be down, will also take the blog down. However, for internet scale systems where at almost any point in time there’s a good chance that one of the nodes is either down, or there are network disruptions, the No-Partition-Tolerance approach just isn’t going to cut it and they will have to choose a different approach for providing their SLAs.
  • Systems that are Available at all times and are capable of handling Partitions must sacrifice their consistency. As it turns out, though, this isn’t bad as it seems, as there are pretty good alternatives for lower levels of consistently, one such solution is Eventual Consistency, which actually works pretty nicely for “social applications” such as Google’s Facebook’s and Outbrain’s
  • I introduced the concept of NRW – N is the number of database replicas data is copied to one must replicate data in order to withstand partitions. W is the number of replicas a write operation would block on until it returns to it’s caller and is “successful” and R is the number of replicas a read operation would block on before returning to its caller.
  • N, R and W are crucial when dealing with Eventual Consistency as their values usually determine the level of consistency you’re going to have. For example, when N=R=W you have a full consistency (which isn’t tolerant to partitions or course). When W=0 you have async writes, which is the lowest level of consistency (you never know when the write operation actually finishes)
  • I introduced the concept of Quorum, which means R=W=ceil((N+1)/2)
  • Introduced a (very partial) list of currently available nosql solutions, such as Cassandra, BigTable, HBase, Dynamo, Voldemort, Riak, CouchDB, MongoDB and more.

Overall this was a very interesting talk, a lot of (fun and interesting) theory. The next part is going to be specific about Cassandra – how all this theory fits into Cassandra and how does one use Cassandra’s API, so stay tuned.


StringTemplate – a step forward to a better web MVC

MVCMVC, Model View Controller is a well known design patten from the field of UI frameworks. It advocates the separation of a Model, an application specific data and business logic, Controller, which takes user input, consults the model and determines the correct view to present a user based on its result and a View, the actual UI component the user interacts with.

In the context of web applications it’s common to consider a Database along with the application business logic as the Model, a web framework, such as Struts, as the Controller and the rendering engine, such as JSP as the View. MVC are widely used in both web development as well as in desktop application development.

StringTemplate is a rendering engine, a View, I’ve recently became familiar with and immediately fell in love with. StringTemplate is a step forward true MVC implementation, but before presenting the solution, let’s see what the problem is.

The MVC design pattern states that in a UI framework there should be 3 components – the Model, the View and the Controller. It also states that the Controller should have direct access to the Model and the View, the View have direct access to the Model when presenting the data as seen in the diagram below.

MVC diagram

There are many MVC implementations in many languages. In Java only, there are more than 17 mentioned on the Wiki page. Many of the frameworks use JSP as their rendering engine and add their specific features to it, so for example struts2 implements a very nice controller and as a rendering engine uses JSP with either struts2 tags or another tag library. The problem lies within JSP.

JSP, similar to PHP, ASP and many other simple and fast-start rendering engines, compromise the MVC model; they allows too much in their View component.

JSP allows code to be executed within the context of the page (using the <% %> notation), which is very nice when you want to hack something fast but is extremely dangerous from code maintenance perspective and is a clear violation of the MVC agreement. If code can be executed within a page, that code can easily alter the model or take actions a controller should have taken. A View should be able to READ the model, but not ALTER it. Even the most disciplined programmers who adhere to clean JSP code by not allowing actual Java code inside their pages (using the <% %>) actually suffer from the same problem without even knowing it… The alternative to the <%  and the <%= constructs are tags, such as JSTL tags, struts2 tags etc. These tags commonly access the model by invoking its getters, but the real problem with them is that the order in which they execute is very important – read it again – the order in which they are displayed on the page is crucial to the correctness of the values returned by their backing model. More about this here. So with JSP, when using tags, the View implementor has to understand how a Model works in order to succeed. That’s both inconsistent with the pure MVC model and inconvenient to the UI designer. Lastly are side effect, which cannot be ignored. Views can create side effects by calling methods (even getters can have side effects!) , which I think there’s no need to mention, is very bad.

StringTemplate comes to the rescue. As a matter of fact, when I programmed django MVC in python, as well as Cheetah in python, I have to say that both MVCs were much stricter than JSP and were therefore a lot more programmer-friendly. But in Java we had Velocity, which is strict, but also less powerful and now we have StringTemplate which is both strict and powerful. (and no – freemarker isn’t any better than just plain JSP, sorry).

At outbrain we use struts2 as our MVC driver and so to be able to use StringTemplate as the View engine I’ve implemented a struts2 View Renderer. Full code and details below. This code is “unstable” which means it’s been developed and tested, so far no bugs or missing features, but hasn’t gone to production yet.

struts.xml:

<struts>
<include file="webwork-default.xml" />
<package name="regular" namespace="/" extends="struts-default">
<result-types>
<result-type name="stringtemplate" class="org.apache.struts2.dispatcher.StringTemplateResult"/>
</result-types>
<action name="st" class="com.mysite.StringTemplateTestAction">
<result name="success" type="stringtemplate">/WEB-INF/st/page.st</result>
</action>
</package>
</struts>
Java source code:
package org.apache.struts2.dispatcher;
 
import java.io.IOException;
import java.io.InputStream;
import java.io.Writer;
import java.util.Enumeration;
import java.util.HashMap;
import java.util.Locale;
import java.util.Map;
import java.util.Properties;
 
import javax.servlet.ServletContext;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.servlet.http.HttpSession;
 
import org.antlr.stringtemplate.NoIndentWriter;
import org.antlr.stringtemplate.StringTemplate;
import org.antlr.stringtemplate.StringTemplateGroup;
import org.antlr.stringtemplate.StringTemplateWriter;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.struts2.ServletActionContext;
import org.apache.struts2.StrutsConstants;
import org.apache.struts2.views.util.ContextUtil;
import org.apache.struts2.views.util.ResourceUtil;
 
import com.opensymphony.xwork2.ActionInvocation;
import com.opensymphony.xwork2.LocaleProvider;
import com.opensymphony.xwork2.inject.Inject;
import com.opensymphony.xwork2.util.ValueStack;
 
import freemarker.template.Configuration;
import freemarker.template.TemplateException;
 
/**
 * Defines the StringTemplate result type for struts2.
 *
 * To use this result type add the following configuration to your struts.xml:
 *
 * <code>
 * &lt;result-types&gt;
 *   &lt;result-type name="stringtemplate"
 *       class="org.apache.struts2.dispatcher.StringTemplateResult"/&gt;
 * &lt;/result-types&gt;
 *</code>
 *
 * Template files should be located relative to /WEB-INF/st/, so for example a common layout
 * would be:
 *
<ul>
 *
	<li>/WEB-INF/st/pages/page1.st</li>
*
	<li>/WEB-INF/st/layouts/layout1.st</li>
*
	<li>/WEB-INF/st/snippets/header.st</li>
*
	<li>/WEB-INF/st/snippets/footer.st</li>
*</ul>
*
 * So page1.st would look like:
 * <code>
 * $layouts/layout1(header=snippets/header(), footer=snippets/footer())$
 * </code>
 *
 * And the associated struts action would be:
 *
 * <code>
 * &lt;action name="page1"
 *      class="com.mycompany.Page1Action"&gt;
 *        &lt;result name="success"
 *          type="stringtemplate"&gt;/WEB-INF/st/pages/page1.st&lt;/result&gt;
 * &lt;/action&gt;
 * </code>
 *
 * Localization:
 * All string property files should be packed into the classpath (in one of the jars) at
 * /lang/.
 * For example:
 *
<ul>
 *
	<li>/lang/en_US.properties</li>
*
	<li>/lang/fr_FR.properties</li>
*</ul>
*
 * @author Ran Tavory (ran@outbrain.com)
 *
 */
public class StringTemplateResult extends StrutsResultSupport {
 
  /**
   * Base path of StringTemplate template files.
   * This is usually something like /WEB-INF/ or /WEB-INF/st/.
   * Must end with /.
   * All templates should reside in subdirectories of this base path and when referencing
   * each other the reference point should be relative to this path.
   * For example, if you have /WEB-INF/pages/page1.st and /WEB-INF/layouts/layout1.st, and assuming
   * the base path is at /WEB-INF then page1 is pages/page1 and layout1 is layouts/layout1.
   * To reference layout1 from page1: $layouts/layout1()$
   */
  public static final String TEMPLATES_BASE =
      "/WEB-INF/st/";
 
  /**
   * Path to the language resource files within the classpath.
   *
   * Resource should be packed inside a jar under /lang/.
   * For example: /lang/en_US.properties
   */
  public static final String LANG_RESOURCE_BASE = "/lang/";
 
  /**
   * If there was an exception during execution it's accessible via $exception$
   */
  public static final String KEY_EXCEPTION = "exception";
 
  /**
   * Session values are accessible via $session.key$
   */
  public static final String KEY_SESSION = "session";
 
  /**
   * All localized strings are accessible via $strings.string_key$
   */
  public static final String KEY_STRINGS = "strings";
 
  /**
   * Request parameters are accessible via $params.key$
   */
  public static final String KEY_REQUEST_PARAMETERS = "params";
 
  /**
   * Request attributes are accessible via $request.attribute$
   */
  public static final String KEY_REQUEST = "request";
 
  private static final Log log = LogFactory.getLog(StringTemplateResult.class);
 
  private static final long serialVersionUID = -2390940981629097944L;
 
  private static final Locale DEFAULT_LOCALE = Locale.US;
 
  /*
   * Struts results are constructed for each result execution
   *
   * the current context is available to subclasses via these protected fields
   */
  private String contentType = "text/html";
 
  private String defaultEncoding;
 
  public StringTemplateResult() {
    super();
  }
 
  public StringTemplateResult(String location) {
    super(location);
  }
 
  public void setContentType(String aContentType) {
    contentType = aContentType;
  }
 
  /**
   * allow parameterization of the contentType the default being text/html
   */
  public String getContentType() {
    return contentType;
  }
 
  @Inject(StrutsConstants.STRUTS_I18N_ENCODING)
  public void setDefaultEncoding(String val) {
    defaultEncoding = val;
  }
 
  /**
   * Execute this result, using the specified template location.
   *
 
   * The template location has already been interpolated for any variable
   * substitutions.
   *
   * NOTE: The current implementation is still under development and has several restrictions.
   *
<ul>
   *
	<li>All template files must end with .st</li>
*</ul>
*/
  public void doExecute(String location, ActionInvocation invocation) throws IOException,
      TemplateException {
 
    final HttpServletRequest request = ServletActionContext.getRequest();
    final HttpServletResponse response = ServletActionContext.getResponse();
 
    if (!location.startsWith("/")) {
      // Create a fully qualified resource name.
      // final ActionContext ctx = invocation.getInvocationContext();
      final String base = ResourceUtil.getResourceBase(request);
      location = base + "/" + location;
    }
 
    final String encoding = getEncoding(location);
    String contentType = getContentType(location);
 
    if (encoding != null) {
      contentType = contentType + ";charset=" + encoding;
    }
 
    response.setContentType(contentType);
 
    final String basePath = ServletActionContext.getServletContext().getRealPath(TEMPLATES_BASE);
    final StringTemplateGroup group = new StringTemplateGroup("webpages", basePath);
    String fileName = location;
    if (fileName.endsWith(".st")) {
      // If filename ends with .st, remove it.
      fileName = fileName.substring(0, fileName.length() - ".st".length());
    }
    if (fileName.startsWith(TEMPLATES_BASE)) {
      // If filename includes the base dir then remove it.
      fileName = fileName.substring(TEMPLATES_BASE.length());
    }
    final StringTemplate template = group.getInstanceOf(fileName);
 
    final Map model = createModel(invocation);
    template.setAttributes(model);
 
    // Output to client
    final Writer responseWriter = response.getWriter();
    final StringTemplateWriter templateWriter = new NoIndentWriter(responseWriter);
    template.write(templateWriter);
    // Flush'n'close
    responseWriter.flush();
    responseWriter.close();
  }
 
  /**
   * Retrieve the encoding for this template.
   *
   * People can override this method if they want to provide specific encodings
   * for specific templates.
   *
   * @return The encoding associated with this template (defaults to the value
   *         of 'struts.i18n.encoding' property)
   */
  protected String getEncoding(String templateLocation) {
 
    String encoding = defaultEncoding;
    if (encoding == null) {
      encoding = System.getProperty("file.encoding");
    }
    if (encoding == null) {
      encoding = "UTF-8";
    }
    return encoding;
  }
 
  /**
   * Retrieve the content type for this template.
   *
   * People can override this method if they want to provide specific content
   * types for specific templates (eg text/xml).
   *
   * @return The content type associated with this template (default
   *         "text/html")
   */
  protected String getContentType(String templateLocation) {
    return "text/html";
  }
 
  /**
   * Build the instance of the ScopesHashModel, including JspTagLib support
   *
   * Objects added to the model are
   *
   *
<ul>
   *
	<li>Application - servlet context attributes hash model</li>
*
	<li>JspTaglibs - jsp tag lib factory model</li>
*
	<li>Request - request attributes hash model</li>
*
	<li>Session - session attributes hash model</li>
*
	<li>request - the HttpServletRequst object for direct access</li>
*
	<li>response - the HttpServletResponse object for direct access</li>
*
	<li>stack - the OgnLValueStack instance for direct access</li>
*
	<li>ognl - the instance of the OgnlTool</li>
*
	<li>action - the action itself</li>
*
	<li>exception - optional : the JSP or Servlet exception as per the servlet
   * spec (for JSP Exception pages)
   *</li>
*
	<li>struts - instance of the StrutsUtil class
   *</li>
*</ul>
*/
  protected Map createModel(ActionInvocation invocation) {
 
    ServletContext servletContext = ServletActionContext.getServletContext();
    HttpServletRequest request = ServletActionContext.getRequest();
    HttpServletResponse response = ServletActionContext.getResponse();
    ValueStack stack = ServletActionContext.getContext().getValueStack();
 
    Object action = null;
    if (invocation != null) {
      action = invocation.getAction();
    }
    return buildTemplateModel(stack, action, servletContext, request, response, invocation);
  }
 
  private Map buildTemplateModel(ValueStack stack, Object action,
                                                 ServletContext servletContext,
                                                 HttpServletRequest request,
                                                 HttpServletResponse response,
                                                 ActionInvocation invocation) {
 
    Map model = buildScopesHashModel(servletContext, request, response, stack);
    populateContext(model, stack, action, request, response);
    populateStrings(model, deduceLocale(invocation));
    return model;
  }
 
  /**
   * Populates the <code>strings</code> attribute of the model according to the current locale
   * value.
   * @param model
   * @param local
   */
  private void populateStrings(Map model, Locale locale) {
    log.debug("Local: " + locale);
    Properties p;
    try {
      p = getLanguageProperties(locale);
      model.put(KEY_STRINGS, p);
      return;
    } catch (IOException e) {
      if (locale.equals(DEFAULT_LOCALE)) {
        log.error("Unable to load language file for the default locale " + locale);
        return;
      } else {
        log.warn("Unable to load language file for " + locale + ". Will try to load for the " +
            "default locale");
      }
    }
 
    // Try once again with the default locale.
    populateStrings(model, DEFAULT_LOCALE);
  }
 
  /**
   * Tries to load language properties for the given locale.
   *
   * The file is loaded from /LANG_RESOURCE_BASE/locale.properties in the classpath, so for example
   * that may be /lang/en_US.properties
   *
   * @param locale
   * @return
   * @throws IOException When the language file isn't found or can't read it.
   */
  private Properties getLanguageProperties(Locale locale) throws IOException {
    final InputStream in =
      getClass().getClassLoader().getResourceAsStream(LANG_RESOURCE_BASE + locale + ".properties");
    final Properties prop = new Properties();
    prop.load(in);
    return prop;
  }
 
  protected Map buildScopesHashModel(ServletContext servletContext,
                                                     HttpServletRequest request,
                                                     HttpServletResponse response,
                                                     ValueStack stack) {
 
    Map model = new HashMap();
 
    // Add session information
    HttpSession session = request.getSession(false);
    if (session != null) {
      model.put(KEY_SESSION, generateAttributeMapForSession(session));
    }
 
    // Add requests attributes
    model.put(KEY_REQUEST, generateAttributeMapFromRequest(request));
 
    // Add request parameters.
    model.put(KEY_REQUEST_PARAMETERS, request.getParameterMap());
 
    return model;
  }
 
  @SuppressWarnings("unchecked")
  private Map generateAttributeMapForSession(HttpSession session) {
 
    Map attributes = new HashMap();
    for (Enumeration e = session.getAttributeNames(); e.hasMoreElements();) {
      String name = (String) e.nextElement();
      attributes.put(name, session.getAttribute(name));
    }
    return attributes;
  }
 
  @SuppressWarnings("unchecked")
  private Map generateAttributeMapFromRequest(HttpServletRequest request) {
 
    Map attributes = new HashMap();
    for (Enumeration e = request.getAttributeNames(); e.hasMoreElements();) {
      String name = (String) e.nextElement();
      attributes.put(name, request.getAttribute(name));
    }
    return attributes;
  }
 
  @SuppressWarnings("unchecked")
  protected void populateContext(Map model, ValueStack stack, Object action,
                                 HttpServletRequest request, HttpServletResponse response) {
 
    // put the same objects into the context that the velocity result uses
    Map standard = ContextUtil.getStandardContext(stack, request, response);
    model.putAll(standard);
 
    // Support for JSP exception pages, exposing the servlet or JSP exception
    Throwable exception = (Throwable) request.getAttribute("javax.servlet.error.exception");
    if (exception == null) {
      exception = (Throwable) request.getAttribute("javax.servlet.error.JspException");
    }
 
    if (exception != null) {
      model.put(KEY_EXCEPTION, exception);
    }
 
    // Add action model.
    if (action instanceof StringTemplateAction) {
      StringTemplateAction stAction = (StringTemplateAction) action;
      model.putAll(stAction.getModel());
    }
  }
 
  /**
   * Returns the locale used for the
   * {@link Configuration#getTemplate(String, Locale)} call. The base
   * implementation simply returns the locale setting of the action (assuming
   * the action implements {@link LocaleProvider}) or, if the action does not
   * the {@link #DEFAULT_LOCALE}
   */
  protected Locale deduceLocale(ActionInvocation invocation) {
 
    if (invocation.getAction() instanceof LocaleProvider) {
      return ((LocaleProvider) invocation.getAction()).getLocale();
    } else {
      return DEFAULT_LOCALE;
    }
  }
}
 
package org.apache.struts2.dispatcher;
 
import java.util.Map;
 
/**
 * Interface defining the required behavior from a StringTemplate result type.
 *
 * The result should prepare the model for the page to display.
 * The model is a map of attributes -&gt; Objects where each object may either be a simple string, int,
 * etc or another map.
 * So. for example $username$ is accessible from the template of the model contains a String under
 * the value "username" and $strings.hello$ is accessible if the model contains a map under the key
 * "strings" and this map contains an attribute under the key "hello"
 *
 * @author Ran Tavory (ran@outbrain.com)
 *
 */
public interface StringTemplateAction {
 
  /**
   * Get the root of the display model.
   *
   * The display model is a map of attributes interpolated by StringTemplate at runtime by
   * substituting the $values$ in the template source.
   * For example to replace $user$ with Ran add
   * <code>map.put("user", "Ran");</code>
   * @return A map of attributes used by StringTemplate to replace in the template.
   */
  public Map getModel();
}

The joy of deleting code

What could be more fun than writing a new shiny super-functional, super-tested piece of code? Deleting it!
When deleting code you know that

  • You have not introduced new bugs. Perhaps you deleted some potential bugs from the old code but chances are you did not introduce new ones.
  • You don’t have to maintain it. It’s deleted.
  • Code was probably poorly written. Good code is never deleted. In many cases there’s poorly written code that you just don’t have the guts to delete it. Now you did, that’s great.
  • You’ve probably found a good way to reuse another piece of code, that’s why you’re deleting this piece of code. Code reuse is good.
  • Or that you’ve taken off a feature from your product. Taking off features is good, is very good. Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away
  • Or that you’ve found a more compact and elegant way to do what you want to do.

Bottom line: Deleting code makes me happy. How about you?

Delete


Mavenizing our code base

Maven is a build tool for Java. It’s more than that actually, but let’s just call it a build tool.

Maven - Welcome to Apache Maven-1

In outbrain we decided we want to replace good old ant with maven.

Changing the company wide used build tool is not a decision taken lightly and may have consequences on product release cycle, but we weighted our options and decided to go for it, so I thought it might be worth mentioning our endeavor.

Everyone familiar with Java programming has probably used ant or at least heard of it. For many years it has been the de-facto standard build tool with large and growing audience, numerous plugins, excellent documentation and IDE support (for example most Java IDEs can automatically generate ant build files). But ant has its shortcomings which we, at outbrain decided we just couldn’t live with. We found maven to fill up most of the gaps.

How do ant and maven differ?

Maven is newer and was built from scratch with many of the lessons learned by ant in mind. Both projects are written and maintained by the high quality, high standard apache software foundation, home of many other wonderful open source products. Both ant and maven are still actively developed and maintained, so it would not be fair to say that maven replaces ant, though many developers tend to think so (including myself).

Ant and maven differ in many ways, but at least to me these are the winning points that actually make the difference and made choose maven:

Maven is declarative. Ant is imperative.

Here’s what an ant build file looks like for a simple java project:

<project name="MyProject" default="dist" basedir=".">
  <description>
    simple example build file
  </description>
  <!-- set global properties for this build -->
  <property name="src" location="src"/>
  <property name="build" location="build"/>
  <property name="dist"  location="dist"/>
 
  <target name="init">
    <!-- Create the time stamp -->
    <tstamp/>
    <!-- Create the build directory structure used by compile -->
    <mkdir dir="${build}"/>
  </target>
 
  <target name="compile" depends="init"
    description="compile the source " >
    <!-- Compile the java code from ${src} into ${build} -->
    <javac srcdir="${src}" destdir="${build}"/>
  </target>
 
  <target name="dist" depends="compile"
    description="generate the distribution" >
    <!-- Create the distribution directory -->
    <mkdir dir="${dist}/lib"/>
 
    <!-- Put everything in ${build} into the MyProject-${DSTAMP}.jar file -->
    <jar jarfile="${dist}/lib/MyProject-${DSTAMP}.jar" basedir="${build}"/>
  </target>
 
  <target name="clean"
    description="clean up" >
    <!-- Delete the ${build} and ${dist} directory trees -->
    <delete dir="${build}"/>
    <delete dir="${dist}"/>
  </target>
</project>

And here’s the maven one:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <artifactId>MyProject</artifactId>
  <name>MyProject</name>
  <description>simple example build file</description>
  <packaging>jar</packaging>
</project>

Ant line count: 22 (not including spaces, comments)

Maven line count: 7. (and they are real easy ones)

Short is good, especially in software (except perl ;-) ). But except for the fact that mvn has shorter build files, there is something more important hidden here – declarative v.s. imperative.

With ant you have to tell it HOW to build. You have to tell it that it first needs to collect all java sources, then run the javac compiler on them, then collect all classes, then run the jar tool on them to make them a jar. That’s tedious, especially if you have to do it 30 times, for each and every project. In outbrain we have many projects so when we used ant you’d see the exact same ant code patterns again and again (including the mistakes, which survived copy-paste horror). Ant is hard to maintain and is also hard to write. Did any of the readers ever write an ant build file? I doubt that. I’ve been using ant for more than 5 years and never have I written a file from scratch. It’s always used to be either copy-paste or using the IDE support for automatic generation. This is a bad sign of superfluous language.

Maven is declarative. You don’t have to tell it HOW to build, only WHAT to build, so conceptually it’s a higher level build tool. With mvn you only have to say “Look, this is how I call my project and I want you to make a jar from it” that’s all, mvn will figure out the rest. It knows where to find the sources (convention) and it knows what steps it needs to take in oder to create a jar. It will compile your source files, will package all resources for you in that jar, will run tests and create that jar for you. You can intervene with this process, but you don’t have to. You can make jars, wars, ears and more.

Declarative is in many cases preferred over imperative. Think HTML vs. Java. HTML is declarative, Java is imperative. In HTML you say <b>bold</b> which tells the browser you want the text to be bold. You don’t tell it how to make the text bold (e.g. how many pixels, what position etc) only that you want it bold and let the browser figure out how to handle it. In Java you’d have to tell it how to make the text bold, how to space the characters, how to space words around it, how to break lines etc. Declarative is in many cases a lot easier than imperative, you worry less.

Maven is declarative, you only have to say “this is my project, jar it”. With ant you have more control over what the build tool does, so you can go crazy with build scripts and… well… jar before compile, or clean after jar (instead of before it) or package the test code inside production code or whatever, you get the point, you have the freedom to err. 9 times out of 10 you don’t need the level of flexibility provided by ant and you’d be much safer using mvn.

Dependency management.

This is a big thing. That was actually the main reason I wanted to move out of ant. Maven has a wonderful dependency management system built in by default. Ant has nothing built in, although it has ivy as a plugin.

What is dependency management? There are external and internal dependencies. External dependencies are ones you download from the net, usually open source projects such as Lucene, ActvieMQ, and other open source projects. With maven you only have to declare your dependency on them and they get automatically downloaded. Example:

<dependency>
  <groupId>struts</groupId>
  <artifactId>struts-bean</artifactId>
  <version>1.2.8</version>
</dependency>
With ant you basically have two options. One is download the library yourself and throw it in some folder, call it 3rdParty and add it to the classpath (good luck with keeping track of versioning, who’s using what and your life) or use ivy, which is pretty decent, but as mentioned before not part of the default ant installation.
As for internal dependencies, which means your project which depends (uses) another one of your projects, mvn supports that as well. AFAIK ant does not. With ant, if you have more than one project in your company (and of course you do), you’d have to manually tweak the build scripts so they run in the correct order and dependencies are compiled before they are used. Although it’s possible, that’s sort of nightmarish as companies grows.
Dependency management is a killer feature for mvn and was actually the main reason that prompted me to pursue it. Although ant can have ivy, this was not zero-work, so I decided, heck it we’re going to put some work into it, let’s rebuild the whole thing and get a much better result. So we did.

Conventions vs. configuration

Conventions are good. They save a lot of time and prevent you from doing foolish mistakes (such as packaging test code into production).

By conventions I mean:

  • Where is the source code?
  • Where is the test code?
  • Where are the resources?
  • Where are the web files?
  • etc

With ant you had to create a directory for sources (call it src or Src or source or srce) and tell and where your sources are. Then you’d have to decide where to put your tests. You can push them in tst, test, tests, or even in the same source directory as production code is, maybe in a test package. Next you have to configure ant where to find test code, how to separate it from production code, and heck – how to run it (it really doesn’t know how to do it).

With mvn that’s much easier. Maven promotes conventions such as the standard directory layout which means all sources are at predefined location, all resources are as well etc. There are several advantages to that including that it’s easy to start a new project, you don’t have to think where everything goes, it’s hard to make mistakes by misplacing items, you don’t have to think about the build script and how to configure it and perhaps most important of all, you make all company employees conform to the same layout. That’s a huge gain for the company.

Built-in functionality out of the box

OK, there are plenty of other benefits to mvn but the post is already getting too long so let’s have the last one here.

With mvn you get tons of functionality out of the box. By creating a very simple build file with only the project definition in it you can:

  • compile all sources
  • compile test
  • run unit tests
  • run integration tests
  • package as a jar/war or something else
  • deploy
  • run in tomcat
  • …and much more

With ant what you had to do is for each one of the above listed goals, create an ant goal and configure ant by telling it how to run it. Some of them may be relatively trivial (but still require coding) and some of them aren’t easy at all… that kind of tells you why anters are very good with their CTRL+C and CTRL+V. And with copy-paste comes the pain of silly copy paste errors and difficult maintainability. Life with mvn is better ;-)

Other great and not covered features: Eclipse and other IDE integration (automatically generating projects), great testing and debugging tools, excellent build output, excellent versioning and more.

How did it go?

So, how did it go? You guys have converted your entire codebase build tool. Isn’t that like Netscape’s near death experience while rewriting their entire code base?

Well… no! The nice thing about mvn is that it’s easy to write and easy to learn. We did spend a couple of weeks on that task and had to resolve some unpredictable situations, but it had only a small impact on our schedule (and needless to say that I hope in the long term will have  the most positive impact on release schedule). Heck, we (at least I) even enjoyed it!

Conclusions

I’d definitely recommend mvn. If you’re starting a new project, choose mvn. If you have an existing codebase using ant and you’re thinking about moving to maven, know that it’s certainly feasible and in my opinion, well worth the effort. Expect some work, it’s not zero effort, but I promise you’ll enjoy it.


Experimenting with Seam Carving

Seam Carving is a technique for smart image resizing developed by Shai Avidan and Ariel Shamir. It’s cool. It’s really cool actually! It let’s you resize an image without having to lose important information, so for example, if there’s a face in the photo and a background, the background will shrink while the face will maintain its size. There are plenty of examples on the web and some very cool videos, so let’s have a taste of them first.

There is also a large number of implementations and in many different languages, including Java, C++, JavaScript and more.

Yesterday I wanted to integrate this cool technology into one of the outbrain features I’m working on, so here’s what I’ve learned:

First of all, it’s really cool, did I mention this? :) .

I used a Java implementation by Mathias Lux from here (thanks, Mathias). In general, this is a very nice work and I only had to fix but a few bugs to get it going ;) .

When I first started using it I noticed two problems:

1. It’s slow. Any I mean – painfully slow! nothing that production code can live with. An average photo would get resized in about 30-60 seconds. No go, no good, no no no.

2. It doesn’t always do the right thing… I mean sometimes the result of the resize is sort of lame… Let’s have a look at some examples.

Before:

Before carving

Before carving

After:

After carving

After carving

Hmm, that’s not so good, right?… See how that poor man’s face is distorted?

Here’s another one. Before:

Before Carving

Before Carving

And after carving. Notice the building roof… it seems to have endured a little earthquake…

After carving

After carving

So, you see, I had a problem… The algorithm was correct, no problem with that, but perhaps this was not the best solution to what I was trying to achieve… What was my goal then? My goal was to have all images resized to the same size (in this case 178×100) without having to letterbox or crop them. But I don’t mind so much that a face in the photo gets smaller, that’s perfectly OK by me, as long as the face doesn’t get but in the middle.

My solution:

Here’s what I did – Instead of using the seam carver right from the beginning, first I resized the image using a simple linear resize operation such that the photo would exactly match one of either the width or the height of the target photo, and only then did I run the seam carver.

For example, if the original size of the image was 300×300 and I wanted the target size to be 100×50 I first resized it to 100×100 (so the image fills the entire width, and overflows the height) and only then use the seam carver to reduce its height.

The results were amazingly well! Performance dropped to about 500ms per photo (still, I can do better, but this is already a big improvement and can go to production) and most importantly, the photos look good now. Compare the following two resized and carved photos with the initial results.

Improved resized

Improved resized

Improved resized

Improved resized

That did the trick, nice :) .


Why functional languages rock with multi-core

intel_quadcoreCores are cheaper nowadays. Almost all new computers are shipped with 2 or more cores. Datacenter computers usually ship with more – 4, 8, 32… Again the hardware industry has left the software industry behind; if only Moore’s law would work the same way for software…

But there is hope!

Functional programming languages used to be a niche for the small group of programing languages advocates or emacs fanatics ;-) I, for one, studied functional languages such as Lisp and ML at school but have never thought I’d ever use them in “real life” industry.

Things have started to change and in my opinion this is greatly due to the multi-core / huge datacenter / cloud computing shift in the software industry. The software industry has come to realize a few key points:

  • One core can not handle the load, no matter how strong and fast is this core is. In the old days of IBM deep blue and the super-computer age there was hope that with the advance of hardware industry cores would get infinitely stronger and every other day there would be another contestant on the fastest/stronger/capable core of the day. However, in the last 10 or so years we have come to realize that hardware has its limitations and clock-rate is one of them. The cores, at least as far as we can tell today, can not grow infinitely and we need a different solution. The solution is multi-core CPUs. The challenge to the software industry is taking the right advantage over the multi-core business. It used to be easy when programming for a single core – you only had to come up with an O(good) algorithm and leave the real time issues to the hands of the hardware guys. But with multiple cores programmers (and not the hardware guys) need to take responsibility over utilizing theh cores;  and that’s hard. Increasing CPU clock-rate is just not going to do, we’ve decided to go for the multi-core solution.
  • Cloud computing. Cloud computing is here to stay. I won’t talk  about the user added value of cloud computing, you can find plenty of this in other media, but what I will talk about is the new challenges it creates for programmers. There are quite a few new challenges, most of them are about scale, such as scaling your database, scaling your users sessions etc, but one of the most significant challenges is scaling your algorithm by parallelizing  it. When dealing with multiple cores the way to scale your algorithm is to make it run in parallel – and this is hard; this is truly hard. One of the challenges in parallelizing the algorithm is synchronizing effectively over state. Java and other modern programming languages have created built-in constructs to assist in program synchronization, such as the synchronized block. This gives you the possibility to take advantage of multiple threads running on multiple cores of the same CPU, but it has two downsides – one, is that it doesn’t yet let you take advantage of multiple CPU datacenter and two, is that it’s very hard to program without creating botttlenecks. In many cases what you’d see is over-synchronization which results bottlenecks, poor performance in execution (not to mention the actual cost of the JVM going into the synchronized block itself) and at the end of the day, you might actually run your program faster if it was single threaded. The problem, just to make it clear, is that current imperative languages, such as Java and C++ all keep state. The state is in their variables. The thing is that when you want two or more threads to access the same variable, or the same state (and modify it) you need to synchronize them; Creating correct sycnronization is hard and many times results in bottlenecks.

Just to make that clear, when I say Multi-Core I mean two things actually: multiple cores on the same CPU, e.g. the same physical machine as well as multiple CPUs (machines) running in a datacenter.

Here’s where functional languages come to our rescue: They don’t keep state! Pure functional languages only present functions, which are pure computation and never keep state. Even in the not-so-pure functional languages, such as Scala, where the language does keep state, still programmers are encouraged not to use it and are given the correct constructs to use it less and use more pure functions which do not modify state (and simply return value) instead. Now, when you don’t keep state, you don’t need to synchronize state (there is always a bit of synchronization needed, but it lets you keep it to the minimum). Functional languages presnt pure-computation, a stateless computation; when computation is stateless it’s easy to run it in parallel on different parts of your data.

Now, I’m not saying it’s impossible to run parallel computations in imperative languages, that’s obviously not true, what I am saying that it’s hard, it’s very hard. In functional languages its easier.

It’s not surprise therefore that recently several functional (or semi-functional) languages have given rise, such as Scala and Erlang, as well as functional-like programming models for other imperative languages, such as Google’s Map-Reduce.

Now let’s put things in a historical context. Imperative programming languages such as C++ and Java is what’s currently driving the software industry. I think this is about to change. Both imperative and functional languages have co-existed for a long time, but it appears that the imperative family of languages have had the lead for the past 30 years or so and why is that? Performance, that’s why! On single cores there’s nothing like good-old C to run a fast program. You may agree or disagree on how “pretty” the program is, but you can not disagree that performance-wise the imperatives win, and performance is what counts, it allows for superb user experience, nice new and complex, computation intensive features etc. But history is moving fast and things have started to change. Today what’s becoming more and more mainstream is cloud computing and its massive amounts of data and computation. It used to be the case that programs were written for client side installation and clients used to be single-core with growing clock-rate. Now two things have happened simultaneously – one is that increasing CPU clock-rate has its limits, so the hardware industry is going towards the multiple core business, and two is that as networks become faster and with higher availability cloud computing has given the rise and it presents us with new challenges of massive amounts of data and massive amounts of users. With these two in place and with the realization that it’s very hard to utilize imperative languages for writing parallel programs the software industry has started its shift towards functional languages.

In the future increased parallelism, rather than clock rate, will be the driving force in computing and for this task functional languages are in the best position to take the lead.

functional-programming-joke


Beware of the Singleton

The Singleton design pattern is well known and used among programmers. It is so easy to use that unfortunately it often gets misused.

java-singleton-design-pattern

In Java a singleton usually looks like this:

 public class Singleton {
   private static final Singleton INSTANCE = new Singleton();
   // Private constructor prevents instantiation from other classes
   private Singleton() {}
   public static Singleton getInstance() {
      return INSTANCE;
   }
 }

A singleton is used, as its name implies, to make sure that only one such instance of the class exists in the application. For example “the database singleton” or “the universe”. Many application define their domain such that there are single object of various kinds and the singleton design pattern programatically enforces that. Very cool, very useful.

But there’s another side to that story. Singletons are great in making sure there is only one instance of a class, but as a side effect they also make it very easy to access that object from anywhere. If you need access to the database singleton simply type Database.getInstance(). It’s just too easy that it gets misused!

We’ve all learned C and we all know that C global variables are bad. C global variables are bad because they prevent encapsulation. When programming to C global variables it’s very hard to determine the environment or the context of the current executing code because this code depends on several globals that you’re not aware of and that could change it’s behavior. Suppose you want to call a function and its documentation says something like “before calling this function make sure to set gNum to 5, and depending on the value of gVersion the function will do this and that…” – that is, if you’re lucky you’ll have documentation to read… in some cases you have no docs at all and you’re left to either read the code or guess what global vars it uses. If I were to read such code I’d do everything that’s in my power not to use it.

So guess what? Singletons == C Global variables. They are easily accessed from anywhere in the code and so easy to define and use that they get misused just like poor global vars in C do.

But there’s another reason why Singletons are bad. They make unit-testing very hard; in some cases even impossible to do proper unit-testing.

This is a key point. Suppose you have an application that uses a database and you want to unit-test it.

class DataBean {
 public String getValue() {
  return Database.getValue(); // The Database singleton
 }
}
 
...
@Test
public void testGetValue() {
 DataBean bean = new DataBean();
 assertEquals(5, bean.getValue());
}

When running a unit-test you don’t want to actually connect to a real database! You really don’t want to do that! There are several reasons why, just to name a few, you want fast execution, you don’t want to test the database, you only want to test the DataBean class, you don’t want to have to prepare and clean up the database with every test you run, you don’t want other ppl executing code to mess up your database, you don’t want failing tests to leave your database in an undefined state etc.

Using singletons is exceptionally bad for unit-testing. As a side note, when writing integration tests (or system-tests) you do want to test all system component, not just single elements, so in that case you do want to use a real database however unit-tests are far more important and effective and you should start with them and test single elements only.

So what’s the alternative then? Dependency Injection. Make your data bean depend on the database and accept a database in its constructor (or in a setter for that matter). A related design patter is Program to Interfaces, not to implementation.

class DataBean {
private DatabaseInterface database; // A reference to a DB interface
public DataBean(DatabaseInterface d) {
 database = d; // Store the DB
 public String getValue() {
  return database.getValue();
 }
}
 
...
@Test
public void testGetValue() {
 // Use a mock DB implementation for testing
 DataBean bean = new DataBean(new MockDatabaseImplementation());
 assertEquals(5, bean.getValue());
}

Conclusion: Singletons are effective making sure there’s only one of them in the application. They are hazardous because that make it too easy to use the global-variables anti-design pattern. Beware of them!