The Three Axis of Software Management Complexity

The title sounds bombastic but hopefully the content is going to be trivial… For anyone who’s been doing software provisioning in a non trivial environment this post should only be calling names to things he may already know.

Yesterday I was chatting with Nati Shalom and we were analyzing the difficulties of maintaining software for non-trivial web services, backend services or anything along these lines. We got the the conclusion that the various difficulties could be categorized into three different categories and so we named them as The Vertical, The Horizontal and the Depth axis (for lack of better name, I’m still looking for a better name to replace Depth).

In most cases, installing a single backend service, a frontend service or any random software component really (doesn’t necessarily have to be a service) isn’t too hard. For example, think of installing a single instance of MySQL or Cassandra or Apache Tomcat. In most cases those things are not terribly hard to accomplish, there are usually some useful installation scripts and good tutorials. It sometimes may get a little more complex when the component installed depend on some other component but we’ll get to that later.

Things start to get complicated when you want to go one of the following directions: Install a full stack, not just a single service, for example a LAMP stack (although the LAMP case is a pretty common theme so there’s abundant information for it on the net, but in general where there’s a stack of different technologies that need to speak to each other it starts to get complex). We named the stack the Vertical Axis. When you need to create a cluster of the component, for example a cluster of mysql, a cluster of memcached etc. We called the cluster case the Horizontal Axis. Or when you really need to go in-depth of the process, for example, you need to analyze the performance of mysql on a single node. Let us elaborate on all three axis.

The Vertical Axis – building a stack

In most real world cases a service or a component is part of a larger stack. Some common software stacks I’ve worked with in the past include LAMP (Linux, Apach, MySQL, PHP), Linux-Apache-Tomcat-Postgesql, Linux-Tomcat-MySQL-Memcached, Rails-MySQL-Nginx, Jetty-grails-HSQLDB and of course those are the simple examples, usually it gets much more complicated as the service matures.

Installing or configuring or maintaining any of the given components isn’t terribly hard but usually when you need to get tomcat talking to MySQL or memcached or apache talk to tomcat, it starts to get just a little bit more complex. Of course this isn’t the end of the world, we’ve all been doing this for years, I’m just pointing out the extra level of work that needs to get done in order to build a stack. For example, apache when serving infront of tomcat will need to know which port tomcat listens to in order to forward the request to the correct port. An app running inside tomcal will need to know which port mysql is using and which port memcached is using and so on. If any of the components in the stack needs to get upgraded you’ll need to verify the correctness once again not only for the upgraded component, but for the entire stack as well.

So to sum up the vertical axis – think of a single host with a stack of services and a graph of dependencies between them, for example apache depends on tomcat which depends on mysql and memcached.

The Horizontal Axis – building a cluster

Now think of a single component, for example a mysql server which you want to multiply and make a cluster of. This is typycally known as Sharding. Now things start to get really complicated. Although many shops employ database sharding, it still remains difficult to execute, hard to maintain and has a number of limitations. Some databases such as cassandra, which I’m a big fan of, do make this job so much easier but they have other limitations and a learning curve.

A few more examples of what may be considered as a horizontal axis expantion and the challanges related to it are:

  • Scaling a memcached cloud from one instance to two. Or even from two to three or any N to N+1. Memcached is a simple cache server and has a simple protocol and part of this protocol means that the client needs to know which exact server holds the key it is looking for. So a client cannot simply connect to a member of the memcached cloud, ask it for a value and hope this member will figure out which other member of the cloud owns it. Memcached doesn’t work like this, memcached expects the clients to know which server holds which key. As a result, when adding a new host to the clusters (or removing a host), what usually happens is that clients need to re-hash their keys which may lead to a) more complexity at the app level in the clients or b) simply clearing all the cache a starting afresh (which is unacceptable in many cases)
  • Scaling a web frontend such as Tomcat from one instance to many. If your web app uses sessions managed by Tomcat or any kind of other local state then you’ll need to get that state synched up to all the members of the cluster. Tomcat does add support for this out of the box but it’s limited in scale. A different solution, or a solution that may be combined with that is taking care of session stickiness by a front facing proxy which isn’t trivial either and is also limited in scale. What people usually end up doing is instead of dealing with sessions, they give up on sessions and any kind of state management whatsoever in their frontends just so that when the day comes and they need to scale them horizontally, they will be able to do so easily. Heck, even google’s appengine has that as a design guideline – no state whatsoever, if you need anything then it’s in the persistence or caching layer.
To sum up the horizontal axis, think of a single service that you want to multiply in order to gain capacity (or failover or what have you).

The Depth Axis – really understanding what the service does…

The third axis is the Depth – really undersntanding what the process does, “how does the process feel”

Is your database overloaded? Is it close to being overloaded soon? If it’s slow then why is it slow? if it’s inconsistently slow, then why does it misbehave in certain conditions?…

Being the service doctor of even a single service on a single node, such as mysql, tomcat or cassandra could be a challanging task. It’s like watching the shadows on the wall and trying to figure out what creature’s creating those shadows. You’ll have to use your detective skills by monitoring the performance data of the service, monitoring possibly other services that operate on the host, of the file system, of the memory, fiddling with different parameters of the OS and of the service until you’ve finally nailed it.

All three axis – the vertical, horizontal and depth are each complex on its own. The reality is that in most real life scenarios you’d have all three of them at the same time.

The Time axis – software erosion.

And yet, there’s one more – the time axis. As surprising as this may sound, software erodes, not only rocks, software does so too.

If you’ve written a successful program, say a web site, you shouldn’t be surprised if after two years it suddenly stops working. There could be many reasons for that, for example, the log directory got full and logs weren’t being deleted from it, you used a 3rd party API which got changed, your OS gets old and vulnerabilities are found and spread while you did not take the time to patch it so you got hacked; or if you did patch it, then perhaps your program doesn’t work because of the new patch, really there are infinite number of reason why your software rots, decays, erodes, so today no one’s really surprised to find out that a two years old service is not functioning as it used to do when it was first launched if it hadn’t been taked care of.

So the time axis adds yet another complexity, even if your service is stagnant, you don’t gain more users and don’t care to add features and don’t bother to maintain it, there’s a good chance that it will eventually stop working because of rot.

That’s it, that was our observation that I wanted to share – these are the three plus one axis that make software maintenance, management and development a whole lot more complex in real life scenarios compared to the neon light lab scenario.

Sorry, comments for this entry are closed at this time.