The joy of deleting code

What could be more fun than writing a new shiny super-functional, super-tested piece of code? Deleting it!
When deleting code you know that

  • You have not introduced new bugs. Perhaps you deleted some potential bugs from the old code but chances are you did not introduce new ones.
  • You don’t have to maintain it. It’s deleted.
  • Code was probably poorly written. Good code is never deleted. In many cases there’s poorly written code that you just don’t have the guts to delete it. Now you did, that’s great.
  • You’ve probably found a good way to reuse another piece of code, that’s why you’re deleting this piece of code. Code reuse is good.
  • Or that you’ve taken off a feature from your product. Taking off features is good, is very good. Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away
  • Or that you’ve found a more compact and elegant way to do what you want to do.

Bottom line: Deleting code makes me happy. How about you?

Delete


Mavenizing our code base

Maven is a build tool for Java. It’s more than that actually, but let’s just call it a build tool.

Maven - Welcome to Apache Maven-1

In outbrain we decided we want to replace good old ant with maven.

Changing the company wide used build tool is not a decision taken lightly and may have consequences on product release cycle, but we weighted our options and decided to go for it, so I thought it might be worth mentioning our endeavor.

Everyone familiar with Java programming has probably used ant or at least heard of it. For many years it has been the de-facto standard build tool with large and growing audience, numerous plugins, excellent documentation and IDE support (for example most Java IDEs can automatically generate ant build files). But ant has its shortcomings which we, at outbrain decided we just couldn’t live with. We found maven to fill up most of the gaps.

How do ant and maven differ?

Maven is newer and was built from scratch with many of the lessons learned by ant in mind. Both projects are written and maintained by the high quality, high standard apache software foundation, home of many other wonderful open source products. Both ant and maven are still actively developed and maintained, so it would not be fair to say that maven replaces ant, though many developers tend to think so (including myself).

Ant and maven differ in many ways, but at least to me these are the winning points that actually make the difference and made choose maven:

Maven is declarative. Ant is imperative.

Here’s what an ant build file looks like for a simple java project:

<project name="MyProject" default="dist" basedir=".">
  <description>
    simple example build file
  </description>
  <!-- set global properties for this build -->
  <property name="src" location="src"/>
  <property name="build" location="build"/>
  <property name="dist"  location="dist"/>
 
  <target name="init">
    <!-- Create the time stamp -->
    <tstamp/>
    <!-- Create the build directory structure used by compile -->
    <mkdir dir="${build}"/>
  </target>
 
  <target name="compile" depends="init"
    description="compile the source " >
    <!-- Compile the java code from ${src} into ${build} -->
    <javac srcdir="${src}" destdir="${build}"/>
  </target>
 
  <target name="dist" depends="compile"
    description="generate the distribution" >
    <!-- Create the distribution directory -->
    <mkdir dir="${dist}/lib"/>
 
    <!-- Put everything in ${build} into the MyProject-${DSTAMP}.jar file -->
    <jar jarfile="${dist}/lib/MyProject-${DSTAMP}.jar" basedir="${build}"/>
  </target>
 
  <target name="clean"
    description="clean up" >
    <!-- Delete the ${build} and ${dist} directory trees -->
    <delete dir="${build}"/>
    <delete dir="${dist}"/>
  </target>
</project>

And here’s the maven one:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <artifactId>MyProject</artifactId>
  <name>MyProject</name>
  <description>simple example build file</description>
  <packaging>jar</packaging>
</project>

Ant line count: 22 (not including spaces, comments)

Maven line count: 7. (and they are real easy ones)

Short is good, especially in software (except perl ;-) ). But except for the fact that mvn has shorter build files, there is something more important hidden here – declarative v.s. imperative.

With ant you have to tell it HOW to build. You have to tell it that it first needs to collect all java sources, then run the javac compiler on them, then collect all classes, then run the jar tool on them to make them a jar. That’s tedious, especially if you have to do it 30 times, for each and every project. In outbrain we have many projects so when we used ant you’d see the exact same ant code patterns again and again (including the mistakes, which survived copy-paste horror). Ant is hard to maintain and is also hard to write. Did any of the readers ever write an ant build file? I doubt that. I’ve been using ant for more than 5 years and never have I written a file from scratch. It’s always used to be either copy-paste or using the IDE support for automatic generation. This is a bad sign of superfluous language.

Maven is declarative. You don’t have to tell it HOW to build, only WHAT to build, so conceptually it’s a higher level build tool. With mvn you only have to say “Look, this is how I call my project and I want you to make a jar from it” that’s all, mvn will figure out the rest. It knows where to find the sources (convention) and it knows what steps it needs to take in oder to create a jar. It will compile your source files, will package all resources for you in that jar, will run tests and create that jar for you. You can intervene with this process, but you don’t have to. You can make jars, wars, ears and more.

Declarative is in many cases preferred over imperative. Think HTML vs. Java. HTML is declarative, Java is imperative. In HTML you say <b>bold</b> which tells the browser you want the text to be bold. You don’t tell it how to make the text bold (e.g. how many pixels, what position etc) only that you want it bold and let the browser figure out how to handle it. In Java you’d have to tell it how to make the text bold, how to space the characters, how to space words around it, how to break lines etc. Declarative is in many cases a lot easier than imperative, you worry less.

Maven is declarative, you only have to say “this is my project, jar it”. With ant you have more control over what the build tool does, so you can go crazy with build scripts and… well… jar before compile, or clean after jar (instead of before it) or package the test code inside production code or whatever, you get the point, you have the freedom to err. 9 times out of 10 you don’t need the level of flexibility provided by ant and you’d be much safer using mvn.

Dependency management.

This is a big thing. That was actually the main reason I wanted to move out of ant. Maven has a wonderful dependency management system built in by default. Ant has nothing built in, although it has ivy as a plugin.

What is dependency management? There are external and internal dependencies. External dependencies are ones you download from the net, usually open source projects such as Lucene, ActvieMQ, and other open source projects. With maven you only have to declare your dependency on them and they get automatically downloaded. Example:

<dependency>
  <groupId>struts</groupId>
  <artifactId>struts-bean</artifactId>
  <version>1.2.8</version>
</dependency>
With ant you basically have two options. One is download the library yourself and throw it in some folder, call it 3rdParty and add it to the classpath (good luck with keeping track of versioning, who’s using what and your life) or use ivy, which is pretty decent, but as mentioned before not part of the default ant installation.
As for internal dependencies, which means your project which depends (uses) another one of your projects, mvn supports that as well. AFAIK ant does not. With ant, if you have more than one project in your company (and of course you do), you’d have to manually tweak the build scripts so they run in the correct order and dependencies are compiled before they are used. Although it’s possible, that’s sort of nightmarish as companies grows.
Dependency management is a killer feature for mvn and was actually the main reason that prompted me to pursue it. Although ant can have ivy, this was not zero-work, so I decided, heck it we’re going to put some work into it, let’s rebuild the whole thing and get a much better result. So we did.

Conventions vs. configuration

Conventions are good. They save a lot of time and prevent you from doing foolish mistakes (such as packaging test code into production).

By conventions I mean:

  • Where is the source code?
  • Where is the test code?
  • Where are the resources?
  • Where are the web files?
  • etc

With ant you had to create a directory for sources (call it src or Src or source or srce) and tell and where your sources are. Then you’d have to decide where to put your tests. You can push them in tst, test, tests, or even in the same source directory as production code is, maybe in a test package. Next you have to configure ant where to find test code, how to separate it from production code, and heck – how to run it (it really doesn’t know how to do it).

With mvn that’s much easier. Maven promotes conventions such as the standard directory layout which means all sources are at predefined location, all resources are as well etc. There are several advantages to that including that it’s easy to start a new project, you don’t have to think where everything goes, it’s hard to make mistakes by misplacing items, you don’t have to think about the build script and how to configure it and perhaps most important of all, you make all company employees conform to the same layout. That’s a huge gain for the company.

Built-in functionality out of the box

OK, there are plenty of other benefits to mvn but the post is already getting too long so let’s have the last one here.

With mvn you get tons of functionality out of the box. By creating a very simple build file with only the project definition in it you can:

  • compile all sources
  • compile test
  • run unit tests
  • run integration tests
  • package as a jar/war or something else
  • deploy
  • run in tomcat
  • …and much more

With ant what you had to do is for each one of the above listed goals, create an ant goal and configure ant by telling it how to run it. Some of them may be relatively trivial (but still require coding) and some of them aren’t easy at all… that kind of tells you why anters are very good with their CTRL+C and CTRL+V. And with copy-paste comes the pain of silly copy paste errors and difficult maintainability. Life with mvn is better ;-)

Other great and not covered features: Eclipse and other IDE integration (automatically generating projects), great testing and debugging tools, excellent build output, excellent versioning and more.

How did it go?

So, how did it go? You guys have converted your entire codebase build tool. Isn’t that like Netscape’s near death experience while rewriting their entire code base?

Well… no! The nice thing about mvn is that it’s easy to write and easy to learn. We did spend a couple of weeks on that task and had to resolve some unpredictable situations, but it had only a small impact on our schedule (and needless to say that I hope in the long term will have  the most positive impact on release schedule). Heck, we (at least I) even enjoyed it!

Conclusions

I’d definitely recommend mvn. If you’re starting a new project, choose mvn. If you have an existing codebase using ant and you’re thinking about moving to maven, know that it’s certainly feasible and in my opinion, well worth the effort. Expect some work, it’s not zero effort, but I promise you’ll enjoy it.


Experimenting with Seam Carving

Seam Carving is a technique for smart image resizing developed by Shai Avidan and Ariel Shamir. It’s cool. It’s really cool actually! It let’s you resize an image without having to lose important information, so for example, if there’s a face in the photo and a background, the background will shrink while the face will maintain its size. There are plenty of examples on the web and some very cool videos, so let’s have a taste of them first.

There is also a large number of implementations and in many different languages, including Java, C++, JavaScript and more.

Yesterday I wanted to integrate this cool technology into one of the outbrain features I’m working on, so here’s what I’ve learned:

First of all, it’s really cool, did I mention this? :) .

I used a Java implementation by Mathias Lux from here (thanks, Mathias). In general, this is a very nice work and I only had to fix but a few bugs to get it going ;) .

When I first started using it I noticed two problems:

1. It’s slow. Any I mean – painfully slow! nothing that production code can live with. An average photo would get resized in about 30-60 seconds. No go, no good, no no no.

2. It doesn’t always do the right thing… I mean sometimes the result of the resize is sort of lame… Let’s have a look at some examples.

Before:

Before carving

Before carving

After:

After carving

After carving

Hmm, that’s not so good, right?… See how that poor man’s face is distorted?

Here’s another one. Before:

Before Carving

Before Carving

And after carving. Notice the building roof… it seems to have endured a little earthquake…

After carving

After carving

So, you see, I had a problem… The algorithm was correct, no problem with that, but perhaps this was not the best solution to what I was trying to achieve… What was my goal then? My goal was to have all images resized to the same size (in this case 178×100) without having to letterbox or crop them. But I don’t mind so much that a face in the photo gets smaller, that’s perfectly OK by me, as long as the face doesn’t get but in the middle.

My solution:

Here’s what I did – Instead of using the seam carver right from the beginning, first I resized the image using a simple linear resize operation such that the photo would exactly match one of either the width or the height of the target photo, and only then did I run the seam carver.

For example, if the original size of the image was 300×300 and I wanted the target size to be 100×50 I first resized it to 100×100 (so the image fills the entire width, and overflows the height) and only then use the seam carver to reduce its height.

The results were amazingly well! Performance dropped to about 500ms per photo (still, I can do better, but this is already a big improvement and can go to production) and most importantly, the photos look good now. Compare the following two resized and carved photos with the initial results.

Improved resized

Improved resized

Improved resized

Improved resized

That did the trick, nice :) .


Why functional languages rock with multi-core

intel_quadcoreCores are cheaper nowadays. Almost all new computers are shipped with 2 or more cores. Datacenter computers usually ship with more – 4, 8, 32… Again the hardware industry has left the software industry behind; if only Moore’s law would work the same way for software…

But there is hope!

Functional programming languages used to be a niche for the small group of programing languages advocates or emacs fanatics ;-) I, for one, studied functional languages such as Lisp and ML at school but have never thought I’d ever use them in “real life” industry.

Things have started to change and in my opinion this is greatly due to the multi-core / huge datacenter / cloud computing shift in the software industry. The software industry has come to realize a few key points:

  • One core can not handle the load, no matter how strong and fast is this core is. In the old days of IBM deep blue and the super-computer age there was hope that with the advance of hardware industry cores would get infinitely stronger and every other day there would be another contestant on the fastest/stronger/capable core of the day. However, in the last 10 or so years we have come to realize that hardware has its limitations and clock-rate is one of them. The cores, at least as far as we can tell today, can not grow infinitely and we need a different solution. The solution is multi-core CPUs. The challenge to the software industry is taking the right advantage over the multi-core business. It used to be easy when programming for a single core – you only had to come up with an O(good) algorithm and leave the real time issues to the hands of the hardware guys. But with multiple cores programmers (and not the hardware guys) need to take responsibility over utilizing theh cores;  and that’s hard. Increasing CPU clock-rate is just not going to do, we’ve decided to go for the multi-core solution.
  • Cloud computing. Cloud computing is here to stay. I won’t talk  about the user added value of cloud computing, you can find plenty of this in other media, but what I will talk about is the new challenges it creates for programmers. There are quite a few new challenges, most of them are about scale, such as scaling your database, scaling your users sessions etc, but one of the most significant challenges is scaling your algorithm by parallelizing  it. When dealing with multiple cores the way to scale your algorithm is to make it run in parallel – and this is hard; this is truly hard. One of the challenges in parallelizing the algorithm is synchronizing effectively over state. Java and other modern programming languages have created built-in constructs to assist in program synchronization, such as the synchronized block. This gives you the possibility to take advantage of multiple threads running on multiple cores of the same CPU, but it has two downsides – one, is that it doesn’t yet let you take advantage of multiple CPU datacenter and two, is that it’s very hard to program without creating botttlenecks. In many cases what you’d see is over-synchronization which results bottlenecks, poor performance in execution (not to mention the actual cost of the JVM going into the synchronized block itself) and at the end of the day, you might actually run your program faster if it was single threaded. The problem, just to make it clear, is that current imperative languages, such as Java and C++ all keep state. The state is in their variables. The thing is that when you want two or more threads to access the same variable, or the same state (and modify it) you need to synchronize them; Creating correct sycnronization is hard and many times results in bottlenecks.

Just to make that clear, when I say Multi-Core I mean two things actually: multiple cores on the same CPU, e.g. the same physical machine as well as multiple CPUs (machines) running in a datacenter.

Here’s where functional languages come to our rescue: They don’t keep state! Pure functional languages only present functions, which are pure computation and never keep state. Even in the not-so-pure functional languages, such as Scala, where the language does keep state, still programmers are encouraged not to use it and are given the correct constructs to use it less and use more pure functions which do not modify state (and simply return value) instead. Now, when you don’t keep state, you don’t need to synchronize state (there is always a bit of synchronization needed, but it lets you keep it to the minimum). Functional languages presnt pure-computation, a stateless computation; when computation is stateless it’s easy to run it in parallel on different parts of your data.

Now, I’m not saying it’s impossible to run parallel computations in imperative languages, that’s obviously not true, what I am saying that it’s hard, it’s very hard. In functional languages its easier.

It’s not surprise therefore that recently several functional (or semi-functional) languages have given rise, such as Scala and Erlang, as well as functional-like programming models for other imperative languages, such as Google’s Map-Reduce.

Now let’s put things in a historical context. Imperative programming languages such as C++ and Java is what’s currently driving the software industry. I think this is about to change. Both imperative and functional languages have co-existed for a long time, but it appears that the imperative family of languages have had the lead for the past 30 years or so and why is that? Performance, that’s why! On single cores there’s nothing like good-old C to run a fast program. You may agree or disagree on how “pretty” the program is, but you can not disagree that performance-wise the imperatives win, and performance is what counts, it allows for superb user experience, nice new and complex, computation intensive features etc. But history is moving fast and things have started to change. Today what’s becoming more and more mainstream is cloud computing and its massive amounts of data and computation. It used to be the case that programs were written for client side installation and clients used to be single-core with growing clock-rate. Now two things have happened simultaneously – one is that increasing CPU clock-rate has its limits, so the hardware industry is going towards the multiple core business, and two is that as networks become faster and with higher availability cloud computing has given the rise and it presents us with new challenges of massive amounts of data and massive amounts of users. With these two in place and with the realization that it’s very hard to utilize imperative languages for writing parallel programs the software industry has started its shift towards functional languages.

In the future increased parallelism, rather than clock rate, will be the driving force in computing and for this task functional languages are in the best position to take the lead.

functional-programming-joke


Beware of the Singleton

The Singleton design pattern is well known and used among programmers. It is so easy to use that unfortunately it often gets misused.

java-singleton-design-pattern

In Java a singleton usually looks like this:

 public class Singleton {
   private static final Singleton INSTANCE = new Singleton();
   // Private constructor prevents instantiation from other classes
   private Singleton() {}
   public static Singleton getInstance() {
      return INSTANCE;
   }
 }

A singleton is used, as its name implies, to make sure that only one such instance of the class exists in the application. For example “the database singleton” or “the universe”. Many application define their domain such that there are single object of various kinds and the singleton design pattern programatically enforces that. Very cool, very useful.

But there’s another side to that story. Singletons are great in making sure there is only one instance of a class, but as a side effect they also make it very easy to access that object from anywhere. If you need access to the database singleton simply type Database.getInstance(). It’s just too easy that it gets misused!

We’ve all learned C and we all know that C global variables are bad. C global variables are bad because they prevent encapsulation. When programming to C global variables it’s very hard to determine the environment or the context of the current executing code because this code depends on several globals that you’re not aware of and that could change it’s behavior. Suppose you want to call a function and its documentation says something like “before calling this function make sure to set gNum to 5, and depending on the value of gVersion the function will do this and that…” – that is, if you’re lucky you’ll have documentation to read… in some cases you have no docs at all and you’re left to either read the code or guess what global vars it uses. If I were to read such code I’d do everything that’s in my power not to use it.

So guess what? Singletons == C Global variables. They are easily accessed from anywhere in the code and so easy to define and use that they get misused just like poor global vars in C do.

But there’s another reason why Singletons are bad. They make unit-testing very hard; in some cases even impossible to do proper unit-testing.

This is a key point. Suppose you have an application that uses a database and you want to unit-test it.

class DataBean {
 public String getValue() {
  return Database.getValue(); // The Database singleton
 }
}
 
...
@Test
public void testGetValue() {
 DataBean bean = new DataBean();
 assertEquals(5, bean.getValue());
}

When running a unit-test you don’t want to actually connect to a real database! You really don’t want to do that! There are several reasons why, just to name a few, you want fast execution, you don’t want to test the database, you only want to test the DataBean class, you don’t want to have to prepare and clean up the database with every test you run, you don’t want other ppl executing code to mess up your database, you don’t want failing tests to leave your database in an undefined state etc.

Using singletons is exceptionally bad for unit-testing. As a side note, when writing integration tests (or system-tests) you do want to test all system component, not just single elements, so in that case you do want to use a real database however unit-tests are far more important and effective and you should start with them and test single elements only.

So what’s the alternative then? Dependency Injection. Make your data bean depend on the database and accept a database in its constructor (or in a setter for that matter). A related design patter is Program to Interfaces, not to implementation.

class DataBean {
private DatabaseInterface database; // A reference to a DB interface
public DataBean(DatabaseInterface d) {
 database = d; // Store the DB
 public String getValue() {
  return database.getValue();
 }
}
 
...
@Test
public void testGetValue() {
 // Use a mock DB implementation for testing
 DataBean bean = new DataBean(new MockDatabaseImplementation());
 assertEquals(5, bean.getValue());
}

Conclusion: Singletons are effective making sure there’s only one of them in the application. They are hazardous because that make it too easy to use the global-variables anti-design pattern. Beware of them!


Defaults – a convenience or a time bomb?

Time BombAll programmers at all languages are familiar with the concept of default values.
Many languages allow default parameter values when calling a functions, some provide a function overloading mechanism which is an expansion of this idea.

For example in python you can have named parameters with default values:

def multiply(v, mult=2.0)
  return v * mult
 
multiply(5) # returns 10.0
multiply(5, 3.0) # returns 15.0
multiply(5, mult=3.0) # returns 15.0; it's the same as before, only using a named parameter

The concept of default values is found not only in function but in many other places, such as Java Property files.
Java implements by default a nice properties mechanism which lets you quite easily separate between program logic and its data. Just create a my-properties.properties file, instantiate a Properties object and read properties from the file.

my-properties.properties

me.prettyprint.my_value=nice!

In Java:

Properties props = new Properties();
URL url = ClassLoader.getSystemResource("my-properties.properties");
props.load(url.openStream());
String myValue = props.getValue("me.prettyprint.my_value");

The properties mechanism is quite convenient and useful; However, in my opinion they went a bit too far with regards to convenience by adding yet another Properties.getValue(String key, String defaultValue) method.
Now you can do this:

// If me.prettyprint.my_value doesn't exist, assign "awful" to myValue
String myValue = props.getValue("me.prettyprint.my_value", "awful");

That’s an example of how default values are more of a time-bomb then they are a convenience. Imagine the following quite typical accidents that happen to programmers daily:

  • Accidentally mistype me.prettyprinl.ny_value in your Java code
  • Accidentally mistype ne.prettyprimt.my_value in the properties file
  • Accidentally mistype the file name ClassLoader.getSystemResource(“my-propetries.properties”)
  • Forget to package the properties file in your jar; or package it in the wrong way.
  • … you get it, right? It’s so easy to make these mistakes that eventually you will; or the next programmer to edit your files will…

The problem is that since there are default values, the defaults are loaded and you have absolutely no clue that something is going wrong here. The compiler won’t help you b/c you’re not making a syntax error. The program may continue to run fine or appear to run fine until…

To make this situation even worse, many programmers (including myself for a long time) set their defaults to the exact same value as the ones in the properties files.  So what happens is that even if you mistype something once, everything works well by loading the default value, but when you go to your production environment and want to change a property, nothing changes. The property is changed in the file, yet, but it’s not loaded to the program variable because of some silly typos. That is the time-bomb!

Defensive programming means – program as if everything could go wrong; Assumption is the mother of all fuck-ups; Assume nothing! While this is somewhat extreme, I tend to agree to that approach. What could go wrong here is programmers typos or similar small mistakes. Assume they will happen and protect your code against them. Don’t use Properties.getValue(value, devaultValue); Only use Properties.getValue(value).


Yet another visual diff for git

I’m relatively new to git and I’m already in love with it, but there was one thing that bugged me and I couldn’t get a good answer anywhere else, so I wrote my own thing, posting it so maybe you can find it helpful.

The problem: Use a visual diff tool for git; and view all diffs at once.

All other solutions show me how to use a visual diff tool for git but they all have a common weakness – they show only one file at a time, which is a bummer b/c many times what I’d like to do is look at all the changed files and switch between them going back and forth.

So, as mentioned there are already many posted solutions how to set up visual diff in git for various platforms including linux, windows, osx (just Google it), so all I had to do is implement a really small change to one of them.

Step 1: Create a wrapper script in /usr/bin/git-diffmerge-wrapper.sh

#!/bin/sh
# diff is called by git with 7 parameters:
# path old-file old-hex old-mode new-file new-hex new-mode
 
cp $2 %2-keepme
diffmerge "$2-keepme" "$5" &
sleep 5 && rm $2-keepme &

Don’t forget to chmod +x /usr/bin/git-diffmerge-wrapper.sh

Step 2: Configure git to use your script

$ git config --global diff.external /usr/bin/git-diffmerge-wrapper.sh

That’s all!

I use os x so for linux systems it’s basically the same script. Windows might have to change a bit, but I’m not that good and win…

I assume you have diffmerge installed. If not, either use a different visual diff tool or download and install it (free).

Here’s the trick : when git runs a a diff it creates temporary files for each index file and deletes them right after the external diff program for that specific file exited. So what I did is simple copy the files to other files ($2-keepme), run the diff in the background, sleep for 5 seconds to make sure the diff program reads the files and then delete them for clean up.

cp $2 %2-keepme
diffmerge “$2-keepme” “$5″ &
sleep 5 && rm $2-keepme &


Flash and encryption? No way dude!

flash

I was asked by a fellow worker whether flash can be? Short answer: no. Long answer below.

But why would you even want to encrypt flash? I asked.

He told me about a product he’s working on, some kind of hook for online games which identifies cheaters and bots as they play in real-time  by collecting many signals looking at some smart patters etc. I realized, ok, this guy really needs to hide his top-secret code from hackers, he doesn’t want them to be able to read his code and break his top-secret sauce, plus his code needs to run on the client to be able to collect its signals => he’s in trouble. Flash code just can not be encrypted, tough luck.

But before getting to that conclusion I researched a bit and found out there are quite a few companies and products that have already thought about this problem and have come out with almost-good-enough products called code obfuscators. It appears that flash developers (me being one) are concerned about their work getting stolen. You work hard making an online game or a video player or mp3 player, put it on your site and baam, someone downloads your swf, runs it by a flash decompiler and has source code access to your hard work; now he can implement slight changes, brand it as his and get your fame.  Code obfuscators try to solve this problem by making it difficult for a hacker to reverse engineer, or decompile the code. Put another way, code obfuscators want to protect your work from being copied. Only that sadly they can’t :( . They may do a decent job at making it somewhat harder to reverse engineer the code but they cannot and will never be able to completely protect your code, not even theoretically and that is the key point.

It’s important to realize that code on the client cannot simply be encrypted. There are other solutions to the problem but let’s establish the theory first. When code runs on your computer, and it doesn’t matter weather it’s flash or anything else, the computer needs to understand the code in order to run it, it needs to be able to read it. Now, as long as you have physical access to your computer, and I assume you do, you can hack it no matter how hard it’s encrypted. Let us assume the code is indeed encrypted; at some stage the computer will have to run it, so it will have to decrypt it; In the flash case it’s the flash bytecode that needs to be run by the flash VM (like in Java VM flash has it’s own bytecode and VM), so in order to run the code the computer will have to decrypt the code first; To the best of my knowledge, no CPU can run encrypted code and no flash VM can run encrypted flash bytecode; if the computer is able the decrypt the code, so can a hacker decrypt it, simply by running the same lines as the computer does; if the computer cannot decrypt the code, if won’t run it, so in that sense your code is pretty safe, but at the same time not usable.

What flash obfuscators do is not encryption (although some of them brand themselves unrightfuly so), they simply apply various transformations on the code to make it harder to read. They rename variables to unreadable names, they run transformations on for and while loops etc, and it is indeed a bummer to try to read their output; if I were to copy source code from a game I’d go for the one that did not get obfuscated, so in that sense they do a reasonable job but they are not cryptographically secure; a hacker with enough time at his hand will be able to crack them.

What do you do, then? he asked me. I really need to protect my code but I also need to run it on the client, so what do I do?

There are several ways to go around this. One is: perform only the simple dumb signal collection on the client and send it to a server and let the secret code run on a server, not on the client. There is no general high level solution to the problem, it’s all very specific to the application, so in this case I suggested him to run the analysis on the server side, but in other cases the solution may be different, but the only thing that’s important to understand is that you can’t protect code on the client.

Keep safe ;)


Required: CSS islands

sheetsHey CSS guys, how about a CSS island tag?!

This is what I’m talking about:

<html>
<style>
h1 {
 font-size: bigger;
}
</style>
<body>
lots of html code....
<cssisland resets="h1;h2;div.img;#id">
<style>
h1 {
 font-size: smaller;
}
</style>
here all page css code is preserved except for h1, h2, div.img and #id
h1 has a smaller font-size only within the scope of this block.
</cssisland>
more html code... Here h1 has font-size bigger again.
</body>
</html>

Here’s the problem: It’s very common that web pages are constructed by many sources. For example in my about page I have a stack overflow badge. Most of the content of this page was constructed by me, but this badge is from stackoverflow.com. The badge widget uses some CSS styles that might conflict with my page styles. For example it may use h1 or default fonts etc. The widget author can’t possibly know who’s going to use her widget, in what pages or how those pages are constructed and so it’s very likely that there will be a CSS conflict. The browser will know what to do with the CSS conflict, after all that’s how CSS is designed, it’ll find what needs to cascade what, but the problem is that either the widget is going to look awful b/c it’s author didn’t think of setting some CSS properties that have odd values on my page or that the page suffers b/c the widget has changes some CSS attribute that messes up the page.

One possible solution to this problem of CSS conflict may be using an iFrame, but this is also a very limited solution b/c sometimes you don’t want to use an iframe and you do want to preserve most of the page styles. See my previous post on the subject.

The common practice today, which pretty much sucks (IMO…), is to set all style attributes inline, for example:

<html>
<style>
h1 {
 font-size: bigger;
}
</style>
<body>
lots of html code....
<h1 style="font-size: smaller">My great widget</h1>
more html code... Here h1 has font-size bigger again.
</body>
</html>

Here we set the h1 font-size as an inline style attribute but this sucks b/c the code is ugly and not very robust.

To do it nicely you’d want to add a <style> block but you can’t. If you do:

<html>
<style>
h1 {
 font-size: bigger;
}
</style>
<body>
lots of html code....
<style>
h1 {
 font-size: smaller;
}
</style>
<h1>My great widget</h1>

more html code... Here h1 has font-size bigger again.
</body>
</html>

… then you’re going to mess up the page display by changing all h1 on it.

So what I’m suggesting is a css-island block which isolates all CSS definitions declared inside it and as a convenience may also reset other CSS selectors defined on the page. This is a “scratch” idea, so optimizations are in place but as a general thought, how does it sound?

Does that sound like a good idea?


Widgets – iframe vs. inline

Inspector GadgetWhen writing a widget, should you use an iFrame or make the widget inline?

Widgets are small web applications that can easily be added to any web page. They are sometimes called Gadgets and are vastly used in growing number of web pages, blogs, social sites, personalized home pages such as iGoogle, my Yahoo, netvibes etc. In this blog I use several widgets, such as the RSS counter to the right which displays how many users are subscribed to this blog (don’t worry, it’ll grow, that’s a new blog ;-) ). Widgets are great in the sense that they are small reusable piece of functionality that even non-programmers can utilize to enrich their site.

I’ve written several such widgets over the time both “raw” widgets that can get embedded in any site as well as iGoogle gadgets which are more structured, worpress*, typepad and blogger widgets, so I’m happy to share my experience.

As a widget author, for widgets that run on the client side (simple embeddable HTML code) you have the choice of writing your widget inside an iframe or simply inline the page and make it part of the dom of the hosting page. The rest of the post discusses the pros and cons of both methods.

How is it technically done?

How to use an iframe or how to implement an inline widget?

Iframes are somewhat easier to implement. The following example renders a simple iframe widget:
<iframe src='http://my-great-widget.com/widgwt' width="100" height="100" frameborder='0'> </iframe>

frameborder=’0′ is used to make sure the ifrmae doesn’t have a border so it looks more natural on the page. The http://my-great-widget.com/widget is responsible of serving the widget content as a complete HTML page.

Inline gadgets might look like this:

function createMyWidgetHtml() {
 return "Hello world of widgets";
}
document.getElementById('myWidget').innerHTML = createMyWidgetHtml();

As you can see, the function createMyWidgetHtml() it responsible for creating the actual widget content and does not necessarily have to talk to a server to do that. In the iframe example there must be a server. In the inline example there does not need to be a server, although if needed, it’s possible to get data from the server, which actually is a very common case, widgets typically do call server side code. Using the inline method server side code is invoked by means of on-demmand javascript.

So, to summarize, in the iframe case we simply place an iframe HTML code and point the source of the iframe to a sever location which actually serves the content of the widget. In the inline case we create the content locally using javascript. You may of course combine usage of iframe with javascript as well as use of the inline method with server side calls, you’re not restricted by that, but the paths start differentially.

So what is the big deal? What’s the difference?

There are several important differences, so here starts the interesting part of the post.

Security.

iFrame widgets are more secure.

What risks do gadgets impose and who’s actually being put at risk? The user of the site and the site’s reputation are at risk.

With inline gadgets the browser thinks that the source of the gadget code code comes from the hosting site. Let’s assume you’re browsing your favorite mail application http://my-wonderful-email.com and this mail application has installed a widget that displays a clock from http://great-clock-widgets.com/. If that widgets is implemented as an inline widget the browser thinks that the widget’s code originated at my-wonderful-email.com and not at great-clock-widgets.com and so it’ll let the widget’s code ultimately get access to the cookies owned by my-wonderful-email.com and the widget’s evil author will steal your email. It’s important to realize that browsers don’t care about where the javascript file is hosted; as long as the code runs on the same frame, the browser regards all code as originationg at the frame’s domain. So, you as a user get hurt by losing control over your email account and my-wonderful-email gets hurt by losing its reputation.

If the same clock would have gotten implemented inside an iframe and the iframe source is different from the page source (which is the common case, e.g. the page source is my-wonderful-email.com and the gadget source is great-clock-widgets.com) then the browser would not allow the clock widgets access to the page cookies, nor will it allow access to any other part of the hosting document, including the host page dom. That’s way more secure. As a matter of fact, personal home pages such as iGoogle don’t even allow inline gadgets, only iframe gadgets are allowed. (inline gadgets are allowed only in rare cases, only after thorough inspection by the iGoogle team to make sure they’re not malicious)

To sum up, iframe widgets are way more secure. However, they are also way more limited in functionality. Next we’ll discuss what you lose in functionality.

Look and feel

In the look and feel battle inline gadgets (usually**) win. The nice thing about them is that they can be made to look as part of the page. They can inherit CSS styles from the page, including fonts, colors, text size etc. Iframes, OTHO must define their CSS from the grounds up so it’s pretty hard for them to blend nicely in the page.

But what’s even more important is that iframes must declare what their size is going to be. When adding an iframe to a page you must include a width and a height property and if you don’t, the browser will use some default settings. Now, if your widget is a clock widget that’s easy enough b/c you know exacly what size you want it to be, but in many cases you don’t know ahead of time how much space your widget is going to take. If, for example you’re authoring a widget that displays a list of some sort and you don’t know how long this list is going to be or how wide each item is going to be. Usually in HTML this is not a big deal because HTML is a declarative based language so all you need to do is tell the browser what you want to display and the browser will figure out a reasonable layout for it, however with iframe this is not the case; with ifrmaes browsers demand that you tell it exactly what the iframe size is and it will not figure it out by itself. This is a real problem for widget authors that want to use iframes – if you require too much space the page will have voids in it and if you specify too little the page will have scrollbars in it, god forbids.

Look and feel wise, inline wins. But note that this really depends on your widget application. If all you want to do is a clock, you may get along with an iframe just as well.

Server side vs. Client side

IFrmaes require you specify a src URL so when implementing a widget using an iframe you must have server side code. This could both be a limitation and a headache to some (owning a server, domain name etc, dealing with load, paying network bills etc) but to others this is actually a point in favor of iframes b/c it let’s you completely write your widgets in server side technologies, so you can write a lot of the code and actually almost all of it using your favorite server side technology whether it be asp.net, django, ror, jsp, struts , perl or other dinosaurs. When implementing an inline gadget you’ll find yourself more and more practicing your javascript Ninja.

What’s the decision algorithm then?

Widget authors: If the widget can be implemented as an iframe, prefer an Iframe simply for preserving users security and trust. If a widget requires inlining (and the medium allows that, e.g. not iGoogle and friends) use inline but dare not exploit users trust!

Widget installers: When installing a widget in your blog you don’t see a “safe for users” ribbon on the widgets. How can you tell if the widget is safe or not? There are two alternatives I can suggest: 1) trust the vendor 2) read the code. Either you trust the widget provider and install it anyway or you take the time to read its code and determine yourself whether it’s trustworthy or not. Reality is that most site owners don’t bother reading code or are not even aware of the risk they’re putting their users at, and so widget providers are blindly trusted. In many cases this is not an issue since blogs don’t usually hold personal information about their readers. I suspect things will start changing once there are few high profile exploits (and I hope it’ll never get to it).

Users: Usres are kept in the dark. Just as there are no “safe for users” ribbons on widgets site owners install, there are no “safe to use” sites and basically users are kept in the dark and have no idea, even if they have the technical skills, whether or not the site they are using contains widgets, whether the widgets are inline or not and whether they are malicious. Although in theory a trained developer can inspect the code up-front, before running it in her browser and losing her email account to a hacker, however this is not practical and there should be no expectation that users en mass will do that. IMO this is an unfortunate condition and I only hope attackers will not find a way of taking advantage of that and doom the wonderful open widget culture on the web.

Happy widgeting folks!


* Some blog platforms have a somewhat different structures for widgets and they may sometimes have both widgets and plugins that may correlate in their functionality, but for the matter of the discussion here I’ll lously use the term widget to discuss the “raw” type which consists of client side javascript code

** Although in most cases you’d want widgets to inherit styles from the hosting page to make them look consistent with it, sometimes you actually don’t want the widget to inherit styles from the page, so in this case iFrames let you start your CSS from scratch.