Monday, September 23, 2013

How to Reverse a String

The source code for this project can be accessed here.

Reversing a string is something that will almost definitely never have any kind of real life application. It’s one of those problems you’ll only ever encounter in an introductory programming course or an interview. Despite all of that, there are actually some pretty important lessons to be learned. So lets do it.

We’re going for speed here. The goal is to find the absolute fastest way to reverse a string. Like most projects on //TODO, we’re going to be doing it in Java. In the past the logic was pretty easily transferrable across languages, but in this case using Java and its String class will actually turn out to be a critical factor in our investigation.

The naïve implementation is always a good start, so let’s give it a shot:


That should be pretty straightforward, but we don’t really expect it to be fast. Here’s what my performance-measuring method looks like:


Here, we generate a string by randomly selecting characters from the lower-case alphabet 10000 times. Then, we record the time it takes to perform the naïve string-reverse by storing the system time before and after the call, and then printing the difference. Technically, it cannot be guaranteed that only the algorithm ran in between the two time checks, as various other processes are running concurrently. However, the variance in time is within a small enough margin of error to forgot about that. Adding cases to that method is pretty straightforward, so that’s the last you’ll explicitly see of that code.

We compile, run, and get this:


For reference, that’s about 42 milliseconds. Not bad, but the scaling is awful. Changing the length of the test string reveals that the algorithm appears to be of ~O(n2) complexity, where n is the length of the input string. While the details aren't incredibly important, it comes from the fact that each iteration actually results in the creation of a new result string, where the characters of the old string are copied over the new string before the new character is appended. Super inefficient. 

How can we improve this? Let’s try to get rid of that memory-allocation overhead by building the result as an array of characters. That way we’re just setting values in a fixed segment in memory, which should be pretty fast. Here’s our second attempt:


We update our test method, run, and check the output:


Damn. That’s a lot faster. That simple change improved the run time by two orders of magnitude. Not only that, but further testing confirms that the algorithm is of O(n) complexity, as we would expect. 

Now what? We could try to apply a fancy sorting algorithm like Mergesort, but these algorithms assume the ordering of the array to be sorted is unknown. In this case, we know it’s completely reversed, so sorting algorithms would actually take longer than a direct approach that exploits that fact. You can see an actual implementation of a Mergesort-like algorithm for reversing a string in the source code, but the output confirms that it’s a bad idea:


To get an idea of how much improvement we have to go, let’s run our code against that of the Java gods and see how we stand. The Java StringBuilder class provides a reverse() method, so we’ll run our best algorithm against that. We update the testing code and run it:


Nice! We’re actually pretty close to beating Java’s go-to string reverse algorithm with almost no effort. Let’s take a look at the source code to see how Java does it:


Stay calm. That looks like a lot of gibberish, so here’s the scoop:  Java encodes characters using UTF-16. This simply means that Java uses two bytes to map values between 0x0000 and 0xFFFF to characters. The problem is that once Unicode took over, Java needed a way to convert their UTF-16 encoding to Unicode, which provides a larger mapping range of 0x000000 to 0x10FFFF. They do this by using a pair of code-points called surrogate pairs (don’t worry about the details). The reversal algorithm used in Java swaps the string byte-by-byte, not character-by-character. Therefore, a lot of additional handling has to be provided for those surrogate pairs.

So what brilliant, insightful, esoteric reason does Java have for doing it this way? I have no idea. In fact, we can actually beat Java with two more, simple optimizations. First off, we didn’t successfully minimize the role of the String object in our improved algorithm: we’re still calling String#charAt on the input. The time overhead created by repeatedly calling that method is greater than if we simply converted the input string to a char array and dealt with it instead.

The second improvement is much more subtle. By pulling characters from both ends of the input string simultaneously, we can exploit the caching of values once we begin to pull characters near the middle of the string. Basically, pulling characters from one side of the string will eventually cause characters from the other side to be placed in the cache as well, which means less overall cache misses occur. Note that the magnitude of this improvement is highly dependent on the cache configuration of the machine running the algorithm.  

We implement these two optimizations into a single, final algorithm:


The result speaks for itself:


Congratulations, you’ve beaten Java at its own game. To fair, this difference is incredibly minute and would be completely negligable during practical use. I’m competitive though, so I count it as a victory.

This is by no means as fast as it gets. There is an obvious opportunity for parallel-processing optimization, and probably a few additional tweaks you could make to squeeze a few thousand nanoseconds out of it. Note that the Java StringBuffer class exists entirely to bypass the issues we saw earlier with Java's String class, but I think that hides the mechanics of what's going on. It uses the exact same techniques we saw here, but in practice using it is highly recommended. 

String reversal is an interesting case where the naïve implementation is actually much closer to optimal than you would expect. If you’re anything like me, the discovery that Java’s methods are occasionally imperfect prompted some serious self-reflection. Try not to let your faith in the Java gods waver too much, though. They’re a vengeful sort.

No comments:

Post a Comment