The source code for
this project can be accessed here.
Reversing a string is something that will almost definitely
never have any kind of real life application. It’s one of those problems you’ll
only ever encounter in an introductory programming course or an interview.
Despite all of that, there are actually some pretty important lessons to be
learned. So lets do it.
We’re going for speed here. The goal is to find the absolute
fastest way to reverse a string. Like most projects on //TODO, we’re going to
be doing it in Java. In the past the logic was pretty easily transferrable
across languages, but in this case using Java and its String class will
actually turn out to be a critical factor in our investigation.
The naïve implementation is always a good start, so let’s
give it a shot:
That should be pretty straightforward, but we don’t really
expect it to be fast. Here’s what my performance-measuring method looks like:
Here, we generate a string by randomly selecting characters
from the lower-case alphabet 10000 times. Then, we record the time it takes to
perform the naïve string-reverse by storing the system time before and after
the call, and then printing the difference. Technically, it cannot be
guaranteed that only the algorithm ran in between the two time checks, as
various other processes are running concurrently. However, the variance in time
is within a small enough margin of error to forgot about that. Adding cases to
that method is pretty straightforward, so that’s the last you’ll explicitly see
of that code.
We compile, run, and get this:
For reference, that’s about 42 milliseconds. Not bad, but
the scaling is awful. Changing the length of the test string reveals that the
algorithm appears to be of ~O(n2) complexity, where n is the length
of the input string. While the details aren't incredibly important, it comes from the fact that each iteration actually results in the creation of a new result string, where the characters of the old string are copied over the new string before the new character is appended. Super inefficient.
How can we improve this? Let’s try to get rid of that memory-allocation overhead by building the result as an array of characters. That
way we’re just setting values in a fixed segment in memory, which should be
pretty fast. Here’s our second attempt:
We update our test method, run, and check the output:
Damn. That’s a lot
faster. That simple change improved the run time by two orders of magnitude. Not
only that, but further testing confirms that the algorithm is of O(n)
complexity, as we would expect.
Now what? We could try to apply a fancy sorting algorithm
like Mergesort, but these algorithms assume the ordering of the array to
be sorted is unknown. In this case, we know it’s completely reversed, so
sorting algorithms would actually take longer than a direct approach that
exploits that fact. You can see an actual implementation of a Mergesort-like
algorithm for reversing a string in the source code, but the output confirms that
it’s a bad idea:
To get an idea of how much improvement we have to go, let’s
run our code against that of the Java gods and see how we stand. The Java
StringBuilder class provides a reverse() method, so we’ll run our
best algorithm against that. We update the testing code and run it:
Nice! We’re actually pretty close to beating Java’s go-to string
reverse algorithm with almost no effort. Let’s take a look at the source code to see how Java does it:
Stay calm. That looks like a lot of gibberish, so here’s the
scoop: Java encodes characters using
UTF-16. This simply means that Java uses two bytes to map values between 0x0000
and 0xFFFF to characters. The problem is that once Unicode took over, Java
needed a way to convert their UTF-16 encoding to Unicode, which provides a
larger mapping range of 0x000000 to 0x10FFFF. They do this by using a pair of code-points
called surrogate pairs (don’t worry about the details). The reversal algorithm
used in Java swaps the string byte-by-byte, not character-by-character.
Therefore, a lot of additional handling has to be provided for those surrogate
pairs.
So what brilliant, insightful, esoteric reason does Java
have for doing it this way? I have no idea. In fact, we can actually beat Java with two more, simple
optimizations. First off, we didn’t successfully minimize the role of the
String object in our improved algorithm: we’re still calling String#charAt on
the input. The time overhead created by repeatedly calling that method is
greater than if we simply converted the input string to a char array and dealt
with it instead.
The second improvement is much more subtle. By pulling
characters from both ends of the input string simultaneously, we can exploit
the caching of values once we begin to pull characters near the middle
of the string. Basically, pulling characters from one side of the string will eventually
cause characters from the other side to be placed in the cache as well, which
means less overall cache misses occur. Note that the magnitude of this
improvement is highly dependent on the cache configuration of the machine
running the algorithm.
We implement these two optimizations into a single, final
algorithm:
The result speaks for itself:
Congratulations, you’ve beaten Java at its own game. To
fair, this difference is incredibly minute and would be completely negligable during
practical use. I’m competitive though, so I count it as a victory.
This is by no means as fast as it gets. There is an obvious
opportunity for parallel-processing optimization, and probably a few additional
tweaks you could make to squeeze a few thousand nanoseconds out of it. Note that the Java StringBuffer class exists entirely to bypass the issues we saw earlier with Java's String class, but I think that hides the mechanics of what's going on. It uses the exact same techniques we saw here, but in practice using it is highly recommended.
String reversal is an interesting case where the naïve
implementation is actually much closer to optimal than you would expect. If you’re anything like me, the discovery that
Java’s methods are occasionally imperfect prompted some serious
self-reflection. Try not to let your faith in the Java gods waver too much, though.
They’re a vengeful sort.
No comments:
Post a Comment