Monday, September 30, 2013

How to Install Hadoop

EDIT: As of 10/11/2013 brew is not installing Hadoop correctly. Just ignore the brew parts and install Hadoop manually. 

This will be boring. I planned on making a post today about how to use Hadoop with all kinds of helpful tips and jokes and then spent the entire afternoon trying to get the stupid thing installed. There's a few tutorials floating around already, but they’re all out of date and somewhat incomplete. After tinkering around and taking bits from each one I finally got it working, so here it is. How to freakin’ install Hadoop.

I’ll be using a single-node, pseudo-distributed setup. This is the best configuration if you’re looking to learn how to use Hadoop. My machine is a MacBook Pro running OSX 10.8.5 with Java 1.6. Cluster users: may god have mercy on your soul. At the time of writing this, the latest stable version is 1.2.1, but I’ll just say <version>.

First things first: if you don’t use homebrew by now, this is the time to get it. Everything in this tutorial is applicable to a manual installation, but homebrew is great and you should use it. Go get it, I'll wait. You may run into some problems with permissions because of homebrew, but these are easily solvable with a quick Google search and will only have to be fixed once. Install Hadoop with the following terminal command:


If you don’t want to use homebrew, get the latest stable build here by navigating to the stable/ directory and downloading the hadoop-<version>.tar.gz file. Then, unpack it to any directory you like.

Before getting started with Hadoop itself, we need to make sure ssh is enabled, and that self-login (i.e. ssh-ing into your own machine) can be done without a password prompt. First, navigate to System Preferences -> Sharing and make sure Remote Login is checked. I also suggest changing “Allow access for:” to “Only these users:”, and then adding only yourself to the list. This is optional, but adds some security.

Next, we need to set up an ssh-key so we can authenticate without entering a password. Just type the following into your terminal:


Make sure that worked by entering:


It shouldn’t ask you for a password. 

Now we can start configuring Hadoop. There’s actually quite a bit of tweaking you can do, but we’ll keep it to the bare basics for now. First, navigate to your conf/hadoop-env.sh file and open it in a text editor (TextEdit works fine). If you installed with homebrew, the file is located at /usr/local/Cellar/hadoop/<version>/libexec/conf/hadoop-env.sh. It’s easiest to navigate to and open the file with the following commands:


We need to do two things with this file. First, we need to specify the location of your Java root directory for the JAVA_HOME variable. Finding the root directory is tricky, but after a bit of snooping around you should get it. I personally ended up setting this line to:


Make sure you get rid of the leading '#'. Second, add the following line to the bottom of the file to a fix a known bug:


We will now modify three more files in the same directory. Their content segments should be set as follows:

conf/core-site.xml:


conf/hdfs-site.xml:


conf/mapred-site.xml:


Let’s take a moment to reflect on the fact that the last localhost port is in fact, over 9000.

It’s smooth sailing from here. From the command-line (even you non-homebrew users), navigate to the directory that contains conf and bin to format your distributed-filesystem by typing:


There should be no errors in the output. Now, let’s test it to make sure it works. Start the hadoop daemons by typing:


Copy some input files to the distributed-filesystem:


Then execute an example run:


You should get output similar to the following:


You can examine the output by typing:


Which should output something like:


Finally, stop all running daemon processes by typing:



There you have it. If everything went without a hitch, Hadoop is now installed and configured on your machine. Stay tuned for next time when we look at MapReduce, what it should be used for, what it shouldn’t be used for, and how to use it.

UPDATE: I'm still working on the main Hadoop post, but I'm having a lot of issues getting everything ready. It seems like there isn't much, if any, support for the Eclipse plugin for Hadoop 1.2.1. I've spent several hours trying to get it set up, but haven't had any luck. It's technically possible to do MapReduce programming without IDE support, but much harder. I'll keep working on it when I have the time but I may have to just come back to it later. 

Monday, September 23, 2013

How to Reverse a String

The source code for this project can be accessed here.

Reversing a string is something that will almost definitely never have any kind of real life application. It’s one of those problems you’ll only ever encounter in an introductory programming course or an interview. Despite all of that, there are actually some pretty important lessons to be learned. So lets do it.

We’re going for speed here. The goal is to find the absolute fastest way to reverse a string. Like most projects on //TODO, we’re going to be doing it in Java. In the past the logic was pretty easily transferrable across languages, but in this case using Java and its String class will actually turn out to be a critical factor in our investigation.

The naïve implementation is always a good start, so let’s give it a shot:


That should be pretty straightforward, but we don’t really expect it to be fast. Here’s what my performance-measuring method looks like:


Here, we generate a string by randomly selecting characters from the lower-case alphabet 10000 times. Then, we record the time it takes to perform the naïve string-reverse by storing the system time before and after the call, and then printing the difference. Technically, it cannot be guaranteed that only the algorithm ran in between the two time checks, as various other processes are running concurrently. However, the variance in time is within a small enough margin of error to forgot about that. Adding cases to that method is pretty straightforward, so that’s the last you’ll explicitly see of that code.

We compile, run, and get this:


For reference, that’s about 42 milliseconds. Not bad, but the scaling is awful. Changing the length of the test string reveals that the algorithm appears to be of ~O(n2) complexity, where n is the length of the input string. While the details aren't incredibly important, it comes from the fact that each iteration actually results in the creation of a new result string, where the characters of the old string are copied over the new string before the new character is appended. Super inefficient. 

How can we improve this? Let’s try to get rid of that memory-allocation overhead by building the result as an array of characters. That way we’re just setting values in a fixed segment in memory, which should be pretty fast. Here’s our second attempt:


We update our test method, run, and check the output:


Damn. That’s a lot faster. That simple change improved the run time by two orders of magnitude. Not only that, but further testing confirms that the algorithm is of O(n) complexity, as we would expect. 

Now what? We could try to apply a fancy sorting algorithm like Mergesort, but these algorithms assume the ordering of the array to be sorted is unknown. In this case, we know it’s completely reversed, so sorting algorithms would actually take longer than a direct approach that exploits that fact. You can see an actual implementation of a Mergesort-like algorithm for reversing a string in the source code, but the output confirms that it’s a bad idea:


To get an idea of how much improvement we have to go, let’s run our code against that of the Java gods and see how we stand. The Java StringBuilder class provides a reverse() method, so we’ll run our best algorithm against that. We update the testing code and run it:


Nice! We’re actually pretty close to beating Java’s go-to string reverse algorithm with almost no effort. Let’s take a look at the source code to see how Java does it:


Stay calm. That looks like a lot of gibberish, so here’s the scoop:  Java encodes characters using UTF-16. This simply means that Java uses two bytes to map values between 0x0000 and 0xFFFF to characters. The problem is that once Unicode took over, Java needed a way to convert their UTF-16 encoding to Unicode, which provides a larger mapping range of 0x000000 to 0x10FFFF. They do this by using a pair of code-points called surrogate pairs (don’t worry about the details). The reversal algorithm used in Java swaps the string byte-by-byte, not character-by-character. Therefore, a lot of additional handling has to be provided for those surrogate pairs.

So what brilliant, insightful, esoteric reason does Java have for doing it this way? I have no idea. In fact, we can actually beat Java with two more, simple optimizations. First off, we didn’t successfully minimize the role of the String object in our improved algorithm: we’re still calling String#charAt on the input. The time overhead created by repeatedly calling that method is greater than if we simply converted the input string to a char array and dealt with it instead.

The second improvement is much more subtle. By pulling characters from both ends of the input string simultaneously, we can exploit the caching of values once we begin to pull characters near the middle of the string. Basically, pulling characters from one side of the string will eventually cause characters from the other side to be placed in the cache as well, which means less overall cache misses occur. Note that the magnitude of this improvement is highly dependent on the cache configuration of the machine running the algorithm.  

We implement these two optimizations into a single, final algorithm:


The result speaks for itself:


Congratulations, you’ve beaten Java at its own game. To fair, this difference is incredibly minute and would be completely negligable during practical use. I’m competitive though, so I count it as a victory.

This is by no means as fast as it gets. There is an obvious opportunity for parallel-processing optimization, and probably a few additional tweaks you could make to squeeze a few thousand nanoseconds out of it. Note that the Java StringBuffer class exists entirely to bypass the issues we saw earlier with Java's String class, but I think that hides the mechanics of what's going on. It uses the exact same techniques we saw here, but in practice using it is highly recommended. 

String reversal is an interesting case where the naïve implementation is actually much closer to optimal than you would expect. If you’re anything like me, the discovery that Java’s methods are occasionally imperfect prompted some serious self-reflection. Try not to let your faith in the Java gods waver too much, though. They’re a vengeful sort.

Sunday, September 22, 2013

A Variable-Length Integer Class

The source code for this project can be accessed here.

This blog post is a bit of a doozy. If massive walls of text aren’t your thing, check out the source code and refer to the text below if there’s anything you need clarification on.

Context

Recently I was involved in a bit of a debacle that made rounds on hacker news circles about the computer security class I was enrolled in. The course was over-enrolled and under-staffed, and the professor’s response was to enforce a series of hilariously draconian rules until they could boot enough people out. Late to class once? Dropped. Cell phone goes off? Dropped. Miss a discussion section? Dropped. Threats were made about the difficulty of the course, how it would be extremely math-heavy and require 40+ hours of work a week. Within 3 days over 50 students had dropped, with a huge amount of anger coming from the computer science student base.

As it turns out, he was totally lying. As much as I would love to say I called his bluff, I stuck around mainly because I had no choice and as of now, it’s turning out to be one of the easiest classes I’ve taken. It’s funny how now I’m actually giving myself more work for the class I thought I would be consuming my life.

The class is pretty interesting, but so far completely conceptual. To try and get a better understanding of the ciphers and protocols that I’m learning, I decided to try and implement them in code as we went along. I was pretty eager to try out the Vernam Cipher, but I realized that there was no intuitive way to use cryptographic keys in the context of Java. Keys are of arbitrary length, and are best streamed bit-by-bit. Java primitive data types wouldn’t really work.

Java does provide a SecretKey interface, but it’s pretty useless as well considering it does nothing other than provide type safety for whatever you choose to use as your cryptographic key. I decided that if I was going to be implementing these ciphers and dealing with keys, I would need a fluid method for doing so, and that I would have to make it myself.

One course of action would have been to make a specific CrytographicKey class, but the issues that make Java primitives inappropriate actually apply to a lot of things out there. We’re used to bit-strings representing numbers, but very often they represent something completely different. For example, suppose you have some set of a finite number of distinct elements, and a subset of that set. One way to describe that subset uniquely is by assigning to it a bit-string, where each bit corresponds to an element in the superset. If the bit is a 0, that element is not in the subset; if the bit is 1, it is. In this case, the length of the bit-string can be anything, and each bit has meaning.

So, instead of a specific CryptographicKey class, the goal is to make a class that represents an arbitrary-length bit-string, which can be streamed bit by bit. Once that’s finished, it would be pretty easy to use the class as a cryptographic key.

Getting Started
The class needs to be able to represent a bit-string of any length. You should be able to specify every bit of the bit-string, and there should be fluid conversion between the Java numeric primitives. The class should be able to represent both positive and negative numbers, using a 2s complement signing scheme (the standard for Java). The class is meant to simply represent a number, so the complexity of the inner-workings shouldn’t be apparent to an outside user. Plus, this all needs to be done with as little memory usage as possible.

To begin, we have to design the way in which the bits are stored. We know the bits are ordered, and that each bit is critical, so an array of some sort sounds good. It’s tempting to immediately start looking into an ArrayList, LinkedList, or other data structure, but in this case the length and contents of the array will be set at instantiation, so the default Java array works just fine.

What should we put in the array? An array of bits would definitely give us the control we need, but even if that were possible to implement it would be enormously tedious for the user. Bytes are a good option, as they let you to specify batches of 8 bits at a time; plus, it would minimize the amount of overhead that occurs as a result of extra bits that we’ll have if the length of the bit-string is not a multiple of our building block.

Let's expand that last point a bit more. Since our building block is not atomic, what happens when the length of our bit-string is not a multiple of 8 (the size of a byte)? What would 10111 look like? Well, there’s no way to get around using the full 8 bits of a byte to represent this (that's what our array is storing), so we’re forced to pad it to 00010111. But what happens if we now want to represent 010111? You can’t pad it to 00010111, that was the padding of the last number and each representation needs to be distinct. Every bit counts, after all.

The solution is pretty much what you expect; we create a variable that stores the length of the bit-string, and use that to determine where to stop streaming bits.

There’s one more small technical detail to cover before jumping into the specifics of the code. The array indexing is big-endian. This means that the first byte in the array actually corresponds to the most significant bytes of the integer. While this may be unintuitive, it makes streaming the bits slightly easier. It does add complications elsewhere, so in retrospect little-endian ordering would work equally well, if not slightly better. 

Since we'll want the user to specify an arbitrary number of fixed bytes at instantiation, we'll call this class FixedByteInteger.

Constructing the Constructors

So far we’ve established two member variables: one that stores the length (mLength), and one that stores the bytes (mBytes). For good measure, we’ll also include the number of bytes needed by a particular FixedByteInteger (mNumBytes), which is usually given by:


Now, we have to work on designing constructors to set these values. The most obvious one is a constructor that takes an argument of an array of bytes. Then, we can simply set our bytes to be equal to the bytes of the argument. This is where the user is given full control over the contents of the bit-string. A first attempt at the constructor might look something like this:


Unfortunately, it isn’t quite that simple. What if the user wants a bit-string length that isn’t a multiple of 8? A user needs the power of specifying when to cut off the bit-stream. Fortunately, the fix is pretty easy. For now, we trust that whatever is going to stream the bits will do so correctly as long as it knows the length of the bit-string. Thus, we can write the following:


We use abs(length) to avoid dealing with the negative length case. What’s up with the last two lines in the first constructor though? Later, when we calculate the numeric value of the FixedByteInteger based on its bit-string, it will make our lives much easier if we sign-extend the top byte.

Like I said previously, one goal we have is to make the process of converting between Java primitive values and FixedByteIntegers fluid and painless for the user. As such, one of our constructors will take a numeric argument, and construct the bytes of the FixedByteInteger based on that value. We will only have one such constructor that takes a long. Ideally, we would have multiple constructors to avoid casting, but any cast to a long (except for floats and doubles, which are not integers anyway and therefore should not be accepted as input values) will never result in a loss of information, so we won’t worry about it for now.

We want to be efficient with our memory. Just because a long takes up 64 bits, doesn’t mean we have to. Say we have the input 32. How many bytes are needed to represent that? If we preserved the same data structure as a long, it would take 8 bytes, when in fact it only takes 1 byte to represent it: 00100000. Therefore, we need to develop a method to find the absolute minimum number of bytes to represent the input value.

The method for doing this actually depends on whether or not the value is negative. Let’s consider the positive case first. In a 2s complement scheme, it is guaranteed that there will be leading zeroes. By chopping them off, we can usually reduce the number of bytes in our array that encode the value of in the input long. Therefore, we scan (remember: big-endian!) the input until we find the first 1, and set mLength to where that 1 lies.

Be cautious. What about when input = 0x00000080? This value represents 8 in decimal. However, using the scheme described above, that value would be reduced to a single byte: 0x80. This value does not represent 8; it represents -128. We have to add a special case where the leading one is byte-aligned when the value is positive, in order to avoid this bug. You can see this case in the code further below, however we essentially just append a 0 to the left of the leading 1 if that case occurs.

What about when the input long is negative? A different case is needed because if the method above is applied to a negative input, the number of bytes used will always be same as the number of bytes in a long, since a negative long always has a leading 1. For example, how many bytes are needed to represent -1? As a long, -1 would be 111…1, 8 bytes of 1s. In fact, we only need a single byte to represent -1: 11111111. The same applies for any negative number. It is likely that a negative value encoded as a long has unnecessary leading 1s.

In order to deal with this, we treat it the same way we would treat leading 0s. Simply scan through the long until the first 0 is encountered, and shear all bits to the left of the 1 that occurs before it.

In the negative case, special consideration of byte-aligned inputs is not needed. I won’t get into the specifics of the code, but our constructor ends up looking like this:


Converting it Back

Since FixedByteInteger can represent a number, it’s logical to have it extend the Number class, which every numeric primitive wrapper also extends. Therefore, it is required that we implement the abstract methods floatValue(), doubleValue(), intValue(), and longValue(), which do exactly what they sound like they do. First off, it isn’t guaranteed that these values will be accurate. For example, if the bit length of a particular FixedByteInteger is greater than 64 bits, it cannot possibly be represented by a long. Therefore, we cannot guarantee that the value will be accurate. This is reflected in the documentation. Secondly, we’ll only worry about doubleValue() and longValue(). floatValue() and intValue() can be implemented by simply calling their respective promoted value methods and then casting the result down (check the source code to see this in action).

Let’s tackle longValue() first. This method is much more straightforward than doubleValue(), since we can literally build the result bit by bit. Again, I don’t want to get into the details of the code, but this is what we arrive at:


The first step is particularly important to understand. Keep in mind that we sign-extended the first byte earlier, so checking the top bit will always tell us the sign of the bit-string. Also, we got to use our ZeroExtender from the last post! Why did we need it?

Now we have to tackle the doubleValue() method. This is much less straightforward, since doubles are not represented as a literal base-2 bit-string, and representing negative doubles is an entirely different ball game. A fortunate turn of events is that the range of negative values is exactly the same as the range of positive values for a double (this is not the case for byte, short, int or long). This will also come into play.

Like before, we’ll consider the case where our FixedByteInteger is positive first. This case is also pretty straightforward: for every index i that corresponds to a 1 in our bit-string, we add 2^i to the result, and return the result after all 1s in the bit-string are found.

The negative case is far less straightforward. As mentioned before, the range of negatives for a double is the same as the range of positives. Therefore, we can convert our negative FixedByteInteger to a positive, calculate the result the same way as the positive case, and then negate it before returning.

The conversion is the hard part. Since the bytes are represented as a 2s complement scheme, the conversion occurs by flipping all bits, and adding 1. However, our bit-string is split into bytes. Therefore, we have to flip the bytes individually, and carry the additional 1 in between consecutive bytes in the case that the addition results in an overflow.  After a lot of tribulation, we end up with this. If you aren’t completely comfortable with bit-twiddling, I encourage you to try and fully understand the following code:


All that’s left is to create the class responsible for streaming the bits. This is pretty straightforward, so I won’t discuss it here. Review the source code linked at the beginning of this post to see how it works (it's defined as a nested class at the end of FixedByteInteger).

Closing Remarks

When should this class be used? It’s tempting think this class is appropriate for large-value bit-streams. However, this should only be the case when the integer value of the bit stream is not relevant. The float and double primitives offer a fantastic interface for encoding large-value numbers, and this class offers almost no advantage over them unless very high precision is needed. Instead, this class is best used when the number of bits needed is within the general range of those offered by Java primitives, but the length of the bit-string has to be exact. As per our motivating example, this class would be extremely well suited for a cryptographic key, which can take on any length but generally has fewer than 256 bits.

It should also be noted that any arithmetic operation on a FixedByteInteger has to be done bit by bit through the provided bit-stream class. This is an unfortunate result of Java arithmetic operators (+, -, *, /, etc.) being strictly defined. As a result, you cannot define what FixedByteInteger + FixedByteInteger actually means. This heavily limits the usefulness of this class, however the utility would be vastly improved in languages where such operations can be defined on arbitrary data types. None of the concepts used were specific to Java, so translating the source code to another language shouldn’t be too hard.

That’s all I’m going to post about bit-twiddling for now. It’s a powerful concept that is extremely important to understand, and hopefully these last two posts have provided you with some familiarity with how it can be applied. Stay tuned for string reversals and sorting, and thanks for stopping by.