Saturday, January 25, 2014

On College Party Culture

The semester is fresh and Berkeley is slowly waking up from its winter stasis. The streets are packed and loud with students looking for a place to party.

The more time I spend in college the further removed I feel from that mindset. I certainly had it as a freshman. As someone who's always preferred a night relaxing with my buddies (or even on my own) than raging, I really wonder why I felt that way. How many of these freshman trying to get into my fraternity's party really want to be here?

I think nobody will need convincing that this comes from the popular image of college life, which we see in movies, TV shows, songs, liquor commercials, etc. At the end of the day watching a raging college party is way more entertaining than seeing a few friends study in the library, even though the latter is probably a more accurate representation.

As is typical for the pop-culture industry, it's also about selling you something: an image. An image which you have to live up to if you want to "do college right", and which you may have to buy a few brand-name liquors and albums to achieve.

First things first. There is nothing wrong with participating in the party culture. Seriously. Tons of people genuinely enjoy it and that's fantastic. If you're one of those people: party on.

Unfortunately, I'm not. While I had a lot of fun pledging my fraternity, my biggest regret in college is forsaking my studies that semester and permanently dropping my GPA 0.5 points. It's closed a lot of doors. I'd wager that a great deal of college students are also not willing participants of the college party culture and are facing an enormous amount of pressure to conform to it.

The irony is that college is about the complete opposite of what the party culture promotes: individual development. Not just academically, but development of every aspect of your persona.

I've found that there are four critical rules to maximizing your time at college: study hard, socialize, be physically healthy, and enjoy life. Do these however you want, but do them.

This is where party culture becomes damaging. Will partying help you be social? Definitely. But what about the rest? Studying? Definitely not. Physically healthy? Hard to say, but given the calorie intake of liquor we'll go with 'probably not'. Enjoyable? Entirely depends on you. If you're like me: not really.

College party culture promotes a lifestyle that is not only not universally enjoyed by those that are pressured to participate, but also damaging to the college experience in general. When I look back at how I've spent my time in college, I value the nights gaming or watching movie with my few, close friends far higher than partying at my fraternity. I've skipped tons of nights on frat row to spend it in San Francisco with my girlfriend and I've never regretted it once.

College is about becoming your own person. Don't let anybody or any body dictate how you should do it.

Friday, January 10, 2014

How to Interview for an Intership

These past few months I've been interviewing for an internship. For a lot of people, there's nothing special about this. Most of my close friends have done dozens of interviews already, but the experience was brand new to me. In fact, before November I had only ever done 2 interviews in my lifetime: one for Stanford, and one for my part-time job. I found the internship interview process very stressful, likely as a result of having such little experience with it all. You always want to know as much as possible about something you're getting into, but the nature of interviews seems to make this impossible.

This post is for the me of two months ago. Or at least, people like the me of two months ago. I want to share some of the nuggets of wisdom I gained by going through the motions, and ways you can prepare yourself. First off, I could end this post by deferring you to two books: Cracking the Coding Interview and The Google Resume. Both are by the same author, and both are great. If you have the time to make your way through these books, I highly recommend you do. You'll be mega prepared.

Preparation
Perfect your resume

This step seems obvious, but you should be aware that resumes in the tech biz differ a lot from resumes you've probably made before. Things like work experience are generally weighed less heavily than things like coding projects. Academic achievements are fine, but recruiters are going to be more impressed with your interaction and contributions to the "real world".

One of the main concerns I encountered while perusing interview advice was the impression that you haven't "done" anything, or that you haven't learned enough to be able to offer anything in particular. I have several friends who didn't even try because they thought they were so unqualified that they would just be wasting their time; that their resume would be tossed out immediately.

First off, this is almost definitely false. If you know basic algorithms and data structures (the second programming class I ever took), you're ready to interview. A close friend of mine will be interning for Facebook without a single college project on his resume (his GPA is stellar to be fair). At the very least, there's no harm in trying. Companies aren't like colleges. They won't hold it against you if you apply and get rejected. In fact, you're usually encouraged to try again, as they know you have the potential to grow as a coder.

If you really want to beef up your resume, try to work on something outside of the classroom. There's nothing wrong with putting class projects on your resume (as an intern you aren't expected to have tons of experience), but independent projects usually take precedence. One way to beef up your list is to take a project you've done in school and expand on it in your own time. I've had to program several games throughout my time in college, and it would be pretty simple to go back and add my own features, or implement a GUI, etc.

Another thing you can do is get involved with open source projects. You can find tons of these with a quick Google session, and there are almost always opportunities to contribute for all skill levels. Check out VLC Media Player for a good starting point.

The best thing you can do it think of a project you'll be passionate about and just do it. Not only for your resume, but to enjoy the process and learn as much as you can. I spent 3 intense weeks coding a game idea I had (Tetrocity), and it eventually became a main talking point of a lot of interviews. Even when it wasn't, a lot of the concepts and nuances I had learned in the process of making it came up.

This subreddit was also a great resource for me when I was working out the kinks of my resume. There's a ton of examples and experienced people willing to help out.

Prepare your expectations

This step is not so obvious, but is a harsh reality of the process. I once heard a quote along the lines of "apply for 30, interview for 7, get an offer from 1". The numbers will obviously vary based on your abilities, but the main point is that you need to prepare for rejection. In fact, you need to prepare for a lot of it. I was fortunate enough to get an offer from my favorite company, but I was rejected quite a few times elsewhere. If you aren't ready for it, it can be pretty disheartening.

First things first, a rejection does not always mean you "failed". Sometimes it does, and you should see those as opportunities for improvement, but a lot of the time the decision will simply be out of your control. I've interviewed for positions I didn't have the experience for, done well, and then been rejected because I didn't have the right experience. Oh well. I've been rejected because I wrote a code base that the company didn't take the time to read properly. Oh well.

I'm not saying you shouldn't take responsibility for negative outcomes. I could have made that code base more readable. The point is that rejections aren't the end of the world. They're an opportunity to learn and be better prepared for the next time. Say "oh well", learn from your mistake, and move on.

A direct consequence of all of this is that you need to apply to a lot of companies. A LOT. The competition is fierce and unless you're a rock star you don't want to put all of your eggs in one basket. It doesn't even matter if you aren't really qualified. Just apply anyway.

Apply correctly

This is one of those areas that I don't agree with, but it's how things go. The best piece of advice I can give in this regard is to avoid applying online. You're chances are simply much better if you can give your resume directly to a recruiter. If you go to college, attend your info sessions and career fairs. If not, look for events put on by companies and attend them. Try to find connections to the company. If you have a friend who has a friend that works at a company you like, ask to be put in contact. A lot of employees of tech companies are actually rewarded for referring a candidate, so they'll be more than happy to help you out.

If you can't do any of that, it isn't the end of the world. I found and applied to the fantastic company that I will be interning for on an online job listing website. In this case the idea of applying to as many companies as possible is even more important. Your chances are less, so increase your sample size!

Know the company

There really isn't a faster way to disappoint your interviewer than not knowing what their company does. It's in both your and the company's best interest if you are applying not just because you want a job, but because you find their work particularly interesting. Companies usually have tons of information online, so take the time before an interview to learn as much as possible. This will also help you ask intelligent questions and really get a feel for what the company is like. It's important that you'll be happy there!

Practice practice practice

I found that there are three main types of interviews: who-you-are, do-you-know, and can-you-do. You'll usually have many interviews per company (anywhere between 3 and 12), and how many of each type you get completely depends on the company. In any case, try to do your best to prepare for each.

Who-you-are interviews (aka "behavioral") are conversations about you as a human being. You'll be asked about your life, background, hobbies, experience, goals, interests, etc ("what kind of software do you like to write?"). These are meant to form a picture of who you are and if you'll fit into the company. There's no better advice for these interviews than to simply be honest. Acting like someone you're not just so the company will like you is only going to hurt everyone in the long run.

Other than that, simply be prepared to talk about everything on your resume (this shouldn't be too hard). Mock interviews are great for this.

Do-you-know interviews are akin to a game of programming trivia. You'll be asked a good deal of relatively factual questions ("what does the Java finalize() method do?"). While a lot of these questions will be 'you either know it or you don't', there's nothing wrong with making an educated guess if you're stumped. Sometimes deducing the correct answer is even more impressive than simply knowing it.

The subject of these questions is pretty dependent on the job qualifications. I've found that a lot of "general" internship programs will include questions about Object Oriented design, and language-specific questions for Python, Java, and C. To best prepare for these, make sure you're very familiar with your favorite language and don't claim to know something you don't on your resume. Brush up on your data structures and algorithms (including implementation features such as Java's HashSet load factor) and know their complexities. Also, pay attention in class!

Can-you-do interviews are harder to prepare for. They usually follow the do-you-know interview and involve you solving a problem on the spot. These questions can range designing an entire program to developing a single algorithm ("how can you check if a player has won a game of tic-tac-toe?"). The best way to prepare for these is to simply practice with as many of them as possible. Check out the two books at the beginning of this post for some great sources of practice problems. I made it my mission 3 months before my first interview to do 2 practice coding questions a day, and it helped immensely. CareerCup is another great source of questions and active discussion. Remember to try and do these with pen and paper or a bare-bones text editor. You won't have Google or Eclipse auto-complete in an interview.

The Interview
Deal with nervousness

My friends seem to vary wildly on how nervous they get for an interview. Some freak out, some don't seem phased at all. I'm certainly closer the freak out side, so I have extensive experience in dealing with nerves.

The first thing you need to do is understand that even if the interview goes terribly, you will still be proud of yourself for trying. It's like getting rejected by your crush. Hey, at least you tried. You can't get an internship if you don't interview, so overcoming your nerves and going through with it is doing yourself a huge favor.

My second piece of advice is go into it with zero expectations. When I was younger I used to try and prepare for every possible outcome of a stressful situation so I could be prepared for anything. This is bad. Don't plan what you're going to say or how you're going to act. This will make it incredibly hard to act naturally and 'lose yourself' in the moment. Just be yourself. You're great!

Something that helped me immensely is to try and connect with your interviewer on a casual level before the technical portion begins. They're people too, and usually incredibly kind and fun to talk to. Spend the first few minutes making small talk if you can (some interviewers jump right into it, so don't force it). It'll help you be more comfortable with your interviewer and the process will seem less daunting. This also has the added benefit of giving you insight into the company's culture, and can entirely change how you perceive a them. Do you really want to spend your internship working with people who aren't friendly?

Above all else, the nervousness will diminish with experience. It's like taking an exam in college. The first time is mortifying, but loses virtually all stress over time.

Think out loud

When you're asked a do-you-know kind of question, it's almost guaranteed that you won't know how to do it immediately. Don't panic. The point is to see how you think through the problem. Keep your interviewer informed with what you're doing. Saying things like "my first thought is this...", "I'm not sure if this will work out, but I'm going to try this...", or "I don't see a solution right now, but I'm going to work through this example to figuring something out..." are a fantastic way to do this. At the very least, your interviewer can't help you if they don't know what you're doing!

I should say that there is a fine line here. Don't babble and talk for the sake of talking. Long periods of silence are absolutely fine. Do what you need to do to think clearly and figure it out.

Don't panic

I can't tell you how many times I've heard "I thought I did terribly, but I got the offer!". It's really easy to walk away from an interview thinking you failed if you didn't answer everything. This is totally false. If the question is hard, then the chances are most other people didn't it either. The point is to see how far you got, or what your process was before you got stuck. A hint isn't the end of the world. If you can't figure something out, don't worry. Just stay calm and tell your interviewer that you're stuck. They'll want to help and will usually guide you in the right direction.

You're being evaluated relative to everyone else, so think of it like a curved test. A 60% may still be an A+.

Ask questions

Once the technical portion is over, you'll usually have the opportunity to ask the interviewer anything you want. I usually have a few questions I ask every company ("what is the coolest thing you've worked on?", "what's something an intern has made that is still being used today?"), and then a few specific to that company. This is your opportunity to interview them and get to know the company better. It may also help to assert your value. After all, you are incredibly desired in the industry and it's their job to convince you just as much as it is yours to convince them. However, make sure you're respectful. Adobe won't be impressed by you asking about their security leak.

---

At the end of the day remember to enjoy the experience. Even if you don't land an internship, which a lot of people don't, it's great experience for when you apply to a full-time position and will give you a huge edge over people who didn't try. I won't say that internships aren't important, but it isn't the end of the world if you don't get one.

Good luck and don't worry! After awhile you'll have interviews mastered and they'll simply be part of your daily routine.

Friday, December 13, 2013

Tetrocity

The source code for this project can be accessed here. This post is largely a collection of loosely related thoughts I have about the development process. Go play the game first! Do it!

Not too long ago I was trudging through one of my online classes when the instructor mentioned something interesting. He brought up the idea of an expansion of Tetris: instead of game pieces (Tetriminoes) with 4 blocks… why not 5!? I was more intrigued that I had never thought of this before, considering the dozens of hours I wasted becoming mediocre at Tetris With Friends. I had made small command-line games before, but nothing really substantial and, you know, actually fun.

So, I set off. Like every OOP project anyone has ever embarked on, I was set on making it as clean, organized, encapsulated, modular, blah blah blah as possible. Surprisingly, I actually managed to succeed, for the most part. I spent a huge amount of time with Microsoft Word, planning my objects and their roles, game features and mechanics, before writing a single line of code.

It was a pretty huge revelation in my development as a programmer to see just how much this helped. It’s easy to get so excited about your ideas that you jump straight in to the code and make up everything as you go along. Almost every time I ever did that, though, it would turn messy pretty fast. Didn’t see that problem coming? Just add a method to this class… I’ll mark it private so nobody will see how badly it doesn’t belong there. That kind of thing adds up fast. I would at this point like to give a shout-out to any recruiter that may be reading this blog post. Please ignore this paragraph.

This necessity is one of the biggest disadvantages and advantages of OOP in general. In functional programming, it’s much easier to implement the functions you need as you go along. Keeping them organized is still important, but not nearly to the degree of OOP. Developing a system in a language like Java requires a lot more forethought. If you put in that initial effort, though, the payoff is massive.

Anyway, after a few weeks of playing around with the idea on and off, getting it sorted out in my head and on paper, I was ready to embark. First things first: more administration. I began by creating every single class that I had anticipated needing, and writing full documentation for each one. This is really just a continuation of the idea from before, but a hell of a lot more boring. I felt it was a lot like eating a salad. Immensely painful and soul rendering, but you walk away feeling better about yourself.

One of the biggest challenges I faced was designing the game in such a way that it would be easy to implement new features down the line. Despite the planning, I really didn’t know what would set my version of Tetris apart from the hordes of Tetris clones already out there.

I began by developing a small set of “fundamental truths” about my game’s mechanics. Really basic stuff that you would expect from any Tetris adaptation: pieces always fall south, the game is over when the board fills, etc. Everything else was completely variable. Tetrimino size? Needs to be potentially infinite. The number of Tetriminoes falling at once? Needs to be potentially infinite. Board size? Needs to be potentially infinite. You see where I’m going with this.

This is where I see a lot of code fail. It’s absolutely possible to write Tetris in a few hundred lines of code, with full GUI support and everything. However, this approach is too static in nature. The code fulfills a very specific purpose: nothing more, nothing less. Avoiding this is a critical aspect of good code. Good code is a brick in a building. You shouldn’t have to demolish the entire thing to add a new floor.

***

Enough preaching! I won’t explain the details of the code or anything. That would be tedious and you can see all of the code on the GitHub repo (it's pretty well documented. Start with the model package if you're interested in learning the game's structure). Instead, I’ll talk about some important things I learned, and then present some interesting problems I ran into, as well as how I solved them.

My first vision for Tetrocity was that every game would be highly customizable. I would begin by asking the player for information such as the range of Tetrimino lengths (i.e. the number of blocks that make up the piece), and the game would begin with those parameters. This was going to be a real selling point. “Play with huge pieces! Infinite possibilities!” I was going to make my game capable of producing infinite-length Tetrimino pieces anyway, so may as well give the option, right?

The first harsh realization I had was there was a huge difference between novelty and genuine entertainment. This is what it looks like when you ask for length 20 Tetriminoes. 


It’s definitely interesting to look at, but you certainly wouldn’t want to play a game like that. It would be impossible. In fact, even length 7 pieces are way too hard to use. Here’s my friend’s best attempt at playing with them (he could barely clear a single line).


If these lengths are impossible to play with, why would anybody want to? After the novelty of seeing them for the first time, there really isn’t much appeal. 

The more I thought about it, the more I began to realize that giving the player all of that control was not a good idea. How would they know what a realistic range of lengths are? Who wants to spend all of that time figuring things out before being able to enjoy the game? On top of that, my own experience with videogames lead me to realize that players would rather overcome a challenge set out for them than overcome one they set out for themselves. By taking control of the course of the game, I could much more easily control things like the difficulty curve, and high scores would have actual meaning. Tetrocity went from fully customizable to “Press start to play”.

I ended up limiting the range of Tetriminoes ever produced in a game to a minimum of 3 and a maximum of 6. It was somewhat disappointing that I wouldn’t be taking full advantage of my infinite length Tetrimino-producing object, but it was a necessary sacrifice of ego.

***

Tetriminoes are randomly constructed in a pretty straightforward way. You begin with an empty grid, place a single block in the center, and then randomly choose grid coordinates that share an edge with a block until you run out of blocks to place. This seems straightforward enough, but ended up posing a lot of problems. First off, this method does not produce every shape with equal probability.

To see this, let’s consider the original set of length-4 Tetriminoes being randomly constructed with this method. Is a line piece just a likely to be constructed as a square piece? Well, there’s only one way to construct a square piece. It looks the exact same regardless of how it’s rotated. A line piece, however, can be constructed two ways: vertically, or horizontally. These two orientations are produced via an entirely different set of coordinates, but any player would consider them equivalent. The more rotational symmetry a shape has, the less likely it is to be produced.

The first question we ask is: is this really a problem? Is the effect noticeable? As it turns out, not only is it noticeable, but it’s also somewhat desirable. In a very hand-wavy sense, the more rotational symmetry a piece has, the less likely we’ll be able to find a place for it, since you can’t rotate it to fit various openings. This is a trivial concern for length-4 pieces, but would be a huge issue for length-5 pieces. I would argue that the game would be way too hard to play with length-5 pieces were they all produced with equal probability.

In true hacker fashion, we’ve turned a bug into a feature! Except for one issue: line pieces have relatively low rotational symmetry. When playing the game, I found that I would go ridiculously long periods of time without ever seeing a line piece. As any player of Tetris knows, they’re critical. Since they have such special meaning, it was necessary to implement some way to increase the probability they would appear.

The way Tetris decides what pieces to give you is actually pretty cool. It begins with a set of all 7 length-4 pieces, and shuffles them randomly. It then feeds you pieces from that set. As such, we expect a line piece to appear once out of every 7.

To mimic this behavior in Tetrocity, I had to implement a probabilistic model, since piece shapes are not stored in software. I made the Tetrimino-production class return random shapes with a certain probability, and straight-line pieces otherwise. This way, we can control the expected number of straight-line pieces fairly easily.

***

An interesting bug I ran into with my Tetrimino-production class TetriminoFactory had to do with Java’s Random class. We’re all guilty of mentally blanking the “pseudo” in “pseudorandom” from time to time, but after the several hours of grueling debugging that resulted from that mindset, I probably won’t make that mistake again.

The bug presented itself as pretty easy to fix. I noticed it when testing TetriminoFacotry’s random shape generation. Regardless of what range of lengths I would give it, it would always give me length-4 pieces. Okay, so the method probably isn’t receiving the correct range information, right? Nope, that was fine. I managed to narrow it down to single call to my Random object’s nextInt() method. Absolutely everything up to that point was working as intended, and everything after it was not. The nextInt() method would return the same number every time. Definitely not random, although still technically pseudorandom I guess.

I isolated the relevant code and began testing, and could not replicate the problem. It seemed like magic. Every relevant parameter was recreated, yet the bug only occurred when the code existed in the Tetrimino-production class. I even tracked down the source code and learned the underlying algorithm in an attempt to figure out why it was occurring. No luck.

Eventually desperation set in and I began to just randomly tweak parameters in some vain hope of something happening. Surprisingly enough, it worked! Apparently, the bug only occurs when the requested range is a power of 2. That was not something I replicated in my isolated code. In my original test code, the range of Tetrimino lengths was 3 to 5. 5 – 3 = 2 = 21.

Why is there such an arbitrary shortcoming in Java’s Random class? I have no idea!
To this day I have no clue why this happens. Regardless of the seed you give it, Java’s random object will return the same value if the range is a power of 2 (it actually begins exhibiting random behavior again after the 15th or so call). Weird.

***

Another interesting challenge was making the program user friendly. Anybody who makes software in industry is well aware of the importance of this, but Tetrocity was the first time I've ever really had to take that into account. 

Tetrocity isn't a particularly complex or deep game. It isn't World of Warcraft, where things like complicated UIs are a justified necessity. Tetrocity is a relatively simple, 2D game. Probably even a mini-game by today's standards. As such, it was very important to convey the new features of the game in as simple a fashion as possible. 

This was pretty easy for the most part, thanks to the success of Tetris. Most people seemed to be aware of the rules and such, so I had to do very little work in explaining that aspect of it. Controls were also borrowed from Tetris With Friends, which makes it intuitive for any Facebook-savvy players. 

The biggest challenge was introducing the ability system. It was certainly new to most players, yet it needed to remain somewhat mysterious in order to motivate people to continue playing. Apart from a single line on the "Controls" welcoming screen, there wasn't even a good way to introduce the concept of abilities without shoving it in the user's face.

Pictures proved to be the best way to get around this. I found a set of free vector icons available here, and sifted through them to find the ones that best described an ability I wanted to implement. By sticking huge icons right off the side of the main grid, I wanted the player to be able to quickly glance at the sidebar, and understand what the ability did. 

I'm not sure I entirely succeeded in that regard. I've found that new players generally don't even notice when a new ability is unlocked, let alone what it does. The main game is simply too intense; your eyes are completely locked on the grid. After a couple of tries people begin to notice the abilities, but I wanted people to know it was there immediately. 

This was partly what motivated me to add sound effects, the majority of which I found here. That way, I could add a little sound that would play every time an ability became available, so that the player wouldn't have to take their eyes off the screen. Again, this was less successful than I had hoped for the reason. Players are just too focused on the game to notice the sound. 

This seemed to work fairly well, but at the end of the day I think a "How to Play" screen is really the best solution. I haven't implemented it yet, nor am I sure if I will, but it's definitely something to keep in mind when your game is too fast paced for the player to really "figure things out". 

*** 

Another important factor in making the game 'easy to use' was the fluidity of the controls. This was actually pretty interesting, because it was the first time that the behavior of my methods became fairly "hazy". I began by developing a very strict definition of what it means to move and rotate a piece. If it was found that these transformations could not occur, they wouldn't. Period. 

From a "contract" point of view everything was fine, but what I found is that the controls became extremely rigid. For example, if a line piece were flush with the right edge of the board and you tried to rotate it, nothing would happen. Logically, that rotation would cause the line piece to protrude off of the board. In fact, the line piece didn't even have to be flush. There could be several spaces in between the line piece and the edge of the board and the rotation would still fail. 

It simply didn't feel right. Again, the game is intense. It simply isn't realistic to expect the player to be that meticulous about every single rotation. When a rotation like that fails, it's incredibly frustrating and confusing. This is was motivated me to begin coding my controls to be "error tolerant". In this example, I made it so that when a failed rotation is detected, the game will automatically try to shift the piece over to the left, and check again. This way, when we encounter that same rotation failure from before, the line piece will be "pushed" left, and rotation will succeed.

Of course, you have to be incredibly careful about just how error tolerant the controls are. How far should you try shifting if left before giving up? Does it have to be the edge of the board; what about another piece? What if it gets shifted into another piece? Once you work out all of the kinks, it feels much better.

***


These are some of the things I learned while making Tetrocity. At the end of the day, I’m extremely proud of how it turned out. It’s the most substantial amount of code I’ve written personally from scratch. The game runs smoothly and is fun as hell. I should go back to studying for finals now, but I’ll probably edit this post as more things come to mind. In the mean time feel free to give the game a go. My high score is about 300000; see if you can beat it!

Monday, September 30, 2013

How to Install Hadoop

EDIT: As of 10/11/2013 brew is not installing Hadoop correctly. Just ignore the brew parts and install Hadoop manually. 

This will be boring. I planned on making a post today about how to use Hadoop with all kinds of helpful tips and jokes and then spent the entire afternoon trying to get the stupid thing installed. There's a few tutorials floating around already, but they’re all out of date and somewhat incomplete. After tinkering around and taking bits from each one I finally got it working, so here it is. How to freakin’ install Hadoop.

I’ll be using a single-node, pseudo-distributed setup. This is the best configuration if you’re looking to learn how to use Hadoop. My machine is a MacBook Pro running OSX 10.8.5 with Java 1.6. Cluster users: may god have mercy on your soul. At the time of writing this, the latest stable version is 1.2.1, but I’ll just say <version>.

First things first: if you don’t use homebrew by now, this is the time to get it. Everything in this tutorial is applicable to a manual installation, but homebrew is great and you should use it. Go get it, I'll wait. You may run into some problems with permissions because of homebrew, but these are easily solvable with a quick Google search and will only have to be fixed once. Install Hadoop with the following terminal command:


If you don’t want to use homebrew, get the latest stable build here by navigating to the stable/ directory and downloading the hadoop-<version>.tar.gz file. Then, unpack it to any directory you like.

Before getting started with Hadoop itself, we need to make sure ssh is enabled, and that self-login (i.e. ssh-ing into your own machine) can be done without a password prompt. First, navigate to System Preferences -> Sharing and make sure Remote Login is checked. I also suggest changing “Allow access for:” to “Only these users:”, and then adding only yourself to the list. This is optional, but adds some security.

Next, we need to set up an ssh-key so we can authenticate without entering a password. Just type the following into your terminal:


Make sure that worked by entering:


It shouldn’t ask you for a password. 

Now we can start configuring Hadoop. There’s actually quite a bit of tweaking you can do, but we’ll keep it to the bare basics for now. First, navigate to your conf/hadoop-env.sh file and open it in a text editor (TextEdit works fine). If you installed with homebrew, the file is located at /usr/local/Cellar/hadoop/<version>/libexec/conf/hadoop-env.sh. It’s easiest to navigate to and open the file with the following commands:


We need to do two things with this file. First, we need to specify the location of your Java root directory for the JAVA_HOME variable. Finding the root directory is tricky, but after a bit of snooping around you should get it. I personally ended up setting this line to:


Make sure you get rid of the leading '#'. Second, add the following line to the bottom of the file to a fix a known bug:


We will now modify three more files in the same directory. Their content segments should be set as follows:

conf/core-site.xml:


conf/hdfs-site.xml:


conf/mapred-site.xml:


Let’s take a moment to reflect on the fact that the last localhost port is in fact, over 9000.

It’s smooth sailing from here. From the command-line (even you non-homebrew users), navigate to the directory that contains conf and bin to format your distributed-filesystem by typing:


There should be no errors in the output. Now, let’s test it to make sure it works. Start the hadoop daemons by typing:


Copy some input files to the distributed-filesystem:


Then execute an example run:


You should get output similar to the following:


You can examine the output by typing:


Which should output something like:


Finally, stop all running daemon processes by typing:



There you have it. If everything went without a hitch, Hadoop is now installed and configured on your machine. Stay tuned for next time when we look at MapReduce, what it should be used for, what it shouldn’t be used for, and how to use it.

UPDATE: I'm still working on the main Hadoop post, but I'm having a lot of issues getting everything ready. It seems like there isn't much, if any, support for the Eclipse plugin for Hadoop 1.2.1. I've spent several hours trying to get it set up, but haven't had any luck. It's technically possible to do MapReduce programming without IDE support, but much harder. I'll keep working on it when I have the time but I may have to just come back to it later. 

Monday, September 23, 2013

How to Reverse a String

The source code for this project can be accessed here.

Reversing a string is something that will almost definitely never have any kind of real life application. It’s one of those problems you’ll only ever encounter in an introductory programming course or an interview. Despite all of that, there are actually some pretty important lessons to be learned. So lets do it.

We’re going for speed here. The goal is to find the absolute fastest way to reverse a string. Like most projects on //TODO, we’re going to be doing it in Java. In the past the logic was pretty easily transferrable across languages, but in this case using Java and its String class will actually turn out to be a critical factor in our investigation.

The naïve implementation is always a good start, so let’s give it a shot:


That should be pretty straightforward, but we don’t really expect it to be fast. Here’s what my performance-measuring method looks like:


Here, we generate a string by randomly selecting characters from the lower-case alphabet 10000 times. Then, we record the time it takes to perform the naïve string-reverse by storing the system time before and after the call, and then printing the difference. Technically, it cannot be guaranteed that only the algorithm ran in between the two time checks, as various other processes are running concurrently. However, the variance in time is within a small enough margin of error to forgot about that. Adding cases to that method is pretty straightforward, so that’s the last you’ll explicitly see of that code.

We compile, run, and get this:


For reference, that’s about 42 milliseconds. Not bad, but the scaling is awful. Changing the length of the test string reveals that the algorithm appears to be of ~O(n2) complexity, where n is the length of the input string. While the details aren't incredibly important, it comes from the fact that each iteration actually results in the creation of a new result string, where the characters of the old string are copied over the new string before the new character is appended. Super inefficient. 

How can we improve this? Let’s try to get rid of that memory-allocation overhead by building the result as an array of characters. That way we’re just setting values in a fixed segment in memory, which should be pretty fast. Here’s our second attempt:


We update our test method, run, and check the output:


Damn. That’s a lot faster. That simple change improved the run time by two orders of magnitude. Not only that, but further testing confirms that the algorithm is of O(n) complexity, as we would expect. 

Now what? We could try to apply a fancy sorting algorithm like Mergesort, but these algorithms assume the ordering of the array to be sorted is unknown. In this case, we know it’s completely reversed, so sorting algorithms would actually take longer than a direct approach that exploits that fact. You can see an actual implementation of a Mergesort-like algorithm for reversing a string in the source code, but the output confirms that it’s a bad idea:


To get an idea of how much improvement we have to go, let’s run our code against that of the Java gods and see how we stand. The Java StringBuilder class provides a reverse() method, so we’ll run our best algorithm against that. We update the testing code and run it:


Nice! We’re actually pretty close to beating Java’s go-to string reverse algorithm with almost no effort. Let’s take a look at the source code to see how Java does it:


Stay calm. That looks like a lot of gibberish, so here’s the scoop:  Java encodes characters using UTF-16. This simply means that Java uses two bytes to map values between 0x0000 and 0xFFFF to characters. The problem is that once Unicode took over, Java needed a way to convert their UTF-16 encoding to Unicode, which provides a larger mapping range of 0x000000 to 0x10FFFF. They do this by using a pair of code-points called surrogate pairs (don’t worry about the details). The reversal algorithm used in Java swaps the string byte-by-byte, not character-by-character. Therefore, a lot of additional handling has to be provided for those surrogate pairs.

So what brilliant, insightful, esoteric reason does Java have for doing it this way? I have no idea. In fact, we can actually beat Java with two more, simple optimizations. First off, we didn’t successfully minimize the role of the String object in our improved algorithm: we’re still calling String#charAt on the input. The time overhead created by repeatedly calling that method is greater than if we simply converted the input string to a char array and dealt with it instead.

The second improvement is much more subtle. By pulling characters from both ends of the input string simultaneously, we can exploit the caching of values once we begin to pull characters near the middle of the string. Basically, pulling characters from one side of the string will eventually cause characters from the other side to be placed in the cache as well, which means less overall cache misses occur. Note that the magnitude of this improvement is highly dependent on the cache configuration of the machine running the algorithm.  

We implement these two optimizations into a single, final algorithm:


The result speaks for itself:


Congratulations, you’ve beaten Java at its own game. To fair, this difference is incredibly minute and would be completely negligable during practical use. I’m competitive though, so I count it as a victory.

This is by no means as fast as it gets. There is an obvious opportunity for parallel-processing optimization, and probably a few additional tweaks you could make to squeeze a few thousand nanoseconds out of it. Note that the Java StringBuffer class exists entirely to bypass the issues we saw earlier with Java's String class, but I think that hides the mechanics of what's going on. It uses the exact same techniques we saw here, but in practice using it is highly recommended. 

String reversal is an interesting case where the naïve implementation is actually much closer to optimal than you would expect. If you’re anything like me, the discovery that Java’s methods are occasionally imperfect prompted some serious self-reflection. Try not to let your faith in the Java gods waver too much, though. They’re a vengeful sort.