programming

Hash tables

I have found a number of potentially unconventional uses for hash tables (aka “associative arrays”). I suppose the first thing that comes to mind when thinking of hash tables is as a way to map a given value to another value. As a very simple example, say you have a list of items and want to keep track of how many of those items you have.

my %items = ();
$items{shoes} = 2;
$items{pants} = 1;
$items{dogs} = 5;
$items{cats} = 50;

We often refer to this arrangement as a “key/value” pair. Now, if you want to know how many shoes you have you can do so by referencing $items{shoes}. If you want to know just how crazy the cat person is, look at $items{cats}.

If you want to print out the item and count, loop through the hash as follows:

for my $key ( keys %items ) {
  print "I have $items{$key} $key\n";
}

On my machine, I see this output:

I have 50 cats
I have 1 pants
I have 2 shoes
I have 5 dogs

One thing to note is that there is no guarantee that the hash table will store the keys in a particular order. That is, there is nothing that says that every time that loop gets run the output will appear in the same order. In practice, for a list that small it probably will print out in the same order. However, you cannot bet on it, and once the list gets larger and items are added and removed the likelyhood of an unpredictable order will increase. Part of it has to do with how perl determines the most efficient way to store/retrieve the values.

If the order is important, it is useful to sort the keys before using them.

A nice thing about the for loop is that a sort routine can be inserted in the parentheses. As long as a list of elements appears in the parentheses after all processing is done perl doesn’t care. In the above example, if I wanted to sort the hash based on the keys:

for my $key ( sort { $a cmp $b } keys %items ) {
  print "I have $items{$key} $key\n";
}

This produces the output:

I have 50 cats
I have 5 dogs
I have 1 pants
I have 2 shoes

Since there is a sort done you can be certain that this is how the ouptut will appear every time.

If, on the other hand you would like to sort by the value instead of the key:

for my $key ( sort { $items{$a} <=> $items{$b} } keys %items ) {
  print "I have $items{$key} $key\n";
}

This produces:

I have 1 pants
I have 2 shoes
I have 5 dogs
I have 50 cats

You may have noticed that I used cmp and <=> We use cmp to compare alpha entities. The <=> is used when comparing numbers.

If you want the order reversed, simply revers the $a and $b (or $item{$a} and $item{$b}). The $a and $b variables are special variables in perl. When sorting, 2 elements are compared to see which one should come before the other. Two elements are tested. Then two more elements and so on until the list of elements is exhausted. For now, just know that in sort $a represents the first element and $b represents the second element. If you are curious, you can print out the elements as they are being sorted. Remember when I said that as long as a list of elements appear within the parentheses perl doesn’t care? Printing doesn’t affect the final list so this is valid:

for my $key ( sort { print "$a cmp $b\n"; $a cmp $b } keys %items ) {
  print "I have $items{$key} $key\n";
}

I see this output:

cats cmp pants
shoes cmp dogs
cats cmp dogs
dogs cmp pants
pants cmp shoes
I have 50 cats
I have 5 dogs
I have 1 pants
I have 2 shoes

Now, back to some of the less common uses of hash tables. The keys of a hash are unique. That means that in the following code, there will only be 2 keys: cats and dogs. The second time cats apppears the existing entry for cats gets overwritten with the new value.

my %items = ();
$items{cats} = 10;
$items{dogs} = 1;
$items{cats} = 50;
for my $key ( sort { $a cmp $b } keys %items ) {
  print "I have $items{$key} $key\n";
}

This is the output I see:

I have 50 cats
I have 1 dogs

One way we can take advantage of this feature is to ensure the uniqueness of a list.

Say we have this list:

my @list = qw(apple banana pear apple peach apple cherry apple berry);

By passing the values of this list to a hash table, we can remove dupliate values:

my %fruit = ();
foreach my $value ( @list ) {
  $fruit{$value}++;
}
my @unique_list = keys %fruit;
print "unique list: ", join( ', ', sort {$a cmp $b} keys %fruit ), "\n";

I get this as the output:

unique list: apple, banana, berry, cherry, peach, pear

Again you can stick a sort in there to get a sorted list.

Not only did you get a unique list of fruit, you also got a count of each fruit in the original list. That’s what the $fruit{$value}++ did. When not explicitly initialized the value of an entry in a hash table will be set to 0. In case you are unfamiliar, $fruit{$value}++ is a short-hand way of doing:

foreach my $value ( @list ) {
  if( ! exists $fruit{$value} ) {
    $fruit{$value} = 0;
  }
  $fruit{$value} = $fruit{$value} + 1;
}

In fact, if you do not use the ++ notation and use warnings are in effect (along with use strict, use warnings is a good habit to get into), perl will give a warning stating something like: Use of uninitialized value in addition (+) This means that $fruit{$value} has never been set to a value so perl is warning you that you might be attempting to add a value to an uninitialized value.

Back to the counts. You can access the counts, the value part of the key/value pair, in the same way as we accessed the number of shoes at the top of this article.

print "unique list: ", join( ', ', sort {$a cmp $b} keys %fruit ), "\n\n";
foreach my $key ( sort { $a cmp $b } keys %fruit ) {
  print "$key appears $fruit{$key} time(s) in the original list\n";
}

I get this output:

unique list: apple, banana, berry, cherry, peach, pear

apple appears 4 time(s) in the original list
banana appears 1 time(s) in the original list
berry appears 1 time(s) in the original list
cherry appears 1 time(s) in the original list
peach appears 1 time(s) in the original list
pear appears 1 time(s) in the original list

There you have some ideas about how to use a hash table in ways that may not have crossed you mind.

maven

Converting from Apache Ant to Apache Maven

At work (http://www.monsanto.com/) we’re converting all projects from Apache ant to Apache maven. The architecture team came up with the POM which all our applications would extend and set up the company-wide repository. I spent about a week and a half converting our project. Everyone thought that conversions would go smoothly. However, I haven’t heard of any project taking less than a week. A colleague wrote an application called “mavenizer”. The mavenizer helps analyze your project and generates a POM for each artifact. We had to determine how to split our application into artifacts but using the mavenizer app helped us figure out the dependencies and got us started with an initial POM.

I can’t speak for other projects but probably the most time consuming part of the process was weeding out the dependencies. We had some production code that had interdependencies but the majority was located in the test code. For instance, we have a TestHelper class that provides methods setting up the data environment prior to running a test and for tearing it down once the test completes. This class was used in a couple of leaf artifacts (domainpos and batchprocess).

When working with maven each, for lack of a better word, installable module is packaged in what’s known as an artifact. An “installable” would be something like a jar or war or ear or some other unit. We have a “common” artifact that’s a jar and can be used by all other artifacts. Then there is an “appdao” jar which contains all database-related code. Finally there are 3 leaf artifacts which can depend on the common and/or appdao but should not depend on each other: domainpos (war that is our point-of-service); plugin (jar that is a plugin for the company-wide GUI framework); batchprocess (jar that runs on a schedule for batch processing).

I’ll post here more about the conversion process and about things I learn about maven.

%d bloggers like this: