programming

Comparing Hash Tables

I was recently asked in a comment how to compare 2 hash tables in Perl. Furthermore, the commenter mention that this would be use in a subroutine.

There is a module Data::Compare http://search.cpan.org/~dcantrell/Data-Compare-1.19/lib/Data/Compare.pm. I’ve never used this in any way other than to learn what it can do. From what I can tell it will not provide details. It will just tell you yes, the data structures are the same or no, the data structures are not the same.

If you want to get any detailed information you can always roll your own.

use strict;
use warnings;

my %hash1 = ('a' => 'b', 'b' => 'c', 'c' => 'd');
my %hash2 = ('a' => 'b1', 'b1' => 'c', 'c' => 'd');

compare(\%hash1, \%hash2 );

sub compare {
  my $hash1 = shift;
  my $hash2 = shift;

  if( keys %{$hash1} != keys %{$hash2} ) {
    print "Hash1 has ", scalar(keys %{$hash1}),
          " keys but hash2 has ", scalar(keys %{$hash2}), "\n";
  }

  # Compare hash1 to hash2
  # Sorted for clear output
  # Case insensitive comparison for sorting
  # Assumes all keys and values are strings
  #
  for my $key_hash1 (sort {lc($a) cmp lc($b) } keys %{$hash1} ) {
    if( ! exists $hash2->{$key_hash1} ) {
      print "Hash1 contains key '$key_hash1' but hash2 has no such key\n";
      next;
    }
    if( $hash1->{$key_hash1} ne $hash2->{$key_hash1} ) {
      print "Both hashes have '$key_hash1' as a key ",
            "but hash1's value is '$hash1->{$key_hash1}' ",
            "and hash2's value is '$hash2->{$key_hash1}'\n";
    }
  }
}

To complete the process you’d also loop through the keys of $hash2 and compare its keys and values to those of $hash1 in a manner similar to the above loop.

References

The second thing to comment on is that hash tables are just a variant of array. In an array the idex is an integer. In a hash the index can be an integer or also be a string (any SCALAR really). The only thing that the => symbol does in the above example is make it more readable by humans. I could have replaced all => with a comma and it would work.

The problem with passing 2 hashes/arrays to a routine is that they would get flattened into one list of values and the routine would see one hash/array instead of two. In order to maintain their individuality the hashes get passed as a reference. To then use the hashes you have to dereference the hashes. In case you didn’t know what the extra symbols mean, they are there because we’re working with references.

Another Option

The commenter mentioned that he is working with 2 files. In the file there are key value pairs. I got to thinking that he might not have to use Perl. He might try sorting the lines in each file (unix/linux/cygwin has a command called sort that has a number of options for sorting). Then use a tool such as diff to show the differences. Depending on how large the files are this might be a more efficient solution. So if the files are sorted in the same way, the keys, being the first entry in the file, would appear in the same order. Example: file one contains

abc: def
abd: fed
efg: hello
xyz: pdq

abc: def
efg: good bye
xyz: pdq

So diff would show something like

2,3c2
< abd: fed
< efg: hello
---
> efg: good bye

You can see from this that file 1, noted by the lines beginning with <, contains abc: fed and efg: hello and that these are not in file 2. It also tells you that file 2, noted by the lines beginning with >, contains efg: good bye.

If you’re not comfortable with this kind of output you could use something like TkDiff to view the differences in a more visual manner (http://sourceforge.net/projects/tkdiff/).

Additional Reading

This article might be useful for understaning references in Perl: http://www.perlmeme.org/howtos/using_perl/dereferencing.html.

You might also find Tom Christiansen’s Object Oriented Tutorial found here: http://www.perl.com/doc/manual/html/pod/perltoot.html.

programming

Hash tables

I have found a number of potentially unconventional uses for hash tables (aka “associative arrays”). I suppose the first thing that comes to mind when thinking of hash tables is as a way to map a given value to another value. As a very simple example, say you have a list of items and want to keep track of how many of those items you have.

my %items = ();
$items{shoes} = 2;
$items{pants} = 1;
$items{dogs} = 5;
$items{cats} = 50;

We often refer to this arrangement as a “key/value” pair. Now, if you want to know how many shoes you have you can do so by referencing $items{shoes}. If you want to know just how crazy the cat person is, look at $items{cats}.

If you want to print out the item and count, loop through the hash as follows:

for my $key ( keys %items ) {
  print "I have $items{$key} $key\n";
}

On my machine, I see this output:

I have 50 cats
I have 1 pants
I have 2 shoes
I have 5 dogs

One thing to note is that there is no guarantee that the hash table will store the keys in a particular order. That is, there is nothing that says that every time that loop gets run the output will appear in the same order. In practice, for a list that small it probably will print out in the same order. However, you cannot bet on it, and once the list gets larger and items are added and removed the likelyhood of an unpredictable order will increase. Part of it has to do with how perl determines the most efficient way to store/retrieve the values.

If the order is important, it is useful to sort the keys before using them.

A nice thing about the for loop is that a sort routine can be inserted in the parentheses. As long as a list of elements appears in the parentheses after all processing is done perl doesn’t care. In the above example, if I wanted to sort the hash based on the keys:

for my $key ( sort { $a cmp $b } keys %items ) {
  print "I have $items{$key} $key\n";
}

This produces the output:

I have 50 cats
I have 5 dogs
I have 1 pants
I have 2 shoes

Since there is a sort done you can be certain that this is how the ouptut will appear every time.

If, on the other hand you would like to sort by the value instead of the key:

for my $key ( sort { $items{$a} <=> $items{$b} } keys %items ) {
  print "I have $items{$key} $key\n";
}

This produces:

I have 1 pants
I have 2 shoes
I have 5 dogs
I have 50 cats

You may have noticed that I used cmp and <=> We use cmp to compare alpha entities. The <=> is used when comparing numbers.

If you want the order reversed, simply revers the $a and $b (or $item{$a} and $item{$b}). The $a and $b variables are special variables in perl. When sorting, 2 elements are compared to see which one should come before the other. Two elements are tested. Then two more elements and so on until the list of elements is exhausted. For now, just know that in sort $a represents the first element and $b represents the second element. If you are curious, you can print out the elements as they are being sorted. Remember when I said that as long as a list of elements appear within the parentheses perl doesn’t care? Printing doesn’t affect the final list so this is valid:

for my $key ( sort { print "$a cmp $b\n"; $a cmp $b } keys %items ) {
  print "I have $items{$key} $key\n";
}

I see this output:

cats cmp pants
shoes cmp dogs
cats cmp dogs
dogs cmp pants
pants cmp shoes
I have 50 cats
I have 5 dogs
I have 1 pants
I have 2 shoes

Now, back to some of the less common uses of hash tables. The keys of a hash are unique. That means that in the following code, there will only be 2 keys: cats and dogs. The second time cats apppears the existing entry for cats gets overwritten with the new value.

my %items = ();
$items{cats} = 10;
$items{dogs} = 1;
$items{cats} = 50;
for my $key ( sort { $a cmp $b } keys %items ) {
  print "I have $items{$key} $key\n";
}

This is the output I see:

I have 50 cats
I have 1 dogs

One way we can take advantage of this feature is to ensure the uniqueness of a list.

Say we have this list:

my @list = qw(apple banana pear apple peach apple cherry apple berry);

By passing the values of this list to a hash table, we can remove dupliate values:

my %fruit = ();
foreach my $value ( @list ) {
  $fruit{$value}++;
}
my @unique_list = keys %fruit;
print "unique list: ", join( ', ', sort {$a cmp $b} keys %fruit ), "\n";

I get this as the output:

unique list: apple, banana, berry, cherry, peach, pear

Again you can stick a sort in there to get a sorted list.

Not only did you get a unique list of fruit, you also got a count of each fruit in the original list. That’s what the $fruit{$value}++ did. When not explicitly initialized the value of an entry in a hash table will be set to 0. In case you are unfamiliar, $fruit{$value}++ is a short-hand way of doing:

foreach my $value ( @list ) {
  if( ! exists $fruit{$value} ) {
    $fruit{$value} = 0;
  }
  $fruit{$value} = $fruit{$value} + 1;
}

In fact, if you do not use the ++ notation and use warnings are in effect (along with use strict, use warnings is a good habit to get into), perl will give a warning stating something like: Use of uninitialized value in addition (+) This means that $fruit{$value} has never been set to a value so perl is warning you that you might be attempting to add a value to an uninitialized value.

Back to the counts. You can access the counts, the value part of the key/value pair, in the same way as we accessed the number of shoes at the top of this article.

print "unique list: ", join( ', ', sort {$a cmp $b} keys %fruit ), "\n\n";
foreach my $key ( sort { $a cmp $b } keys %fruit ) {
  print "$key appears $fruit{$key} time(s) in the original list\n";
}

I get this output:

unique list: apple, banana, berry, cherry, peach, pear

apple appears 4 time(s) in the original list
banana appears 1 time(s) in the original list
berry appears 1 time(s) in the original list
cherry appears 1 time(s) in the original list
peach appears 1 time(s) in the original list
pear appears 1 time(s) in the original list

There you have some ideas about how to use a hash table in ways that may not have crossed you mind.

%d bloggers like this: