Hash Tables and Hash Functions

Hash Tables and Hash Functions
Prev		Next

A lot of what the program does for both tangling and weaving is search for particular keys in a table, and one way to do it quickly is by means of a hash table. An excellent hash algorithm has been presented in [JENK97], and it is this algorithm that we use in our program. This particular module of xml-lit is as general as I have been able to make it, and can be copied piecemeal out of here and incorporated into other programs. It may later become part of a library of common functions I would use.

First, we have the hash.h header which contains the definitions used by the hash table module.

      
--Code fragment from file: hash.h--


#ifndef _HASH_H

#define _HASH_H

typedef void *hashtable;

The hashtable object should be completely opaque except within the hash module, so it is given a type of void *. The internals are described later in the actual source.

What comes next are the prototypes for the hash module API.

hashtable hash_init(bits); int bits;

This function initializes the hash table. The bits parameter gives the number of bits to be used for the hashes, i.e. two raised to the number of bits gives the total size of the hash table in buckets. It returns a pointer to a hash object which should be used in future functions that modify the hash. It returns NULL if for some reason the hash table could not be allocated.

      
--Code fragment from file: hash.h--


extern hashtable hash_init(int bits);

void * hash_lookup(ht, s); hashtable ht; const char * s;

This function looks up a key s in the hash table ht and returns a generic pointer containing the user-specific data associated with the key if it exists. If it does not, NULL is returned.

      
--Code fragment from file: hash.h--


extern void *hash_lookup(hashtable ht, const char *s);

int hash_install(ht, s, udata); hashtable ht; const char *s; void *udata;

This function adds a key to the hash table ht, with key string s, and user data udata. If the hash table is full, 1 is returned, otherwise, the function returns 0.

      
--Code fragment from file: hash.h--


extern int hash_install(hashtable ht, char *s, void *udata);

#endif

      
--Code fragment from file: hash.c--


#include <stdlib.h>
#include "hash.h"

There are a number of internal data structures that are used by the program to support the hash. The first of these is the hash_bucket. It is really a rather simple structure, containing just two fields:

char * key: This gives the key name of this hash bucket.
void * udata: This gives the user data associated with key.

      
--Code fragment from file: hash.c--


typedef struct {
  char *key;
  void *udata;
} hash_bucket;

This next data structure is the hash_table_int, the hash table itself. The stub hashtable data type is internally cast to this data type when the internal functions use it. An explanation of the fields of the internal structure follows:

unsigned long bits: These hash tables are always a power of two in size, so this field gives the number of bits, i.e. the base-2 logarithm of the size of the table.
unsigned long size: This gives the actual size of the hash table in total number of buckets, basically this is what you get when 2 is raised to the bits power.
unsigned long mask: Since the mixing function for the hash generates a 32-bit number, we use the value of this field to truncate the 32-bit number so it becomes a proper index into the hash table. It's just the size field minus one, so when we do a bitwise AND of this with the result of the mixing function we have a result that is exactly the right size.
unsigned long num_entries: This simply gives us the total number of used hash buckets in the table. It's used for determining if the hash table is actually full already, as well as determining the maximum number of collisions that can occur when looking up a hash value.
unsigned long prevval: The result of the previous hash.
hash_bucket * table: This is the actual hash table. It is an array of buckets.

        
--Code fragment from file: hash.c--


typedef struct {
  unsigned long bits, size, mask;
  unsigned long num_entries, prevval;
  hash_bucket *table;
} hash_table_int;

This first function hash_init() initializes the hash table given the number of bits intended for it to use.

      
--Code fragment from file: hash.c--


hashtable
hash_init(int bits)
{
  hash_table_int *ht;

  ht = (hash_table_int *)calloc(1, sizeof(hash_table_int));
  if (ht == NULL)
    return(NULL);
  ht->bits = bits;
  ht->size = (1L << bits);
  ht->mask = ht->size - 1;
  ht->num_entries = 0;
  ht->prevval = 0xdeadbeef; /* an arbitrary initial value :) */
  ht->table = (hash_bucket *)calloc(ht->size, sizeof(hash_bucket));
  if (ht->table == NULL)
   return(NULL);
  return(ht);
}

The internal function hash() converts the key string into a 32-bit hash. See [JENK97] for more information on how this works.

      
--Code fragment from file: hash.c--


/* mixing function for Jenkins' Hash */
#define MIX(a, b, c)\
{\
  a -= b; a -= c; a ^= (c>>13); \
  b -= c; b -= a; b ^= (a<<8); \
  c -= a; c -= b; c ^= (b>>13); \
  a -= b; a -= c; a ^= (c>>12); \
  b -= c; b -= a; b ^= (a<<16); \
  c -= a; c -= b; c ^= (b<<5); \
  a -= b; a -= c; a ^= (c>>3); \
  b -= c; b -= a; b ^= (a<<10); \
  c -= a; c -= b; c ^= (b>>15); \
}

static unsigned long
hash(hash_table_int *ht, const char *p)
{
  unsigned long a, b, c, len, length;

  length = len = strlen(p);
  a = b = 0x9e3779b9;        /* golden ratio suggested by Jenkins */
  c = ht->prevval;           /* previous value */
  while (len >= 12)
  {
    a += (p[0] + (((unsigned long)p[1])<<8) + (((unsigned long)p[2])<<16) +
	  (((unsigned long)p[3])<<24));
    b += (p[4] + (((unsigned long)p[5])<<8) + (((unsigned long)p[6])<<16) +
	  (((unsigned long)p[7])<<24));
    c += (p[8] + (((unsigned long)p[9])<<8) + (((unsigned long)p[10])<<16) +
	  (((unsigned long)p[11])<<24));
    MIX(a, b, c);
    p += 12;
    len -= 12;
  }
  c += length;
  switch(len) {
  case 11:
    c+=((unsigned int)p[10]<<24);
  case 10:
    c+=((unsigned int)p[9]<<16);
  case 9:
    c+=((unsigned int)p[8]<<8);
  case 8:
    b+=((unsigned int)p[7]<<24);
  case 7:
    b+=((unsigned int)p[6]<<16);
  case 6:
    b+=((unsigned int)p[5]<<8);
  case 5:
    b+=((unsigned int)p[4]);
  case 4:
    a+=((unsigned int)p[3]<<24);
  case 3:
    a+=((unsigned int)p[2]<<16);
  case 2:
    a+=((unsigned int)p[1]<<8);
  case 1:
    a+=((unsigned int)p[0]);
  }
  MIX(a, b, c);
  return(c);
}

This next function hash_lookup(), will find the associated bucket given the hash table and the name. First thing it does is hash the name:

      
--Code fragment from file: hash.c--


void *
hash_lookup(hashtable ht2, const char *s)
{
  unsigned long hashval;
  int i;
  hash_table_int *ht;

  ht = (hash_table_int *)ht2;

  hashval = (ht->mask) & hash(ht, s);

We then need to handle the possible case when we have a collision. Obviously, the greatest number of collisions that can occur is equal to the total number of used buckets, so we loop up to that many times. If we find a bucket whose user data is NULL, that means the string has not yet been entered. If it's not NULL, we check to see if the name of the bucket matches the string we are looking for, in which case, we have found the right one.

      
--Code fragment from file: hash.c--


  for (i=0; i<ht->num_entries; i++) {
    if (ht->table[hashval].udata == NULL)
      return(NULL);
    if (strcmp(ht->table[hashval].key, s) == 0)
      return(ht->table[hashval].udata);
    hashval = (hashval+1)&(ht->mask);
  }
  return(NULL);
}

This function, hash_install() installs an object with a given key into the hash table. As before, first thing we do is hash the string

      
--Code fragment from file: hash.c--


int
hash_install(hashtable ht2, char *s, void *udata)
{
  unsigned long hashval, indx;
  int i;
  hash_table_int *ht;

  ht = (hash_table_int *)ht2;
  hashval = (ht->mask)&hash(ht, s);

We then go handle collisions. In this case we attempt to look for an empty bucket close to the hashed entry. If no empty bucket can be found, the hash table is full, and we return false.

      
--Code fragment from file: hash.c--


  for (i=0; i<ht->size; i++) {
    indx = (hashval+i) & (ht->mask);
    if (ht->table[indx].udata == NULL) {  /* unused entry found */
      ht->table[indx].udata = udata;
      ht->table[indx].key = s;
      ht->num_entries++;
      return(0);
    }
  }
  return(1);
}

This code is used by both the tangler and the weaver.

Prev	Home	Next
The main program	Up	The tangler