Tuesday | 23 APR 2024
[ previous ]
[ next ]

Implementing a Hashmap in UniVerse

Title:
Date: 2023-05-13

In the future I want to write a html renderer in BASIC and I realized during a previous attempt that I'm going to need a map of sorts. I need to be able to store variable names and values in a table and look them up. I could use a LOCATE statement but a LOCATE is a linear search of a string. It's a relatively expensive operation for something that should be O(1).

Unfortunately BASIC doesn't have maps so I'll need to implement in the programming language itself. I envision 3 specific functions:

MAP.SET(MAT MAP,KEY,VALUE)
MAP.DELETE(MAT MAP,KEY)
MAP.GET(MAT MAP,KEY)

These 3 functions should let me manipulate a dimensioned multivalue array and give me the basics of a hashmap. The fact that these will be subroutines may actually add more overhead and ultimately end up slower than the linear search. At least I'll have fun doing it!

The first part of implementing a hashmap is the actual hashing function and modding it.

The logic goes like this, if you have an array available to you then you can access each index of that array in O(1) time. This means that it is constant time to access the 1st element or the nth element. This means that we can write an abstraction on top of the array that will take a key, convert that key to some number and then that number serves as the index of that array.

The hashing function is what converts a key to an integer. We want this function to randomly hash and it should be that changing a single letter will result in a wildly different number. This way we will spread the keys around the array well. Now we can't just have the hashing function as the number it spits out might be too big. This is why we take the hash that we generate and divide by how big the array is. The remainder will be the index that we will use.

A good hashing function that a few different people have said to use is the djb2 hashing function:

unsigned long djb2_hash(const char* str) {
    unsigned long hash = 5381;
    int c;
    while ((c = *str++)) {
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
    }
    return hash;
}

The above function will seed the hash with a number and then it will loop through the characters in the key. It will add the character and the hash multiplied by 33 together and continually build a larger and larger number.

The numbers are magic and just work. I'm not sure if there's an explanation so it is cool that there is some unknowns even in something so small.

The multiplication by 33 has been made faster by converting it to right shifts. The hash is right shifted 5 places which means it is multiplied by 32. The hash is added one more time bringing the total number of multiplications to 33.

Now I'm going to take this function and convert it to BASIC. The issue is that BASIC is limited in the way that we have access to integers and long integers. I tried using the straight datatype and let the system typecast everything for me. This resulted in overflow issues and strangeness. I gave up that method and switched to the string numeric functions.

These are the SADD,SSUB,SSMUL,SSDIV functions. These functions work on intgers that are strings and they can be of any length. I'm really curious what is going on under the hood, what are these numbers getting converted to. Are they being manipulated directly as strings and there is a calculator engine or is it converting them into some C datatype.

Using these functions I can implement the hashing function though it will be quite a few function calls. This means that the overhead is probably increasing once more. It's going to be fun to benchmark this against a LOCATE. Maybe I have it wrong and the overhead to call subroutines and functions isn't that bad.

    SUBROUTINE MAP.SET(MAT MAP,SIZE,KEY,VALUE)
*
    KEY.LEN = LEN(KEY)
*
    HASH = 5381
*
    FOR I = 1 TO KEY.LEN
        C = KEY[I,1]
*
* HASH = HASH * 33 + C
*
        HASH = SADD(SMUL(HASH,33),SEQ(C))
    NEXT I
*
* POS = HASH - (INT(HASH / SIZE) * HASH)
*
    D = SDIV(HASH,SIZE)
    E = INDEX(D,'.',1)
*
    IF E = 0 THEN E = LEN(D)
*
    DIVISION = D[1,E]
    MULTIPLE = SMUL(DIVISION,SIZE)
    POS = SSUB(HASH,MULTIPLE) + 1
*
    PRINT KEY 'L#25' : HASH 'L#25' : POS 'R#10'
*
    IF MAP(POS) = '0' THEN
        MAP(POS)<1,1> = KEY
        MAP(POS)<1,2> = VALUE
*
    END ELSE
        LOCATE(KEY,MAP(POS)<1>,1;ANYPOS) THEN
            MAP(POS)<1,2,ANYPOS> = VALUE
*
        END ELSE
            MAP(POS)<1,1,-1> = KEY
            MAP(POS)<1,2,-1> = VALUE
        END
    END
*
*
    RETURN
*
* END OF PROGRAM
*
    END
*

The core of getting this function implemented in BASIC is using the string numeric functions but besides that, it's straighforward translation. There wasn't a SMOD function or SREM function to get the remainder properly and so I had to code that up as well. Luckily the function was defined in the manual for a regular MOD so I wrote it using that as a guide.

The bit shifting functions also don't exist in BASIC and so I used regular multiplication. This will limit how fast this hashing function can run.

I could move this hashing function to a seperate file and call it from there but I haven't decided yet if that is a good idea.

Compared to the C version, this version of the hashing program is going to give different hashes. It looks like in the C version the hashes overflow after they hit the limit of the datatype. In BASIC because I'm using the string versions, there is no overflow and so the hashes end up different for larger keys. I don't think this will matter but it is something to note if you want to test the function against C version to validate it.

Below are 10 hashes for 10 random words I generated:

Serendipity    272099048547214449781     82
Whimsical      249867127444925670        71
Ephemeral      249842151425547096        97
Effervescent   8978561987401624087685    86
Quixotic       7571492226034011          12
Bucolic        229419778902534           35
Labyrinthine   8978907966718093115054    55
Ethereal       7570979451673647          48
Scintillating  296315749425851252881018  19

The last number is the postion that these keys will map to in the array.

The other thing to note is that for collisions, I push the key and value into the map. I need to do a LOCATE inside the bucket to figure out if the key is already in the bucket, if it is then we can update the value. If it isn't in the bucket then we can add it to the bucket.

With that we have a working function that will hash a key and then add the key and value to our hashmap. This function will need some lowering so as not to screw things up. The delimiters are definitely going to be a problem because I might have highly nested data that I want to save.

This is going to require some rules about what letters values can contain and I will be stealing the stuff I did for my json parser. There I used 50 greek characters as delimiters and so could raise and lower quite deeply. I needed that there because json can be arbitrarily nested but I probably don't need as much here.

The rest of the functions like MAP.DELETE and MAP.GET will use the same logic as MAP.ADD so they should be straightforward to write.

The last thing to note is that the hashmap needs to be initialized with 0s so that we can get the size at run time. I haven't figured out if there is a way to get the size of a dimensioned array so for now I need to build the array into a dynamic array and then count the number of elements in it. I also can't use the attribute marks in a dimensioned array because they will get added in when the MATBUILD is run. This means that I don't get the full benefits of using a dimensioned array because I can't get the size.

I changed the routine to pass the size in. I did use the MATBUILD and DCOUNT to get the size but that is way too expensive, it makes things 100x slower on my machine. 40ms to 4000ms.

I'll need to investigate because it will simplify things if there was a way to get the size of a dimensioned array automatically. I could pass it in through the subroutine calls but that looks uglier. The signature should be simple.

There are many hacky things going on and hopefully I can get rid of them over time.

  • Using string numbers to do the math
  • Can't do bit shifting in BASIC
  • Can't get a size of a dimensioned array.
  • Can't use attribute marks inside a dimensioned array for that reason.

I may need to switch the underlying dimensioned array to a dynmaic array but I remember it being that array access in the form of TEST.ITEM<15> being a linear time search to find the 15th delimiter. There is no pointer to the nth element as part of the dynamic array.I wonder if an execute to linux might be faster than all this jazz.