Thinking about some designs...

2023-11-04

(I am going to stop creating multiple parts for the analytics project series ...)

What's been floating in my head as a next step for fortalt is to do a couple of things:

Make a lexicon of strings and hashes
Adding multiple dimensions
Store the dimensions as unsigned ints ...

Lexicon as hash table

A lexicon would match a hash value with a string. So, the lexicon can be a hash table index. The reason for this is:

The hash would be a fixed size, like a 32-bit (unsigned) integer
The string itself only appears once in the system. It could be an O(1) lookup away if I was given just a hash value.

For example, if I asked for every column in a row, I need to translate the results into strings instead of just showing integers. Also, since we are talking about small strings, we can cheat a little and give a fixed size to the strings, like 512 bytes (so less than 512 for Unicode). Of course, there are a bunch of things to consider, like proper hash function selection, collision handling, but the system is dumb, so I'll make a naive first choice. As a help, I am referring to the chapter on Hash tables in Crafting Interpreters: https://craftinginterpreters.com/hash-tables.html and from the Introduction of Algorithms section on hash tables.

We are making a few assumptions:

That the strings themselves are not time-dependent (maybe we will end up with multiple hash tables if we pair aggregate data with lexicons)
That there is no simple LRU semantics involved, or at this point, none in fact. We can assume all strings are valuable, but the value of data might decay over time (like an entry references a single event that happens once, thus LRU semantics become more complex.)

The actual data

Given the fixed size of the hash, my primary tables can be just fixed-size integers. Secondly, we can use bitmaps with these values to see if, for example, "at least one event" has the desired characteristic. The type of bitmap approach could take different directions - I was thinking of implementing a basic roaring bitmap, which is a nice compressed bitmap algorithm which involves compactly storing bitstrings, sort of by saying the most significant bits once and the specifics either in sparse arrays or fixed size bitstrings. I like this primer on the algorithm: https://vikramoberoi.com/a-primer-on-roaring-bitmaps-what-they-are-and-how-they-work/

For example, if I had 3 rows, timestamp, path, user selection type and user selection, I would have a hash table that maps integers to strings;

1: /test
5: liked (a "selection type")
20: /foobar
30: y
40: n

1 is 0001 in hex, 5 is 0005 in hex, 20 is 0014, 30 is 001E and 40 is 0028

Then, if I have a table with the aforementioned dimensions, I can have a table like:

timestamp,path,selection_type,selection
65422C77,00000001,00000005,00000040
65422C78,00000020,00000005,00000030
65422C79,00000020,00000005,00000040

(The table would be instead organized where each column is a contiguous array, so fast search on particular columns can happen. Each column would have a pointer to a key column entry to tie data together in different columns.) (Note to self: take a peek at: https://parquet.apache.org/docs/)

Then, this makes it easy to do set operations. For example: path = /foobar AND selection_type = liked AND selection = n.

Here is a naive approach. Hash each field (don't need the lexicon for that) If the path column is sorted, we can, with some method, jump to that 'row' by using math. Then with similar maths, do the same thing with the other columns. With sorting, it presumes that the column is a key. The set operations can be actually done with compressed bitmaps, but I am still thinking about this.

The approach is not optimal, but by just using integers, even if the search is slow, comparisons aren't - comparing two words in two CPU registers is (or could be) a single CPU instruction.

What is not immediately optimal is the space requirement here - look at all those zeros! Having fixed size fields at least makes it easy to optimize that usage. So this needs a little more thinking ...

Some misc. things

Just off the top of my head:

We might want to store column data in 4k pages in main memory. There is a certain affinity to addressing data in pages. But, not going to worry about this.
The file format needs a bit more explanation, but we are here to just hack something. If anything, a column can have a size, type (number or string) and a name.
We are not dealing with anything other than string data unless we specifically say in the column header that it is a number and thus doesn't require a hash lookup. This is relevant for timestamps, which is an assumed column in this system. Just like timestamps, this is also useful for describing queries that are in ranges of things.

I'm going to see how this goes along, but I don't want to think too much about design. I need to see things work to validate my ideas.

In: c analytics