(I am going to stop creating multiple parts for the analytics project series ...)
What's been floating in my head as a next step for fortalt
is to do a couple of things:
A lexicon would match a hash value with a string. So, the lexicon can be a hash table index. The reason for this is:
For example, if I asked for every column in a row, I need to translate the results into strings instead of just showing integers. Also, since we are talking about small strings, we can cheat a little and give a fixed size to the strings, like 512 bytes (so less than 512 for Unicode). Of course, there are a bunch of things to consider, like proper hash function selection, collision handling, but the system is dumb, so I'll make a naive first choice. As a help, I am referring to the chapter on Hash tables in Crafting Interpreters: https://craftinginterpreters.com/hash-tables.html and from the Introduction of Algorithms section on hash tables.
We are making a few assumptions:
Given the fixed size of the hash, my primary tables can be just fixed-size integers. Secondly, we can use bitmaps with these values to see if, for example, "at least one event" has the desired characteristic. The type of bitmap approach could take different directions - I was thinking of implementing a basic roaring bitmap, which is a nice compressed bitmap algorithm which involves compactly storing bitstrings, sort of by saying the most significant bits once and the specifics either in sparse arrays or fixed size bitstrings. I like this primer on the algorithm: https://vikramoberoi.com/a-primer-on-roaring-bitmaps-what-they-are-and-how-they-work/
For example, if I had 3 rows, timestamp, path, user selection type and user selection, I would have a hash table that maps integers to strings;
1: /test
5: liked (a "selection type")
20: /foobar
30: y
40: n
1 is 0001
in hex, 5 is 0005
in hex, 20 is 0014
, 30 is 001E
and 40 is 0028
Then, if I have a table with the aforementioned dimensions, I can have a table like:
timestamp,path,selection_type,selection
65422C77,00000001,00000005,00000040
65422C78,00000020,00000005,00000030
65422C79,00000020,00000005,00000040
(The table would be instead organized where each column is a contiguous array, so fast search on particular columns can happen. Each column would have a pointer to a key column entry to tie data together in different columns.) (Note to self: take a peek at: https://parquet.apache.org/docs/)
Then, this makes it easy to do set operations. For example: path = /foobar AND selection_type = liked AND selection = n
.
Here is a naive approach. Hash each field (don't need the lexicon for that) If the path column is sorted, we can, with some method, jump to that 'row' by using math. Then with similar maths, do the same thing with the other columns. With sorting, it presumes that the column is a key. The set operations can be actually done with compressed bitmaps, but I am still thinking about this.
The approach is not optimal, but by just using integers, even if the search is slow, comparisons aren't - comparing two words in two CPU registers is (or could be) a single CPU instruction.
What is not immediately optimal is the space requirement here - look at all those zeros! Having fixed size fields at least makes it easy to optimize that usage. So this needs a little more thinking ...
Just off the top of my head:
I'm going to see how this goes along, but I don't want to think too much about design. I need to see things work to validate my ideas.
Made with Bootstrap and my site generator script.