Making a dumb analytics system, part 1

2023-10-20

I am reviving a project idea I had before starting my current job.

https://gitlab.com/bolsen80/fortalt

Fortalt is a play on a Norwegian word which means "told" (really in the form "has told"/"har fortalt") but the part of the word "talt" means counted, where the present tense "teller" means counting. :D Sounds analytic-y to me.

The first version is a few bash scripts that takes a count of paths. This will put a timestamp with a Unix epoch in a log file.

fortalt put "/path/to/url" date +'%s'

It will look like:

/path/to/url|1697790717

If I ran it at the time of writing.

To aggregate data in a time frame:

fortalt aggr "/path/to/url" 1697790000 1697800000

It will give me the summation of all events between the two times (between 8:20 AM and 11:06 AM CEST, when I am writing this.)

As something dead simple and something that can be log-rotated, this was easy. There are other approaches, like log parsing, but I just wanted to do quick counts.

The next step is to install this for my blog. I need to get Nginx to trigger the script in some fashion, which I thought could involve using socat and smoosh in some HTTP parsing and Nginx's mirror module. Let's see.

My plan however is to write this in C. I use a simple append-only log. Maybe write a really bad version of a Log structured database (or even the merge tree variant), which roughly could involve writing a sorted log structure that gets flushed to files when the file becomes too big. I was thinking this can also be smashed into a hash table and also have more data points to work with. But first things first - after getting Nginx to work with my scripts, I'll have a sufficiently real output to write the C programs. :)

The other advancement that can be immediately be made is utilizing some form of bucketing. The algorithm right now that is used for searching is basically O(n) over a sorted list. It greps the log for a path then sorts the list based on times and then starts by searching through the list and only finishing the search when it finds the first invalid timestamp outside the range. A simple binary search over a larger dataset could be employed for example, with each path segregated and the individual log pre-sorted. Further, the search dataset can be pre-aggregated from the original log files into buckets, like 5 minute ranges.

One thing in the C implementation that I would want to do is do some form of simple mmap'ing (mmap is considered evil on large datasets though :D) that can make binary searches fast enough sitting in main memory. Bash only works on serial file access, making it inefficient for large datasets. FWIW though, I don't think I will ever reach to 'web-scale' but it's just a fun thing to tinker with. :)

I tried in the past to use other tools for things that are stupid simple, but again, it's better to explore the (non-)problem a bit and see what happens. For what it's worth, I have a pretty good idea about how analytical systems work, since this is was in my last job. ;)

Next in series ...

In: c web analytics