Making a dumb analytics system, part 2

2023-10-23

In the next step, I wanted to go ahead an add a bucketing script. Basically, if I hand it a log, a time to start and a time interval, it will put events into time intervals.

For example, if I want to put events into 5 minute buckets:

fortalt bucket log test.bucket "/test" 1697736400 600

This would say to pull from log, write results to test.bucket, the events on /test starting from 19. October 2023 at 17:26:00 UTC, and proceeding every 5 minutes thereafter.

So I have this log:

/test|1697736412
/test|1697736419
/test|1697736420
/test|1697736422
/foobar|1697736427
/foobar|1697736428
/test|1697736432
/test|1697807418

and this is the test.bucket output:

1697736412|5
1697807212|1
1697736400|5
1697807200|1

You can imagine how this could be used: a cronscript or daemon that can continually aggregate events from a given line:

line_count=$(wc -l | cut -d' ' -f1)
curr_pointer=$1
mkfifo fortalt.pipe
tail -n $((line_count-curr_pointer)) log > fortalt.pipe &
fortalt bucket log fortalt.pipe "/test" first 600 # didn't implement saying "first" as the time ... something to add ...
echo $((line_count+1)) # for updating $curr_pointer

This would probably need to use flock(1) to put an exclusive lock on this script so that the current line outputs correctly.

As this starts to come together, these have been the ideas that have been floating in my mind:

  1. Unit tests!
  2. Convert "paths" to hashes (I thought of using murmur32 and see if I can print hashes as ASCII chars). The purpose is to have fixed size entries, essentially 32-bit entries for each row representing an event.
  3. Make a lexicon for variable-length strings: the hash values would map to values (maybe via a hash table) and the reverse lookup can be done using a trie. This lookup table would be used by the event lookup programs to relate variable-length strings to hash values and vice-versa.
    • I'll be writing this in C for no other purpose than the fact that hash tables, tries and hash functions are probably a nice thing to write in C.
  4. Write a really simple Nginx log parser (something I could do first) to pull up a backlog of events
  5. Performance testing : for X types of events and Y events, how fast can it search the raw log? How fast can it bucket? (the first is dependent on the O() of grep and the O() of sort at the moment)
  6. Allow defining schemas and separate the individual columns in some way without sacrificing ease of sort+search (https://en.wikipedia.org/wiki/Column-oriented_DBMS)
  7. With more than column, a simple language for queries (like time between (A,B) && path is '/test')
  8. I was awake at night wondering if I can use protobuf as a format ...
  9. An event collector using libmicrohttpd. I had a project idea to use libmicrohttpd with QuickJS, maybe this will be it?
  10. I saw this in my earlier post: https://vikramoberoi.com/a-primer-on-roaring-bitmaps-what-they-are-and-how-they-work/ ... ooo

Last in series ...

In: c web analytics