Introduction

So you've found my blag? It's not like I've deliberately hidden so yeah, figures.

At some point I wanted to write a few articles about how I write code, and I actually started writing those. You can find them in the Legacy section.

In case you read this in a far away future in which I actually did follow through with my plan; chances are you'll also see a few entries in a journal-like fashion. I've been pondering about it for a while and it seems feasible to actually jot down notes while doing some work and then just sort those notes when I'm done doing whatever I was doing and using that as both a journal and a blog that I can refer to at the same time. As said, far away future, so if you're looking for that and it's not here yet, chances are it doesn't even exist.

Ramble

This is the mandatory ramble section I guess. Here I'll explain some of my background and what this is all about.

I'm currently typing this on my semi-freshly acquired aarch64 tablet convertible. The device is nice, it's basically a portable Linux device which I can use for typing stuff on the go, or with some tinkering maybe even play games on the go. The crux is however is that computers are mind-bogglingly stupid1.

Contrary to the popular opinion of "the computer deleted my file" or "the computer keeps downloading files to the wrong directory" we did not actually make sand think, not yet anyway. All a computer ever does is what someone told it to do, specifically, in a structured format that the computer can follow; this applies to all computers regardless of whether it's a physical device, some emulator, or even the JVM – or any byte code interpreter for that matter. They are incredibly single-minded if you can even call it that. The problems that people have with computers are usually one of three things (IMHO, YMMV): hardware failure, interoperability, and "someone told this computer to do something I don't like". The first one is unavoidable to some extent. The second one is usually some vendor lock-in or some hardware standards thing, often caused by NIH link, or some other enshittification thing, avoidable to some extent, but also historical baggage more often than not. The last one.… well, this includes everything from hardware, via firmware, to software, up to and including settings, presets, etc., and it's one of the primary reasons why FOSS came into existence. So when someone says "the computer keeps downloading files to the wrong directory" what the underlying message is "someone configured the software on this computer to download files to a different directory than I want them in". I am not blaming the person who is using the computer. Sure, they may have explicitly set the wrong directory in their browsers preferences, but with the amount of software we shove into these boxes it may also be the complex interaction of three pieces of software all of which have different conventions and configurations. Someone configured those, either explicitly as a user, or implicitly through defaults shipped by your upstream vendor, or the defaults set in the software, or straight out "defaults" in terms of the way the code is written in. If the code of a software never checks the configuration setting you are setting despite it being the way to set things as per documentation then yes, someone forgot to tell the computer to ask you for your opinion (and you're just giving your opinion to the void). Right here and right now it is important to say that yes, there's always a person to point at, but no, you shouldn't actually do that. All of these "configuration choices" happen for a reason, and rarely is it just a simple mistake.

  • often because software follows a standard (like the XDG specs for your desktop)
  • sometimes because people didn't know the standard (when someone uses public IPv4 addresses in their private network because they didn't know about RFC 1918)
  • time or other resource constraints (like how my infrastructure is largely not available via Legacy IP2 because it's just not worth it for me)
  • unfortunately also because companies don't give a flying duck about making something configurable or even usable for a user (go ask Nintendo what that big warning notice on the 3DS update history is all about)

And in this case – a Chromebook – someone at Google thought of a really fancy way to boot this device with some custom (largely?) Open Source firmware and whatnot to enable a very fast and verified boot, with signatures and such, to prevent intrusions. I don't know why it was necessary to optimize the boot times of the bootloader of all things, but hey, at least the source code is available and the documentation is actually usable. The intentions here aren't really relevant right now, suffice to say getting mainline Linux to boot on these devices requires packing of the kernel and initrd in a special format (Depthcharge) to get the device to recognize this (not to mention persisting dev mode and tinkering with firmware write protection to disable verified boot). Okay, so far so good, this isn't the most complex computer thing I've done before. Why am I telling you all this? Well, first of all, this is the ramble section, what did you expect? Second, and much more importantly, because in trying to change something about this I had to learn a hard lesson — again. What I was trying to do is deploy Das U-Boot as a kpart image (like the kernel) so I could teach this device to just boot from a partition, which would have made testing and updating kernels much easier. There is even a documentation section on this exact thing, except it's for the device codenamed snow with jerry being listed as an alternative. I am writing this on krane however.

Now comes the relevant part that I've been meaning to get to; I'm a network engineer. I love routing packets from one device to the next. That's what I do at home. I also tinker with my software a bit, but usually nothing that nobody's done before. That thing however? No dice. The Linux kernel has DTBs for my device (and the entire family), heck mobile-nixos has this entire device family covered including the kpart generation using depthcharge. U-Boot has none of that. No DTBs – okay, I can copy those off of the Linux build. No defconfig – okay, I used to compile my own kernels shouldn't be that hard. Nothing on the internet on what I need for this device type and the closest in the U-Boot tree with a defconfig was a single device with an mt8183 that didn't quite match my hardware. This is where I get stuck.

I – as a sysadmin-ish hybrid developer networking person – require documentation to base my stuff off of, and whenever something diverges from my – admittedly very broad – experience I get stuck. This entire blog is about writing down whatever I know so that I can reference it and so that other people can try and build stuff off of whatever I do. That is if I can actually keep up the writing, something I wouldn't bet on, but you never know what life throws at you, eh?

1

I can't promise that this was the last H₂G₂ reference.

2

When I say Legacy IP I refer to IPv4. I don't like IPv4, and I'd rather we fully commit to IPv6 already.

Legacy

This is the legacy section of this blag. As the same suggests all of these posts have been carried over from the previous version, so they may not be complete, up-to-date, or anything at all. Better old than lost I guess.?

Getting Started With Programming

Hi there, welcome to a fast (depending on how fast you read) and quick (actually really in depth if I manage to keep going).

What I will do:

  • explain computers
    • a bit of hardware maybe
    • a whole lot of software
    • sprinkled with operating system stuff
  • explain programming
    • concepts
    • ideas
    • best practices
  • explain languages in detail
    • some, I don't really know which yet, I just started to write something so this might be a little weird

Theory

You can happily skip this whole chapter. I earlier said this will be in depth so here you go:

Computers are stupid.

To understand why, let me just summarise:

  1. Computers do exactly what you tell them
    • if you tell them to kill themselves they'll do
    • if you tell them something they don't understand they'll try to make themselves understand (often by just doing random things)
      • that got better in recent years, we now have smarter interpreters, compilers, OSes and whatnot. Stuff usually tells you right away that it doesn't know what to do with your instructions
  2. Computers have not a lot of instructions you can give them
    • this however doesn't apply to other software, you can and will instruct a lot of foreign software to do things for you
      • foreign code can contain bugs of their own
  3. Computers as whole, starting at the hardware and going all the way up, are just so incredibly complex that you will never understand everything about them. relevant xkcd

What is a Computer?

Basically a computer consists of CPU and RAM. In most computers you'll find other things like disks or peripherals.

Peripherals include:

  • output devices
    • monitors
  • input devices
    • keyboards
    • mice
  • both
    • parallel/serial ports
    • network

*[CPU]: Central Processing Unit *[RAM]: Random Access Memory

Imagine a computer that just computes using CPU and RAM.… You had no way of getting data into the computer and no way of getting it out, that would be pretty pointless.

Now think of the Standards problem. There is a whole lot of entirely different hardware out there. You can't just be compatible to all hardware. Period. That is why I am introducing you to the concept of the Stack.

The Stack

A stack (English: stack) is what happens when you want to go to bed and are too tired to put your clothes somewhere and just throw them in the same corner every night and take fresh ones in the morning. This is about the same with computers just a little broader and occasionally less high.

Bottom

So at the bottom of the stack there is our hardware. CPU, RAM, disks, peripherals and all the other nice things.

There are chips on your mainboard (the part that connects all of the other things) that handle voltage of the other parts, make it possible to keep your finger long on the power button to kill your computer and much more things.

There is also a chip containing a BIOS or UEFI. Either of them is the bottom of the software stack. I'll just assume a BIOS.

*[BIOS]: Basic Input and Output System *[UEFI]: newer than BIOS, but not necessarily better.

The BIOS initialises devices and provides some way of configuring certain very low-level (very deep down the stack) configuration options. This can for example be "where to boot from". If you just boot from the internal disk it will read the first 512 bytes from the disk and interpret those.

Binary

Oh right.… Some long time ago computer people decided they need to store data in some way. The most hardwarish way to do that, as early computers were the size of a house and did mostly consist of hard-wired logic, was to build some electronic part that could store a simple signal:

StateNumber
Off0
On1

This is what we call a "bit". A bit is not very much. So we just take eight of them and call it a "byte". If you have a byte written out in numbers it would be between (inclusive) 00000000 (lowest) and 11111111 (highest). Now, if we count the number of different patterns we can store inside that we will count exactly 256.

Memory, so both disks, RAM and everything else that stores data is just a long array of bytes of which each one can be individually read and set.

More to be read on Wikipedia.

But back to the BIOS

So, the BIOS reads 512 bytes which it interprets. These bytes are called the MBR. The MBR can then again instruct the BIOS to do things like loading data from a harddisk, store it in RAM. It can even instruct the BIOS to jump to the location in RAM where it just loaded data to and the data is then executed by the CPU.

*[MBR]: Master Boot Record

Okay, so we managed to get from the hardware all the way up to executing custom code on the CPU. This means this code now has full control over the entire computer. What it does? Well, the process of starting a computer is called "booting". There are several parts involved in booting, including a "bootloader".

Bootloader

The bootloader consists of at least two parts, the first part being the MBR. As 512 bytes don't really give you a lot of opportunity to do something other than loading stuff to memory and jumping to it the second part naturally is the code being loaded to memory and executed. We call it the second stage bootloader.

This bootloader has a lot more space at hand and can therefore do things such as presenting a menu to you on screen and reading input from your keyboard to assist in doing things.

Examples of bootloaders include:

  • Linux
    • LiLo
    • Grub
    • syslinux
  • Windows
    • ntldr
    • bootmgr

This second bootloader often is large enough to know how to properly read other data than just "the first 512 byte" meaning it can already read data from the disk that determines it's behaviour. This usually means that you can already use configuration files or an equivalent to instruct this part of the bootloader.

The bootloader then continues to load the first bit of your Operating System, called the "kernel".

Kernel

From here on things are quite different depending on your OS so I'll save it for a later blog post.

Plain Code

In the first blog post I've covered some theory about computers in general so how about we start programming now?

This post might seem a bit boring as we will neither talk about how to actually run the code nor will the code actually output something or interact with anything.

Programming Languages

Syntax describes the structure of code. A programming language is just a way of telling a computer what to do in human readable form. The only thing that programming languages differ in is syntax. That last sentence is just plain wrong.

Every programming language does have some features that set it apart from others, unless somebody felt the need to reinvent the wheel, wasn't aware of the other language(s) or developed it at the same time. Features might be different programming paradigms such as declarative versus imperative programming.

For the sake of simplicity I will just choose the language C, not because it is particularly modern or has a lot of features but because it is lacking magic and is therefore easier to understand because nothing strange is going on behind the scenes and because C is nowadays used as a whole category of syntax. C-like syntax is common; Java, Rust, PHP, JavaScript, it's just everywhere.

First Steps

Let's get into it.

Every program needs a starting point, in case of C it is called main.

Let's do something simple: a program that does nothing.

// this is a comment
void main(void)
{
}

So, what can we see here?

First thing: a comment is something you can only see in the source code. It is just ignored by literally everything.

Second: we see main right? But what's the stuff around it you ask? Well, void means nothing.

C is typed, meaning everything has a type. The notation used for declaring things in C is one of the below:

NameDeclaration
Variabletype name
Arraytype name[length]
Functionreturn_type name(variables)

That means we are declaring a function named main (our entry point) that "computes" nothing and needs nothing to do that. Seems about right.

Furthermore we have curly braces that denote the function body: the start and end of the function.

Functions

Functions aka subroutines, are pieces of code which you can use from other places. This could for example be used for establishing a connection to a server.

A function does have input and output, both of which can be nothing aka void.

In the example of the connection this function could look like:

connection connect(address server,number port)
{
	// do something
	return some_variable_of_type_connection;
}

Variables

Variables are easy. A variable is a piece of memory with a fixed size. As soon as you have a variable you can do with it's memory whatever you like.

C does have certain types of variables defined for you (assuming x86_64 platform):

NameSizeTypeValues
Integer4 byteint-2147483648 to 2147483647
Short2 byteshort int-32768 to 32767
Character1 bytechar-128 to 127
Pointer8 bytetype *reference to memory
Double8 bytedoublevery low to very high floating point
Float4 bytefloatlow to high floating point

Integral types including characters can be made unsigned by prefixing them with unsigned, making them hold only positive numbers, extending their upper bounds by the capped lower bounds.

Integer

An integer is just an integral number, meaning one or two, but not 1.5. You can multiply it's value, divide it and do weird things with it.

Character

Characters are used for text. Meaning you can easily store 'a' or 'b' and using pointers or arrays whole words sentences or whatever you want, we call that a string.

If you are storing an actual character in a char as opposed to a number, then you will find yourself using ASCII, which is just a standard assigning specific numbers to letters. 'a' for example equals the number 97. You usually don't use the numbers directly but character notation:

*[ASCII]: American Standard Code for Information Interchange

char difficult = 97;
char easy = 'a';

ASCII table and other things about ASCII to be found here.

Floating Point

There is a nice IEEE standard, just read that. No, I'm joking, just deal with it that you can store things like 0.5 inside them but that it is not guaranteed that they equal or be exact. If you can, use an integer instead. Especially when dealing with currency. If you deal with currency, just store the smallest unit as integers.

Array

You need fifty integers? Just use int [50].

You can then access it using name[index].

Warning: the first index is 0, not 1.

Pointers

No, I haven't yet explained them. Most people consider them annoying because they always make your programs crash (when you use them wrong). Pointers are easy, just think of them as an integer that holds an address in memory. I've already told you, a variable isn't any different from a few bytes in memory. A simple example:

// get four bytes and write 42 into them
int data = 42;

// store the data's address
int *ptr = &data;

// get more bytes and store the data the ptr points at
int newdata = *ptr;

Pointers and Functions

// sets the memory at some address to the answer
void get_answer(int *data)
{
	*data = 42;
}

// get some bytes
int data = 4;

// set that memory address' memory to the answer
get_answer(&data);

Why would you want this? You'll know when, but if it's possible to do without pointers, try that, unless of course it has serious disadvantages.

When Pointers Go Bad

// just point at some address in memory where we don't have data
int *data = 0;

// access the data
*data;

// program crashes above

Why is 0 special? Well, some people thought, why don't take 0 as a special value for "pointer points at nothing". Since then this is convention. All modern operating systems have the address 0 always "unmapped" meaning if you try to access it, whether it be read or write, the program instantly crashes.

Pointers and Arrays

The name of an array without any index given is a pointer to the memory of the array, meaning you can use it to pass arrays to functions (which is otherwise not permitted). This is also nice if you got your memory from somewhere else than an array declaration.

Furthermore it means that not only is your array a pointer but also that any pointer is an array.

Therefore you can do things like that:

// returns the second element
int second(int *arr)
{
	// remember that the first index is 0
	return arr[1];
}

// create an array with two elements (0 and 1)
// and fill it with the values 1337 and 42
int arr[2] = {1337,42};

// store the result (42) of `second` into `snd`
int snd = second(arr);

Pointer Arithmetic

You can add some numbers to integers and characters of course, but what about pointers? If you add or subtract some number from/to a pointer then the result is a pointer shifted by that amount. Amount in this case means that it is shifted by a number of units of the type of the pointer.

An example:

// pointer at the start of the ram at address 0
int *ptr = 0;

// pointer to the second integer in memory
// this means it points at the address 4
int *further = ptr+1;

I wrote "the second integer in memory" which is kind of a contrary statement to what I said before. Memory is just a bunch of bytes. On RAM level there is no such thing as an integer. But if you write C code that declares some memory address to be handled as an integer then it will be handled as that, even if it's not.

We can see this easily in an example like this:

// regular integer (4 byte)
int i = 1337;

// use the four bytes of `i` as an array
char *arr = &i;

// we can now access the individual parts of the array
arr[0]; // 57
arr[1]; // 5
arr[2]; // 0
arr[3]; // 0

Something worth noting here is that the first element (0) contains the lowest byte of the integer. This is because the platform I am assuming (x86_64) is "little endian". More information on Wikipedia.

Strings

Strings as mentioned above are a special kind of array, a character array. They are very important because in an application you often deal with character arrays, whether it be user input, network connections or just simple output, you always work with byte arrays.

Strings do have some special value at the end, a zero. This is used for termination. Why you ask? Well, imagine an array which holds the text a user entered; variable length and definitely no NUL value inside. Now it is your job to do things with it. How do you know where it ends? You see that having a 0 at the end is perfect for telling where a string ends.

And because we use strings that often there is a convenient way to write them:

// we can leave the number out if we directly initialize it
// because the length is known at compile-time because we set the values
char array[] = {97,104,97,0}; // ASCII for "aha" and a zero at the end

// means the same (including the zero at the end).
char same[] = "aha";

To Be Continued

My Way of Programming

Herein I will describe my programming style as well as my experience with Haskell, as an example for purely functional programming.

I sometimes call my style of programming a-to-b programming. To make this whole thing more practical I will just add a few examples written in Haskell.

a-to-b

What do I mean by a-to-b? Every single application I know is basically just performing data transformation. You have some data a of which you make some data b.

Application

This does not hold true for IO, which includes:

  • storage applications
    • databases (PostgreSQL)
    • caches (varnish)
    • key-value stores (memcached)
  • one-way communication

It does however hold true for applications which you might not expect to be designed that way:

  • generic web applications: ([DataBaseRow],Request) -> ([DataBaseModification],Response)
  • command line applications: ([CliParameter],StdIn) -> (ExitCode,StdOut)

Functional Programming

So how does all of that relate to functional programming (and Haskell)? Imagine a complete web application which is structured into smaller parts each doing one single thing (adhering to the UNIX philosophy). One of these modules could be something along the lines:

-- called with a "map" from path to image and the request
-- returns the response to be sent
handleRequest :: [(Path,Image)] -> Request -> Response
handleRequest images request =
	makeResponse
		(
			resizeImage 
				(getDimension request) 
				(lookup (getPath request) images)
		)

-- creates a response from an image (add headers and so on)
makeResponse :: Image -> Response
-- resize the image somehow
resizeImage :: Dimension -> Image -> Image
-- get the requested dimension from the request
getDimension :: Request -> Dimension

Dear Haskellers, please excuse me for leaving out the Maybe part of the lookup function. It's for readability.

Dear Non-Haskellers, believe me if I tell you that Haskell's type system kind of pushes you into the right direction.

Benefits

These benefits are applicable to both Functional Programming and the a-to-b model:

State

No state-keeping within your logic. Have a look at the handleRequest function above. It does not keep state at all. If anything was to keep state it might just get one state as an argument and return the new state. The state-keeping would then be abstracted from your application. If talking about web applications this could be your session, stored in memcached transparently. Combined with Haskell's laziness you'd probably end up not even requesting unneeded state.

Parallelisable

There will be no race conditions in this code. You will be able to quickly scale out with more threads or even more hosts if needed. You get all this "for free".

Reproducible

Oh noes, you found a bug. Good luck reproducing this without full step-by-step instructions. With functional programming you simply need to log your state and request as soon as an error occurs and you have everything you need to reproduce it. Debugging will be ridiculously easy.

Side-Effect Free

If you have a high unit-test coverage then you have actually tested everything that could go wrong because there are no side-effects in pure functions.

Maintenance

You can easily replace a function that does not perform as it should and, once it works, it probably keeps on working forever.

Profiling

If you profile your application you see how long each function takes and how often it's called. This makes it easy to find a bad performing function and simply rewrite it, if possible, or reduce it's usage or the way it's used.

Shipping Features

Adding a feature is as easy as adding the new functions and putting a call to them somewhere. You cannot break any of the existing parts of the program if you do not touch them due to the complete lack of side-effects.

How to build a RegExp?

I see people abusing regexes just about every day. If you're really good at regexes then you will certainly feel some sort of pain as soon as you see .* or similar constructs in inappropriate places.

So here is my guide to doing it the right way.

Notes

I'm going to use PCRE all over the place. My preferred syntax is m{something} and s{a}{b}g so I'll stick with those.

You can try all examples using:

# for searches, just start typing, quit using ^D
perl -ne 'm{a(.)c} && print "$1\n"'
# for replacements
perl -pe 's{a.c}{abc}g'

Why not use .*?

You're looking for a needle in a haystack. A practical example: you look for your favourite plushie in your room.

A nice regex for that would be m{\bplushie\b}. It looks for your plushie as one word, meaning that it will match only if on each side the word ends. See also word boundaries.

What I see people do, which is completely ridiculous, is m{^.*\bplushie\b.*$}.

Let me explain:

  • They match the beginning of the line even though they don't need it.
  • They match the end of the line.
  • They let their parser go through all of the characters, even though they already found what they were looking for.

If you look for an occurrence somewhere, that does not need to be anywhere specific then why are you looking at everything else? You don't walk into your room and start looking from one side sequentially to the other. What you do is look in your room, see the plushie sitting on the bed and take it.

Complex Examples

Let's choose fail2ban as an example. We want to block every IP sending more than 1000 HTTP requests per five seconds. I'll ignore how to configure fail2ban as it's not relevant to the regex thing.

First try, without even looking at the format of the logs: m{^.*<HOST>.*$}

You probably messed up very hard here. To explain why you've messed up so hard, let's look at one line of logs:

10.0.0.1 - - [16/Jul/2017:15:38:54 +0200] "GET /robots.txt HTTP/1.0" 404 319 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)" "-"

I took that line out of my server, the 10.0.0.1 is the part that we want (please ignore that it's an internal IP).

m{.*} does greedy matching, so the above regex will be very easy to break. Just put an IP in the User Agent. The User Agent contains spaces as you might have noticed, which is fine as the string is quoted (and nginx does some escaping on the contained string). Now what'll happen if I put an IP address in the User Agent? Right, due to greedy matching the rightmost <HOST> will match. This will of course result in some serious problems.

As a malicious hacker I could:

  • Circumvent fail2ban altogether by putting 127.0.0.1 into my User Agent. That will effectively turn off fail2ban for my requests as long as 127.0.0.1 is whitelisted. If that IP is not whitelisted, you've got problems a lot worse (assuming fail2ban uses iptables, this will break at least half of your server's software, think of accessing MySQL on localhost).
  • Block arbitrary IPs from accessing your website in much the same way.

So how are we going to construct a Regex that will match only what we're looking for here?

Well, the IP is right at the start, so we'll take m{^<HOST>.*$} right?

Technically right, but I wouldn't write it that way. That regex would again match everything after the IP, but we really don't care about that.

What we should do is a simple m{^<HOST>}. This works as intended and it does only look at the first few characters. If you want to make sure that it's followed by a space, go ahead, please.

So we end up with a fool-proof regex: m{^<HOST>\s}. This will for sure fulfill all our needs.

To be honest, this example is kind of easy as the IP is right at the start. Let's assume some other format so we can work out a more general way for this to work.

Copy, Paste, Replace

We are going to perform CPR on a line of text. As said above, this time we want to get something that is not right at the start. We'll look at the following line:

[16/Jul/2017:16:27:41 +0200] openbsd.cloud.bsocat.net - - 66.133.109.36 "GET /.well-known/acme-challenge/tTRnUGY9gZEVz2llGWqn1m3mHznMDOFH3zCXsgelh7w HTTP/1.1" 200 87

This is a slightly modified OpenBSD httpd log-format. By slightly I mean:

  • date and time moved to the front
  • host moved a bit backwards

Now let's do this:

# copy the line, verbatim
[16/Jul/2017:16:27:41 +0200] openbsd.cloud.bsocat.net - - 66.133.109.36 "GET /.well-known/acme-challenge/tTRnUGY9gZEVz2llGWqn1m3mHznMDOFH3zCXsgelh7w HTTP/1.1" 200 87

# remove everything after the needle (except for the delimiter), we don't need it
[16/Jul/2017:16:27:41 +0200] openbsd.cloud.bsocat.net - - 66.133.109.36\s

# add needed meta-characters (start of line)
^[16/Jul/2017:16:27:41 +0200] openbsd.cloud.bsocat.net - - 66.133.109.36\s

# escape all the characters that need escaping
^\[16/Jul/2017:16:27:41 \+0200\] openbsd\.cloud\.bsocat\.net - - 66\.133\.109\.36\s

# replace the host
^\[16/Jul/2017:16:27:41 \+0200\] openbsd\.cloud\.bsocat\.net - - <HOST>\s

# replace everything that is not static by their possible values
# this requires a lot of in depth knowledge about the log format
# let's do this the easy way and just replace using dots for the date
^\[../.../....:..:..:.. .....\] openbsd\.cloud\.bsocat\.net - - <HOST>\s

# for the other fields we just specify them to "not contain spaces" as these
# are used for delimiting, so they will not occur in the fields
^\[../.../....:..:..:.. .....\] [^\s]+ [^\s]+ [^\s]+ <HOST>\s

# fail2ban needs spaces escaped I think
^\[../.../....:..:..:..\s.....\]\s[^\s]+\s[^\s]+\s[^\s]+\s<HOST>\s

How do we check?

If it's fail2ban, just run that. For everything else:

perl -ne 'm{^\[../.../....:..:..:..\s.....\]\s[^\s]+\s[^\s]+\s[^\s]+\s([^\s]+)\s} && print "$1\n"'

This should output only the IP we were looking for.

Assignments

So, every once in a while a friend of mine gets into computer science (or something else entirely). I get curious. Like really curious. I know they're a beginner when coding and I'm throwing way too many fancy concepts at them, but I wanna do things properly so friggin' badly.

Sometimes I do some assignments. Not for them. Not for me. Just for the sake of it.

This is a collection.

Finbean Linked List

Finbean

If you don't know Finbean then this meta-paragraph probably doesn't concern you. Basically Finbean studies computer science and writes code. Rookie code I'd probably say, but we've all been rookies at some point, and there's only so far you can make someone go on their assignments, so sometimes you code along.

The Assignment

I don't know exactly what the assignment was, it was something about storing cuisines in a list, and you had to craft the list yourself I assume (otherwise, what the heck Finbean, that assignment would've taken like two minutes). Anyway, language is C++ and the "IDE" was vim, which is powerful enough as it is. Since I use vim daily my configuration is elaborate, and my toolchain consists of a long list of compiler flags and some additional tools (like valgrind or gdb). For this assignment however I used onlinegdb.com as it was easiest, despite running Gentoo and therefore having C++ compilers at hand on every single machine I have.

Code

Back on track, this is the code (ISC licensed):

/*
 * Copyright (c) 2022 benaryorg <binary@benary.org>
 *
 * Permission to use, copy, modify, and distribute this software for any
 * purpose with or without fee is hereby granted, provided that the above
 * copyright notice and this permission notice appear in all copies.
 *
 * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
 * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
 * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
 * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
 * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
 * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
 * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
 */

// compile using:
//  g++ \
//    -pedantic -Wall -Wextra -Wcast-align -Wcast-qual -Wctor-dtor-privacy -Wdisabled-optimization
//    -Wformat=2 -Winit-self -Wlogical-op -Wmissing-declarations -Wmissing-include-dirs -Wnoexcept
//    -Wold-style-cast -Woverloaded-virtual -Wredundant-decls -Wsign-conversion -Wsign-promo
//    -Wstrict-null-sentinel -Wstrict-overflow=5 -Wswitch-default -Wundef -Wno-unused -Werror

#include <iostream>
#include <memory>

using namespace std;

typedef void (*iter_func)(const string *data);

void print_node(const string *);

class MyNode
{
	string data;
	unique_ptr<MyNode> next { nullptr };

public:
	MyNode(string data)
	{
		this->data = data;
	}

	bool eq(string data)
	{
		return this->data == data;
	}

	void iterate(iter_func func)
	{
		func(&this->data);
		if(this->next)
		{
			this->next->iterate(func);
		}
	}

	void insert(string data)
	{
		if(data < this->data)
		{
			unique_ptr<MyNode> old = move(this->next);
			this->next = unique_ptr<MyNode>(new MyNode(this->data));
			this->data = data;
			this->next->next = move(old);
		}
		else
		{
			if(this->next)
			{
				this->next->insert(data);
			}
			else
			{
				this->next = unique_ptr<MyNode>(new MyNode(data));
			}
		}
	}

	bool remove(string data)
	{
		if(!this->next)
		{
			return false;
		}
		if(this->next->eq(data))
		{
			this->next = move(this->next->next);
			return true;
		}
		return this->next->remove(data);
	}

	unique_ptr<MyNode> split_after()
	{
		return move(this->next);
	}
};

class MyList
{
	unique_ptr<MyNode> head { nullptr };

public:
	MyList()
	{
	}

	void iterate(iter_func func)
	{
		if(this->head)
		{
			this->head->iterate(func);
		}
	}

	void insert(string data)
	{
		if(this->head)
		{
			this->head->insert(data);
		}
		else
		{
			this->head = unique_ptr<MyNode>(new MyNode(data));
		}
	}

	bool remove(string data)
	{
		if(!this->head)
		{
			return false;
		}
		return false;
		if(this->head->eq(data))
		{
			this->head = this->head->split_after();
			return true;
		}
		return this->head->remove(data);
	}
};

void print_node(const string *data)
{
	cout << "data: " << *data << endl;
}

int main()
{
	unique_ptr<MyList> list(new MyList());

	cout << "first run (empty):" << endl;
	list->iterate(print_node);

	list->insert("tomato salad");
	cout << "second run (1):" << endl;
	list->iterate(print_node);

	list->insert("french fries");
	cout << "third run (2):" << endl;
	list->iterate(print_node);

	list->insert("potato salad");
	cout << "fourth run (3):" << endl;
	list->iterate(print_node);

	list->insert("marinated manatee");
	cout << "fifth run (4):" << endl;
	list->iterate(print_node);

	list->remove("marinated manatee");
	cout << "sixth run (3):" << endl;
	list->iterate(print_node);

	return 0;
}

Questionare

Besides writing this and answering questions if people had some (also teaching some peers how to use the basics of gdb, though that's not quite related to this) there was also this questionaire that came up in the end. This was also relayed by Finbean, and of course it may very well be subject to intellectual property, so I reproduce these questions here hoping that neither me nor Finbean will be reprimanded for it. If you want it taken down hit me up via the media of your choice, my website1 should list something you can use.

1

Note the website is IPv6-only, file a complaint with your ISP if you can't access it, though I can't help but think that my infrastructure may be at fault.

Used Structure

a. How well did the data structure selected perform for the assigned application?

I chose a list with most methods implemented via tail recursive calls on the nodes. This performed reasonably well since the ergonomics are at focus for this assignment. Had performance in terms of memory, cpu, or scalability constraints been an issue the lack of short-circuiting in the iterative approach of traversing the list would have been a limiting factor.

Alternative Structures

b. Would a different data structure work better? If so, which one and why...

My list was singly linked, therefore implementation complexity was low which was the important thing here. It also was more of a set-like approach (a PriorityQueue in Java or BinaryHeap in Rust), keeping the list sorted internally at all times, this would've allowed some performance improvements, though nothing big compared to the alternatives. However for performance reasons a hashset or hashmap would have been more suitable. The constant-time of hashing adds some overhead but allows O(log(n)) indexing which for large sets of data outperforms any iterative approach.

Efficiency Considerations of Used Structure

c. What was efficient about your design and use of the data structure?

The tail recursion did make up for constant memory usage while keeping implementation complexity to a minimum. The separation of list and node functions, the latter carrying the brunt of logic, did allow for a separation of internals and interface while still keeping internals structured. Keeping the list sorted internally was also a benefit when it comes to search, but that was barely necessary. If insertion speed was key, then this would indeed be a serious flaw.

Ineffeciencies of Used Structure

d. What was not efficient?

Insertion speed mostly, but as said, everything compared to a hash-approach.

Critical Approach

e. What would you do differently if you had more time to solve the problem?

Throw math at the problem, in a nutshell. Many existing algorithms are faster than mine mathematically. I'd leverage them where possible.

For e I'd like to add that if I had more time I'd like to have a talk with the product owners or stakeholders (or whatever internal position is responsible for clarifying requirements) as to what use-case should be optimised for. Are we focusing write speed? Memory compactness? Search time? The solution strongly depends on these, for example Perfect Hash Functions (PHFs) can be used to achieve constant time access in a map, which would be useful if the list was either constant/static or rarely ever changed. Adding new elements would be terribly expensive then though. Also, a database might be much better suited if filtering was important.

If complexity would rise above basic types of sets or lists then I'd very likely opt for using an in-memory sqlite database, which is both trivial in terms of license (public domain) as well as embeddable (it's in virtually every smartphone) and highly performant [citation needed] and scalable (unless you require parallel writable access). It also serialises nicely, and can leverage edge cases such as expensive flash storage by using append-mechanisms and whatnot. Basically for storing structured data, sqlite is the way to go if you need querying.

Words to the Professor

I know I didn't fully finish the assignment, heck I didn't even have access to it to begin with, however I'd like your honest opinion on this if possible. To add a few things about my code: yes, my class naming sounds more like a tutorial than anything. Also, I would've gone the full route and either use templating or generics to get myself a generic list, or map even, that would allow me to re-use it for several parts of the assignment, again the full scope of which I didn't have access to, hence I only ever implemented the basic list type and the bare minimum of methods on it. Also, I'm "new" to C++, usually sticking to Rust, C, and various scripting language since I'm a systems administrator by profession (though both developer and devops suite my skillset equally well), hence not making use of the very powerful features of templating that C++ offers, a double edged sword, but one that when wielded properly can be a game-changer. In practice I'd stay away from templating though, since maintainability decreases drastically in such cases, and in my opinion maintainability is more important than performance often enough. Also note that modern compilers optimize better than humans can do, so the best we can do is choose an approach and algorithm and hope for the best, implementation specifics rarely matter, and making the code concise on a high-level often increases speed due to abstractions applied by the optimisation layer, such as tail-call optimisation.

Feel free to send me an email if you can be bothered to go through this.

Further Info

By the way, examples of such optimisations are the rigorous use of call-by-value even for large structs. The compiler will figure out how to pass it indirectly or how to make use of the stack. Generally I prefer using the stack wherever possible for that reason exactly.