Getting started with iocaine & containers

Overview

In this short guide, we will set up iocaine from scratch, using its built-in handler, in a container (using docker compose). Despite its simplicity, the built-in handler is still powerful, and in this author’s experience, will route the vast majority of unwanted visitors into iocaine’s maze.

We will not be writing a custom request handler, that will be a separate guide.

  Important

On these pages, we’ll document the process of setting up iocaine in a containerized environment. If you want to run iocaine on the host itself, see the main Getting Started guide. If you’re using NixOS, there’s a guide for that, too.

Requirements

The requirements below are for this guide, not necessarily for iocaine itself. We will not cover setting up a reverse proxy here, that part is covered by the main guide - it is exactly the same whether iocaine is running on the host, or in a container.

As such, this guide assumes we have:

Throughout this guide, we’ll be working from a single directory, the home of our compose.yaml, all the auxilliary files we need, will be put in subdirectories. To make things easier, the following layout is what this guide will assume going forward:

# tree -d .
.
└── data
    ├── corpus
    └── etc
        └── config.d

Lets create those directories first!

# mkdir -p data/corpus data/etc/config.d

Getting familiar with iocaine

Before we begin our journey of configuring iocaine, let us take a moment to see what it can do out of the box, without any configuration whatsoever. We’ll download the example compose.yaml, and see what it can do.

# curl -sL \
    https://git.madhouse-project.org/iocaine/iocaine/raw/branch/iocaine-3.x/data/compose.yaml \
    -o compose.yaml

Lets start it up! It will print a warning message, and keep running. We’ll talk more about that warning a little later.

# docker compose run --rm -P iocaine
2025-11-02T13:56:20.697583Z  WARN iocaine::user: No ai-robots-txt-path configured, using default
2025-11-02T13:56:20.739065Z  INFO iocaine::morgue: starting iocaine
2025-11-02T13:56:20.739724Z  INFO iocaine::morgue: iocaine ready

We can get more logs out of it by setting the RUST_LOG environment variable to iocaine=trace (the supported log levels are, in order of decreasing verboseness: trace, debug, info, warn, and error):

# docker compose run --rm -P -e RUST_LOG=iocaine=trace iocaine
2025-11-02T13:56:47.981203Z DEBUG iocaine::config: loading configuration config_file="/etc/iocaine/config.d/00-default-bind.kdl"
2025-11-02T13:56:47.981859Z DEBUG iocaine::sex_dungeon::means_of_production: using the embedded handler
2025-11-02T13:56:47.983102Z TRACE iocaine::sex_dungeon::means_of_production: compiling init
2025-11-02T13:56:48.020451Z TRACE iocaine::sex_dungeon::means_of_production: compilation finished
2025-11-02T13:56:48.020913Z TRACE iocaine::sex_dungeon::means_of_production: running init
2025-11-02T13:56:48.021241Z DEBUG iocaine::user: Registering metrics
2025-11-02T13:56:48.021464Z  WARN iocaine::user: No ai-robots-txt-path configured, using default
2025-11-02T13:56:48.040562Z DEBUG iocaine::user: Loading embedded HTML template
2025-11-02T13:56:48.041059Z DEBUG iocaine::user: Initializing template engine
2025-11-02T13:56:48.041495Z TRACE iocaine::sex_dungeon::means_of_production: init finished
2025-11-02T13:56:48.043989Z TRACE iocaine::sex_dungeon::means_of_production: compiling the main script
2025-11-02T13:56:48.066084Z TRACE iocaine::sex_dungeon::means_of_production: compilation finished
2025-11-02T13:56:48.066911Z  INFO iocaine::morgue: starting iocaine
2025-11-02T13:56:48.067335Z  INFO iocaine::morgue: iocaine ready

We can do all kinds of fancy stuff with logging, but this was enough of a side quest already! Lets see: does it work? We can curl that (in another terminal)!

# curl -is http://127.0.0.1:42069/
HTTP/1.1 421 Misdirected Request
content-length: 0
date: Sun, 19 Oct 2025 09:12:56 GMT

Why that port? Because that’s where iocaine binds to by default! Lets check its config:

# docker compose run --rm -P iocaine show config
initial-seed ""
http-server default {
    bind "0.0.0.0:42069"
    use handler-from=default
}
declare-handler default language=roto

The “421 Misdirected Request” response is a signal that real contents should be served. We can also test what happens if we send it a request it deems garbage:

# curl -Is http://127.0.0.1:42069/ -A Perplexity
HTTP/1.1 200 OK
content-type: text/html
content-length: 2467
date: Sun, 19 Oct 2025 09:14:08 GMT

We sent a HEAD request this time, with curl -I, but only because this author did not want to paste a lot of garbage into this guide. Replace the -I with -i, or drop it entirely, and marvel at the unintelligible junk iocaine generates out of its own source code!

Adjusting the configuration

This is all fine and great, and will stop a lot of the crawlers, there’s this warning about ai-robots-txt-path. You see, iocaine ships with a copy of ai.robots.txt’s robots.json, to ward off crawlers that identify themselves. But iocaine’s copy is only updated when a new iocaine release is cut - we may wish to update it more often than that.

To do so, lets grab the most recent copy of it, directly from ai.robots.txt’s main branch:

# curl -L https://github.com/ai-robots-txt/ai.robots.txt/raw/refs/heads/main/robots.json \
       -o data/ai.robots.txt-robots.json

Previously, we’ve seen a warning about ai-robots-txt-path not being configured. Now that we have a copy of this file, lets tell iocaine about it. The way to do that is through partial configuration snippets: we can place files with partial configuration into a directory, tell iocaine about said directory, and it will merge them all. If we do not give iocaine a configuration file to load, it will use its embedded default. We can use this to our advantage, and extend the default configuration!

Lets have a look at that default again!

# docker compose run --rm -P iocaine show config
initial-seed ""
http-server default {
    bind "0.0.0.0:42069"
    use handler-from=default
}
declare-handler default language=roto

It’s the “handler” we need to apply configuration to. Doing so is simple:

declare-handler default {
  ai-robots-txt-path "/data/ai.robots.txt-robots.json"
}

The service in the container is configured by default to load configuration parts from /data/etc/config.d (part of the reason why we created that in the first place!). Lets place the snippet above into data/etc/config.d/00-ai.robots.txt.kdl, and see how the merged configuration looks:

# docker compose run --rm -P iocaine --config-path /data/etc/config.d show config
initial-seed ""
http-server default {
    bind "0.0.0.0:42069"
    use handler-from=default
}
declare-handler default language=roto {
    ai-robots-txt-path "/data/ai.robots.txt-robots.json"
}

It picked up our configuration! But why do we need the --config-path /data/etc/config.d? Didn’t the compose.yaml set that as the command to run anyway? Great question! One that had this author confused for longer than he’s willing to admit, but the answer is simple: any argument after the service name on the docker compose run commandline overrides the command in the compose file. Thus, we must repeat the --config-path /data/etc/config.d in there for iocaine to pick the directory up for our show config.

But enough distractions, lets see if it worked:

# docker compose run --rm -P -e RUST_LOG=iocaine=debug iocaine
2025-11-02T13:58:54.332448Z DEBUG iocaine::config: loading configuration config_file="/etc/iocaine/config.d/00-default-bind.kdl"
2025-11-02T13:58:54.333086Z DEBUG iocaine::config: loading configuration config_file="/data/etc/config.d/00-ai.robots.txt.kdl"
2025-11-02T13:58:54.333794Z DEBUG iocaine::sex_dungeon::means_of_production: using the embedded handler
2025-11-02T13:58:54.373332Z DEBUG iocaine::user: Registering metrics
2025-11-02T13:58:54.373591Z DEBUG iocaine::user: Loading ai-robots-txt from /data/ai.robots.txt-robots.json
2025-11-02T13:58:54.391235Z DEBUG iocaine::user: Loading embedded HTML template
2025-11-02T13:58:54.391649Z DEBUG iocaine::user: Initializing template engine
2025-11-02T13:58:54.414062Z  INFO iocaine::morgue: starting iocaine
2025-11-02T13:58:54.414684Z  INFO iocaine::morgue: iocaine ready

It certainly picked the config up, and loaded it according to the logs, yay! With this set up, we can send any of the bots listed in ai.robots.txt into our infinite maze of Rusty garbage!

# curl -Is http://127.0.0.1:42069/ -A ClaudeBot
HTTP/1.1 200 OK
content-type: text/html
content-length: 2467
date: Sun, 19 Oct 2025 10:09:03 GMT

There are a whole lot of options the built-in handler supports - we’re not going to repeat all of them here, only some of the more important ones, like the ai-robots-txt-path option we’ve just played with.

A better corpus

Out of the box, iocaine will use its own source code as its corpus. While that does end up generating completely nonsensical garbage, it’s not a particularly big, nor varied corpus. We could be using something better. How about Orwell’s 1984, combined with Huxley’s Brave New World?

Lets grab a copy of these from archive.org!

# curl -L https://archive.org/download/GeorgeOrwells1984/1984_djvu.txt \
       -o data/corpus/1984.txt
# curl -L https://archive.org/download/ost-english-brave_new_world_aldous_huxley/Brave_New_World_Aldous_Huxley_djvu.txt \
       -o data/corpus/brave-new-world.txt

We could use the above books as our wordlist too, but we’ll get a larger set of words out of a wordlist collection. We’re gonna grab one from miscfiles:

# curl -L https://git.savannah.gnu.org/cgit/miscfiles.git/plain/web2 \
       -o data/corpus/words.txt

We’ll need to tell iocaine to use these, too, so lets drop another configuration snippet into, say, data/etc/config.d/01-sources.kdl:

declare-handler default {
  sources {
    training-corpus "/data/corpus/1984.txt" \
                    "/data/corpus/brave-new-world.txt"
    wordlists "/data/corpus/words.txt"
  }
}

If we restart iocaine now, the generated garbage will be far less rusty now.

Observing the Crawlers

The built-in handler supports metrics too, and if a prometheus-server is configured, it will make a number of metrics available (along with the usual process metrics):

qmk_requests{host}

The number of requests served, keyed by host.

qmk_ruleset_hits{ruleset, outcome}

Number of times a particular rule was hit, and its outcome. The outcome is either garbage or default, and the rulesets are ai.robots.txt, major-browsers, unwanted-visitors, or default.

qmk_garbage_generated{host}

Amount of garbage generated, in bytes, keyed by host.

We like big numbers and pretty graphs, so while there is no example dashboard (yet), we can still enable a Prometheus server, and start collecting! We’ll also tell iocaine to persist these metrics, so that we don’t start from zero every time iocaine is restarted.

Lets drop the following configuration snippet into data/etc/config.d/02-metrics.kdl:

prometheus-server metrics {
  bind "0.0.0.0:42042"
  persist-path "/run/iocaine/qmk-metrics.json"
  persist-interval "1h"
}

http-server default {
  use metrics=metrics
}

As this is running on a different port, we need to adjust our compose.yaml to expose it: lets add 127.0.0.1:42042:42042 to the ports list:

--- compose.yaml.orig
+++ compose.yaml
@@ -9,6 +9,7 @@
     restart: unless-stopped
     ports:
       - '127.0.0.1:42069:42069'
+      - '127.0.0.1:42042:42042'
     volumes:
       - ./data:/data
       - iocaine-state:/run/iocaine

Once we restart iocaine, the metrics will be available immediately at http://127.0.0.1:42042/metrics:

# curl -s http://127.0.0.1:42042/metrics | grep '^iocaine_version'
iocaine_version{version="3.0.0-snapshot"} 1

The metrics mentioned above will appear in this listing as soon as iocaine has seen some traffic. Lets give it some, and check the metrics!

# curl -s http://127.0.0.1:42069/ >/dev/null
# curl -s http://127.0.0.1:42069/ -A ClaudeBot >/dev/null
# curl -s http://127.0.0.1:42069/ -A Perplexity >/dev/null
# curl -s http://127.0.0.1:42069/ -A "Mozilla/5.0 Firefox/0" >/dev/null
# curl -s http://127.0.0.1:42042/metrics | grep '^qmk_'
qmk_garbage_generated{host="127.0.0.1:42069"} 5370
qmk_requests{host="127.0.0.1:42069"} 4
qmk_ruleset_hits{outcome="default",ruleset="default"} 1
qmk_ruleset_hits{outcome="garbage",ruleset="ai.robots.txt"} 1
qmk_ruleset_hits{outcome="garbage",ruleset="major-browsers"} 1
qmk_ruleset_hits{outcome="garbage",ruleset="unwanted-visitors"} 1

Final remarks

Now that we have iocaine up and running in a container, we can integrate it with our reverse proxy of choice. Doing so is exactly the same as if iocaine was running on the host, so we will not repeat the steps shown in the main guide.

Now go, tweak it further, watch the metrics, and see the Crawlers get trapped in the maze.