Starting with iocaine 2.2.0, it is possible to configure a custom request handler. This request handler can be used to make decisions about how to proceed with it: accept it and serve it garbage or a challenge, or reject it, and let the reverse proxy serve the upstream source. This makes it possible to move the bot detection within iocaine itself.
There are multiple scripting languages availabe: Roto (the default, and most efficient), Lua, and Fennel (both of these since iocaine 2.4.0). They all provide the same set of tools and possibilities, in different ways, and with different compromises. For detailed information about how they each work, see the language specific guides linked above.
The difference between Fennel and Lua is that for Lua, you just write the script, configure iocaine, and you’re good to go. For Fennel, you either have to pre-compile your code to Lua, or tell iocaine the location of the fennel compiler, so it can compile the script itself. Apart from this, the two languages are equivalent in functionality and performance.
The differences between Roto and Lua/Fennel is much more substantial, because Roto is a very different language. It’s a statically typed, compiled language, with a lot of limitations, such as no support for loops, nor lists, nor arbitrary key-value maps. This makes writing scripts in Roto more involved. On the other hand, Roto is much faster.
On top of this, the Roto engine has been deployed into production before it even hit a stable release, using the author’s Nam-Shub of Enki project as the driver.
A note on performance
The simplest benchmark - rejecting all requests - puts Roto at two orders of magnitudes faster than Lua. The difference slightly diminishes with more complex scripts, where Lua/Fennel allow writing things in a more performant manner, but the difference will remain in favour of Roto.
Nevertheless, even when using Lua/Fennel, the request handler will not be the bottleneck: a typical Lua/Fennel request handler is still 4-5x faster than generating a page of garbage, and garbage generation is also very efficient. Add the overhead of HTTP and a reverse proxy on top, and your choice of language will largely be irrelevant, as far as performance goes - it will not be the bottleneck.
Organizing request handler scripts
Regardless of language used, the general organization of a script is largely the same: we prepare a number of patterns, regexes and regex sets, network prefixes ahead of time, during iocaine’s startup. These will be used later by the decision function to figure out the fate of each request.
The reason for this split is simple: performance. The patterns, regexes, and other things we set up at init time are static, they don’t change between requests, and they can all be compiled into a form more appropriate than the textual representation we write - this compilation step is comparatively slow, so we’d prefer not to do that for each incoming request.
All languages available for scripting provide mechanisms to do this, in their own, language-specific manner - but the core functionality is always there.
Core features
Each language provides access to a number of core things, one way or another, implemented in a way that makes sense for each particular language. Consult the reference guides (Roto, Lua, Fennel) for specifics for each language.
The incoming request
For each request iocaine receives, the request handler’s decision function is invoked with a request parameter, which gives the script easy access to the request method and path, all headers, all query parameters, and all cookies. All of these are read-only.
Each engine also gives the script the ability to construct mock requests on its own, for testing purposes.
Efficient matching of multiple substrings
A common pattern in request handlers is matching the User-Agent header against a large list of static substrings, such as crawler identifiers sourced from ai.robots.txt. Checking these one by one, or as a regexp, or even as a set of regexps is far less efficient than using the Aho-Corasick algorithm to finding occurrences of many patterns at once.
As such, each engine provides a way to construct a matcher at initialization time, and then use that matcher to see if a given string matches any of the patterns we set up previously.
Regular expressions
In case simply matching a static string is not enough, if we want to extract part of it for further inspection, each engine lets you compile regular expressions at initialization time, and then match against them, or even extract named groups from them during the decision making process.
This is less efficient than pattern matching or matching an entire set of regexes, but it allows extracting named capture groups. When there’s only a single regular expression to match, the performance is the same as if you used a regexp set, but the setup is - usually - simpler.
Matching a string against a set of regular expressions
In case we want to match a string against a set of regular expressions, iocaine’s scripting engines have us covered too. Why match against a set of regexes? Because that allows overlapping matches, and in general, allows for simpler regexes and better organization. The matching is still done in a single pass.
Matching an IP address against a set of network prefixes
If one wants to check whether the requesting IP address is part of a larger network, the engines provide a way to easily do so: network prefixes in CIDR notation can be collected into network sets, and the value of the X-Forwarded-For header can be checked against these sets, efficiently. The network prefixes given in CIDR notation are compiled into an efficient representation behind the scenes, so at decision time, the task at hand will not be much more complicated than comparing a handful of numbers.
Why network sets, not just singular networks? Because that lets us collect entire ASNs into a single set. Yes, you can pre-compile entire ASN->Prefix lists within the request handler, and while the startup time might be considerably slower, once that’s done, the decision making is barely affected. With this, you do not need an external service to provide ASN and GeoIP lookups - you can encode both into a request handler.
Working with Sec-CH-UA headers
Chrome and Chrome-derived browsers (with few exceptions) have implemented User-Agent Client Hints for a number of years now. In particular interest to us is the Sec-CH-UA header, which - when sent - contains identifiers about the browser. This can be very useful in deciding whether a browser that says is a Chrome-derivative, really is one. If they don’t send an appropriate Sec-CH-UA header, chances are, they aren’t really a human-controlled browser.
To make this simpler, and more reliable, each engine provides a way to check whether a particular named item is listed in the Sec-CH-UA header. There is no built-in mechanism to extract the associated value yet, though.
Metrics
Each engine grants access to a metric (iocaine_request_handler_hits{id="..."}), which the script can use to increase it, with a value of its choosing for the id label. This makes it possible to set up things like increasing a metric each time a particular ruleset in the request handler is hit.
Logging
All engines provide helpers to construct log messages in JSON format, written to iocaine’s standard output, in a format compatible with VictoriaLogs. Languages that have a built-in print or similar function are free to output in whatever other format they’d like.
JSON parsing
Every engine provides a way to load JSON files, and process its contents in a limited way. The main use-case here is loading the robots.json file of the ai.robots.txt project, and extracting the keys (the robot identifiers).
Testing
Every engine has built-in support for running tests embedded in the request handler script. This can be used to verify that the ruleset works as expected, prior to deploying it. Use the iocaine test command to run them.
How tests are implemented is - like everything else - language specific. But each engine supports them fully.