OCaml-CI : A Zero-Configuration CI

Hi! I'm Thomas Leonard and today I'd like to talk to you about OCaml-CI, which is a new CI system we've built at
OCaml Labs specifically for testing OCaml projects. I'll start by giving a brief
introduction to the CI and a short demo, and then we'll look at some of the
unusual features that the CI provides. Then, in the second half of this talk, I'd
like to go over some of the technologies that we've used to build it:
the Cap'n Proto RPC system, a new opam dependency solver,
the OCurrent pipeline language, and the OCluster build scheduler.
So, OCaml-CI itself is a CI for OCaml projects that does not require any
specific configuration. You — once you've told it to provide CI
for your GitHub repository, it will look at your project's existing
opam and dune files to work out what to build
and which platforms to build it on.

And it uses caching
to make this fast. In particular, if you don't change your opam files,
it will not need to reinstall the dependencies.
It's currently in beta, and it's deployed on around 100 GitHub projects
at the moment. So, just to show an example of the CI
being used, I'm going to create a test pull request now… And as we can see the CI is now building
it. And when we go to the web interface we
can see a list of all the builds that it's doing at the moment.
Uh, you have to press refresh to to get updates,
but as you can see we're building on a variety of platforms
and we're doing them fairly quickly — so that took about
10 seconds to build on Alpine, Debian, Fedora, OpenSuse, and Ubuntu, with a range
of compiler versions from 4.08 here to 4.11,
and also on x86_64 — which is most of them — but also on 32 ARM 64
PowerPC 64 and x86_32. So to look in a little bit more detail
of what happened here we can look at the analysis phase,
which is the first step it does, and here the CI goes to your project and it looks
at all the opam files and it performs an opam solve on —
against each of the possible supported platforms on CI.
And some of these will succeed — here for example we found a solution
using ocam 4.11 on Debian — and others will fail: For example OCaml
4.04 cannot be used with this — with this
repository.

And if you want more information about,
um, exactly which versions it chose or why it
couldn't choose them, you can scroll down further.
This is showing the packages selected for 4.10,
and if we scroll down further, until we find one that doesn't work,
here for example, is when we try to find a solution
using ocam 4.07. And you can see that many of the packages
here simply require ocam 4.08 or later. Once the analysis phase is complete then
it performs the individual builds. There are various
lint steps. For example checking the documentation,
checking your code using ocamlformat, if you've enabled that for your project,
and linking the opam files.

But most of the work of the CI of course
is actually just building and testing the software against each of
the supported platforms. And when you look at one of them you'll
see that one of the first things it displays
is a Docker file which it is going to use to do this build. And you can
take this docker file and you can run it on your local machine
if you want to reproduce the build exactly as the CI would do it.
And this Docker file includes the hash — the exact hash — of the base platform it's
using.

It includes the exact commit it's using
from the opam repository to get the dependencies. And it includes the exact versions of every
package it's going to test against. And, in fact, it actually includes a
complete shell command that you can just post into your shell
to do this build. Another interesting thing is worth
pointing out is that when this fails it doesn't just say "build failed" it can
usually give you a useful error message, and that's
because we have a large collection of regular expressions
that match various ways that OCamel produces errors. Errors from
the opam compiler, errors from dune, uh, errors from opam,
and so on. And in this particular case we can see
that the change I made does not work on ocam 4.08 or 4.09 because
i'm using a function that didn't exist in those versions.
So I should either stop using this or I should update my opam files to specify
4.10 as the minimum version.

And you probably notice that this build
was performed very quickly, and that's because the build machines
already had all of the dependencies already cached
and they were therefore able to just reuse that
from the previous build. And this is not always the case,
for example if you have changed your opam file then it will have to
reinstall all the dependencies, or it may be that they've just been
expired from from the build cache. But, generally speaking, if you are
iterating on a PR — if you're submitting a pr and then modifying it,
um, finding a problem and modifying it again —
then in that case after the first time you'll you should find that the cache is
hot, and all the builds should go very
quickly.

Now this web interface is actually
implemented as a separate service that communicates with the main CI service
over Cap'n Proto RPC. Cap'n Proto is a really nice
RPC system: It's very fast, provides great security,
and we have an implementation of it from the mirage project under Mirage capnp RPC. And in OCaml-CI we're using this not
just to provide the web interface you've just seen
but we also have a command line interface that can be used for scripting,
and we have an interactive terminal UI. One unusual feature of the CI
is that because it's reading your opam files and it knows what your project's
dependencies are, it can automatically rebuild the project
when they become out of date. The CI monitors the opam repository and
whenever a new PR is merged into there, it will rerun all of the analysis jobs
and look to see if any of them have changed the
selections as a result.

If they have it will rebuild– it will redo those
builds. And this graph here shows a fairly
typical example of opam repository being updated. A PR was merged
to master at 19:02 and the CI kicked off
750 new analysis jobs, one for each of the branches and PR's that it was
monitoring. For each of those PRs it performed
19 separate opam solves, one for each of the platforms that it tests against.
And, as you can see, it took about 18 minutes
to find all the changes, all the jobs, that needed to be rebuilt.
That's doing about 12 open solves per second
on a single computer. And the way we've managed to do that so quickly
is by using a new dependency solver. So what we did, was we took the package
dependency solver from the 0install project —
which is actually nicely functurized and doesn't actually depend on the rest of
zero install at all — and we applied that functor to the
opam package module. And we found this solver is able to find
solutions much more quickly than the standard opam solvers.
In fact when we tested it on every single package or in opam repository
not a single one of them took more than a second to produce the solution.
And we've been using it for a few months now.

It's pretty well tested. We're using
it in OCaml-CI. We're using it in the
duneiverse tool — which is a really useful little tool for
pulling in all the dependencies of your project
into your source directory, so that you can do a `dune build` across the whole
thing all in one go, and we're also using it in opam-health-check
which is a bulk build, uh, that we do every few days testing
every package in the opam repository. And that was previously using the z3
solver, but using this solver it was actually
faster. And if you want you can also run it from the command line. It's available
in opam. Uh, here i've asked it to find a solution
for utop and OCamel 4.10.0, and as you can see here, the solve took
0.22 seconds. In fact they actually found the solution in less time than it took
to initialize the opam library. One thing you'd often like to do with
CIs is to reproduce the build locally, and I've already shown that every CI
build includes a Docker file that you can copy and paste
to reproduce that particular build.

But another thing that you might like to do
is to actually run the CI itself on your local machine. You can do that by
cloning its repository and running the OCaml CI service command, although
if you want to do that you will have to register your own GitHub app.
But, so, we also provide an easier way, which is the opam-ci-local command.
You just give it the path of a local git clone,
and it will run the CI just on the head commit of that repository.
And when you do that, it will provide you with a little web interface — and here is
a screenshot I took from that — showing the pipeline that it runs. Over
here, on the left, it's pulling in some Docker base images.
You can configure which platforms you want: you don't necessarily want to put
in all 19 platforms. I've configured it here
just to pull in three.

For each one it queries opam to find out details of
the distribution name, and so on, which it need needed for the
open solve. Below that, it will clone the opam repository
to get the metadata about the dependencies, to find out which versions
are available. And then finally, here, it's actually
monitoring the local repository. This is querying to find out which
branch your repository is currently on.

And this is querying to find the head
commit of that branch. And all of this is input into the
analysis phase that we saw earlier. In this case, the analysis phase
determined that OCaml 4.02 wasn't supported, but it has produced
builds for a 4.09 and 4.10, as well as some lint jobs. Um, for example,
this build here has failed, as shown in red.
You could click on that and actually see the log and interact
with it. And this is really useful if you want to
extend the CI, if you'd like to contribute
some improvements to it, you can just use this to test your improvements locally. Now, that diagram that we just saw is an
example of an OCurrent pipeline. OCurrent is an embedded domain specific
language for writing pipelines for keeping things
current. It's a form of incremental computation,
like Jane Street's "incremental" library or Buenzli's react library. But it also
makes it very easy to do static analysis, and that is how we're
able to generate these diagrams that show what it's going
to do, even if it hasn't actually completed all the steps yet.

And here's
an example of a really simple OCurrent pipeline: we get the head of a
local repository, and call that "source"; we use docker to build the source code
to get a Docker image; and then we run a `make test` command
inside that image. And that pipeline will automatically
generate the diagram shown here. And by the way — the colors here: green
indicates a step that's been successful, orange indicates a step that is
currently in the process of being built, and grey indicates a step that cannot
run yet because its inputs are not ready.

So, OCurrent provides a language for
connecting these boxes together. Um, to actually get these primitive
operations, we have packages available for interacting with git, GitHub, Docker,
Slack, and OCluster, and you can also define
your own primitive operations very easily. There
is a wiki in your OCurrent repository with
tutorials and there is a skeleton, project which you can use as a starting
point, for your own pipelines. So here's the
actual pipeline used by the real CI service. We're
putting on a lot more images, on the left, than we were in that smaller pipeline we
saw. And we're also querying GitHub for a
list of accounts that have installed the GitHub app.
For each of the accounts, we get a list of the repositories.
In this case, I've expanded the Mirage organization.
And then for each of these repositories we get a list of the
PRs and branches that need to be tested. I've expanded the capnp
RPC repository.

And then for each PR and branch we fetch the source code and
then we run that same pipeline that I showed you before where
we do an analysis and then we do the builds. And then,
finally, we push the status back to GitHub. And there's an admin UI that you can add
to any OCurrent pipeline that will automatically give you a web
interface that will show you your pipeline graph, allow you to view
the logs, do queries over previous results, and
configure the regular expressions used in the log analyzer.
Now, writing CI pipelines is not the only possible use for OCurrent and I wanted
to just show you a couple of other uses we found for it.
The first is a pipeline that we use for building the Docker base images
that are consumed by the CI, and this is a pipeline —
it builds a wide variety of different combinations
of platform: like Debian, Ubuntu, Fedora — architecture: x86, ARM, PowerPC —
OCaml versions from 4.02 to 4.12 — and compiler options.

And every week
it builds these images and it pushes them to Docker Hub.
And if you'd like to use them yourself, you can run a command like this: so this
is `docker run ocurrent/open debian-10-ocam-4.08-afl`.
And that will give you a Docker container that's based on Debian 10
that has OCaml 4.08 installed, and compiled
for fuzzing with afl, for example. Another use we found for OCurrent is for
deploying services. We have a deployer pipeline that watches
the live branch of various repositories, and when it sees
a new commit pushed to the live branch, it will build it, push it to Docker Hub,
pull it from Docker Hub onto the build machine, and update the service there.
And while it's doing this it will send updates of its status
to Slack, so we can see what it's doing.

In fact, for the CI we're actually
watching three different branches: There's a live engine branch, which we
use to deploy a new version of the engine.
There's a live "www" branch, which we use,
um, for updating the web user interface. And there's also a "staging-www" branch,
which allows us to experiment with changes to the web UI, uh,
before we're ready to push it live. And finally, um, the deployer also watches all
the other branches and PR's in the repository,
and just builds them, just to make sure that they can be built,
and pushes the status of that back to GitHub.
And the final component of the system is the OCluster build scheduler. This is a
little OCaml service that runs on one machine. It takes job
submissions from clients, such as OCaml-CI and the
base image builder, and it distributes them to various
worker nodes. This just makes it really easy for us to
add new machines to the build pools, when they become available.

Okay, so now I'm just going to give a
quick demo of the OCluster system. First of all, you need to run the
scheduler, and it takes a few arguments. You need to tell it whereabouts you want
it to create its private key. You need to tell it,
uh, what address to listen on for incoming network connections.
(The scheduler is the only component in the system that needs to accept incoming
connections.) You need to tell it its public address,
which is where it suggests other services should connect
to it on — here I'm just using localhost but obviously in a real system
you'd use, uh, some kind of public address here. It needs a directory, in which it will
store the database, telling it which jobs it's previously assigned to
which workers, allowing it to assign similar jobs to
the same worker in future and benefit from caching.
And then, finally, you can specify whatever set of pools you want
in your cluster.

These names can be whatever you like. Now we can see it's generated a set of
four capability files, and each of these grants access to some
service within the scheduler. For example the
submission capability gives its holder the ability to submit
things to the cluster, the admin capability gives the ability
to run administration commands, and the pool
capabilities give the ability to be a worker and to
register to a pool to accept jobs.

So let's just take a look at one of
these capability files, just to see what's
inside it. So it's a URL and it has several parts:
This part is the the address which the holder capability should connect to to
access the service. This part is the fingerprint of the
service key, which allows the client to check that it
is connected to the right service without the need to set up any kind of,
um, PKI or, you know, create certificates and sign
CAs and things like that.

And the final part is this, uh, this swiss
number, which identifies which serves to use and
also it's a secret which grants you
permission to use it. And the other cap files will be the same,
except that this swiss number at the end will be
different. So, let's try adding a worker to our cluster.
We'll say a cluster "worker". We need to give it permission
to connect — I'll connect it to the x86_64 pool — and we need to give the worker
some unique name: this is mainly used for monitoring
so we can actually tell, um, you know, when a machine is running out of disk space or
whatever. So that is now, um, connected to the
scheduler. Okay, so now let's try using our new
cluster. We'll use the OCluster client command. We'll submit a job.
We'll need to give the client permission to use the cluster, by giving it access
to the submission capability.

We'll need to say
which pool we wanted to use — I'll use the linux x86_64 pool — and we need to give it a description of
the job we wanted to perform, and in our– my case I'll ask it to build a local
Dockerfile. And here we have a client connecting to
the scheduler. The scheduler hands the job over to the
worker. The worker is executing the build, in
this case using Docker, and streaming the logs back to the
client. So in summary OCamel-CI, provides a fast
build and test pipeline, for OCaml projects. But you may find that
the components it's made of are also useful to you
individually. The diagram on this page shows the ones we've looked at — the web
user interface, the command interface, the CI engine, the new opam solver,
and the OCluster scheduler — and the arrows represent the Cap'n
Proto connections that tie everything together.
All of this software is open source, and I've included links
throughout this talk.

And if you would like to get involved,
please get in touch!.

You May Also Like