OCaml Workshop 2020 – Irmin v2

Hello and welcome to the Irmin v2
presentation In this talk i'm going to present
what irmin is and some of the new features that we developed over the last
year and how you can maybe use them in your
project. Then we will dive into some of the internals of Irmin
through a real use case running in prediction and some of the
challenges that we faced before we start irmin is developed at
Tarides and is part of the MirageOS project.
MirageOS is a set of operating system libraries that
allow you to build unikernels. That means an application and an operating system
and bundled into one small standalone unikernel that can
run on basically any device but this part is a bit out of scope for
this talk so if you're interested in that aspect
we you are very welcome to visit the MirageOS website.
So first off what is Irmin? Irmin is a library built in OCaml that enables you
to create a versioned key value stores and it
follows the same design principles as git.
So as such it also has some of the git features for instance
you can create branches and merge them it's also distributed which means you
can have multiple replicas of the same Irmin store
on different machines and perhaps with different states
and you can sync them using pull and push operations just like git.
And it's also developed from MirageOS which means it builds into small unikernels
that don't depend on unix syscalls for instance and since
the requirements for unikernels are quite
close to the one for web-based applications
that also means Irmin builds to javascript.

So we have several use cases that use Irmin.
First off internet services, for instance the DNS
server or ARP service of Mirage are built using Irmin as their storage and Robbie also built a CavDav server using Irmin as their storage.
You can also build browser-based applications for instance cuekeeper is a
nice to-do app that runs only in the browser and
embeds Irmin as the storage and it's also very well suited for data
intensive applications for instance the Tezos cryptocurrency
stores its ledger in an Irmin store and also if
you are familiar with OAPM packaging and OPAM
repository then the CI of OPAM repository
stores its results in irmin. So Irmin follows the same
principles git but what are they? So the data is basically organized
under snapshots which are called commits so a commit is a record of the state or
the version of a store and it contains metadata such
as a message commit, a date, an author it also contains
a link to the previous versions of the store and a link to the tree

The tree object is what contains the key
to value bindings. As in git the keys are tree shapes
that means you can have sub-trees in a tree just like you would
have sub-directory in a git directory. So a tree is basically a list
of links to either contents or another subtree.
All these components are hashed and references through the hash which means
you have a natural hash cons-ing in Irmin.

For instance if two trees point to the
same contents then the content is not duplicated. So three nice features of Irmin,
it's compatible with git so that means you can open
and change the state of a git repository using Irmin
or the opposite. You can open an Irmin store and use the git
command line to interact with it. All components are customizable that
goes from the contents that are stored in the store
unlike git that only handles blobs of the time doesn't know about the
structure but you can also customize the hash or
the metadata the branches basically any part.
It's also storage agnostic which is a great feature
so that means the part of him in that handles
tree commits and contents doesn't need to know how that data will be stored on
the disk and you can plug basically any storage
layer to that. So those storage layer we call them Irmin back-ends and we provide a few of them and the huge benefit of
that is depending on your needs and what your
application is you have a different storage layer for
instance if you don't need the data to be persisted to disk then you can use a
simple hash table that's what implemented in Irmin mem.

If you want to
be compatible with git you can use the Irmin git backend
and if you have very specific needs then you can write your own backend that is
optimized for your use case. So there's there have been a lot going
on for the v2 release but here i'm going to
focus on three aspect that is the the use of the
library. The first one is the simple PPX,
that avoids a lot of boilerplates when you write an Irmin store and
makes it really easy to use. The second one is a cli interface so now you can
interact with your store without writing a single line of OCaml and just
using the cli and we also now provide the GraphQL
interface and server so you can set up on top of any Irmin install a GraphQL
server that will serve a GraphQL interface that you can
use through the web maybe from another application perhaps
written in another language or anything.

So we will
have a short demo of that we are going to build a very basic
GraphQL server that serves an Irmin store
with structured data and then we are going to interact with it
using GraphQL request and the cli tool. So here is the piece of code that we are
using for the demo as I said Irmin is highly customizable that means you can
provide custom components for different parts such as hashes,
branches, metadata but here we are only going to provide a custom user and use
the defaults for orders. So to provide a custom content you have
to provide a type t so here to record it to two fields and
you have to provide the OCaml value that represents that type so that Irmin
knows how to serialize it and
hash it so before the v2 release we had to do it by hand and it was quite
tedious but now you can just use the PPX Irmin to derive it for you.
You also have to provide a merge function that will be used when you
merge two branches but i will not dive into the details
here so you can just check out the documentation if you're interested.
Then i can create an Irmin store from the git backend on the file system
and it's a kv (key value) store using only the cutouts.
I will not use any of the remote features so
just set it to none and i can create a server using the functor provided by
Irmin GraphQL.

Then i can create an Irmin repository under slash tmp
slash irmin and the GraphQL layer on top of that
that i will serve using Cohttp. So now i can run the example.
The server is running now and i can open it in a browser so i have a nice interface on the left i can type requests and on
the right i will get the results.

So here i'm going to create a new key
with the value John with age 10 and i will get the hash
back. So here we go and i can also read what's
in the store so if i want to read the key under admin slash john i should
get value I just added. So here we go and i can also ask for a
key that doesn't exist and in that case i will just get nil.

So I can also browse the database using the command line
interface so if i go to the directory where the repository is i can
ask Irmin to give me the value of admins slash john and i will get
the value i just added and i can also set a new value,
for jane for instance and recover it and as i say the Irmin is also
compatible with git that means i can also use
the git command line to read what i just did so if i check out on master and print what's in the directory i will get the
two keys that i created and of course Irmin is based on commits
so each change that i made created a new commit so i
can go back in time i can also check that with the git
command line by showing git log and then i will see i have two commits
one for adding John and one for adding Jane.

Now let's dive into some of the
internals of Irmin and talk about Irmin pack. So Irmin pack is an Irmin
backend that we developed as part of the v2 release.
It was developed specifically for the Tezos use case
which is storing the Tezos ledger. So as a blockchain it makes the commit very well suited for storing and
organizing the data. At the time Tezos was using an lmdb backend which used 250 gigabytes on the disk so our main goal here was to reduce the
size of the Tezos storage.

An Irmin backend must provide a pretty
simple interface of course that can hide an arbitrarily
complicated implementation but the basic operation allows adding
and looking up. All the data is content addressable so
that means the data can be retrieved based on the hash.
We also have other constraints so because a Tezos node can perform
multiple operations at once such as validating blocks answering user
requests or getting data from the network.
We cannot block and the latency when performing an operation must be minimal. We have to be disk optimized so we have to go way under the 250 gigabytes
and we have to grow less than 20 gigabytes per month which was the growth
rate at the time. We also have a hard
bounded memory constraint because a tezos node can run
on very small devices such as raspberry pi's or small vps.
And finally a Tezos node is composed of three processes that should be able to
read the storage independently of each other
so we have to provide a single writer multiple readers
access pattern. So a name in pack store is made of three components the
first component is the pack component its role is to
store the serialized data that it received into
a single file it's an append only file so that means you can only write data in the end of the file and you cannot alter
any existing data.

It contains a file header
for versioning and integrity reasons and then the data is appended along with
the length so we know how much to read and the hash
for integrity checks so as we saw nodes and commits
contain links to other objects uh so first optimizations that helped
us saved a lot of space is we made those links as offsets in the pack
file instead of hashes. So since offsets are
much smaller than hashes and there are a lot of them
or we saved a lot of space that way it's not an issue
of course because the data is immutable that means
the offset cannot change in the pack file so we cannot bring the leaks.

A second optimization consists in deduplicating the path names
so as we saw the tree contains path to value bindings
so whenever a new tree is written all the paths have to be written with it
that can lead to a lot of duplicated data for instance typically when working
on a git repository for source control the change of paths do not happen that
often because you mostly change the contents
of your files instead. But yet you have to
rewrite the path name for every commit. So to solve this we deduplicate the past
name by referencing them in another component the dict. So the dict
is basically a bi-directional binding table from path name to
identifiers which are shorter. So whenever we write a note to the pack
we first check if the path was already used in the past
and in that case we use the corresponding id
and if that is not the case we use a fresh id and I
add it to the dict.

We perform the opposite operation when
dereferencing when we need to read the note.
In the Tezos use case the dict is quite small it's about 10 megabytes so it's
also kept entirely in memory for speed reasons. Okay so now having that big append only
pac file is convenient when writing because we just have to open it
at the end of the file but searching in it is obviously not
not optimized since we would basically have to
read the whole file linearly to find our data.
So this the third component is responsible of indexing the data
that means it contains bindings from hash
to offset and length in the pack.

So we extracted that component as a
separate library so on the surface index is a simple key
value table so you can find and replace value just
as in a hash table but remember we also have other
constraints such as the multiple readers constraint or the bounded memory
so Tezos has more than 300 million values
stored so we need a specific solution for that. That library is virtually independent
from Irmin and available through OPAM although it was developed specifically
to meet the constraints. So in the context of data intensive
applications the largest source of performance drop
is access to the disk, so reads and writes. So our goal here is
to provide a single read lookup to ensure good performance. We also need
the memory to be bounded no matter the number of bindings. So for that reason
the the index is divided into two parts the log is a write-only file that
contains the most recent bindings it's duplicated in memory with a hash
table and when we need to read we read it from the
memory and we when we need to write we write both on
the disk to be able to recover on a crash or on a
reboot and we also write it in memory.
When adding too many bindings we have to clean the log somehow
to ensure that the memory remains bounded so
the lock file is bounded with a hard limit that is customisable.
So now we can find and we can add value to the log but what happens when we reach that log size limit.
So when that happens we write the log file into a new file
which we call the data file and then we clean the log file.
We empty it.

Each time the log gets full we merge the log file with the older
data file and provide a new data file.
So as you can guess this operation is costly so it all happens
in the background so we can use a simple and atomic rename to ensure concurrent
access. So that data file contains all the
values. It's only on the disk we don't keep
it in memory and it's a read-only file most
importantly it's sorted by the hashes so that's an invariant that the merge
operation has to keep. So that will allow us fast lookup of the

So whenever we want to find some data we first
look up in the log in memory and if we don't find it
we look in the data by performing an interpolation search.
So it's similar to binary search except it can find the data quicker provided
that the hashes are well distributed which is the case here. On top of that we
have a fanout that is able to tell us on which page of the disk we can find a
specific key so in the end we only run the
interpolation search on one page of the file which is done using only one sys call to read. Another tricky aspect is how index integrates to the big picture of Irmin
pack so because flushing is costly we do not
do a flush for each write on the disk and
it's okay in the context of Tezos because they can
replay and recover the data by other means, by computations.

So instead we have bounded memory buffers which get flushed
when they are full but obviously concurrent readers cannot
read those because the memory is not shared
so we still have to flush them somehow for the concurrent readers to
see them. But as you know index contain references
from hashes to have offsets in the pack that means each read has to go
through the index first so we must ensure that anything written in the
index points to a valid offset in the back. So
to do that we enforce a specific order when flushing to the disk
so whenever index has to be flushed because its buffers
are full we also need a pack to flush its buffers
before that so that way we ensure that anything written to the index was
previously flushed to the pack so the pointers are
still valid. So in index we provide the flush
callbacks that we use to keep that invariant and ensure

So now we said that performance was our
main goal while developing this backend so what are the results the Tezos
ledger contains about 350 million objects
we managed to reduce this disk space from 250 gigabytes using lmdb down to
25 gigabytes in pack and we reduced the growth rate from
20 gigabytes per month to 2 gigabytes per month.
We ensure that the memory is bounded and we need less than one gigabytes but this limit is customizable
by changing the log size so if you have a more powerful machine you can
have faster lookups by raising that limit
and we also ensure that the reads that we provide are low latency
reads thanks to the single read indexing so our worst case
is under half a second and a typical data
recovery is under 100 milliseconds that includes finding in the index,
reading in the pack, dereferencing the path names with the dict
and then deserialising the data and performing into integrity checks.

So what's next for Irmin we are actively
working on many interesting features the layout store will allow you to
combine different stores for instance you could use an in-memory back-end for
most access values and a persistent back-end for older values
for performance reasons. Also we have some improvements
on Irmin type. Irmin type is our dynamic type library it's responsible of
serialisation, hashing, printing so we have a lot of nice performance
improvement that will speed up Irmin overall.
As we saw we also provide a GraphQL server,
we also have an http server on top of Irmin stores but we
also want to provide RPC servers using for instance cap'n proto
for performance focused applications.

We're also working on a garbage
collection similar to git gc to help clean up and reference commit
trees and contents and we also want to provide some
verified components as part of Irmin to improve the confidence in the software. Thank you for watching this presentation if you're interested in MirageOS you
can check the website it contains a ton of
resources that you can use to learn what it is and how you can use it.
We also have an Irmin website that contains the API documentations
tutorials and examples and of course all the projects that i mentioned are
open source are available on github. So we are always happy to welcome new
contributors and users and feedback is always appreciated now i
guess we will have a bit of time for questions
if you have some.

You May Also Like