In the protocol of Tezos, we define the serialization of values using the data-encoding library. We make these definitions by hand to be able to audit the code and fine-tune the serialization.

The serialization and deserialization functions must be inverse. Missing this property can lead to bugs or exploits. We present a solution to formally verify the definition of data-encodings in Coq. As a result, we increase our confidence in the correctness of our encodings. We even found a mistake in a development code and fixed it before release. This mistake was found by TriliTech.

The example of verification of encoding which we will follow is the `Nonce.info_encoding`

:

`type info = Revealed of Nonce.t | Missing of Nonce_hash.t | Forgotten`

let info_encoding =

let open Data_encoding in

union

[

case

(Tag 0)

~title:"Revealed"

(obj1 (req "nonce" Nonce.encoding))

(function Revealed nonce -> Some nonce | _ -> None)

(fun nonce -> Revealed nonce);

case

(Tag 1)

~title:"Missing"

(obj1 (req "hash" Nonce_hash.encoding))

(function Missing nonce -> Some nonce | _ -> None)

(fun nonce -> Missing nonce);

case

(Tag 2)

~title:"Forgotten"

empty

(function Forgotten -> Some () | _ -> None)

(fun () -> Forgotten);

]`Lemma encoding_is_valid :`

Data_encoding.Valid_on.t

Valid.t Alpha_services.Nonce.info_encoding.

eapply Data_encoding.Valid_on.implies.

eapply Data_encoding.Valid_on.union.

(repeat econstructor; simpl; try intuition congruence);

try (intro x; destruct x; reflexivity);

try apply Data_encoding.Valid_on.obj1.

apply Data_encoding.Valid.empty.

apply Nonce_hash.encoding_is_valid.

apply Seed_repr.Nonce.encoding_is_valid.

intros x H_x.

destruct x; tauto.

Qed.

## The data-encoding libraryβ

This library defines several combinators which we can combine to define the encoding of a type. From an encoding, we get serialization and deserialization functions to binary or JSON format.

### Basic combinatorsβ

Some examples of basic combinators are:

`type 'a encoding`

val bool : bool encoding

val string : string encoding

val option : 'a encoding -> 'a option encoding

val list : ?max_length:int -> 'a encoding -> 'a list encoding

The encodings `bool`

or `string`

directly encode their corresponding types. The combinators `option`

and `list`

create encodings for parametrized types, starting from the encoding on the base type.

### Conversionβ

To encode a type for which no combinators directly work, we can use `conv`

:

`val conv :`

('a -> 'b) ->

('b -> 'a) ->

?schema:Json_schema.schema ->

'b encoding ->

'a encoding

This function expects two parameters which must be bijections between the target type `'a`

and the base type `'b`

. The user of the `data-encoding`

library is responsible for checking the correctness of the bijection.

### Algebraic data typesβ

For algebraic data types , we use several combinators: `case`

and `union`

for sum types (detailed below) and `obj`

and `merge_objs`

for products (which we do not use in any example of this blog post and thus we do not give any details about here).

`type 't case`

val case :

title:string ->

?description:string ->

case_tag ->

'a encoding ->

('t -> 'a option) ->

('a -> 't) ->

't case

val union :

?tag_size:[`Uint8 | `Uint16] -> 't case list -> 't encoding

With the `union`

primitive, we combine several `case`

constructs. Each `case`

construct covers one constructor of the sum type `'t`

. The two functions given as parameters to `case`

are the projections and injection corresponding to the constructor. We need to make sure that we cover all the constructors exactly once and that the projections/injections operations are compatible.

We will now see how we verify the definition of encodings with Coq.

## Specificationβ

We translate the protocol of Tezos using coq-of-ocaml. During this translation we generate a list of axioms for the `.mli`

files of the environment. The environment is the set of primitives used by the protocol and includes the `data-encoding`

library. Thus we do not have access to the definition of the `data-encoding`

library but can use its interface from Data_encoding.v.

We manually specify the behavior of `data-encoding`

and suppose that its code is correct. To do so, we define the predicate `Valid_on.t`

which we will apply on each primitive:

`Module Valid_on.`

Definition t {a : Set} (domain : a β Prop)

(encoding : Data_encoding.t a) : Prop :=

β x, domain x β

match Data_encoding.Binary.to_bytes_opt None encoding x with

| Some bytes β

Data_encoding.Binary.of_bytes_opt encoding bytes = Some x

| None β False

end.

End Valid_on.

This statement says that for some type `a`

, on a specific domain `domain`

defined as a predicate, the encoding `encoding`

is valid. For now, we only define that serializing and then deserializing to binary format is the identity. We should also state that the reverse operation is valid (deserializing and then serializing is the identity), and that serialization in JSON format is valid. For the typical case of encodings valid on the whole domain we define the predicate `Valid.t`

:

`Module Valid.`

Definition t {a : Set} (encoding : Data_encoding.t a) : Prop :=

Valid_on.t (fun _ β True) encoding.

End Valid.

### Basic combinatorsβ

For each of the data-encoding combinators, we give a validity axiom with corresponding conditions and domain:

`Axiom bool : Valid.t Data_encoding.bool.`

Axiom string : Valid.t Data_encoding.string.

Axiom option : β {a : Set} {domain : a β Prop}

{encoding : Data_encoding.t a},

Valid_on.t domain encoding β

let option_domain x :=

match x with

| None β True

| Some x β domain x

end in

Valid_on.t option_domain (Data_encoding.option encoding).

Axiom list : β {a : Set} {domain_a}

{encoding_a : Data_encoding.t a},

Valid_on.t domain_a encoding_a β

Valid_on.t (List.Forall domain_a)

(Data_encoding.list None encoding_a).

The primitives `bool`

Β and `string`

are valid on their whole domain. On lists, the primitive `list`

yields an encoding that is valid on lists in which each element is in the domain `domain_a`

. We define the list domain with:

`List.Forall domain_a`

knowing that the `List.Forall`

operator has type:

`forall {a : Set}, (a -> Prop) -> list a -> Prop`

### Conversionβ

For the conversion operator, we rely on the correction of the bijection given as a parameter:

`Axiom conv : β {a b : Set} {domain_b : b β Prop}`

{a_to_b : a β b} {b_to_a} {encoding_b : Data_encoding.t b},

Valid_on.t domain_b encoding_b β

Valid_on.t

(fun v_a β domain_b (a_to_b v_a) β§ b_to_a (a_to_b v_a) = v_a)

(Data_encoding.conv a_to_b b_to_a None encoding_b).

### Algebraic data typesβ

The specification of the `union`

combinator for algebraic data types is a little more involved. We have to define the validity of a `case`

as an inductive predicate in Coq due to universe constraints on impredicative sets. We let the interested reader look at the full definition.

## Proof schemeβ

We now verify the following type encoding from alpha_services.ml:

`type info = Revealed of Nonce.t | Missing of Nonce_hash.t | Forgotten`

let info_encoding =

let open Data_encoding in

union

[

case

(Tag 0)

~title:"Revealed"

(obj1 (req "nonce" Nonce.encoding))

(function Revealed nonce -> Some nonce | _ -> None)

(fun nonce -> Revealed nonce);

case

(Tag 1)

~title:"Missing"

(obj1 (req "hash" Nonce_hash.encoding))

(function Missing nonce -> Some nonce | _ -> None)

(fun nonce -> Missing nonce);

case

(Tag 2)

~title:"Forgotten"

empty

(function Forgotten -> Some () | _ -> None)

(fun () -> Forgotten);

]

The type `info_encoding`

is an algebraic data type with three constructors `Revealed`

, `Missing`

, and `Forgotten`

. Its encoding relies on the encodings of `Nonce.t`

and `Nonce_hash.t`

.

The Coq translation of this OCaml code is in Alpha_services.v. The encoding of `Nonce.t`

is only valid on a specific domain `Seed_repr.Nonce.Valid.t`

. Thus we define the domain for `info_encoding`

as:

`Module Valid.`

Definition t (info : Alpha_services.Nonce.info) : Prop :=

match info with

| Alpha_services.Nonce.Revealed nonce β

Seed_repr.Nonce.Valid.t nonce

| _ β True

end.

End Valid.

We express the correctness lemma for `info_encoding`

in Proofs/Alpha_services.v as:

`Lemma encoding_is_valid :`

Data_encoding.Valid_on.t

Valid.t Alpha_services.Nonce.info_encoding.

We verify that this encoding is valid on the domain `Valid.t`

. Like most proofs on data encodings, we start with:

`eapply Data_encoding.Valid_on.implies.`

The lemma `Data_encoding.Valid_on.implies`

is of type:

`forall {a : Set} {domain domain' : a β Prop}`

{encoding : Data_encoding.t a},

Valid_on.t t domain encoding β

(β x, domain' x β domain x) β

Valid_on.t domain' encoding.

Using `eapply`

on `implies`

we introduce an existential Coq variable for the domain. We say that the domain is undefined for now. We will build it as a side-effect of using the axioms about the data-encoding primitives. At the end of the proof, we will verify that the domain we built includes the domain of the lemma statement using the `tauto`

tactic.

Then, we use the axiom on `union`

:

`eapply Data_encoding.Valid_on.union.`

and eliminate bureaucratic goals, checking for example that we do not repeat two times the same tag:

`(repeat econstructor; simpl; try intuition congruence);`

try (intro x; destruct x; reflexivity);

try apply Data_encoding.Valid_on.obj1.

Finally, we use previously verified lemmas for the payload of each constructor:

`apply Data_encoding.Valid.empty.`

apply Nonce_hash.encoding_is_valid.

apply Seed_repr.Nonce.encoding_is_valid.

## Conclusionβ

We described how we verify the data-encodings in the Tezos protocol. Our goal is to verify as many encodings as possible in order to avoid mistakes in this part of the code. Ultimately, we would like to also verify the `data-encoding`

library itself to make sure that it implements our specification.