2014_04_01_the_generic_syntax_extension.md 8.12 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
title=The Generic Syntax Extension
authors=Çagdas Bozman
date=2014-04-01
category=OCaml
tag=tooling,ocaml,generic,syntax,extension

OCaml 4.01 with its new feature to disambiguate constructors allows to do a nice trick: a simple and generic syntax extension that allows to define your own syntax without having to write complicated parsetree transformers. We propose an implementation in the form of a ppx rewriter.

it does only a simple transformation: replace strings prefixed by an operator starting with ! by a series of constructor applications

for instance:

```ocaml
!! "hello 3"
```

is rewriten to

```ocaml
!! (Start (H (E (L (L (O (Space (N3 (End))))))))
```

How is that generic ? We will present you a few examples.

#### Base 3 Numbers
26
For instance, if you want to declare base 3 arbitrary big numbers, let's define a syntax for it. We first start by declaring some types.
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157

```ocaml
type start = Start of p

and p =
  | N0 of stop
  | N1 of q
  | N2 of q

and q =
  | N0 of q
  | N1 of q
  | N2 of q
  | Underscore of q
  | End

and stop = End
```

This type will only allow to write strings matching the regexp 0 | (1|2)(0|1|2|_)*. Notice that some constructors appear in multiple types like N0. This is not a problem since constructor desambiguation will choose for us the right one at the right place. Let's now define a few functions to use it:

```ocaml
open Num

let rec convert_p = function
  | N0 (End) -> Int 0
  | N1 t -> convert_q (Int 1) t
  | N2 t -> convert_q (Int 2) t

and convert_q acc = function
  | N0 t -> convert_q (acc */ Int 3) t
  | N1 t -> convert_q (Int 1 +/ acc */ Int 3) t
  | N2 t -> convert_q (Int 2 +/ acc */ Int 3) t
  | Underscore t -> convert_q acc t
  | End -> acc

let convert (Start p) = convert_p p
```

```ocaml
# val convert : start -> Num.num = <fun>
```

And we can now try it:

```ocaml
let n1 = convert (Start (N0 End))
# val n1 : Num.num = <num 0>
let n2 = convert (Start (N1 (Underscore (N0 End))))
# val n2 : Num.num = <num 3>
let n3 = convert (Start (N1 (N2 (N0 End))))
# val n3 : Num.num = <num 15>
```

And the generic syntax extension allows us to write:

```ocaml
let ( !! ) = convert

let n4 = !! "120_121_000"
val n4 : Num.num = <num 11367>
```

#### Specialised Format Strings
We can implement specialised format strings for a particular usage. Here, for concision we will restrict to a very small subset of the classical format: the characters %, i, c and space

Let's define the constructors.

```ocaml
type 'a start = Start of 'a a

and 'a a =
  | Percent : 'a f -> 'a a
  | I : 'a a -> 'a a
  | C : 'a a -> 'a a
  | Space : 'a a -> 'a a
  | End : unit a

and 'a f =
  | I : 'a a -> (int -> 'a) f
  | C : 'a a -> (char -> 'a) f
  | Percent : 'a a -> 'a f
```

Let's look at the inferred type for some examples:

```ocaml
let (!*) x = x

let v = !* "%i %c";;
# val v : (int -> char -> unit) start = Start (Percent (I (Space (Percent (C End)))))
let v = !* "ici";;
# val v : unit start = Start (I (C (I End)))
```

This is effectively the types we would like for a format string looking like that. To use it we can define a simple printer:

```ocaml
let rec print (Start cons) =
  main cons

and main : type t. t a -> t = function
  | I r ->
    print_string "i";
    main r
  | C r ->
    print_string "c";
    main r
  | Space r ->
    print_string " ";
    main r
  | End -> ()
  | Percent f ->
    format f

and format : type t. t f -> t = function
  | I r ->
    fun i ->
      print_int i;
      main r
  | C r ->
    fun c ->
      print_char c;
      main r
  | Percent r ->
    print_string "%";
    main r

let (!!) cons = print cons
```

158
And voila!
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323

```ocaml
let s = !! "%i %c" 1 'c';;
# 1 c
```

### How generic is it really ?
It may not look like it, but we can do almost any syntax we might want this way. For instance we can do any regular language. To explain how we transform a regular language to a type definition, we will use as an example the language a(a|)b

```ocaml
type start = Start of a

and a =
  | A of a';

and a' =
  | A of b
  | B of stop

and b = B of stop

and stop = End
```

We can try a few things on it:

```ocaml
let v = Start (A (A (B End)))
# val v : start = Start (A (A (B End)))

let v = Start (A (B End))
# val v : start = Start (A (B End))

let v = Start (B End)
# Characters 15-16:
#   let v = Start (B End);;
#                  ^
# Error: The variant type a has no constructor B

let v = Start (A (A (A (B End))))
# Characters 21-22:
#  let v = Start (A (A (A (B End))));;
#                        ^
# Error: The variant type b has no constructor A
```

Assumes the language is given as an automaton that:

- has 4 states, a, a', b and stop
- with initial state a
- with final state stop
- with transitions: a - A -> a' a' - A -> b a' - B -> stop b - B -> stop
let's write {c} for the constructor corresponding to the character c and

[c][/c]

for the type corresponding to a state of the automaton.

- For each state q we have a type declaration [q]
- For each letter a of the alphabet we have a constructor {a}
- For each transition p - l -> q we have a constructor {l} with parameter [q] in type [p]:

```ocaml
type [p] = {l} of [q]
```

- The End constructor without any parameter must be present in any final state
- The initial state e is declared by

```ocaml
type start = Start of [e]
```

### Yet more generic
In fact we can encode deterministic context free languages (DCFL) also. To do that we encode pushdown automatons. Here we will only give a small example: the language of well parenthesized words

```ocaml
type empty
type 'a r = Dummy

type _ q =
  | End : empty q
  | Rparen : 'a q -> 'a r q
  | Lparen : 'a r q -> 'a q

type start = Start of empty q

let !! x = x

let m = ! ""
let m = ! "()"
let m = ! "((())())()"
```

To encode the stack, we use the type parameters: Lparen pushes an r to the stack, Rparen consumes it and End checks that the stack is effectively empty.

There are a few more tricks needed to encode tests on the top value in the stack, and a conversion of a grammar to Greibach normal form to allow this encoding.

### We can go even further
#### a^n b^n c^n
In fact we don't need to restrict to DCFL, we can for instance encode the a^n.b^n.c^n language which is not context free:

```ocaml
type zero
type 'a s = Succ

type (_,_) p =
  | End : (zero,zero) p
  | A : ('b s, 'c s) p -> ('b, 'c) p
  | B : ('b, 'c s) q -> ('b s, 'c s) p

and (_,_) q =
  | B : ('b, 'c) q -> ('b s, 'c) q
  | C : 'c r -> (zero, 'c s) q

and _ r =
  | End : zero r
  | C : 'c r -> 'c s r

type start = Start of (zero,zero) p

let v = Start (A (B (C End)))
let v = Start (A (A (B (B (C (C End))))))
```

#### Non recursive languages
We can also encode solutions of Post Correspondance Problems (PCP), which are not recursive languages:

Suppose we have two alphabets A = { X, Y, Z } et O = { a, b } and two morphisms m1 and m2 from A to O* defined as

- m1(X) = a, m1(Y) = ab, m1(Z) = bba
- m2(X) = baa, m2(Y) = aa, m2(Z) = bb

Solutions of this instance of PCP are words such that their images by m1 and m2 are equal. for instance ZYZX is a solution: both images are bbaabbbaa. The language of solution can be represented by this type declaration:

```ocaml
type empty
type 'a a = Dummy
type 'a b = Dummy

type (_,_) z =
  | X : ('t1, 't2) s -> ('t1 a, 't2 b a a) z
  | Y : ('t1, 't2) s -> ('t1 a b, 't2 a a) z
  | Z : ('t1, 't2) s -> ('t1 b b a, 't2 b b) z

and (_,_) s =
  | End : (empty,empty) s
  | X : ('t1, 't2) s -> ('t1 a, 't2 b a a) s
  | Y : ('t1, 't2) s -> ('t1 a b, 't2 a a) s
  | Z : ('t1, 't2) s -> ('t1 b b a, 't2 b b) s

type start = Start : ('a, 'a) z -> start

let v = X (Z (Y (Z End)))
let r = Start (X (Z (Y (Z End))))
```

### Open question
Can every context free language (not deterministic) be represented like that ? Notice that the classical example of the palindrome can be represented (proof let to the reader).

### Conclusion
So we have a nice extension available that allows you to define a new syntax by merely declaring a type. The code is available on [github](https://github.com/chambart/generic_ppx). We are waiting for the nice syntax you will invent !

PS: Their may remain a small problem... If inadvertently you mistype something you may find some quite complicated type errors attacking you like a pyranha instead of a syntax error.