title=The Generic Syntax Extension authors=Çagdas Bozman date=2014-04-01 category=OCaml tag=tooling,ocaml,generic,syntax,extension OCaml 4.01 with its new feature to disambiguate constructors allows to do a nice trick: a simple and generic syntax extension that allows to define your own syntax without having to write complicated parsetree transformers. We propose an implementation in the form of a ppx rewriter. it does only a simple transformation: replace strings prefixed by an operator starting with ! by a series of constructor applications for instance: ```ocaml !! "hello 3" ``` is rewriten to ```ocaml !! (Start (H (E (L (L (O (Space (N3 (End)))))))) ``` How is that generic ? We will present you a few examples. #### Base 3 Numbers For instance, if you want to declare base 3 arbitrary big numbers, let's define a syntax for it. We first start by declaring some types. ```ocaml type start = Start of p and p = | N0 of stop | N1 of q | N2 of q and q = | N0 of q | N1 of q | N2 of q | Underscore of q | End and stop = End ``` This type will only allow to write strings matching the regexp 0 | (1|2)(0|1|2|_)*. Notice that some constructors appear in multiple types like N0. This is not a problem since constructor desambiguation will choose for us the right one at the right place. Let's now define a few functions to use it: ```ocaml open Num let rec convert_p = function | N0 (End) -> Int 0 | N1 t -> convert_q (Int 1) t | N2 t -> convert_q (Int 2) t and convert_q acc = function | N0 t -> convert_q (acc */ Int 3) t | N1 t -> convert_q (Int 1 +/ acc */ Int 3) t | N2 t -> convert_q (Int 2 +/ acc */ Int 3) t | Underscore t -> convert_q acc t | End -> acc let convert (Start p) = convert_p p ``` ```ocaml # val convert : start -> Num.num = ``` And we can now try it: ```ocaml let n1 = convert (Start (N0 End)) # val n1 : Num.num = let n2 = convert (Start (N1 (Underscore (N0 End)))) # val n2 : Num.num = let n3 = convert (Start (N1 (N2 (N0 End)))) # val n3 : Num.num = ``` And the generic syntax extension allows us to write: ```ocaml let ( !! ) = convert let n4 = !! "120_121_000" val n4 : Num.num = ``` #### Specialised Format Strings We can implement specialised format strings for a particular usage. Here, for concision we will restrict to a very small subset of the classical format: the characters %, i, c and space Let's define the constructors. ```ocaml type 'a start = Start of 'a a and 'a a = | Percent : 'a f -> 'a a | I : 'a a -> 'a a | C : 'a a -> 'a a | Space : 'a a -> 'a a | End : unit a and 'a f = | I : 'a a -> (int -> 'a) f | C : 'a a -> (char -> 'a) f | Percent : 'a a -> 'a f ``` Let's look at the inferred type for some examples: ```ocaml let (!*) x = x let v = !* "%i %c";; # val v : (int -> char -> unit) start = Start (Percent (I (Space (Percent (C End))))) let v = !* "ici";; # val v : unit start = Start (I (C (I End))) ``` This is effectively the types we would like for a format string looking like that. To use it we can define a simple printer: ```ocaml let rec print (Start cons) = main cons and main : type t. t a -> t = function | I r -> print_string "i"; main r | C r -> print_string "c"; main r | Space r -> print_string " "; main r | End -> () | Percent f -> format f and format : type t. t f -> t = function | I r -> fun i -> print_int i; main r | C r -> fun c -> print_char c; main r | Percent r -> print_string "%"; main r let (!!) cons = print cons ``` And voila! ```ocaml let s = !! "%i %c" 1 'c';; # 1 c ``` ### How generic is it really ? It may not look like it, but we can do almost any syntax we might want this way. For instance we can do any regular language. To explain how we transform a regular language to a type definition, we will use as an example the language a(a|)b ```ocaml type start = Start of a and a = | A of a'; and a' = | A of b | B of stop and b = B of stop and stop = End ``` We can try a few things on it: ```ocaml let v = Start (A (A (B End))) # val v : start = Start (A (A (B End))) let v = Start (A (B End)) # val v : start = Start (A (B End)) let v = Start (B End) # Characters 15-16: # let v = Start (B End);; # ^ # Error: The variant type a has no constructor B let v = Start (A (A (A (B End)))) # Characters 21-22: # let v = Start (A (A (A (B End))));; # ^ # Error: The variant type b has no constructor A ``` Assumes the language is given as an automaton that: - has 4 states, a, a', b and stop - with initial state a - with final state stop - with transitions: a - A -> a' a' - A -> b a' - B -> stop b - B -> stop let's write {c} for the constructor corresponding to the character c and [c][/c] for the type corresponding to a state of the automaton. - For each state q we have a type declaration [q] - For each letter a of the alphabet we have a constructor {a} - For each transition p - l -> q we have a constructor {l} with parameter [q] in type [p]: ```ocaml type [p] = {l} of [q] ``` - The End constructor without any parameter must be present in any final state - The initial state e is declared by ```ocaml type start = Start of [e] ``` ### Yet more generic In fact we can encode deterministic context free languages (DCFL) also. To do that we encode pushdown automatons. Here we will only give a small example: the language of well parenthesized words ```ocaml type empty type 'a r = Dummy type _ q = | End : empty q | Rparen : 'a q -> 'a r q | Lparen : 'a r q -> 'a q type start = Start of empty q let !! x = x let m = ! "" let m = ! "()" let m = ! "((())())()" ``` To encode the stack, we use the type parameters: Lparen pushes an r to the stack, Rparen consumes it and End checks that the stack is effectively empty. There are a few more tricks needed to encode tests on the top value in the stack, and a conversion of a grammar to Greibach normal form to allow this encoding. ### We can go even further #### a^n b^n c^n In fact we don't need to restrict to DCFL, we can for instance encode the a^n.b^n.c^n language which is not context free: ```ocaml type zero type 'a s = Succ type (_,_) p = | End : (zero,zero) p | A : ('b s, 'c s) p -> ('b, 'c) p | B : ('b, 'c s) q -> ('b s, 'c s) p and (_,_) q = | B : ('b, 'c) q -> ('b s, 'c) q | C : 'c r -> (zero, 'c s) q and _ r = | End : zero r | C : 'c r -> 'c s r type start = Start of (zero,zero) p let v = Start (A (B (C End))) let v = Start (A (A (B (B (C (C End)))))) ``` #### Non recursive languages We can also encode solutions of Post Correspondance Problems (PCP), which are not recursive languages: Suppose we have two alphabets A = { X, Y, Z } et O = { a, b } and two morphisms m1 and m2 from A to O* defined as - m1(X) = a, m1(Y) = ab, m1(Z) = bba - m2(X) = baa, m2(Y) = aa, m2(Z) = bb Solutions of this instance of PCP are words such that their images by m1 and m2 are equal. for instance ZYZX is a solution: both images are bbaabbbaa. The language of solution can be represented by this type declaration: ```ocaml type empty type 'a a = Dummy type 'a b = Dummy type (_,_) z = | X : ('t1, 't2) s -> ('t1 a, 't2 b a a) z | Y : ('t1, 't2) s -> ('t1 a b, 't2 a a) z | Z : ('t1, 't2) s -> ('t1 b b a, 't2 b b) z and (_,_) s = | End : (empty,empty) s | X : ('t1, 't2) s -> ('t1 a, 't2 b a a) s | Y : ('t1, 't2) s -> ('t1 a b, 't2 a a) s | Z : ('t1, 't2) s -> ('t1 b b a, 't2 b b) s type start = Start : ('a, 'a) z -> start let v = X (Z (Y (Z End))) let r = Start (X (Z (Y (Z End)))) ``` ### Open question Can every context free language (not deterministic) be represented like that ? Notice that the classical example of the palindrome can be represented (proof let to the reader). ### Conclusion So we have a nice extension available that allows you to define a new syntax by merely declaring a type. The code is available on [github](https://github.com/chambart/generic_ppx). We are waiting for the nice syntax you will invent ! PS: Their may remain a small problem... If inadvertently you mistype something you may find some quite complicated type errors attacking you like a pyranha instead of a syntax error.