Writing a simple parser
This article explains how to write a simple parser, using symbolic expressions (s-expression, known from Lisp and Scheme) as an example. At the same time, it's a brief introduction to Lean.
A simple s-expression is, for example, (1 2 3), which is a list with three elements. However, if the input is invalid, (1 2 3 say, we want to return an error. The following data type will represent parse errors:
inductive ParseError : Type where
| mk : String -> Nat -> ParseError
instance : ToString ParseError where
toString : ParseError -> String
| .mk message line => s!"{message} (line {line})"
The String field contains the error message and the Nat field (which is short for natural number, a non-negative integer) contains the line number where the error occurred. We also implement the ToString type class for this data type so we can pretty-print its contents.
When the parsing succeeds, the result is a value (for example, the input's abstract syntax tree, AST). Otherwise we return an instance of ParseError:
def ParseResult : Type -> Type := fun a => Result a ParseError
instance [ToString a] [ToString b] : ToString (ParseResult a) where
toString : ParseResult a -> String
| .value val => toString val
| .error err => toString err
instance : Pure ParseResult where
pure x := .value x
instance : Bind ParseResult where
bind r f :=
match r with
| .value val => f val
| .error err => .error err
instance [Inhabited a] : Inhabited (ParseResult a) where
default := .value default
Again, we implement the ToString type class, and also Pure, Bind and Inhabited, highly technical type classes we won't go into here.
The value of an s-expression can be an identifier, an integer, a float, a string, a list of s-expressions or an empty list, represented by the nil value:
inductive Value : Type where
| ident : String -> Nat -> Value
| int : Nat -> Nat -> Value
| float : Float -> Nat -> Value
| string : String -> Nat -> Value
| list : List Value -> Nat -> Value
| nil : Nat -> Value
mutual
def listAsString : List Value -> String
| [] => ""
| x :: y :: xs => x.toString ++ " " ++ listAsString (y :: xs)
| x :: _ => x.toString
def Value.toString : Value -> String
| .ident t _ => t
| .int n _ => s!"{n}"
| .float x _ => s!"{x}"
| .string s _ => escape s
| .list l _ => s!"({listAsString l})"
| .nil _ => "()"
end
instance : ToString Value where
toString := Value.toString
instance : Inhabited Value where
default := .nil 0
We finally get to the code that does the actual parsing. The input of the function is a list of tokens (the input code processed by a lexer). The output is either a parse error or a parsed value and a list of the remaining tokens.
mutual
partial def parseValue (tokens : List Token) : ParseResult (Value × List Token) :=
match tokens with
| .ident text line :: tail
=> .value (.ident text line, tail)
| .str text line :: tail
=> .value (.string text line, tail)
| .num text1 line :: .symbol "." _ :: .num text2 _ :: tail
=> .value (.float (parseFloat $ text1 ++ "." ++ text2) line, tail)
| .num text line :: tail => .value (.int text.toNat! line, tail)
| .symbol "(" line :: tail
=> parseList tail >>= fun (lst, tail) => .value (if lst.isEmpty then .nil line else .list lst line, tail)
| .symbol text line :: tail
=> .value (.ident text line, tail)
| .eof line :: _
=> .error $ ParseError.mk "unexpected end of file" line
| _
=> .error $ ParseError.mk "no tokens" 0
partial def parseList (tokens : List Token) : ParseResult (List Value × List Token) :=
match tokens with
| .symbol ")" _ :: tail
=> .value ([], tail)
| _
=> parseValue tokens >>= fun (val, tail) => parseList tail >>= fun (lst, tail) => .value (val :: lst, tail)
end
Note that the parser 1) is recursive (there can be many nested lists) and 2) works by pattern matching. For example, if the parser sees .symbol "(" line at the beginning of the token list, it recursively calls the parseList function.
The main function reads the input file into a string and calls the parseValue function:
def main (args : List String): IO UInt32 := do
if args.length == 0 then
IO.eprintln "no input file specified"
return 1
let filename := args[0]!
if not (<- FilePath.pathExists filename) then
IO.eprintln "input file doesn't exist"
return 1
let input <- IO.FS.readFile filename
let tokens := tokenise input
-- IO.println s!"{tokens}"
match parseValue tokens with
| .value (val, _) => IO.println s!"parsed: {val}"
| .error err => IO.eprintln s!"parse error: {err}"
return 0
For (1 2 3), the parser succeeds. For an ill-formed input, (1 2 3 say, the output is parse error: unexpected end of file (line 2).
This post was rather technical but it illustrates how pattern matching makes writing parsers for formal languages easier.