Writing a simple parser

This article explains how to write a simple parser, using symbolic expressions (s-expression, known from Lisp and Scheme) as an example. At the same time, it's a brief introduction to Lean.

A simple s-expression is, for example, (1 2 3), which is a list with three elements. However, if the input is invalid, (1 2 3 say, we want to return an error. The following data type will represent parse errors:

inductive ParseError : Type where
  | mk : String -> Nat -> ParseError

instance : ToString ParseError where
  toString : ParseError -> String
  | .mk message line => s!"{message} (line {line})"

The String field contains the error message and the Nat field (which is short for natural number, a non-negative integer) contains the line number where the error occurred. We also implement the ToString type class for this data type so we can pretty-print its contents.

When the parsing succeeds, the result is a value (for example, the input's abstract syntax tree, AST). Otherwise we return an instance of ParseError:

def ParseResult : Type -> Type := fun a => Result a ParseError

instance [ToString a] [ToString b] : ToString (ParseResult a) where
  toString : ParseResult a -> String
  | .value val => toString val
  | .error err => toString err

instance : Pure ParseResult where
  pure x := .value x

instance : Bind ParseResult where
  bind r f :=
    match r with
    | .value val => f val
    | .error err => .error err

instance [Inhabited a] : Inhabited (ParseResult a) where
  default := .value default

Again, we implement the ToString type class, and also Pure, Bind and Inhabited, highly technical type classes we won't go into here.

The value of an s-expression can be an identifier, an integer, a float, a string, a list of s-expressions or an empty list, represented by the nil value:

inductive Value : Type where
  | ident : String -> Nat -> Value
  | int : Nat -> Nat -> Value
  | float : Float -> Nat -> Value
  | string : String -> Nat -> Value
  | list : List Value -> Nat -> Value
  | nil : Nat -> Value

mutual
  def listAsString : List Value -> String
  | [] => ""
  | x :: y :: xs => x.toString ++ " " ++ listAsString (y :: xs)
  | x :: _ => x.toString

  def Value.toString : Value -> String
    | .ident t _ => t
    | .int n _ => s!"{n}"
    | .float x _ => s!"{x}"
    | .string s _ => escape s
    | .list l _ => s!"({listAsString l})"
    | .nil _ => "()"
end

instance : ToString Value where
  toString := Value.toString

instance : Inhabited Value where
  default := .nil 0

We finally get to the code that does the actual parsing. The input of the function is a list of tokens (the input code processed by a lexer). The output is either a parse error or a parsed value and a list of the remaining tokens.

mutual
  partial def parseValue (tokens : List Token) : ParseResult (Value × List Token) :=
    match tokens with
    | .ident text line :: tail
      => .value (.ident text line, tail)
    | .str text line :: tail
      => .value (.string text line, tail)
    | .num text1 line :: .symbol "." _ :: .num text2 _ :: tail
      => .value (.float (parseFloat $ text1 ++ "." ++ text2) line, tail)
    | .num text line :: tail => .value (.int text.toNat! line, tail)
    | .symbol "(" line :: tail
      => parseList tail >>= fun (lst, tail) => .value (if lst.isEmpty then .nil line else .list lst line, tail)
    | .symbol text line :: tail
      => .value (.ident text line, tail)
    | .eof line :: _
      => .error $ ParseError.mk "unexpected end of file" line
    | _
      => .error $ ParseError.mk "no tokens" 0
  
  partial def parseList (tokens : List Token) : ParseResult (List Value × List Token) :=
    match tokens with
    | .symbol ")" _ :: tail
      => .value ([], tail)
    | _
      => parseValue tokens >>= fun (val, tail) => parseList tail >>= fun (lst, tail) => .value (val :: lst, tail)
end

Note that the parser 1) is recursive (there can be many nested lists) and 2) works by pattern matching. For example, if the parser sees .symbol "(" line at the beginning of the token list, it recursively calls the parseList function.

The main function reads the input file into a string and calls the parseValue function:

def main (args : List String): IO UInt32 := do
  if args.length == 0 then
    IO.eprintln "no input file specified"
    return 1
  let filename := args[0]!
  if not (<- FilePath.pathExists filename) then
    IO.eprintln "input file doesn't exist"
    return 1
  let input <- IO.FS.readFile filename
  let tokens := tokenise input
  -- IO.println s!"{tokens}"
  match parseValue tokens with
  | .value (val, _) => IO.println s!"parsed: {val}"
  | .error err => IO.eprintln s!"parse error: {err}"
  return 0

For (1 2 3), the parser succeeds. For an ill-formed input, (1 2 3 say, the output is parse error: unexpected end of file (line 2).

This post was rather technical but it illustrates how pattern matching makes writing parsers for formal languages easier.

LeanParsing
Avatar for Petr Homola

Written by Petr Homola

Studied physics & CS, PhD in NLP, interested in AI, HPC & PLT

Loading

Fetching comments

Hey! 👋

Got something to say?

or to leave a comment.