regexp.expr: Regular expressions



Overview

We follow the .t convention for type naming here.
  • t A: The regular expressions over atoms of type A. We have the following operations
    • ε
    • r1 || r2 : Either r1 or r2 written r1 + r2 in literature
    • r1 ++ r2: The concatenation
    • many r: The Kleene closure r
    • many1 r: One or more of r often r⁺ = r r in literature
  • map: The natural map function. Makes expr.t a functor.
  • language r : The language associated with the regular expression.
  • eval: When the atom set itself is lang B then every regular expression defines a language over B. Both language and matcher.language are defined in terms of this eval function.

Matcher variant.

The submodule matcher is the regular expressions that we using in designing our regexp matcher. It is generalized variant of what is typically seen in automata text books and helps us support features like character classes succinctly.
  • matcher.t B: regular expressions used for matching. It is what we call generelized regular expression (i.e. t (B bool)
  • matcher.language The language over that is matched by the given regular expression

Inductive t {A : Type} :=
| eps : t
| single : A t
| alt : t t t
| concat : t t t
| many1 : t t.

Arguments t : clear implicits.

Fixpoint map {A B}(f : A B)(re : t A) : t B :=
  match re with
  | epseps
  | single xsingle (f x)
  | alt re1 re2alt (map f re1) (map f re2)
  | concat re1 re2concat (map f re1) (map f re2)
  | many1 remany1 (map f re)
  end.
Definition many {A}(r : t A) := alt (many1 r) eps.
Notation "'ε'" := eps : regexp_scope.
Infix "||" := alt : regexp_scope.
Infix "++" := concat : regexp_scope.

Fixpoint unnest {A}(re : t (t A)) : t A :=
  match re with
  | epseps
  | single xx
  | alt re1 re2alt (unnest re1) (unnest re2)
  | concat re1 re2concat (unnest re1) (unnest re2)
  | many1 remany1 (unnest re)
  end.

Require Import regexp.lang.

Regular expression over languages.

Any regular expression where the atoms are themselves languages over A can be evaluated to a language over A. Using this evaluation, in one shot we can give semantics mea to not just regexp over A but also generalized regular expressions (see below).

Fixpoint eval {A}(re : t (lang.lang A)) : lang.lang A :=
  match re with
  | ε%reε%lan
  | single LL
  | (re1 || re2)%re ⇒ (eval re1 || eval re2)%lan
  | (re1 ++ re2)%re ⇒ (eval re1 ++ eval re2)%lan
  | expr.many1 relang.many1 (eval re)
  end%lan.

The usual language definition given for a regular expression is just an application of the above evaluation function.

Definition language {A} : t A lang.lang A := eval \o map lang.singleton.

Generalized regexp matcher.

To support features like character classes etc efficiently we need define the notion of matching where the letters are boolean tests over the characters. We capture this in the module below.

Module matcher.

  Definition t A := expr.t (A bool).
  Definition language {A} : matcher.t A lang.lang A := eval \o map satisfy.

End matcher.