1
Fork 0
mirror of git://git.sv.gnu.org/emacs.git synced 2026-01-30 04:10:54 -08:00

Fix the new PEG library

* doc/lispref/peg.texi (Parsing Expression Grammars)
(PEX Definitions, Parsing Actions, Writing PEG Rules): Fix markup,
indexing, and wording.

* etc/NEWS: Fix wording of PEG entry.

* test/lisp/progmodes/peg-tests.el: Move from test/lisp/, to match
the directory of peg.el.
This commit is contained in:
Eli Zaretskii 2024-03-31 10:29:34 +03:00
parent 914b00f207
commit 994bcc125b
3 changed files with 117 additions and 83 deletions

View file

@ -7,29 +7,34 @@
@chapter Parsing Expression Grammars
@cindex text parsing
@cindex parsing expression grammar
@cindex PEG
Emacs Lisp provides several tools for parsing and matching text,
from regular expressions (@pxref{Regular Expressions}) to full
@acronym{LL} grammar parsers (@pxref{Top,, Bovine parser
development,bovine}). @dfn{Parsing Expression Grammars}
left-to-right (a.k.a.@: @acronym{LL}) grammar parsers (@pxref{Top,,
Bovine parser development,bovine}). @dfn{Parsing Expression Grammars}
(@acronym{PEG}) are another approach to text parsing that offer more
structure and composibility than regular expressions, but less
complexity than context-free grammars.
A @acronym{PEG} parser is defined as a list of named rules, each of
which matches text patterns, and/or contains references to other
A Parsing Expression Grammar (@acronym{PEG}) describes a formal language
in terms of a set of rules for recognizing strings in the language. In
Emacs, a @acronym{PEG} parser is defined as a list of named rules, each
of which matches text patterns and/or contains references to other
rules. Parsing is initiated with the function @code{peg-run} or the
macro @code{peg-parse} (see below), and parses text after point in the
current buffer, using a given set of rules.
@cindex parsing expression
The definition of each rule is referred to as a @dfn{parsing
expression} (@acronym{PEX}), and can consist of a literal string, a
regexp-like character range or set, a peg-specific construct
resembling an elisp function call, a reference to another rule, or a
combination of any of these. A grammar is expressed as a tree of
rules in which one rule is typically treated as a ``root'' or
``entry-point'' rule. For instance:
@cindex root, of parsing expression grammar
@cindex entry-point, of parsing expression grammar
Each rule in a @acronym{PEG} is referred to as a @dfn{parsing
expression} (@acronym{PEX}), and can be specified a a literal string, a
regexp-like character range or set, a peg-specific construct resembling
an Emacs Lisp function call, a reference to another rule, or a
combination of any of these. A grammar is expressed as a tree of rules
in which one rule is typically treated as a ``root'' or ``entry-point''
rule. For instance:
@example
@group
@ -56,14 +61,17 @@ first rule is considered the ``entry-point'':
@end group
@end example
This macro represents the simplest use of the @acronym{PEG} library,
but also the least flexible, as the rules must be written directly
into the source code. A more flexible approach involves use of three
macros in conjunction: @code{with-peg-rules}, a @code{let}-like
construct that makes a set of rules available within the macro body;
@code{peg-run}, which initiates parsing given a single rule; and
@code{peg}, which is used to wrap the entry-point rule name. In fact,
a call to @code{peg-parse} expands to just this set of calls. The
@c FIXME: These two should be formally defined using @defmac and @defun.
@findex with-peg-rules
@findex peg-run
The @code{peg-parse} macro represents the simplest use of the
@acronym{PEG} library, but also the least flexible, as the rules must be
written directly into the source code. A more flexible approach
involves use of three macros in conjunction: @code{with-peg-rules}, a
@code{let}-like construct that makes a set of rules available within the
macro body; @code{peg-run}, which initiates parsing given a single rule;
and @code{peg}, which is used to wrap the entry-point rule name. In
fact, a call to @code{peg-parse} expands to just this set of calls. The
above example could be written as:
@example
@ -79,33 +87,43 @@ above example could be written as:
This allows more explicit control over the ``entry-point'' of parsing,
and allows the combination of rules from different sources.
@c FIXME: Use @defmac.
@findex define-peg-rule
Individual rules can also be defined using a more @code{defun}-like
syntax, using the macro @code{define-peg-rule}:
@example
@group
(define-peg-rule digit ()
[0-9])
@end group
@end example
This also allows for rules that accept an argument (supplied by the
@code{funcall} PEG rule).
@code{funcall} PEG rule, @pxref{PEX Definitions}).
@c FIXME: Use @defmac.
@findex define-peg-ruleset
Another possibility is to define a named set of rules with
@code{define-peg-ruleset}:
@example
@group
(define-peg-ruleset number-grammar
'((number sign digit (* digit))
digit ;; A reference to the definition above.
(sign (or "+" "-" ""))))
@end group
@end example
Rules and rulesets defined this way can be referred to by name in
later calls to @code{peg-run} or @code{with-peg-rules}:
@example
@group
(with-peg-rules number-grammar
(peg-run (peg number)))
@end group
@end example
By default, calls to @code{peg-run} or @code{peg-parse} produce no
@ -125,11 +143,11 @@ act upon parsed strings, rules can include @dfn{actions}, see
Parsing expressions can be defined using the following syntax:
@table @code
@item (and E1 E2 ...)
A sequence of @acronym{PEX}s that must all be matched. The @code{and} form is
optional and implicit.
@item (and @var{e1} @var{e2}@dots{})
A sequence of @acronym{PEX}s that must all be matched. The @code{and}
form is optional and implicit.
@item (or E1 E2 ...)
@item (or @var{e1} @var{e2}@dots{})
Prioritized choices, meaning that, as in Elisp, the choices are tried
in order, and the first successful match is used. Note that this is
distinct from context-free grammars, in which selection between
@ -141,43 +159,43 @@ Matches any single character, as the regexp ``.''.
@item @var{string}
A literal string.
@item (char @var{C})
A single character @var{C}, as an Elisp character literal.
@item (char @var{c})
A single character @var{c}, as an Elisp character literal.
@item (* @var{E})
Zero or more instances of expression @var{E}, as the regexp @samp{*}.
@item (* @var{e})
Zero or more instances of expression @var{e}, as the regexp @samp{*}.
Matching is always ``greedy''.
@item (+ @var{E})
One or more instances of expression @var{E}, as the regexp @samp{+}.
@item (+ @var{e})
One or more instances of expression @var{e}, as the regexp @samp{+}.
Matching is always ``greedy''.
@item (opt @var{E})
Zero or one instance of expression @var{E}, as the regexp @samp{?}.
@item (opt @var{e})
Zero or one instance of expression @var{e}, as the regexp @samp{?}.
@item SYMBOL
@item @var{symbol}
A symbol representing a previously-defined PEG rule.
@item (range CH1 CH2)
The character range between CH1 and CH2, as the regexp @samp{[CH1-CH2]}.
@item (range @var{ch1} @var{ch2})
The character range between @var{ch1} and @var{ch2}, as the regexp
@samp{[@var{ch1}-@var{ch2}]}.
@item [CH1-CH2 "+*" ?x]
@item [@var{ch1}-@var{ch2} "+*" ?x]
A character set, which can include ranges, character literals, or
strings of characters.
@item [ascii cntrl]
A list of named character classes.
@item (syntax-class @var{NAME})
@item (syntax-class @var{name})
A single syntax class.
@item (funcall E ARGS...)
Call @acronym{PEX} E (previously defined with @code{define-peg-rule})
with arguments @var{ARGS}.
@item (funcall @var{e} @var{args}@dots{})
Call @acronym{PEX} @var{e} (previously defined with
@code{define-peg-rule}) with arguments @var{args}.
@item (null)
The empty string.
@end table
The following expressions are used as anchors or tests -- they do not
@ -210,19 +228,19 @@ Beginning of symbol.
@item (eos)
End of symbol.
@item (if E)
Returns non-@code{nil} if parsing @acronym{PEX} E from point succeeds (point
is not moved).
@item (if @var{e})
Returns non-@code{nil} if parsing @acronym{PEX} @var{e} from point
succeeds (point is not moved).
@item (not E)
Returns non-@code{nil} if parsing @acronym{PEX} E from point fails (point
is not moved).
@item (guard EXP)
Treats the value of the Lisp expression EXP as a boolean.
@item (not @var{e})
Returns non-@code{nil} if parsing @acronym{PEX} @var{e} from point fails
(point is not moved).
@item (guard @var{exp})
Treats the value of the Lisp expression @var{exp} as a boolean.
@end table
@c FIXME: peg-char-classes should be mentioned in the text below.
@vindex peg-char-classes
Character class matching can use the same named character classes as
in regular expressions (@pxref{Top,, Character Classes,elisp})
@ -234,12 +252,13 @@ in regular expressions (@pxref{Top,, Character Classes,elisp})
@cindex parsing stack
By default the process of parsing simply moves point in the current
buffer, ultimately returning @code{t} if the parsing succeeds, and
@code{nil} if it doesn't. It's also possible to define ``actions''
that can run arbitrary Elisp at certain points in the parsed text.
These actions can optionally affect something called the @dfn{parsing
stack}, which is a list of values returned by the parsing process.
These actions only run (and only return values) if the parsing process
ultimately succeeds; if it fails the action code is not run at all.
@code{nil} if it doesn't. It's also possible to define @dfn{parsing
actions} that can run arbitrary Elisp at certain points in the parsed
text. These actions can optionally affect something called the
@dfn{parsing stack}, which is a list of values returned by the parsing
process. These actions only run (and only return values) if the parsing
process ultimately succeeds; if it fails the action code is not run at
all.
Actions can be added anywhere in the definition of a rule. They are
distinguished from parsing expressions by an initial backquote
@ -247,12 +266,13 @@ distinguished from parsing expressions by an initial backquote
of hyphens (@samp{--}) somewhere within it. Symbols to the left of
the hyphens are bound to values popped from the stack (they are
somewhat analogous to the argument list of a lambda form). Values
produced by code to the right are pushed to the stack (analogous to
the return value of the lambda). For instance, the previous grammar
can be augmented with actions to return the parsed number as an actual
integer:
produced by code to the right of the hyphens are pushed onto the stack
(analogous to the return value of the lambda). For instance, the
previous grammar can be augmented with actions to return the parsed
number as an actual integer:
@example
@group
(with-peg-rules ((number sign digit (* digit
`(a b -- (+ (* a 10) b)))
`(sign val -- (* sign val)))
@ -261,6 +281,7 @@ integer:
(and "" `(-- 1))))
(digit [0-9] `(-- (- (char-before) ?0))))
(peg-run (peg number)))
@end group
@end example
There must be values on the stack before they can be popped and
@ -271,43 +292,53 @@ only left-hand terms will consume (and discard) values from the stack.
At the end of parsing, stack values are returned as a flat list.
To return the string matched by a @acronym{PEX} (instead of simply
moving point over it), a rule like this can be used:
moving point over it), a grammar can use a rule like this:
@example
@group
(one-word
`(-- (point))
(+ [word])
`(start -- (buffer-substring start (point))))
@end group
@end example
The first action pushes the initial value of point to the stack. The
intervening @acronym{PEX} moves point over the next word. The second
action pops the previous value from the stack (binding it to the
variable @code{start}), and uses that value to extract a substring
from the buffer and push it to the stack. This pattern is so common
that @acronym{PEG} provides a shorthand function that does exactly the
above, along with a few other shorthands for common scenarios:
@noindent
The first action above pushes the initial value of point to the stack.
The intervening @acronym{PEX} moves point over the next word. The
second action pops the previous value from the stack (binding it to the
variable @code{start}), then uses that value to extract a substring from
the buffer and push it to the stack. This pattern is so common that
@acronym{PEG} provides a shorthand function that does exactly the above,
along with a few other shorthands for common scenarios:
@table @code
@item (substring @var{E})
Match @acronym{PEX} @var{E} and push the matched string to the stack.
@findex substring (a PEG shorthand)
@item (substring @var{e})
Match @acronym{PEX} @var{e} and push the matched string onto the stack.
@item (region @var{E})
Match @var{E} and push the start and end positions of the matched
region to the stack.
@findex region (a PEG shorthand)
@item (region @var{e})
Match @var{e} and push the start and end positions of the matched
region onto the stack.
@item (replace @var{E} @var{replacement})
Match @var{E} and replaced the matched region with the string @var{replacement}.
@findex replace (a PEG shorthand)
@item (replace @var{e} @var{replacement})
Match @var{e} and replaced the matched region with the string
@var{replacement}.
@item (list @var{E})
Match @var{E}, collect all values produced by @var{E} (and its
sub-expressions) into a list, and push that list to the stack. Stack
@findex list (a PEG shorthand)
@item (list @var{e})
Match @var{e}, collect all values produced by @var{e} (and its
sub-expressions) into a list, and push that list onto the stack. Stack
values are typically returned as a flat list; this is a way of
``grouping'' values together.
@end table
@node Writing PEG Rules
@section Writing PEG Rules
@cindex PEG rules, pitfalls
@cindex Parsing Expression Grammar, pitfalls in rules
Something to be aware of when writing PEG rules is that they are
greedy. Rules which can consume a variable amount of text will always
@ -319,9 +350,10 @@ backtracking. For instance, this rule will never succeed:
(forest (+ "tree" (* [blank])) "tree" (eol))
@end example
The @acronym{PEX} @code{(+ "tree" (* [blank]))} will consume all
repetitions of the word ``tree'', leaving none to match the final
@code{"tree"}.
@noindent
The @acronym{PEX} @w{@code{(+ "tree" (* [blank]))}} will consume all
the repetitions of the word @samp{tree}, leaving none to match the final
@samp{tree}.
In these situations, the desired result can be obtained by using
predicates and guards -- namely the @code{not}, @code{if} and
@ -331,6 +363,7 @@ predicates and guards -- namely the @code{not}, @code{if} and
(forest (+ "tree" (* [blank])) (not (eol)) "tree" (eol))
@end example
@noindent
The @code{if} and @code{not} operators accept a parsing expression and
interpret it as a boolean, without moving point. The contents of a
@code{guard} operator are evaluated as regular Lisp (not a
@ -345,6 +378,7 @@ rule:
(end-game "game" (eob))
@end example
@noindent
when run in a buffer containing the text ``game over'' after point,
will move point to just after ``game'' then halt parsing, returning
@code{nil}. Successful parsing will always return @code{t}, or the

View file

@ -1587,8 +1587,8 @@ preventing the installation of Compat if unnecessary.
+++
** New package PEG.
Emacs now includes a library for writing (P)arsing (E)xpression
(G)rammars, an approach to text parsing that provides more structure
Emacs now includes a library for writing Parsing Expression
Grammars (PEG), an approach to text parsing that provides more structure
than regular expressions, but less complexity than context-free
grammars. The Info manual "(elisp) Parsing Expression Grammars" has
documentation and examples.