mirror of
git://git.sv.gnu.org/emacs.git
synced 2026-01-30 04:10:54 -08:00
Fix the new PEG library
* doc/lispref/peg.texi (Parsing Expression Grammars) (PEX Definitions, Parsing Actions, Writing PEG Rules): Fix markup, indexing, and wording. * etc/NEWS: Fix wording of PEG entry. * test/lisp/progmodes/peg-tests.el: Move from test/lisp/, to match the directory of peg.el.
This commit is contained in:
parent
914b00f207
commit
994bcc125b
3 changed files with 117 additions and 83 deletions
|
|
@ -7,29 +7,34 @@
|
|||
@chapter Parsing Expression Grammars
|
||||
@cindex text parsing
|
||||
@cindex parsing expression grammar
|
||||
@cindex PEG
|
||||
|
||||
Emacs Lisp provides several tools for parsing and matching text,
|
||||
from regular expressions (@pxref{Regular Expressions}) to full
|
||||
@acronym{LL} grammar parsers (@pxref{Top,, Bovine parser
|
||||
development,bovine}). @dfn{Parsing Expression Grammars}
|
||||
left-to-right (a.k.a.@: @acronym{LL}) grammar parsers (@pxref{Top,,
|
||||
Bovine parser development,bovine}). @dfn{Parsing Expression Grammars}
|
||||
(@acronym{PEG}) are another approach to text parsing that offer more
|
||||
structure and composibility than regular expressions, but less
|
||||
complexity than context-free grammars.
|
||||
|
||||
A @acronym{PEG} parser is defined as a list of named rules, each of
|
||||
which matches text patterns, and/or contains references to other
|
||||
A Parsing Expression Grammar (@acronym{PEG}) describes a formal language
|
||||
in terms of a set of rules for recognizing strings in the language. In
|
||||
Emacs, a @acronym{PEG} parser is defined as a list of named rules, each
|
||||
of which matches text patterns and/or contains references to other
|
||||
rules. Parsing is initiated with the function @code{peg-run} or the
|
||||
macro @code{peg-parse} (see below), and parses text after point in the
|
||||
current buffer, using a given set of rules.
|
||||
|
||||
@cindex parsing expression
|
||||
The definition of each rule is referred to as a @dfn{parsing
|
||||
expression} (@acronym{PEX}), and can consist of a literal string, a
|
||||
regexp-like character range or set, a peg-specific construct
|
||||
resembling an elisp function call, a reference to another rule, or a
|
||||
combination of any of these. A grammar is expressed as a tree of
|
||||
rules in which one rule is typically treated as a ``root'' or
|
||||
``entry-point'' rule. For instance:
|
||||
@cindex root, of parsing expression grammar
|
||||
@cindex entry-point, of parsing expression grammar
|
||||
Each rule in a @acronym{PEG} is referred to as a @dfn{parsing
|
||||
expression} (@acronym{PEX}), and can be specified a a literal string, a
|
||||
regexp-like character range or set, a peg-specific construct resembling
|
||||
an Emacs Lisp function call, a reference to another rule, or a
|
||||
combination of any of these. A grammar is expressed as a tree of rules
|
||||
in which one rule is typically treated as a ``root'' or ``entry-point''
|
||||
rule. For instance:
|
||||
|
||||
@example
|
||||
@group
|
||||
|
|
@ -56,14 +61,17 @@ first rule is considered the ``entry-point'':
|
|||
@end group
|
||||
@end example
|
||||
|
||||
This macro represents the simplest use of the @acronym{PEG} library,
|
||||
but also the least flexible, as the rules must be written directly
|
||||
into the source code. A more flexible approach involves use of three
|
||||
macros in conjunction: @code{with-peg-rules}, a @code{let}-like
|
||||
construct that makes a set of rules available within the macro body;
|
||||
@code{peg-run}, which initiates parsing given a single rule; and
|
||||
@code{peg}, which is used to wrap the entry-point rule name. In fact,
|
||||
a call to @code{peg-parse} expands to just this set of calls. The
|
||||
@c FIXME: These two should be formally defined using @defmac and @defun.
|
||||
@findex with-peg-rules
|
||||
@findex peg-run
|
||||
The @code{peg-parse} macro represents the simplest use of the
|
||||
@acronym{PEG} library, but also the least flexible, as the rules must be
|
||||
written directly into the source code. A more flexible approach
|
||||
involves use of three macros in conjunction: @code{with-peg-rules}, a
|
||||
@code{let}-like construct that makes a set of rules available within the
|
||||
macro body; @code{peg-run}, which initiates parsing given a single rule;
|
||||
and @code{peg}, which is used to wrap the entry-point rule name. In
|
||||
fact, a call to @code{peg-parse} expands to just this set of calls. The
|
||||
above example could be written as:
|
||||
|
||||
@example
|
||||
|
|
@ -79,33 +87,43 @@ above example could be written as:
|
|||
This allows more explicit control over the ``entry-point'' of parsing,
|
||||
and allows the combination of rules from different sources.
|
||||
|
||||
@c FIXME: Use @defmac.
|
||||
@findex define-peg-rule
|
||||
Individual rules can also be defined using a more @code{defun}-like
|
||||
syntax, using the macro @code{define-peg-rule}:
|
||||
|
||||
@example
|
||||
@group
|
||||
(define-peg-rule digit ()
|
||||
[0-9])
|
||||
@end group
|
||||
@end example
|
||||
|
||||
This also allows for rules that accept an argument (supplied by the
|
||||
@code{funcall} PEG rule).
|
||||
@code{funcall} PEG rule, @pxref{PEX Definitions}).
|
||||
|
||||
@c FIXME: Use @defmac.
|
||||
@findex define-peg-ruleset
|
||||
Another possibility is to define a named set of rules with
|
||||
@code{define-peg-ruleset}:
|
||||
|
||||
@example
|
||||
@group
|
||||
(define-peg-ruleset number-grammar
|
||||
'((number sign digit (* digit))
|
||||
digit ;; A reference to the definition above.
|
||||
(sign (or "+" "-" ""))))
|
||||
@end group
|
||||
@end example
|
||||
|
||||
Rules and rulesets defined this way can be referred to by name in
|
||||
later calls to @code{peg-run} or @code{with-peg-rules}:
|
||||
|
||||
@example
|
||||
@group
|
||||
(with-peg-rules number-grammar
|
||||
(peg-run (peg number)))
|
||||
@end group
|
||||
@end example
|
||||
|
||||
By default, calls to @code{peg-run} or @code{peg-parse} produce no
|
||||
|
|
@ -125,11 +143,11 @@ act upon parsed strings, rules can include @dfn{actions}, see
|
|||
Parsing expressions can be defined using the following syntax:
|
||||
|
||||
@table @code
|
||||
@item (and E1 E2 ...)
|
||||
A sequence of @acronym{PEX}s that must all be matched. The @code{and} form is
|
||||
optional and implicit.
|
||||
@item (and @var{e1} @var{e2}@dots{})
|
||||
A sequence of @acronym{PEX}s that must all be matched. The @code{and}
|
||||
form is optional and implicit.
|
||||
|
||||
@item (or E1 E2 ...)
|
||||
@item (or @var{e1} @var{e2}@dots{})
|
||||
Prioritized choices, meaning that, as in Elisp, the choices are tried
|
||||
in order, and the first successful match is used. Note that this is
|
||||
distinct from context-free grammars, in which selection between
|
||||
|
|
@ -141,43 +159,43 @@ Matches any single character, as the regexp ``.''.
|
|||
@item @var{string}
|
||||
A literal string.
|
||||
|
||||
@item (char @var{C})
|
||||
A single character @var{C}, as an Elisp character literal.
|
||||
@item (char @var{c})
|
||||
A single character @var{c}, as an Elisp character literal.
|
||||
|
||||
@item (* @var{E})
|
||||
Zero or more instances of expression @var{E}, as the regexp @samp{*}.
|
||||
@item (* @var{e})
|
||||
Zero or more instances of expression @var{e}, as the regexp @samp{*}.
|
||||
Matching is always ``greedy''.
|
||||
|
||||
@item (+ @var{E})
|
||||
One or more instances of expression @var{E}, as the regexp @samp{+}.
|
||||
@item (+ @var{e})
|
||||
One or more instances of expression @var{e}, as the regexp @samp{+}.
|
||||
Matching is always ``greedy''.
|
||||
|
||||
@item (opt @var{E})
|
||||
Zero or one instance of expression @var{E}, as the regexp @samp{?}.
|
||||
@item (opt @var{e})
|
||||
Zero or one instance of expression @var{e}, as the regexp @samp{?}.
|
||||
|
||||
@item SYMBOL
|
||||
@item @var{symbol}
|
||||
A symbol representing a previously-defined PEG rule.
|
||||
|
||||
@item (range CH1 CH2)
|
||||
The character range between CH1 and CH2, as the regexp @samp{[CH1-CH2]}.
|
||||
@item (range @var{ch1} @var{ch2})
|
||||
The character range between @var{ch1} and @var{ch2}, as the regexp
|
||||
@samp{[@var{ch1}-@var{ch2}]}.
|
||||
|
||||
@item [CH1-CH2 "+*" ?x]
|
||||
@item [@var{ch1}-@var{ch2} "+*" ?x]
|
||||
A character set, which can include ranges, character literals, or
|
||||
strings of characters.
|
||||
|
||||
@item [ascii cntrl]
|
||||
A list of named character classes.
|
||||
|
||||
@item (syntax-class @var{NAME})
|
||||
@item (syntax-class @var{name})
|
||||
A single syntax class.
|
||||
|
||||
@item (funcall E ARGS...)
|
||||
Call @acronym{PEX} E (previously defined with @code{define-peg-rule})
|
||||
with arguments @var{ARGS}.
|
||||
@item (funcall @var{e} @var{args}@dots{})
|
||||
Call @acronym{PEX} @var{e} (previously defined with
|
||||
@code{define-peg-rule}) with arguments @var{args}.
|
||||
|
||||
@item (null)
|
||||
The empty string.
|
||||
|
||||
@end table
|
||||
|
||||
The following expressions are used as anchors or tests -- they do not
|
||||
|
|
@ -210,19 +228,19 @@ Beginning of symbol.
|
|||
@item (eos)
|
||||
End of symbol.
|
||||
|
||||
@item (if E)
|
||||
Returns non-@code{nil} if parsing @acronym{PEX} E from point succeeds (point
|
||||
is not moved).
|
||||
@item (if @var{e})
|
||||
Returns non-@code{nil} if parsing @acronym{PEX} @var{e} from point
|
||||
succeeds (point is not moved).
|
||||
|
||||
@item (not E)
|
||||
Returns non-@code{nil} if parsing @acronym{PEX} E from point fails (point
|
||||
is not moved).
|
||||
|
||||
@item (guard EXP)
|
||||
Treats the value of the Lisp expression EXP as a boolean.
|
||||
@item (not @var{e})
|
||||
Returns non-@code{nil} if parsing @acronym{PEX} @var{e} from point fails
|
||||
(point is not moved).
|
||||
|
||||
@item (guard @var{exp})
|
||||
Treats the value of the Lisp expression @var{exp} as a boolean.
|
||||
@end table
|
||||
|
||||
@c FIXME: peg-char-classes should be mentioned in the text below.
|
||||
@vindex peg-char-classes
|
||||
Character class matching can use the same named character classes as
|
||||
in regular expressions (@pxref{Top,, Character Classes,elisp})
|
||||
|
|
@ -234,12 +252,13 @@ in regular expressions (@pxref{Top,, Character Classes,elisp})
|
|||
@cindex parsing stack
|
||||
By default the process of parsing simply moves point in the current
|
||||
buffer, ultimately returning @code{t} if the parsing succeeds, and
|
||||
@code{nil} if it doesn't. It's also possible to define ``actions''
|
||||
that can run arbitrary Elisp at certain points in the parsed text.
|
||||
These actions can optionally affect something called the @dfn{parsing
|
||||
stack}, which is a list of values returned by the parsing process.
|
||||
These actions only run (and only return values) if the parsing process
|
||||
ultimately succeeds; if it fails the action code is not run at all.
|
||||
@code{nil} if it doesn't. It's also possible to define @dfn{parsing
|
||||
actions} that can run arbitrary Elisp at certain points in the parsed
|
||||
text. These actions can optionally affect something called the
|
||||
@dfn{parsing stack}, which is a list of values returned by the parsing
|
||||
process. These actions only run (and only return values) if the parsing
|
||||
process ultimately succeeds; if it fails the action code is not run at
|
||||
all.
|
||||
|
||||
Actions can be added anywhere in the definition of a rule. They are
|
||||
distinguished from parsing expressions by an initial backquote
|
||||
|
|
@ -247,12 +266,13 @@ distinguished from parsing expressions by an initial backquote
|
|||
of hyphens (@samp{--}) somewhere within it. Symbols to the left of
|
||||
the hyphens are bound to values popped from the stack (they are
|
||||
somewhat analogous to the argument list of a lambda form). Values
|
||||
produced by code to the right are pushed to the stack (analogous to
|
||||
the return value of the lambda). For instance, the previous grammar
|
||||
can be augmented with actions to return the parsed number as an actual
|
||||
integer:
|
||||
produced by code to the right of the hyphens are pushed onto the stack
|
||||
(analogous to the return value of the lambda). For instance, the
|
||||
previous grammar can be augmented with actions to return the parsed
|
||||
number as an actual integer:
|
||||
|
||||
@example
|
||||
@group
|
||||
(with-peg-rules ((number sign digit (* digit
|
||||
`(a b -- (+ (* a 10) b)))
|
||||
`(sign val -- (* sign val)))
|
||||
|
|
@ -261,6 +281,7 @@ integer:
|
|||
(and "" `(-- 1))))
|
||||
(digit [0-9] `(-- (- (char-before) ?0))))
|
||||
(peg-run (peg number)))
|
||||
@end group
|
||||
@end example
|
||||
|
||||
There must be values on the stack before they can be popped and
|
||||
|
|
@ -271,43 +292,53 @@ only left-hand terms will consume (and discard) values from the stack.
|
|||
At the end of parsing, stack values are returned as a flat list.
|
||||
|
||||
To return the string matched by a @acronym{PEX} (instead of simply
|
||||
moving point over it), a rule like this can be used:
|
||||
moving point over it), a grammar can use a rule like this:
|
||||
|
||||
@example
|
||||
@group
|
||||
(one-word
|
||||
`(-- (point))
|
||||
(+ [word])
|
||||
`(start -- (buffer-substring start (point))))
|
||||
@end group
|
||||
@end example
|
||||
|
||||
The first action pushes the initial value of point to the stack. The
|
||||
intervening @acronym{PEX} moves point over the next word. The second
|
||||
action pops the previous value from the stack (binding it to the
|
||||
variable @code{start}), and uses that value to extract a substring
|
||||
from the buffer and push it to the stack. This pattern is so common
|
||||
that @acronym{PEG} provides a shorthand function that does exactly the
|
||||
above, along with a few other shorthands for common scenarios:
|
||||
@noindent
|
||||
The first action above pushes the initial value of point to the stack.
|
||||
The intervening @acronym{PEX} moves point over the next word. The
|
||||
second action pops the previous value from the stack (binding it to the
|
||||
variable @code{start}), then uses that value to extract a substring from
|
||||
the buffer and push it to the stack. This pattern is so common that
|
||||
@acronym{PEG} provides a shorthand function that does exactly the above,
|
||||
along with a few other shorthands for common scenarios:
|
||||
|
||||
@table @code
|
||||
@item (substring @var{E})
|
||||
Match @acronym{PEX} @var{E} and push the matched string to the stack.
|
||||
@findex substring (a PEG shorthand)
|
||||
@item (substring @var{e})
|
||||
Match @acronym{PEX} @var{e} and push the matched string onto the stack.
|
||||
|
||||
@item (region @var{E})
|
||||
Match @var{E} and push the start and end positions of the matched
|
||||
region to the stack.
|
||||
@findex region (a PEG shorthand)
|
||||
@item (region @var{e})
|
||||
Match @var{e} and push the start and end positions of the matched
|
||||
region onto the stack.
|
||||
|
||||
@item (replace @var{E} @var{replacement})
|
||||
Match @var{E} and replaced the matched region with the string @var{replacement}.
|
||||
@findex replace (a PEG shorthand)
|
||||
@item (replace @var{e} @var{replacement})
|
||||
Match @var{e} and replaced the matched region with the string
|
||||
@var{replacement}.
|
||||
|
||||
@item (list @var{E})
|
||||
Match @var{E}, collect all values produced by @var{E} (and its
|
||||
sub-expressions) into a list, and push that list to the stack. Stack
|
||||
@findex list (a PEG shorthand)
|
||||
@item (list @var{e})
|
||||
Match @var{e}, collect all values produced by @var{e} (and its
|
||||
sub-expressions) into a list, and push that list onto the stack. Stack
|
||||
values are typically returned as a flat list; this is a way of
|
||||
``grouping'' values together.
|
||||
@end table
|
||||
|
||||
@node Writing PEG Rules
|
||||
@section Writing PEG Rules
|
||||
@cindex PEG rules, pitfalls
|
||||
@cindex Parsing Expression Grammar, pitfalls in rules
|
||||
|
||||
Something to be aware of when writing PEG rules is that they are
|
||||
greedy. Rules which can consume a variable amount of text will always
|
||||
|
|
@ -319,9 +350,10 @@ backtracking. For instance, this rule will never succeed:
|
|||
(forest (+ "tree" (* [blank])) "tree" (eol))
|
||||
@end example
|
||||
|
||||
The @acronym{PEX} @code{(+ "tree" (* [blank]))} will consume all
|
||||
repetitions of the word ``tree'', leaving none to match the final
|
||||
@code{"tree"}.
|
||||
@noindent
|
||||
The @acronym{PEX} @w{@code{(+ "tree" (* [blank]))}} will consume all
|
||||
the repetitions of the word @samp{tree}, leaving none to match the final
|
||||
@samp{tree}.
|
||||
|
||||
In these situations, the desired result can be obtained by using
|
||||
predicates and guards -- namely the @code{not}, @code{if} and
|
||||
|
|
@ -331,6 +363,7 @@ predicates and guards -- namely the @code{not}, @code{if} and
|
|||
(forest (+ "tree" (* [blank])) (not (eol)) "tree" (eol))
|
||||
@end example
|
||||
|
||||
@noindent
|
||||
The @code{if} and @code{not} operators accept a parsing expression and
|
||||
interpret it as a boolean, without moving point. The contents of a
|
||||
@code{guard} operator are evaluated as regular Lisp (not a
|
||||
|
|
@ -345,6 +378,7 @@ rule:
|
|||
(end-game "game" (eob))
|
||||
@end example
|
||||
|
||||
@noindent
|
||||
when run in a buffer containing the text ``game over'' after point,
|
||||
will move point to just after ``game'' then halt parsing, returning
|
||||
@code{nil}. Successful parsing will always return @code{t}, or the
|
||||
|
|
|
|||
4
etc/NEWS
4
etc/NEWS
|
|
@ -1587,8 +1587,8 @@ preventing the installation of Compat if unnecessary.
|
|||
|
||||
+++
|
||||
** New package PEG.
|
||||
Emacs now includes a library for writing (P)arsing (E)xpression
|
||||
(G)rammars, an approach to text parsing that provides more structure
|
||||
Emacs now includes a library for writing Parsing Expression
|
||||
Grammars (PEG), an approach to text parsing that provides more structure
|
||||
than regular expressions, but less complexity than context-free
|
||||
grammars. The Info manual "(elisp) Parsing Expression Grammars" has
|
||||
documentation and examples.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue