emacs/admin/notes/tree-sitter/html-manual/Language-Definitions.html

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<!-- Created by GNU Texinfo 6.8, https://www.gnu.org/software/texinfo/ -->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<!-- This is the GNU Emacs Lisp Reference Manual
corresponding to Emacs version 29.0.50.

Copyright © 1990-1996, 1998-2023 Free Software Foundation, Inc.

Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.3 or
any later version published by the Free Software Foundation; with the
Invariant Sections being "GNU General Public License," with the
Front-Cover Texts being "A GNU Manual," and with the Back-Cover
Texts as in (a) below.  A copy of the license is included in the
section entitled "GNU Free Documentation License."

(a) The FSF's Back-Cover Text is: "You have the freedom to copy and
modify this GNU manual.  Buying copies from the FSF supports it in
developing GNU and promoting software freedom." -->
<title>Language Definitions (GNU Emacs Lisp Reference Manual)</title>

<meta name="description" content="Language Definitions (GNU Emacs Lisp Reference Manual)">
<meta name="keywords" content="Language Definitions (GNU Emacs Lisp Reference Manual)">
<meta name="resource-type" content="document">
<meta name="distribution" content="global">
<meta name="Generator" content="makeinfo">
<meta name="viewport" content="width=device-width,initial-scale=1">

<link href="index.html" rel="start" title="Top">
<link href="Index.html" rel="index" title="Index">
<link href="index.html#SEC_Contents" rel="contents" title="Table of Contents">
<link href="Parsing-Program-Source.html" rel="up" title="Parsing Program Source">
<link href="Using-Parser.html" rel="next" title="Using Parser">
<style type="text/css">
<!--
a.copiable-anchor {visibility: hidden; text-decoration: none; line-height: 0em}
a.summary-letter {text-decoration: none}
blockquote.indentedblock {margin-right: 0em}
div.display {margin-left: 3.2em}
div.example {margin-left: 3.2em}
kbd {font-style: oblique}
pre.display {font-family: inherit}
pre.format {font-family: inherit}
pre.menu-comment {font-family: serif}
pre.menu-preformatted {font-family: serif}
span.nolinebreak {white-space: nowrap}
span.roman {font-family: initial; font-weight: normal}
span.sansserif {font-family: sans-serif; font-weight: normal}
span:hover a.copiable-anchor {visibility: visible}
ul.no-bullet {list-style: none}
-->
</style>
<link rel="stylesheet" type="text/css" href="./manual.css">


</head>

<body lang="en">
<div class="section" id="Language-Definitions">
<div class="header">
<p>
Next: <a href="Using-Parser.html" accesskey="n" rel="next">Using Tree-sitter Parser</a>, Up: <a href="Parsing-Program-Source.html" accesskey="u" rel="up">Parsing Program Source</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index.html" title="Index" rel="index">Index</a>]</p>
</div>
<hr>
<span id="Tree_002dsitter-Language-Definitions"></span><h3 class="section">37.1 Tree-sitter Language Definitions</h3>
<span id="index-language-definitions_002c-for-tree_002dsitter"></span>

<span id="Loading-a-language-definition"></span><h3 class="heading">Loading a language definition</h3>
<span id="index-loading-language-definition-for-tree_002dsitter"></span>

<span id="index-language-argument_002c-for-tree_002dsitter"></span>
<p>Tree-sitter relies on language definitions to parse text in that
language.  In Emacs, a language definition is represented by a symbol.
For example, the C language definition is represented as the symbol
<code>c</code>, and <code>c</code> can be passed to tree-sitter functions as the
<var>language</var> argument.
</p>
<span id="index-treesit_002dextra_002dload_002dpath"></span>
<span id="index-treesit_002dload_002dlanguage_002derror"></span>
<span id="index-treesit_002dload_002dsuffixes"></span>
<p>Tree-sitter language definitions are distributed as dynamic libraries.
In order to use a language definition in Emacs, you need to make sure
that the dynamic library is installed on the system.  Emacs looks for
language definitions in several places, in the following order:
</p>
<ul>
<li> first, in the list of directories specified by the variable
<code>treesit-extra-load-path</code>;
</li><li> then, in the <samp>tree-sitter</samp> subdirectory of the directory
specified by <code>user-emacs-directory</code> (see <a href="Init-File.html">The Init File</a>);
</li><li> and finally, in the system&rsquo;s default locations for dynamic libraries.
</li></ul>

<p>In each of these directories, Emacs looks for a file with file-name
extensions specified by the variable <code>dynamic-library-suffixes</code>.
</p>
<p>If Emacs cannot find the library or has problems loading it, Emacs
signals the <code>treesit-load-language-error</code> error.  The data of
that signal could be one of the following:
</p>
<dl compact="compact">
<dt><span><code>(not-found <var>error-msg</var> &hellip;)</code></span></dt>
<dd><p>This means that Emacs could not find the language definition library.
</p></dd>
<dt><span><code>(symbol-error <var>error-msg</var>)</code></span></dt>
<dd><p>This means that Emacs could not find in the library the expected function
that every language definition library should export.
</p></dd>
<dt><span><code>(version-mismatch <var>error-msg</var>)</code></span></dt>
<dd><p>This means that the version of language definition library is incompatible
with that of the tree-sitter library.
</p></dd>
</dl>

<p>In all of these cases, <var>error-msg</var> might provide additional
details about the failure.
</p>
<dl class="def">
<dt id="index-treesit_002dlanguage_002davailable_002dp"><span class="category">Function: </span><span><strong>treesit-language-available-p</strong> <em>language &amp;optional detail</em><a href='#index-treesit_002dlanguage_002davailable_002dp' class='copiable-anchor'> &para;</a></span></dt>
<dd><p>This function returns non-<code>nil</code> if the language definitions for
<var>language</var> exist and can be loaded.
</p>
<p>If <var>detail</var> is non-<code>nil</code>, return <code>(t . nil)</code> when
<var>language</var> is available, and <code>(nil . <var>data</var>)</code> when it&rsquo;s
unavailable.  <var>data</var> is the signal data of
<code>treesit-load-language-error</code>.
</p></dd></dl>

<span id="index-treesit_002dload_002dname_002doverride_002dlist"></span>
<p>By convention, the file name of the dynamic library for <var>language</var> is
<samp>libtree-sitter-<var>language</var>.<var>ext</var></samp>, where <var>ext</var> is the
system-specific extension for dynamic libraries.  Also by convention,
the function provided by that library is named
<code>tree_sitter_<var>language</var></code>.  If a language definition library
doesn&rsquo;t follow this convention, you should add an entry
</p>
<div class="example">
<pre class="example">(<var>language</var> <var>library-base-name</var> <var>function-name</var>)
</pre></div>

<p>to the list in the variable <code>treesit-load-name-override-list</code>, where
<var>library-base-name</var> is the basename of the dynamic library&rsquo;s file name,
(usually, <samp>libtree-sitter-<var>language</var></samp>), and
<var>function-name</var> is the function provided by the library
(usually, <code>tree_sitter_<var>language</var></code>).  For example,
</p>
<div class="example">
<pre class="example">(cool-lang &quot;libtree-sitter-coool&quot; &quot;tree_sitter_cooool&quot;)
</pre></div>

<p>for a language that considers itself too &ldquo;cool&rdquo; to abide by
conventions.
</p>
<span id="index-language_002ddefinition-version_002c-compatibility"></span>
<dl class="def">
<dt id="index-treesit_002dlanguage_002dversion"><span class="category">Function: </span><span><strong>treesit-language-version</strong> <em>&amp;optional min-compatible</em><a href='#index-treesit_002dlanguage_002dversion' class='copiable-anchor'> &para;</a></span></dt>
<dd><p>This function returns the version of the language-definition
Application Binary Interface (<acronym>ABI</acronym>) supported by the
tree-sitter library.  By default, it returns the latest ABI version
supported by the library, but if <var>min-compatible</var> is
non-<code>nil</code>, it returns the oldest ABI version which the library
still can support.  Language definition libraries must be built for
ABI versions between the oldest and the latest versions supported by
the tree-sitter library, otherwise the library will be unable to load
them.
</p></dd></dl>

<span id="Concrete-syntax-tree"></span><h3 class="heading">Concrete syntax tree</h3>
<span id="index-syntax-tree_002c-concrete"></span>

<p>A syntax tree is what a parser generates.  In a syntax tree, each node
represents a piece of text, and is connected to each other by a
parent-child relationship.  For example, if the source text is
</p>
<div class="example">
<pre class="example">1 + 2
</pre></div>

<p>its syntax tree could be
</p>
<div class="example">
<pre class="example">                  +--------------+
                  | root &quot;1 + 2&quot; |
                  +--------------+
                         |
        +--------------------------------+
        |       expression &quot;1 + 2&quot;       |
        +--------------------------------+
           |             |            |
+------------+   +--------------+   +------------+
| number &quot;1&quot; |   | operator &quot;+&quot; |   | number &quot;2&quot; |
+------------+   +--------------+   +------------+
</pre></div>

<p>We can also represent it as an s-expression:
</p>
<div class="example">
<pre class="example">(root (expression (number) (operator) (number)))
</pre></div>

<span id="Node-types"></span><h4 class="subheading">Node types</h4>
<span id="index-node-types_002c-in-a-syntax-tree"></span>

<span id="index-type-of-node_002c-tree_002dsitter"></span>
<span id="tree_002dsitter-node-type"></span><span id="index-named-node_002c-tree_002dsitter"></span>
<span id="tree_002dsitter-named-node"></span><span id="index-anonymous-node_002c-tree_002dsitter"></span>
<p>Names like <code>root</code>, <code>expression</code>, <code>number</code>, and
<code>operator</code> specify the <em>type</em> of the nodes.  However, not all
nodes in a syntax tree have a type.  Nodes that don&rsquo;t have a type are
known as <em>anonymous nodes</em>, and nodes with a type are <em>named
nodes</em>.  Anonymous nodes are tokens with fixed spellings, including
punctuation characters like bracket &lsquo;<samp>]</samp>&rsquo;, and keywords like
<code>return</code>.
</p>
<span id="Field-names"></span><h4 class="subheading">Field names</h4>

<span id="index-field-name_002c-tree_002dsitter"></span>
<span id="index-tree_002dsitter-node-field-name"></span>
<span id="tree_002dsitter-node-field-name"></span><p>To make the syntax tree easier to analyze, many language definitions
assign <em>field names</em> to child nodes.  For example, a
<code>function_definition</code> node could have a <code>declarator</code> and a
<code>body</code>:
</p>
<div class="example">
<pre class="example">(function_definition
 declarator: (declaration)
 body: (compound_statement))
</pre></div>

<span id="Exploring-the-syntax-tree"></span><h3 class="heading">Exploring the syntax tree</h3>
<span id="index-explore-tree_002dsitter-syntax-tree"></span>
<span id="index-inspection-of-tree_002dsitter-parse-tree-nodes"></span>

<p>To aid in understanding the syntax of a language and in debugging of
Lisp program that use the syntax tree, Emacs provides an &ldquo;explore&rdquo;
mode, which displays the syntax tree of the source in the current
buffer in real time.  Emacs also comes with an &ldquo;inspect mode&rdquo;, which
displays information of the nodes at point in the mode-line.
</p>
<dl class="def">
<dt id="index-treesit_002dexplore_002dmode"><span class="category">Command: </span><span><strong>treesit-explore-mode</strong><a href='#index-treesit_002dexplore_002dmode' class='copiable-anchor'> &para;</a></span></dt>
<dd><p>This mode pops up a window displaying the syntax tree of the source in
the current buffer.  Selecting text in the source buffer highlights
the corresponding nodes in the syntax tree display.  Clicking
on nodes in the syntax tree highlights the corresponding text in the
source buffer.
</p></dd></dl>

<dl class="def">
<dt id="index-treesit_002dinspect_002dmode"><span class="category">Command: </span><span><strong>treesit-inspect-mode</strong><a href='#index-treesit_002dinspect_002dmode' class='copiable-anchor'> &para;</a></span></dt>
<dd><p>This minor mode displays on the mode-line the node that <em>starts</em>
at point.  For example, the mode-line can display
</p>
<div class="example">
<pre class="example"><var>parent</var> <var>field</var>: (<var>node</var> (<var>child</var> (&hellip;)))
</pre></div>

<p>where <var>node</var>, <var>child</var>, etc., are nodes which begin at point.
<var>parent</var> is the parent of <var>node</var>.  <var>node</var> is displayed in
a bold typeface.  <var>field-name</var>s are field names of <var>node</var> and
of <var>child</var>, etc.
</p>
<p>If no node starts at point, i.e., point is in the middle of a node,
then the mode line displays the earliest node that spans point, and
its immediate parent.
</p>
<p>This minor mode doesn&rsquo;t create parsers on its own.  It uses the first
parser in <code>(treesit-parser-list)</code> (see <a href="Using-Parser.html">Using Tree-sitter Parser</a>).
</p></dd></dl>

<span id="Reading-the-grammar-definition"></span><h3 class="heading">Reading the grammar definition</h3>
<span id="index-reading-grammar-definition_002c-tree_002dsitter"></span>

<p>Authors of language definitions define the <em>grammar</em> of a
programming language, which determines how a parser constructs a
concrete syntax tree out of the program text.  In order to use the
syntax tree effectively, you need to consult the <em>grammar file</em>.
</p>
<p>The grammar file is usually <samp>grammar.js</samp> in a language
definition&rsquo;s project repository.  The link to a language definition&rsquo;s
home page can be found on
<a href="https://tree-sitter.github.io/tree-sitter">tree-sitter&rsquo;s
homepage</a>.
</p>
<p>The grammar definition is written in JavaScript.  For example, the
rule matching a <code>function_definition</code> node looks like
</p>
<div class="example">
<pre class="example">function_definition: $ =&gt; seq(
  $.declaration_specifiers,
  field('declarator', $.declaration),
  field('body', $.compound_statement)
)
</pre></div>

<p>The rules are represented by functions that take a single argument
<var>$</var>, representing the whole grammar.  The function itself is
constructed by other functions: the <code>seq</code> function puts together
a sequence of children; the <code>field</code> function annotates a child
with a field name.  If we write the above definition in the so-called
<em>Backus-Naur Form</em> (<acronym>BNF</acronym>) syntax, it would look like
</p>
<div class="example">
<pre class="example">function_definition :=
  &lt;declaration_specifiers&gt; &lt;declaration&gt; &lt;compound_statement&gt;
</pre></div>

<p>and the node returned by the parser would look like
</p>
<div class="example">
<pre class="example">(function_definition
  (declaration_specifier)
  declarator: (declaration)
  body: (compound_statement))
</pre></div>

<p>Below is a list of functions that one can see in a grammar definition.
Each function takes other rules as arguments and returns a new rule.
</p>
<dl compact="compact">
<dt><span><code>seq(<var>rule1</var>, <var>rule2</var>, &hellip;)</code></span></dt>
<dd><p>matches each rule one after another.
</p></dd>
<dt><span><code>choice(<var>rule1</var>, <var>rule2</var>, &hellip;)</code></span></dt>
<dd><p>matches one of the rules in its arguments.
</p></dd>
<dt><span><code>repeat(<var>rule</var>)</code></span></dt>
<dd><p>matches <var>rule</var> for <em>zero or more</em> times.
This is like the &lsquo;<samp>*</samp>&rsquo; operator in regular expressions.
</p></dd>
<dt><span><code>repeat1(<var>rule</var>)</code></span></dt>
<dd><p>matches <var>rule</var> for <em>one or more</em> times.
This is like the &lsquo;<samp>+</samp>&rsquo; operator in regular expressions.
</p></dd>
<dt><span><code>optional(<var>rule</var>)</code></span></dt>
<dd><p>matches <var>rule</var> for <em>zero or one</em> time.
This is like the &lsquo;<samp>?</samp>&rsquo; operator in regular expressions.
</p></dd>
<dt><span><code>field(<var>name</var>, <var>rule</var>)</code></span></dt>
<dd><p>assigns field name <var>name</var> to the child node matched by <var>rule</var>.
</p></dd>
<dt><span><code>alias(<var>rule</var>, <var>alias</var>)</code></span></dt>
<dd><p>makes nodes matched by <var>rule</var> appear as <var>alias</var> in the syntax
tree generated by the parser.  For example,
</p>
<div class="example">
<pre class="example">alias(preprocessor_call_exp, call_expression)
</pre></div>

<p>makes any node matched by <code>preprocessor_call_exp</code> appear as
<code>call_expression</code>.
</p></dd>
</dl>

<p>Below are grammar functions of lesser importance for reading a
language definition.
</p>
<dl compact="compact">
<dt><span><code>token(<var>rule</var>)</code></span></dt>
<dd><p>marks <var>rule</var> to produce a single leaf node.  That is, instead of
generating a parent node with individual child nodes under it,
everything is combined into a single leaf node.  See <a href="Retrieving-Nodes.html">Retrieving Nodes</a>.
</p></dd>
<dt><span><code>token.immediate(<var>rule</var>)</code></span></dt>
<dd><p>Normally, grammar rules ignore preceding whitespace; this
changes <var>rule</var> to match only when there is no preceding
whitespaces.
</p></dd>
<dt><span><code>prec(<var>n</var>, <var>rule</var>)</code></span></dt>
<dd><p>gives <var>rule</var> the level-<var>n</var> precedence.
</p></dd>
<dt><span><code>prec.left([<var>n</var>,] <var>rule</var>)</code></span></dt>
<dd><p>marks <var>rule</var> as left-associative, optionally with level <var>n</var>.
</p></dd>
<dt><span><code>prec.right([<var>n</var>,] <var>rule</var>)</code></span></dt>
<dd><p>marks <var>rule</var> as right-associative, optionally with level <var>n</var>.
</p></dd>
<dt><span><code>prec.dynamic(<var>n</var>, <var>rule</var>)</code></span></dt>
<dd><p>this is like <code>prec</code>, but the precedence is applied at runtime
instead.
</p></dd>
</dl>

<p>The documentation of the tree-sitter project has
<a href="https://tree-sitter.github.io/tree-sitter/creating-parsers">more
about writing a grammar</a>.  Read especially &ldquo;The Grammar DSL&rdquo;
section.
</p>
</div>
<hr>
<div class="header">
<p>
Next: <a href="Using-Parser.html">Using Tree-sitter Parser</a>, Up: <a href="Parsing-Program-Source.html">Parsing Program Source</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index.html" title="Index" rel="index">Index</a>]</p>
</div>


</body>
</html>