Modern Tree-sitter, part 5: injections
One annoying thing that software developers do is insist on writing in more than one language at once. Web developers are espeically obnoxious about this â routinely, for instance, putting CSS inside their HTML, or HTML inside their JavaScript, or CSS inside their HTML inside their JavaScript.
Code editors like Pulsar need to roll with this, so today weâll talk about how the modern Tree-sitter system handles what we call injections.
The TextMate grammar system understands injections. In any context, a TextMate grammar can include a subset of its own rules⌠or an entirely separate grammar.
But Tree-sitter needs something a bit more elaborate. If Iâve got CSS inside a style
tag in my HTML file, now Iâve got two different parsers, each responsible for a different range of code. If I make some changes inside that style
block, both parsers need to react to it.
Injections cover a wide range of use cases â from the examples above to fenced code blocks in Markdown files to special-purpose injections that recognize things like URLs. Injections allow us to do some powerful and useful things that would be hard to do otherwise â including some things that TextMate injections canât do at all.
A mental model for injections
Letâs pretend we have a simple HTML file that looks like this:
<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
<meta charset="utf-8" />
<title>Sample</title>
<style>
body {
padding: 0;
}
</style>
<script type="text/javascript">
window.dataLayer = window.dataLayer || [];
function gtag() {
dataLayer.push(arguments);
}
gtag("js", new Date());
gtag("config", "G-ABCDEFGHIJ");
</script>
</head>
<body></body>
</html>
We havenât gotten very close to the machinery so far in this series, but Iâve been content to have you model this as a single document with a single Tree-sitter HTML parser responsible for all syntax highlighting. That works fine until we get to the contents of the style
and script
elements.
To the Tree-sitter HTML parser, a style
element looks like this:
You can see that it performs the usual parsing on the start and end tags, but punts on parsing the CSS itself â instead marking it as raw_text
. This is what it should do! Itâs not a CSS parser, after all. It treats the inline script
element similarly, marking its contents as raw_text
because it doesnât understand JavaScript.
To apply syntax highlighting to these areas, we need to bring in parsers that understand these languages.
So our mental model needs to evolve. Instead of one buffer with one parser, we have one buffer with three parsers. We need a name for âa region of the buffer that uses a specific grammar to be understood,â so letâs call it a language layer, because thatâs what Pulsar calls it under the hood.
Language layers
Imagine a simpler HTML file that doesnât have any inline style
or script
tags:
<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
<meta charset="utf-8" />
<title>Sample</title>
</head>
<body></body>
</html>
Here weâre looking at a buffer with a single language layer at the root. When I type a new keystroke in this buffer, only one Tree-sitter parser has to do any re-parsing work, and only one layer needs to be consulted when re-applying syntax highlighting.
Once I add a style
blockâŚ
<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
<meta charset="utf-8" />
<title>Sample</title>
<style>
body {
padding: 0;
}
</style>
</head>
<body></body>
</html>
âŚI trigger the creation of a second language layer. This new layer is a child of the root HTML layer â because the HTML layer is its reason for being, and the CSS layer might go away in the future if the HTML layer changes. The new language layer uses a Tree-sitter CSS parser.
When I add back the script
blockâŚ
<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
<meta charset="utf-8" />
<title>Sample</title>
<style>
body {
padding: 0;
}
</style>
<script type="text/javascript">
window.dataLayer = window.dataLayer || [];
function gtag() {
dataLayer.push(arguments);
}
gtag("js", new Date());
gtag("config", "G-ABCDEFGHIJ");
</script>
</head>
<body></body>
</html>
âŚI trigger the creation of another language layer: a JavaScript layer that is also a child of the root HTML layer. Now there are three layers that might need to be consulted for syntax highlighting and other tasks.
And it doesnât stop here! Certain constructs inside of the JavaScript, like regular expressions or tagged template literals, might carry their own injections, in which case new language layers would be created as children of the JavaScript layer. The result is a tree of language layers which cooperate to apply syntax highlighting to our buffer.
This might feel impossibly complex, but it isnât. Itâs just a different approach to what was already being done with TextMate grammars. In a minute Iâll explain how Pulsar manages this complexity.
How Tree-sitter envisions injections
The Tree-sitter CLI tool performs its own code highlighting, so it needs its own solution for injections. It envisions a query file called injections.scm
that maps certain tree nodes to certain languages. For instance, hereâs the HTML parserâs injections.scm
:
((script_element
(raw_text) @injection.content)
(#set! injection.language "javascript"))
((style_element
(raw_text) @injection.content)
(#set! injection.language "css"))
These are simple examples, but the website documentation covers more advanced scenarios. For instance, the name of the language might not be hard-coded in the query file; it might be something that youâll need to determine from the tree itself, like by inspecting a heredoc stringâs tag:
list_items.each do |item|
puts <<~HTML
<li>#{item.name}</li>
HTML
end
This is a thorough system, and weâve already shown how query files can solve problems like syntax highlighting and injection hinting. So it can solve our language injection problem, right?
How Pulsar implements injections
The stuff I just described makes a lot of sense, and itâs possible weâll do it someday. But hereâs how Pulsar does it now:
atom.grammars.addInjectionPoint("text.html.basic", {
type: "script_element",
language() {
return "javascript";
},
content(node) {
return node.child(1);
},
});
atom.grammars.addInjectionPoint("text.html.basic", {
type: "style_element",
language() {
return "css";
},
content(node) {
return node.child(1);
},
});
This code block is the equivalent of the above query file for HTML injections. Itâs got a similar amount of flexibility for selecting a target (e.g., âthe second child of a style_element
nodeâ) and the ability to determine the language name either dynamically or statically. But weâve done everything else via queries; why do we do injections this way instead?
The
addInjectionPoint
API was added by the legacy Tree-sitter implementation. For reasons of continuity, it makes a lot of sense to keep using this API rather than switch to something thatâs functionally the same.In fact, thereâs one thing that the
addInjectionPoint
API does that a hypotheticalinjections.scm
file canât: it can be used to add injections to Language X even by someone who doesnât control thelanguage-x
bundle. This makes it far more useful to Pulsar! It means that someone can write their own parser that injects itself into another language, whether in the form of a community package or a few lines in a userâsinit.js
.
To me, it doesnât make sense to deprecate the addInjectionPoint
approach when it can do things that a query-based approach canât. Still, lots of Tree-sitter parsers include that query file, so I imagine that Pulsar will eventually support it in addition to the addInjectionPoint
API.
How does it know which language to use?
You mightâve noticed that both the Tree-sitter query file example and our addInjectionPoint
example refer to the to-be-injected language rather casually â javascript
and css
. Internally, grammars tend to refer to one another via their root scope â as in the text.html.basic
case above. So why not just use the root scope?
Two reasons:
- In our example, the HTML grammar shouldnât necessarily hard-code references to the injectable grammars it wants. It makes more abstract sense to describe the language it wants as
javascript
instead ofsource.js
â because more than one grammar could theoretically identify as a JavaScript grammar, and because a grammar might want to respond to more than one name. (js
andjavascript
,c++
andcpp
, and so forth.) - There are multiple use cases for associating a shorthand token in a buffer file to a language name. I mentioned above how heredoc strings often hint at the language of the string content via the tag. (And as I write this document, I glance at a dozen other examples: fenced code blocks in Markdown.) So itâs useful for us to be able to determine the to-be-injected language dynamically by inspecting the content of the buffer. We need to meet those use cases where they are.
So when an injection wants javascript
, we need to be able to match it to our source.js
grammar. That happens via a property in the grammar definition file; the grammar itself describes its âshortâ name.
Architecture
No matter which approach we use for describing injections, the job of processing injections is roughly the same. All of Pulsarâs Tree-sitter support is written with the understanding that there could be an arbitrary number of grammars that need to be consulted when an edit happens. So the job of parsing a document is divided up into some number of LanguageLayer
classes arranged hierarchically.
To visualize what we described above, you can once again use tree-sitter-tools. Open your favorite HTML document, then run the Tree Sitter Tools: Open Inspector For Editor command. Youâll be able to see all of a documentâs trees in a drop-down list:
The first item in the list will always be the ârootâ tree. Other items, if present, represent injections. And because injected languages can have their own injections, this list can grow to arbitrary length.
When the user edits the buffer, even with a single keystroke, we re-parse the document from the root down as follows:
- The root layer re-parses.
- When thatâs done, Pulsar looks for possible injections by querying for nodes that have been specified in calls to
addInjectionPoint
. - If those nodes match the criteria of
addInjectionPoint
(does the injection describe a language whose name we can match to a grammar? does the node specified by thecontent
callback exist?), then we try to match them up to the layers that already exist. Layers that canât be matched to current injection nodes are disposed of, and nodes that canât be matched to existing layers get new layers created for them. - The process starts over for each layer at the next level of depth in the tree until all injections are current and all parsers have re-parsed.
Any keystroke can create a brand new injection or invalidate one that used to exist. If I put the cursor inside of <style>
and insert an x
, changing it to <styxle>
, then the CSS injection would no longer exist, and its corresponding LanguageLayer
instance would need to be destroyed. If I then undo my change, the parse tree restores the style_element
node, and a new language layer is created.
Does this feel overwhelming? Thatâs fair. After all, Iâm writing this blog post in a buffer with 32 different language layers across the various code examples, and youâd think that would add up to one hell of a performance penalty on each keystroke. But it doesnât.
Here are a few reasons why:
We donât revisit every single language layer on every single keystroke because we can determine when a given buffer change cannot possibly affect a given injection. For instance, if Iâm editing a section of my HTML file outside of the
style
block, then Pulsar knows it doesnât have to re-parse the CSS inside of thatstyle
block yet. It knows that the layerâs parse tree, though technically stale, is not invalid, and will defer re-parsing until an edit happens within its extent of the buffer. (This is true even if those edits cause thestyle
block to shift its position in the document!) As a result, lots of buffer changes can short-circuit the exhaustive process I just described.Syntax highlighting in particular is designed for performance, even in injection scenarios. After a bufferâs initial highlighting pass, a given section of code will retain its highlighting indefinitely, even if its position in the buffer shifts as the result of other edits.
Syntax re-highlighting only happens when a buffer range is specifically invalidated. When an edit happens, Tree-sitter tells us how that edit affects the syntax tree, which in turn tells us which parts of the buffer need to be re-highlighted â and, just as importantly, which parts donât need to be re-highlighted.
Tree-sitter is faster than you think it is. The smaller the edit, the more the parser can reuse its old work, and the faster the re-parse happens.
Hardly anything in this process happens synchronously, so buffer operations will feel fast even in the rare case where Tree-sitter needs time to catch up.
Challenges
The systems we described in the last two installments â syntax highlighting, code folding, and indentation hinting â are much easier to explain when we donât have to think about injections. How do we make them work in a multi-language buffer?
Annoyingly, the answer is different for each system. For instance:
- To support code folding in an environment with multiple injected languages, weâd want to ask each layer for its code folds, then combine the results.
- If the user presses Return and we want to know whether to indent the next line, we should ask one specific layer â the one most qualified to answer that question at the given cursor position.
So sometimes we need to aggregate across layers, but other times we need to pick a winner.
Picking a winner is the obvious approach for indentation when you think it through. If I hit Return when the cursor is in a script
block, then Iâm writing JavaScript, and the JavaScript layer should be the one making indentation decisions. More generally, this means that if more than one layer is active at a given buffer position, we should pick the deepest layer and ask it to decide. (Sometimes this means the deepest layer that fulfills a certain criterion â in this case, the deepest layer that actually defines an indentation query.)
But aggregating is the obvious approach for other scenarios. Tree-sitter grammars get to support Pulsarâs Editor: Select Smaller Syntax Node and Editor: Select Larger Syntax Node commands (you donât know you need them in your life until you give them a try!) and those commands should work properly across injection boundaries. So when either command is invoked with the cursor at a given position, we should figure out which nodes contain that point regardless of which parse tree owns them. Then we can arrange those nodes from smallest to largest.
You can see the results here. As I expand the selection by invoking Select Larger Syntax Node over and over, the selection starts with nodes in the CSS injection, jumps to nodes in the parent JavaScript injection, then jumps again to nodes in the root HTML injection.
Strange injection tricks
Mixing languages in a single buffer is messy, so injections need some unusual features in order to deal with that messiness. These features can be used in surprising and powerful ways.
âRedactionâ in injections
One thing that makes Tree-sitter injections more powerful than their TextMate equivalents is their ability to ignore content that isnât relevant to their jobs. The injection engine âredactsâ all content except what it wants a given layer to see.
Redacting children
Suppose you had an html
tagged template literal:
let markup = html` <p>Copyright 2020â2023 John Doe</p> `;
Since the tag hints at the language name, Pulsar will give you HTML syntax highlighting for free inside the literal. But that literal is still JavaScript, so what happens if we do this?
let now = new Date();
let markup = html` <p>Copyright 2020â${now.getFullYear()} John Doe</p> `;
That ${now.getFullYear()}
part isnât actually HTML. This example wonât confuse an HTML parser, so itâs not a big deal; but there does exist valid content inside of a template interpolation that definitely would flummox the injection:
let evil = html` <p>this might not ${"</p>"} get parsed correctly</p> `;
Ideally, the HTML injection wouldnât see that interpolation at all. So what if we could hide it?
We can. In fact, we do! Hereâs what that template string looks like in tree-sitter-tools
:
Our injection is defined such that we specify the template_string
node as the content. That means Pulsar will use the buffer range of that node, but will subtract the ranges of any of the nodeâs children!
We can visualize this with the âShow injection rangesâ option in tree-sitter-tools
:
You can see that the HTML injection layer has two disjoint content ranges on either side of the interpolation. The Tree-sitter HTML parser wonât even know the interpolation is there.
This behavior makes sense as a default, but it can be opted out of with includeChildren: true
in addInjectionPoint
if it gets in your way.
content
callback
Redacting via A grammar author has another tool to control what gets redacted: the content
callback. Itâs not limited to returning a single node! It can return an array of nodes, each with its own range; thereâs no obligation for those ranges to be adjacent.
Our first HTML injection example earlier applied its own subtle redaction. We specified a type
of script_element
, but a content
callback that returns that elementâs second child (the raw_text
node). So the type
property tells Pulsar which node to query for (create an injection for each script
element) but content
selects the node(s) that will be meaningful to the parser (omit the <script>
and </script>
because those arenât JavaScript).
This flexibility means that itâs possible for all your injections of a certain type to share one language layer. Instead of creating one layer for each of N different buffer ranges, you could create one layer with N disjoint content ranges. The trade-off is that an injection that covers more of the buffer will need to be re-parsed more often in response to buffer changes, but that trade-off might make sense in certain scenarios.
Macros in Rust/C/C++
C, C++, and Rust allow you to define macros via a preprocessor. Macros are weird for a parser: they canât be parsed as though theyâre valid source code, because they might be fragments of code that arenât syntactically valid until after preprocessing.
Hence theyâre a situation where a language might want to inject itself into itself. Consider this code example:
#define circleArea(r) (3.1415*(r)*(r))
The #define
keyword and the circleArea(r)
preprocessor function signature have to be well-formed, but everything that follows is an anything-goes nightmare for a syntax highlighter. The preprocessor wonât try to parse it or make it make sense; itâll just make the appropriate substitution throughout the source file and enforce validity later.
For the same reason, the tree-sitter-c
parser doesnât attempt to do any parsing of the preprocessor argument â the (3.1415*(r)*(r))
in our above example. But that argument will often be valid C, so thereâs no reason why we shouldnât take a stab at it:
atom.grammars.addInjectionPoint(`source.c`, {
type: "preproc_function_def",
language(node) {
return "c";
},
content(node) {
return node.lastNamedChild; // the `preproc_arg` node
},
});
This is a low-stakes gambit for us. If the content of the macro is syntactically strange, the parser might get a bit flummoxed, and the resulting highlighting might look a bit weird. But thatâs OK! It wonât affect the highlighting of anything outside of the macro content.
Injecting highlighting for URLs and TODOs
Two built-in packages called language-todo
and language-hyperlink
define specialized TextMate grammars. Their purpose is to provide rules that match TODO:
remarks (in comments) and URLs (in comments and strings), and to inject those rules into certain contexts regardless of grammar. This is a lovely feature of TextMate that the Atom developers got for free when implementing TextMate-style grammars back in the day.
The effect is that Pulsar can help you locate TODOs in comments by coloring them differently from the rest of the comment. It can also draw underlines under URLs and even follow a URL when you place your cursor inside of it and invoke the Link: Open command.
This works because a TextMate grammar can âpushâ its injections into any scope inside any other grammar, whether that other grammar asks for it or not. For instance, the language-hyperlink
grammar injects itself into strings, so any language that defines a string.*
scope will have those rules injected into it.
The legacy Tree-sitter system never had an equivalent feature. I missed it terribly, so I decided to create equivalent Tree-sitter parsers and grammars for these rules. These parsers, when given arbitrary text, can create nodes for things that look like URLs or TODO comments. Once those parsers existed, I could inject them into whichever grammars I wanted:
for (let type of ["template_string", "string_fragment", "comment"]) {
atom.grammars.addInjectionPoint("source.js", {
type,
language: () => {
return "hyperlink";
},
content: (node) => node,
languageScope: null,
});
}
Thereâs one new thing here: the languageScope
option. Typically, youâll want a grammarâs base scope name to be present inside of an injection; for instance, youâd want a source.js
scope name to exist inside of an HTML script
block. But that behavior doesnât make sense for our use case. We want to add a scope name around a URL when itâs present, but otherwise we want to operate stealthily. Passing null
to the languageScope
option lets us bypass the default behavior.
Thereâs one other thing to address, though. Most comments wonât have URLs in them. Most strings wonât have URLs in them. If I use this code as-is, Iâll be creating one new injection for every string, every line comment, and every block comment in my JavaScript file, whether a URL is present or not. (This unnecessary work, believe it or not, doesnât create any sluggishness during routine editing, but we should still try to avoid it.)
What should I do? One option would be to do what I described above: create one large injection for the entire document and have it be in charge of all comments and strings in the document. That was my first experiment, but I decided against it because the trade-off wasnât worth it: incremental re-parses were slower because every buffer change meant that my URL parser had to re-scan the whole buffer.
Iâm willing to chalk part of that up to my lack of experience writing Tree-sitter parsers! Iâd bet there are things I can do to make those parses less costly. But in the meantime, I applied a Stupid Human Trick⢠to get the best of both worlds:
const HYPERLINK_PATTERN = /\bhttps?:/;
for (let type of ["template_string", "string_fragment", "comment"]) {
atom.grammars.addInjectionPoint("source.js", {
type,
language: (node) => {
return HYPERLINK_PATTERN.test(node.text) ? "hyperlink" : undefined;
},
content: (node) => node,
languageScope: null,
});
}
I can assure you that this feels incredibly silly to do, but it works: weâre pre-screening the content of the node and ignoring those that definitely donât contain a URL. Returning undefined
from the language
callback prevents a layer from being needlessly created. We employ a similar strategy for the TODO
highlighting.
Thereâs another thing that feels awkward about this: itâs not as automatic as the previous TextMate solution. Instead of being able to âpushâ these injections into other grammars, weâre asking those grammars to âpullâ our injections into themselves.
In an ideal world, Iâd be able to create a generalized injection that applied to all files as easily as in the TextMate grammars. But to create a Tree-sitter injection Iâve got to describe the name of the node I want to inject into. And there arenât many safe assumptions you can make about Tree-sitter node naming conventions.
The saving grace here is what I mentioned above: the injection API lets you inject things into someone elseâs language grammar. So if your favorite community language package doesnât highlight TODOs and URLs, you can fix that with about six lines of JavaScript in your init.js
.
Markdown and front matter
Markdown is how I write most of my prose, including this blog post. And for years itâs been quite popular inside static site generators like Jekyll and its successors, but with a wrinkle: those tools support the addition of YAML metadata via a âfront matterâ block at the start of a Markdown file.
There are two major Markdown parsers for Tree-sitter, both of which are written by third parties rather than by the Tree-sitter organization. One of them is being actively developed, and boasts built-in support for front matter, but has a number of bugs that are show-stoppers for Pulsar at the moment. The other one is older, doesnât support front matter, and doesnât seem to be actively maintained⌠but is otherwise bulletproof. Iâd love to use the newer one for Pulsar, but I canât justify it until itâs more stable.
So how do we get around the older parserâs lack of support for front matter? By writing our own Tree-sitter parser and using injections:
- Write a front matter parser whose only purpose is to divide a Markdown document into two nodes: (a) front matter and (b) Markdown text.
- Inject the YAML grammar into the front matter node.
- Inject the Markdown grammar into the Markdown text node.
In an ideal world, the parser in step 1 would be just an ordinary Tree-sitter parser for Markdown, and weâd need only the single injection for the YAML block. But thisâll tide us over just fine. Documents that donât have front matter still get parsed by tree-sitter-frontmatter
and will simply omit the front_matter
node.
Next time
I could keep talking about injections, but I canât afford to test your patience while we still have other topics to visit. Next time weâll look at what Tree-sitter calls code navigation systems: how to use Tree-sitter to identify functions, classes, and other important parts of your code.