Modern Tree-sitter, part 3: syntax highlighting via queries
Last time I laid out the case for why we chose to embrace TextMate-style scope names, even in newer Tree-sitter grammars. I set a difficult challenge for Pulsar: make it so that a Tree-sitter grammar can do anything a TextMate grammar can do.
Today Iâd like to show you the specific problems that we had to solve in order to pull that off.
If youâd like to follow along with these code examples in your own Pulsar installation, Iâd suggest installing the tree-sitter-tools
package. It includes a language grammar for the Tree-sitter query files weâll be spending most of this article writing, and it lets us visualize the syntax trees that Tree-sitter produces.
If you just want to get a sense of what a Tree-sitter tree looks like, you can use the Playground on the Tree-sitter web site.
The problems
The legacy Tree-sitter integration into Atom used its own system for mapping tree nodes to scope names, but it had two major limitations that prevented it from matching the scope names produced by a TextMate grammar:
- It couldnât query the right nodes: it used a CSS-like syntax that was limited in how expressively it could describe tree nodes.
- It couldnât describe the right ranges: it could only add scopes to ranges that corresponded to individual tree nodes.
The need to write an Atom-specific bridge between node names and scope names served as evidence that Tree-sitter need its own system for more easily working with syntax trees â one that would prevent every Tree-sitter consumer from having to reinvent the wheel.
So Tree-sitter added a powerful query system, with a Lisp-like syntax most directly influenced by Scheme. This system wasnât around for the legacy implementation to use â but itâs here for us now, and itâs going to make our job much easier.
Tree-sitter query syntax does the same thing for a Tree-sitter syntax tree that CSS selector syntax does for HTML: it gives us a terse way to describe a set of nodes in a tree. And just as people eventually realized that CSS selector syntax was useful for more than just styling â see document.querySelector
â youâll soon see that Tree-sitter queries are useful for more than just making strings green.
Right now, though, letâs just focus on syntax highlighting. And remember those two limitations that I described above, because weâll have to solve them both before this article is done.
The first challenge: robust querying
Iâll remind you of our example from part 2: the scope names applied to a double-quoted string in JavaScript.
Our first goal is to make it so that our JavaScript Tree-sitter grammar can apply these same scopes to the same buffer ranges. But Tree-sitter works very differently to a TextMate grammar, so itâs not immediately obvious how we can pull this off. Letâs reason through it.
How captures work
Tree-sitter itself has demonstrated how query files can be used to highlight code. If a parser has a highlights.scm
file defined in its repository, the CLI will allow you to run tree-sitter highlight
on arbitrary input. Itâll parse the input, figure out which parser should do the job, use that parserâs highlights.scm
to map certain nodes to query capture names, and then emit highlighted output in your terminal.
Last time I showed you this excerpt from tree-sitter-javascript
âs highlights.scm
file:
[
(string)
(template_string)
] @string
Once youâre familiar with query syntax, the outcome of this query is clear: all JavaScript strings will be captured and given a name of @string
. Somewhere within the tree-sitter highlight
code path, those capture names are mapped to various colors, and are applied to the captured buffer ranges. Anything that gets captured as @string
will have one color in the output; anything that gets captured as @keyword
will have a different color; and so on.
Letâs imagine that Pulsar has a similar system. In order to keep this article from putting you to sleep, I wonât get into the details of exactly how we do it, but the machinery is in place. Instead of a capture name like @string
, weâll be choosing more verbose names like @string.quoted.double.js
, but the principles are the same.
We also wonât be talking about how Pulsar knows which areas of the buffer to re-highlight as the user makes changes, nor how Pulsar combines the results of multiple parsers (for example, JavaScript embedded in HTML). These topics may be visited in future articles, but today weâre just talking about how to use Tree-sitter queries to identify arbitrary ranges and give them the scope names that we want.
Highlights
What does a string look like in a Tree-sitter tree? Letâs create a new document with nothing but the following contents:
"like this";
Using the tree-sitter-tools package, I can open an inspector pane and look at the raw tree for this string:
So we can see that a string
node consists of three parts: a delimiter, a string_content
node, and another delimiter. This structure maps elegantly to the things that we want to scope.
Most parsers build abstract syntax trees. In the process of making the tree, they discard lots of information that isnât important to the parserâs goal. But Tree-sitter builds concrete syntax trees! Every character in our JavaScript file will end up being represented by at least one node, and every node remembers the exact buffer range it covers. If it didnât, we wouldnât be able to use it for syntax highlighting.
In Tree-sitter parsers, nodes that matter to semantics (like string_content
) tend to have names, whereas other nodes (like the delimiters) are âanonymousâ nodes. Anonymous nodes can still be queried against like named nodes, so it feels like itâll be pretty easy to apply the scopes that we want.
So letâs open our grammarâs highlights.scm
file and give it a try. (In Pulsarâs âdev modeâ â which you can trigger with the --dev
flag on the command line â grammar authors can even make changes to their query files and see them take effect instantly when they save!)
(string) @string.quoted.double.js
Thatâs not bad, but itâs too specific. In tree-sitter-javascript
, the string
node applies for both single-quoted and double-quoted strings, with the difference being reflected only in the anonymous nodes. (Template strings, as we saw above, have their own node type.)
How do we distinguish single-quoted strings from double-quoted strings? Hereâs one thing we could do:
((string) @string.quoted.single.js
(#match? @string.quoted.single.js "^'"))
((string) @string.quoted.double.js
(#match? @string.quoted.single.js "^\""))
The built-in #match?
predicate allows us to reject possible matches when their contents donât match a given regular expression. Here weâre telling the query engine to distinguish between string
nodes whose text starts with '
and those whose text starts with "
.
Weâll be using the #match?
predicate a lot. Unlike some of the other predicates weâll see shortly, itâs implemented by the web-tree-sitter
bindings, so Tree-sitter on its own is able to reject would-be captures that donât pass it. By the time we see the list of captures, those that have failed a #match?
test have already been filtered out.
In this case, though, thereâs an even easier way to tell these strings apart:
(string "'") @string.quoted.single.js
(string "\"") @string.quoted.double.js
As I mentioned, we can query for the presence of anonymous nodes. So the first line will match any string
that contains at least one anonymous node child named '
, and the second line will match any string
that contains at least one anonymous node child named "
.
Since the capture name is on the outside of the closing parenthesis, the capture name applies to the whole string
.
We donât have to be more specific; if that anonymous node exists at all, then itâs used as a delimiter on both sides of the string. And this query wonât match any potential false positives â like a double-quoted string that happens to have a '
somewhere inside of it â because the parser is too smart to get tripped up by that sort of thing.
This is one reason why itâs very easy to make a new Tree-sitter grammar if someone has done the work of writing a Tree-sitter parser for the given language. If we were writing a TextMate grammar, weâd have to care about a lot more of these edge cases, but a Tree-sitter parser will have handled them for us already.
Scope tests
We can already tell that the expressiveness of Tree-sitterâs query system will go a long way toward solving the first of the two problems we described above. Last time around, Atom developers had to invent a system for querying the tree, but we get a much more powerful system for free.
Tools like anonymous nodes and #match?
predicates can get us quite far on their own, but they canât solve all of our problems. We still have to scope the quotation marks themselves, and we may think we know how to do it:
(string "'" @punctuation.definition.string.begin.js)
By putting the capture name immediately after the "'"
, we can target that anonymous node and give it a name. But remember that there are two delimiters! We want to give one scope name to the beginning delimiter and a different scope name to the ending delimiter. As weâve written it, this rule would match both delimiters.
Surprisingly, we canât solve this problem without some external help. Tree-sitter queries have a concept of âanchorsâ that can enforce positioning of children â for instance, targeting only the first node child of a parent â but they can be used only for named nodes, not anonymous nodes. We need a way to introduce our own filtering criteria into Tree-sitter queries.
Luckily, Tree-sitter gives us the tools to write our own predicates. Instead of trying to make it aware of our application-specific concerns, we can use the generic predicates #is?
and #is-not?
to mark certain query captures with data, then use that data to filter the results however we like.
The downside is that Tree-sitter canât approve or reject captures with these predicates on its own like it can with #match?
. Instead, weâll have to âprocessâ these captures after the fact and filter them manually. But itâs worth the effort because it lets us use whatever logic we want â even Pulsar-specific logic that means nothing to Tree-sitter.
Letâs illustrate.
(string
"'" @punctuation.definition.string.begin.js
(#is? test.first "true"))
The two parameters after #is?
are arbitrary values of my own invention. Tree-sitter simply treats them as a key and value and applies some metadata to this capture. Any nodes that get captured by this query will contain some data under assertedProperties
:
capture.assertedProperties["test.first"]; //-> "true"
In fact, I can omit that second argument; for boolean tests like this one, the presence of the property is all we need:
(string
"'" @punctuation.definition.string.begin.js
(#is? test.first))
(string
"'" @punctuation.definition.string.end.js
(#is? test.last))
Using (#is? test.first)
is like having Tree-sitter attach a Post-It note to a capture object with the text âremember to assert test.first
laterâ written on it. Tree-sitter doesnât know or care what that means, but it assumes we will.
And in this case, test.first
corresponds to a function weâve written that looks like this:
function first(node) {
if (!node.parent) return true;
return node?.parent?.firstChild?.id === node.id;
}
This function first ensures that root nodes (which have no parent) will always pass the test. Then it compares our node to its parentâs first child. If theyâre equal, the test passes. If theyâre not, then the captured node isnât the first child of its parent, and we can ignore it.
The logic for our last
function is identical, except that it compares our node
to node.parent.lastChild
.
Even better: the existence of #is-not?
means that we get negation practically for free. Suppose I did this:
(string
"'" @punctuation.definition.string.end.js
(#is-not? test.first))
Then the metadata would exist in a different placeâŠ
"test.first" in capture.refutedProperties; //-> true
âŠand Iâd know to ignore this capture unless my first
function fails for this node.
So now weâve got a way to filter capture names by any criteria we can think of. If we can test for it in JavaScript, it can be used as a predicate in Tree-sitter queries.
We call these custom predicates scope tests, and weâve grouped them under a test.
namespace for reasons that may make more sense later. Scope tests are a crucial tool for solving the first of those two problems we described earlier: they let us query for tree nodes in arbitrary ways that the legacy system simply couldnât.
And because scope tests are just JavaScript, weâre able to use some oddball criteria for accepting or rejecting captures. Consider these examples:
((program) @source.js.embedded
(#is? test.injection))
(variable_declarator
name: (identifier) @constant.other.foo
(#match? @constant.other.foo "^[_A-Z]+$")
(#is? test.config "language-foo.highlightAllCapsIdentifiersAsConstants"))
Both of these are scope tests that grammar query files can use. The first one applies a scope only if weâre in an injection layer â for instance, if this is JavaScript inside of a SCRIPT
tag in an HTML file. The second one applies a scope only if the user has enabled a certain configuration option. Neither one has anything to do with Tree-sitter itself, but we can use them in Tree-sitter query files all the same.
The logic for applying tests and winnowing a list of raw captures exists in Pulsar in a class called ScopeResolver
. That first
function defined above is present at ScopeResolver.TESTS.first
, and thereâs logic in ScopeResolver
that matches up a test.first
property to that function.
There are other scope tests that weâve found to be quite useful â tests which any grammar author can use in their own query files:
- Whether a node has a certain kind of node as an ancestor
- Whether a node has a certain kind of node as a descendant
- Whether a node is the first/last non-whitespace content on a row
- Whether a node has arbitrary metadata that has been attached with Tree-sitterâs
#set!
predicate
But for our current goal â applying punctuation
scopes to string delimiters â implementing test.first
and test.last
is enough to get us the outcome we want.
You may remember from earlier that the legacy Tree-sitter integration used a CSS-like syntax to describe nodes. It supported combinators like >
and even pseudoclasses like :nth-child
, but not much else. Tree-sitterâs own query system can do much more than that â and itâs extensible, so we can add our own logic wherever we need it.
So weâve done it! We now have the ability to scope a JavaScript string identically between our two different grammar systems. And weâve moved past the legacy systemâs first drawback: a brittle query system. Our first challenge has been vanquished.
The second challenge: scoping arbitrary ranges
To solve the second drawback â inability to scope the correct ranges â letâs create another challenge.
TextMate grammars will scope comments differently based on whether theyâre line comments or block comments. Thereâs also a convention to annotate comment.line
scopes with their delimiter type:
// this is a comment
A typical TextMate grammar for JavaScript would scope this comment as comment.line.double-slash.js
, and would further scope the //
as punctuation.definition.comment.js
.
Can we do that in a Tree-sitter grammar with the tools weâve already got? Letâs inspect what our example comment looks like in a Tree-sitter tree:
Hmm. No anonymous nodes or anything. Just one node called comment
.
Itâs a strong Tree-sitter convention that various sorts of code comments are all represented by nodes called comment
. So this query would do the right thing in our exampleâŠ
(comment) @comment.line.double-slash.js
âŠbut would be incorrect in other scenarios because it would match block comments as well as line comments.
But we can use #match?
to distinguish between the two kinds of comments:
((comment) @comment.line.double-slash.js
(#match? @comment.line.double-slash.js "^\/\/"))
((comment) @comment.block.js
(#match? @comment.block.js "^\/*"))
In the latter case, we donât have to test for the presence of */
at the end. If the contents of the comment begin with /*
, thatâs all the information we need; we know it must end with */
, or else the parser wouldnât have classified it as a comment.
But what about our punctuation.definition.comment.js
scope? Sadly, tree-sitter-javascript
(and most other parsers) donât make it easy to target the comment delimiters themselves. Comment delimiters, unlike string delimiters, usually arenât available as anonymous nodes.
Hence the legacy Tree-sitter system has never been able to annotate that //
with the punctuation
scope it needed. Weâll need to solve this ourselves.
Scope adjustments
In this case, itâd be more convenient if there were a node for the //
we want to scope. But we control the internals of the editor, and we can tell it to apply scope names to whatever buffer ranges we want. When the tree doesnât do all of our work for us, it just means we have to try a bit harder.
In this case, the comment
node tells us where the comment starts in the buffer. And once it passes the #match?
predicate, we know that it starts with //
, so it must end two characters later. Not the hardest problem to solve!
The trickiest part will be figuring out how to describe these buffer positions in a query.
Adjusting by pattern
What if we did this?
((comment) @comment.line.double-slash.js
(#match? @comment.line.double-slash.js "^//"))
((comment) @punctuation.definition.comment.js
(#match? @punctuation.definition.comment.js "^//")
(#set! adjust.endAfterFirstMatchOf "^//"))
Here weâre capturing the same thing twice under different names. In the second case, weâre using #set!
â a predicate very similar to #is?
and #is-not?
â to attach a qualifier: instead of stopping at the end of the line comment, stop right after the //
near the beginning.
Just like #is?
predicates are represented on captures under assertedProperties
and #is-not?
predicates are represented under refutedProperties
, #set!
predicates are represented in their own bucket simply called properties
.
Note the semantic difference between a predicate that ends in !
and one that ends in ?
. In this case weâre not setting up a test for the capture to pass or fail; weâre attaching a side effect to the capture. Imagine a Post-It note attached to the capture that says âremember to adjust the range for this capture to end at X.â
I mentioned earlier that all nodes remember their corresponding buffer range â starting at row W and column X, ending at row Y and column Z. By default, this is the range against which a scope is applied. But in this case, the adjust.endAfterFirstMatchOf
predicate reminds us to execute a regular expression match on the nodeâs contents and move the ending position to the end of that match, instead of the nodeâs natural ending point.
So in the absence of any node boundary at the position we want, weâve found a rather simple alternative way to express that position. We still have to write our own code to make it happen, but that wonât be a problem. Since we already have to loop through our captures to apply scope tests, we might as well use that opportunity to tweak the range of a capture if we need to.
And how would we handle the delimiters of a block comment?
((comment) @comment.block.js
(#match? @comment.block.js "^/\\*"))
((comment) @punctuation.definition.comment.begin.js
(#match? @punctuation.definition.comment.begin.js "^/\\*")
(#set! adjust.endAfterFirstMatchOf "^/\\*"))
((comment) @punctuation.definition.comment.end.js
(#match? @punctuation.definition.comment.end.js "\\*/$")
(#set! adjust.startBeforeFirstMatchOf "\\*/$"))
We can scope the opening delimiter the same way. To scope the ending delimiter, we move the head of the range to the position at the beginning of a regex match. (And for situations where you want to move both the head and the tail at once, thereâs adjust.startAndEndAroundFirstMatchOf
).
Adjusting by node position descriptor
Letâs look at another example.
function SomeComponent(props) {
return <SomeOtherComponent {...props} />;
}
Letâs say we want to scope the />
at the end of the self-closing tag. Tree-sitter represents that as two separate anonymous nodes â /
and >
.
So weâve got a different problem here: thereâs no single node that includes both boundaries of the range we want to scope. How can we make the scope span two adjacent nodes?
We could probably use a pattern-based solution here like we did above. But we could also leverage a useful feature of how nodes are represented in the tree.
You might be familiar with how DOM nodes in the browser are traversible via properties like parentNode
, nextSibling
, and so on. If youâve got a reference to a particular DOM node, you can use those properties to jump from that node to any other node in the tree, as long as you you know how the two nodes are related.
A similar system exists for Tree-sitter nodes, and it gives us a simple way to describe relationships between nodes:
((jsx_self_closing_element
; The "/>" in `<Foo />`, extended to cover both anonymous nodes at once.
"/") @punctuation.definition.tag.end.js
(#set! adjust.startAt lastChild.previousSibling.startPosition)
(#set! adjust.endAt lastChild.endPosition))
Here weâre capturing the entire self-closing element, then moving the ends of the range to the specific boundaries that we need: starting at the beginning of second-to-last node child, and ending at the end of the last node child. The /
and >
will always be the last two children of the node we captured, so this is a simple and repeatable way to describe the boundaries we want.
This syntax for describing a position relative to a given node is something we call a node position descriptor. If youâve ever used a function like Lodashâs _.get
, you might feel at home with this syntax â it resolves a chain of property lookups all at once, and fails gracefully if any of them arenât present. Itâll come in handy later for other purposes.
So in order to pass our âJavaScript line commentâ challenge, weâve had to invent another feature called scope adjustments. With the infrastructure weâve already got, scope adjustments are an easy thing to add: a scope adjustment is just a function that accepts a capture object and returns a range. We process them in the post-capture phase immediately before we apply scope tests.
Scope adjustments are our answer to the legacy systemâs second limitation. They allow us to embrace the syntax tree when it works in our favor, but still break free of it whenever we need to.
What does this get us?
Some of you, having gotten this far, might wonder to yourselves: if it takes this much effort just to get feature parity with a TextMate grammar, why not just stick with TextMate grammars?
One answer is simple: despite how complex this may seem, and how many implementation details Iâve hidden from you, the effort itâs taken to integrate Tree-sitter grammars into Pulsar is much, much less than the effort it originally took to integrate TextMate grammars into Atom.
But the message here canât just be âwe implemented a cool new thing that you wonât notice at all!â The whole point is that this is better than â not just equivalent to â what we had before, and not just for the few of us who write grammars.
So let me give you an example of something that was very hard with a TextMate grammar, but is now quite easy.
When a scope name might be useful for semantic reasons, but not for syntax highlighting reasons, TextMate grammars add scope names that use the meta
namespace. You might see some meta
scope names when you run the âLog Cursor Scopeâ command, but most of them donât have an effect on how your code looks. Yet that information is visible to snippets, settings, and commands, and it can be very useful.
In the language-javascript
package, Pulsar defines a fun
snippet that expands as follows, with the string functionName
highlighted:
function functionName() {}
Thatâs great, but thereâs more than one syntax for defining functions. Imagine if you could use fun
inside of a class body and have it expand to the correct syntax for defining an instance method:
class Foo {
functionName() {}
}
Or inside of an object literal â much like the class body syntax, but with a trailing comma so as to prevent syntax errors:
const Foo = {
functionName() {},
};
TextMate grammars werenât built with this sort of thing in mind. With much effort, you could probably pull it off, but only by making the grammar so complex that it might as well be a parser. But Tree-sitter is already a parser. It understands your code well enough to make these things easy.
With our rich syntax tree, we can now apply meta
scope names rather liberally, marking sections of the buffer with useful metadata that can be used by commands and snippets. We can scope the inside of a class body and the inside of an object literal:
; The interior of a class body.
((class_body) @meta.block.class.js
; Start after `{`âŠ
(#set! adjust.startAt firstChild.endPosition)
; âŠand end before `}`.
(#set! adjust.endAt lastChild.startPosition))
; The inside of an object literal.
((object) @meta.object.js
; Start after `{`âŠ
(#set! adjust.startAt firstChild.endPosition)
; âŠand end before `}`.
(#set! adjust.endAt lastChild.startPosition))
By defining these two scope names, weâve exposed these concepts to all the systems that consume scope names, including snippets.
The language-javascript
package already defines fun
one way, but we can redefine it for more specific scopes:
'.source.js .meta.block.class.js':
'Function':
'prefix': 'fun'
'body': '${1:functionName}($2) {\n\t$0\n}'
'.source.js .meta.object.js':
'Function':
'prefix': 'fun'
'body': '${1:functionName}($2) {\n\t$0\n},'
Now weâve made the fun
snippet much more useful. When we type fun
and press Tab, the snippets
package will pick the version that matches the context of the cursor most closely.
This will work identically whether you use tab triggers or choose your snippets from an autocomplete menu.
This change isnât just theoretical; itâs been implemented in the language-javascript
grammar package, and it shipped with Pulsar 1.109.
Next time
Syntax highlighting isnât the only way that Tree-sitterâs query system can make our lives easier. In the next installment weâll tackle two tasks that the legacy Tree-sitter integration never addressed: indentation and code folding.