Modern Tree-sitter, part 4: indentation and code folding
Last time we looked at Tree-sitterās query system and showed how it can be used to make a syntax highlighting engine in Pulsar. But syntax highlighting is simply the most visible of the various tasks that a language package performs.
Today weāll look at two other systems ā indentation hinting and code folding ā and Iāll explain how queries can be used to support each one.
Indentation
Our programmer predecessors might scoff at how much our editors coddle us, but Iāll say it anyway: I hate it when my editor doesnāt know when to indent. In particular, C-style languages like JavaScript have predictable and simple rules about indentation, so when an editor guesses wrong, itās like a burst of static in the middle of a song.
If I type
if {
and press Return, all major code editors will indent the next line for me by one level. Thereās no reason for an editor to force me to press Tab myself ā itās obvious from context.
And once Iām done with the conditional and type }
on its own line, my editor should recognize that this line isnāt part of the indented block, and should decrease the indentation by one level automatically.
How does Pulsar do this now? And how can we swap in our own system for doing it with Tree-sitter?
How TextMate grammars do it
Pulsar, like Atom before it, uses an indentation hinting system based on the system from TextMate grammars. It works a bit like this:
- Use the previous lineās content to decide whether a new line is indented. When a user presses Return, a TextMate grammar will test the line the cursor was just on against a regular expression called
increaseIndentPattern
. If thereās a match, it concludes that the next line should start with an extra level of indentation. - Use a lineās own content to decide when it should be dedented. Since only certain kinds of content āĀ like
}
orend
āĀ signify the end of a block, a TextMate grammar will test the current line against a regular expression calleddecreaseIndentPattern
over and over as the user types. If that pattern ever matches, the current line will be dedented one level.
For JavaScript, imagine that increaseIndentPattern
can be described as āan opening brace without a matching closing brace,ā and decreaseIndentPattern
can be described as āa closing brace that is not preceded by an opening brace.ā Each pattern would probably need to match more than just those situations, but thatās the most important pattern to recognize by far.
Letās look at that screencast again to see these rules in action:
The first rule comes into play when we press Return at the end of line 2. A newline is inserted, and the cursor is correctly indented by one additional level.
The second rule reacts after we type a }
on line 4: now the lineās content matches decreaseIndentPattern
, and the line is dedented automatically.
How legacy Tree-sitter grammars do it
The same way. No, seriously.
It was obviously meant to be temporary, but legacy Tree-sitter grammars rely on the same increaseIndentPattern
and decreaseIndentPattern
that TextMate grammars use. We should come up with a Tree-sitterābased system instead.
TextMateās deep embrace of regular expressions is a double-edged sword: it makes simple things easy, but it makes complex things nearly impossible. Instead of testing lines against a single regular expression ā which can quickly get unwieldy as it expands to handle all possible indentation hinting scenarios ā we can use Tree-sitterās query system to identify code features that would affect indentation.
@mauricioszabo had noticed Neovimās prior art here, and had started to implement a similar system, so it was easy to pick up where he left off.
Working within the system
We couldāve just adopted Neovimās system wholesale. But I wanted a system that kept the same decision points as the TextMate system for conceptual simplicity. So weāve built a system that will abide by the contract described aboveā¦
- To figure out whether to indent a line, look at the content of the line above it.
- To figure out whether to dedent a line, look at the content of that line itself.
ā¦except using its own logic:
- When a user presses Return, instead of checking the previous line against
increaseIndentPattern
, weāll instead run a query capture against the previous line and look at the results to figure out whether to indent the new line. - When a user types on the current line, instead of checking it against
decreaseIndentPattern
, weāll run a query capture against that line with each keystroke and look for results that imply we should dedent that line relative to the line above.
Remember that Tree-sitter produces concrete syntax trees; every node represents an actual buffer range. The things that are typically stripped from abstract syntax trees, like punctuation, are still present in a Tree-sitter tree, and can be queried. Thatās good news for us: looking at punctuation is a great way to predict how lines should be indented.
So letās start here. Imagine this content exists in a file called indents.scm
:
"{" @indent
"}" @dedent
This is maybe the simplest possible system for describing indentation in a C-style language. For some languages, like CSS, this gets us 99% of the way to a perfect indentation system. But how does it work?
If youāre following along with tree-sitter-tools
, you can visualize it:
- First, make sure youāve enabled modern Tree-sitter.
- Open a new document, change the grammar to CSS, and type some sample CSS.
- Run the Tree Sitter Tools: Open Inspector For Editor command.
- Toggle the āAnonymous nodesā option to Show.
- Paste that code block into the appropriate text box and click on Run Query.
You can see the query matches illustrated in the text editor in real time, and you can match up the colors to the @indent
and @dedent
captures. You can even type new content (as in the examples below) and see the captures update in real time!
Letās say the user is writing a CSS file, and the cursor is represented by the |
character:
body {|
If the user were to press Return, weād run a query capture on a specific range of the buffer: from the start of row 1 to wherever the cursor was. The opening brace on row 1 would produce a single capture called @indent
. Based on that information, weād know that row 2 should be indented by one level.
But what if the file looked like this instead?
body { font-family: 'Helvetica', sans-serif; }|
If the user were to press Return, weād run the same query capture, and it would match twice: one @indent
capture and one @dedent
capture. Those captures would cancel each other out. Weād know that the opening brace we saw had already been balanced by a later closing brace, and weād know that row 2 should not increase its indentation level.
Now letās look at one more example:
body {
font-family: 'Helvetica', sans-serif;
|
After typing one property-value pair and pressing Return, weāre on row 3. Should this line be dedented? It depends on what weāre about to type! If weāre about to type }
, then the answer is yes ā but if weāre typing anything else, like another property-value pair, then the answer is no. Thatās why Pulsar wonāt decide whether to dedent until we start typing. If we were to type }
, our Tree-sitter grammar would run a query capture on the current line, spot the @dedent
match, and respond by dedenting the current line one level from the line above.
For more complex C-style languages like JavaScript, hereās a better starting point for indents.scm
:
; Any of these characters should trigger indentā¦
[ "{" "(" "[" ] @indent
; ā¦and any of these should trigger dedent.
[ "}" ")" "]" ] @dedent
There are major simplicity benefits to targeting these anonymous nodes instead of more abstract nodes. Most folksā indentation styles tend to align with delimiter usage, so the more tightly we can bind to the delimiters themselves, the more accurate the hinting will be. And anonymous nodes are safer to rely upon as the user types and the syntax tree is in flux. Sometimes we have to āfilter outā false positives ā for instance, curly braces in JavaScript that signify template string interpolations instead of statement blocks ā but thatās a small price to pay.
Iām hiding some of the complexity from you, but less than youād think. This is a much friendlier way to describe indentation hinting than making a grammar author maintain an ever-more-complex set of regular expressions. It allows the author to describe each kind of indentation hint as its own rule.
Getting creative
And it allows us to do some more complex things that werenāt possible before.
TextMateās system will let us indent or dedent one level at a time. But consider a switch
statement:
switch (foo) {
case "bar":
// one thing
break;
case "baz":
// another thing
break;
default:
// a third thing
}
Between the second-to-last line and the last line, thereās a two-level change in indentation. How can we express that?
Hereās where our syntax tree comes in handy. Instead of describing our desired indentation level relative to the previous line, we can describe it relative to a line of our choosing:
; The closing brace of a switch statement's body should match the indentation
; of the line where the switch statement starts.
(switch_statement
body: (switch_body "}" @match)
(#set! indent.matchIndentOf parent.startPosition))
; 'case' and 'default' need to be indented one level more than their containing
; `switch`.
(["case" "default"] @match
(#set! indent.matchIndentOf parent.parent.startPosition)
(#set! indent.offsetIndent 1))
["case" "default"] @indent
Here weāre using a new capture called @match
. It can do exactly what we just described by using a node position descriptor (an idea we introduced in the last installment) to refer to an earlier line.
In the first rule, weāre matching the closing }
of a switch
statement. Weāre using a #set!
predicate to describe the starting position of its switch_body
parent. The switch_body
starts on row 1, so @match
understands this to mean āadjust the current line to match the indentation of row 1.ā This will happen automatically when the user types the closing brace.
In the second rule, weāre using similar logic. If we were typing the above switch
statement, weād find ourselves over-indented as we started typing on line 5. Weād want our editor to dedent that line once it saw that we were typing a new branch of the switch
statement. So we can write another @match
rule ā still targeting the indentation level of the starting row of switch_body
āĀ but with an extra rule to offset the indent by one level. In other words, we want to be indented one level more than the indent level of row 1.
The third rule is simpler: itās how we ensure that the editor indents by one level after case
and default
statements.
You mightāve had your hackles raised by this example. After all, thereās another school of thought on how to indent switch
statements:
switch (foo) {
case "bar":
// one thing
break;
case "baz":
// another thing
break;
default:
// a third thing
}
This faction thinks that case
and default
statements should be indented to the same level as the original switch
statement. How can we please both camps?
One way might be to write two different versions of our second rule, and decide which one to use based on configuration:
; Some say 'case' and 'default' need to be indented one level more than their
; containing `switch`.
(["case" "default"] @match
(#is? test.config "language-javascript.doubleIndentSwitchStatements")
(#set! indent.matchIndentOf parent.parent.startPosition)
(#set! indent.offsetIndent 1))
; Others say 'case' and 'default' should match their containing `switch`.
(["case" "default"] @match
(#is-not? test.config "language-javascript.doubleIndentSwitchStatements")
(#set! indent.matchIndentOf parent.parent.startPosition))
Here weāre using a test.config
scope test. I told you about scope tests last time, but I havenāt yet mentioned that they donāt just apply to syntax highlighting queries; they apply to indentation queries, too!
The test.config
scope test gives us a way to approve or reject a capture based on the userās chosen configuration. If theyāve enabled the doubleIndentSwitchStatements
config option, we can indent their code one way; if theyāve disabled it, we can indent their code another way.
This particular example isnāt yet implemented, but it could be. This would be another advantage that the new Tree-sitter system has over TextMate-style indentation hinting: more room for configurability.
Hereās another edge case of indentation: a braceless if
statement.
How did we pull this off? Havenāt we been targeting nodes like {
and }
?
Yes, but we can also write rules to handle one-line conditionals like these:
(if_statement
condition: (parenthesized_expression ")" @indent
(#is? test.lastTextOnRow true)))
Here weāve written a query that captures the closing )
of a braceless if
statement and uses that as the indentation hint. Weāre also using a scope test to ensure the capture is ignored when it isnāt the last text on the row; thatās how we can avoid a false positive in this scenario:
if (notificationsAreDisabled) return;
A braceless if
or else
applies only to the next statement. The real feat is knowing to dedent immediately after that statement:
(if_statement
condition: (_) @indent
consequence: [
(expression_statement)
(return_statement)
(continue_statement)
(break_statement)
(throw_statement)
(debugger_statement)
] @dedent.next)
An if
clause with braces will have a node named statement_block
in its consequence
slot. An if
clause without braces will have its consequence
slot filled with one of these six kinds of nodes instead.
The @dedent.next
capture is only rarely needed, but this is a textbook case: it signals when we should dedent the next line without waiting to see the content of the line. Because we know that the next line should always be dedented in this scenario.
How well does this work? Amazingly well:
Tree-sitter isnāt confused by the line comment! It wonāt dedent until after the user types an actual statement.
Does this matter a great deal? Is it worth creating detailed rules to cover a breadth of unusual indentation scenarios? Probably not. But this is just one of a handful of low-hanging fruit that the new system has made possible. Even the slightly verbose query above is much easier to write (and for other people to reason about) than an inscrutable regular expression.
Code folding
Iām not someone who uses code folding very much, but I want it to be there when I need it. Collapsing entire code branches helps me see the big picture more easily.
How TextMate grammars do it
Much like indentation, code folding markers in TextMate grammars are determined with regular expressions. Any line that matches foldingStartMarker
is treated as the start of a fold, and any line that matches foldingEndMarker
is treated as the end of a fold. This offers similar tradeoffs to the indentation patterns described above: simple for simple cases, but nearly impossible for complex cases.
How legacy Tree-sitter grammars do it
Legacy Tree-sitter grammars introduced a system for defining folds within the grammar definition file itself āĀ one that hooks into named nodes from the tree. Itās a good start, but we can make something more expressive and flexible with queries.
Using queries to define fold ranges
Again, credit goes to the nvim-treesitter
developers and to @mauricioszabo for envisioning how Tree-sitter queries can describe folds more simply. Hereās how simple it can be in a language like CSS:
(block) @fold
Thatās the entirety of the contents of the folds.scm
file inside the language-css
package. This works because a starting position and an ending position are all you need to describe a fold, and thatās what Tree-sitter gives us. Every node in a tree reports its buffer positions, so every node can be the target of a fold.
Letās go a bit deeper and figure out what this does.
When it opens a file, Pulsar needs to figure out where possible fold ranges are so that it can show a small chevron in the gutter on each line where a fold can start. So itāll run a query capture against each line, testing it to see if any @fold
captures result.
Any results will have their buffer ranges inspected. If the range starts and ends on the same line, the candidate fold is ignored. (This saves grammar authors from having to manually exclude things like one-line conditionals.) But if the range starts on one line and ends on another, the fold is valid, and Pulsar knows to place a chevron in the gutter where the fold would start.
The beginning of a code fold is the very last column on its starting line. In most cases, the range in question will have delimiters on each end ā the backticks of a template string, the curly braces of an if
or else
condition, et cetera. Thatās why, by default, the end of a code fold is the starting position of its very last node child. This works as intended in the vast majority of cases, as in our CSS example above:
But because this isnāt always true ā especially for languages like Python that donāt use delimiters for blocks ā we provide ways to tweak a foldās ending position.
For instance, letās look at a JavaScript block comment:
Since comment nodes donāt have children, we should set a custom ending position for the fold with fold.endAt
so that it doesnāt try to look up a child node that doesnāt exist. Then we can use fold.offsetEnd
to move the ending point of the fold two columns to the left so that the fold ends before the commentās ending delimiter:
((comment) @fold
(#set! fold.endAt endPosition)
(#set! fold.offsetEnd -2))
Now we can fold up the block comment the way we want:
Folding in JavaScript is still pretty simple, but not as simple as CSS. Weāve got to account for some edge cases. For example, when an if
statement is followed by an else
, we should adjust the fold so that it ends on the line before the else
, so that each fold can be toggled independently without interfering with one another:
; Target `if` consequence blocks that are followed by `else`s.
((if_statement
consequence: (statement_block) @fold
alternative: (else_clause)
(#set! fold.adjustToEndOfPreviousRow true)
))
; Other `if` consequence blocks will get caught here.
(statement_block) @fold
You can see how this works in the screencast below ā the else
blockās closing delimiter folds up to be on the same line as the starting delimiter, but the if
blockās fold stops before the newline.
āEnd the fold at the end of the previous lineā is a common enough case to have its own shorthand predicate. Weāve put this special-case query above the simpler one because Pulsar will use the first capture that matches for a given line.
Why is this so complicated?
Iāll say it again: this tour through the machinery of Pulsar is aimed at Tree-sitter aficionados and at those who might want to write their own language packages for Pulsar. If that doesnāt describe you, donāt let yourself get overwhelmed by this information dump ā just make note of the new features that this system makes possible.
There are pieces of the indentation and folding systems that I didnāt even try to explain in this post. But all this complexity has a purpose, and users reap the benefits in small increments ā for instance, every time they donāt have to go back and reformat code they paste into an editor.
These systems only work with the assistance of language grammars, so we owe it to the authors of those grammars to hide as much of that complexity as we can. If we can make these systems seem simple on the surface, theyāll work better, and users will be happier.
Next time
Cue up your DVD of Inception! Next time weāre delving into language injections ā the feature that lets you write CSS inside of HTML inside of JavaScript inside of HTML inside of PHP.