Modern Tree-sitter, part 2: why scopes matter
In the last post, I tried to explain why the new Tree-sitter integration was worth writing about in the first place: because we needed to integrate it into a system defined by TextMate grammars, and we had to solve some challenging problems along the way.
Today Iâll try to illustrate what that system looks like and why itâs important.
When I first started getting involved with Pulsar back in January, @mauricioszabo had just begun the process of migrating us to the web-tree-sitter
bindings, and seemed to be dreading the enormity of the task.
He noticed that most Tree-sitter parsers had their own built-in query files for syntax highlighting. These highlights.scm
files act like a sort of stylesheet for a Tree-sitter tree: theyâre used when you run tree-sitter highlight
from the command line, and theyâre what GitHub uses when it highlights code in a web browser.
Couldnât we just use these files and save ourselves a lot of hassle?, he asked.
I said no. And I realized that, in saying no, I was volunteering myself to do some hard work.
How TextMate grammars work
You may not have ever used TextMate, but youâve probably used an editor that benefited from its existence. Several widely used editors â first Sublime Text, then Atom, and now Visual Studio Code â all based their grammar systems on TextMateâs.
A TextMate grammar doesnât aim to understand your code entirely. It uses regular expressions â pattern-matching â to describe rules that identify the important parts of your code. It knows that some rules apply broadly and others are valid only in certain contexts, so it allows you to nest rules inside other rules.
Hereâs a very simple example of what a TextMate grammar looks like:
{
"name": "JavaScript",
"scopeName": "source.js",
"fileTypes": ["js", "jsx", "mjs", "cjs"],
"patterns": [
{
"name": "string.quoted.double.js",
"match": "\\\"(.*?)\\\""
}
]
}
Itâll feel overwhelming to consider all of the things a grammar does, so for this example Iâve removed all of its rules except for one: the one that highlights double-quoted strings "like this"
. You can see that it looks for a particular pattern and, when it matches, assigns the name string.quoted.double.js
to that range. Thatâs called a scope name.
When you load a JavaScript file in Pulsar â or VSCode, or Sublime Text â this grammar will find all your strings and mark them with scope names. In turn, your editorâs syntax theme will hook into that scope name to make that string a special color. (It will most likely target string
, rather than something more specific, but thatâs why scope names are divided into segments; targeting string
will also match string.quoted.double.js
.)
Letâs make our grammar a bit more complex:
{
"name": "JavaScript",
"scopeName": "source.js",
"fileTypes": ["js", "jsx", "mjs", "cjs"],
"patterns": [
{
"name": "string.quoted.double.js",
"begin": "\"",
"beginCaptures": {
"0": { "name": "punctuation.definition.string.begin.js" }
},
"end": "\"",
"endCaptures": {
"0": { "name": "punctuation.definition.string.end.js" }
},
"patterns": [
{
"include": "#string_escapes"
}
]
}
]
}
This is practically the exact rule for double-quoted strings from Pulsarâs built-in TextMate grammar for JavaScript. It introduces some more advanced concepts:
- Instead of a single pattern, weâve now got
begin
andend
patterns. If it can identify a buffer range that starts with abegin
match and ends with anend
match, it can scope the entire range that way, and can create a new context in which only certain patterns are applied â in this case, escape sequences. - The entire string will still be scoped
string.quoted.double.js
, but now weâre also applyingpunctuation
scope names to the quote characters themselves.
Hereâs a better way to visualize this:
If you were to put your cursor anywhere within a buffer and run the Editor: Log Cursor Scope command, youâd see a list of scope names that are active at that buffer position, moving from broadest to narrowest:
Why is it choosing those specific names? Because of naming conventions established by TextMate many years ago. When different language grammars try to abide by the same conventions for naming their scopes, it makes it easier to write syntax themes without constantly checking how they look in every possible language.
Still, youâd be forgiven if you felt like this was overkill. All weâre doing here is applying some syntax highlighting, right? Do you need this much information just to make a string green?
How TextMate uses scope names
TextMate has a good reason for wanting such verbose scope names, and for sprinkling them so liberally throughout your source code: they arenât used just for syntax highlighting.
Itâs better to think of scope names as intelligent annotations for your source code. The idea isnât that youâll want to make single-quoted strings a different color from double-quoted strings â though you could! â but rather that other editor features could benefit from knowing whether you were inside a string, and what kind of string you were inside.
Because in TextMate, most editor actions â commands, snippets, and macros â can be made contextually aware.
Letâs use commands as an example. Here are some commands you can write in TextMate:
- If the cursor is within a string, pressing Ctrl-Shift-â should convert the string between single-quoted and double-quoted delimiters.
- If the cursor is in the middle of
{}
, pressing Return should insert two newlines, put the cursor between the two newlines, and indent the cursorâs line by one level. - If the cursor is within a URL, pressing Enter on the number pad should open that URL in a web browser.
In fact, all of these are real examples that are enabled by default in TextMate. And because they target generic scope names that are shared across a number of grammars, theyâll work in nearly any language.
Did you notice how our grammar file defined a scopeName
of source.js
? Thatâs a root scope name, and it applies to the entire file. And it means that commands can restrict themselves to certain languages just like any other context.
For instance: to beautify your JavaScript, you might want to run it through prettier
. To beautify your HTML, you might want to run it through HTML Tidy. And to beautify your JSON, you might want to run it through jq
.
In TextMate, file beautification commands canonically use the hotkey Ctrl-Shift-H. By applying the right scope selector for each of these three commands, you can assign them all to Ctrl-Shift-H, and they wonât get in each otherâs way.
How Pulsar uses scope names
There are a number of reasons why TextMate isnât widely used anymore: itâs a macOS-only editor, its releases are frustratingly sporadic, and itâs never had support for crucial features like split editor panes. But TextMate was a major influence on how Atom was built.
Itâs not widely known, but Atom and Pulsar offer nearly as much flexibility with scope names as TextMate does:
Any snippet can be defined to run in an arbitrary scope context. For instance, you could write a snippet that will only expand if youâre inside of a JavaScript block comment; if you try to invoke it elsewhere, itâll act as though that snippet doesnât exist.
Built-in packages use this system. For instance, the
language-html
package defines snippets for most common tag names, but then redefines each one to do nothing if the cursor is inside a context where HTML would be invalid, like astyle
orscript
element.Pulsarâs configuration system allows for scope-specific overrides. For instance, a user can set
editor.softWrap
tofalse
globally, but set it totrue
for Markdown files.Built-in packages use this system. Scopes can be used to set config values that arenât exposed in the UI â like what to use for comment delimiters in a given language, or which non-letter characters should be considered to be part of a âwordâ for the purposes of cursor navigation.
In Pulsar, commands can be invoked anywhere, whether the user is in a text editor or not; hence thereâs no strong system for enforcing that commands be invoked only in certain scope contexts.
But itâs possible! Iâve got several commands in my
init.js
whose only purpose is to inspect the scope list of the active text editorâs cursor, then delegate to one of several other commands depending on where the cursor is. Itâs not as easy as it could be, but the bones are there.
But thatâs not all. If you use Pulsar, chances are very good that you use a package â built-in or third-party â that inspects scope names for contextual information. Here are just a few of many examples:
autocomplete-plus
inspects scopes to decide whether it should offer suggestions at the cursor; it declines to do so whenever any of the current scopes matches a user-configurable list. (This is how it decides not to offer suggestions if youâre typing within a comment, for instance.)autocomplete-css
decides which suggestions to offer based on the scope name of the closest token.bracket-matcher
uses scope names to locate and highlight paired tokens like brackets and matching HTML tags.emmet
, the eighth-most-downloaded community package, relies on scope descriptors to decide whether it should try to expand an abbreviation into HTML.
What would break?
With this in mind, letâs go back to those built-in Tree-sitter query files and consider our string example. In tree-sitter-javascript/queries/highlights.scm
we can see how strings are treated:
[
(string)
(template_string)
] @string
This rule treats three different kinds of strings â single-quoted, double-quoted, and backtick-quoted â identically. If we used this rule as written, weâd be applying a scope name of string
to all JavaScript strings. Not string.quoted.double.js
and the like; just string
. And thereâs no rule in that file that applies scope names to string delimiters, either.
If we embraced a system that used names like string
and comment
instead of string.quoted.double.js
and comment.block.documentation.js
, the editing experience would be worse in a number of ways ways â some tiny, some large.
To drive it home for you, letâs think of all the things that would break from this change alone:
- Any community syntax theme that chose to style double-quoted strings differently from single quoted strings would suddenly behave differently. (Far-fetched? Consider a language like PHP or Ruby in which single-quoted strings have different semantics from double-quoted strings.)
- Any community syntax theme that chose to style string delimiters in a certain way â with a bolder weight, perhaps â would suddenly behave differently, because the
punctuation
scopes would no longer be present on the delimiters. - Any user-defined snippet that is scoped specifically to
string.quoted.double
â or evenstring.quoted
â would break. - Any user-defined commands that looked for the presence of a
punctuation
scope to detect whether the cursor is adjacent to a string delimiter would break.
I think Atom learned this lesson rather well during the original rollout of Tree-sitter grammars. After updating, users were defaulted to the new grammars. Sometimes things looked different, or behaved differently. When users reported these differences, they learned that they could work around those bugs in the short term by unchecking the âUse Tree-Sitter Grammarsâ setting and restoring the original behavior. I wonder how many did, and how many of them ever re-enabled Tree-sitter grammars in the future.
A lot of code has been written that presumes the existence of TextMate-style scopes. I think this is fine. The fact that we have a new system doesnât mean that we should leave scopes behind. For one thing, itâd be a jarring experience for users â weâd be bragging about a new feature that was likely to break customizations that users had been relying upon for years.
But I also think that TextMate-style scopes function as an elegant middleware. They allow us to change an underlying implementation without the user needing to know or care â except, perhaps, to notice downstream benefits, like faster syntax highlighting and new editor features.
The challenge
So thatâs why I volunteered to do this. I knew that we couldnât just reuse some syntax highlighting query files that were written with different assumptions in place. And I knew that I couldnât reasonably ask someone else to do a bunch of work just because I felt like it was important.
My goal was to create a set of grammars that had the performance and accuracy upsides of the Tree-sitter experience and the rich scope name annotations that TextMate grammars provided.
For that to be possible, Tree-sitter grammars should apply scope names as similarly as possible to TextMate grammars. There may be reasons that we choose not to do something the way that the TextMate grammar does it, but we donât want to be unable to do something the way that the TextMate grammar does it.
Syntax highlighting was the first, and hardest, task. Next time Iâll talk about how we solved it.