Modern Tree-sitter, part 1: the new old feature
The last few releases of Pulsar have been bragging about a feature that arguably isnât even new: our experimental âmodernâ Tree-sitter implementation. You mightâve read that phrase a few times now without fully understanding what it means, and an explanation is long overdue.
This is the first of a series of articles about Pulsarâs ongoing project to migrate its Tree-sitter implementation to a more modern version â the culmination of hundreds of hours of development work that started back in February of this year. It first shipped in Pulsar version 1.106 back in June as an opt-in feature, and is being improved on an ongoing basis with each new monthly release.
This is a big feature, perhaps the biggest since Pulsar was forked from Atom â and yet itâs a feature that, if weâve done our jobs right, wonât even seem like much of a change to most users. Before we dive into the deep end, Iâll try to explain why this is a topic worthy of multiple blog posts.
What is Tree-sitter?
Tree-sitter is a code parsing system. Itâs the brainchild of Max Brunsfeld, current Zed contributor and former contributor to Atom.
Itâs a code parsing system that represents your code as a tree of nodes. Itâs very fast on first parse â and even faster at re-parsing code after youâve made changes, because it can reuse the output from the last parse and reprocess only the parts that have changed.
You can use its output to underpin lots of features that youâd need in a code editor:
- syntax highlighting
- code folding
- contextual awareness (for example: is the cursor currently within a string?)
- indentation hinting (for example: if I press Return here, should the next line be indented by one level?)
- buffer navigation (for example: select the entire string that my cursor is in, or move the cursor to the nearest opening HTML tag)
- symbol navigation (viewing an outline of your current file, or jumping to a symbol with a specific name)
A Tree-sitter parser is designed to parse code quickly, but not necessarily with 100% accuracy; the goal is to be accurate enough for the purposes listed above.
What is the new Tree-sitter integration replacing?
The new Tree-sitter integration â which Iâll be calling modern Tree-sitter throughout this series â wonât replace anything except for the previous Tree-sitter integration, which Iâll be calling legacy Tree-sitter.
Once we decide modern Tree-sitter is stable, weâll drop support for legacy Tree-sitter so that Pulsar can update to a newer version of Electron.
Tree-sitter will continue to exist alongside Atomâs original system for syntax highlighting: TextMate grammars. This grammar system is based on the one invented by TextMate many years ago, and itâs still being used by editors like Visual Studio Code and Sublime Text.
If Tree-sitter is already in Pulsar, why write a new implementation?
Good question! Atom was, after all, the first code editor to ship with support for Tree-sitter. It was introduced in late 2017, and was made the preferred system for syntax highlighting starting with Atom 1.32 nearly a year later.
There are two major reasons why the legacy implementation needs to be replaced:
Tree-sitter now has powerful features that the legacy implementation doesnât leverage. As is often the case, being the first to implement it meant that Atom found all of Tree-sitterâs early pain points. It was a stated goal to use TextMate-style scope names in the new Tree-sitter grammars â so as to make migration easier â but Atom had to invent its own system for mapping Tree-sitter output to scope names, and that system didnât have the flexibility it needed to match TextMate grammarsâ syntax highlighting in all cases. This revealed a need for a more robust system of describing tree nodes, and for highlighting ranges that didnât correspond to the exact ranges of tree nodes.
Tree-sitter eventually introduced a powerful query language that could make the job of syntax highlighting easier. But by that point, Microsoft had bought GitHub, and Atom seemed not to be a major priority anymore, so the legacy implementation was never updated to adopt this query language.
Thatâs a task worth doing, but it will change how Tree-sitter grammars are written, so thereâs no way to avoid the fact that backward compatibility will be broken. But this is a perfect time to make the leap, becauseâŠ
We need to switch to the
web-tree-sitter
bindings. One of the goals of Pulsar is to be able to run the editor on the latest version of Electron. Unfortunately, newer Electron versions make it difficult for Pulsar to use Node modules that are not context-aware. The legacy Tree-sitter implementation uses thenode-tree-sitter
bindings, and it appears to be a tall task to adapt these bindings so that they can be used in newer Electron versions. Right now, Pulsarâs reliance onnode-tree-sitter
is preventing us from upgrading Electron to anything past our current version, 12.2.3 â which is nearly two years old.So we decided to migrate to the
web-tree-sitter
bindings. They use WebAssembly and can run safely inside a browser or an Electron application. Using WebAssembly instead of a native C++ module likenode-tree-sitter
involves a performance penalty, but weâve found that penalty to be very small in practice. Theweb-tree-sitter
bindings are robust and can do nearly everything thatnode-tree-sitter
can do.If, someday, the
node-tree-sitter
bindings were updated to be easier to use in an Electron context, weâd be able to migrate back without any further loss of backward compatibility. But for now,web-tree-sitter
is the way forward, and weâre pleasantly surprised at how well it does the job.
Nobody likes to break backward compatibility, but needing to switch to web-tree-sitter
presents us with an opportunity. Tree-sitter is more stable and more robust than it was in 2017, so weâre able to replace legacy Tree-sitter with something better rather than something thatâs merely equivalent.
Why is Tree-sitter better in general?
Here are a few reasons why Pulsar is using Tree-sitter at all, and why Pulsar is configured to prefer a Tree-sitter grammar over a TextMate grammar when both are present:
- Tree-sitter can offer far more accurate and specific syntax highlighting.
- It can give you better understanding of context. For example: it makes it much easier to write snippets that can behave differently based on the context of the cursor.
- It makes it much easier for grammar authors to describe features like code folding and indentation hinting â making Pulsar smarter and easier to work with.
- It allows for smarter code navigation â meaning a more modern and flexible way to view the important symbols in your current file.
- It offers package authors a richer system for working with source code files. The syntax tree generated by Tree-sitter can be consumed by packages and leveraged in a number of ways.
The specific ways in which Tree-sitter will make your life easier will vary based on which languages you use most often, but this post series will explore a handful of examples.
Why is this new implementation better than the old one?
An under-the-hood change like this isnât necessarily something youâd notice. But Pulsar users may notice some of the downstream effects:
- Most notably, modern Tree-sitter is better at understanding and syntax highlighting your code than legacy Tree-sitter.
- You may notice that Pulsar is better at indenting and dedenting your code as you type, or suggesting ways to fold code blocks that werenât possible before.
- You may notice new features being added to existing language support â for example, snippets that do different things based on context â that werenât possible under the legacy system.
The benefits are much more direct to grammar authors:
- It gives authors a more intuitive system for describing syntax highlighting, and one which can finally match a TextMate grammarâs flexibility in how it applies scopes.
- It gives authors brand new systems for describing code folding and indentation hinting.
- Modern Tree-sitter grammars are easier to iterate on â they allow someone to make changes to a grammar in progress and see them applied instantly.
I disabled Tree-sitter grammars at some point, and I donât feel like Iâve missed anything. Why should I turn them back on?
TextMate grammars are still the main style of grammar in Visual Studio Code, Sublime Text, and other editors. They canât do all the things that Tree-sitter parsers can do, and most new editors on the market have chosen to use Tree-sitter instead; but even just VSCodeâs example tells us that TextMate grammars are no impediment to having a popular and feature-filled editor.
So Iâll be clear: we have no plans to deprecate TextMate-style grammars. They still have their place in Pulsar, and the only thing weâd achieve by deprecating them is to disrupt the editor experience of many of our users.
In the future, it will still be possible (as it is today) to turn off Tree-sitter grammars, either altogether or selectively for certain kinds of files, and fall back to a TextMate grammar for a given language (if it exists).
But our hope is that youâll give this new Tree-sitter system a chance, even if youâd disabled Tree-sitter grammars in the past for any reason. We think itâs got all the upsides of the legacy Tree-sitter integration without any of the downsides.
Can I use this new implementation now?
Yes, you can, as long as youâre on Pulsar 1.106 or greater. Open your Pulsar settings and focus the âCoreâ pane. Find the setting named Use Modern Tree-Sitter Implementation and make sure itâs checked, then make sure that the nearby setting named Use Tree-Sitter Parsers is also checked. Then restart Pulsar or reload your window.
If you routinely use the grammar selector and want to be able to switch between Tree-sitter grammars and TextMate grammars at will, locate the grammar-selector
package in the âPackagesâ pane, then click on its Settings button. Uncheck the setting named Hide Duplicate TextMate Grammars. This will give you the ability to choose between modern Tree-sitter, legacy Tree-sitter, and TextMate grammars.
Which Tree-sitter grammars come with Pulsar?
Currently, these grammars are built in:
- C and C++
- Clojure
- CSS
- EJS and ERB (HTML with embedded JavaScript/Ruby)
- Go
- HTML
- Java
- JavaScript
- JSON
- Markdown
- Python
- Ruby
- Rust
- Shell
- TOML
- TypeScript (and TSX)
- YAML
In addition, Pulsar ships with several specialty Tree-sitter parsers that can be injected into other grammars:
- A parser to detect URLs in text (for identifying and highlighting URLs in strings and comments)
- A parser to detect TODO-style remarks in comments so that they can be highlighted
- A parser to highlight regular expressions in various languages
- A parser for separating YAML front matter from Markdown
If you use a language that isnât on the list above and youâre curious about what it would take to give that language a Tree-sitter grammar, youâll get extra value out of this post series.
The old grammar highlighted my code in a way that I liked. Now things are colored differently and itâs driving me nuts. Should I turn off Tree-sitter?
Please donât! Itâd be like amputating your finger to get rid of a hangnail.
Instead, you can use your user stylesheet to apply a few lines of overrides to your syntax theme and restore the look youâre used to. Open a topic in our discussion forums and someone can tell you exactly how to do it.
Why should I write a Tree-sitter grammar for Pulsar?
Because itâs a much friendlier experience than writing your own TextMate grammar, provided that a Tree-sitter parser exists for the language in question.
Pulsar already has built-in Tree-sitter grammars for most common programming languages. But if youâre a consumer of something more obscure, you might find that someoneâs already written a parser for it. The nvim-treesitter
project â arguably the largest extant consumer of Tree-sitter â is responsible for the creation of dozens of Tree-sitter parsers for niche languages.
In my experience, turning a Tree-sitter parser into a full-fledged Pulsar grammar takes less than two hours.
Why is this interesting enough to write about?
This Tree-sitter overhaul is the biggest feature to be introduced to Pulsar since it was forked from Atom, and itâs a feature that covers a lot of the surface area of the core editing experience.
Other Tree-sitterâintegrated editors like Zed, Nova, and Lapce are, to the best of my knowledge, greenfield projects. They are free to invent entirely new conventions.
But weâve got a harder job. Atom embraced most of the concepts inherent to TextMate grammars and built major editor features around them. It wouldnât be very user-friendly if we introduced a parallel system with a different set of concepts â it would force users to be aware of which kind of language grammar theyâre using, and to juggle their mental model accordingly.
But also: most Pulsar users rely on at least a few community packages that were written for Atom and arenât actively maintained. We have to be very careful to break backward compatibility as little as possible, and only when itâs absolutely necessary.
For these reasons, we shouldnât just introduce brand new systems for code highlighting, contextual awareness, and the rest. Instead, weâll do whatever we can to make the new Tree-sitter system work within â or identically to â systems that Atom originally shipped with. The Tree-sitter integration can offer enhancements beyond what TextMate grammars do â and it will! â but itâs still got to live in the world that TextMate grammars created.
So in order to pull this off â to make modern Tree-sitter grammars work within existing systems â we had to create a brand new set of conventions for writing Tree-sitter grammars. In some places, there was prior art from implementations like neovimâs; in others we were flying blind and had to invent things from scratch.
If youâre at all interested in how we did it, stay tuned for the rest of this series.