Modern Tree-sitter, part 7: the pain points and the promise
Iâve spent a number of posts talking about Tree-sitter as though it makes sliced bread look mundane. But it also drives me nuts on a regular basis. Letâs wrap up the series by talking about what makes Tree-sitter hard to work with â with an optimistic look toward the near future.
Tree-sitter is a typical open-source project: the skyâs the limit, but thereâs never quite enough time and attention to go around.
I think itâs absolutely amazing what we on the Pulsar team have been able to do in such a short amount of time â especially when using the web-tree-sitter
bindings, which appear to be nobodyâs first choice. Still, there have been bumps in the road. Maybe I can point out a few of them and help out the next Tree-sitter user.
Since I do not want to save any of these gripes for an eighth blog post (and since you probably donât want to read an eighth blog post) this article may be a bit longer than the rest.
Donât get me wrong: nearly all of these challenges have gotten easier in the year that Iâve been immersed in Tree-sitter. There is cause for optimism, and Iâll be sure to point out the upsides as I go.
Tree-sitter challenges
Iâve tried to group these gripes based on how generally applicable they are. Some of them would be things youâd trip over if you started using Tree-sitter tomorrow, and some might just be Pulsar-specific dilemmas. In order to keep you engaged, letâs start with the broader gripes.
Itâs nobodyâs day job
Max Brunsfeld worked on Tree-sitter for years before it wound up in Atom. Itâs no longer just his pet project â there are a small handful of folks with commit rights to the repository â but, of course, theyâre volunteers, and development happens on a âwhen-itâs-doneâ schedule. Thereâs a rough roadmap of desired improvements, but not much of a timetable.
Lots of major improvements happen in between major releases. For this reason, most of Pulsarâs own Tree-sitter grammars are built from the master
branch of their parsers, rather than from tagged releases on NPM.
Yet, despite this uncertainty, lots of projects have embraced Tree-sitter. One of them is a commercial text editor. As weâve mentioned, GitHub uses Tree-sitter for code navigation and highlighting on the web. Other companies are building code analysis tools around it.
So itâs not going anywhere, and enough people know how it works that itâd have a future even if Max suddenly decided to eschew modern technology and go live simply in the woods.
Still, thereâs a mismatch here. Itâs becoming crucial infrastructure for major projects, yet itâs nobodyâs full-time job. It could be somebodyâs full-time job, but the creator already has that other full-time job, and that doesnât seem like itâll change any time soon.
And yetâŠ
There are six official members of the tree-sitter
GitHub organization, but Iâd like to mention @amaanq specifically as being an increasingly helpful and prolific contributor in the eighteen months that Iâve spent in the Tree-sitter ecosystem. Others, like @clason and @sogaiu, idle on Tree-sitterâs Discord and Matrix channels and have written enormously helpful documentation.
The presence of other diligent contributors is how Tree-sitter has gotten much better over that span of time despite only sporadic participation from Max.
Thereâs even a roadmap these days â which wouldâve been a shocking display of transparency even six months ago! Things are looking up.
hard easy hard to write parsers
Itâs I could argue both sides of this one depending on how I chose to look at it.
Itâs easy to start writing a Tree-sitter parser. At the beginning, youâre staring at a largely empty JavaScript file. Each individual rule seems easy to add. You keep adding new rules, then testing them, and everything works. You think this is going to be a cake-walk.
Ten minutes later, youâve hit a wall. Youâre staring at some debugging output, trying to figure out how the last rule you added somehow broke everything.
This is quite common! The learning curve of Tree-sitter is⊠shaped much more strangely than a curve.
Why is this true? And what can be done about it?
First, the design: Tree-sitter is a style of parser that many people will be unfamiliar with. It doesnât backtrack and it doesnât try to group simple nodes into higher-level constructs until after itâs decided what kinds of nodes they are. This violates a lot of folksâ intuitions about how something should parse. To a certain extent, this is unavoidable, though there are a few loopholes that weâll discuss in a moment.
There are two important tools to help you understand the parsing process.
tree-sitter --debug
will give you exhaustive logging of each step of the parse process; andtree-sitter -D
will go further and actually build graphs to visualize chains of tokens, opening the results in a web page when the parse is done.These are great tools, but they output the kind of debug information that youâd want if you already knew exactly how Tree-sitter worked and didnât need any of its decisions explained. There are improvements to be made here.
When Tree-sitter canât solve a problem on its own, you can employ the mother of all loopholes: you can write an external scanner in C. An external scanner can do lots of things that otherwise couldnât be done, because it can both keep its own state and look ahead as much as it wants.
The power that this gives a parser author is certainly appreciated, but it tempts parser authors to define more rules externally instead of inside
grammar.js
. When I resort to external scanner logic to get me out of a jam, I always suspect that thereâs a simpler way to solve the problem that I just donât understand.And if youâre like me and you have no real experience in systems programming, you might be intimidated by a system that expects you to dip into C to solve some of your problems.
And yetâŠ
Iâve had a draft of this blog post half-written for six months, but when I first wrote it, I would never have dreamed that Iâd be such an active contributor to the Tree-sitter ecosystem by now.
Despite not feeling like Iâve got my brain totally wrapped around this whole parser thing, Iâve now written three small parsers from scratch, and am maintaining a fork of tree-sitter-scss
. Iâve contributed bug fixes and enhancements to tree-sitter-jsdoc
, tree-sitter-html
, and tree-sitter-css
.
Tree-sitter got easier once I understood that most of the crucial decisions are made during lexing, rather than during parsing. Lexing happens first; itâs the process Tree-sitter uses to decide what the next token will be. And since thereâs no backtracking, itâs often the root cause of a parsing problem. Headaches around precedence and ambiguity can usually be smoothed out in the parsing phase, but if the lexing is going wrong, none of the parsing-related tools that Tree-sitter gives you will help much.
Itâs true that the prospect of having to write C steepens the learning curve a bit. But the C that youâd write for an external scanner really isnât that challenging! The vast majority of it involves while
loops and switch
statements and comparing characters to other characters. The most challenging problem Iâve tackled in an external scanner is keeping track of state â and there are plenty of examples of scanners that keep state that you can crib from.
Reading the source code of other Tree-sitter parsers is probably the best way to understand Tree-sitter better, and the second-best way is to idle in the Tree-sitter Discord. The Discord isnât incredibly active, but nearly all queries get answered eventually.
With help, and over time, even the most intimidating tools slowly reveal their inner workings. If I can do it, so can you.
web-tree-sitter
experience is second-class
The I consider web-tree-sitter
to be one of the main success stories of WebAssembly. Before WebAssembly, compiling a C-based library like Tree-sitter to run entirely in a browser wouldâve involved something closer to a ten-fold performance penalty. With WebAssembly, that penalty is small enough that most users wonât notice the difference between it and the node-tree-sitter
bindings.
In fact, most of the challenges weâve faced with web-tree-sitter
donât involve performance at all: they involve constraints on the web platform that donât affect other bindings.
stdlib exports
Weâve talked about how a Tree-sitter grammar can use an external C scanner to do its job. This isnât a problem for web-tree-sitter
; Emscripten can compile those scanners to WebAssembly. But the scanner can draw on any functions it wants from the C standard library to help it with this task. This makes sense; why write a custom C function to detect if the next character is whitespace when you can just #import <wctype.h>
and use iswspace
?
But web-tree-sitter
â not an individual parser â is in charge of bundling and exporting these builtins at compile time. Thatâs a dilemma: parsers can choose any functions they want from the C standard library, but web-tree-sitter
has to guess which functions theyâll pick. Out of the box, it makes good guesses; but if a parser uses a function thatâs not on that list, that parser will fail at runtime when it tries to use that function and finds that itâs just not available.
If you do a thorough audit of a parser, you can make note of which builtins itâll need. But some parsers have quite complex external C scanners. And if you overlook one, you wonât find out until runtime.
Architecturally, Tree-sitter is fixing this by setting firmer rules about which functions you can and canât use in an external scanner; that list I linked above is now a mandate. But I expect that itâll be a while before parsers in the wild are updated to reflect this.
Hence, for now, Pulsar has got to keep track of all the external functions that are used by popular Tree-sitter parsers, and build a custom web-tree-sitter
that includes them all. This doesnât make it bloated or slow, from what I can tell, but it is a chore.
And it means that, if a community package wants to contribute a Tree-sitter grammar, its author might run up against this problem and have to open a ticket with Pulsar to get us to add new functions to the exports list. Weâd be happy to do so, but then the author would have to wait around for the next stable Pulsar release, and all Pulsar users would have to update to that release before they could install that package.
Emscripten versioning
Iâm also not wild about how particular the toolchain is. Turning a parser into a WASM file involves using a precise version of Emscripten that varies based on the version of Tree-sitter that the parser uses. Itâs not a huge problem, but it certainly steepens the learning curve for contributors.
It also means that we have to keep our build of web-tree-sitter
pretty diligently up to date. My experience is that newer versions of web-tree-sitter
can consume parsers built against earlier versions, but not vice versa. Iâm glad to have backward compatibility, but it does make it rather urgent for us to ensure our web-tree-sitter
is as new as possible so that community packages donât have to build their .wasm
files with older versions of the tree-sitter
CLI.
Lack of custom predicates
Other Tree-sitter bindings have the ability to define their own query predicates. This would be quite helpful for Pulsar; itâd allow us to define predicates with more than a single argument and predicates that assert things about a specific capture instead of the whole query. Yet web-tree-sitter
doesnât support this yet. Iâve opened an issue for this and itâs possible I might be the one to contribute this when I get the time.
Until then, our only way of defining custom query logic is by attaching properties to the #is?
and #is-not?
predicates and doing our own âpost-processingâ step on the matches that Tree-sitter gives us.
And yetâŠ
Whatever problems weâve had with web-tree-sitter
havenât prevented us from shipping a bunch of solid parsers that have held up to stress-testing. (Our biggest problem so far has been with our Markdown parser, and we fixed that by migrating to another Tree-sitter Markdown parser thatâs more actively maintained.)
Also, this recent change is exciting: compared to native parsers, .wasm
files are self-contained and easy to use, so now even the native Tree-sitter bindings can consume them! The prospect of being able to use .wasm
files in this manner means that web-tree-sitter
concerns will also be relevant to Tree-sitterâs Rust and C bindings.
Tree-sitter has also been cracking the whip on external scanners recently; C++ external scanners have been deprecated, and (as mentioned earlier) scanner authors no longer have freedom to use anything they want from the C stdlib. Practically speaking, this isnât a hardship; if anything, a lack of prescription from Tree-sitter was leading to sloppy choices. (For instance, some parsers wrongly use isalpha
instead of iswalpha
â the latter being more proper for Unicode text.)
And the feedback loop is getting shorter: you will now be warned at compile time when you try to make a .wasm
file from a parser that consumes functions that are not in the default list of exports.
Error recovery is a black box
Tree-sitter is not a fail-on-first-error sort of parser. That strategy would make it completely unsuitable for the task of syntax highlighting in an editor. Imagine adding a new style rule in the middle of a CSS file⊠and having everything after the cursor flicker as you type, because itâs reacting to the fact that the CSS file is invalid until youâre mostly done typing.
Invalidity in Tree-sitter is indicated with ERROR
and/or MISSING
nodes in the tree. Tree-sitter will parse some tokens, notice that they donât add up to any valid rule, and decide on the least costly way to get itself back to a valid state. That could mean skipping over the token that put it into an error state (producing an ERROR
node). Or it could mean assuming the presence of a node that isnât there (producing a MISSING
node).
The fact that this process exists is what makes Tree-sitter good for a code editor scenario in which a document frequently flips between âvalidâ and âinvalid.â When it works well, your syntax highlighting is only minimally affected by the invalidity.
But the fact that itâs largely a black box is frustrating, especially in rare scenarios in which Tree-sitter makes a catastrophically bad choice in how to recover. The âcostsâ of various error recovery strategies are determined by Tree-sitter in ways that may not make much sense to an outside observer.
Hereâs a very simple example from the tree-sitter-css
parser:
div {
justif
}
If your cursor is at the end of justif
, youâre probably about to type something like y-content: space-between;
. Because of recent enhancements to CSS, the justif
token could turn out to be one of several things⊠but itâs probably just a property name! And the parser should assume itâs a property name until itâs certain that it isnât.
Yet tree-sitter-css
doesnât parse it this way:
(stylesheet [0, 0] - [3, 0]
(rule_set [0, 0] - [2, 1]
(selectors [0, 0] - [0, 3]
(tag_name [0, 0] - [0, 3]))
(block [0, 4] - [2, 1]
(ERROR [1, 2] - [1, 6]
(attribute_name [1, 2] - [1, 8])))))
Instead, itâs chosen to interpret justif
as an attribute name. Faced with the theoretical ambiguity of whether justif
will end up as a property name or a tag name (remember that you can nest CSS selectors now!)⊠Tree-sitter has chosen a third option that isnât valid there.
There are various annoyances like this in the tree-sitter-css
parser. Maybe they can be fixed without new Tree-sitter features, but to me it doesnât seem like tree-sitter-css
is poorly written; itâs just that CSS itself may be designed in a way that makes it especially susceptible to the downsides of Tree-sitterâs design decisions.
This is especially painful because the autocomplete-css
package tries to inspect scope information to determine which completions to suggest at the position of the cursor. The legacy TextMate-style grammar is much better at interpreting incomplete lines than the modern Tree-sitter CSS grammar, so this isnât just a cosmetic issue; it means that Pulsar is now worse at suggesting CSS completions. (We have tentative plans to address this by no longer using scope information as the primary driver of suggestion information, but it hasnât gotten the necessary attention yet.)
This is something that must get better if Tree-sitter wants to realize its potential. Hence there are two different requests Iâd like to make as an occasional parser author:
- Give authors the ability to influence the error recovery process by hinting at the âcostâ of a missing or skipped token.
- Allow authors to anticipate certain common invalidities and handle them by returning explicit
ERROR
andMISSING
tokens.
Of these two requests, the first feels much more reasonable. I havenât fleshed out how the second idea would work, or even whether itâs a good idea; it could just be a footgun.
But I know exactly how I want Tree-sitter to parse the incomplete CSS above, and itâs frustrating that I canât just say so.
And yetâŠ
This is a known issue and itâs on Maxâs long-term roadmap. The optimism is more remote here because the problems with solutions that exist only in Maxâs mind have, well, more bottlenecks than other problems.
For selfish reasons, Iâm highly interested in this getting done, but if I wanted it done on my timetable I probably should have majored in computer science instead of journalism.
Youâre at the mercy of the parser author
Suppose TypeScript introduces a new syntactic construct in a minor version, and suppose youâre a diligent TypeScript user who starts using the new version on day one. Will your syntax highlighting be ready for the new feature?
In this example, youâre in luck: since the TypeScript release process is so gradual, chances are good that the tree-sitter-typescript
parser will have been updated to support the new syntax on day one. Pulsar releases monthly, which should mean plenty of lead time for us to see the changes to tree-sitter-typescript
and generate a new parser for the next release.
Now imagine youâre using a more obscure parser, one written by a third party rather than maintained by the tree-sitter
GitHub organization. SCSS is a good example: you can see that itâs got a decent parser, but there are lots of valid SCSS constructs that that parser doesnât support, and the last commit to the repo is thirty months old.
Do you know enough about Tree-sitter to fix it yourself? And if so, what do you do if the repo is dead? Do you fork it and increase the bounds of your reluctant code ownership? (This is how I âsolvedâ the SCSS example; my fork fixes most of what the original repo didnât support, but I am not an enthusiastic maintainer.)
Now suppose it isnât obscure â just especially technically challenging. Would you be up for maintaining it? (Writing a tree-sitter-bash
parser strikes me as a startling act of hubris. I think even the precious few among us who consider themselves good writers of shell scripts would balk at writing a tool that tried to understand shell scripts.)
All of these headaches could just as easily apply to a TextMate grammar, of course. But because TextMate grammars are less ambitious, they tend to recover better from constructs they donât understand, and they tend to be easier to fix than Tree-sitter grammars because they donât require such deep understanding of their internals.
And yetâŠ
Everythingâs obscure and inscrutable until it isnât. History teaches us that the way out of this dilemma is to get more people using and depending on Tree-sitter â something thatâs already happening on its own.
Another good sign is the emergence of the tree-sitter-grammars
organization. Until recently, there were two kinds of grammars: first-party grammars that lived in the tree-sitter
organization, and grammars that were maintained by third parties. The presence of the latter kind of grammar is promising (other people are using Tree-sitter!) but also a bit anxiety-inducing (the author could get bored or distracted at any point!), and Iâve made pull requests for several third-party grammars that languished for months. (Itâs also why I maintain a tree-sitter-scss
fork.)
Many of the most high-profile third-party Tree-sitter grammars have migrated to the tree-sitter-grammars
organization in the last few months. Itâs a small gesture, but it helps reassure me that those grammars wonât just languish if their original authors stop contributing.
Pulsar-specific challenges
Now weâre getting into challenges that might only be laid bare by Tree-sitter. A less flattering heading might have read Corners weâve painted ourselves into.
Memory needs managing
When you work with WebAssembly for the first time, it may surprise you to learn that youâll probably need to manage your own memory.
JavaScript developers have it pretty easy when trying to prevent memory leaks: they can rely on garbage collection once an object has no more strong references to it. But WebAssembly operates outside of the engineâs standard GC process and requires explicit freeing of resources. A WASM module will be allocated a certain amount of memory when loaded; the author can choose the size of that allocation and whether it can grow over time. But once the module hits its maximum allowable memory usage, thatâs it. Any further attempts to allocate will trigger exceptions.
In web-tree-sitter
, if I call parse
on a buffer and get back a tree object, I am now in charge of the lifecycle of that object. The memory that has been allocated to build that object cannot be reused until I destroy that object. I canât just null out a variable reference and rely on the engine to take care of it.
Tree-sitter gives us the tools we need here; every tree holds a delete
method that will dispose of it and free the associated memory. Since trees are created and thrown away with each keystroke, this is incredibly important for Pulsar to keep track of. When a tree is stale, we must explicitly dispose of it. Indentation hinting sometimes forces us to do an extra middle-of-transaction tree parse; if that happens, we have to save a reference to it until the next parsing cycle and clean it up along with the other stale trees.
And if a community package wants to hook into the Tree-sitter lifecycle to do something cool, it canât just keep a reference to a parsed tree and assume itâll hang around forever. If it tries, it will discover later that the tree is a useless stub of its former self because we called delete
on it. Instead, it must first await
until a fresh parsed tree is available â often one will be available immediately, but not always. Then it must choose between (a) processing the tree synchronously before it has a chance to go stale, or (b) explicitly copying the tree via a copy
method â in which case itâs in charge of managing the life-cycle of that copy.
Tree-sitter is a major opportunity for community packages; they can query the tree just as easily as Pulsar core. But itâs also a dilemma for the Pulsar team, because itâs not clear how to expose this to package authors. They deserve to have access to parsing information, but we also donât want a community package to break because of a breaking API change in web-tree-sitter
.
Ultimately â whether we offer community packages direct access to the tree or mediate it via some sort of wrapper â itâs not possible to conceal these quirks from package authors. The Pulsar API has taught them to be dutiful about resource management with Disposable
objects and onDidDestroy
callbacks⊠but this is another level entirely. If you want to do anything clever with Tree-sitter in a package, please bear it in mind.
And yetâŠ
Everything I just complained about is, as of very recently, not necessarily how WebAssembly works anymore. More implicit strategies for garbage collection are suddenly possible.
As far as I understand, we wouldnât get this for free; weâd need web-tree-sitter
to support it, and weâd only reap the benefits once we can move to a modern Electron version. But itâs still refreshing to envision a future in which dealing with a WASM library feels less like trying to solve a Rubikâs cube while wearing rubber dishwashing gloves.
(One reason we might not want to offer direct access to web-tree-sitter
APIs is to preserve the option of migrating back to node-tree-sitter
in the future. But there were recent efforts to harmonize the API differences between the two bindings, so that might make life easier for our future selves.)
Parsing performance needs managing
For the vast majority of files that a user will edit, I believe that our modern Tree-sitter system will highlight code with at least comparable performance to the equivalent TextMate grammar system. It can handle files that are thousands of lines long; I know because the file that implements most of the systems Iâve been describing is 4200 lines long, and it worked swimmingly even before I applied optimizations. In my experience, it applies syntax highlighting much more quickly than a TextMate grammar does when loading a large file.
But a text editor has a way of finding all of your edge cases. Under a worst-case scenario â a gigantic file with very long lines â both systems will fall down. But Tree-sitter will fall down sooner. Thatâs because itâs still a system that, by design, must parse the entire file at least once. That approach wonât scale to certain kinds of files, like log files, that a user could plausibly try to open within Pulsar.
To ensure that we donât lock up the editor when faced with a monumental parsing task, Tree-sitter parses can go asynchronous. Right now, a parse is only allowed to run for three milliseconds at a time before we pause it so that other tasks can run.
If a parse takes more than three milliseconds, itâs almost certainly the initial parse that happens when a file is opened. Parsing will continue for as many three-millisecond blocks as are necessary, with long enough pauses in between to ensure the UI is still responsive. And edits (even edits to gigantic files) are lightning-fast because most of the earlier work can be reused.
I think thereâs still some work to be done at determining the best compromise between job time (how long we let a parse run before pausing it) and idle time (how long we wait in between jobs to allow other code to run). Right now, idle time is very low, and I could entertain an argument for it to be increased; but if we did that, weâd probably want to increase job time to compensate for it. Ultimately, the goal is that we allow the editorâs display layer enough time to do the work it needs to hit its usual 60 frames per second â which means weâre talking about how best to spend the 16.67 milliseconds we get to paint each frame.
Indentation hinting might be the biggest threat to that time budget. TextMate-style indentation hinting is cheap because it can just execute a regular expression against one line of content, but Tree-sitter indentation hinting requires a fresh syntax tree. Sometimes we have to reparse earlier than we otherwise wouldâve just to deliver accurate hinting. In the most common cases, this isnât a big deal; but some scenarios would require us to do lots of reparses in quick succession, and weâve had to develop strategies to handle those cases. (I wonât bore you further with this; you can read the source code if this topic fascinates you.)
And yetâŠ
I believe that raw parsing speed in web-tree-sitter
is pretty well optimized. But I also think that, over time, the performance penalty of WebAssembly in Chromium can only decrease as more and more attention is paid to it. This is one reason why Iâm eager to upgrade our version of Electron and enjoy any optimizations to WASM and V8 that have landed in Chromium in the last couple of years.
And weâve got something in our back pocket: the node-tree-sitter
bindings have (theoretically) been updated so that they can be used in Electron apps. Migrating back to node-tree-sitter
would surely improve parsing and querying speed. But for all its headaches, web-tree-sitter
âs .wasm
files really do make distribution easier. They donât have to be built for the userâs architecture, nor rebuilt when the version of Electron changes. If we made this change, itâd have to have large upside to justify the whiplash for grammar authors.
Highlighting performance needs managing
Pulsar uses an internal library called text-buffer
. Thatâs the library that handles buffer rendering. It uses pure DOM manipulation without a library and itâs fast. Itâs one of the few things we havenât touched at all since the fork.
But one of the biggest bottlenecks in Pulsar has to do with the fact that text-buffer
can only apply highlighting in increments of buffer lines. In the worst-case scenario â a large file without any newlines â that means that every single change made to the buffer forces the entire file to be re-highlighted.
In one sense, this is a small problem; 99% of files youâll encounter in an editor have a reasonable number of characters per line. But in another sense, itâs a big problem, since Pulsar is typically thoughtful enough not to waste time trying to highlight regions of the screen you wonât even see. If a large file has no newlines, weâve got no choice but to try to highlight the entire thing.
This bottleneck affects all grammars alike, so itâs not new information. And itâs something that we could try to fix! Speaking personally, though: I think itâd be a high-risk change. Iâd want to understand text-buffer
much better than I currently do before taking a stab at it.
In the meantime, there are already checks in place that limit syntax highlighting on very large files. But right now, the systems we have to detect those kinds of files arenât robust enough. We disable syntax highlighting if a file exceeds a certain size, but this does nothing to address files that have extremely long lines even if the files themselves arenât enormous.
When we know a long buffer line will give us trouble, we should default the user to a plain-text grammar and offer them the choice of opting into syntax highlighting if theyâre willing to risk it â just like we do for large files today.
And yetâŠ
This one is wholly within Pulsarâs control to fix; itâs just a matter of finding time â or finding more contributors.
Query predicates and footguns
Itâs a strange gap in Tree-sitter query syntax that thereâs no way to test the position of an anonymous node. Because weâd need that fairly often, I had to come up with a way of applying that constraint on our own. I invented conventions for applying additional tests on captures using the #is?
and #is-not?
predicates.
Those first few custom predicates were simple utilities that tested straightforward things about a node and how it related to the rest of the tree.
But because custom predicates are implemented in JavaScript, theyâre a powerful bridge into the rest of the Pulsar environment. And once I saw that they were quite fast in practice, I got bolder and wrote other sorts of predicates that integrated with other systems.
For instance, test.config
â passing or failing a predicate based on the state of a Pulsar configuration value â feels weird because it talks to a different Pulsar system that has nothing to do with code parsing; yet the case for its existence is too strong to be denied. Weâve already used it in language-javascript
and language-typescript
to deliver indentation hinting thatâs much more helpful than what we had before, but also more opinionated. So itâs crucial that they be configurable â we wouldnât dare to make some of these decisions for users if they had no way to disable them.
Right now, Tree-sitter grammars can only use the predicates that we make available to them. Should we allow packages to write their own? Or would that be like giving them a loaded gun and pointing it at the userâs foot?
Hereâs an example: the CSS and PHP grammars both have highlights.scm
files where some scope names need to execute a big regular expression against the contents of a node. There are lots of CSS properties and values, and there are even more functions in PHPâs large and chaotic standard library. And CSS especially is a moving target, with new properties and values added every year.
Ideally, this sort of information wouldnât live in a highlights.scm
file. To me, if individual grammars had the ability to define their own query predicates, this would be the killer application:
- Define a JSON file with nothing but names of functions (or properties or whatever else), optionally grouped by some criterion.
- Have a language grammar parse that file on activation and put those functions into one or more
Set
s. - Allow the language grammar to define a
language-foo.isKnownFunction
predicate that does nothing but test for presence of a function name in any of these sets.
Thatâd probably be even quicker than a regex test. Itâd keep the highlights.scm
much sleeker. And itâd put the source of truth into a much more maintainable structure â a JSON file â which could be used for other purposes, like autocompletion suggestions.
So whatâs the problem, you might think. Well, to give a language package the ability to do a very smart thing like this, Iâd also have to give it the ability to do lots of potentially stupid things. A poorly designed custom predicate could instantly become a major drag on editor performance without it even being obvious what the culprit is. When editor performance suffers, people get annoyed! They might complain to us instead of the package author â fair, because how are they supposed to know the real source of the problem?
And yetâŠ
Ultimately, I think this is too powerful of a tool to justify not using it for our own language packages. And if our own language packages can use it, so can community packages â to do otherwise would be a violation of Pulsarâs ethos.
So I think the best way forward would be to define ground rules for what custom predicates can do and how long they can take to do it â even to the extent of setting time budgets and warning when theyâre too slow.
Why are you telling me this?
Other than the fact that the internet is largely for complaining, whatâs the point of this airing of grievances?
I donât want you to misunderstand. Iâm now a regular Tree-sitter contributor and have just written six blog posts about how cool it is, so Iâm not trying to be a buzzkill here.
Highlighting source code is hard! Hereâs how I know:
- The most popular code editors of the past decade have highlighted your source code with a system created 20 years ago by a macOS-only editor that most people had stopped using by 2010.
- That system can only work alongside a specific regular expression engine with ornery syntax.
- Microsoft â a huge tech company that can afford to have a team of people work on a code editor that they distribute for free â decided the best way to deliver a consistent highlighting experience across that editorâs desktop and web versions was to port that obscure regular expression engine to WebAssembly.
Thatâs how much editor authors donât want to write their own syntax highlighting system. If it were easy, someone wouldâve done it by now!
If youâre wondering why Tree-sitter is gathering so much momentum despite being around for years and still not having a 1.0 release⊠itâs because itâs clearly better than what we had before, warts and all.
You still havenât sold me
As this series illustrates, Iâm excited about the handful of ways that Tree-sitter makes Pulsar a better editor. But if you feel like youâre being swept up in these grammars that you donât care much about, I want to make sure you know that there are solutions! Tree-sitter makes grammars customizable enough to accommodate even the tiniest and most arbitrary of gripes.
Even if your nitpick is as small as âI want this
in JavaScript to be a different color than it isâ⊠hop into Discord and ask for help. Weâll be able to give you the right snippet to put into your user stylesheet.
But if itâs more fundamental than that, and you just want the highlighting you were used to, then Iâll state it again for safety: the original TextMate-style grammars arenât going anywhere. Weâre not even tempted to drop support for that system. It would break backward compatibility and make lots of our users unhappy⊠without delivering any meaningful progress in performance, complexity, or bundle size.
Nearly all built-in languages have a TextMate-style grammar you can fall back on. If a particular Tree-sitter grammar causes you pain for whatever reason, you can selectively revert to the TextMate grammar with a scope-specific setting in your config.cson
:
".source.css": # (or any other grammar)
core:
useTreeSitterParsers: false
You may be relying on a legacy Tree-sitter grammar; and if so, I regret to tell you that that grammar will be going away soon. But since legacy Tree-sitter offered a very similar set of trade-offs as modern Tree-sitter does, Iâm much more confident that the modern version of that grammar will be able to meet your needs.
Youâve reached the end of the series
If youâve read them all, give yourself a round of applause. And if you find this subject fascinating enough to have read the whole thing, and youâre not already a Pulsar contributor, then I bet we could use your efforts somewhere. Visit our Discord or any of the other communities listed in the menu above if youâd like to have technical discussions about stuff like this.