Hacker News new | past | comments | ask | show | jobs | submit login
How to Get Started with Tree-Sitter (masteringemacs.org)
243 points by robenkleene on May 28, 2023 | hide | past | favorite | 67 comments



Anyone knows, why github code navigation, which is based on tree-sitter [1], supports only subset of languages for which tree-sitter has parsers [2]? For example, Common Lisp is not supported. I asked github [3], but they are silent.

1 - the "search based" navigation is based on tree sitter https://docs.github.com/en/repositories/working-with-files/u...

2 - the list of parsers in the official docs: https://tree-sitter.github.io/tree-sitter/

3 - https://github.com/orgs/community/discussions/55704


While we're talking about this, I've been quite disappointed by the new GitHub file browser: the new features look neat, but the core file browsing experience is significantly worse for me because the new UI is much more complicated and every action seems to do way more work.

I couldn't immediately find a way to opt out; I wouldn't mind the new file browser so much if I could drop into it on demand instead of having it forced down my throat.


I’ve had few (some, but minimal) problems with the new file tree browser, but the actual individual file view has become a minefield. It hijacks find in page, presumably to enhance it with semantic relations, but it breaks frequently when navigating. Previous searches (cmd-G) will fall back to the browser’s native search. If you’re lucky, you searched the same term in both, and if you’re luckier still the search term is in the top N lines of code (because the code itself is lazy loaded I guess?). Most of the time I’m not lucky and I have to start over each time I view a different file. Even though it feels more like a “single page app” than it previously did, I lose all of that state every time I do anything.

I get what they’re trying to do, and in some ways it’s clearly beta growing pains, but in other ways it’s clear they’re fighting the browser environment in ways that don’t and probably shouldn’t work.


I ended up blocking keyboard events from a userscript to deal with the keyboard hijacking. This combined with opting out of all the "feature previews" has made it more or less usable again for now. But the writing is on the wall, I am not using GH for any new projects.


It was an optional opt-in labs feature that I tried for an hour then turned off, but now they force it.

The new UI is more busy for no good reason. Sidebar makes sense in a local IDE, but over web every time you expand a subtree it takes ages anyway. And because it merely exists it is distracting and makes pages load slower...


Click on your profile in the top-right, and then in the resulting menu, click on 'Feature preview'. You might be able to disable it from there.


It’s not there anymore, presumably because it graduated from beta earlier this month: <https://github.blog/changelog/2023-05-08-the-new-code-search...>


I believe they use semantic (Haskell program that uses tree-sitter) for navigation: https://github.com/github/semantic

So the answer may be that semantic does not yet have support for the language in question.


correct. adding support for new language is a matter of opening up a pull-request with the appropriate gluecode. https://github.com/github/semantic/blob/793a876ae45d38a6bd17...


It would be more realistic to think of pull requests if github allowed per-repo configs with custom parsing scripts, similar to gitLab LSIF support - https://docs.gitlab.com/ee/user/project/code_intelligence.ht.... Otherwise, how can a contrioutor be sure that his pull request really produces convenient code navigation in github UI. There is no way to test it.


The Semantic seems not used anymore:

> During this time, the team migrated from a Semantic-based tagging service to one that operated entirely with (newly developed) Tree-sitter queries. Though Semantic performed well, using the Tree-sitter query language allowed faster iteration and avoided the operational overhead of a program-analysis framework.

https://queue.acm.org/detail.cfm?id=3487022



I believe this paper has an answer to your question - https://queue.acm.org/detail.cfm?id=3487022


Thank you very much for the link.

I don't see an answer in there, though.

They use tagging (as described also here: https://tree-sitter.github.io/tree-sitter/code-navigation-sy...). From both docs I assume `tree-sitter tags` works out of box for any language that has a parser. (Since neither doc instructs to save a custom config for tag extraction query, I assume every language plugin provides tagging queries).


I worked with tree sitter a bit for a small project. It seems to be the case that, while it covers a large set of languages, the actual implementation of each subset doesn't follow the exact same underlying API. And it follows that there is no universal AST parser as far as I can tell. Languages have many types of atomic parts to such an extent that unifying them is a tremendously complex task.


Have you used the `tree-sitter tags` command?


Unfortunately not. My stint of use was focused around dynamically loading in any given tree-sitter-langaugeName package via docker and then parsing any given file type in a runtime environment, in order to coalesce differing languages' parts into a unified metadata structure.

So, I wasn't trying to identify named things, like the tags command might seem to do, but more generic parts of the langauges like function, var, import/export statements, etc. I ultimately made my own walk functionality, but pulling out those atomic parts per language is still a very heavy task (basically need a mini lexer for each lang), even though TS seems to provide all the info to do the task.


ah, easy. it's because support has not been added into https://github.com/github/semantic which is the tech that powers the GitHub UI. Adding support is pretty easy/mainly glue code [1] that imports the tree sitter API.

[1] https://github.com/github/semantic/blob/793a876ae45d38a6bd17...


In my case, Tree-Sitter grammar for C has many shortcomings (ignore preprocessor bugs, they can't be solved in the current TS model anyway): https://github.com/tree-sitter/tree-sitter-c/issues


title should include "in Emacs"


Given how powerful Emacs is and how important it has been for my computing over the past four decades, I think it would be more useful to me for people to label all non-emacs articles [Not Emacs]


Emacs is relatively obscure now. It is most practical to assume that all editor-related articles are about Visual Studio Code unless indicated otherwise, since a majority of developers use that. Vim if the article concerns in-terminal editor use.


sar·casm

noun: sarcasm; plural noun: sarcasms

the use of irony to mock or convey contempt.

"his voice, hardened by sarcasm, could not hide his resentment"


his voice hardened... I didn't hear that in the comment. Maybe /s at the end of the comment would have helped missing out on the body language or verbal queues.

/s


I think adding "in Emacs" to the title of an article published on Mastering Emacs Dot Org might be a bit redundant.


Scrolled over the title (thought it was some programming algorithm concept) and then by chance I've read the site name saying it's about Emacs which prompted me to open the discussion.

Maybe on some setups the site name is not as easily visible as the title?


What’s visible on the HN page is “masteringemacs”, which is just nonobvious enough to parse (“master in <whatever>”) that it doesn’t register.


some times a bit of redundancy doesn't hurt


Is everyone supposed to comprehend the domain name of the post?


There is a very cool gif demo at the bottom showing contextual editing with multiple cursors using an extension

Some pictures/video or gif would have been really nice at the top of the article to get a quick idea of what I’ll be able to get by following the guide


Good shout. I've added another example image near the top, instead of burying the lede for people wondering what it's about.


Amazing, thank you


Mickey Peterson's writing is like that of an explorer charting wildernesses that only a few have yet had the capacities to visit.


I used Tree-sitter to navigate and extract syntactic information from source code files in multiple programming languages. It was great that I didn't have to use multiple parser (generator) libraries.


Focusing on the Tree-Sitter side, it's something I've been meaning to look into. We have a lot of Oracle SQL files that make up our ETL pipeline at work, and I would love to be able to visualize the flow of the data. I know there are some products out there but our ETL was written from scratch without any tooling besides Oracle SQL and bash scripts (and cron).

I feel like a well written Tree-Sitter grammar should allow me to parse the files and follow the data from source to dashboard.


I believe there is a grammar for SQL, but it probably doesn't support a lot of the vendor-specific language extensions you might be using. You could end up writing your own grammar or heavily extending any grammar that already exists. There might already be libraries specific to parsing Oracle SQL into a syntax tree, at which point Tree Sitter isn't really adding value.

That said, the ability to query for things like "all SELECT statements where table 'xyz' is referenced in any table identifier" is very powerful.


That's a good idea, that would save me starting from scratch.

We do use a fair bit of vendor-specific language, would be cool to make a good Oracle version and then commit back to the project.


I take it that many people like the idea with tree-sitter but don't really know how to use it. Tree-sitter does have documentation but it's not that useful.

It would be interesting to know how many sales you get from a niche book like this though.


There's not much end-user documentation the Tree-sitter project can really provide. It's a programming interface. The stuff that an Emacs, NeoVim, ... user is actually interacting with depends on the integration in the editor. So it's up to those docs to explain what features they expose based on Tree-sitter.

For example, the multiple cursor thing at the end is enabled by having the concrete syntax tree from Tree-sitter's parse, but Tree-sitter has nothing whatsoever to do with the cursors: it just provides locations in the text based on particular queries (like "all `identifier` nodes named 'foo' that are sub-nodes of this other node").

Even the headliner feature, syntax highlighting, isn't provided directly by Tree-sitter. It's up to the client system to inspect the syntax tree and apply attributes to its rendered text -- however it does that rendering.


The query language and CLI tools are absolutely user-facing features on their own. But even if they aren't, developers still need to be able to learn from the docs. Currently that's very difficult to do. My experience is that you have to rely heavily on copying other people's examples in order to make progress.


Point taken about the developer documentation. But the query language is only user-facing if the tool you're using exposes it. And the CLI only if you need to build your own grammars; I expect that to be less and less common as time goes on.


Maybe this is a "blind spot" of sorts to the developers of Tree Sitter, but I find that the CLI is directly useful for end users to perform queries on their code. Maybe it would be better if we had dedicated tools that were more oriented at end users rather than developers of TS grammars, but it seems silly not to use a perfectly good tool that already exists.

Somewhat of an aside about building grammars, I have found that the grammars are relatively hard to make small modifications or extensions to. Whereas it's relatively easy with e.g. traditional Vim syntax highlighting. Making grammars easier to extend would be valuable for users of languages like SQL that have a zillion custom dialects.


> I find that the CLI is directly useful for end users to perform queries on their code

That's interesting; what do they use the results for?

> I have found that the grammars are relatively hard to make small modifications or extensions to

Definitely true. I think this may be an inherent problem of the system and the parser generation, though. I am not sure it's solvable by the grammar authors.


I use it for code search. I can search for the definition of a given function, or all places that a particular function is called by name.


Helix editor has it built-in so it's a good place to start looking into tree-sitter. For me the important parts are:

- Much better (and faster) syntax hilighting

- Shrink and expand selection (select variable, all variables inside parentheses, the whole block, the whole function)

- Select next or previous sibling node

- Goto matching bracket from where I am right now

- Jump between functions

- Jump between type definitions

- Jump between parameters

- Jump between comments

- Jump between tests

I think Helix must be the easiest way to start experimenting with tree-sitter. You need no plugins, just install the editor and start experimenting:

https://helix-editor.com/


pfft I can do that with 105 plugins in neovim


In what way is its documentation not useful? I did a project based on TreeSitter a year ago and the docs were perfectly clear and adequate, and I was up and running within half an hour or so


For example the Query Syntax, I would like an example on how to find all function declarations and where in the code the function starts and end. As well as example data that such a query is supposed to return. And how to do it in the JavaScript/Web Tree-sitter binding =)


i am in awe how emacs is able to reinvent itself again and again, (along with vim/neovim)..


This is so hopeless complicated that I'm glad I use VSCode and not Emacs anymore. Sheesh.


I use both VS Code and Emacs. One of the things I love about Emacs is that if you put enough work into it, it will be like nothing else. There are times when I’m writing code, a text file, or even managing git commits with magit that I can sit back and say “wow, this is genuinely a pleasure to use.”

But, it takes a lot of work to get there. I still don’t have everything working super well, debugging is way easier for me in VS Code. But, I’m still learning (after 25+ years as an Emacs user), and that brings me joy.

It’s like when I’m working on electronics. There’s a genuine joy I get from using my Hakko soldering station, Mitutoyo calipers, or my Engineer hand tools. Using something that is supremely well designed for a purpose brings me joy.

And org-mode. Seriously, org-mode.


OK? It's feature in development. Eventually emacs 29 with tree sitter will be part of distro packaging, and it will just be an "apt install emacs" away.


That's not the impression I got from TFA. It sounds like, once you have Emacs compiled with tree-sitter, you also need to have a language binding in a shared library in a known location (that you have to compile yourself or depend on the kindness of strangers). And once you have that, you actually need someone to write a major mode that utilizes it at all. And if you want to make that major mode the default for that file type, there are a few other things you need to do. Etc.


> once you have Emacs compiled with tree-sitter

Once Emacs 29 releases, your distro will package Emacs compiled with tree-sitter.

> you also need to have a language binding in a shared library in a known location

Emacs already ships with command that clones grammar repo, compiles, and installs the shared library to that known location - this was explained in the article. The only manual thing you need to do is to associate a language with a git repo in your configuration.

> And once you have that, you actually need someone to write a major mode that utilizes it at all.

Emacs developers have been also been working on covering major languages to provide tree-sitter based major modes. I count 23 major modes already being maintained as part of Emacs that will be shipped soon as part of 29.1, not to mention there are a lot more in Melpa (centralised community package repository).

In any case, whole thing with the tree-sitter is that it really makes writing major modes easy. It's all declarative now, including indentation rules that had traditionally been tricky to get right.


I use emacs 29 beta right now and that is just not how it works. For example for elixir you just install elixir-ts-mode like any other package, answer yes to the prompt to install the elixir ts library, and you’re done.

Even that setup step will be unnecessary in emacs 30, when all this stuff will be shipped by default.


Okay, that's a good thing! From the article:

> Neither tree-sitter nor Emacs come installed with language grammars

> ...it’ll only work if you don’t have an exceptional setup (so it won’t work well unless you have GCC and run some flavor of Linux.)

> Determining if a grammar is available is not intuitive nor obvious unless you use elisp

> Note that, just because you have installed a grammar, does not mean Emacs supports it. Someone still has to write the – admittedly, way easier – syntax and indentation logic and all that good stuff.

> Annoyingly, there’s no easy way to see if you’re using the normal or the TS-powered major mode

> If you use Customize, then you don’t have to do anything, but if you normally use setq, you’ll have to use customize-set-variable instead to ensure the setter is called properly.

> That sounds like a great idea until you realize that it is not possible to make one-size-fits all commands that do this. Believe me: I’ve tried.


Does VS code even integrate tree sitter? I'd imagine leveraging tree sitter is even more hairy there.


Not yet. If anyone manages to read through all of https://github.com/Microsoft/vscode/issues/50140 they might find out if it's planned or not; Mickey's article is shorter =P


I guess the implication was that VS Code works well and I don't need tree-sitter?


Imagine not even desiring to understand the tools that enable you to do your job.


It feels like people use these editors for the novelty of it and not because it's truly the best for productivity.


I will treat your comment as serious and explain why I personally use emacs.

The thing I love most about emacs is that it is joyously consistent. The same key combination to jump ahead by a word works everywhere, such as when you're opening files or navigating directories. This consistency means that I am often very efficient trying new packages that I have never used.

Maybe you've never seen someone use emacs in anger? If so, check out this video of Steve Yegge doing some stuff in Emacs: https://youtu.be/lkIicfzPBys?t=142


please point me towards literally _any_ text editor that dynamically remaps shortcuts on you.


Ah yes, the novelty of using emacs, an editor that started in 1976.


It's definitely just a fad that will die out any day now!


Well, the nature of emacs is that you can always add a line to your config file and have it radically change everything.


fact aside that I don't think I've ever heard emacs associated with novelty (the thing is as old as Stonehenge), I wish this productivity cult would die. Programming is fun and play. Discovery and experimentation are great things.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: