-
Notifications
You must be signed in to change notification settings - Fork 34.9k
Closed
Labels
feature-requestRequest for new features or functionalityRequest for new features or functionalitylanguages-guessingLanguage guessing issuesLanguage guessing issueson-testplanworkbench-untitled-editorsManaging of untitled editors in workbench windowManaging of untitled editors in workbench window
Milestone
Description
Continuing from #118455 ... this will span multiple milestones.
Where we are
Since this PR is merged in which brings in vscode-languagedetection, we now have automatic language detection. What's it all about?
- 100% no code leaves your VS Code instance. Model is queried locally.
- opt-in feature
"workbench.editor.untitled.languageDetection": true
- supports language specific enablement:
"[plaintext]" { "workbench.editor.untitled.languageDetection": true }
- Powered by @yoeo's guesslang model (the latest release of it) which supports 30 languages
- Very basic additional heuristics to help the model with accuracy: (adding confidence of JS and TS, and C and C++
- Uncompressed (~4MB package)
This provides an "ok" experience but the model has to be very very sure it's the language it thinks it is to get the untitled file to change. You can enable the feature, and paste in a pretty large sample of code, and it should work.
Where we wanna be
- Support as many languages as possible (JSON, YAML, XML, are not supported today for example)
- You should be able to open an untitled file and start typing and the language detection flips on as fast as possible
- A nice experience to handle tie breakers, "almost confident but not confident enough" situations and when we are wrong
- Compressed as much as possible (preliminary tests seem to say we can get down to 2-3MB)
How we'll get there
- Improved guesslang model
- Support 24 more languages, including JSON, Kotlin, XML, YAML etc... yoeo/guesslang#33 which adds support for 14 more languages (JSON, YAML, XML included)
- Possibly help guesslang with more files to train on
- Improved heuristics
- Variable confidence acceptance (i.e. If the model is 30% confident it's Java, but <1% confident it's anything else, then it's probably Java and we should set that)
- We should use VS Code language information to fine tune our language detection #129596
-
Look at the user's workspace and influence weight based on what's opensounds too costly of an operation
- Improved UX
-
In the event of a tie breaker, show a notification or similar to say "I'm tied between these X languages. which one is it?" - Make sure the user's decision doesn't get overwritten
- If we were wrong, promote the language picker (maybe show a badge on the language status bar entry when we change the lang)
- add detected languages to the top of the language picker
-
- Improved Perf
- Compress the model as much as possible
- Handle large files (i.e. someone pasting a huge JSON payload)
- debounce event for untitled files
- Improved feedback
- It's important to understand how the model is doing for users to make sure it's actually useful. To do that, we opened Automatic language detection telemetry #129576
Additional possible investigations
- Have
code -
open untitled files and detect the language - Kernel guessing of a Jupyter Notebook
- Provide feedback to the model so that it can learn from users? (this would be totally local)
- Use a different model than guesslang
tanhakabirtanhakabir, AlbertoFabbri93 and sana-ajani
Metadata
Metadata
Assignees
Labels
feature-requestRequest for new features or functionalityRequest for new features or functionalitylanguages-guessingLanguage guessing issuesLanguage guessing issueson-testplanworkbench-untitled-editorsManaging of untitled editors in workbench windowManaging of untitled editors in workbench window