Skip to content

Automatic language detection plan #129004

@TylerLeonhardt

Description

@TylerLeonhardt

Continuing from #118455 ... this will span multiple milestones.

Where we are

Since this PR is merged in which brings in vscode-languagedetection, we now have automatic language detection. What's it all about?

  • 100% no code leaves your VS Code instance. Model is queried locally.
  • opt-in feature
    • "workbench.editor.untitled.languageDetection": true
    • supports language specific enablement: "[plaintext]" { "workbench.editor.untitled.languageDetection": true }
  • Powered by @yoeo's guesslang model (the latest release of it) which supports 30 languages
  • Very basic additional heuristics to help the model with accuracy: (adding confidence of JS and TS, and C and C++
  • Uncompressed (~4MB package)

This provides an "ok" experience but the model has to be very very sure it's the language it thinks it is to get the untitled file to change. You can enable the feature, and paste in a pretty large sample of code, and it should work.

Where we wanna be

  • Support as many languages as possible (JSON, YAML, XML, are not supported today for example)
  • You should be able to open an untitled file and start typing and the language detection flips on as fast as possible
  • A nice experience to handle tie breakers, "almost confident but not confident enough" situations and when we are wrong
  • Compressed as much as possible (preliminary tests seem to say we can get down to 2-3MB)

How we'll get there

  • Improved guesslang model
  • Improved heuristics
  • Improved UX
    • In the event of a tie breaker, show a notification or similar to say "I'm tied between these X languages. which one is it?"
    • Make sure the user's decision doesn't get overwritten
    • If we were wrong, promote the language picker (maybe show a badge on the language status bar entry when we change the lang)
    • add detected languages to the top of the language picker
  • Improved Perf
    • Compress the model as much as possible
    • Handle large files (i.e. someone pasting a huge JSON payload)
    • debounce event for untitled files
  • Improved feedback

Additional possible investigations

  • Have code - open untitled files and detect the language
  • Kernel guessing of a Jupyter Notebook
  • Provide feedback to the model so that it can learn from users? (this would be totally local)
  • Use a different model than guesslang

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions