Skip to content

Proposal: Language Packs #39178

@dbaeumer

Description

@dbaeumer

Language Packs (under construction)

A while ago we opened VS Code's language set to contribution from the community by moving the translation database to Transifex. Since then quite some languages got added. However there is currently no vehicle to install these languages with a stable version of VS Code. The stable version still only ships with the 9 core VS Code languages. Two extra languages (pt_br and hu) have been added to the insider build.

Instead of pre-bundling all new languages with VS Code we should come to a model where these languages can be installed later on like users install additional feature via extensions.

How is VS Code localized

I will first outline how VS Code is localized today. The localization consists of the following parts:

  • the developer tagging strings to be translated.
  • automatic extraction of the strings to be translated.
  • pushing the strings to be translated to Transifex.
  • pulling translation from Transifex.
  • building translation bundles.

Tagging strings to be translated

VS Code uses a tagging approach to mark strings to be translated directly in the source code. It therefore provides a translation function nls.localize. Strings pass to that function as an argument are tagged for translation. Strings in single quotes are in general treated as 'technical' which don't require any translation. Strings in double quotes outside a localize function call are treated as strings that need translation but aren't and are flagged by a linter rule as untranslated. A typical nls.localize call looks like this:

nls.localize('TaskService.ignoredFolder', 'The following workspace folders are ignored since they use task version 0.1.0: ');

We also maintain an npm module that allows for the same approach in extension code. The npm module is called vscode-nls.

During normal compile time the strings inside the nls.localize call stay as they are. This ensures for quick turn around cycles during development time. Furthermore it is important to note that the truth of the strings is in source code (TypeScript and JavaScript). VS Code doesn't maintain resource bundles or property files in other formats.

Extracting strings

Strings to be translated are automatically extracted from the source code during build time. This extraction process does the following things:

  • extracts the key and the value and puts them into a special meta data file (json format).
  • replaces the key with an index
  • removes the value from the call.

The meta data file contains all strings with their key / value pair that are used inside VS Code. It is named nls.metadata.json and it is produced during the build process and ships with VS Code.

The above example looks like this in a version we ship:

nls.localize(17, null);

Pushing to Transifex

The content of the nls.metadata.json is then used to upload the strings to be translated to Transifex. Since VS Code has thousands of strings the translation is grouped into smaller projects to make them easier to handle in Transifex. The following source file describes how these strings are grouped into projects: https://github.com/Microsoft/vscode/blob/master/build\lib\i18n.resources.json#L1

Pulling translations from Transifex

Translations are pull from Transifex and stored alongside the source in the VS Code GitHub repository. They are all under the i18n folder. Storing the translation together with the source code is necessary to be able to version source code and translations together. Otherwise it would for example be very hard to do a recovery build on an older version with exactly the same translations. The translated strings are stored under the i18n folder where the first sub folder is the translated language. The structure underneath the langue folder is isomorphic to the source code folder structure under the src folder. However ts/js source files which don't contain any translatable strings will not have a corresponding i18n.json file. The files in the i18n folder are all machine generated and should never be edited by a developer.

Building translation Bundles

During build time (when we build s shippable version of VS Code) the build process will also generate translation bundles per supported language (currently the 9 code languages). These translation bundles do have the same granularity as the source bundles have. For example there is a workbench.main.js (which bundles most of our workbench code). So there are corresponding workbench.main.nls.${lang}.js files which contain the translated strings.

These translation bundles are optimized for memory footprint and low CPU consumption when looking up strings. This is achieved using the following two techniques:

  • all key values from the source code are replaces with index lookups in arrays (see above). The index used is a sequence number for the nls.localize call per TS/JS source file.
  • unknown translations are replaced with their default value using the normal language mode rule lookup (de_ch -> de -> en). As a result the translation bundles are always complete during runtime and only one set needs to be loaded and no dynamic lookup is happening.

A translation bundle is statically linked to a VS Code version and it is very likely not functioning correctly with a different VS Code version.

This optimization happens for all VS Code core code and our built in extensions. The mechanism and the necessary build tools are also available for outside extensions via the vscode-nls-dev npm module.

Language Packs

It is desirable that language packs come as extensions and are managed by the market place. We don't want to add another channel to host language packs nor do we want to ship all languages in the box (size, language deprecation, ...). To provide such language pack extensions we need to explorer wo things:

  • how would we build such language pack extension
  • do we have special extension related requirements for such a language pack extension

Building Language Packs

Especially the optimization we do when building translation bundles (key -> index replacement and default language lookup during build time) makes it harder for third parties to produce language pack extensions. In general we have three choices:

  1. we implement a second translation bundle solution for non core languages which don't make use of the optimization described above.
  2. we publish all our nls related build tools as standalone npm packages so that third parties can publish translations using the same optimizations as own extensions.
  3. we publish all language extensions.

Option 1

Basically we would still replace the key by an index during build time. However during runtime we would do the following for non core languages:

  • load the nls.metadata.json file into memory
  • reverse translate index to key (this information is part of the nls.metadata.json
  • load non core languages bundle files into memory and build up hash tables (these bundles would be key/value maps)
  • look up the message using the key from a message bundle installed as an extension.
  • if no message is found we follow the language lookup rules (de_ch -> de -> en) and load other language bundles until we find a valid value for a give key (which we do at least for English)

We could think about giving up on the current solution later on and only use a dynamic runtime solution instead of the optimized statically linked build time solution. This would avoid loading the nls.metadata.json file but would leave the key in the generated JS file.

Option 2

We publish all build tools that are currently inside the code VS Code repository as standalone npm package so that third parties can run the same scripts to bundle translation files.

Option 3

We publish all language extensions to the market place during build time. The translations itself would still come from Transifex. However the translations would also go into the i18n folder like our core languages do and our build scripts would generate language extensions and publish them to the market place.

Option 4

We either include all languages available in the normal VS Code build or we have two different VS Code builds. A first which includes the nine core languages and a second call VS Code International which includes all language currently available in Transifex. The advantage would be no changes to build scripts, loader / start up code or extension installation. However we would need to create and manage these additional builds.

Proposal

I am favoring Option 3. Pros are:

  • we can treat core languages the same way and basically ship VS Code only with English out of the box.
  • only minor changes to how we read and manage translation bundles during runtime (we only need additional lookup locations).

Cons are:

  • the language pack must match the VS Code version number. So a language pack produced for version 1.18 will not work for version 1.19.
  • we will add more translation files to our GitHub repository (under the i18n folder). However that folder can fully be ignored during development time. The average size of the files for a language is currently 500KB.

I don't like Option 1 since it will add a second translation bundle runtime story with its own set of bugs (at least for a while; I am convinced that the optimization we do are worth it especially during startup time). If we stick with two different solutions (core / contributed languages) core language and contributed languages will look different and we will always ship all core languages in the box. The only advantage I can see with option 1 is that it would allow to start VS Code with an old outdated language pack since missing strings will dynamically fall back to English during runtime.

Option 2 would be doable and would allow us to treat core and contributed languages the same. However from the experiences with maintaining a separate repository and npm module it might not be worth compared to option 3. In addition we would end up with more outdated language packs when we ship a new version of VS Code.

Language Pack Extensions

Since option 3 (as option 2) language packs are statically 'linked' against a VS Code version we would need the following features for extensions (if not already present):

  • exact version matching
  • auto updating. This means when a user installs a new version of VS Code we should automatically update all language packs to the matching VS Code version. If no language pack version is available we would disable the language pack.

Metadata

Metadata

Labels

l10n-platformLocalization platform issues (not wrong translations)on-testplanunder-discussionIssue is under discussion for relevance, priority, approach

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions