10 Languages in Matroska and MKVToolNix
Moritz Bunkus edited this page 2023-04-02 09:31:49 +00:00

Languages in Matroska and MKVToolNix

Background

For a long time the Matroska file format has only supported track/chapter/tag languages in the form of ISO 639-2 codes (e.g. por for Portuguese) with an optional country code following (e.g. por-BR for Portuguese in Brazil). The whole of MKVToolNix has never had support for those optional country codes, though.

In 2019 the IETF's CELLAR working group (which the Matroska project is a part of) has extended the format specifications with three new elements that allow storing a much more descriptive language tag for tracks, chapters and tags. These elements are called "LanguageIETF" (for track headers), "ChapLanguageIETF" (for chapter languages) and "TagLanguageIETF" (for tag languages). These elements must follow the syntax laid out in the IETF's Best Common Practice 47 (referred to as "BCP 47" in this article) which is also known as "RFC 5646".

The advantages of using an existing standard, especially a Best Common Practice, is that we can build on the work of a lot of very smart and knowledgeable people and that the same standard is widely used in other projects, protocols and products.

Examples for BCP 47 language tags

So how might those elements look? Here are a couple of examples:

  • de — the simplest form consists solely of an ISO 639-1 or 639-2 language code (in this case: German)
  • pt-BR — an ISO 3166 country code might be used, too, for specifying the language spoken in a specific region or country (in this case: Portuguese as spoken in Brazil)
  • sr-Cyrl-RS — an ISO 15924 script code might be used as well for specifying that a language is written in a specific script (in this case: Serbian as spoken Serbia, written in Cyrillic instead of Latin)

There are several more possible additions that describe rarer variants of languages. One can even use custom private extensions that aren't standardized and only have meaning to a select number of people.

Semantics & interaction

The rules for using those new elements in Matroska are simple: if both a new "…LanguageIETF" element and the corresponding old plain "…Language" exist at the same level, the new "…LanguageIETF" element must be used. Otherwise the only existing element is used.

It is expected that existing programs & devices will take quite some time before they support the new elements. That is nothing MKVToolNix can help with.

Support for BCP 47 language tags

Starting with version 50 MKVToolNix has almost full support for BCP 47 language tags. The BCP 47 language tag parser is lenient in what it accepts, including but not limited to:

  1. It is case-insensitive (e.g. both en and EN are accepted to mean English).
  2. You can specify either an ISO 639-2 or an ISO 639-1 code if both exist for the same language (e.g. both eng and en are accepted).
  3. It accepts both ISO 3166 country codes and numeric UN M.49 country codes (e.g. both UG and 800 mean Uganda).

However, the parser always outputs a normalized version of the language tag as laid out in BCP 47, again including but not limited to:

  1. The language code is the ISO 639-1 code if it exists, otherwise 639-2, and it is lower-case.
  2. The script code, if given, is capitalized.
  3. The country code is always the alphabetical ISO 3166 code, even if a numeric UN M.49 code was given initialized. And it is upper-case.
  4. The other components are always lower-case.

As there are two methods to normalize language tags wrt. their extended language sub-tags form, both mkvmerge & mkvpropedit let the user chose a method with a parameter called --normalize-language-ietf (documentation). The GUI also contains corresponding options.

mkvmerge

All of mkvmerge's options that accept a language accept a BCP 47 language tag.

When identifying a file in JSON mode, existing "LanguageIETF" track header elements will be output as the language_ietf track property.

When writing a file will always write the "LanguageIETF", "ChapLanguageIETF" and "TagLanguageIETF" elements (the latter two only if chapters or tags are written in general, of course). In addition to those elements the corresponding old elements will be written; they'll be set to the ISO 639-2 code portion of the BCP 47 language tag. For example, when the track language is set to sr-Cyrl-RS "LanguageIETF" will be set to sr-Cyrl-RS and the old "Language" element will be set to "srp".

When reading existing files (Matroska files, XML chapter or tag files etc.) that already contain those "…LanguageIETF" elements the existing elements will be kept. Otherwise "…LanguageIETF" elements will be added based on command-line options and other existing "…Language" elements.

The creation of the new elements can be disabled completely with the command-line option --disable-language-ietf which operates on all three new elements.

You can chose the normalization method applied to extended language sub-tags with the parameter --normalize-language-ietf (documentation).

mkvpropedit

mkvpropedit mostly tries to do the right thing by default. Therefore changes to the language track header property will cause mkvpropedit to apply the same change to both the new "LanguageIETF" element as well as the old "Language" element similar to how mkvmerge applies the language to both elements. For example, when using mkvpropedit movie.mkv --edit track:2 --set language=zh-TW the "LanguageIETF" element will be set to zh-TW and the old "Language" element to chi.

Additionally there's a new track header property named language-ietf that can be set or removed. Changes to this property only apply to the new "LanguageIETF" track header element. A caveat is that the old "Language" track header element is a mandatory element in Matroska — meaning even if it isn't present in the file it still has an implied value. When you only set the language-ietf property but not the language property, mkvpropedit will therefore add the old "Language" element in certain cases & set it to "und" = "undetermined" as the user didn't specify a value for it.

When reading XML chapter or tag files mkvpropedit works like mkvmerge does (see above).

The creation of the new elements can be disabled completely with the command-line option --disable-language-ietf which operates on all three new elements.

You can chose the normalization method applied to extended language sub-tags with the parameter --normalize-language-ietf (documentation).

MKVToolNix GUI

Multiplexer & chapter editor

In MKVToolNix's multiplexer and chapter editor all controls taking a single language have been changed to use a language selection dialog. That dialog offers the user the choice between a free-form input or selecting each component of the language with the help of drop-down boxes. Changes to any of the input methods cause the respective other input method to be updated immediately if the resulting input is valid. The validity of the input is shown on the bottom of the dialog including the parser's error message if the input is invalid.

The default editing mode is selecting individual components. The default mode can be changed in the preferences → "GUI" → "Default IETF BCP 47 language editing mode".

Header editor

The header editor shows both elements as entries in its tree. The old "Language" element uses the old language drop-down box just like in earlier versions of MKVToolNix. The "LanguageIETF" element uses the same language selection dialog described above.

Changes to the "LanguageIETF" element have no effect on the old "Language" elements and vice versa, which differs from how mkvpropedit works.

Disabling the "…LanguageIETF" elements

You might find yourself in situations where you have to disable those new elements, e.g. because your hardware device fails to play a Matroska file that contains them. Here's how to do that:

  • For mkvmerge add the command-line option --disable-language-ietf. Not only does it prevent mkvmerge from adding those elements, they'll also be removed if they exist in the source file.
  • For mkvpropedit add the command-line option --disable-language-ietf. It'll prevent mkvpropedit from writing the track header "LanguageIETF" element when working on the language property and remove the "ChapLanguageIETF" & "TagLanguageIETF" elements when working on chapters or tags respectively. For removing existing track header "LanguageIETF" elements, use --edit track:… --delete language-ietf.
  • For MKVToolNix GUI's multiplexer you can add --disable-language-ietf to the default list of additional command-line options in the preferences → "Multiplexer" → "Default values" → "Default additional command-line options".
  • For MKVToolNix GUI's chapter editor you currently cannot disable the creation of "ChapLanguageIETF" elements. Such functionality might be added later.
  • For MKVToolNix GUI's header editor you can simply select each "Language (IETF BCP 47)" element and check the "remove element" checkbox if the element is currently present in the file.

The old user interface is not coming back

Compared to the old user interface the new one requires two more mouse clicks to change the language tag (opening the dialog & clicking "OK"). A small number of users object to this change. While I truly understand that only two clicks can amount to a lot of extra work when handling large number of files, the old interface is not coming back. Other wishes such as "simply show the old language combo box with the new edit button" don't have much merit considering what the new interface achieves. Here are the requirements I had before I implemented the new UI:

  1. The user should be able to use the full feature-set that BCP 47 language tags offer.
  2. For users not familiar with BCP 47: the language tag should be easy to construct with a lot of help from the program.
  3. Users intimately familiar with BCP 47 should be able to quickly input a valid BCP 47 language tag without having to hunt through multiple combo boxes.
  4. The displayed language tag should be human-readable (e.g. "English" instead of "en").
  5. The displayed language tag should still offer the full information about all of its components.
  6. The new controls should ideally fit into the same space the existing controls were occupying, if at all possible without making the dialog any wider.
  7. After swapping out the old for the new controls the interface should not be much more confusing than it was before.
  8. The number of mouse clicks required for the most-often executed operation (changing solely the language code) should ideally not increase, or if it has to increase, as little a possible.
  9. The new controls should be usable solely with the keyboard.
  10. The user interface should not accept invalid language tags.

Of course several of those requirements conflict with each other.

Having the combo box outside would be much more confusing in the presence of other components of a BCP 47 language tag. It would also pose real problems wrt. handling invalid language tags. Overall that change wouldn't be an improvement due to the number of drawbacks it would come with.

The current UI the result of finding a compromise that implements as many of the requirements as possible. It's not perfect & I'm definitely willing to improve upon it.