BCP 47's verbiage is pretty lax wrt. variants & their prefixes. It
states[1]:
> Variant subtag records in the Language Subtag Registry MAY include
> one or more 'Prefix' (Section 3.1.8) fields. Each 'Prefix'
> indicates a suitable sequence of subtags for forming (with other
> subtags, as appropriate) a language tag when using the variant.
Therefore a hard check whether a variant is used with only the listed
prefixes is inappropriate.
Furthermore there are other semi-normative sources stating the
same. For example, the W3C[2] says:
> Check the context and ordering for variant subtags. Most variant
> subtag records in the registry have one or more Prefix fields. The
> prefixes indicate with which subtags it is usually appropriate to
> use this variant.
…
> If you have a good reason, you could use a variant subtag with
> different subtags, eg. cmn-Latn-pinyin would be a perfectly legal
> way to say Mandarin Chinese written with pinyin.
And `pinyin` lists neither `cmn` nor `cmn-Latn` as a prefix.
BCP 47 goes on to state that "Most variants that share a prefix are
mutually exclusive", but there's actually no way to identify the
variants for which this holds true automatically. Therefore this
property isn't enforced either.
Lastly BCP 47 does have one hard requirement on variants in [1]:
> 5. The same variant subtag MUST NOT be used more than once within a
> language tag.
This is now enforced.
Part of the implementation/fix of #3307.
[1] https://www.rfc-editor.org/rfc/rfc5646.html#section-2.2.5
[2] https://www.w3.org/International/questions/qa-choosing-language-tags#variants
There are several languages that aren't part of ISO 639-2 but are part
of the 639-3 or 639-5. For those languages the legacy Matroska language
elements cannot be set to the ISO 639 alpha 3 code of the BCP 47
language tag.
However, there are a lot of such languages whose ISO 639 alpha 3 code
is a valid extlang subtag of a BCP 47 tag. For example: the language
"Yue Chinese" has an ISO 639 alpha 3 code of `yue` but isn't part of
ISO 639-2. However, `yue` is also a valid extlang.
As each extlang must have a prefix for which it is valid (in the case
of `yue` it's `zh`) and as that prefix must in turn be an ISO 639 code
itself, that prefix language's ISO 639-2 code is the closest
representation.
Part of the implementation of #3307.
There are several languages that aren't part of ISO 639-2 but are part
of the 639-3 or 639-5. For those languages the legacy Matroska language
elements cannot be set to the ISO 639 alpha 3 code of the BCP 47
language tag.
However, there are a lot of such languages whose ISO 639 alpha 3 code
is a valid extlang subtag of a BCP 47 tag. For example: the language
"Yue Chinese" has an ISO 639 alpha 3 code of `yue` but isn't part of
ISO 639-2. However, `yue` is also a valid extlang.
As each extlang must have a prefix for which it is valid (in the case
of `yue` it's `zh`) and as that prefix must in turn be an ISO 639 code
itself, that prefix language's ISO 639-2 code is the closest
representation.
Part of the implementation of #3307.
Even though AVC & HEVC are often mis-detected in the middle of other
container formats, it is pretty unambiguous if the file starts with
the typical NALU marker. So try to detect AVC & HEVC before trying
audio types if the file starts with a NALU marker as audio types are
often mis-detected as well.
Fixes#3201.
With this change both legacy language/country elements and IETF BCP 47
language tags will be normalized when chapters are read or
written. This fixes a couple of corner cases in all programs dealing
with chapters:
1. IETF BCP 47 elements will now always be created before writing
chapters unless IETF BCP 47 elements are disabled. This wasn't
always the case when chapters were read from Matroska files.
2. When a chapter display element contains legacy language & country
elements but no IETF BCP 47 elements and IETF BCP 47 elements
aren't disabled, the IETF BCP 47 elements created will contain the
region from the legacy element. Before the change the elements
created didn't contain a country, leading to a change in semantics
as IETF BCP 47 elements take precedence over all legacy elements
when they're present.
3. Legacy country elements are now created when IETF BCP 47 elements
are present & contain a region code allowed in legacy country
elements.
Part of the fix of #3193.
When using language tags for selecting which tracks to keep or
discard, mkvmerge was so far comparing the given language tag with the
ones in the file (after normalizing each). This meant that in order to
always keep all Spanish tracks but discard others, `--stracks !es`
would not work reliably as a track in the file might be specified as
`es-ES` — and verbatim comparison simply didn't treat `es` and `es-ES`
as the same.
For users this is somewhat counterintuitive. The idea behind allowing
languages for track selection has always been to provide an easy to
remember, easy to use way to select tracks for human beings without
having to look through file identification first. Verbatim comparison
worked fine until support for IETF BCP 47 language tags came along as
until that point languages in Matroska files only ever contained a
language component but not e.g. a region or a variant.
This commit changes the selection to use a matching algorithm similar
to how IETF BCP 47 describes language tag matching. Basically it takes
a track's existing language, normalizes it & splits it into its
components. Then the same is done with all the languages mentioned
with the track selection option currently evaluated.
For each language listed in the track selection all components that
are actually set are compared with the track's language's
corresponding components. If all of them are equal, the track is
considered to be matched. Components set in the track's language but
not in the selection's language are simply ignored.
This means that specifying `--stracks !es` in the example above will
now match all tracks whose language is some kind of Spanish, no matter
if the track's language tag contains a region, variants or
whatever (e.g. it would drop tracks marked as `es`, `es-MX`,
`es-Latn-ES` etc.).
The frame summary requires the number of references to be known in
order to able to determine the frame type. That number is only known
once the whole block group has been parsed as the the reference block
elements are usually located behind the block elements.