PR #1250 Replaced 'no' langauge code with 'nb' and use full norwegian language names

petterreinholdtsen2 years ago👍 1

The 'no' language code is a obsolete language code that is the union of language codes 'nb' (Norwegian Bokmål) and 'nn' (Norwegian Nynorsk), the two written variants in use in Norway. As 'no' is misleading, and seem to be used by Whisper to mean Norwegian Bokmål, I recommend replacing it with 'nb' and using the full names for both of the norwegian written forms.

Useful references:

Replaced 'no' langauge code with 'nb' and use full norwegian language…

fcf33008

ryanheise2 years ago

From my understanding, this may be similar to Chinese which also has multiple written variants, although Chinese also has multiple spoken dialects. The way Whisper was trained, it was trained on all of these written and spoken variants under the umbrella of "Chinese", and so ultimately a single language code has to describe all of these. Since there is only one language code for Chinese, the way you get it to transcribe for a particular variant is therefore not by specifying a different language but by using the same language code and then using a prompt to get it off on the right foot with the variant. It may not be a strictly correct use of language codes, but the training is already done that way.

I haven't tried transcribing Norwegian, but is it similar in that Whisper contains training data for both writing systems under a single umbrella of "no" which can be accessed by using a prompt to set it off in one of those two directions? If so, I'm not sure if changing it to "nb" would make sense.

petterreinholdtsen2 years ago

[ryanheise]

From my understanding, this may be similar to Chinese which also has multiple written variants, although Chinese also has multiple spoken dialects.

I believed China had several distinct languages with their own written language as well as the written Mandarin standardized across the country, in addition to a lot of dialects of the different languages.

I haven't tried transcribing Norwegian, but is it similar in that Whisper contains training data for both writing systems under a single umbrella of "no" which can be accessed by using a prompt to set it off in one of those two directions? If so, I'm not sure if changing it to "nb" would make sense.

The transcribing I have tested so far gave me Norwegian Bokmål. Not sure how to to prompt it to switch to Norwegian Nynorsk.

…

-- Happy hacking Petter Reinholdtsen

ryanheise2 years ago

You can try using --initial_prompt "Some introductory pre-sentence written in the Norwegian Nynorsk script."

So you just write a made-up sentence in the script you want, and you might get better results if it is a sentence that you could plausibly imagine as having been spoken before the first actual sentence in the audio you're transcribing.

KjeldsenDK2 years ago

The rest of the world uses no and doing this change would result in numerous posts all over the world asking why -no doesn't work. Not to mention the hundreds of guides and copies of documentation the uses -no

As a minimum -no should be kept for legacy support and nn and nb should only be implemented if whisper actually knows the difference.

petterreinholdtsen2 years ago

There are for sure quite a few using the Norwegian language codes incorrectly, or confuse the country and language code. Because of this it is a good idea to keep the incorrect 'no' language code still working as an alias, probably for the W3C recommended 'nb' code.

In any case, if Whisper is unable to know the difference between Nynorsk and Bokmål, I guess the entire question is moot.

ryanheise2 years ago

You can try using --initial_prompt "Some introductory pre-sentence written in the Norwegian Nynorsk script."

By the way, did you get around to trying this?

petterreinholdtsen2 years ago

[ryanheise]

By the way, did you get around to trying this?

I did, running this using a random nynorsk piece on youtube, <URL: https://yewtu.be/watch?v=s7olTWEIwAI >. whisper --model medium Are\ Kalvø\ -\ Kåseri\ om\ nynorsk\ \[s7olTWEIwAI\].webm --language Nynorsk --initial_prompt "eg heitar" Sadly the resulting transcription is of very low quality: Den største fordelen me å bruke ny norsk er at det gjær det lett å framstå som langt meir intresang enn du faktisk er. For oss som koserer er det for eksempel helt opplagt enn fordel å bruke ny norsk. There are several typos and inaccuracies here at the start of the recording. :)

…

-- Happy hacking Petter Reinholdtsen

ryanheise2 years ago

Normally I would say that "eg heitar" isn't actually a pre-sentence, it's only two words, no full stop at the end, etc. and probably not a great prompt to teach Whisper what style to continue in. Although that being said, I don't hold out any hopes that it will work well if you're getting typos. That feels like there's limited training data for that script.

jongwook2 years ago

We originally collected the language tags from the VoxLingua107 dataset, but 100% of the transcription data had no, and no nn label. We had some nn labels in the translation data, but I guess that's less relevant when the input is spoken Norwegian and output is English.

So the labeling was a bit haphazard, but I think it still makes sense to keep the macrolanguage label no, considering that that labels would've contained a mixture of Bokmål and Nynorsk. (It appears that most were in Bokmål though, as the Nynorsk prompting example above didn't work very well unfortunately.)

petterreinholdtsen2 years ago👍 2

[Jong Wook Kim]

We originally collected the language tags from the [VoxLingua107](https://bark.phon.ioc.ee/voxlingua107/) dataset, but 100% of the transcription data had `no`, and no `nn` label. We had some `nn` labels in the translation data, but I guess that's less relevant when the input is spoken Norwegian and output is English.

This is not really surprising that Voxlingual07 used 'no', given that it states "VoxLingua107 is a speech dataset for training spoken language identification models." There is a difference between language codes for spoken language and written language in Norway. The Norwegian Bokmål and Nynorsk are written languages, while the spoken language is a dialect of Norwegian. So all written Norwegian should use either 'nb' or 'nn', and spoken language could use 'no'. If you only found 'no' in transcription data, the transcriptions are misclassified, and most likely should have been classified as 'nb', the written variant used by most people in Norway.

So the labeling was a bit haphazard, but I think it still makes sense to keep the macrolanguage label `no`, considering that that labels would've contained a mixture of Bokmål and Nynorsk. (It appears that most were in Bokmål though, as the Nynorsk prompting example above didn't work very well unfortunately.)

Yeah. Hope someone can do a better job at training a system to write Norwegian Bokmål and Nynorsk with the correct classicication in the future. At least the issue is better known in the Whisper community now. Note, there are ways to fairly accurately detect if the written text is Nynorsk or Bokmål by looking for marker words like 'jeg'(nb) vs 'eg'(nn) og 'en'(nb) vs 'ein'(nn).

…

-- Happy hacking Petter Reinholdtsen

jongwook closed this 2 years ago

ryanheise290 days ago

I was looking into this recently and one approach to work around Whisper's inclination toward Bokmål (based on its training data for no) would be to just use a separate tool to convert Whisper's Bokmål output into Nynorsk in post processing.

For example, Apertium can be used, and can be installed locally (making it also helpful for automated pipelines).

The specific Bokmål/Nynorsk module for Apertium is here, including some examples of its output.

For timestamps, the two scripts are at least textually similar enough to match them up (e.g. Diff Match Patch). An easier way might be to just embed timestamps of the form 00:24:07 into the source text and see if Apertium preserves them in place without disrupting the translation process. It seems to work, but I haven't fully tested that.

petterreinholdtsen290 days ago

[ryanheise]

I was looking into this recently and one approach to work around Whisper's inclination toward Bokmål (based on its training data for `no`) would be to just use a separate tool to convert Whisper's Bokmål output into Nynorsk in post processing.

This is a good point, and Apertium is doing a very good job as converting Bokmål to Nynorsk. But sadly it also require seriuos proof reading as it is far from perfect, according to the creator of the nb->nn Apertium transformer. :) In any case, the main takeaway here is that the 'no' language code is for a spoken language, while the 'nb' and 'nn' language codes are for written languages.

…

-- Happy hacking Petter Reinholdtsen

whisper
Replaced 'no' langauge code with 'nb' and use full norwegian language names
#1250

Closed

Replaced 'no' langauge code with 'nb' and use full norwegian language names #1250

whisper Replaced 'no' langauge code with 'nb' and use full norwegian language names #1250 Closed

Replaced 'no' langauge code with 'nb' and use full norwegian language names #1250

whisper
Replaced 'no' langauge code with 'nb' and use full norwegian language names
#1250

Closed