From my understanding, this may be similar to Chinese which also has multiple written variants, although Chinese also has multiple spoken dialects. The way Whisper was trained, it was trained on all of these written and spoken variants under the umbrella of "Chinese", and so ultimately a single language code has to describe all of these. Since there is only one language code for Chinese, the way you get it to transcribe for a particular variant is therefore not by specifying a different language but by using the same language code and then using a prompt to get it off on the right foot with the variant. It may not be a strictly correct use of language codes, but the training is already done that way.
I haven't tried transcribing Norwegian, but is it similar in that Whisper contains training data for both writing systems under a single umbrella of "no" which can be accessed by using a prompt to set it off in one of those two directions? If so, I'm not sure if changing it to "nb" would make sense.
You can try using --initial_prompt "Some introductory pre-sentence written in the Norwegian Nynorsk script."
So you just write a made-up sentence in the script you want, and you might get better results if it is a sentence that you could plausibly imagine as having been spoken before the first actual sentence in the audio you're transcribing.
The rest of the world uses no and doing this change would result in numerous posts all over the world asking why -no doesn't work. Not to mention the hundreds of guides and copies of documentation the uses -no
As a minimum -no should be kept for legacy support and nn and nb should only be implemented if whisper actually knows the difference.
There are for sure quite a few using the Norwegian language codes incorrectly, or confuse the country and language code. Because of this it is a good idea to keep the incorrect 'no' language code still working as an alias, probably for the W3C recommended 'nb' code.
In any case, if Whisper is unable to know the difference between Nynorsk and Bokmål, I guess the entire question is moot.
You can try using
--initial_prompt "Some introductory pre-sentence written in the Norwegian Nynorsk script."
By the way, did you get around to trying this?
Normally I would say that "eg heitar" isn't actually a pre-sentence, it's only two words, no full stop at the end, etc. and probably not a great prompt to teach Whisper what style to continue in. Although that being said, I don't hold out any hopes that it will work well if you're getting typos. That feels like there's limited training data for that script.
We originally collected the language tags from the VoxLingua107 dataset, but 100% of the transcription data had no
, and no nn
label. We had some nn
labels in the translation data, but I guess that's less relevant when the input is spoken Norwegian and output is English.
So the labeling was a bit haphazard, but I think it still makes sense to keep the macrolanguage label no
, considering that that labels would've contained a mixture of Bokmål and Nynorsk. (It appears that most were in Bokmål though, as the Nynorsk prompting example above didn't work very well unfortunately.)
I was looking into this recently and one approach to work around Whisper's inclination toward Bokmål (based on its training data for no
) would be to just use a separate tool to convert Whisper's Bokmål output into Nynorsk in post processing.
For example, Apertium can be used, and can be installed locally (making it also helpful for automated pipelines).
The specific Bokmål/Nynorsk module for Apertium is here, including some examples of its output.
For timestamps, the two scripts are at least textually similar enough to match them up (e.g. Diff Match Patch). An easier way might be to just embed timestamps of the form 00:24:07 into the source text and see if Apertium preserves them in place without disrupting the translation process. It seems to work, but I haven't fully tested that.
Login to write a write a comment.
The 'no' language code is a obsolete language code that is the union of language codes 'nb' (Norwegian Bokmål) and 'nn' (Norwegian Nynorsk), the two written variants in use in Norway. As 'no' is misleading, and seem to be used by Whisper to mean Norwegian Bokmål, I recommend replacing it with 'nb' and using the full names for both of the norwegian written forms.
Useful references: