transformers
Fixed 30s timestamp resets in Whisper long-form transcription
#36612

Closed

Fixed 30s timestamp resets in Whisper long-form transcription #36612

FaresBadrCA wants to merge 1 commit into huggingface:main from FaresBadrCA:whisper_timestamp_rollover_fix

FaresBadrCA118 days ago

What does this PR do?

Fixes #34210 and #31942.
This is an alternative to PR #35750

It resolves the issue of timestamps rolling over every 30 seconds in the Whisper model's long-form transcription. It does this by forcing return_segments to be True when return_timestamps is True.

Before submitting

Did you read the contributor guideline pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@eustlb, @Rocketknight1, @gante, @ylacombe

Fixed 30s timestamp resets in Whisper long-form transcription by enfo…

f36e7e46

github-actions118 days ago

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

github-actions marked this pull request as draft 118 days ago

FaresBadrCA marked this pull request as ready for review 118 days ago

github-actions requested a review from

ArthurZucker 118 days ago

github-actions requested a review from

Rocketknight1 118 days ago

eustlb114 days ago

Hey @FaresBadrCA

Thanks a lot for your PR! 🤗
Can you please provide a reproducer of what's not working with #35750 please (works fine from my tests)? This PR takes into account the internals and specificities of tricky Whisper heuristics and I'd rather work from it. In the case where it's indeed not doing what's expected, I'd be glad to review your PR.

FaresBadrCA114 days ago (edited 113 days ago)

Hi @eustlb, below is a snippet I used for testing, using the LinusTech dataset.

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device
)

dataset = load_dataset("Whispering-GPT/linustechtips-transcript-audio", split="train", streaming=True)
sample = list(dataset.take(1))[0]
result = pipe(sample['audio'], return_timestamps = True, generate_kwargs = {"language": "english", "condition_on_prev_tokens" : True})
print(result['chunks'][:5])

Running the code twice: Once for this PR (#36612) and once for the other PR (#35750), I get the results below.

PR 36612: Using `return_segments = true`

[{'text': " So guys today I'm going to be doing a quick unboxing of the Antec Cooler H2O 620. So this",
  'timestamp': (0.0, 13.72)},
 {'text': ' is the little brother to the Cooler H2O 920. Now the key difference between this one and the 920,',
  'timestamp': (13.72, 20.400000000000002)},
 {'text': " they're both fairly similar in terms of the design, is the thickness of the radiator. So",
  'timestamp': (20.400000000000002, 27.36)},
 {'text': ' So while the 620 uses a thinner style radiator that offers the advantage of better compatibility',
  'timestamp': (27.36, 33.6)},
 {'text': ' with cases on the market, the 920 is going to offer better performance due to the larger',
  'timestamp': (33.6, 40.0)}]

PR 35750: Using timestamp tokens

[{'timestamp': (0.0, 13.72),
  'text': " So guys today I'm going to be doing a quick unboxing of the Antec Cooler H2O 620. So this"},
 {'timestamp': (13.72, 20.4),
  'text': ' is the little brother to the Cooler H2O 920. Now the key difference between this one and the 920,'},
 {'timestamp': (20.4, 27.36),
  'text': " they're both fairly similar in terms of the design, is the thickness of the radiator. So"},
 {'timestamp': (30.0, 36.54),
  'text': ' while the 620 uses a thinner style radiator that offers the advantage of better compatibility with'},
 {'timestamp': (36.54, 43.5),
  'text': ' cases on the market, the 920 is going to offer better performance due to the larger surface area.'}]

Note the fourth segment: It should go from 27.3 to 33.6. Instead, goes from 30.0 to 36.5. It is delayed by about 3 seconds, and that delay carries over to subsequent segments.
I noticed this issue only happens when condition_on_prev_tokens = True

For reference, below are the "correct" segments provides in the dataset.

Provided segments (`sample['segments'][:5]`)

{'start': 0.0, 'end': 13.48, 'text': " So guys, today I'm going to be doing a quick unboxing of the Antec Cooler H2O 620."}
{'start': 13.48, 'end': 17.94, 'text': ' So this is the little brother to the Cooler H2O 920.'}
{'start': 17.94, 'end': 22.76, 'text': " Now the key difference between this one and the 920, they're both fairly similar in terms"}
{'start': 22.76, 'end': 27.3, 'text': ' of the design, is the thickness of the radiator.'}
{'start': 27.3, 'end': 33.56, 'text': ' So while the 620 uses a thinner style radiator that offers the advantage of better compatibility'}

eustlb113 days ago

I took a look at it, and what you've spotted is actually an issue, thanks a lot for that 🙏

That is exactly why we want to go with #35750: output should be equivalent from what you get looking directly at the segments (what you're doing in this PR). That is also why this PR won't get merge: we do not want to bypass decoding directly from the outputted tokens.

Anyway, thanks a lot again for spotting this issue, I added a fix for it in #35750 and will also add a test for it 😊

ArthurZucker removed review request from

Rocketknight1 105 days ago

ArthurZucker removed review request from

ArthurZucker 105 days ago

eustlb7 days ago

Closing this now for the above-mentioned reasons.

eustlb closed this 7 days ago

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Milestone

No milestone

transformers Fixed 30s timestamp resets in Whisper long-form transcription #36612 Closed