transformers
Fixed 30s timestamp resets in Whisper long-form transcription
#36612
Closed

Fixed 30s timestamp resets in Whisper long-form transcription #36612

FaresBadrCA
FaresBadrCA118 days ago

What does this PR do?

Fixes #34210 and #31942.
This is an alternative to PR #35750

It resolves the issue of timestamps rolling over every 30 seconds in the Whisper model's long-form transcription. It does this by forcing return_segments to be True when return_timestamps is True.

Before submitting

Who can review?

@eustlb, @Rocketknight1, @gante, @ylacombe

Fixed 30s timestamp resets in Whisper long-form transcription by enfo…
f36e7e46
github-actions
github-actions118 days ago

Hi πŸ‘‹, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

github-actions github-actions marked this pull request as draft 118 days ago
FaresBadrCA FaresBadrCA marked this pull request as ready for review 118 days ago
github-actions github-actions requested a review from ArthurZucker ArthurZucker 118 days ago
github-actions github-actions requested a review from Rocketknight1 Rocketknight1 118 days ago
eustlb
eustlb114 days ago

Hey @FaresBadrCA

Thanks a lot for your PR! πŸ€—
Can you please provide a reproducer of what's not working with #35750 please (works fine from my tests)? This PR takes into account the internals and specificities of tricky Whisper heuristics and I'd rather work from it. In the case where it's indeed not doing what's expected, I'd be glad to review your PR.

FaresBadrCA
FaresBadrCA114 days ago (edited 113 days ago)

Hi @eustlb, below is a snippet I used for testing, using the LinusTech dataset.

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device
)

dataset = load_dataset("Whispering-GPT/linustechtips-transcript-audio", split="train", streaming=True)
sample = list(dataset.take(1))[0]
result = pipe(sample['audio'], return_timestamps = True, generate_kwargs = {"language": "english", "condition_on_prev_tokens" : True})
print(result['chunks'][:5])

Running the code twice: Once for this PR (#36612) and once for the other PR (#35750), I get the results below.

PR 36612: Using return_segments = true

[{'text': " So guys today I'm going to be doing a quick unboxing of the Antec Cooler H2O 620. So this",
  'timestamp': (0.0, 13.72)},
 {'text': ' is the little brother to the Cooler H2O 920. Now the key difference between this one and the 920,',
  'timestamp': (13.72, 20.400000000000002)},
 {'text': " they're both fairly similar in terms of the design, is the thickness of the radiator. So",
  'timestamp': (20.400000000000002, 27.36)},
 {'text': ' So while the 620 uses a thinner style radiator that offers the advantage of better compatibility',
  'timestamp': (27.36, 33.6)},
 {'text': ' with cases on the market, the 920 is going to offer better performance due to the larger',
  'timestamp': (33.6, 40.0)}]

PR 35750: Using timestamp tokens

[{'timestamp': (0.0, 13.72),
  'text': " So guys today I'm going to be doing a quick unboxing of the Antec Cooler H2O 620. So this"},
 {'timestamp': (13.72, 20.4),
  'text': ' is the little brother to the Cooler H2O 920. Now the key difference between this one and the 920,'},
 {'timestamp': (20.4, 27.36),
  'text': " they're both fairly similar in terms of the design, is the thickness of the radiator. So"},
 {'timestamp': (30.0, 36.54),
  'text': ' while the 620 uses a thinner style radiator that offers the advantage of better compatibility with'},
 {'timestamp': (36.54, 43.5),
  'text': ' cases on the market, the 920 is going to offer better performance due to the larger surface area.'}]

Note the fourth segment: It should go from 27.3 to 33.6. Instead, goes from 30.0 to 36.5. It is delayed by about 3 seconds, and that delay carries over to subsequent segments.
I noticed this issue only happens when condition_on_prev_tokens = True

For reference, below are the "correct" segments provides in the dataset.

Provided segments (sample['segments'][:5])

{'start': 0.0, 'end': 13.48, 'text': " So guys, today I'm going to be doing a quick unboxing of the Antec Cooler H2O 620."}
{'start': 13.48, 'end': 17.94, 'text': ' So this is the little brother to the Cooler H2O 920.'}
{'start': 17.94, 'end': 22.76, 'text': " Now the key difference between this one and the 920, they're both fairly similar in terms"}
{'start': 22.76, 'end': 27.3, 'text': ' of the design, is the thickness of the radiator.'}
{'start': 27.3, 'end': 33.56, 'text': ' So while the 620 uses a thinner style radiator that offers the advantage of better compatibility'}
eustlb
eustlb113 days ago

I took a look at it, and what you've spotted is actually an issue, thanks a lot for that πŸ™

That is exactly why we want to go with #35750: output should be equivalent from what you get looking directly at the segments (what you're doing in this PR). That is also why this PR won't get merge: we do not want to bypass decoding directly from the outputted tokens.

Anyway, thanks a lot again for spotting this issue, I added a fix for it in #35750 and will also add a test for it 😊

ArthurZucker ArthurZucker removed review request from Rocketknight1 Rocketknight1 105 days ago
ArthurZucker ArthurZucker removed review request from ArthurZucker ArthurZucker 105 days ago
eustlb
eustlb7 days ago

Closing this now for the above-mentioned reasons.

eustlb eustlb closed this 7 days ago

Login to write a write a comment.

Login via GitHub

Reviewers
No reviews
Assignees
No one assigned
Labels
Milestone