Hi π, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review
button (at the bottom of the PR page).
Hey @FaresBadrCA
Thanks a lot for your PR! π€
Can you please provide a reproducer of what's not working with #35750 please (works fine from my tests)? This PR takes into account the internals and specificities of tricky Whisper heuristics and I'd rather work from it. In the case where it's indeed not doing what's expected, I'd be glad to review your PR.
Hi @eustlb, below is a snippet I used for testing, using the LinusTech dataset.
import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device
)
dataset = load_dataset("Whispering-GPT/linustechtips-transcript-audio", split="train", streaming=True)
sample = list(dataset.take(1))[0]
result = pipe(sample['audio'], return_timestamps = True, generate_kwargs = {"language": "english", "condition_on_prev_tokens" : True})
print(result['chunks'][:5])
Running the code twice: Once for this PR (#36612) and once for the other PR (#35750), I get the results below.
return_segments = true
[{'text': " So guys today I'm going to be doing a quick unboxing of the Antec Cooler H2O 620. So this",
'timestamp': (0.0, 13.72)},
{'text': ' is the little brother to the Cooler H2O 920. Now the key difference between this one and the 920,',
'timestamp': (13.72, 20.400000000000002)},
{'text': " they're both fairly similar in terms of the design, is the thickness of the radiator. So",
'timestamp': (20.400000000000002, 27.36)},
{'text': ' So while the 620 uses a thinner style radiator that offers the advantage of better compatibility',
'timestamp': (27.36, 33.6)},
{'text': ' with cases on the market, the 920 is going to offer better performance due to the larger',
'timestamp': (33.6, 40.0)}]
[{'timestamp': (0.0, 13.72),
'text': " So guys today I'm going to be doing a quick unboxing of the Antec Cooler H2O 620. So this"},
{'timestamp': (13.72, 20.4),
'text': ' is the little brother to the Cooler H2O 920. Now the key difference between this one and the 920,'},
{'timestamp': (20.4, 27.36),
'text': " they're both fairly similar in terms of the design, is the thickness of the radiator. So"},
{'timestamp': (30.0, 36.54),
'text': ' while the 620 uses a thinner style radiator that offers the advantage of better compatibility with'},
{'timestamp': (36.54, 43.5),
'text': ' cases on the market, the 920 is going to offer better performance due to the larger surface area.'}]
Note the fourth segment: It should go from 27.3 to 33.6. Instead, goes from 30.0 to 36.5. It is delayed by about 3 seconds, and that delay carries over to subsequent segments.
I noticed this issue only happens when condition_on_prev_tokens = True
For reference, below are the "correct" segments provides in the dataset.
sample['segments'][:5]
){'start': 0.0, 'end': 13.48, 'text': " So guys, today I'm going to be doing a quick unboxing of the Antec Cooler H2O 620."}
{'start': 13.48, 'end': 17.94, 'text': ' So this is the little brother to the Cooler H2O 920.'}
{'start': 17.94, 'end': 22.76, 'text': " Now the key difference between this one and the 920, they're both fairly similar in terms"}
{'start': 22.76, 'end': 27.3, 'text': ' of the design, is the thickness of the radiator.'}
{'start': 27.3, 'end': 33.56, 'text': ' So while the 620 uses a thinner style radiator that offers the advantage of better compatibility'}
I took a look at it, and what you've spotted is actually an issue, thanks a lot for that π
That is exactly why we want to go with #35750: output should be equivalent from what you get looking directly at the segments (what you're doing in this PR). That is also why this PR won't get merge: we do not want to bypass decoding directly from the outputted tokens.
Anyway, thanks a lot again for spotting this issue, I added a fix for it in #35750 and will also add a test for it π
Closing this now for the above-mentioned reasons.
Login to write a write a comment.
What does this PR do?
Fixes #34210 and #31942.
This is an alternative to PR #35750
It resolves the issue of timestamps rolling over every 30 seconds in the Whisper model's long-form transcription. It does this by forcing
return_segments
to beTrue
whenreturn_timestamps
isTrue
.Before submitting
Who can review?
@eustlb, @Rocketknight1, @gante, @ylacombe