transformers
Feature Update [added `initial_prompt` support for automatic-speech-recognition whisper pipeline]
#28556
Open

Feature Update [added `initial_prompt` support for automatic-speech-recognition whisper pipeline] #28556

Biswajit2902 wants to merge 37 commits into huggingface:main from Biswajit2902:main
Biswajit2902
Biswajit29021 year ago (edited 1 year ago)🎉 8

What does this PR do?

Fixes # (feature)

  • initial_prompt support for whisper Pipeline (automatic-speech-recognition)

Before submitting

  • Added initial_prompt as an option for whisper model
  • To handle initial prompt processor considered as optional parameter
  • Current implementation supports only Torch version of decoding.
  • how to use initial prompt;
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-small"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
    processor=processor
)


dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
audio = dataset[0]["audio"]["array"]
sampling_rate = dataset[0]["audio"]["sampling_rate"]

# including timestamp
print(pipe(audio, initial_prompt = "Biswajit, Whisper", return_timestamps=True))

# without timestamp
print(pipe(audio, initial_prompt = "Biswajit, Whisper"))

Who can review?

Anyone in the community is free to review the PR once the tests have passed. @sanchit-gandhi , @Narsil, Can anyone help to take this PR forward please. Let me know, if anything is needed.

fixes #27317

Biswajit2902 test
befe862e
Biswajit2902 revert back
9eb6bf51
Biswajit2902 added option for inital_prompt for whisper API
caecfd23
Biswajit2902 added option for inital_prompt for whisper API
8eb31790
Biswajit2902 added option for inital_prompt for whisper API
74c505e6
Biswajit2902 added option for inital_prompt for whisper API
47e770f1
Biswajit2902 added option for inital_prompt for whisper API
de2f27ad
Biswajit2902 added option for inital_prompt for whisper API
cf15d96c
Biswajit2902 Merge pull request #1 from Biswajit2902/dev
a461d273
Biswajit2902 Update automatic_speech_recognition.py
62cf72b7
Biswajit2902 Update automatic_speech_recognition.py
8e50b7f7
Biswajit2902 Update automatic_speech_recognition.py
f16bd910
Biswajit2902 Biswajit2902 changed the title Feature Update [added support for `initial_prompt` for automatic-speech-recognition whisper pipeline] Feature Update [added `initial_prompt` support for automatic-speech-recognition whisper pipeline] 1 year ago
Biswajit2902 fixed formatting
448c6f1b
Biswajit2902 reformatted src/transformers/pipelines/automatic_speech_recognition.py
e58e861b
Biswajit2902 Biswajit2902 marked this pull request as ready for review 1 year ago
Biswajit2902 Merge branch 'main' into main
3169e2d1
Biswajit2902 Merge branch 'main' into main
1d4b9ed8
Biswajit2902 Merge branch 'main' into main
7894a72d
Biswajit2902 Merge branch 'main' into main
2180226a
Biswajit2902 Merge branch 'main' into main
10d802d3
kaminwong
kaminwong1 year ago

Hi thank you your code saved my day! I think line 535 needs to modify a bit prompt_tensor = torch.tensor(generate_kwargs["prompt_ids"], dtype=out["tokens"].dtype).cuda() if is_torch_cuda_available else torch.tensor(generate_kwargs["prompt_ids"], dtype=out["tokens"].dtype), and add is_torch_cuda_available to line 22. without cuda it'll run on cpu which is a lot slower.

Biswajit2902
Biswajit29021 year ago

@kaminwong , this is just to modify the output sequence to avoid showing inital_prompt in transcription.

Actual generation has device handles in below line.

           tokens = self.model.generate(
                attention_mask=attention_mask,
                **generate_kwargs,
            )

Apart from this token decoding part is serialised implementation which has no effect, that can be misuse of GPU.

kaminwong
kaminwong1 year ago (edited 1 year ago)

Thanks for the reply! But if I don't make that changes I get the following error, so I assume prompt_tensor needs to be in cuda if device is also in cuda? Or is there any other way to correct the error? Thank you for your time.

File "/.../python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 538, in _forward if (tmp_tokens[0:nprompt_token] == prompt_tensor).sum() == nprompt_token: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

I followed the code you posted:


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-small"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
    processor=processor
)

Biswajit2902
Biswajit29021 year ago

@kaminwong , Thank you for addressing. I understood the issue. let me verify and reolved it.

Biswajit2902 added device handle for whisper decoding (with initial_prompt) in src…
3af7f9aa
Biswajit2902 added device handle for whisper decoding (with initial_prompt) in src…
3fa18aba
Biswajit2902
Biswajit29021 year ago

@kaminwong , you can pull latest commit and install it should work now. its fixed.

kaminwong
kaminwong1 year ago👍 1

Thank you for the elegant solution. It works now!

amyeroberts
amyeroberts1 year ago

Gentle ping @sanchit-gandhi for review

Biswajit2902
Biswajit29021 year ago

@amyeroberts is there any plan to close this in near future? or will it take time?

amyeroberts
amyeroberts1 year ago👍 2

@Biswajit2902 Once @sanchit-gandhi has reviewed and approved, the PR will need a final review from a maintainer. Once approved, then the PR can be merged in.

Biswajit2902 Merge branch 'huggingface:main' into main
9c9e4239
Biswajit2902 Merge branch 'huggingface:main' into main
dd3ae32e
Biswajit2902 Merge branch 'huggingface:main' into main
d1668426
amyeroberts amyeroberts requested a review from sanchit-gandhi sanchit-gandhi 1 year ago
Biswajit2902 Merge branch 'huggingface:main' into main
964eeda5
Biswajit2902 Merge branch 'huggingface:main' into main
acd15663
sanchit-gandhi
sanchit-gandhi commented on 2024-03-28
sanchit-gandhi1 year ago❤ 1

Hey @Biswajit2902 - thanks for working on this welcome feature! Super sorry for the late review here. Left some comments regarding the pipeline design and how we can simplify.

src/transformers/pipelines/automatic_speech_recognition.py
206 processor: Optional[AutoProcessor] = None,
203207 **kwargs,
204208 ):
209
self.processor = processor
sanchit-gandhi1 year ago

Unfortunately, we can't accept the processor as an attribute of the pipeline, for reasons mentioned here.

src/transformers/pipelines/automatic_speech_recognition.py
508 # Added initial prompt for whisper
509 if "initial_prompt" in generate_kwargs:
510 initial_prompt = generate_kwargs.pop("initial_prompt")
511
generate_kwargs["prompt_ids"] = self.processor.get_prompt_ids(initial_prompt)
sanchit-gandhi1 year ago👍 1

By design, we can get the same behaviour using the tokenizer method get_prompt_ids:

generate_kwargs["prompt_ids"] = self.tokenizer.get_prompt_ids(initial_prompt)

Let's simplify this logic to only ever have the feature extractor + tokenizer, and always rely on tokenizer.get_prompt_ids to convert the prompt.

src/transformers/pipelines/automatic_speech_recognition.py
513530 else:
514531 out = {"tokens": tokens}
532
533
if "prompt_ids" in generate_kwargs:
sanchit-gandhi1 year ago

This should be a post-processing step, rather than in the _forward. Could you move it to the method postprocess please?

Biswajit29021 year ago

@sanchit-gandhi , Thank you so much for the review. will work on your comment.

github-actions
github-actions1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

thomasmol
thomasmol1 year ago

@Biswajit2902 any new updates? let me know if you need help

Biswajit2902
Biswajit29021 year ago❤ 2

@thomasmol I will update on this soon. was busy since two weeks. Thank you for the reminder.

amyeroberts amyeroberts added Core: Pipeline
amyeroberts amyeroberts added Audio
Biswajit2902
Biswajit29021 year ago (edited 1 year ago)

@thomasmol @sanchit-gandhi , i see below conflict in AutomaticSpeechRecognitionPipeline._sanitize_parameters;

<<<<<<< main
            forward_params["generate_kwargs"]["max_new_tokens"] = max_new_tokens
        if initial_prompt is not None:
            forward_params["generate_kwargs"]["initial_prompt"] = initial_prompt
=======
            forward_params["max_new_tokens"] = max_new_tokens
>>>>>>> main

I want to understand why we removed generate_kwargs from forward_params. Also initial_prompt.

My changes before were working fine. But after this, there seems have some bug. I am working on resolving it. So need your input on this.

sanchit-gandhi
sanchit-gandhi355 days ago (edited 355 days ago)👍 1

Hey @Biswajit2902 - you can read the motivation for this change here. Essentially, we're unifying the forward_params and generate_kwargs in _sanitize_parameters. However, for the purposes of your feature, you should strive to put the initial_prompt under preprocess_params:

preprocess_params["initial_prompt"] = initial_prompt

And then convert the text prompt to token ids in the preprocess method, which will then be passed to _forward.

Biswajit2902
Biswajit2902352 days ago

@sanchit-gandhi , Thanks for the pointer. Sorry got super busy could go back review. Will do it soon and close it.

Biswajit2902 updated initial prompt implementation
49d44ecd
Biswajit2902 Merge branch 'main' into latest-2605
536ac701
Biswajit2902 Merge pull request #2 from Biswajit2902/latest-2605
b2e84276
Biswajit2902 Update automatic_speech_recognition.py
bf9068ac
Biswajit2902 clean up
4aaa93c3
Biswajit2902 Merge pull request #3 from Biswajit2902/latest-2605
92cb5069
Biswajit2902 clean up
cd2e2648
Biswajit2902 Merge pull request #4 from Biswajit2902/latest-2605
9cc34f42
Biswajit2902
Biswajit2902348 days ago

@sanchit-gandhi , Just an update. I have made the changes for this issue as suggested. But i have identified the output is not proper like before. seems like generate has some issue. its adding initial prompt with all the chunks. Will check and update on this. Also let me know if any existing issue going on this to your knowledge.

Biswajit2902 Merge branch 'huggingface:main' into main
877fda40
huggingface huggingface deleted a comment from github-actions on 2024-07-16
amyeroberts
amyeroberts298 days ago
basicblueberrry136
basicblueberrry136277 days ago

are there any updates on this? or other ways you know of for pushing the model to more easily detect certain words using this pipeline?

amyeroberts
amyeroberts276 days ago
Biswajit2902 Merge branch 'huggingface:main' into main
e35154d2
Biswajit2902 Merge branch 'huggingface:main' into main
f171d8a6
ylacombe
ylacombe250 days ago

Hey @basicblueberrry136, thanks for your comment!
@sanchit-gandhi's review still has to be addressed before the next steps. Once it's done, I'll make another review! Hopefully it'll move fast!

JacobLinCool
JacobLinCool201 days ago

I believe this is very helpful when used with the serverless inference API.

It seems that the serverless inference API uses the Transformers library to run models, and we cannot pass any parameter that has a type of tensor, as shown below:

const data = fs.readFileSync(filename);
const b64 = data.toString('base64');

const body = JSON.stringify({
    inputs: b64,
    parameters: {
        return_timestamps: true,
        generate_kwargs: {
            num_beams: 1,
            prompt_ids: [50362, 27338, 3763, 48022, 2257, 48022, 6784, 118, 25157, 1546, 15789, 23987, 5975, 17174, 28472, 25750, 6062, 1543],
        }
    }
});

It results in the following error:

{
  "error": "unknown error",
  "warnings": [
    "There was an inference error: unknown error: list indices must be integers or slices, not NoneType"
  ]
}

If initial_prompt is added, we can pass the prompt as a string to the serverless inference API.

jollyfish-cjy
jollyfish-cjy83 days ago

Hi, thanks for your work! Are there any updates on this?

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone