Biswajit29021 year ago (edited 1 year ago)🎉 8

What does this PR do?

Fixes # (feature)

initial_prompt support for whisper Pipeline (automatic-speech-recognition)

Before submitting

Added initial_prompt as an option for whisper model
To handle initial prompt processor considered as optional parameter
Current implementation supports only Torch version of decoding.
how to use initial prompt;

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-small"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
    processor=processor
)


dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
audio = dataset[0]["audio"]["array"]
sampling_rate = dataset[0]["audio"]["sampling_rate"]

# including timestamp
print(pipe(audio, initial_prompt = "Biswajit, Whisper", return_timestamps=True))

# without timestamp
print(pipe(audio, initial_prompt = "Biswajit, Whisper"))

Who can review?

Anyone in the community is free to review the PR once the tests have passed. @sanchit-gandhi , @Narsil, Can anyone help to take this PR forward please. Let me know, if anything is needed.

fixes #27317

test

befe862e

revert back

9eb6bf51

added option for inital_prompt for whisper API

caecfd23

added option for inital_prompt for whisper API

8eb31790

added option for inital_prompt for whisper API

74c505e6

added option for inital_prompt for whisper API

47e770f1

added option for inital_prompt for whisper API

de2f27ad

added option for inital_prompt for whisper API

cf15d96c

Merge pull request #1 from Biswajit2902/dev

a461d273

Update automatic_speech_recognition.py

62cf72b7

Update automatic_speech_recognition.py

8e50b7f7

Update automatic_speech_recognition.py

f16bd910

Biswajit2902 changed the title ~~Feature Update [added support for `initial_prompt` for automatic-speech-recognition whisper pipeline]~~ Feature Update [added `initial_prompt` support for automatic-speech-recognition whisper pipeline] 1 year ago

fixed formatting

448c6f1b

reformatted src/transformers/pipelines/automatic_speech_recognition.py

e58e861b

Biswajit2902 marked this pull request as ready for review 1 year ago

Merge branch 'main' into main

3169e2d1

Merge branch 'main' into main

1d4b9ed8

Merge branch 'main' into main

7894a72d

Merge branch 'main' into main

2180226a

Merge branch 'main' into main

10d802d3

kaminwong1 year ago

Hi thank you your code saved my day! I think line 535 needs to modify a bit prompt_tensor = torch.tensor(generate_kwargs["prompt_ids"], dtype=out["tokens"].dtype).cuda() if is_torch_cuda_available else torch.tensor(generate_kwargs["prompt_ids"], dtype=out["tokens"].dtype), and add is_torch_cuda_available to line 22. without cuda it'll run on cpu which is a lot slower.

Biswajit29021 year ago

@kaminwong , this is just to modify the output sequence to avoid showing inital_prompt in transcription.

Actual generation has device handles in below line.

           tokens = self.model.generate(
                attention_mask=attention_mask,
                **generate_kwargs,
            )

Apart from this token decoding part is serialised implementation which has no effect, that can be misuse of GPU.

kaminwong1 year ago (edited 1 year ago)

Thanks for the reply! But if I don't make that changes I get the following error, so I assume prompt_tensor needs to be in cuda if device is also in cuda? Or is there any other way to correct the error? Thank you for your time.

File "/.../python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 538, in _forward if (tmp_tokens[0:nprompt_token] == prompt_tensor).sum() == nprompt_token: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

I followed the code you posted:


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-small"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
    processor=processor
)

Biswajit29021 year ago

@kaminwong , Thank you for addressing. I understood the issue. let me verify and reolved it.

added device handle for whisper decoding (with initial_prompt) in src…

3af7f9aa

added device handle for whisper decoding (with initial_prompt) in src…

3fa18aba

Biswajit29021 year ago

@kaminwong , you can pull latest commit and install it should work now. its fixed.

kaminwong1 year ago👍 1

Thank you for the elegant solution. It works now!

amyeroberts1 year ago

Gentle ping @sanchit-gandhi for review

Biswajit29021 year ago

@amyeroberts is there any plan to close this in near future? or will it take time?

amyeroberts1 year ago👍 2

@Biswajit2902 Once @sanchit-gandhi has reviewed and approved, the PR will need a final review from a maintainer. Once approved, then the PR can be merged in.

Merge branch 'huggingface:main' into main

9c9e4239

Merge branch 'huggingface:main' into main

dd3ae32e

Merge branch 'huggingface:main' into main

d1668426

amyeroberts requested a review from

sanchit-gandhi 1 year ago

Merge branch 'huggingface:main' into main

964eeda5

Merge branch 'huggingface:main' into main

acd15663

sanchit-gandhi commented on 2024-03-28

sanchit-gandhi1 year ago❤ 1

Hey @Biswajit2902 - thanks for working on this welcome feature! Super sorry for the late review here. Left some comments regarding the pipeline design and how we can simplify.

src/transformers/pipelines/automatic_speech_recognition.py

	206	processor: Optional[AutoProcessor] = None,
203	207	**kwargs,
204	208	):
	209	self.processor = processor

sanchit-gandhi1 year ago

Unfortunately, we can't accept the processor as an attribute of the pipeline, for reasons mentioned here.

src/transformers/pipelines/automatic_speech_recognition.py

	508		# Added initial prompt for whisper
	509		if "initial_prompt" in generate_kwargs:
	510		initial_prompt = generate_kwargs.pop("initial_prompt")
	511		generate_kwargs["prompt_ids"] = self.processor.get_prompt_ids(initial_prompt)

sanchit-gandhi1 year ago👍 1

By design, we can get the same behaviour using the tokenizer method get_prompt_ids:

generate_kwargs["prompt_ids"] = self.tokenizer.get_prompt_ids(initial_prompt)

Let's simplify this logic to only ever have the feature extractor + tokenizer, and always rely on tokenizer.get_prompt_ids to convert the prompt.

src/transformers/pipelines/automatic_speech_recognition.py

513	530	else:
514	531	out = {"tokens": tokens}
	532
	533	if "prompt_ids" in generate_kwargs:

sanchit-gandhi1 year ago

This should be a post-processing step, rather than in the _forward. Could you move it to the method postprocess please?

Biswajit29021 year ago

@sanchit-gandhi , Thank you so much for the review. will work on your comment.

github-actions1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

thomasmol1 year ago

@Biswajit2902 any new updates? let me know if you need help

Biswajit29021 year ago❤ 2

@thomasmol I will update on this soon. was busy since two weeks. Thank you for the reminder.

amyeroberts added Core: Pipeline

amyeroberts added Audio

Biswajit29021 year ago (edited 1 year ago)

@thomasmol @sanchit-gandhi , i see below conflict in AutomaticSpeechRecognitionPipeline._sanitize_parameters;

<<<<<<< main
            forward_params["generate_kwargs"]["max_new_tokens"] = max_new_tokens
        if initial_prompt is not None:
            forward_params["generate_kwargs"]["initial_prompt"] = initial_prompt
=======
            forward_params["max_new_tokens"] = max_new_tokens
>>>>>>> main

I want to understand why we removed generate_kwargs from forward_params. Also initial_prompt.

My changes before were working fine. But after this, there seems have some bug. I am working on resolving it. So need your input on this.

sanchit-gandhi355 days ago (edited 355 days ago)👍 1

Hey @Biswajit2902 - you can read the motivation for this change here. Essentially, we're unifying the forward_params and generate_kwargs in _sanitize_parameters. However, for the purposes of your feature, you should strive to put the initial_prompt under preprocess_params:

preprocess_params["initial_prompt"] = initial_prompt

And then convert the text prompt to token ids in the preprocess method, which will then be passed to _forward.

Biswajit2902352 days ago

@sanchit-gandhi , Thanks for the pointer. Sorry got super busy could go back review. Will do it soon and close it.

updated initial prompt implementation

49d44ecd

Merge branch 'main' into latest-2605

536ac701

Merge pull request #2 from Biswajit2902/latest-2605

b2e84276

Update automatic_speech_recognition.py

bf9068ac

clean up

4aaa93c3

Merge pull request #3 from Biswajit2902/latest-2605

92cb5069

clean up

cd2e2648

Merge pull request #4 from Biswajit2902/latest-2605

9cc34f42

Biswajit2902348 days ago

@sanchit-gandhi , Just an update. I have made the changes for this issue as suggested. But i have identified the output is not proper like before. seems like generate has some issue. its adding initial prompt with all the chunks. Will check and update on this. Also let me know if any existing issue going on this to your knowledge.

Merge branch 'huggingface:main' into main

877fda40

huggingface deleted a comment from github-actions on 2024-07-16

amyeroberts298 days ago

cc @kamilakesbi as @sanchit-gandhi is off

basicblueberrry136277 days ago

are there any updates on this? or other ways you know of for pushing the model to more easily detect certain words using this pipeline?

amyeroberts276 days ago

c @ylacombe

Merge branch 'huggingface:main' into main

e35154d2

Merge branch 'huggingface:main' into main

f171d8a6

ylacombe250 days ago

Hey @basicblueberrry136, thanks for your comment!
@sanchit-gandhi's review still has to be addressed before the next steps. Once it's done, I'll make another review! Hopefully it'll move fast!

JacobLinCool201 days ago

I believe this is very helpful when used with the serverless inference API.

It seems that the serverless inference API uses the Transformers library to run models, and we cannot pass any parameter that has a type of tensor, as shown below:

const data = fs.readFileSync(filename);
const b64 = data.toString('base64');

const body = JSON.stringify({
    inputs: b64,
    parameters: {
        return_timestamps: true,
        generate_kwargs: {
            num_beams: 1,
            prompt_ids: [50362, 27338, 3763, 48022, 2257, 48022, 6784, 118, 25157, 1546, 15789, 23987, 5975, 17174, 28472, 25750, 6062, 1543],
        }
    }
});

It results in the following error:

{
  "error": "unknown error",
  "warnings": [
    "There was an inference error: unknown error: list indices must be integers or slices, not NoneType"
  ]
}

If initial_prompt is added, we can pass the prompt as a string to the serverless inference API.

jollyfish-cjy83 days ago

Hi, thanks for your work! Are there any updates on this?

transformers
Feature Update [added `initial_prompt` support for automatic-speech-recognition whisper pipeline]
#28556

Open

Feature Update [added `initial_prompt` support for automatic-speech-recognition whisper pipeline] #28556

What does this PR do?

Before submitting

Who can review?

transformers Feature Update [added `initial_prompt` support for automatic-speech-recognition whisper pipeline] #28556 Open

Feature Update [added `initial_prompt` support for automatic-speech-recognition whisper pipeline] #28556

What does this PR do?

Before submitting

Who can review?

transformers
Feature Update [added `initial_prompt` support for automatic-speech-recognition whisper pipeline]
#28556

Open