The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Hey! This is great. Is this already in alpha?
Team, is there any tentative time to release this v3 alpha ???
I can't wait anymore :) Please update me when it will be released!
@xenova Can I test v3-alpha by using NPM? When I try to run, I get this issue.
@xenova Can I test v3-alpha by using NPM? When I try to run, I get this issue.
use this https://github.com/kishorekaruppusamy/transformers.js/commit/7af8ef1e5c37f3052ed3a8e38938595702836f09
commit to resolve this issue ...
Thanks for your reply @kishorekaruppusamy I tried with your branch and I got other issues.
Please give me your advise!
Thanks for your reply @kishorekaruppusamy I tried with your branch and I got other issues.
Please give me your advise!
https://github.com/kishorekaruppusamy/transformers.js/blob/V3_BRANCH_WEBGPU_BUG_FIX/src/backends/onnx.js#L144
change this url to local dist dir inside build ..
Thanks @kishorekaruppusamy
I downloaded the latest wasm from onnxruntime and added in local directory but I got same issue
I realized transformer.js v3 uses onnxruntime 1.16.3 so I created wasm by using onnxruntime 1.16.3 and tested and I got same issue
please give your advise. Thanks
Hi everyone! Today we released our first WebGPU x Transformers.js demo: The WebGPU Embedding Benchmark (online demo). If you'd like to help with testing, please run the benchmark and share your results! Thanks!
@xenova can this bench pick the GPU 1 instead of 0? For the laptops with dGPU
@xenova can this bench pick the GPU 1 instead of 0? For the laptops with dGPU
Not currently, but this is being worked on here: microsoft/onnxruntime#19857. We will add support here once ready.
@beaufortfrancois - I've added the source code for the video background removal demo. On my device, I get ~20fps w/ WebGPU support (w/ fp32 since fp16 is broken). Here's a screen recording (which drops my fps to ~14):
@beaufortfrancois - I've added the source code for the video background removal demo. On my device, I get ~20fps w/ WebGPU support (w/ fp32 since fp16 is broken). Here's a screen recording (which drops my fps to ~14):
You rock. Thanks! It's a cool demo! ๐
I've been wondering how we could improve it:
output[0].mul(255).to('uint8')
takes some non negligible time to run. Is there a faster path?GPUExternalTexture
to the model as an input could also come handy.1 | /** | ||
2 | * @typedef {'cpu'|'gpu'|'wasm'|'webgpu'|null} DeviceType |
Out of curiosity, what is 'gpu'
?
It's meant to be a "catch-all" for the different ways that the library can be used with GPU support (not just in the browser with WebGPU). The idea is that it will simplify documentation, as transformers.js will select the best execution provider depending on the environment. For example, DML/CUDA support in onnxruntime-node (see microsoft/onnxruntime#16050 (comment))
Of course, this is still a work in progress, so it can definitely change!
device: 'webgpu',
For some environement it better be list.
Because there not all execution proveders support all oprtators.
For my use-case I'm give a list of EP orderd by priority, let onnxruntime auto fallback.
For example: ['nnapi', 'xnnpack', 'cpu']
for Android / ['qnn', 'dml', 'xnnpack', 'cpu']
for Windows ARM64 (custom build)
UPDATE: Looks like some kernels are not supported for quant operations :/
I tested WebGPU version on https://huggingface.co/Xenova/wav2vec2-bert-CV16-en with changes from v3 and model (quantized) is loaded without errors but after running transcribe it throws error with message:
An error occurred during model execution: "Error: [WebGPU] Kernel "[Split] /wav2vec2_bert/encoder/layers.0/conv_module/glu/Split" failed. Error: no GPU data for output: 0".
[E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running Split node. Name:'/wav2vec2_bert/encoder/layers.0/conv_module/glu/Split' Status Message: Failed to run JSEP kernel
Is it some quantization error or onnxruntime error?
Logs localhost-1710758687772.log
Env: Windows, Chrome 122, Nvidia Geforce 3090
@young-developer Thanks for the report. I will cc @guschmue for this unsupported operator. It may already be fixed in the dev branch of onnxruntime-web.
@hans00 For more advanced use-cases, you can update the session options directly with session_options: {...}
in the model options.
FYI @xenova I was able to load the model in fp32 and got the same error. I also tried to load in fp16 but it throws input error is (float) instead of (float16) so I assume inputs should be converted to fp16 too.
Exciting news ๐ฅณ We've got Musicgen working! Example usage:
import { AutoTokenizer, MusicgenForConditionalGeneration } from '@xenova/transformers';
// Load tokenizer and model
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/musicgen-small');
const model = await MusicgenForConditionalGeneration.from_pretrained(
'Xenova/musicgen-small', { dtype: 'fp32' }
);
// Prepare text input
const prompt = '80s pop track with bassy drums and synth';
const inputs = tokenizer(prompt);
// Generate audio
const audio_values = await model.generate({
...inputs,
max_new_tokens: 512,
do_sample: true,
guidance_scale: 3,
});
// (Optional) Write the output to a WAV file
import wavefile from 'wavefile';
import fs from 'fs';
const wav = new wavefile.WaveFile();
wav.fromScratch(1, model.config.audio_encoder.sampling_rate, '32f', audio_values.data);
fs.writeFileSync('musicgen_out.wav', wav.toBuffer());
Samples:
Would it be helpful if I created an example for MusicGen? (based on your example code, but as a small stand-along html page)
@xenova There is new version 1.17.3 of onnxruntime-web. I tested with wav2vec and there is new error so looks like a progress ๐
Segment Anything Encoder now works with WebGPU: up to 8x faster! (online demo)
Phi-3 WebGPU support is now working! Demo: https://huggingface.co/spaces/Xenova/experimental-phi3-webgpu
Does anyone have a guide for how to get this bundled into a script, akin to a JSDelivr URL? Here's what I tried:
// index.js
export * from 'transformers.js'; // Adjust if the import path differs
npm install xenova/transformers.js#v3
npm install rollup @rollup/plugin-node-resolve rollup-plugin-terser --save-dev
// rollup.config.js
import resolve from '@rollup/plugin-node-resolve';
import { terser } from 'rollup-plugin-terser';
export default {
input: 'index.js',
output: {
file: 'bundle.js',
format: 'esm',
sourcemap: true
},
plugins: [
resolve({
browser: true,
}),
terser()
]
};
And in package.json
:
"scripts": {
"build": "rollup -c"
}
And then:
npm run build
And that produced a bundle.js
, but it was looking for webgpu.proxy.min.js
on jsdelivr, which doesn't exist where it was looking. I tried manually adjusting to URL in the bundle to point to the ort.webgpu.min.js
file, but no luck (I also tried esm/ort.webgpu.min.js
). I'm guessing there are some tricky things due to the dynamic nature of backend loading that bundlers struggle to automatically pick up.
@xenova Alternatively, I wonder if you'd be able to do some v3 alpha/prealpha releases via github tags so that jsdelivr picks them up? Since there's no way (IIUC) to simply reference a branch via jsdelivr (due to immutability requirement I assume).
The latest commits add support for Moondream2, a small vision language model by @vikhyat designed to run efficiently on edge devices.
Try it out yourself with the live demo: https://huggingface.co/spaces/Xenova/experimental-moondream-webgpu
Usage:
import { AutoProcessor, AutoTokenizer, Moondream1ForConditionalGeneration, RawImage } from '@xenova/transformers';
// Load processor, tokenizer and model
const model_id = 'Xenova/moondream2';
const processor = await AutoProcessor.from_pretrained(model_id);
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const model = await Moondream1ForConditionalGeneration.from_pretrained(model_id, {
dtype: {
embed_tokens: 'fp16', // or 'fp32'
vision_encoder: 'fp16', // or 'q8'
decoder_model_merged: 'q4', // or 'q4f16' or 'q8'
},
device: 'webgpu',
});
// Prepare text inputs
const prompt = 'Describe this image.';
const text = `<image>\n\nQuestion: ${prompt}\n\nAnswer:`;
const text_inputs = tokenizer(text);
// Prepare vision inputs
const url = 'https://huggingface.co/vikhyatk/moondream1/resolve/main/assets/demo-1.jpg';
const image = await RawImage.fromURL(url);
const vision_inputs = await processor(image);
// Generate response
const output = await model.generate({
...text_inputs,
...vision_inputs,
do_sample: false,
max_new_tokens: 64,
});
const decoded = tokenizer.batch_decode(output, { skip_special_tokens: false });
console.log(decoded);
// [
// '<|endoftext|><image>\n\n' +
// 'Question: Describe this image.\n\n' +
// 'Answer: A hand is holding a white book titled "The Little Book of Deep Learning" against a backdrop of a balcony with a railing and a view of a building and trees.<|endoftext|>'
// ]
VLMs now support PKV caching. Demo: https://huggingface.co/spaces/Xenova/experimental-nanollava-webgpu
import { AutoProcessor, AutoTokenizer, LlavaForConditionalGeneration, RawImage } from '@xenova/transformers';
// Load tokenizer, processor and model
const model_id = 'Xenova/nanoLLaVA';
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await LlavaForConditionalGeneration.from_pretrained(model_id, {
dtype: {
embed_tokens: 'fp16', // or 'fp32' or 'q8'
vision_encoder: 'fp16', // or 'fp32' or 'q8'
decoder_model_merged: 'q4', // or 'q8'
},
// device: 'webgpu',
});
// Prepare text inputs
const prompt = 'What does the text say?';
const messages = [
{ role: 'system', content: 'Answer the question.' },
{ role: 'user', content: `<image>\n${prompt}` }
]
const text = tokenizer.apply_chat_template(messages, { tokenize: false, add_generation_prompt: true });
const text_inputs = tokenizer(text);
// Prepare vision inputs
const url = 'https://huggingface.co/qnguyen3/nanoLLaVA/resolve/main/example_1.png';
const image = await RawImage.fromURL(url);
const vision_inputs = await processor(image);
// Generate response
const { past_key_values, sequences } = await model.generate({
...text_inputs,
...vision_inputs,
do_sample: false,
max_new_tokens: 64,
return_dict_in_generate: true,
});
// Decode output
const answer = tokenizer.decode(
sequences.slice(0, [text_inputs.input_ids.dims[1], null]),
{ skip_special_tokens: true },
);
console.log(answer);
// The text reads "Small but mighty".
const new_messages = [
...messages,
{ role: 'assistant', content: answer },
{ role: 'user', content: 'How does the text correlate to the context of the image?' }
]
const new_text = tokenizer.apply_chat_template(new_messages, { tokenize: false, add_generation_prompt: true });
const new_text_inputs = tokenizer(new_text);
// Generate another response
const output = await model.generate({
...new_text_inputs,
past_key_values,
do_sample: false,
max_new_tokens: 256,
});
const new_answer = tokenizer.decode(
output.slice(0, [new_text_inputs.input_ids.dims[1], null]),
{ skip_special_tokens: true },
);
console.log(new_answer);
// The context of the image is that of a playful and humorous illustration of a mouse holding a weightlifting bar. The text "Small but mighty" is a playful reference to the mouse's size and strength.
@xenova For some models, the performance may be a blocker. Since model downloads can be quite large, I wonder if there should be a way for web developers to know their machine performance class for running a model without downloading it completely first.
I believe this would involve running the model code with zeroed-out weights, which would still require buffer allocations but would allow the web app to catch out-of-memory errors or such. The model architecture would still needed to generate shaders, but this be much smaller than model weights.
Essentially, knowing the model architecture and testing with empty weights would allow for assessing performance capability without downloading the full model.
I thought I could use from_config
for that but I wonder now if this should be a built-in V3 feature. What are your thoughts?
@beaufortfrancois That would be amazing to have! Although, it's probably best suited as a feature request for onnxruntime-web. The way one could do it is to use the external data format to save models into two parts: graph-only (<1MB usually) and weights, and then initialize an empty session from the graph without loading the weights. @guschmue might have additional insights.
Thank you @xenova for your support โค๏ธ
@guschmue What are your thoughts on #545 (comment)?
I'm happy to file a feature request in https://github.com/microsoft/onnxruntime
@beaufortfrancois, yes some utility class that helps applications to decide what hardware capabilities are there before a model is loaded is on my wish list since some time.
We have not gotten to it but should find time soon I hope.
It would need to tell how mighty your gpu is, if there is a npu (in the future if there is webnn), if it is visible to run the model on wasm.
Not trivial to get this right on first try so I'd expect a few iterations on it.
It also would need a lot of feedback and help from application developers.
Filing a feature request would be good, then we have a place to track it.
@guschmue I've filed microsoft/onnxruntime#20998 to track this feature request. How would we be able to help out there?
we'd need to come up with a nice api. the info one can get from webgpu is very sparse and imo not good enough to make this work.
The way I see this work goes like this:
we define a couple of model classes, ie:
llm, vision, speech
based on this we'd run some shaders briefly to determine relevant flops that are used in the selected class.
the result would be some raw flops number, or some class based on some heuristics a class like 'good enough for 500M parameters' and some hints from the webgpu info ...
Application could cache this so the detection needs to run only the first time.
Maybe there would be an offline tool that you can run your model through to capture some data what the model needs.
We would need help defining this (ie what classes) and then get a lot of feedback to tune this to practial values.
But this is just what I think how this would work, very open to other suggestions.
@guschmue Sounds great, +1 on the need to integrate feedback from application developers. One early comment/question on the potential API, wrt your "good enough for 500M parameters" example: are you referring to "fast enough"? If so, it may be convenient for application developers to not only get as an output a bucketized speed estimate (e.g. for a given 500MB params model, x-slow
/ slow
/ medium
/ fast
), but also to be able access the raw timings. How long end-users are willing to wait for an output may be use-case specific.
Experimental Florence2 support has been added! ๐ฅณ (closes #815)
Example code:
import {
Florence2ForConditionalGeneration,
AutoProcessor,
AutoTokenizer,
RawImage,
} from '@xenova/transformers';
// Load model, processor, and tokenizer
const model_id = 'onnx-community/Florence-2-base-ft';
const model = await Florence2ForConditionalGeneration.from_pretrained(model_id, {
dtype: 'fp32',
});
const processor = await AutoProcessor.from_pretrained(model_id);
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
// Load image
const url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true";
const image = await RawImage.fromURL(url);
// Process inputs
const prompts = "Describe with a paragraph what is shown in the image.";
const text_inputs = tokenizer(prompts);
const vision_inputs = await processor(image);
// Generate text
const generated_ids = await model.generate({
...text_inputs,
...vision_inputs,
max_new_tokens: 100,
});
// Decode generated text
const generated_text = tokenizer.batch_decode(generated_ids, { skip_special_tokens: true });
console.log(generated_text);
generates
'A green car is parked in front of a tan building. There is a brown door on the building behind the car. There are two windows on the front of the building. '
I'm still working on adding support for other tasks and improving processing methods, but this is a good start. Another issue is that the vision encoder doesn't work on WebGPU (but other submodules do). cc @guschmue for this.
@xenova the option of use a quantized model doesn't exists anymore?
I'm trying to use https://huggingface.co/Xenova/trocr-base-handwritten/blob/main/onnx/encoder_model_quantized.onnx
Let's gooo ๐ ๐ ๐ Awesome work!!!
I nearly thought it would never happen ๐ An amazing achievement and thank you for your persistence!
๐ฅ ๐
WOOHOO!!! Congrats!! WebGPU all the things!
This is a huge milestone๐ Thank you for all the fantastic work in this great project!
๐ ๐ ๐
Login to write a write a comment.
In preparation for Transformers.js v3, I'm compiling a list of issues/features which will be fixed/included in the release.
onnxruntime-web
to 1.17.0).onnxruntime-web
โ 1.17.0). Closes:topk
->top_k
parameter.transpose
->permute
Useful commands:
npm version prerelease --preid=alpha -m "[version] Update to %s"
How to use WebGPU
First, install the development branch
Then specify the
device
parameter when loading the model. Here's example code to get started. Please note that this is still a WORK IN PROGRESS, so the following usage may change before release.