Add SD-Turbo and refine diffusion demo (#18694)
[SD-Turbo](https://huggingface.co/stabilityai/sd-turbo) is a fast
generative text-to-image model that distilled from [Stable Diffusion
2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1). It is
targeted for 512x512 resolution.
1. Support sd-turbo model.
1. Refiner ControlNet in demo
+ Cache the ControlNet model so that it is downloaded only once.
+ Do not download default images in script. Instead update document to
use wget to download example image.
+ Fix an issue of control image processing that causes shape mismatch in
inference.
1. Refine arguments:
+ Change argument --disable-refiner to --enable-refiner since refiner is
not used in most cases
+ Rename --refiner-steps to --refiner_denoising_steps
+ Add abbreviations for most used arguments.
+ Add logic to set default arguments for different models.
1. Refine torch model cache:
+ Share cached torch model among different engines to save disk space.
+ Only download fp16 model (previously, ORT_CUDA downloads fp32 model).
1. Do not use vae slicing when image size is small.
1. For LCM scheduler, allow guidance scale 1.0~2.0.
2. Allow sdxl-turbo to use refiner
### Performance Test Results
Average latency in ms for SD-Turbo (FP16, EulerA, 512x512) on
A100-SXM4-80GB.
Batch | Steps | TRT 8.6 static | ORT_TRT static | ORT_CUDA static | TRT
8.6 dynamic | ORT_TRT dynamic | ORT_CUDA dynamic
-- | -- | -- | -- | -- | -- | -- | --
1 | 1 | 32.07 | 30.55 | 32.89 | 36.41 | 38.30 | 34.83
4 | 1 | 125.36 | 97.40 | 97.49 | 118.24 | 114.95 | 99.10
1 | 4 | 62.29 | 60.24 | 62.50 | 72.49 | 77.82 | 67.66
4 | 4 | 203.51 | 173.11 | 168.32 | 217.14 | 215.71 | 172.53
* Dynamic engine is built for batch size 1 to 8, image size 512x512 to
768x768, optimized for batch size 1 and 512x512