模型:
cvssp/audioldm
AudioLDM is a latent text-to-audio diffusion model capable of generating realistic audio samples given any text input. It is available in the ? Diffusers library from v0.15.0 onwards.
AudioLDM was proposed in the paper AudioLDM: Text-to-Audio Generation with Latent Diffusion Models by Haohe Liu et al.
Inspired by Stable Diffusion , AudioLDM is a text-to-audio latent diffusion model (LDM) that learns continuous audio representations from CLAP latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music.
This is the original, small version of the AudioLDM model, also referred to as audioldm-s-full . The four AudioLDM checkpoints are summarised in the table below:
Table 1: Summary of the AudioLDM checkpoints.
Checkpoint | Training Steps | Audio conditioning | CLAP audio dim | UNet dim | Params |
---|---|---|---|---|---|
audioldm-s-full | 1.5M | No | 768 | 128 | 421M |
audioldm-s-full-v2 | > 1.5M | No | 768 | 128 | 421M |
audioldm-m-full | 1.5M | Yes | 1024 | 192 | 652M |
audioldm-l-full | 1.5M | No | 768 | 256 | 975M |
First, install the required packages:
pip install --upgrade diffusers transformers
For text-to-audio generation, the AudioLDMPipeline can be used to load pre-trained weights and generate text-conditional audio outputs:
from diffusers import AudioLDMPipeline import torch repo_id = "cvssp/audioldm" pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16) pipe = pipe.to("cuda") prompt = "Techno music with a strong, upbeat tempo and high melodic riffs" audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
The resulting audio output can be saved as a .wav file:
import scipy scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
Or displayed in a Jupyter Notebook / Google Colab:
from IPython.display import Audio Audio(audio, rate=16000)Your browser does not support the audio element.
Prompts:
Inference:
BibTeX:
@article{liu2023audioldm, title={AudioLDM: Text-to-Audio Generation with Latent Diffusion Models}, author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D}, journal={arXiv preprint arXiv:2301.12503}, year={2023} }