Digging into Stable Diffusion and Other Generative Models

Before I dive in, I should mention you'll need a high-end GPU with a lot of memory to train the model on your own data or generate images larger than 512x512. On my machine, it takes the full 24GB VRAM to train. Some projects claim to work with as low as 4GB VRAM required, and some even support running on Apple Silicon.

If you don't have a machine that can handle this, you will find links to various free and paid tools for AI art below.

The visuals above were made with Stable Diffusion (SD), an open-source AI text-to-image (txt2img) model created by a company called Stability AI (SAI). I'm using a project referred to as AUTOMATIC1111's repo in r/stablediffusion. This project supports 3rd party python scripts. Deforum created a script for AUTOMATIC1111 to create animations with image-to-image (img2img) iterations and image manipulation with keyframes and math.

Another SD project called Dreambooth gives you the ability to extend the model by training it on your own images.

The audio in the video was generated with Mubert Text to Music. This project is a little different. It uses a text classifier to convert a prompt into tags (genres and music styles), then makes a request to the Mubert API to generate music.

Image generation:

  • DALL-E 2 by Open AI
  • Midjourney
  • DreamStudio by Stability AI (SD)
    • Recently announced the development of DreamStudio Pro, which will have audio/video/image generation models with a web interface for doing 3D animation

Audio Generation:

  • Mubert music generation
  • Fakeyou TTS - lots of TV/character voice models
  • Uberduck TTS - lots of TV/character voice models
  • Coqui TTS - clone voices

AI/ML model communities — find and try models online

Animation Tools

These are harder to explain with short descriptions.

  • TPSMMW — animate a single image with a face/body driver video
  • EBSynth uses input keyframes to style video

There’s a lot more out there, and things are moving fast.