DocumentationCoreNetUpcoming Model Additions

Upcoming Model Additions

Soon-to-be-Available AI Models

  • Video-to-Anime: Mainly based on StyleGAN2 by rosalinity and partly from UGATIT.
  • Video-Object-Replacement: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation.
  • Video-Colorization: Colorization using a Generative Color Prior for Natural Images, an implementation of the ECCV 2022 Paper.
  • Make-Any-Image-Talk: Based on the framework of pix2pix-pytorch and MakeItTalk, ATVG, RhythmicHead, Speech-Driven Animation.
  • Image-to-Text: The CLIP Interrogator is a prompt engineering tool that combines OpenAI's CLIP and Salesforce's BLIP to optimize text prompts to match a given image. Use the resulting prompts with text-to-image models like Stable Diffusion to create cool art!
  • AI-Logo-Designer: Erlich is the text2image latent diffusion model from CompVis (with additions from glid-3-xl) finetuned on a dataset collected from LAION-5B named Large Logo Dataset. It consists of roughly 1000K images of logos with captions generated via BLIP using aggressive re-ranking.
  • Image-Style-Transfer: Image Style Transfer with a Single Text Condition.
  • Image-Segment: Image segmentation based on Segment Anything Model (SAM).
  • Image-Restoration: Blind Face Restoration with Vector-Quantized Dictionary and Parallel Decoder.
  • Object-Removal: Combines Semantic segmentation and EdgeConnect architectures with minor changes in order to remove specified objects from photos.
  • Speech-to-Text: A general-purpose speech transcription model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech transcription as well as speech translation and language identification.
  • Text-to-Video: Based on a multi-stage text-to-video generation diffusion model, which inputs a description text and returns a video that matches the text description.
  • Image-Super-Resolution: Image super-resolution with Stable Diffusion 2.0.
  • Text-to-Music: Fine-tuned on images of spectrograms paired with text base on stable diffusion 2.0. Audio processing happens downstream of the model.
  • Text-Recognition: Based on PaddleOCR ch_ppocr_server_v2.0_xx model.
  • Generate-Detailed-Images-from-Scribbled-Drawings: This model is ControlNet adapting Stable Diffusion 2.0 to use a line drawing (or "scribble") in addition to a text input to generate an output image.
  • AI-Face-Swap: Based on GHOST (Generative High-fidelity One Shot Transfer).
  • Voice-Changing: This model adopts the end-to-end framework of VITS for high-quality waveform reconstruction, and propose strategies for clean content information extraction without text annotation.
  • AI-Clothes-Changer: Fine-tuned Stable Diffusion model trained on clothes changing.
  • AI-Interior-Design: Generate interior design based on stable diffussion.
  • Age-Prediction: Computes the similarity age with an input image, using CLIP.
  • Style-Your-Hair: Latent Optimization for Pose-Invariant Hairstyle Transfer via Local-Style-Aware Hair Alignment based on Barbershop.

On this page