Dia-1.6B TTS : Best Text-to-Dialogue Generation Model

Looking for the right text-to-speech model? The 1.6 billion parameter model Dia might be the one for you. You’d also be surprised to hear that this model was created by two undergraduates and with zero funding! In this article, you’ll learn about the model, how to access and use the model and also see the results to really know what this model is capable of. Before using the model, it would be appropriate to get acquainted with it.

What is Dia-1.6B?

The models trained with the goal of having text as input and natural speech as output, are called text-to-speech models. The Dia-1.6B parameter model developed by Nari Labs belongs to the text-to-speech models family. This is an interesting model that is capable of generating realistic dialogue from a transcript. It’s also worth noting that the model can produce nonverbal communications like laugh, sneeze, whistle etc. Exciting isn’t it?

How to Access the Dia-1.6B?

Two ways in which we can access the Dia-1.6B model:

Using Hugging Face API with Google Collab
Using Hugging Face Spaces

The first one would require getting the API key and then integrating it in Google Collab with code. The latter is a no-code and allows us to interactively use Dia-1.6B.

1. Using Hugging Face and Collab

The model is available on Hugging Face and can be run with the help of 10 GB of VRAM, provided by the T4 GPU in Google Collab notebook. We’ll demonstrate the same with a mini conversation.

Before we begin, let’s get our Hugging Face access token which will be required to run the code. Go to https://huggingface.co/settings/tokens and generate a key, if you don’t have one already.

Make sure to enable the following permissions:

Open a new notebook in Google Collab and add this key in the secrets (Name should be HF_Token):

Note: Switch to T4 GPU to run this notebook. Then only you’d be able to use the 10GB of VRAM, required for running this model.

Let’s now get our hands on the the model:

First clone the Dia’s Git repository:

!git clone https://github.com/nari-labs/dia.git

Install the local package:

!pip install ./dia

Install the soundfile audio library:

!pip install soundfile

After running the previous commands, restart the session before proceeding.

After the installations, let’s do the necessary imports and initialize the model:

import soundfile as sf

from dia.model import Dia

import IPython.display as ipd

model = Dia.from_pretrained("nari-labs/Dia-1.6B")

Initialize the text for the text to speech conversion:

text = "[S1] This is how Dia sounds. (laugh) [S2] Don't laugh too much. [S1] (clears throat) Do share your thoughts on the model."

Run inference on the model:

output = model.generate(text)

sampling_rate = 44100 # Dia uses 44.1Khz sampling rate.

output_file="dia_sample.mp3"

sf.write(output_file, output, sampling_rate) # Saving the audio

ipd.Audio(output_file) # Displaying the audio

Output:

The speech is very human-like and the model is doing great with non-verbal communication. It’s worth noting that the results aren’t reproducible as there are no templates for the voices.

Note: You can try fixing the seed of the model to reproduce the results.

2. Using Hugging Face Spaces

Let’s try to clone a voice using the model via Hugging Face spaces. Here we have an option to use the model directly on the using the online interface: https://huggingface.co/spaces/nari-labs/Dia-1.6B

Here you can pass the input text and additionally you can also use the ‘Audio Prompt’ to replicate the voice. I passed the audio we generated in the previous section.

The following text was passed as an input:

[S1] Dia is an open weights text to dialogue model. 
[S2] You get full control over scripts and voices. 
[S1] Wow. Amazing. (laughs) 
[S2] Try it now on Git hub or Hugging Face.

I’ll let you be the judge, do you feel that the model has successfully captured and replicated the earlier voices?

Note: I got multiple errors while generating the speech using Hugging Face spaces, try changing the input text or audio prompt to get the model to work.

Things to remember while using Dia-1.6B

Here are a few things that you should keep in mind, while using Dia-1.6B:

The model is not fine-tuned on a specific voice. So, it’ll get a different voice on every run. You can try fixing the seed of the model to reproduce the results.
Dia uses 44.1 KHz sampling rate.
After installing the libraries, make sure to restart the Collab notebook.
I got multiple errors while generating the speech using Hugging Face spaces, try changing the Input Text or Audio Prompt to get the model to work.

Conclusion

The model results are very promising, especially when we see what it can do compared to the competition. The model’s biggest strength is its support for a wide range of non-verbal communication. The model has a distinct tone and speech feels natural, but on the other hand as it’s not fine-tuned on specific voices, it might not be easy to reproduce a particular voice. Like any other generative AI tool, this model should be used responsibly.

Frequently Asked Questions

Q1. Can we use only two speakers in the conversation?

A. No, you can use multiple speakers but need to add this in the prompt [S1], [S2], [S3]…

Q2. Is Dia 1.6B a paid model?

A. No, it’s a completely free to use model available on Hugging Face.

Passionate about technology and innovation, a graduate of Vellore Institute of Technology. Currently working as a Data Science Trainee, focusing on Data Science. Deeply interested in Deep Learning and Generative AI, eager to explore cutting-edge techniques to solve complex problems and create impactful solutions.