The recent improvements in the field of speech synthesis have lead to many innovative technologies, offering a wide range of useful applications like automatic speech recognition, natural speech synthesis, voice cloning, digital dictation, and so forth.
Deep learning has played an immensely important role in improving already existent speech synthesis approaches by replacing the entire pipeline process with neural networks trained on data alone.
Following that perspective, Tacontron2 has been a game-changer, achieving a mean opinion score MOS of 4.53, almost comparable to professionally recorded speech.
Yet, its architecture is quite intuitive, formerly based on a recurrent sequence-to-sequence network that maps character embedding to Mel spectrograms. The overall structure lies under two main stages:
- A first recurrent seq-to-seq attention-based feature extractor that processes character embedding. It yields Mel spectrogram frames as a forward input for the second stage.
- A modified WaveNet vocoder that generates time-domain waveform samples previously conditioned on the Mel spectrograms.

If you want to dig deeper into that field and grasp the intuition behind spectrograms and Fourier transforms, I would recommend you check out a really good video by 3blue1brown about that topic:
For this article, we will be wrapping Tacotron2 within a Django Rest API for serving purposes. We’ll also increase the complexity a bit by adding Nginx as a proxy server for static serving efficiency and Gunincorn will replace Django’s internal server. The whole backend will be packaged with Docker to simulate a production-like environment.
Overview
- Create a model handler for Tacotron2
- Build the Django Rest API
- Generate the Dockerfile for your Django app
- Configure a Nginx proxy
- Dockerize the whole application
The entire code for this project can be found in my Github repo:
Create a model handler for Tacotron2
If you take a look at the Tacotron2 implementation at the NVIDIA Github repo, you’ll find among all the files a model definition for Tacotron2 that is completely Pytorch based. They also provide a Pytorch implementation for Waveglow, the required vocoder to synthesize the spectrograms and generate wave audio files that are audible to humans.
However, in order to use those two models in conjunction, we’ll still need a class, or a handler, to manage the intermediate required steps such as data processing, inference, and post-processing. We’ll organize our work separating the concerns of each part so that our code can be easily maintainable.
class TacotronHandler(nn.Module):
def __init__(self):
super().__init__()
self.tacotron_model = None
self.waveglow = None
self.device = None
self.initialized = None
def _load_tacotron2(self, checkpoint_file, hparams_config: Hparams):
tacotron2_checkpoint = torch.load(os.path.join(_WORK_DIR, _MODEL_DIR, checkpoint_file))
self.tacotron_model = Tacotron2(hparams= hparams_config)
self.tacotron_model.load_state_dict(tacotron2_checkpoint['state_dict'])
self.tacotron_model.to(self.device)
self.tacotron_model.eval()
def _load_waveglow(self, checkpoint_file, is_fp16: bool):
waveglow_checkpoint = torch.load(os.path.join(_WORK_DIR, _MODEL_DIR, checkpoint_file))
waveglow_model = WaveGlow(
n_mel_channels=waveglow_params.n_mel_channels,
n_flows=waveglow_params.n_flows,
n_group=waveglow_params.n_group,
n_early_every=waveglow_params.n_early_every,
n_early_size=waveglow_params.n_early_size,
WN_config=WN_config
)
self.waveglow = waveglow_model
self.waveglow.load_state_dict(waveglow_checkpoint)
self.waveglow = waveglow_model.remove_weightnorm(waveglow_model)
self.waveglow.to(self.device)
self.waveglow.eval()
if is_fp16:
from apex import amp
self.waveglow, _ = amp.initialize(waveglow_model, [], opt_level="3")
def initialize(self):
if not torch.cuda.is_available():
raise RuntimeError("This model is not supported on CPU machines.")
self.device = torch.device('cuda')
self._load_tacotron2(
checkpoint_file='tacotron2.pt',
hparams_config=tacotron_hparams)
self._load_waveglow(
is_fp16=False,
checkpoint_file='waveglow_weights.pt')
self.initialized = True
logger.debug('Tacotron and Waveglow models successfully loaded!')
def preprocess(self, text_seq):
text = text_seq
if text_seq[-1].isalpha() or text_seq[-1].isspace():
text = text_seq + '.'
sequence = np.array(text_to_sequence(text, ['english_cleaners']))[None, :]
sequence = torch.from_numpy(sequence).to(device=self.device, dtype=torch.int64)
return sequence
def inference(self, data):
start_inference_time = time.time()
_, mel_output_postnet, _, _ = self.tacotron_model.inference(data)
with torch.no_grad():
audio = self.waveglow.infer(mel_output_postnet, sigma=0.666)
return audio, time.time() - start_inference_time
def postprocess(self, inference_output):
audio_numpy = inference_output[0].data.cpu().numpy()
output_name = 'tts_output_{}.wav'.format(uuid.uuid1())
path = os.path.join(_AUDIO_DIR, output_name)
print(path)
write(path, tacotron_hparams.sampling_rate, audio_numpy)
return 'API/audio/'+ output_name
- initialize(): Load Tacontron and Waveglow with their respective checkpoints.
- preprocess(text_seq): Transform raw text into suitable input for the model. Convert it to a specific set of character sequences.
- inference(data): Run inference on the previous processed input and return a corresponding synthesized audio matching the input text.
- postprocess(inference_output): Save the wav audio file to a directory under the container file system.
The details of the code can be checked in the Github repo.
Build the Django Rest API:
Setup your Django project: django-admin startproject tacotron_tts and django-admin startapp API.
If you need a thorough tour of how to begin with a Django Rest project, feel free to check out my previous article.
As per the project requirements, we’ll indeed be relying upon a third-party service to store and retrieve speech generated data through our endpoints. Therefore, Django ORM helpers and serializers will come in handy. As said in their documentation, Django ORM is “a Pythonical way to create SQL to query and manipulate your database and get results in a Pythonic fashion.” Now:
- Create your ORM model for the TTS output.
- Create the corresponding serializer.
- Build your views (POST, DELETE) and your routing.
# Model Class for TTS outputs
class TTSound(models.Model):
uuid = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
text_content = models.TextField(null=True, blank=True)
audio_join = models.FileField(upload_to=get_tts_media, null=True, blank=True)
inference_time = models.CharField(max_length=255,null=True, blank=True)
created_at = models.DateTimeField(auto_now_add=True)
# Serializer for the TTSound model
class TTSOutputSerializer(serializers.ModelSerializer):
class Meta:
model = TTSound
fields = ('uuid', 'text_content', 'audio_join', 'inference_time', 'created_at')
Generate the Dockerfile for the Django app:
To package the whole API as a docker container, we’ll need to find a root image that complies with the project requirements. As the version of Tacotron that we’re using is entirely GPU based, we’ll need to pull a docker image already built with Cuda support. An interesting image backing Cuda-10.2 alongside PyTorch 1.5.0 can be found in the docker hub and it perfectly matches our needs.
Copy the local folders to the image file system, install the requirements inside a virtual environment, give the required permissions, and you’re ready to go.
Create two new directories where the static and media files will be stored.
Once the image for the Django app is fully operational, we’ll be configuring the Dockerfile for the Nginx proxy. Nothing special to add to the Dockerfile except for the static and media folders that will be shared between the two containers.
FROM anibali/pytorch:1.5.0-cuda10.2
USER root
MAINTAINER Aymane Hachcham <[email protected]>
ENV DEBIAN_FRONTEND noninteractive
ENV PATH="/scripts:${PATH}"
RUN apt-get update
RUN mkdir /tacotron_app
COPY ./Tacotron_TTS /tacotron_app
WORKDIR /tacotron_app
# install virtualenv
RUN pip install --upgrade pip
RUN pip install virtualenv
COPY ./requirements.txt /requirements.txt
COPY ./scripts /scripts
RUN chmod +x /scripts
RUN mkdir -p /vol/web/media
RUN mkdir -p /vol/web/static
CMD ["entrypoint.sh"]
Configure your Nginx Proxy:
Nginx is a lightweight DNS micro service especially fitted for dockerized backend environments. The purpose is to use it as a proxy server that can route and serve static files and media. Rather than requesting Django internal server, a best practice for production environments is to utilize an independent proxy server responsible for that part. As the name microservice implies, each service works in a detached way as to focus on different parts of the whole infrastructure.

We’ll build our Nginx service by pulling the standard Nginx docker image from the hub: Nginx-Unprivileged.
Basic configuration for our needs:
- Define an upstream service
- Prepare your server
- URLs starting with /: Forward to Gunicorn
- URLs with /static/: Forward to our media and static folders, which happen to be inside our docker file system.
Orchestrate your Architecture with Docker Compose
As previously discussed, we need to structure our code such that the containers can communicate and work tightly together to make the whole service run. The way to tackle it is by defining two services, one for the API and the other for the proxy specifying a shared volume (static_data) for the two components where media files can be accessed. And that’s it, you can know deploy the service.
version: '3.7'
services:
api:
build:
context: .
dockerfile: Dockerfile
volumes:
- static_data:/vol/web
- ./Tacotron_TTS:/tacotron_app
environment:
- ALLOWED_HOSTS=127.0.0.1,localhost
nginx:
build: ./nginx
volumes:
- static_data:/vol/static
ports:
- "8080:8080"
depends_on:
- api
volumes:
static_data:
Run your application
- There is one more little step to figure out before actually running the API regarding the static URL paths. In your settings.py add the following locations that match the static volumes previously defined in the Django Dockerfile.
2. Download Postman and start testing your API locally:


3. Listen to your output on port 8080: Listen to the transcription:
Input: “Every man must decide whether he will walk in the light of creative altruism or in the darkness of destructive selfishness” — Martin Luther King.
Conclusion
You’ve had a quick overview of the whole project in this article. I strongly recommend you check the Github repo for more in-depth insight.
As you can see, the field of natural speech synthesis is very promising and it will keep improving till reaching stunning results. Conversational AI is getting closer to the extent of seamlessly discussing with intelligent systems without even noticing any substantial difference with human speech.
I leave you here some additional resources you may want to check.
- Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
- Nginx-Django production-like development, YouTube video by London App Developer
- Interesting and useful NVIDIA Github repos for TTS and STT
If you have any questions regarding the code, please get in touch with me and don’t hesitate to e-mail me at [email protected]
Comments 0 Responses