An architecture for production-ready natural speech synthesizer

The recent improvements in the field of speech synthesis have lead to many innovative technologies, offering a wide range of useful applications like automatic speech recognition, natural speech synthesis, voice cloning, digital dictation, and so forth.

Deep learning has played an immensely important role in improving already existent speech synthesis approaches by replacing the entire pipeline process with neural networks trained on data alone.

Following that perspective, Tacontron2 has been a game-changer, achieving a mean opinion score MOS of 4.53, almost comparable to professionally recorded speech.

Yet, its architecture is quite intuitive, formerly based on a recurrent sequence-to-sequence network that maps character embedding to Mel spectrograms. The overall structure lies under two main stages:

A first recurrent seq-to-seq attention-based feature extractor that processes character embedding. It yields Mel spectrogram frames as a forward input for the second stage.
A modified WaveNet vocoder that generates time-domain waveform samples previously conditioned on the Mel spectrograms.

If you want to dig deeper into that field and grasp the intuition behind spectrograms and Fourier transforms, I would recommend you check out a really good video by 3blue1brown about that topic:

For this article, we will be wrapping Tacotron2 within a Django Rest API for serving purposes. We’ll also increase the complexity a bit by adding Nginx as a proxy server for static serving efficiency and Gunincorn will replace Django’s internal server. The whole backend will be packaged with Docker to simulate a production-like environment.

Overview

Create a model handler for Tacotron2
Build the Django Rest API
Generate the Dockerfile for your Django app
Configure a Nginx proxy
Dockerize the whole application

The entire code for this project can be found in my Github repo:

Create a model handler for Tacotron2

If you take a look at the Tacotron2 implementation at the NVIDIA Github repo, you’ll find among all the files a model definition for Tacotron2 that is completely Pytorch based. They also provide a Pytorch implementation for Waveglow, the required vocoder to synthesize the spectrograms and generate wave audio files that are audible to humans.

However, in order to use those two models in conjunction, we’ll still need a class, or a handler, to manage the intermediate required steps such as data processing, inference, and post-processing. We’ll organize our work separating the concerns of each part so that our code can be easily maintainable.

class TacotronHandler(nn.Module):
    def __init__(self):
        super().__init__()
        self.tacotron_model = None
        self.waveglow = None
        self.device = None
        self.initialized = None

    def _load_tacotron2(self, checkpoint_file, hparams_config: Hparams):
        tacotron2_checkpoint = torch.load(os.path.join(_WORK_DIR, _MODEL_DIR, checkpoint_file))
        self.tacotron_model = Tacotron2(hparams= hparams_config)
        self.tacotron_model.load_state_dict(tacotron2_checkpoint['state_dict'])
        self.tacotron_model.to(self.device)
        self.tacotron_model.eval()

    def _load_waveglow(self, checkpoint_file, is_fp16: bool):
        waveglow_checkpoint = torch.load(os.path.join(_WORK_DIR, _MODEL_DIR, checkpoint_file))
        waveglow_model = WaveGlow(
            n_mel_channels=waveglow_params.n_mel_channels,
            n_flows=waveglow_params.n_flows,
            n_group=waveglow_params.n_group,
            n_early_every=waveglow_params.n_early_every,
            n_early_size=waveglow_params.n_early_size,
            WN_config=WN_config
        )
        self.waveglow = waveglow_model
        self.waveglow.load_state_dict(waveglow_checkpoint)
        self.waveglow = waveglow_model.remove_weightnorm(waveglow_model)
        self.waveglow.to(self.device)
        self.waveglow.eval()
        if is_fp16:
            from apex import amp
            self.waveglow, _ = amp.initialize(waveglow_model, [], opt_level="3")

    def initialize(self):
        if not torch.cuda.is_available():
            raise RuntimeError("This model is not supported on CPU machines.")
        self.device = torch.device('cuda')

        self._load_tacotron2(
            checkpoint_file='tacotron2.pt',
            hparams_config=tacotron_hparams)

        self._load_waveglow(
            is_fp16=False,
            checkpoint_file='waveglow_weights.pt')

        self.initialized = True

        logger.debug('Tacotron and Waveglow models successfully loaded!')

    def preprocess(self, text_seq):
        text = text_seq
        if text_seq[-1].isalpha() or text_seq[-1].isspace():
            text = text_seq + '.'
        sequence = np.array(text_to_sequence(text, ['english_cleaners']))[None, :]
        sequence = torch.from_numpy(sequence).to(device=self.device, dtype=torch.int64)
        return sequence

    def inference(self, data):
        start_inference_time = time.time()
        _, mel_output_postnet, _, _ = self.tacotron_model.inference(data)
        with torch.no_grad():
            audio = self.waveglow.infer(mel_output_postnet, sigma=0.666)
        return audio, time.time() - start_inference_time

    def postprocess(self, inference_output):
        audio_numpy = inference_output[0].data.cpu().numpy()
        output_name = 'tts_output_{}.wav'.format(uuid.uuid1())
        path = os.path.join(_AUDIO_DIR, output_name)
        print(path)
        write(path, tacotron_hparams.sampling_rate, audio_numpy)
        return 'API/audio/'+ output_name

initialize(): Load Tacontron and Waveglow with their respective checkpoints.
preprocess(text_seq): Transform raw text into suitable input for the model. Convert it to a specific set of character sequences.
inference(data): Run inference on the previous processed input and return a corresponding synthesized audio matching the input text.
postprocess(inference_output): Save the wav audio file to a directory under the container file system.

The details of the code can be checked in the Github repo.

Build the Django Rest API:

Setup your Django project: django-admin startproject tacotron_tts and django-admin startapp API.

If you need a thorough tour of how to begin with a Django Rest project, feel free to check out my previous article.

As per the project requirements, we’ll indeed be relying upon a third-party service to store and retrieve speech generated data through our endpoints. Therefore, Django ORM helpers and serializers will come in handy. As said in their documentation, Django ORM is “a Pythonical way to create SQL to query and manipulate your database and get results in a Pythonic fashion.” Now:

Create your ORM model for the TTS output.
Create the corresponding serializer.
Build your views (POST, DELETE) and your routing.

# Model Class for TTS outputs
class TTSound(models.Model):
    uuid = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
    text_content = models.TextField(null=True, blank=True)
    audio_join = models.FileField(upload_to=get_tts_media, null=True, blank=True)
    inference_time = models.CharField(max_length=255,null=True, blank=True)
    created_at = models.DateTimeField(auto_now_add=True)
    
# Serializer for the TTSound model
class TTSOutputSerializer(serializers.ModelSerializer):
    class Meta:
        model = TTSound
        fields = ('uuid', 'text_content', 'audio_join', 'inference_time', 'created_at')

Generate the Dockerfile for the Django app:

To package the whole API as a docker container, we’ll need to find a root image that complies with the project requirements. As the version of Tacotron that we’re using is entirely GPU based, we’ll need to pull a docker image already built with Cuda support. An interesting image backing Cuda-10.2 alongside PyTorch 1.5.0 can be found in the docker hub and it perfectly matches our needs.

Copy the local folders to the image file system, install the requirements inside a virtual environment, give the required permissions, and you’re ready to go.
Create two new directories where the static and media files will be stored.
Once the image for the Django app is fully operational, we’ll be configuring the Dockerfile for the Nginx proxy. Nothing special to add to the Dockerfile except for the static and media folders that will be shared between the two containers.

FROM anibali/pytorch:1.5.0-cuda10.2
USER root

MAINTAINER Aymane Hachcham <[email protected]>

ENV DEBIAN_FRONTEND noninteractive
ENV PATH="/scripts:${PATH}"

RUN apt-get update

RUN mkdir /tacotron_app
COPY ./Tacotron_TTS /tacotron_app
WORKDIR /tacotron_app

# install virtualenv
RUN pip install --upgrade pip
RUN pip install virtualenv

COPY ./requirements.txt /requirements.txt
COPY ./scripts /scripts

RUN chmod +x  /scripts

RUN mkdir -p /vol/web/media
RUN mkdir -p /vol/web/static

CMD ["entrypoint.sh"]

Configure your Nginx Proxy:

Nginx is a lightweight DNS micro service especially fitted for dockerized backend environments. The purpose is to use it as a proxy server that can route and serve static files and media. Rather than requesting Django internal server, a best practice for production environments is to utilize an independent proxy server responsible for that part. As the name microservice implies, each service works in a detached way as to focus on different parts of the whole infrastructure.

We’ll build our Nginx service by pulling the standard Nginx docker image from the hub: Nginx-Unprivileged.

Basic configuration for our needs:

Define an upstream service
Prepare your server
URLs starting with /: Forward to Gunicorn
URLs with /static/: Forward to our media and static folders, which happen to be inside our docker file system.

Orchestrate your Architecture with Docker Compose

As previously discussed, we need to structure our code such that the containers can communicate and work tightly together to make the whole service run. The way to tackle it is by defining two services, one for the API and the other for the proxy specifying a shared volume (static_data) for the two components where media files can be accessed. And that’s it, you can know deploy the service.

version: '3.7'

services:
  api:
    build:
      context: .
      dockerfile: Dockerfile
    volumes:
    - static_data:/vol/web
    - ./Tacotron_TTS:/tacotron_app
    environment:
      - ALLOWED_HOSTS=127.0.0.1,localhost

  nginx:
    build: ./nginx
    volumes:
    - static_data:/vol/static
    ports:
    - "8080:8080"
    depends_on:
      - api

volumes:
  static_data:

Run your application

There is one more little step to figure out before actually running the API regarding the static URL paths. In your settings.py add the following locations that match the static volumes previously defined in the Django Dockerfile.

2. Download Postman and start testing your API locally:

3. Listen to your output on port 8080: Listen to the transcription:

Input: “Every man must decide whether he will walk in the light of creative altruism or in the darkness of destructive selfishness” — Martin Luther King.

Conclusion

You’ve had a quick overview of the whole project in this article. I strongly recommend you check the Github repo for more in-depth insight.

As you can see, the field of natural speech synthesis is very promising and it will keep improving till reaching stunning results. Conversational AI is getting closer to the extent of seamlessly discussing with intelligent systems without even noticing any substantial difference with human speech.

I leave you here some additional resources you may want to check.

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
Nginx-Django production-like development, YouTube video by London App Developer
Interesting and useful NVIDIA Github repos for TTS and STT

If you have any questions regarding the code, please get in touch with me and don’t hesitate to e-mail me at [email protected]

An architecture for production-ready natural speech synthesizer

Build a Django Rest API to serve an end-to-end speech synthesizer and deploy a production-ready version leveraging docker, Gunicorn and Nginx.