Speech recognition and speech synthesis on iOS with Swift

Everyone knows Siri, and many people use it every day. Why? Because Siri provides a very fast and user-friendly way of interacting with an iOS device.

Convenience is not the only motivation for this type of interaction, though. The combination of speech recognition and speech synthesis feels more personal than using a touch screen. On top of that, the option for verbal communication enables visually impaired people to interact with your app.

As you probably already know, Siri’s communication mechanism can be split up in two main components: speaking and listening. Speaking is formally known as “speech synthesis” whereas listening is often referred to as “speech recognition.” Although the tasks look very different in code, they have one thing in common: both are powered by machine learning.

Luckily, Apple’s speech synthesis and speech recognition APIs aren’t private — everyone has access to their cutting-edge technology. In this tutorial, you’ll build an app that uses those APIs to speak and listen to you.

Speech synthesis

The easier of the two tasks we’ll explore here is speech synthesis — making the app speak — which can be done in just two lines of code. 2!

The framework we’ll use for speech synthesis is AVFoundation, which, generally speaking, is a very low-level framework, but it also has some very nice speech synthesis APIs.

Start by importing it on the top of ViewController.swift:

Then, in viewDidLoad, create an instance of AVSpeechUtterance. You can compare this to a string, but in sound:

Then we “speak” the utterance, which is printing, but on a speaker:

That’s it!

The speed of the speech can be changed by modifying rate on utterance:

I found 0.5 to be a really good value, but I’d encourage you to play with it yourself.

Speech recognition — Asking for permission

Now that our app can speak, let’s move on to the next part of our equation: making it listen. Recall that this task is called “speech recognition.” Speech recognition will turn raw audio recorded from the microphone into a string we can use.

Before we can do anything with the user’s microphone though, or use the speech recognition API, we have to ask the user for permission to use these APIs.

Like many of iOS’s privacy features, we need to add an entry to our info.plist file. The key we need to access speech recognition is NSSpeechRecognitionUsageDescription.

Enter something like “We need access to the microphone to hear what you have to say.” You’ll also need to add NSMicrophoneUsageDescription along with a similar description.

Next, we need to ask the user for permission. This is done through the SFSpeechRecognizer.requestAuthorization function, which returns an authStatus, possibly on a background thread.

authStatus is an enum with the following cases:

authorized
denied
restricted
notDetermined

We’ll implement a handler for each case next when we’re done building the UI. For now, add some TODOs.

// MARK: - Privacy
private func checkPermissions() {
    SFSpeechRecognizer.requestAuthorization { authStatus in
        DispatchQueue.main.async {
            switch authStatus {
            case .authorized:
                // TODO: Implement.
                break
            case .denied:
                // TODO: Implement.
                break
            case .restricted:
                // TODO: Implement.
                break
            case .notDetermined:
                // TODO: Implement.
                break
            @unknown default:
                fatalError()
            }
        }
    }
}

Building a user interface

Below is a screenshot of the UI I used. For the purposes of this tutorial, this is a really simple UI. Feel free to fork the repo and create a more attractive layout. Share it on Twitter (+1 if you use SwiftUI)!

Steps to recreate it:

Drag a button onto the view controller,
Type “Start recording,”
Change text style to “Headline,”
Add center x and center y constraints.

The user can start the speech recognition functionality by tapping the button, and when they tap it again, speech recognition will stop.

Next, create an outlet titled recordButton and an action titled recordButtonTapped. Make sure sender is of type UIButton — not Any.

Now that we have access to the on-screen button, we’re ready to implement the privacy features. If the user doesn’t grant permission, we’ll show an alert suggesting to open Settings and disable the record button. Here’s a little helper function for that:

private func handlePermissionFailed() {
    // Present an alert asking the user to change their settings.
    let ac = UIAlertController(title: "This app must have access to speech recognition to work.",
                               message: "Please consider updating your settings.", preferredStyle: .alert)
    ac.addAction(UIAlertAction(title: "Open settings", style: .default) { _ in
        let url = URL(string: UIApplication.openSettingsURLString)!
        UIApplication.shared.open(url)
    })
    ac.addAction(UIAlertAction(title: "Close", style: .cancel))
    present(ac, animated: true)
    
    
    // Disable the record button.
    recordButton.isEnabled = false
    recordButton.setTitle("Speech recognition not available.", for: .normal)
}

We can replace the switch statement with the following:

switch authStatus {
case .authorized: break
default: self.handlePermissionFailed()
}

If we get authorization from the user, we can continue; otherwise, we ask them to open settings.

The last thing left to do is check for permission every time the user enters this screen:

override public func viewDidAppear(_ animated: Bool) {
    checkPermissions()
}

While we’re at it, let’s also create a small utility function to handle errors. This function will present an alert telling the user an error has occurred and will also disable the record button.

private func handleError(withMessage message: String) {
    // Present an alert.
    let ac = UIAlertController(title: "An error occured", message: message, preferredStyle: .alert)
    ac.addAction(UIAlertAction(title: "OK", style: .default))
    present(ac, animated: true)

    // Disable record button.
    recordButton.setTitle("Not available.", for: .normal)
    recordButton.isEnabled = false
}

Adding a little more structure

Before we can recognize speech, we need to fix a view things. I promise, it won’t take long.

First off, add a property isRecording to the top of your view controller. I set its access controls to public private(set) because other components in the app are allowed to know whether we’re recording or not, but they must not change it for us.

Second, we want to change the content of the button to say “Stop recording” if we aren’t recording, of course. In recordButtonTapped(_ sender: UIButton):

if isRecording { stopRecording() } else { startRecording() }
isRecording.toggle()
sender.setTitle((isRecording ? "Start" : "Stop") + " recording", for: .normal)

This snippet utilizes the ternary operator, which is basically an inline if statement. It works like this: ? : . The reason we set sender to UIButton is so we can change values on the sender, which is Swifty-er, in my opinion, than accessing the outlet.

Xcode will probably complain that stopRecording and startRecording don’t exist yet. You can fix that by adding placeholders for these functions (both set to private).

Recognizing speech

First things first, import Speech at the top of ViewController.swift.

The task of speech recognition is more complex than synthesis (is that true for humans too?) and requires some set up.

Let’s break it down into five steps:

Create a recognizer
Create a speech recognition request
Create a recording pipeline
Start recognizing speech
Stop recognizing speech

To recap, the user can start speech recognition by pressing the button. If we aren’t already recording, we start recording by calling startRecording() on self. If we are recording, we stop it by calling stopRecording().

Steps 1 to 4 are in startRecording, 5 is in stopRecording.

1. Creating a recognizer

We start by creating an SFSpeechRecognizer. Its initializer might return nil, or it might not be available for another reason, so we need to carefully validate whether we can use it or not before moving on.

guard let recognizer = SFSpeechRecognizer(), recognizer.isAvailable else {
    handleError(withMessage: "Speech recognizer not available.")
    return
}

2. Creating a speech recognition request

Next, we can create a request that goes along the recognizer.

Our particular request is of type SFSpeechAudioBufferRecognitionRequest, a subclass of SFSpeechRecognitionRequest. An SFSpeechAudioBufferRecognitionRequest is made for recognizing speech in AVFoundation audio pipelines (see step 3 for more details).

The other subclass is SFSpeechURLRecognitionRequest, for recognizing speech in audio files, if you’re interested in that.

I set shouldReportPartialResults to true to demonstrate how we get intermediate results when iOS is decoding speech from the audio. If you’re only planning on using the final result, it should be set to false to save compute resources.

recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
recognitionRequest.shouldReportPartialResults = true

Just like Vision, we also have to tell Speech up front what we’ll be doing with the results — this cannot be changed after we add the task to the recognizer. We’ll print out the results, including partial results, and if the result is final, we’ll update the UI. It’s also possible to update the UI directly after new results come in, like Siri does.

recognizer.recognitionTask(with: recognitionRequest!) { (result, error) in
    guard error == nil else { self.handleError(withMessage: error!.localizedDescription); return }
    guard let result = result else { return }

    print("got a new result: (result.bestTranscription.formattedString), final : (result.isFinal)")

    if result.isFinal {
        DispatchQueue.main.async {
            self.updateUI(withResult: result)
        }
    }
}

The updateUI function is a very small function that solely shows an alert to the user telling them what they just said:

private func updateUI(withResult result: SFSpeechRecognitionResult) {
    // Update the UI: Present an alert.
    let ac = UIAlertController(title: "You said:",
                               message: result.bestTranscription.formattedString,
                               preferredStyle: .alert)
    ac.addAction(UIAlertAction(title: "OK", style: .default))
    self.present(ac, animated: true)
}

3. Creating a recording pipeline

While our code is ready to classify, there is nothing to classify yet. Let’s fix that.

AVFoundation allows you to build complicated graphs of audio pipelines. Each item in this graph is called a node. There are three different types of nodes: input nodes, output nodes, and mixer nodes. We only use one input node in this app, but it’s still good to understand what happens under the hood.

The very first thing we need to do is get an audio engine. This is the object that controls the entire pipeline. Because we need it later on to stop recording, we add it as a property on self.

Creating an audio engine is only a matter of calling its initializer:

After that, we’ll get the input node of the audio engine. We need this object later on as well, so create another property:

The input node is available as a property of the audio engine:

There is one other thing to review. Audio recordings can be of any duration, meaning we can’t simply assign a block of memory to put the recording into. Luckily, engineers have found a solution for this issue: we cut the recording up into many pieces of length bufferSize that we can store in a fixed block.

You don’t need to worry about audio getting cut off as a result of this — if we stop recording in the middle of a block (it always happens) the rest of the block is filled with silence.

To get these chunks of audio, we need to install a tap (i.e. add a node to the graph) on the input node. The bus is like a channel we’re using. We also tell AVFoundation what our next step is: adding the extracted buffers to the recognition request ready to be transcribed (transcription is done with recurrent neural networks, also in buffers!).

let recordingFormat = inputNode.outputFormat(forBus: 0)
inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer, _) in
    self.recognitionRequest?.append(buffer)
}

With that, the entire graph is finished. AVAudioEngine can now build the graph for us if we call .prepare().

audioEngine.prepare()

4. Start recognizing speech

There are a lot of things that could potentially go wrong when we put all our above code into action. The user might be on a phone call, their microphone may be missing or broken, the graph might not be complete, etc.

Luckily, iOS helps us by checking these things for us and simply givign us an error if anything is not OK. All we have to do is wrap the following code in a do-try-catch block:

do {
    // TODO: Start recognizing speech.
} catch {
    handleError(withMessage: error.localizedDescription)
}

The way iOS handles audio and video input and output is through AVAudioSessions, which can be requested either by third party apps or iOS itself.

Through sessions, an app can request to use the microphone, camera, or both simultaneously. By specifying a category, a mode, and options, iOS will automatically prepare the low-level system functions. So for example, if a user is listening to music, the music will be paused and the microphone will be enabled.

It won’t come as a surprise to learn that AVAudioSession is a singleton — there is only one iPhone to manage, after all.

// Activate the session.
audioSession = AVAudioSession.sharedInstance()
try audioSession.setCategory(.record, mode: .spokenAudio, options: .duckOthers)
try audioSession.setActive(true, options: .notifyOthersOnDeactivation)

We set the category of our session to .record because we’re recording audio. Other categories are also available: see the full list here.

Finally, we can fire up the entire pipeline, from getting raw input to transcribing the audio, by calling .start().

// Start the processing pipeline.
try audioEngine.start()

5. Stop recognizing speech

The very last thing to do is implement stopRecording. This function should be self explanatory.

private func stopRecording() {
    // End the recognition request.
    recognitionRequest?.endAudio()
    recognitionRequest = nil
  
    // Stop recording.
    audioEngine.stop()
    inputNode.removeTap(onBus: 0) // Call after audio engine is stopped as it modifies the graph.
  
    // Stop our session.
    try? audioSession.setActive(false)
    audioSession = nil
}

One thing I’d like to point out is that the audio pipeline cannot be changed while it’s being used, so we have to stop the audio engine before we can modify the graph by removing our tap.

Running the app

You can now hit build and run to load the final result onto your device. Tap start recording, and once you start talking, results will be printed in Xcode. If you tap stop, an alert will be presented.

Spraakherkenning

SFSpeechRecognizer isn’t just built for English speaking users — one can use many different languages by passing locale to its initializer. For example, you can use nl_NL, for Dutch, my native language. (Spraak-herkenning means speech recognition in Dutch!)

Or, you can get the user’s default locale by accessing the current property:

You can also change the language of speech synthesis by changing the voice property of an AVSpeechUtterance instance:

What’s next?

It would be interesting to combine this project with a previous project of mine to predict the emotion of the user based on what they tell the app. The combined project could be used in healthcare as a therapy bot.

You could also combine synthesis and recognition with a natural language model to generate a conversation bot, which could entertain lonely people.

The complete code is available on my GitHub. Don’t forget to leave a ⭐️!

If you have any questions or comments, feel free to reach out on Twitter or email me directly at [email protected].