Every single one of us will intermittently try to conjure the right word for a given moment, pausing mid-sentence to try and remember it. For example, forgetting the name of a place caused this hesitation:
We can all relate to this, but as cognition declines, these pauses become more common and more pronounced.
As fellow humans, we’ll adapt and be patient, giving someone time to think—maybe even suggesting the completion as shown here:
Voice assistants (Amazon Alexa, Google Assistant, etc…) need to become more naturally interactive in order to do this. These pauses are often mistaken by voice assistants as the end of a sentence, frustratingly replying with something like “I’m sorry, I didn’t quite catch that”.
The accessibility of voice assistants is now more crucial than ever, so a team of students have been exploring this challenge with the Interaction Lab at Heriot-Watt University.
This project was planned, designed, and built over a 12 week period as part of a course called “Conversational Agents and Spoken Language Processing”. If this is a class or an area that interests you, take a look at the MSc in Conversational AI at Heriot-Watt.
What We Achieved: Developing our System
Voice assistants in the home are often used to interact with smart-home devices. These devices can help people live in their own homes longer and more independently. For this reason, they’re even recommended by many related charities (on a case-by-case basis, of course).
In order to analyze which devices people interact with the most, we investigated the Fluent Speech Commands dataset. It contains over 30,000 annotated commands uttered (to control a smart-home) by 97 speakers. From the data, we concluded that the living room was a good starting point. We chose four devices: the lights, heating, music, and TV. As a result, only 19.8% of the commands in the dataset were deemed out-of-scope.
In order to illustrate our goal interactions, we’ve provided a couple of mock dialogues. You can watch some more examples in action at 1:58 in the video below.
In the above example, the system detects the long pause and suggests two predictions. The user then selects the correct prediction by naming the device. Finally, the system combines the “music” selection with the previous “turn up” request to “turn up the music”.
In this example, three predictions were made and the user numerically selected the item. The intent to “turn on” and the selection of “heating” were then combined to take the action “turn on the heating”.
You can see three more examples like these working in this video (at 1:58).
People travel great distances to have meetings instead of just phoning. There is a similar difference between talking to Siri, or a voice assistant embodied in a robot. While talking on the phone, you miss many of the signals (nods, gaze, brow raising, etc…) that we use to guide our communication every day.
For these reasons, we integrated our system with Furhat to make the interactions more natural and more engaging. We planned to use a physical Furhat robot but switched to the virtual Furhat when the university closed.
Getting More Technical
In the video above, the system architecture is summarized at 0:59. It looks like this:
Our system contains the following components:
- Speech Recognition (we used Furhat) – converts the user’s voice to text.
- Incomplete Utterance Detector (LSTM) – as the name suggests, this processes the text to detect whether an utterance is complete or not. The Fluent Speech Commands dataset contained many similar utterances so we identified key split points and trained an LSTM on the split utterances.
- Alana – handles any complete utterances. Alana is an open-domain conversational agent so can chat about almost anything.
- Utterance Completion (Rasa) – receives a user’s incomplete utterance and predicts what the user is wanting to say. These predictions are filtered for likelihood and converted into rule-based natural language responses.
- Text-to-Speech (Furhat) – converts this natural language prediction to audio and asks the user. The user’s response is converted to text.
- Dialogue Manager (partly Rasa) – receives the user’s response and manages the interaction. Firstly, the user either selects a device (by naming it or selecting it from the list e.g. “the last one”) or indicates that the predictions were incorrect. If incorrect, the state resets and our system says something friendly to the user. If a selection is made, the selected device and intent from the utterance completion model are combined. The fully-resolved user intent, such as “turn off TV”, is the output.
- Interface (rule-based) – receives the user’s intent and displays the action taken in the virtual living room. A confirmation of the action taken is also generated, converted to audio, and spoken out loud by Furhat.
As mentioned at the beginning of this article, we planned a deeper evaluation, which was unfortunately postponed due to coronavirus. Despite this, we’re aware of many ways that our system could be improved:
- Additional data. The Fluent Speech Commands dataset contains many utterances, but they aren’t particularly diverse. There are many repeated utterances and several commands that are not directly related to smart devices, such as “bring me my shoes”. Including a larger and more diverse range of commands could help improve several aspects of the system.
- Extending device platforms and domains. We only handle a limited number of devices in the living room. With additional data, we could extend the capabilities of our system.
- Improve incomplete utterance detection. The LSTM we trained had a 99.5% accuracy on the Fluent Speech Commands dataset. As mentioned above, however, this dataset is not very diverse. Training with additional data would make this model more robust to real user utterances.
- Integration with a more sophisticated end-of turn (EOT) prediction. Our system relies on external end-of-turn prediction and is still therefore turn-based. To significantly improve fluidity of the conversation, our system needs to integrate fully with an advanced EOT prediction model.
- Implement a re-ranker. Due to time constraints, we didn’t manage to complete our plan for a re-ranker. This would filter and re-order the predictions, based upon contextual information. For example, not predicting the activation of the lights when all of the lights are already on. Using computer vision and Furhat’s inbuilt camera, we could even prioritize the object that the user is looking at.