conversation Design

paralanguage

Words gain new meaning when said aloud – this can strengthen the message or undermine it. With this article I shine a light on how paralanguage can impact voice assistants

Paralanguage needs to be experienced to be understood – please listen to the included audio files 🙂

Words on the screen.

Words said aloud.

They are not the same.

Why is that? When a text is read aloud it gains an extra layer of meaning; each speaker puts their own spin on words and this can either reinforce or change the message. This phenomena is often called paralanguage and it’s present with every spoken utterance. Paralanguage also occurs with synthetic voices as they are designed to emulate human voices. Paralanguage includes both vocal intonation and body language but we would only need to consider body language when creating a multi-modal assistant with visuals.

“There are no utterances or speech signals that lack paralinguistic properties”

https://en.wikipedia.org/wiki/Paralanguage

My experience with paralanguage comes from many years recording, directing and editing dialogue for film, TV, audiobooks, commercials and other media. When editing dialogue my primary role is to take all the recordings of the actors and turn them into cohesive conversations. A dialogue editor must listen to the tone and inflections of each actor’s utterances and edit them into a believable conversation with the right emphasis, emotion and pacing that was intended by the actor, scriptwriter and director. The dialogue edit is always a composite take of many performances but must feel like it occured at single point in time.

From my experience four things must be considered with every utterance;

Who is saying it? Are those words appropriate for that person or does it sound unintentionally ironic? Is it true to their character?
What are they saying? This is usually the aspect that conversation designers focus on most – the words that are said
How are they saying it? Their words always get transformed by paralanguage – is it an improvement or not?
Context is the fourth consideration. You need to know the context in order to have some concept about how the words should be interpreted.

A good voice actor always considers how to say their lines. They might need feedback but they will always consider their intonation while they are recording. The paralanguage is finalized during the recording phase.

With synthesized speech such as TTS or cloned voices we must also consider paralanguage but we don’t have the benefit of letting an experienced actor interpret the words for us – it must be considered by the team who are designing the experience. I feel that conversational design teams are either completely ignoring this aspect of communication or at least undervaluing it. In my opinion we either design paralanguage to enhance our text or we might get the wrong result by accident. Even if we intend to have a very subtle persona for our voice assistant we must design a persona and paralanguage that supports that vision.

The power of paralanguage; does my fridge have a fetish?

Let’s look at some examples! We hear paralanguage every time an utterance is said aloud by a synthetic voice or a live one. Don’t believe me? Consider how this sounds when read aloud:

“Can I smell 
something bad?”

Imagine this utterance coming from a smart home device such as a refrigerator when it detects food that has gone past its ‘best before’ date. Here is a male voice artist (myself) reading it with a slightly disgusted intonation:

Male Voice Talent

And now let’s hear the same utterance with a male TTS voice:

Matthew - Male (Amazon Polly EN-US, no SSML)

Let’s see if the TTS voice ‘Joanna’ gets a better intonation:

Joanna - Female (Amazon Polly EN-US, no SSML)

As you can hear there is no disgust in their voices – to me it sounds more like “I’d really love to smell something bad!”. Joanna seems to be even more enthusiastic about it than Matthew. Now we’re more focused on the assistant’s apparent personality quirks than the condition of the food. That was never the intention; remember we wanted to alert the user to bad food and not draw attention away from it. This is a great example of unintentional persona design – we wanted to have an assistant who is helpful but instead they’re the opposite. This is very important to remember; if we don’t carefully design the persona and how they should talk then there is a risk they will mutate into something we don’t want. We can’t ignore persona – the user will always imagine the person they think they are speaking to:

“There is no such thing as a voice user interface with no personality”

Voice User Interface Design – Cohen, Giangola and Balogh, 2004

How would I try to fix this issue? If possible I would change the wording to a statement like “I can smell something bad” and test whether this is said with a suitable intonation. If it’s not possible to change the wording, or it doesn’t help to do so, then I would try to alter the intonation with SSML. Instead of being able to synthesize emotions with SSML we must try to force the pitch, emphasis and timing to improve the result. It can work well sometimes, but not always. A full description of using SSML would be enough information for another article, which I intend to write soon.

Do you appreciate how powerful paralanguage can be? It can even exist when there are no words, for example as laughter. We can laugh with or without words. Laughter colours the meaning of our words in various ways; for example it can make them joyful or sarcastic. We could participate in a conversation with laughter alone, by simply laughing approval at what our conversation partner says, or even doing ‘forced’ laughter as a sarcastic riposte to their utterances.

Who is saying it?

Let’s move on. As I said, we should consider; who says it, what they say and how they say it. Context is the glue that binds all these elements together.

Let’s start with who; personas and how they influence conversation design. Long before we create sample dialogs and voice recordings we must define the type of voice we will use.

Let’s consider a medical voice assistant that can define a user’s illness by asking relevant questions. In order to create an assistant that apeals to users we must define how to ask diagnostic questions. Should we sound positive and happy so that the user feels we care about their needs? Should we be serious and quickly push them for information because good health care is a matter of urgently diagnosing and curing illness? Or should we be calm, show them that we’re listening, and try to not show any obvious emotions? With each of these styles of delivery we give the user a feeling about who they’re speaking to – we’re defining the persona of our skill with each utterance.

Imagine that the user has already told the assistant they have a few different symptoms. The assistant needs more clarity about the user’s illness and asks:

“what is your 
main symptom?”

Here are some examples:

Enthusiastic

Calm

Bossy

Strict

Rude

Personally I find both the ‘calm’ and ‘bossy’ versions to be the best – I feel like I can confidently speak with those voices about personal matters and trust them. ‘Enthusiastic’ seems a little fake to me, as if I’m being sold to rather than treated for an illness. ‘Strict’ seems overtly patronising, and ‘rude’ is of course a very risky option because it might scare away users! You might feel differently when listening to these? I’d love to know.

The important thing is that we need to know what we’re aiming for so that the voice is consistent with our design.

As you can see tone of voice can do so much to foster empathy or destroy it – in a medical skill this could be catastrophic as you need the user to feel comfortable sharing personal information! The tone of voice is totally relevant to the context. Consider how teachers use their voice to inspire and encourage students or how judges use their voice to assert authority. The way they speak is totally relevant to their profession.

How are they saying it?

Let’s look closer at how paralanguage affects the meaning of the utterance. How does the emphasis of certain words in an utterance affect the way it is heard? Here is an example utterance:

“Sorry I didn’t 
get that”

This utterance can have different meanings depending on which words we emphasize:

Emphasis on "I"

To me, the above example sounds like the assistant is saying this:

“I, the important voice assistant, am unhappy because you, the user, didn’t make that clear enough”

or

“someone else understood what you said, but I didn’t”

Emphasis on "didn't"

I heard this:

“I’m sorry, please try again. I’m listening and trying to help”

Emphasis on "get"

This sounds like:

“I’m a bit flustered and got confused. Please try again”

Emphasis on "that"

To me, this sounds like:

“That is very important – you must tell me again so I can understand!”

Those examples show how I interpret the utterances when read aloud to me, but you may have come to different conclusions. In order to make sure we have the right results, we need to test the assistant.

In practice there are many subtle variations possible with these utterances, such as emphasizing ‘that’ but trying to put a positive spin on it. I think that in the end we usually want the most natural read that fits the context. We don’t want to make the user have to think about what we mean so the paralanguage should support the utterance. Did the assistant ever say it right? Of those four, the second version with a subtle emphasis on ‘didn’t’ seems most clear and it would have to be tested in the context of a skill to see if it is consistent with the other utterances and original design.

Should I stay or should I go?

Here is the final example. Intonation can do so much to affect how we view the person talking to us. It’s unavoidable. Just think back to a time you held a phone call with someone you’d never met. You couldn’t see them but from their voice (and possibly background noises from their side of the call) you start to build a picture of who they are, where they are and what they’re doing.

Questions have a rising pitch and statements have a lowering pitch – that is universal in all languages (John J. Ohala – ‘Cross Language Use Of Pitch – An Ethological View’, Phonetica, 40, 1983). So what happens when we purposefully mess with this concept and use a descending intonation with a question? Does it give a meaningful improvement or not? Does it make the utterance more memorable because we’re surprised by the way it has been said?

This example is for a hotel booking skill. The utterance is:

“How many nights 
will you be staying?”

Descending Pitch 1

Decending Pitch 2

Ascending Pitch 1

Ascending Pitch 2

In my opinion the last example is the most pleasant to listen to and the most clear to understand. The versions with ascending pitch feel natural to me and lead me to respond straight away. The first two examples with descending pitch sound condescending and unfriendly as if the assistant doesn’t want the user to stay. I would say they are memorable but for the wrong reasons – to me it seems like the assistant is rude.

Does your voice assistant say things in a way that is consistent with your design? Did you consider paralanguage in your design at all?

Most branding agencies use style guides which often include details about how voice artists should read their branding materials in commercials, training videos etc. Perhaps we should all be paying as much attention to the way we say things so that it enhances our message and doesn’t undermine it – not only to build a strong brand but also to create a more satisfying user experience.

In conclusion

Regardless of what voice you use – whether synthetic or a live voice actor – the way you approach paralanguage is vital to your success. If you don’t design it then there is a very real danger that you will unintentionally undermine your good work. We need to design the persona and how they talk at an early stage in the design process.

You should start to consider how your voice assistant will talk as early as possible in the design process and intonation should be considered while you write the utterances. As I’ve shown, utterances and paralanguage are inseparable when the written text becomes vocalized. The paralanguage should support the utterance. Perhaps try recording your utterances as you write them so that you have a guide for recording the voice talent or synthesizing TTS?

If you want to play with paralanguage to create a surprising result then make sure to test it – what seems like an improvement to you might be unappealing to a user.

Always consider these four points:

Who is saying it?
What are they saying?
How do they say it?
How the first three fit within the wider context of your design

Say Hello!

Email: ben [at] conch [dot] design

Twitter