Analyzing Conversations

Grading a call

In commercial telephony applications, from customer support to sales, a common need is evaluating the quality of a call. Commonly, quality metrics are collected either by asking callers to perform a satisfaction survey or by having expert humans listen to a sample of calls. However, most of the time, only a fraction of callers will agree to do a survey, and evaluations of call quality can be inconsistent and biased.

The call grading functionality is a pre-trained classification model, aimed at extracting common measures of quality from a call:

  • Successful Outcome - Whether the parties of the call appear to have reached a positive result in the conversation.
  • Quality of Service - A measure of whether the call was cordial and professional.
  • Experience - Whether the parties analyzed seemed confident and capable given the topics discussed.
  • Proactivity - The extent to which problems were addressed before they escalated.
  • Trust - A measure of perceived honesty and trust given the tone and speech content.
  • Empathy - A measure of the extent to which parties reflexively react to the emotions of each other.

These models look at both the voice (signal) and speech content (words) in a conversation to reach a conclusion about the above metrics. Sometimes the tone of voice may provide the best clues to how a call went, while other times the choice of words and phrasing may dominate. The models are pre-built to evaluate these metrics, however, they may return less accurate results in calls with unusual domains or in which one of the above metrics is not relevant.

Call grading is available both on live calls through telephony interfaces and through the file upload interface demonstrated below.

Uploading a set of files

In this example, we have three recordings that we wish to upload so that we can then process them with the call grading model. In our environment the files are named “test_call1.wav”, “test_call2.wav”, “test_call3.wav”. In our case, the files are all 8,000 Hz 16 bit PCM and were recorded through a call center phone system, however other bitrates and compression codecs are supported.

import requests

# REPLACE WITH YOUR CREDENTIALS
account_id = 'd47260621aa148ce'
auth_token = '2Y4Z5RU9R85J5U6OZ61RPFNFB43LCL26'

file_list = ["test_call1.wav", "test_call2.wav", "test_call3.wav"]

audio_ids = {}

upload_url = 'https://api.gridspace.com/v0/media/upload'

for file in file_list:
    files = {'audio': open(file, 'rb')}
    response = requests.post(upload_url,
        auth=(account_id, auth_token),
        files=files).json()
    audio_ids[file] = response['audio_id']

Note

HTTP basic access authentication is required for all calls to our REST API. Refer to API Authentication for more information.

The JSON content of each response includes a audio_id field that allows us to create a Conversation for each audio file.

Processing the files

Using the list of audio_id’s, we can request for each audio_id to form the basis of a Conversation object.

conv_url = 'https://api.gridspace.com/v0/conversations'
callback_url = 'https://myserver.com/doneprocessing'

conversations = {}

for file_name, audio_id in audio_ids.iteritems():
    payload = {'audio_id': audio_id,
                'name': file_name,
                'processors': ['grade'],
                'event_callback': callback_url}
    response = requests.post(conv_url,
        auth=(account_id, auth_token),
        json=payload).json()
    conversations[file_name] = response

Each request returns a new Conversation object. In this case, we’re storing the conversation objects for our reference in a dictionary. If necessary, you can then store the conversation in your database, if you need to correlate it with the response later. However, because we’ve chosen the filename for the name of the conversation, that may also be sufficient to connect our call grades to the audio file that was graded.

Receiving call grades

Each conversation (and its media file) are now queued for processing. As the files are processed and graded, the done_processing event callback will be triggered. Whatever callback URL you’ve provided, must be capable of receiving POST HTTP calls, and handling the done_processing event.

A Call Grades field will be present in the returned Conversation object. This contains all the metrics listed above.

Interpreting call grades

In our case, after processing, we receive three done_processing event callbacks. Each returns a conversation with the Call Grades field populated. The call grade objects for our three test files are listed below:

test_call1.wav:

{
    "outcome": 0.95,
    "quality": 0.80,
    "experience": 0.82,
    "proactivity": 0.63,
    "trust": 0.71,
    "empathy": 0.52,
}

test_call2.wav:

{
    "outcome": 0.43,
    "quality": 0.15,
    "experience": 0.41,
    "proactivity": 0.62,
    "trust": 0.39,
    "empathy": 0.26,
}

test_call3.wav:

{
    "outcome": 0.60,
    "quality": 0.94,
    "experience": 0.87,
    "proactivity": 0.81,
    "trust": 0.89,
    "empathy": 0.89,
}

In the above cases we quickly can learn about the three calls, without listening to the content.

The first call has the the highest “outcome” by far. This indicates the model feels that whatever goals the callers had were achieved. Often speech and tone at the end of a conversation strongly drives this. Other metrics of quality are generally above 0.5, however they seem to indicate the call was effective more than it was cordial.

Verify Models for Relevancy

Even though 0.5 may seem low for many of these metrics, depending on the domain of speech, it may be hard to raise these metrics further. In some call domains, some of these metrics (ie empathy) might not be a good fit for the call set, because, perhaps they are overtly formal, which to the model might indicate limited empathy. Before relying on a metric, make sure it’s a good fit for your dataset.

The second call seems to indicate poor results in most metrics. The outcome of 0.43 indicates that the goals were partially or ambiguously resolved. A value in this midrange may indicate the model was saw conflicting indicators of results. However, many of the metrics tied to a positive interpersonal interaction are poor. In particular, the quality of service and empathy are exceptionally low. This conversation would, in most use cases, be evaluated as an obvious red flag.

The third call is a bit more interesting. Metrics across the board indicate the parties on the call were cordial and had a generally positive interaction. Likely the parties had no strong verbal conflicts and were generally cooperative. What’s notable is that the outcome appears to be well below 1.0. Likely do to external factors, the call parties failed to fully achieve their goals. Or at the very least, they had outstanding actions remaining for one or several parties.

All these models were trained on calls labelled by humans, either in satisfaciton surveys or professional call service managers. However, depending on your domain and the types of calls evaluated, some metrics may perform better than others on your data. It’s recommended that you run a test set of your data through this processor and examine the results manually or against your own human labels. There’s nothing wrong with cherry picking the subset of metrics that fit your data best.

If these call grading fields don’t perfectly match your own metrics, your organization can always train custom models using the classify processor. However, it’s often easiest to start with pre-trained models for common speech tasks.

Building a custom classifier

Commonly, an application will want to take some action based on how a spoken conversation is categorized or classified. For example, if you have a business where in-bound callers may call to make a reservation or complain, it may be useful to take an action whenever a customer a calls to complain but not otherwise.

The task of classifying conversations first requires a model to be taught by example. To do so, we must choose several conversations and tell our model the name of the class to which they belong. In the above example, we may choose to call our model “reason_for_calling” and, to train, we could call one class “reservation” and the other “complaint”. Each example is a conversation, and the more examples the model receives, the more accurate it will get.

In the following example, we’re going to build a simple model that tells apart calls to a technical support line for a broadband provider. The two types of calls we see are callers who can’t connect to the internet (“cannot_connect”) and callers who have a slow connection (“slow_connection”).

Making sense of call history

Suppose we have previously stored a list of conversation IDs and the class to which each call belongs in CSV format. We start by parsing the csv and training a classifer on each. There’s no need to manually create a classifier, as it’ll be created the first time we train with its model name.

import csv

train_url = 'https://api.gridspace.com/v0/train_classifier'
model_name = "support_call_type"

with open('support_calls.csv', 'rb') as csvfile:
    support_calls = csv.reader(csvfile, delimiter=',') # conversation_id, label

for support_call in support_calls:
    payload = {'conversation_id': support_call[0],
                'model_name': model_name,
                'class_name': support_call[1]}
    response = requests.post(train_url,
        auth=(account_id, auth_token),
        json=payload).json()
    # Each iteration replaces the classifier object for our records
    classifier = response

Classifying a call

Now that our model has been built on the conversation list above, we can use that model to identify whether a support call is more likely to be in the “cannot_connect” class or the “slow_connection” class.

Sufficiently Train your Models

Even though you have provided examples to your classifier, often, the machine models used to tell apart human speech classes rely on a wide range of subtle speech behaviors. Models behave best when trained on hundreds to thousands of conversations.

The three calls explored earlier in Uploading a set of files can now be run through the classifier we just created, to determine what class each call belongs to. We reuse the audio_ids list created earlier.

conv_url = 'https://api.gridspace.com/v0/conversations'
callback_url = 'https://myserver.com/doneprocessing'

conversations = {}

for file_name, audio_id in audio_ids.iteritems():
    payload = {'audio_id': audio_id,
                'name': file_name,
                'processors': ['classify:support_call_type'],
                'event_callback': callback_url}
    response = requests.post(conv_url,
        auth=(account_id, auth_token),
        json=payload).json()
    conversations[file_name] = response

In response to each processor, we receive a POST request to our callback URL indicating the calls belong to the following classes.

test_call1.wav:

{
    "support_call_type": "cannot_connect"
}

test_call2.wav:

{
    "support_call_type": "slow_connection"
}

test_call3.wav:

{
    "support_call_type": "cannot_connect"
}

Unsurprisingly, the call with the most negative grades also differs in terms of the class assigned to it. In the calls where the caller cannot connect to the internet, the grades in the earlier guide showed good resoltuion, possibly indicating the callers were able to restore their internet connection. By contrast, the caller with a slow internet connection had a much more negative, unresolved call, indicating that potentially they were not so fortunate in their resolution.

Transcribing completed conversations

In the previous two examples, we’ve done quite a bit of analysis on the calls above. Often that is most efficient and useful to an API user. However, sometimes, for your records or own analysis, a machine transcript is sufficient.

Running the same calls through transcription is analogous to the other processors run earlier.

conv_url = 'https://api.gridspace.com/v0/conversations'
callback_url = 'https://myserver.com/doneprocessing'

conversations = {}

for file_name, audio_id in audio_ids.iteritems():
    payload = {'audio_id': audio_id,
                'name': file_name,
                'processors': ['transcribe'],
                'event_callback': callback_url}
    response = requests.post(conv_url,
        auth=(account_id, auth_token),
        json=payload).json()
    conversations[file_name] = response

Transcription when Necessary

Often using transcription to solve your tasks will appear to be the simplest or most direct solution. However, for common tasks like classification, grading, or topic extraction, the Gridspace Sift analysis methods are designed specifically to jump directly to high quality results. By contrast, your methods trained on machine transcripts inherit additional noise, and will likely underperform the Sift models.

What we get back is a machine transcript from the Sift Automatic Speech Recognition (ASR) system. While our ASR models are state of the art, the raw quality of the transcription result depends quite a bit on the quality and domain of the speech transcribed. Accuracy is improved by closer microphone distance, reduced background noise, and slow or read speech.