Writing Hosted Scripts

The Sift API provides a wide array of speech and telephony capabilities that makes it possible to build sophisticated voice applications. However, many applications necessitate running a server that is always ready to handle messages from the Sift API servers.

For large, permanent applications this may be a minor concern, however for prototypes, experiments, and smaller applications, this is a large overhead. Additionally, you may simply want to create your application quickly without setting up and maintaining a server.

Hosted scripts are small programs written in Javascript that are uploaded to the Gridspace servers and control the behavior of your communications applications. They offer all the power of the Sift REST API with less complexity and boilerplate code.

Hosted scripts may be stored in files on your local system and uploaded via a REST call, or composed in the browser using the online script editor. We recommend using the online editor when iterating on a new application, since it allows you to quickly update your script and view debug output and call logs.

Hello world

Our first hosted script application will simply answer an inbound call and use text-to-speech software to say “hello world” to the caller. The code for this script is:

gs.onIncomingCall = function(connection) {
    connection.say("Hello world");
    connection.hangUp();
};

You can try running this script at https://api.gridspace.com/scripts/try#helloworld.

Each hosted script must define at least one of the special entry point functions, gs.onIncomingCall and gs.onStart. Since we want our script to run whenever someone calls our phone number, we define gs.onIncomingCall only.

The first and only parameter to gs.onIncomingCall is a Connection object, which represents the audio connection between the caller and your application. The connection object defines a number of methods for sending audio, listening for spoken language, and bridging multiple connections together.

We call the Connection.say function, which takes a string argument and speaks the string to the other side of the connection. The function will wait for the machine to finish speaking before returning.

Finally, we call the Connection.hangUp function to terminate the connection. Calling any more functions on the connection after calling hangUp will throw an error.

Note

If you are experienced with programming Javascript in the browser, you may be concerned that the call to say above can block for a few seconds while the computer speaks. In a web application, this would block the UI thread and make the page unresponsive. However, this is not so in the Gridspace hosted scripts environment. Each instance of a script happens in its own independant Javascript context, so blocking scripts will not prevent your application from handling multiple scripts at a time. Consequently, most hosted script API calls block until their task is complete, which allows your scripts to be written in a simple declarative style without excessive callback functions.

Starting a conversation

Let’s try connecting multiple participants together so they can talk to one another. We can accomplish this either by joining a connection to a conference or by calling another phone number directly.

Conferences

To join a conference, we use the Connection.joinConference function:

gs.onIncomingCall = function(connection) {
    connection.joinConference('Conference');
};

The only required parameter is a string that uniquely identifies the name of the conference room to join. Any other connection that receives a call to joinConference with the same string will be connected to the same conference room and can speak to all other active connections in the room. You can also connect as a ghost, in which case the connection can only hear the conversation and will not produce any sound. This is useful if you are implementing an application where a QA agent will be monitoring calls silently.

connection.joinConference('Conference', { ghost: true });

The hosted scripts API uses the convention that all required parameters are provided as the first arguments to the function. Optional parameters are be provided by passing a plain object with key-value pairs as the final parameter.

By default, all conversations are recorded. You can disable recording by setting the doRecord parameter to false.

connection.joinConference('Conference', { doRecord: false });

Keep in mind, however, that if you disable recording, you will not be able to transcribe or process the conversation after it ends.

Phone calls

You can also initiate a phone call directly from a connection:

gs.onIncomingCall = function(connection) {
    connection.callPhone('+15552231452');
};

When the Connection.callPhone function is called, the target connection will hear a ringback tone until the other end picks up. The function will block until either the call fails or either party ends the call.

You can specify how long to wait for the outbound call to be answered before giving up using the maxRings parameter. In the following example, we wait for three rings and if nobody answers, we say sorry to the caller.

gs.onIncomingCall = function(connection) {
    var conversation = connection.callPhone('+15552231452', {maxRings: 3});
    if (conversation.status == 'failed') {
        connection.say("Sorry I am not available");
    }
};

We check whether or not the call was successful by checking the status property on the object returned from callPhone. If maxRings is exceeded without the call being answered, the status property will be “failed”.

Playing sounds

You can play a sound to the user by calling the Connection.play function:

gs.onIncomingCall = function(connection) {
    connection.play("http://apicdn.gridspace.com/examples/assets/alert.wav")
}

The parameter to this function is an http or https url which points to a sound file that will be played over the connection. This file must be a valid .wav or .mp3 file.

We also recommend that you use mono audio files, since the connection to the user is also mono. This reduces file size and also makes sure the file sounds the same to the user as when you play the file on your desktop computer.

Getting input from the user

Many applications need to interact directly with the end user. The hosted scripts API can respond to voice input as well as phone key presses.

Key presses

Traditionally, phone systems use Dual-tone multi-frequency signaling (DTMF) as the main user input mechanism. When the user presses a key on their phone’s dial pad, a signal is sent and detected by the phone system on the other end of the call.

You can collect a sequence of key presses from the user with the Connection.getDigits command.

gs.onIncomingCall = function(connection) {
    var digits = connection.getDigits(3, {promptText: 'Press three keys'});
    connection.say('You pressed ' + digits);
}

In this example, we ask the user to press three keys on their dial pad and say the digits back to them. getDigits will return a string representing the keys that were pressed.

Normally, you will also want to provide some kind of prompt to the user telling them to press a key. You can use either the promptText or the promptUrl parameter to play a sound to the user while waiting for them to press the expected number of keys.

The promptText parameter is a string that will be said to the user, similar to the say command we encountered earlier. The promptUrl is a sound file that will be played to the user, similar to the play function.

Voice responses

Although touch tone interfaces are simple and familiar, they are very limited in the type of information they can collect and cumbersome to use. Luckily, you can make use of all the powerful voice recognition tools in the Sift API from within hosted scripts.

The Connection.getFreeResponse function gets a spoken voice response from the user and then returns their transcribed words as a string as soon as they stop speaking.

gs.onIncomingCall = function(connection) {
    var response = connection.getFreeResponse({
        promptText: "What is your favorite ice cream?"
    });
    console.log(response);
}

As with the Connection.getDigits command, we can provide a prompt sound to the user as either text or an audio file URL.

You can help the speech recognition system give more accurate results if you know something about the type of responses that you will get. For example, in the script above, we know that the user is probably going to say something about ice cream. We can give a hint to Gridspace’s transcription software by passing a list of words to the hintWords parameter.

gs.onIncomingCall = function(connection) {
    var response = connection.getFreeResponse({
        promptText: "What is your favorite ice cream?",
        hintWords: ["sherbet", "rocky", "road", "neopolitan", "vanilla", "fudge"]
    });
    console.log(response);
}

This is especially useful for words that are unusual or sound similar to other words. If you were running this application repeatedly and noticed that “cherry” was often getting mistaken for “ferry”, you could add “cherry” to the hint words to get better results.

Formatted speech

Often times, you’ll be interested in gathering some information the user, rather than transcribing their exact words. The API offers many functions to make it easier to extract relevant information from speech and return it in an easy to parse format.

For instance, if the user is selecting one choice from a fixed set of options, you can use the Connection.getMultipleChoice function instead of the more general getFreeResponse.

gs.onIncomingCall = function(connection) {
    var response = connection.getMultipleChoice(
        ['chocolate', 'vanilla', 'strawberry'],
        {
            promptText: "Would you prefer chocolate, vanilla, or strawberry ice cream?"
        }
    );
    if (response == 'chocolate') {
        response.say('You selected chocolate');
    }
};

Using getMultipleChoice has two benefits. First, it tells the API which words to listen for, reducing the chance that it will mishear the user. Second, it will return exactly one of the options you provide so you can use direct string comparison to check which option was said, rather than having to search for each option within the transcript. If the user doesn’t say anything, or they say something that does not match any of the provided options, getMultipleChoice will return null.

To retrieve a numeric value like a quantity or account number, use the Connection.getNumber function:

gs.onIncomingCall = function(connection) {
    var digits = connection.getNumber({promptText: "what is your account number?"});
    // digits now contains a string of digits like "01648".
}

Like the getDigits function above, this function returns a string of digits, this time collected from speech rather than the keypad. The getNumber function will work equally well whether the user says each digit one at a time like “six oh five one” or a single number like “six thousand fifty one”. Mixing the two is also ok, as in “sixty fifty one”, as long as the result is unambiguous.

See the table below for a full list of the formatted voice response functions available on the Connection object.

Function Return type Extracts
getFreeResponse String General speech
getNumber String of digits Numeric values
getMultipleChoice String One of a fixed set of choices
getDate ISO 8601 date string Dates
getName String A person’s name or names
getYesOrNo Boolean A negative or affermative response

Key presses or voice responses

You can also let users make a selection with either their voice or the keypad. This is especially useful for situations when the caller is not an English speaker or is in a high-noise environment. In this example, we use the Connection.getMultipleChoiceOrDigits function to let the user navigate a menu system by voice or keypad:

const CHOICES = ["store hours", "locations"];

gs.onIncomingCall = function(connection) {
    var result = connection.getMultipleChoiceOrDigits(
        CHOICES, 1, {promptUrl: 'http://mysite.com/choices.mp3'}
    );
    if (result.digits == '0' || result.choice == 'store hours') {
        connection.play('http://mysite.com/storehours.mp3');
    } else if (result.digits == '1' || result.chocie == 'locations') {
        connection.play('http://mysite.com/locations.mp3');
    }
};

Instead of returning a string like the getMultipleChoice function, we return a result object which has both a digits and a choice attribute.

Advanced inputs with Scanners

For more sophisticated speech parsing, the API provides a tool called a Scanner. Scanners use a powerful query language that allows you to capture words and phrases while allowing for variation in phrasing or word order. Read the Parsing Language With Scanners guide for full documentation on scanners.

You can use scanners in your hosted scripts with Connection.getScannedResponse, which works just like other input functions:

gs.onIncomingCall = function(connection) {
    var result = connection.getScannedResult("'{number}' then ('minutes' or 'hours')" {
        promptText: "When should we call you back?"
    });
}

The returned return value is a Scanner result object, or null if the user’s response does not match the scanner query.

Monitoring live calls

Any call that you facilitate using hosted scripts can be monitored in real-time. This enables you to create intelligent agents that react to what is being said on the call. Some things you might do with live calls are:

  • Send an alert when a caller mentions a specific product.
  • Connect someone to the call via voice command.
  • Create a live stream of transcripts from all active calls.

Conversation objects

Some functions of the Connection object create an API resource called a Conversation object. Conversations track the state of a communication between one or more parties and store metadata about the interaction. During a call, the corresponding Conversation object will be in the “in-progress” state. As soon as the conversation ends, the Conversation object moves to the “finished” state.

Any Connection function that generates a Conversation accepts a set optional parameters that allow you to transcribe and analyze the Conversation in real-time or after the fact. We will be using the joinConference function in the following examples, but the same parameters can be used in any of the following calls:

Live transcripts

To receive live transcripts, you must provide an onTranscript callback to the function that is initiating the conversation. If you are creating a conference call, you would add the onTranscript callback like this:

gs.onIncomingCall = function(connection) {
    connection.joinConference('Conference', {
        onTranscript: function(transcript, conversation) {
            console.log(transcript);
        }
    });
};

Now any time that a participant in the conversation finishes a statement, the provided function will be called with two arguments: a transcript and a Conversation object.

The transcript object is a string-like object with some additional attributes, including the speaker and the timestamp. The transcript also exposes the hasPhrase helper function, which can be used to search for a particular words or words.

gs.onIncomingCall = function(connection) {
    connection.joinConference('Conference', {
        onTranscript: function(transcript, conversation) {
            if (transcript.hasPhrase("password")) {
                conversation.playToAll("http://apicdn.gridspace.com/examples/assets/alert.wav");
            }
        }
    });
};

The above example will play an alert sound to everyone in the conversation if someone says the word “password”.

Live topics

You can request a periodic update with the current topics of a conversation by supplying the onTopics parameter.

gs.onIncomingCall = function(connection) {
    connection.joinConference('Conference', {
        onTopics: function(topics, conversation) {
            console.log(topics);
        }
    });
};

The provided callback function will be invoked periodically with a list of topic strings. Some examples of topics are: “cars”, “Star Wars”, and “sales team”.