|
|
| VoiceXML Tutorial Plum Voice Platform v. 2.6 © 2007 Plum Group, Inc. All rights reserved. More on VoiceXML at Plumvoice.com VoiceXML 2.0 is the World Wide Web consortium standard for scripting voice applications. In this tutorial, we construct a VoiceXML interactive voice response (IVR) for a customer service center. Starting from a simple "Hello World" application, we build a telephony application which includes:
Outline
1. Getting started: building a minimal VoiceXML script Tags introduced: <?xml?> <vxml> <form>
<block> <prompt> <audio>1.1 Hello World: A minimal VoiceXML script We begin with nearly the simplest complete VoiceXML application. The application here is analogous to an answering machine set to play an announcement only.
In this example, the user would hear a synthesized voice say, "Welcome to Plum Voice." Then the system would simply hang up. The <form>
defines the basic unit of interaction in VoiceXML. This form includes
only a single <block> of executable content which
in
turn includes a single <prompt> to the user. By
default, any plain text within a prompt is passed to the system's
text-to-speech (TTS) synthesis engine to be generated as audio.For static prompts such as this welcome message, we'll probably want to use a human announcer instead of TTS. TTS has come a long way, but there's still no substitute for the real thing. For recorded prompts, we use the <audio> tag.
As we explain in the next section, the <audio> tag
is analogous to the <IMG> image tag for graphics in
HTML. The "src" attribute
provides a URL to the WAV file
which should be played for this prompt. In this case, the "src" reference
is relative to the VXML document URL in which it appears.
The text within the audio tag is not required. We could have included no content: which is equivalent to The text included within the audio tag in the example above is something like the ALT text for images in HTML. If the VoiceXML platform is unable to open or play the "src" file in the audio tag, it falls back on generating TTS from the included text. However, if no text is specified and there is a problem playing the audio file, this will trigger a fatal error.
2. User interaction with touch tone input Tags introduced: <goto> <menu> <choice> <form>
<field> <filled> <if> <else> <elseif>
Tag attributes introduced: id for <form>;
type for <field>2.1 Menus The simplest way to create an interactive VoiceXML application is by using the <menu> tag.
An interaction with this VoiceXML application might sound like this. Following the convention used in the VoiceXML specification, the VoiceXML computer system output is labeled "C:" and the human user input is labeled "H:" C: Welcome to Plum Voice. For sales, press 1. For tech support, press 2. For company directory, press 3.Each <choice> specifies a mapping of user input to
an action, branching to a different dialog in the application.
For
now, we have simply added stub dialogs as placeholders in our
application.Note that in the first form, we must explicitly <goto>
the main menu after the initial prompt. The VoiceXML interpreter begins
the interaction at the first form or menu found in the document. It
does
not then automatically fall through and execute each form in sequence.
Each form or menu after the first must be explicitly specified.
2.2 Gathering user input using builtin grammars The menu construct is useful when we simply need to use user input to control program branching. For more sophisticated control of user input, we can explicitly create input fields and specify allowable grammars for user input. In this example, we expand the support dialog to prompt the user for a customer identification number using <field>.
In this example, if the user has a customer identification
number,
they are eligible for premium support, so we send them to a different
part of the call script.
For each field, we specify a "name", which implicitly
defines a VoiceXML variable, and a type, which defines a
built-in grammar for allowable inputs. For the type "boolean", the user can enter 1 for true or 2 for no. The boolean field variable is filled with the ECMAScript literals true or false.
For the type "digits", the user can enter any sequence of digits.
These
inputs are available to our call script in any context which allows
variables. Here we use the variable as a conditional in the <if>
tag and we use the value within a prompt using the <value>
tag. In Section 3, we'll explore the use of
VoiceXML
variables in more detail.For fields within a form, in contrast to forms within a document, the VoiceXML interpreter walks through each field in sequence as it is filled unless explicitly instructed to do otherwise. The conditional <if> expression in the example above
could be rewritten as:
If we know that the customer identification number must be seven digits, we can specify that length in the field tag.
What happens if the user doesn't enter seven digits? VoiceXML has rules for signalling errors and reprompting the user for error conditions. The default behavior is that the system will say "Sorry, I didn't understand you" and reprompt the user. VoiceXML allows us to override this default behavior and add our own handlers for noinput
(no
input before timeout of the waiting period after a prompt) and nomatch
(out
of grammar input) events. We'll talk about how to do this in Section 4. We can also tune the
responsiveness of the system by setting the timeout parameters globally
or for each prompt.
2.3 DTMF grammars for numeric input We can also specify allowable inputs for each form field explicitly using the <grammar> tag. We could rewrite the menu in Section 2.1 using grammars and fields.
Here we specify a grammar for the field using JSGF (Java Speech Grammar Format) grammar syntax which is the default syntax for the Plum Voice Platform. For most simple examples, JSGF syntax is identical to ABNF grammar format. The key difference between this and the menu implementation that we saw before was that here, we can include any execution code that we want within each clause of the <if> statement. We are not restricted to branching using <goto>.
We'll look at what else we
can
do in Section 3.
The Plum Voice Platform also supports SRGS+XML grammars. For numeric input, JSGF is often a shorter alternative. For example, suppose that we wanted to listen for one of the following phrases: sales, support, accounts payable, operator or help. In JSGF, this would look something like:
Rewriting the grammar in SRGS would look like:
For more information on SRGS+XML grammars, refer to the W3C SRGS Grammar Specification and the Plum Voice Programmer's Reference Guide.
2.4 Controlling interruptions using bargein If you run these examples, you will notice that you can begin entering input before the prompt has finished playing. In some cases you may want to disallow interruptions of prompts. This can be controlled using the bargein attribute of each prompt.
The default for prompts within fields is bargein="true".
The default for prompts within blocks is bargein="false".We can override the default on a per prompt basis.
We can also override the behavior within a form or globally for a document using the <property> tag. We discuss this and other properties for tuning the behavior of the application in Section 6. 3. Acting on user input Tags introduced: <var> <assign> <script>3.1 VoiceXML variables In Section 2.2, we saw how named fields implicitly create new VoiceXML variables. We can explicitly create new variables using the <var>
tag, and assign values to them using <assign>.
The scope of a VoiceXML variable is the tag in which it is declared. Each variable is available to the sub contexts within the context it is declared. The document level variable id
can
be used within the field of the form.The scope of a field variable is the form in which it occurs. To use the value of the "id" variable
outside this form, we must
save
the value to the document level variable customerid
.
In the event of name conflicts, the more local variable takes
precedence.To use variables that span VoiceXML documents, use application level variables by creating a root document. Refer to the VoiceXML 2.0 Specification for more information. 3.2 Client side scripting using ECMAScript We can use ECMAScript (formally, JavaScript) to implement dynamic behavior within the VoiceXML application. These scripts are executed within the Plum VoiceXML Interpreter as part of the Voice Platform described in Section 1. As with ECMAScript executed in web browsers for HTML, execution of ECMAScript in VoiceXML does not require a new request to the application server. VoiceXML variables and ECMAScript variables are interchangeable. Each variable declared within <var> tags and each variable declared implicitly by a named field are available as ECMAScript variables within the script tags. Conversely, ECMAScript variables declared within the script tag can be accessed by <if>
or <value> tags in the VoiceXML.
<script> tags can be included at the top level
of the
VXML document or in any place where executable code is allowed: <block>,
or event handlers such as <filled>, <noinput>,
or <nomatch>.We can use ECMAScript expressions within any "expr" a ttribute.
In the following example, we construct a link dynamically based
on the user's input using the ECMAScript "+" operator to concatenate
strings.
In the next example, we create an ECMAScript function to help generate a time string to be used in TTS.
We use the speech markup tag <say-as> to direct the TTS engine to say the string, in HH:MM format, as a time. Also, notice that we have made the ECMAScript "xml safe" by escaping special characters.
3.3 Application server side dynamics using submit Once input has been collected from a user, it is usually necessary to compare that input to the information in your database. If you are familiar with standard server side scripting, doing submits should seem very familiar. We first begin by requesting the user's identification number. Upon a successful entry by the user, we perform a standard HTTP POST to the processid.php script: getid.vxml
processid.php
4. Input validation and error handling Tags introduced: <noinput> <nomatch> <catch>
<if> <else> <elseif> Users often don't do what they're told. How do we build our application to catch and handle errors and exceptions? 4.1 Catching errors: catching noinput and nomatch events. If we create a field to collect user input, the VoiceXML Form Interpretation Algorithm already takes care of trapping and handling some exception conditions. In the next example, we want to collect a seven digit identification number.
Suppose the user of this system enters nothing. Then, after the timeout interval of the prompt has passed, the system responds with "Sorry, I didn't hear you," then reprompts the user with "Enter your customer identfication number." The synthesized "Sorry, I didn't hear you" message is the default response of VoiceXML to no input. This default behavior can be overridden by adding event handlers to the field. The following mimics the default behavior of the system.
We could choose to offer a more helpful error message, to not play the original prompt again by omitting the <reprompt/> tag, to execute script code, or to abandon the effort altogether by using <goto>.
These tags are shorthand for the generic <catch> event handler. We could write one <catch> clause to handle both events.
See the Plum Voice Programmer's Reference Guide for other standard VoiceXML events. Using <throw> we can also define and throw our own events and then use catch to handle them. 4.2 Validating input using conditionals Suppose we have defined a checksum function isvalid_id() (defined as just a stub function in the example below). We can clear the filled variable and try again.
4.3 Offering help using tiered prompts Rather than simply repeating the same prompts to the user, we can offer increasingly detailed prompt messages by using the prompt counter.
The user interaction might sound like this: C: Enter your customer identification number. <prompt counter = 1> 5. Responding to Speech Input Up to this point, we've restricted our discussion to the use of touch tone (DTMF) input. One of the most compelling reasons to use VoiceXML is the ability to integrate advanced speech recognition technologies simply and portably. 5.1 Speech input using menus To enable speech input, we can use the same menu as before in Section 2. We have modified the prompts to use <enumerate> to read the choices available to the user.
The user interaction might sound like this: C: Welcome to Plum Voice. Please choose a department: sales, tech support, company directory.Note that the menu choices really are unchanged from the example in Section 2. In other words, we could have spoken "sales" in response to that example as well or pressed 1 for this example, regardless of what we were instructed to do by the prompt. To restrict input to use only speech, we can set the inputmodes property using the property tag. 5.2 Speech Grammars For common input types, we can use built in grammars such as digits, numbers, and currency.
C: What's your ID number?For other applications, we can define our own grammars using the <grammar>
tag.
C: What department are you trying to reach?5.2 Designing Speech Prompts In general, prompts for speech applications should list the options available. In other words, the user should be prompted to say what they hear. In cases where it is cumbersome to list all possible options, the system can provide an example of a well formed input. For more information about designing speech applications and speech prompts, consult a reference on VUI (Voice User Interface) design. References are listed in Section 8. 5.3 Deciding when to use speech For some applications, the use of speech recognition will not significantly improve the quality of the user experience. When the application only requires a few inputs with only a few possible choices, DTMF input may be as easy to use as a speech enabled application. DTMF applications can often be faster and easier to use by repeat callers if key ahead of inputs is allowed. Also, if users of the system are expected to call from very noisy environments or from wireless telephones, it may be difficult to create a speech enabled system that works reliably. Where speech recognition can be very useful is in instances where touch tone input is unwieldy or impossible. One example is an auto attendant application based on a company directory. Speech recognition can be used as an alternative to "dial by name" using the telephone keypad. Using VoiceXML, grammars for such applications can be generated dynamically from a company database. 6. Tuning Application Behavior To tune the behavior of the application, you can use the <property> tag. The <property> element sets a property value. Properties are used to set values that affect platform behavior, such as the recognition process, timeouts, caching policy, etc. Properties may be defined for the whole application, for the
whole document at the <vxml> level, for a particular dialog
at the <form> or <menu> level, or for a particular
form item. Properties apply to their parent element and all the
descendants of the parent. A property at a lower level overrides
a property at a higher level. When different values for a property
are specified at the same level, the last one in document order
applies. Properties specified in the application root document
provide default values for properties in every document in the
application; properties specified in an individual document
override property values specified in the application root
document. 6.1 Using the Generic Speech
Recognizer Properties
Setting the "sensitivity" level to a low setting helps in environments where there is alot of background noise. However, by setting the "sensitivty" to a low setting, there is also a small risk of the application missing some of the incoming speech. In a quiet environment, you might want to raise the "sensitivity" level for the application to recognize incoming speech better. For example, you would want to set the "sensitivity" to a high setting such as 0.8 for soft-speaking users who would use your application. Another property that can be used to tune properties for recognizing incoming speech is the "confidencelevel" property. This property adjusts the confidence needed for a recognition. For example:
From this example, the threshold of the confidence level is raised to 0.75, requiring a clear response of a "yes" or "no" answer. Using a high confidence level setting is useful for when you are expecting a precise match to your grammar. However, for grammars with multiple possibilities for matches such as a database of first and last names, you would want to adjust the confidence level to a lower setting to allow the user to hear many possibilities for a match. 6.2 Using the Generic DTMF Recognizer Properties The <property> tag also allows us to tune properties for recognizing DTMF. For example, we can use the "interdigittimeout" property to adjust the in-between time for the user to input numbers on a telephone keypad:
From this example, the user has 3 seconds between inputting digits on the keypad once the first digit is entered. If nothing is entered, a "timeout" occurs, resulting in a <noinput> being generated. The "timeout" property is further explained below. 6.3 Using the Prompting and Collecting Properties The <property> tag also allows us to tune properties for prompting and collecting. As shown earlier in Section 2.4 of the tutorial, the "bargein" property can be used to prevent users from interrupting speech. Here is an example that does not allow the user to interrupt for the first prompt, but does allow the user to interrupt for the second prompt:
So, a possible user interaction might be: C: You must... H: One two. C: ...listen to this message. C: Say any number of... H: One two. C: You said one two. Since the original value of "bargein" is set to false, the user is not allowed to interrupt the first message. When "bargein" is set to true, the user is allowed to interrupt the message by saying 1 or 2. Another property that can be used to adjust prompting and collecting is the "timeout" property. This value can be adjusted to allow for more time if a user does not input anything. For example:
From this example, the user has 7 seconds to say a string of ones or twos. If the user does not say anything after 7 seconds, a noinput event is generated. The default value for a "timeout" being prompted is 3 seconds and the limit value is 60 seconds. Another property that can be used to adjust prompting and collecting is the "termmaxdigits" property. When this property is set to false, the timeout after a user has begun to enter digits is set to the "interdigittimeout". For example:
A possible user interaction might be: C: Enter any number of the digits one or two. H: (enters DTMF-1 DTMF-2) C: (waits 3 seconds because of interdigittimeout) You entered one two. However, if we set "termmaxdigits" to true, we can allow for no timeout as soon as the user has matched the maximum number of digits. For example:
A possible user interaction might be: C: Enter up to five ones or twos. H: (enters DTMF-1 DTMF-2 DTMF-2 DTMF-1 DTMF-1) C: You entered twelve thousand two hundred eleven. Since the user's input of 5 numbers matched the length of digits, the behavior of the application immediately returns a response to the user. If the user entered less than 5 numbers of ones or twos, the application waits for 3 seconds from the "interdigittimeout" property and returns a <nomatch> after 3 seconds has passed with nothing inputted. If the user entered more than 5 numbers, the application returns just the first 5 digits that were entered by the user. 6.4 Using the Fetching Properties The <property> tag also allows us to tune fetching properties for audio, documents, grammars, and scripts. To adjust these properties for audio, you would use "audiofetchhint", "audiomaxage", and "audiomaxstale". For document reference tags such as <subdialog>, <goto>, <submit>, <link>, and <choice>, you would use "documentfetchhint", "documentmaxage", and "documentmaxstale". For grammars, you would use "grammarfetchhint", "grammarmaxage", and "grammarmaxstale". For scripts, you would use "scriptfetchhint", "scriptmaxage", and "scriptmaxstale". For example:
From this example, the "documentmaxage" value is set to 150 seconds and the "documentmaxstale" value is set to 25 seconds. This sets a global property that all document tags (<goto>, <submit>, etc.) have a maxage value of 150 seconds and a maxstale value of 25 seconds. So, since the file "myfile.vxml" is inside of a <goto> tag, it would have a maxage value of 150 seconds and a maxstale value of 25 seconds because of the "documentmaxage" and "documentmaxstale" properties. Also, the fetchtimeout property can be used to set the timeout for fetching a file from a web server. For example:
From this example, if the file "myfile.vxml" cannot be fetched within 20 seconds from the web server, then a timeout occurs and an error is thrown. "Fetchtimeout" is also a global property that can be set for all files that are to be fetched. Another way to tune fetching properties is to use "fetchaudio", "fetchaudiodelay", and "fetchaudiominimum". These properties can be used to control the audio that is played for a user when the user is put on hold while a document is being fetched. For example:
From this example, the "fetchaudio" property sets holdmusic.wav to play whenever there is a delay in fetching a file. The "fetchaudiodelay" property causes a 2 second delay to happen before the "fetchaudio" source is played. The "fetchaudiominimum" property causes a 5 second minimum time interval to play the "fetchaudio" source, even after the fetch has arrived. 6.5 Using the Miscellaneous Properties The <property> tag allows us to tune miscellaneous properties such as only being able to understand dtmf input instead of speech input. We use the "inputmodes" property to show this:
A possible user interaction might be: C: You will only be able to enter digits. H: Twelve. C: (ignores spoken input) Enter a number on your keypad. H: (enters DTMF-1 DTMF-2) C: You entered twelve. 6.6 Using the RecordCall Properties The <property> tag allows us to set up a call recording using the "recordcall" property. The type of this property is Boolean, where when it is set to true, call recording is enabled within its scope. For example:
From this example, the only part of the call that would be recorded is the prompt in form2 since the "recordcall" property is set within that scope. Also, note that the "callrecording" property must be used for the "namelist" property of <submit> since it is predefined to work with "recordcall". Make sure that you reference to "callrecording" in your .php script. By default, if the "recordcall" property is used in 2 forms, the recorded parts in the first form would be overwritten by the recorded parts in the second form. To avoid this from happening, we have a "recordcallappend" property that allows for a concatenation of the two recorded parts from each of the forms. This property is a global property, so it affects the entire vxml document. For example:
From this example, "recordcallappend" allows for the recordings from form1 and form2 to be concatenated together so that the recording from form1 is not overwritten by the recording from form2. 7. Auto Attendant Example From this tutorial, you should now be able to build your own application. Let's try to build an automated attendant application. First, begin by starting with Plum's standard template with the <vxml> tags and add in 2 global properties: a "sensitivity" property of 0.3 and a "recordcall" property for recording the entire call.
Next, set up a <form> block to be your introduction to the user. However, make your prompt such that the user cannot interrupt the introduction (hint: use "bargein"). Also, set up a <goto> tag to go to a <menu> block for the next section of the application.
Next, set up a <menu> block using the <choice> tag and allow the user to make a choice by either DTMF or speech input (hint: this is shown earlier in one of our examples).
Next, set up a <form> block for the first choice made by the user from your menu. Create your <form> block such that it allows the user to make a choice by using a speech <grammar> tag. Also, try using <if> tags (<if>, <elseif>, <else>) to acknowledge the choice made by the user.
Next, set up a <form> block for the second choice in your menu for the user. Again, you can use the <grammar> tag along with the <if>, <elseif>, and <else> tags for this <form> block.
Next, set up a <form> block for the third choice in your menu for the user. Try making this <form> block DTMF input only and have the user enter a multi-digit number that matches with the number of digits that you want the user to enter (hint: use "termmaxdigits"). Make sure you give the user an ample amount of time when entering these digits (hint: use "interdigittimeout"). Use the <nomatch> and <noinput> tags to help the user correctly enter the input.
Next, set up a <form> block for the last choice in your menu for the user. Here, try using the <transfer> tag to transfer the user to a telephone number. To use the <transfer> tag, you would type something similar to this:
Now, try using it within your <form> block>:
Once you complete this application, you have mastered many of the tags and techniques that are used within the Plum IVR platform. For further information, see the References section. 8. References
For more information about Plum VoiceXML products and services, contact us: web: www.plumgroup.com e-mail: sales@plumgroup.com phone: 800-995-PLUM |