VoiceXML Tutorial

Plum Voice Platform v. 2.6

© 2007 Plum Group, Inc. All rights reserved.
More on VoiceXML at Plumvoice.com

VoiceXML 2.0 is the World Wide Web consortium standard for scripting voice applications. In this tutorial, we construct a VoiceXML interactive voice response (IVR) for a customer service center. Starting from a simple "Hello World" application, we build a telephony application which includes:
  • dynamic response driven by touch tone or speech input
  • advanced text-to-speech (TTS) speech synthesis and automatic speech recognition (ASR)
  • system integration with enterprise databases
Notes on standards, technology, or implementation are set aside from the primary discussion in yellow boxes.

VoiceXML 2.0
All of the code samples in this tutorial conform to the VoiceXML 2.0 Specification and have been tested to run on the Plum VoiceXML Voice Platform version 2.6.  They should also run on any VoiceXML 2.0 platform which supports either JSGF or ABNF grammars.

Outline

  1. Getting started: building a minimal VoiceXML application
    1. Hello World: a minimal VoiceXML script
  2. User interaction with touch tone input
    1. Menus 
    2. Gathering user input using builtin grammars
    3. DTMF grammars for numeric input
    4. Controlling interruptions using bargein
  3. Acting on user input
    1. VoiceXML variables
    2. Client side scripting using ECMAScript
    3. Application server side dynamics using <submit>
  4. Input validation and error handling
    1. Catching noinput and nomatch events
    2. Validating input using conditionals
    3. Offering help using tiered prompts
  5. Using speech input
    1. Speech input using menus
    2. Speech grammars
    3. Designing speech prompts
    4. Deciding when to use speech
  6. Tuning application behavior
    1. Using the generic speech recognizer properties
    2. Using the generic DTMF recognizer properties
    3. Using the prompting and collecting properties
    4. Using the fetching properties
    5. Using the miscellaneous properties
    6. Using the recordcall properties
  7. Auto Attendant Example
  8. References
 
1. Getting started: building a minimal VoiceXML script

Tags introduced:  <?xml?> <vxml> <form> <block> <prompt> <audio>


1.1 Hello World: A minimal VoiceXML script
We begin with nearly the simplest complete VoiceXML application. The application here is analogous to an answering machine set to play an announcement only.


<?xml version="1.0"?>
<vxml version="2.0">
  <form>
    <block>
      <prompt>
         Welcome to Plum Voice.
      </prompt>
    </block>
  </form>
</vxml>

In this example, the user would hear a synthesized voice say, "Welcome to Plum Voice." Then the system would simply hang up. The <form> defines the basic unit of interaction in VoiceXML. This form includes only a single <block> of executable content which in turn includes a single <prompt> to the user. By default, any plain text within a prompt is passed to the system's text-to-speech (TTS) synthesis engine to be generated as audio.

For static prompts such as this welcome message, we'll probably want to use a human announcer instead of TTS.  TTS has come a long way, but there's still no substitute for the real thing. For recorded prompts, we use the <audio> tag.

    <prompt>
       <audio src="wav/welcome.wav">
          Welcome to Plum Voice.
       </audio>
    </prompt>

As we explain in the next section, the <audio> tag is analogous to the <IMG> image tag for graphics in HTML. The "src" attribute provides a URL to the WAV file which should be played for this prompt. In this case, the "src" reference is relative to the VXML document URL in which it appears.  

Audio formats
WAV files are a generic container type.  WAV files include a header which indicates the actual audio sample size, encoding, and rate used. Supported formats vary by VoiceXML implementation and not all possible WAV file formats are supported.  The Plum Voice Platform supports 8 kHz audio files in 16 bit linear, 8 bit µ-law (u-law), or 8 bit A-law encoding in WAV files or headerless files.

The text within the audio tag is not required. We could have included no content:  
   <audio src="wav/welcome.wav"/>

which is equivalent to
   <audio src="wav/welcome.wav"></audio>

The text included within the audio tag in the example above is something like the ALT text for images in HTML. If the VoiceXML platform is unable to open or play the "src" file in the audio tag, it falls back on generating TTS from the included text.  However, if no text is specified and there is a problem playing the audio file, this will trigger a fatal error.

VoiceXML and XML
As the <?xml?> tag declares, every VoiceXML document is an XML document.  The basic structure of the VoiceXML should be familiar to anyone who has looked at HTML web documents.  Tags are set off by brackets <form>  and are closed with a forward slash </form>. VoiceXML documents must adhere strictly to the XML standard.   The document must begin with the <?xml?> tag.  Then the rest of the document is enclosed within the <vxml></vxml> tags.  Unlike HTML, all tags must be closed and certain special characters must be escaped with a safe alternative.  For example, the less than sign <, when it is not used to open a tag, must be escaped with a safe alternative (e.g. &lt;).


2. User interaction with touch tone input

Tags introduced:  <goto> <menu> <choice> <form> <field> <filled> <if> <else> <elseif>
Tag attributes introduced: id for <form>; type for <field>

2.1 Menus
The simplest way to create an interactive VoiceXML application is by using the <menu> tag.


<?xml version="1.0"?>
<vxml version="2.0">

  <form>
    <block>
      <prompt>
         Welcome to Plum Voice.
      </prompt>
      <goto next="#mainmenu"/>
    </block>
  </form>

  <menu id="mainmenu">
    <prompt>
      For sales, press 1.
      For tech support, press 2.
      For company directory, press 3.
    </prompt>
    <choice dtmf="1" next="#sales">
       Sales</choice>
    <choice dtmf="2" next="#support">
       Tech Support</choice>
    <choice dtmf="3" next="#directory">
       Company Directory</choice>
 
  </menu>

  <form id="sales">
    <block>
      Please hold for the next available sales
      representative.
      <!-- transfer to sales -->
    </block>

  </form>

  <form id="support">
    <block>
      <!-- transfer to tech support -->
    </block>
 
  </form>


  <form id="directory">
    <block>
      <!-- transfer to company directory -->
    </block>
  </form>


</vxml>

An interaction with this VoiceXML application might sound like this. Following the convention used in the VoiceXML specification, the VoiceXML computer system output is labeled "C:" and the human user input is labeled "H:"
C: Welcome to Plum Voice.  For sales, press 1. For tech support, press 2. For company directory, press 3.

H: <presses 1>

C: Please hold for the next available sales representative.

Each <choice> specifies a mapping of user input to an action, branching to a different dialog in the application.  For now, we have simply added stub dialogs as placeholders in our application.

Note that in the first form, we must explicitly <goto> the main menu after the initial prompt. The VoiceXML interpreter begins the interaction at the first form or menu found in the document. It does not then automatically fall through and execute each form in sequence.  Each form or menu after the first must be explicitly specified.

VoiceXML and HTML
Many of the elements here will be familiar to authors of HTML. We've labeled each of the <form> and <menu> dialogs with an id attribute and use the <goto> tag and the <choice> next attribute to move between dialogs. As in HTML, prepending the pound sign # to a reference as in <goto next="#sales"/> marks the reference as internal to this document. Comments can be included using the syntax  <!-- comment -->. Unlike in HMTL, we must use a forward slash to close the <goto> tag in order to maintain XML compliance.

2.2 Gathering user input using builtin grammars
The menu construct is useful when we simply need to use user input to control program branching. For more sophisticated control of user input, we can explicitly create input fields and specify allowable grammars for user input.

In this example, we expand the support dialog to prompt the user for a customer identification number using <field>.  In this example, if the user has a customer identification number, they are eligible for premium support, so we send them to a different part of the call script.

  <form id="support">
    <field name="hasid" type="boolean">
      <prompt>
        If you know your customer identification number,
        press 1.  Otherwise, press 2.
      </prompt>
     
<filled>
        <if cond="hasid==false">
          <goto next="#unknowncustomer"/>
        </if>
      </filled>

    </field>

    <field name="id" type="digits">
      <prompt>
        Enter your customer identification number.
      </prompt>
      <filled>
        You entered <value expr="id"/>.
        <!-- transfer to premium support -->
      </filled>
    </field>
  </form>


  <form id="unknowncustomer">
    <block>
      Please hold for the next available representative.
      <!-- transfer to general support -->
    </block>
  </form>



For each field, we specify a "name", which implicitly defines a VoiceXML variable, and a type, which defines a built-in grammar for allowable inputs.

For the type "boolean", the user can enter 1 for true or 2 for no. The boolean field variable is filled with the ECMAScript literals true or false. For the type "digits", the user can enter any sequence of digits.  These inputs are available to our call script in any context which allows variables. Here we use the variable as a conditional in the <if> tag and we use the value within a prompt using the <value> tag. In Section 3, we'll explore the use of VoiceXML variables in more detail.

For fields within a form, in contrast to forms within a document, the VoiceXML interpreter walks through each field in sequence as it is filled unless explicitly instructed to do otherwise.  The conditional <if> expression in the example above could be rewritten as:

      <if cond="hasid=='false'">
        <goto next="#unknowncustomer"/>
      <else/>
        <goto nextitem="id"/>
      </if>


FIA: Form Interpretation Algorithm
At the heart of the VoiceXML interpreter is the Form Interpretation Algorithm, which is the specified procedure for walking through the fields of a form.  In addition to simply prompting for each field in turn, the Form Interpretation Algorithm specifies what will happen in the case of no user input or input which does not match any of the allowable grammars.  We look at this in more detail in Section 4.


If we know that the customer identification number must be seven digits, we can specify that length in the field tag.

    <field name="id" type="digits?length=7">
      <prompt>
        Enter your seven digit customer
        identification number.
      </prompt>
      <filled>
        You entered <value expr="id"/>.
        <!-- transfer to premium support -->
      </filled>
    </field>

What happens if the user doesn't enter seven digits? VoiceXML has rules for signalling errors and reprompting the user for error conditions.  The default behavior is that the system will say "Sorry, I didn't understand you" and reprompt the user.  VoiceXML allows us to override this default behavior and add our own handlers for noinput (no input before timeout of the waiting period after a prompt) and nomatch (out of grammar input) events. We'll talk about how to do this in Section 4. We can also tune the responsiveness of the system by setting the timeout parameters globally or for each prompt.

Built-in grammars
Built-in grammar types vary by VoiceXML implementation.  Allowable built-in types and their return values may vary from one VoiceXML implementation to another. These built-in grammars may support both DTMF and speech input.  For more information, refer to the Plum Voice Programmer's Reference Guide.

2.3 DTMF grammars for numeric input
We can also specify allowable inputs for each form field explicitly using the <grammar> tag.  We could rewrite the menu in Section 2.1 using grammars and fields.

<form id="mainmenu">
  <field name="menuchoice">
    <grammar type="application/x-jsgf">1|2|3</grammar>
      <prompt>
        For sales, press 1.
        For tech support, press 2.
        For company directory, press 3.
      </prompt>
      <filled>
        <if cond="menuchoice==1">
          <goto next="#sales"/>
        <elseif cond="menuchoice==2"/>
          <goto next="#support"/>
        <elseif cond="menuchoice==3"/>
          <goto next="#directory"/>
        </if>
      </filled>
  </field>
</form>


Here we specify a grammar for the field using JSGF (Java Speech Grammar Format) grammar syntax which is the default syntax for the Plum Voice Platform.  For most simple examples, JSGF syntax is identical to ABNF grammar format.

The key difference between this and the menu implementation that we saw before was that here, we can include any execution code that we want within each clause of the <if> statement.  We are not restricted to branching using <goto>. We'll look at what else we can do in Section 3.

JSGF
Here are some more examples of JSGF syntax:
  • (1|2)[4|5|6]   1 or 2, followed optionally by 4, 5, or 6.  Example matches include 14, 2, 26.
  • 1(12|34)*  1 followed by either 12 or 34 repeated 0 or more times.  Example matches include 1121234, 13412, 13434.
  • 1+2  one or more 1's followed by 2.  Example matches include 12, 111112.
  • 1{one}|2{two}  Matches 1 or 2.  The curly braces {} are tags.  If the user enters 1, the string "one" is bound to the field variable.  Note that whitespace within the braces is significant; adding spaces around the tag token { one } will return " one " with surrounding spaces.

The Plum Voice Platform also supports SRGS+XML grammars.   For numeric input, JSGF is often a shorter alternative.  For example, suppose that we wanted to listen for one of the following phrases: sales, support, accounts payable, operator or help.  In JSGF, this would look something like:

  <grammar type="application/x-jsgf" mode="voice">
    (sales|support|accounts payable|operator|help)
  </grammar>

Rewriting the grammar in SRGS would look like:

  <grammar type="application/srgs+xml" root="ROOT" mode="voice">
<rule id="ROOT">
<one-of>
<item>sales</item>
<item>support</item>
<item>accounts payable</item>
<item>operator</item>
<item>help</item>
</one-of>
</rule>

</grammar>


For more information on SRGS+XML grammars, refer to the W3C SRGS Grammar Specification and the Plum Voice Programmer's Reference Guide.

<dtmf> vs. <grammar>
In VoiceXML 2.0, the <dtmf> tag which was part of VoiceXML 1.0 was eliminated.  The <grammar> tag is now used for both dtmf and speech input grammars.  To specify DTMF input digits, use the arabic numbers 1, 2, 3, etc.  To specify spoken digits, spell the numbers one, two, three, etc.

2.4 Controlling interruptions using bargein
If you run these examples, you will notice that you can begin entering input before the prompt has finished playing.  In some cases you may want to disallow interruptions of prompts. This can be controlled using the bargein attribute of each prompt.  The default for prompts within fields is bargein="true". The default for prompts within blocks is bargein="false".

We can override the default on a per prompt basis.

      <prompt bargein="false">
        This is a very important message which you
        must hear.  Now enter your identification
        number.
      </prompt>

We can also override the behavior within a form or globally for a document using the <property> tag.  We discuss this and other properties for tuning the behavior of the application in Section 6.


3. Acting on user input

Tags introduced: <var> <assign> <script>

3.1 VoiceXML variables
In Section 2.2, we saw how named fields implicitly create new VoiceXML variables.  We can explicitly create new variables using the <var> tag, and assign values to them using <assign>.

<!-- TODO var, assign example -->
  <var name="customerid" expr="0"/>

  <form id="support">
    <field name="hasid" type="boolean">
      <prompt>
        If you know your customer identification number,
        press 1.  Otherwise, press 2.
      </prompt>
    </field>
    <filled>

      <if cond="hasid==false">
        <goto next="#unknowncustomer"/>
      </if>
    </filled>
    <field name="id" type="digits">
      <prompt>
        Enter your customer identification number.
      </prompt>
      <filled>
        <assign name="customerid" expr="id"/>
        You entered <value expr="id"/>.
        <!-- transfer to premium support -->
      </filled>
    </field>
  </form>



The scope of a VoiceXML variable is the tag in which it is declared. Each variable is available to the sub contexts within the context it is declared. The document level variable id can be used within the field of the form.

The scope of a field variable is the form in which it occurs. To use the value of the "id" variable outside this form, we must save the value to the document level variable customerid .  In the event of name conflicts, the more local variable takes precedence.

To use variables that span VoiceXML documents, use application level variables by creating a root document.  Refer to the VoiceXML 2.0 Specification for more information.

3.2 Client side scripting using ECMAScript
We can use ECMAScript (formally, JavaScript) to implement dynamic behavior within the VoiceXML application.  These scripts are executed within the Plum VoiceXML Interpreter as part of the Voice Platform described in Section 1.  As with ECMAScript executed in web browsers for HTML, execution of ECMAScript in VoiceXML does not require a new request to the application server.

VoiceXML variables and ECMAScript variables are interchangeable.  Each variable declared within <var> tags and each variable declared implicitly by a named field are available as ECMAScript variables within the script tags.  Conversely, ECMAScript variables declared within the script tag can be accessed by <if> or <value> tags in the VoiceXML.

<var name="firstname" expr="'John'"/>
<script>
  var lastname = 'Doe';
  var nickname = firstname;
  var fullname;
</script>
<assign name="fullname" expr="firstname +' '+ lastname"/>


<script> tags can be included at the top level of the VXML document or in any place where executable code is allowed:  <block>, or event handlers such as <filled>, <noinput>, or <nomatch>.

We can use ECMAScript expressions within any "expr" attribute.  In the following example, we construct a link dynamically based on the user's input using the ECMAScript "+" operator to concatenate strings.

<form>
  <field name="formid">
  <grammar>1|2</grammar>
     <prompt>
        Enter the form ID.
     </prompt>
     <filled>
       <goto expr="'#form'+formid"/>
     </filled>
  </field>
</form>

<form id="form1">
  <block>
     This is form 1.
  </block>
</form>

<form id="form2">
  <block>
     This is form 2.
  </block>
</form>


In the next example, we create an ECMAScript function to help generate a time string to be used in TTS.
<script>
  function get12hourtime(){

     var now = new Date();
     var hours = now.getHours();
     var minutes = now.getMinutes();

     // speak time in 12 hour format

     var ampm = (hours &gt;= 12) ? 'PM' : 'AM';
     var hours12 = (hours &gt; 12) ?
          hours - 12 : hours;
     if (hours == 0) hours12 = 12;

     minutes = (minutes &lt; 10) ?
          '0' + minutes : minutes;
     seconds = (seconds &lt; 10) ?
          '0' + seconds : seconds;

     var timestr = hours12 + ':' + minutes + ampm;
     return timestr;
  }
</script>

<form>
  <field name="id" type="digits">
    <prompt>
       Please enter your id number.
    </prompt>
    <filled>
       <prompt>
          logged in at
          <say-as type="time:hm">
             <value expr="get12hourtime()"/>
          </say-as>
       </prompt>
    </filled>
  </field>
</form>

We use the speech markup tag <say-as> to direct the TTS engine to say the string, in HH:MM format, as a time.  Also, notice that we have made the ECMAScript "xml safe" by escaping special characters.

XML safe scripting
All content within a VoiceXML document must be valid XML. ECMAScript code within a script tag is no exception.  A few common characters often used as comparison operators in ECMAScript must be escaped.
ECMAScript
Operator
XML safe
escape sequence

<
&lt;
less than
>
&gt;
greater than
&
&amp;
binary AND
&&
&amp;&amp;
boolean AND


3.3 Application server side dynamics using submit
Once input has been collected from a user, it is usually necessary to compare that input to the information in your database. If you are familiar with standard server side scripting, doing submits should seem very familiar. We first begin by requesting the user's identification number. Upon a successful entry by the user, we perform a standard HTTP POST to the processid.php script:

getid.vxml
<?xml version="1.0"?>
<vxml version="2.0">
<form>
<field name="id" type="digits">
<prompt>
Please enter your id number.
</prompt>

<filled>
<!-- Here we perform the HTTP POST -->
<submit next="processid.php" namelist="id" method="post"/>
</filled>
</field>
</form>

</vxml>


processid.php
<?php
include_once("lib/dbaccess.php");

echo("<?xml version=\"1.0\"?>\n");
echo("<vxml version=\"2.0\">\n");
?>

<form>
<block>
<?php
// Here we call check_id() function from dbaccess.php
if(isset($_POST["id"]) && check_id($_POST["id"])) {
echo("<prompt>Your user id number is valid.\n");
echo("Starting session</prompt>\n");
echo("<goto next=\"begin_session.vxml\"/>\n");
} else {
echo("<prompt>You user id is not valid.</prompt>\n");
echo("<goto next=\"getid.vxml\"/>\n");
}
?>

</block>
</form>
</vxml>



4. Input validation and error handling

Tags introduced: <noinput> <nomatch> <catch> <if> <else> <elseif>
Users often don't do what they're told.  How do we build our application to catch and handle errors and exceptions?

4.1 Catching errors: catching noinput and nomatch events.
If we create a field to collect user input, the VoiceXML Form Interpretation Algorithm already takes care of trapping and handling some exception conditions.  In the next example, we want to collect a seven digit identification number.

    <field name="id" type="digits?length=7">
      <prompt>
        Enter your customer identification number.
      </prompt>
      <filled>
        <assign name="customerid" expr="id"/>
        You entered <value expr="id"/>.
        <!-- transfer to premium support -->
      </filled>
    </field>

Suppose the user of this system enters nothing.  Then, after the timeout interval of the prompt has passed, the system responds with "Sorry, I didn't hear you," then reprompts the user with "Enter your customer identfication number."

The synthesized "Sorry, I didn't hear you" message is the default response of VoiceXML to no input. This default behavior can be overridden by adding event handlers to the field.  The following mimics the default behavior of the system.

    <field name="id" type="digits?length=7">
      <prompt>
        Enter your customer identification number.
      </prompt>
      <filled>
        <assign name="customerid" expr="id"/>
        You entered <value expr="id"/>.
        <!-- transfer to premium support -->
      </filled>
      <noinput>
        Sorry, I didn't hear you.
        <reprompt/>
      </noinput>

      <nomatch>
        Sorry, I didn't understand.
        <reprompt/>
      </nomatch>

    </field>

We could choose to offer a more helpful error message, to not play the original prompt again by omitting the <reprompt/> tag, to execute script code, or to abandon the effort altogether by using <goto>.

      <noinput>
        Your identification number is the seven
        digit number on the front of your membership
        card.
        <reprompt/>
      </noinput>

      <nomatch>
        <assign name="badid" expr="id"/>
        Your identification number must be seven digits.
        Try again.
      </nomatch>


These tags are shorthand for the generic <catch> event handler.  We could write one <catch> clause to handle both events.

      <catch event="nomatch noinput">
        Try again.
        <reprompt/>
      </catch>


See the Plum Voice Programmer's Reference Guide for other standard VoiceXML events.  Using <throw> we can also define and throw our own events and then use catch to handle them.

4.2 Validating input using conditionals
Suppose we have defined a checksum function isvalid_id()  (defined as just a stub function in the example below).  We can clear the filled variable and try again.

      <filled>
        <script>
           function isvalid_id(id) {
             // check digits
             // for now, simply do a redundant check
             // on length
             return (id.length == 7);
           }
        </script>
        <if cond="isvalid_id(id)">
           <assign name="customerid" expr="id"/>
           You entered <value expr="id"/>.
           <!-- transfer to premium support -->

           <!-- transfer to premium support -->
         <else/>
           Invalid ID number.  Please check the number
           and try again.
           <clear namelist="id"/>
           <reprompt/>

         </if>
      </filled>

4.3 Offering help using tiered prompts
Rather than simply repeating the same prompts to the user, we can offer increasingly detailed prompt messages by using the prompt counter.

    <field name="id" type="digits?length=7">
      <prompt count="1">
        Enter your customer identification number.
      </prompt>
      <prompt count="2">
        Enter your seven digit customer identification
        number.
      </prompt>
      <prompt count="3">
        Your customer identification number can be
        found on the front of your membership card.
        Enter your seven digit customer identification
        number.
      </prompt>
      <filled>
        <assign name="customerid" expr="id"/>
        You entered <value expr="id"/>.
        <!-- transfer to premium support -->
      </filled>
      <nomatch>
         You entered an invalid number.
         <reprompt/>
      </nomatch>
    </field>

The user interaction might sound like this:
C: Enter your customer identification number. <prompt counter = 1>

H: <enters 1 2 3>

C: You entered an invalid number. Enter your seven digit customer identification number. <prompt counter = 2>

H: <silence: no input>

C: Sorry, I didn't hear you. 
Your customer identification number can be found on the front of your membership card. Enter your seven digit customer identification number. <prompt counter = 3>

H:
<enters 1 2 3 4 5 6>

C: You entered an invalid number.  
Your customer identification number can be found on the front of your membership card. Enter your seven digit customer identification number. <prompt counter = 4>
 

5. Responding to Speech Input

Up to this point, we've restricted our discussion to the use of touch tone (DTMF) input.  One of the most compelling reasons to use VoiceXML is the ability to integrate advanced speech recognition technologies simply and portably.

5.1 Speech input using menus
To enable speech input, we can use the same menu as before in Section 2. We have modified the prompts to use <enumerate> to read the choices available to the user.

<?xml version="1.0"?>
<vxml version="2.0">

  <form>
    <block>
      <prompt>
         Welcome to Plum Voice.
      </prompt>
      <goto next="#mainmenu"/>
    </block>
  </form>

  <menu id="mainmenu">
    <prompt>
      Please choose a department:
      <enumerate/>
    </prompt>
    <choice dtmf="1" next="#sales">
       Sales</choice>
    <choice dtmf="2" next="#support">
       Tech Support</choice>
    <choice dtmf="3" next="#directory">
       Company Directory</choice>
 
  </menu>

  <form id="sales">
    <block>
      Please hold for the next available sales
      representative.
      <!-- transfer to sales -->
    </block>

  </form>

  <form id="support">
    <block>
      <!-- transfer to tech support -->
    </block>
 
  </form>


  <form id="directory">
    <block>
      <!-- transfer to company directory -->
    </block>
  </form>


</vxml>

The user interaction might sound like this:
C: Welcome to Plum Voice.  Please choose a department: sales, tech support, company directory.

H: sales

C: Please hold for the next available sales representative.
Note that the menu choices really are unchanged from the example in Section 2.  In other words, we could have spoken "sales" in response to that example as well or pressed 1 for this example, regardless of what we were instructed to do by the prompt.

To restrict input to use only speech, we can set the inputmodes property using the property tag.  

5.2 Speech Grammars
For common input types, we can use built in grammars such as digits, numbers, and currency.

  <field name="id" type="digits">
    <prompt> What's your ID number? </prompt>
    <filled>
       I heard <value expr="id"/>.
    </filled>
  </field>

C: What's your ID number?

H: one two three four <spoken>

C: I heard one two three four.
For other applications, we can define our own grammars using the <grammar> tag.

  <field name="target">
    <grammar type="application/x-jsgf">
        sales |
        ([company] directory){directory} |
        ([tech|technical] support){support} |
        (H R | human resources | recruiting) {hr} |
        operator
    </grammar>
    <prompt>
      What department are
      you trying to reach?
    </prompt>
    <filled>
       Transfering to <value expr="target"/>.
       <if cond="target=='support'">
          <goto next="#support"/>
       <else/>
          <!-- etc. -->
       </if>
    </filled>
  </field>

C: What department are you trying to reach?

H: tech support <spoken>

C: Transfering to support
5.2 Designing Speech Prompts
In general, prompts for speech applications should list the options available. In other words, the user should be prompted to say what they hear. In cases where it is cumbersome to list all possible options, the system can provide an example of a well formed input.

For more information about designing speech applications and speech prompts, consult a reference on VUI (Voice User Interface) design. References are listed in Section 8.

5.3 Deciding when to use speech
For some applications, the use of speech recognition will not significantly improve the quality of the user experience. When the application only requires a few inputs with only a few possible choices, DTMF input may be as easy to use as a speech enabled application. DTMF applications can often be faster and easier to use by repeat callers if key ahead of inputs is allowed. Also, if users of the system are expected to call from very noisy environments or from wireless telephones, it may be difficult to create a speech enabled system that works reliably.

Where speech recognition can be very useful is in instances where touch tone input is unwieldy or impossible.  One example is an auto attendant application based on a company directory.  Speech recognition can be used as an alternative to "dial by name" using the telephone keypad.  Using VoiceXML, grammars for such applications can be generated dynamically from a company database.

6. Tuning Application Behavior

To tune the behavior of the application, you can use the <property> tag. The <property> element sets a property value. Properties are used to set values that affect platform behavior, such as the recognition process, timeouts, caching policy, etc.

Properties may be defined for the whole application, for the whole document at the <vxml> level, for a particular dialog at the <form> or <menu> level, or for a particular form item. Properties apply to their parent element and all the descendants of the parent. A property at a lower level overrides a property at a higher level. When different values for a property are specified at the same level, the last one in document order applies. Properties specified in the application root document provide default values for properties in every document in the application; properties specified in an individual document override property values specified in the application root document.

6.1 Using the Generic Speech Recognizer Properties
The <property> tag allows us to tune properties for recognizing incoming speech.

For example, we can use the "sensitivity" property to reduce interference from any background noise during a call:

<?xml version="1.0"?>
<vxml version="2.0">

<property name="sensitivity" value="0.3"/>

<form>
  <field type="boolean">
    <prompt>
      Please say yes or no.
    </prompt>
  </field> 
</form>

</vxml>

Setting the "sensitivity" level to a low setting helps in environments where there is alot of background noise. However, by setting the "sensitivty" to a low setting, there is also a small risk of the application missing some of the incoming speech.

In a quiet environment, you might want to raise the "sensitivity" level for the application to recognize incoming speech better. For example, you would want to set the "sensitivity" to a high setting such as 0.8 for soft-speaking users who would use your application.

Another property that can be used to tune properties for recognizing incoming speech is the "confidencelevel" property. This property adjusts the confidence needed for a recognition. For example:

<?xml version="1.0"?>
<vxml version="2.0">

<property name="confidencelevel" value="0.75"/>

<form>
  <field type="boolean">
    <prompt>
      Please say yes or no.
    </prompt>
  </field> 
</form>

</vxml>

From this example, the threshold of the confidence level is raised to 0.75, requiring a clear response of a "yes" or "no" answer.  Using a high confidence level setting is useful for when you are expecting a precise match to your grammar.

However, for grammars with multiple possibilities for matches such as a database of first and last names, you would want to adjust the confidence level to a lower setting to allow the user to hear many possibilities for a match.

6.2 Using the Generic DTMF Recognizer Properties
The <property> tag also allows us to tune properties for recognizing DTMF.

For example, we can use the "interdigittimeout" property to adjust the in-between time for the user to input numbers on a telephone keypad:

<?xml version="1.0"?>
<vxml version="2.0">

<property name="interdigittimeout" value="3s"/>
    <form>
<field name="myfield">
<grammar type="application/x-jsgf" mode="dtmf">
( 1 | 2 )+
</grammar>
<prompt>
Enter any number of the digits one or two.
</prompt>
<filled>
You entered <value expr="myfield"/>.
</filled>
<nomatch>
You did not enter any ones or twos.
<reprompt/>
</nomatch>
<noinput>
You did not enter anything.
<reprompt/>
</noinput>
</field>
</form>
</vxml>

From this example, the user has 3 seconds between inputting digits on the keypad once the first digit is entered. If nothing is entered, a "timeout" occurs, resulting in a <noinput> being generated. The "timeout" property is further explained below.

6.3 Using the Prompting and Collecting Properties
The <property> tag also allows us to tune properties for prompting and collecting.

As shown earlier in Section 2.4 of the tutorial, the "bargein" property can be used to prevent users from interrupting speech. Here is an example that does not allow the user to interrupt for the first prompt, but does allow the user to interrupt for the second prompt:

<?xml version="1.0"?>
<vxml version="2.0">

    <form>
        <property name="bargein" value="false"/>
        <field name="myfield">
            <grammar type="application/x-jsgf" mode="voice">
                ( one | two )+
            </grammar>
            <prompt bargein="false">
                You must listen to this message.
            </prompt>
            <prompt bargein="true">
                Say any number of the digits one or two.
            </prompt>
            <filled>
                You said <value expr="myfield"/>.
            </filled>
            <nomatch>
                You did not say any ones or twos.
                <reprompt/>
            </nomatch>
            <noinput>
                You did not say anything.
                <reprompt/>
            </noinput>
        </field>
    </form>

</vxml>

So, a possible user interaction might be:

C: You must...

H: One two.

C: ...listen to this message.

C: Say any number of...

H: One two.

C: You said one two.

Since the original value of "bargein" is set to false, the user is not allowed to interrupt the first message. When "bargein" is set to true, the user is allowed to interrupt the message by saying 1 or 2.

Another property that can be used to adjust prompting and collecting is the "timeout" property. This value can be adjusted to allow for more time if a user does not input anything. For example:

<?xml version="1.0"?>
<vxml version="2.0">

    <form>
        <property name="timeout" value="7s"/>
        <field name="myfield">
            <grammar type="application/x-jsgf" mode="voice">
                ( one | two )+
            </grammar>
            <prompt>
                Say any number of the digits one or two.
            </prompt>
            <filled>
                You said <value expr="myfield"/>.
            </filled>
            <nomatch>
                You did not say any ones or twos.
                <reprompt/>
            </nomatch>
            <noinput>
                You did not say anything.
                <reprompt/>
            </noinput>
        </field>
    </form>

</vxml>

From this example, the user has 7 seconds to say a string of ones or twos. If the user does not say anything after 7 seconds, a noinput event is generated. The default value for a "timeout" being prompted is 3 seconds and the limit value is 60 seconds.

Another property that can be used to adjust prompting and collecting is the "termmaxdigits" property. When this property is set to false, the timeout after a user has begun to enter digits is set to the "interdigittimeout". For example:

<?xml version="1.0"?>
<vxml version="2.0">

<property name="termmaxdigits" value="false"/>
<property name="interdigittimeout" value="3s"/>

    <form>
        <field name="myfield">
            <grammar type="application/x-jsgf" mode="dtmf">
                ( 1 | 2 )+
            </grammar>
            <prompt>
                Enter any number of the digits one or two.
            </prompt>
            <filled>
                You entered <value expr="myfield"/>.
            </filled>
            <nomatch>
                You did not enter any ones or twos.
                <reprompt/>
            </nomatch>
            <noinput>
                You did not enter anything.
                <reprompt/>
            </noinput>
        </field>
    </form>

</vxml>


A possible user interaction might be:

C: Enter any number of the digits one or two.

H:
(enters DTMF-1 DTMF-2)

C: (waits 3 seconds because of interdigittimeout) You entered one two.

However, if we set "termmaxdigits" to true, we can allow for no timeout as soon as the user has matched the maximum number of digits. For example:

<?xml version="1.0"?>
<vxml version="2.0">

<property name="termmaxdigits" value="true"/>
<property name="interdigittimeout" value="3s"/>

    <form>
        <field name="myfield" type="digits?length=5">
            <prompt>
                Enter up to five ones or twos.
            </prompt>
            <filled>
                <prompt bargein="false">
                    You entered <value expr="myfield"/>.
                </prompt>
            </filled>
            <nomatch>
                You did not enter enough ones or twos.
                <reprompt/>
            </nomatch>
            <noinput>
                You did not enter anything.
                <reprompt/>
            </noinput>
        </field>
    </form>

</vxml>


A possible user interaction might be:

C: Enter up to five ones or twos.

H:
(enters DTMF-1 DTMF-2 DTMF-2 DTMF-1 DTMF-1)

C: You entered twelve thousand two hundred eleven.

Since the user's input of 5 numbers matched the length of digits, the behavior of the application immediately returns a response to the user. If the user entered less than 5 numbers of ones or twos, the application waits for 3 seconds from the "interdigittimeout" property and returns a <nomatch> after 3 seconds has passed with nothing inputted. If the user entered more than 5 numbers, the application returns just the first 5 digits that were entered by the user.

6.4 Using the Fetching Properties
The <property> tag also allows us to tune fetching properties for audio, documents, grammars, and scripts. To adjust these properties for audio, you would use "audiofetchhint", "audiomaxage", and "audiomaxstale". For document reference tags such as <subdialog>, <goto>, <submit>, <link>, and <choice>, you would use "documentfetchhint", "documentmaxage", and "documentmaxstale". For grammars, you would use "grammarfetchhint", "grammarmaxage", and "grammarmaxstale". For scripts, you would use "scriptfetchhint", "scriptmaxage", and "scriptmaxstale". For example:

<?xml version="1.0"?>
<vxml version="2.0">

<property name="documentmaxage" value="150s"/>
<property name="documentmaxstale" value="25s"/>
    <form>
         <block>
              <goto next="myfile.vxml"/>
         </block>
    </form>

</vxml>

From this example, the "documentmaxage" value is set to 150 seconds and the "documentmaxstale" value is set to 25 seconds. This sets a global property that all document tags (<goto>, <submit>, etc.) have a maxage value of 150 seconds and a maxstale value of 25 seconds. So, since the file "myfile.vxml" is inside of a <goto> tag, it would have a maxage value of 150 seconds and a maxstale value of 25 seconds because of the "documentmaxage" and "documentmaxstale" properties.

Also, the fetchtimeout property can be used to set the timeout for fetching a file from a web server. For example:

<?xml version="1.0"?>
<vxml version="2.0">

<property name="documentmaxage" value="150s"/>
<property name="documentmaxstale" value="25s"/>
<property name="fetchtimeout" value="20s"/>
    <form>
         <block>
              <goto next="myfile.vxml"/>
         </block>
    </form>

</vxml>

From this example, if the file "myfile.vxml" cannot be fetched within 20 seconds from the web server, then a timeout occurs and an error is thrown. "Fetchtimeout" is also a global property that can be set for all files that are to be fetched.

Another way to tune fetching properties is to use "fetchaudio", "fetchaudiodelay", and "fetchaudiominimum". These properties can be used to control the audio that is played for a user when the user is put on hold while a document is being fetched. For example:

<?xml version="1.0"?>
<vxml version="2.0">

<property name="fetchaudio" value="holdmusic.wav"/>
<property name="fetchaudiodelay" value="2s"/>
<property name="fetchaudiominimum" value="5s"/>
    <form>
         <block>
              <goto fetchaudio="holdmusic.wav"/>
         </block>
    </form>

</vxml>

From this example, the "fetchaudio" property sets holdmusic.wav to play whenever there is a delay in fetching a file. The "fetchaudiodelay" property causes a 2 second delay to happen before the "fetchaudio" source is played. The "fetchaudiominimum" property causes a 5 second minimum time interval to play the "fetchaudio" source, even after the fetch has arrived.

6.5 Using the Miscellaneous Properties
The <property> tag allows us to tune miscellaneous properties such as only being able to understand dtmf input instead of speech input.

We use the "inputmodes" property to show this:

<?xml version="1.0"?>
<vxml version="2.0">

<property name="inputmodes" value="dtmf"/>
<property name="interdigittimeout" value="3s"/>

    <form>
        <field name="myfield" type="digits">
            <prompt>
                You will only be able to enter digits.
                Enter a number on your keypad.
            </prompt>
            <filled>
                You entered <value expr="myfield"/>.
            </filled>
            <nomatch>
                You did not enter a number properly.
                <reprompt/>
            </nomatch>
            <noinput>
                You did not enter anything.
                <reprompt/>
            </noinput>
        </field>
    </form>

</vxml>

A possible user interaction might be:

C: You will only be able to enter digits.

H: Twelve.

C: (ignores spoken input) Enter a number on your keypad.

H:
(enters DTMF-1 DTMF-2)

C: You entered twelve.

6.6 Using the RecordCall Properties
The <property> tag allows us to set up a call recording using the "recordcall" property. The type of this property is Boolean, where when it is set to true, call recording is enabled within its scope. For example:

<?xml version="1.0"?>
<vxml version="2.0">

     <form id="form1">
          <block>
               <prompt>
                    You will not hear this message in your recording.
               </prompt>
               <goto next="#form2"/>               
          </block>
    </form>

     <form id="form2">
          <property name="recordcall" value="true"/>
          <block>
               <prompt>
                    You will hear this message in your recording.
               </prompt>
               <submit next="callrecord.php" namelist="callrecording" 
               method="post"/>              
          </block>
    </form>

</vxml>

From this example, the only part of the call that would be recorded is the prompt in form2 since the "recordcall" property is set within that scope. Also, note that the "callrecording" property must be used for the "namelist" property of <submit> since it is predefined to work with "recordcall". Make sure that you reference to "callrecording" in your .php script.

By default, if the "recordcall" property is used in 2 forms, the recorded parts in the first form would be overwritten by the recorded parts in the second form. To avoid this from happening, we have a "recordcallappend" property that allows for a concatenation of the two recorded parts from each of the forms. This property is a global property, so it affects the entire vxml document. For example:

<?xml version="1.0"?>
<vxml version="2.0">

<property name="recordcallappend" value="true"/>

     <form id="form1">
          <property name="recordcall" value="true"/>
          <block>
               <prompt>
                    You will not hear this message in your recording.
               </prompt>
               <goto next="#form2"/>             
          </block>
    </form>

     <form id="form2">
          <property name="recordcall" value="true"/>
          <block>
               <prompt>
                    You will hear this message in your recording.
               </prompt> 
               <submit next="callrecord.php" namelist="callrecording" 
               method="post"/> 
          
          </block>
    </form>

</vxml>

From this example, "recordcallappend" allows for the recordings from form1 and form2 to be concatenated together so that the recording from form1 is not overwritten by the recording from form2.

7. Auto Attendant Example

From this tutorial, you should now be able to build your own application. Let's try to build an automated attendant application. First, begin by starting with Plum's standard template with the <vxml> tags and add in 2 global properties: a "sensitivity" property of 0.3 and a "recordcall" property for recording the entire call.

<?xml version="1.0"?>
<vxml version="2.0">

<property name="sensitivity" value="0.3"/>
<property name="recordcall" value="true"/>

</vxml>

(Highlight above to see the code)

Next, set up a <form> block to be your introduction to the user. However, make your prompt such that the user cannot interrupt the introduction (hint: use "bargein"). Also, set up a <goto> tag to go to a <menu> block for the next section of the application.

<form id="intro">
     <block>
          <prompt bargein="false">
               Hello! Welcome to The Electronic Store, the leader of all electronic
               stores! This call may be recorded for research purposes.
          </prompt>
          <goto next="#mainmenu"/>
     </block>
</form>

(Highlight above to see the code)

Next, set up a <menu> block using the <choice> tag and allow the user to make a choice by either DTMF or speech input (hint: this is shown earlier in one of our examples).

<menu id="mainmenu">
     <prompt>
          Please choose a department:
          <enumerate/>
     </prompt>
     <choice dtmf="1" next="#desktop">
          Desktop</choice>
     <choice dtmf="2" next="#accessories">
          Accessories</choice>
     <choice dtmf="3" next="#support">
          Support</choice>
     <choice dtmf="4" next="#operator">
          Operator</choice>
</menu>

(Highlight above to see the code)

Next, set up a <form> block for the first choice made by the user from your menu. Create your <form> block such that it allows the user to make a choice by using a speech <grammar> tag. Also, try using <if> tags (<if>, <elseif>, <else>) to acknowledge the choice made by the user.

<form id="desktop">
     <field name="brand1">
          <grammar type="application/x-jsgf">
               (Gateroad | Kell)
          </grammar>
          <prompt>
               Which desktop would you like? Gateroad or Kell?
          </prompt>
          <filled>
               You chose a <value expr="brand1"/>.
               <if cond="brand1=='Gateroad'">
                    <prompt>
                         Congratulations on buying a Gateroad! Remember our motto:
                         Let Gateroad take you to a better path.
                    </prompt>
               <else/>
                    <prompt>
                         Congratulations on buying a Kell! Remember our motto:
                         Do it the Kell way.
                    </prompt>
               </if>
          </filled> 
     </field>            
</form>

(Highlight above to see the code)

Next, set up a <form> block for the second choice in your menu for the user. Again, you can use the <grammar> tag along with the <if>, <elseif>, and <else> tags for this <form> block.

<form id="accessories">
     <field name="brand2">
          <grammar type="application/x-jsgf">
               (Keyboard | Mouse | Monitor)
          </grammar>
          <prompt>
               Which accessories would you like? Keyboard, mouse, or monitor?
          </prompt>
          <filled>
               You chose <value expr="brand2"/>.
               <if cond="brand2=='Keyboard'">
                    <prompt>
                         Congratulations on buying a keyboard! 
                         This keyboard will help you type faster than ever before.
                    </prompt>
               <elseif cond="brand2=='Mouse'"/>
                    <prompt>
                         Congratulations on buying a mouse!
                         This mouse help you click faster than ever before.
                    </prompt>
               <else/>
                    <prompt>
                         Congratulations on buying a monitor!
                         This monitor will display graphics better than you have
                         ever seen.
                    </prompt>
               </if>
          </filled> 
     </field>            
</form>

(Highlight above to see the code)

Next, set up a <form> block for the third choice in your menu for the user. Try making this <form> block DTMF input only and have the user enter a multi-digit number that matches with the number of digits that you want the user to enter (hint: use "termmaxdigits"). Make sure you give the user an ample amount of time when entering these digits (hint: use "interdigittimeout"). Use the <nomatch> and <noinput> tags to help the user correctly enter the input.

<form id="support">
     <property name="termmaxdigits" value="true"/>
     <property name="interdigittimeout" value="3s"/>
     <field name="id" type="digits?length=7">
          <prompt>
               Enter your customer identification number.
          </prompt>
          <filled>
               <assign name="customerid" expr="id"/>
               You entered <value expr="id"/>.
               Welcome customer number <value expr="id"/>.
               We value helping our customers greatly.
               Please wait for the next available assistant.
          </filled>
          <noinput>
               Your identification number is the seven digit number on the back of
               your computer. It can also be located on the front of your user's
               manual.
               <reprompt/>
          </noinput>
          <nomatch>
               Your identification number must be seven digits.
               Please try again.
               <reprompt/>
          </nomatch>
     </field>         
</form>

(Highlight above to see the code)

Next, set up a <form> block for the last choice in your menu for the user. Here, try using the <transfer> tag to transfer the user to a telephone number. To use the <transfer> tag, you would type something similar to this:
 
     <transfer name="mycall" dest="tel:+1-617-712-3000" connecttimeout="20s"
     bridge="true">
     </transfer>  

Now, try using it within your <form> block>:

<form id="operator">
     <block>
          <prompt>
               Hello. You've reached the operator's line. Now transferring.
          </prompt>
     </block>
     <transfer name="mycall" dest="tel:+1-617-712-3000" connecttimeout="20s"
     bridge="true">
     </transfer>   
</form>

(Highlight above to see the code)

Once you complete this application, you have mastered many of the tags and techniques that are used within the Plum IVR platform. For further information, see the References section.

8. References

For more information about Plum VoiceXML products and services, contact us:
web: www.plumgroup.com
e-mail: sales@plumgroup.com
phone: 800-995-PLUM