Programmer's Reference Manual

Plum Voice Platform v. 2.6

© 2007 Plum Group, Inc. All rights reserved.
More on VoiceXML at Plumvoice.com

Introduction
The Programmer's Reference Manual provides complete information on the VoiceXML 2.0/2.1 tags, attributes, and features supported by the Plum Voice Platform. It is the desk-side resource for Plum Voice Platform VoiceXML developers. Section 1 describes all of the supported VoiceXML tags and attributes. Section 2 provides information on default grammars and the default speech recognition engine shipped with the Plum platform. Section 3 provides information on prompts and speech synthesis. Section 4 provides information on logging. Section 5 provides information on caching behavior. Section 6 provides information on the TTS Speech Engine Characteristics for AT&T Natural Voices, Cepstral Engine, and RealSpeak Engine. Section 7 provides information on data exchange using the <submit>, <subdialog>, and <data> tags. Section 8 provides information on how to generate a call report. Section 9 provides information on how to debug a VoiceXML script if there is an error. Section 10 provides information on how to use root documents to pass information from one VoiceXML page to another VoiceXML page.

This document assumes the reader has prior experience with developing IVR applications with VoiceXML. In order to take full advantage of the features of the Plum Voice Platform, the reader should be familiar with the following documents:


Outline
1. VoiceXML Reference
1.1 Tag Descriptions
1.2 Global Properties
1.3 Session Variables
1.4 Unsupported Tags and Attributes
1.5 Standard Application Variables
2. Grammars and Speech Recognition
2.1 ABNF+XML
2.2 JSGF
2.3 ABNF+XML Tag Format
2.4 JSGF Tag Format
2.5 Form Interpretation Algorithm
2.6 Mixed Initiative Forms
2.7 Built-in Grammars
2.8 Extended Built-in Grammars
3. Prompts and Speech Synthesis
3.1 Supported Audio Formats
3.2 Prompt Queuing and Barge-in Behavior
4. Logging
4.1 Last Call Logs
4.2 Session Logs
5. Caching
5.1 Plum Caching Approach
5.2 Typical Web Server Behavior
5.3 Examples of Maxage/Maxstale Usage
6. TTS Speech Engine Characteristics
6.1 Voice Tag Attributes
6.2 Voice Child Tags
7. Data Exchange
7.1 Using the Submit Tag
7.2 Using the Subdialog Tag
7.3 Using the Data Tag
7.4 Submit vs. Subdialog vs. Data
8. Call Report
8.1 Generating a Report
9. Debugging
9.1 Script Validation
9.2 Common Errors
10. Root Documents
A. Appendix
A.1 <say-as> Tag Types
A.2 Phoneme Set
A.3 Record and Recognition Termination Chart
A.4 Tag Hierarchy


1. VoiceXML Reference
Information for each VoiceXML 2.0/2.1 tag on the Plum Voice Platform is provided in Section 1.1. The tags are presented in 2 formats: alphabetically and functional groupings. Much of the text provided in Section 1.1 is drawn directly from the VoiceXML 2.0 Last Call and VoiceXML 2.1 Working Draft specifications. All tag attributes are shown, and exceptions to specified attribute behavior are noted for particular attributes. The global properties that can be set in the <property> tag are described in Section 1.2. Unsupported tags are listed in Section 1.4.


1.1 Tag Descriptions
Each tag description is broken into four sections:
  • summary: a brief description of the tag
  • attributes: descriptions of the tag's attributes
  • child tags: child tags and/or special text specific to the tag
  • notes: additional information related to the tag
  • example: an example of the tag in use
Much of the descriptive text is taken from the VoiceXML 2.0 Last Call and VoiceXML 2.1 Working Draft specification. Plum Voice Platform exceptions to the specification are noted accordingly.

The following is the list of all supported vxml elements grouped alphabetically:
Tag Purpose
<assign> Assigns a value to a variable
<audio> Retrieves and plays the specified audio file
<block> Contains executable content
<break> Inserts a pause in the synthesized text
<catch> An XML Event handler containing executable content
<choice> Defines a menu item
<clear> Resets one or more form items
<data> Allows a VoiceXML application to fetch arbitrary XML data from a document server without transitioning to a new VoiceXML document.
<desc> Provides a description of a non-speech audio source in <audio>
<disconnect> Disconnect from the telephone session
<else> Used within an <if> element
<elseif> Used within an <if> element
<emphasis> Specifies text that should be spoken with emphasis
<enumerate> Automatically generate description of available choices
<error> Catches all events of type error
<example> Provided as the initial content within a "rule" element
<exit> Disconnects the session without throwing any events
<field> Specifies an input item to be gathered from the user
<filled> Catches events in which input items are filled by input
<foreach> Iterates through an ECMAScript array and to execute the content contained within the <foreach> element for each item in the array.
<form> An executable content container
<goto> Transitions to another form, dialog or document
<grammar> Provide a speech (or DTMF) grammar
<help> An abbreviation for <catch event="help">
<if> Provides conditional control logic
<initial> Prompts user for initial form information
<item> Contains each alternative expansion of a <one-of> block
<lexicon> Specifies a pronunciation lexicon for the prompt This tag is not supported
<link> Allows grammar declarations with global scope
<log> Generate a logging or debug message
<mark> Unused tag; disregarded by all vxml interpreters
<menu> Shorthand for a form containing a single anonymous field
<meta> Specifies meta information, as in HTML
<metadata> A container to store information about the document in
<noinput> An abbreviation for <catch event="noinput">
<nomatch> An abbreviation for <catch event="nomatch">
<one-of> A set of alternative legal rule expansions contained within an <item> block
<option> Represented by a set of <option> elements contained in a <field> element
<paragraph> Changes prosody of text to reflect the end of a paragraph
<param> Specify values that are passed to subdialogs
<phoneme> Specify pronunciations explicitly in synthesized speech
<prompt> Controls output of synthesized speech
<property> Sets property values that affect platform behavior
<prosody> Changes sound characteristics of synthesized speech
<record> Collects recorded audio from the telephone session
<reprompt> Re-visit a field
<return> Returns control and data to a calling dialog
<rule> Contains An SRGS+XML rule definition
<ruleref> References another defined rule within the same local grammar
<say-as> Provides pronunciation hints for synthesized speech
<script> Specifies client-side scripting language code
<sentence> Changes prosody of text to reflect the end of a paragraph
<speak> Synthesize the specified text
<sub> Suggest substitute synthesized speech
<subdialog> Container to facilitate modular design and re-use
<submit> Submit transition to another document with form data
<tag> An arbitrary string that may be included inline within any legal rule expansion
<throw> Throws an application or predefined event
<token> Defines words or other entities that may be spoken
<transfer> Telephony control that transfers to a phone number
<value> Insert the value of an expression into a prompt
<var> Declares a variable that can occur in executable content
<voice> Change the voice of the TTS speaker
<vxml> The outermost tag within a VoiceXML document


The following is the list of all supported vxml elements grouped by purpose:
Function Tags
Document definition <meta>
<metadata>
<vxml>
Dialog definition <form>
<menu>
Form input definitions <field>
<record>
<subdialog>
Telephony control <transfer>
Form control <block>
<initial>
Menus <menu>
<choice>
<grammar>
<enumerate>
Fields <field>
<enumerate>
<grammar>
<option>
<filled>
<help>
<noinput>
<nomatch>
Subdialogs <subdialog>
<param>
<return>
Controlling Dialog Transitions <goto>
<submit>
<link>
<choice>
Throwing events <throw>
Handling events <catch>
<error>
<help>
<noinput>
<nomatch>
Controlling Text-To-Speech <break>
<voice>
<emphasis>
<prosody>
<phoneme>
<say-as>
<sub>
<mark>
<sentence> , <s>
<paragraph> , <p>
Specifying grammars <grammar>
<say-as>
Changing platform properties <property>
Content containers <block>
<filled>
<if>
<catch>
<error>
<help>
<noinput>
<nomatch>
Variable usage <var>
<assign>
<clear>
ECMAScript usage <script>
Conditional control logic <if>
<else>
<elseif>
<foreach>
Audio Output <audio>
<prompt>
<reprompt>
<value>
<enumerate>
Session Termination <disconnect>
<exit>
Submitting and receiving data <data>
<submit>
Debugging <log>


1.2 Global Properties
Property Default Description
completetimeout 1 second The specification-optional completetimeout global property is not supported but a setting here will override the incompletetimeout value if it is of a greater value.
incompletetimeout 1 second The required length of silence following user speech after which a recognizer finalizes a result. The minimum value for incompletetimeout is 500ms.
maxspeechtimeout 60 seconds The maximum duration of user speech input. Maximum recording time (as opposed to recognition time) is set with maxtime attribute of the <record> tag. The maxspeechtimeout value can not be less than 2 seconds.
interdigittimeout 1 second The interdigit timeout value to use when recognizing DTMF input.
termtimeout
0 seconds
The terminating timeout to use when recognizing DTMF input. The value is a Time Designation.
termchar "#" The "termchar" global property is not supported.
bargein true The bargein attribute to use for prompts. Setting this to true allows barge-in by default. Setting it to false disallows barge-in.
bargeintype
no default value
Sets the type of bargein to be speech or hotword. Default is platform-specific. This property is not supported.
timeout 3 seconds The time after which a noinput event is thrown by the platform. The "timeout" global property can not be set higher than 60 seconds.
audiofetchhint
"safe" Tells the platform whether or not it can attempt to optimize dialog interpretation by pre-fetching audio. The value is either safe to say that audio is only fetched when it is needed, never before; or prefetch to permit, but not require the platform to pre-fetch the audio.
audiomaxage NULL Tells the platform the maximum acceptable age, in seconds, of cached audio resources. Setting "audiomaxage" to NULL allows the platform to honor expiration time set by the web server.
audiomaxstale 0 seconds Tells the platform the maximum acceptable staleness, in seconds, of expired cached audio resources.
documentfetchhint
"safe" Tells the platform whether or not documents may be pre-fetched. The value is either safe (the default), or prefetch.
documentmaxage NULL Tells the platform the maximum acceptable age, in seconds, of cached documents. Setting "documentmaxage" to NULL allows the platform to honor expiration time set by the web server.
documentmaxstale 0 seconds Tells the platform the maximum acceptable staleness, in seconds, of expired cached documents.
grammarfetchhint
"safe" Tells the platform whether or not grammars may be pre-fetched. The value is either prefetch or safe.
grammarmaxage NULL Tells the platform the maximum acceptable age, in seconds, of cached grammars. Setting "grammarmaxage" to NULL allows the platform to honor expiration time set by the web server.
grammarmaxstale 0 seconds Tells the platform the maximum acceptable staleness, in seconds, of expired cached grammars.
scriptfetchhint
"safe" Tells whether scripts may be pre-fetched or not. The values are prefetch or safe.
scriptmaxage NULL Tells the platform the maximum acceptable age, in seconds, of cached scripts. Setting "scriptmaxage" to NULL allows the platform to honor expiration time set by the web server.
scriptmaxstale 0 seconds Tells the platform the maximum acceptable staleness, in seconds, of expired cached scripts.
fetchaudio
no default value
The URI of the audio to play while waiting for a document to be fetched. The default is not to play any audio during fetch delays. There are no fetchaudio properties for audio, grammars, objects, and scripts. The fetching of the audio clip is governed by the audiofetchhint, audiomaxage, audiomaxstale, and fetchtimeout properties in effect at the time of the fetch. The playing of the audio clip is governed by the fetchaudiodelay, and fetchaudiominimum properties in effect at the time of the fetch.
fetchaudiodelay 0 seconds
The time interval to wait at the start of a fetch delay before playing the fetchaudio source. The value is a Time Designation. The default interval is platform-dependent, e.g. "2s".  The idea is that when a fetch delay is short, it may be better to have a few seconds of silence instead of a bit of fetchaudio that is immediately cut off.
fetchaudiominimum 0 seconds

The minimum time interval to play a fetchaudio source, once started, even if the fetch result arrives in the meantime. The value is a Time Designation. The default is platform-dependent, e.g., "5s".  The idea is that once the user does begin to hear fetchaudio, it should not be stopped too quickly.

fetchtimeout 30 seconds The timeout for all fetches.
inputmodes "dtmf voice" This property determines which input modality to use. The input modes to enable: dtmf and voice. To disable speech recognition, set inputmodes to "dtmf". To disable DTMF, set it to "voice". One use for this would be to turn off speech recognition in noisy environments. Another would be to conserve speech recognition resources by turning them off where the input is always expected to be DTMF. This property does not control the activation of grammars. For instance, voice-only grammars may be active when the inputmode is restricted to DTMF. Those grammars would not be matched, however, because the voice input modality is not active.
universals
no default value

Platforms may optionally provide platform-specific universal command grammars, such as "help", "cancel", or "exit" grammars, that are always active (except in the case of modal input items) and which generate specific events.

Production-grade applications often need to define their own universal command grammars, e.g., to increase application portability or to provide a distinctive interface. They specify new universal command grammars with <link> elements. They turn off the default grammars with this property. Default catch handlers are not affected by this property.

The value "none" is the default, and means that all platform default universal command grammars are disabled. The value "all" turns them all on. Individual grammars are enabled by listing their names separated by spaces; for example, "cancel exit help". This property is not supported.

maxnbest
5
This property controls the maximum size of the "application.lastresult$" array; the array is constrained to be no larger than the value specified by 'maxnbest'. This property has a minimum value of 1. The default value is 1. This property is not supported.
sensitivity
0.5
Set the sensitivity level. A value of 1.0 means that it is highly sensitive to quiet input. A value of 0.0 means it is least sensitive to noise. The value is a Real Number Designation.
speedvsaccuracy
0.5
A hint specifying the desired balance between speed vs. accuracy. A value of 0.0 means fastest recognition. A value of 1.0 means best accuracy. The value is a Real Number Designation. The default is value 0.5.
confidencelevel
0.5 The speech recognition confidence level, a float value in the range of 0.0 to 1.0. Results are rejected (a nomatch event is thrown) when application.lastresult$.confidence is below this threshold. The platform has been finely tuned to minimize the number of false positives above 0.5, and as such lowering this value is not recommended.
termmaxdigits
false Enables the behavior "termchar Empty When Grammar Must Terminate" describe in the VoiceXML 2.0 Specification - Appendix D.  An immediate timeout on dtmf collection for grammars builtin:digits (with length or maxlength set) or builtin:boolean grammars. If any other grammars are active during recognition this behavior is disabled.
recordutterance
false Enables recording during recognition, If the recordutterance property is set to true, three shadow variables are set on the appropriate form item variable: recording, recordingsize, and recordingduration. See Section 1.5 - Standard Application Variables for more information.
recordutterancetype
"audio/basic" Specifies the media format of utterance recordings.
recordcall
false Enables call recording while property is in scope. Call recordings are stored in the variable session.callrecording and can be uploaded using the <submit>, <subdialog> or <data> tags.
recordcallappend
false If this property is set to true when call recording transitions from disabled to enabled any previous call recorded audio will be appended to instead of being overwritten.
voicename
depends on system
This property globally sets the default voice for a VoiceXML script.
voicegender
depends on system
This property globally sets the default gender for a VoiceXML script.


1.3 Session Variables
There are three session variables set by the Plum Voice Platform:
Variable Description
session.telephone.ani The caller's phone number.
session.telephone.dnis The phone number the caller dialed.
session.id A unique ID for each session. The ID consists of a six character server identifier (defaults to "000000"), followed by a semicolon, followed by a three digit channel identifier, followed by a timestamp of when the channel was ready to receive the next call.


1.4 Unsupported Tags and Attributes
Tag VXML Specification Deviation
<return> Only returns a statically specified event in its "event" attribute. The "eventexpr", "message", and "messageexpr" attributes are not supported.
<vxml> The xmlns, xsi:schemaLocation, and xmlns:xsi attributes are disregarded by the platform.
<grammar> The tag-format, version, and mode attributes are not supported.
<prompt> The bargeintype attribute is not supported
<lexicon> There is no support for this tag.
<mark> There is no support for this tag.


1.5 Standard Application Variables
application.lastresult$ holds information about the last recognition to occur within this application. It is an array of elements where each element, application.lastresult$[i], represents a possible result.  The number of application.lastresult$ elements will always be greater than or equal to one and less than or equal to 5.
Application Variable Description
application.lastresult$[i].confidence Confidence level for this utterance. A value of 0.0 indicates minimum confidence, and a value of 1.0 indicates maximum confidence.
application.lastresult$[i].utterance The raw string of words that were recognized for this interpretation.In the case of a DTMF grammar, this variable will contain the matched digit string.
application.lastresult$[i].inputmode The mode in which user input was provided: dtmf or voice.
application.lastresult$[i].interpretation Contains the interpretation, or the exact match to the active grammar.
application.lastresult$[i].recording Contains the last utterance if the global property "recordutterance" is set and audio was collected.
application.lastresult$[i].recordingsize Contains the last utterance recording size in bytes.
application.lastresult$[i].recordingduration Contains the last utterance duration size in milliseconds.