14

I know that some skills can capture spoken text, such as when adding to to-do lists and shopping lists, and third party skills can also do this, eg. SMS with Molly.

So, how do they do this? Is there an API call that captures the recognized text and stores it somewhere?

Graham Chiu
  • 243
  • 2
  • 6

1 Answers1

8

Custom skills can capture text and send them to your Skill's API.

If you're not completely familiar with how Alexa Skills work, here's a brief summary:

  • First, you register your Skill with Amazon, providing an intent schema and sample utterances. The intent schema defines which actions can be performed, and the slots for custom data to be sent to your API. The sample utterances provide examples of how a user can trigger each intent.

  • When the user activates your Skill, Alexa will try to match what they said to one of your skill's sample utterances. If it does match, it will send an HTTPS request to your server to ask for a response.

  • Your server provides a response (if all goes well) and then Alexa will give feedback to the user who triggered your skill.

The AMAZON.LITERAL slot allows you to accept virtually any input. Note that currently it is only supported in the English (US) region—English (UK) and German skills cannot use AMAZON.LITERAL.

Your intent schema might look like this:

{
  "intents": [
    {
      "intent": "SaveTodo",
      "slots": [
        {
          "name": "Todo",
          "type": "AMAZON.LITERAL"
        }
      ]
    }
  ]
}

And your sample utterances might be like this:

SaveTodo remind me to {fetch the shopping|Todo}
SaveTodo remind me to {write my English essay|Todo}
SaveTodo remind me to {buy some dog food tomorrow|Todo}

When using AMAZON.LITERAL, you need to provide lots of sample utterances—at least one sample for each possible length of input, but ideally more. The Amazon documentation suggests that you should be aiming for hundreds of samples for slots where you could accept various types of inputs.

It does seem a little tedious, but if you don't do this, it's unlikely that your skill will recognise text well. You could perhaps generate sample utterances from customer data (so long as personal information is removed beforehand!) so that the most common utterances are in your samples—I suspect Alexa will be slightly biased towards recognising utterances similar to the samples.

Amazon discourage AMAZON.LITERAL slots though, and would prefer you to use custom slot types, which require you to list the possible inputs. It's important to remember that:

A custom slot type is not the equivalent of an enumeration. Values outside the list may still be returned if recognized by the spoken language understanding system. Although input to a custom slot type is weighted towards the values in the list, it is not constrained to just the items on the list. Your code still needs to include validation and error checking when using slot values.

Helmar
  • 8,450
  • 6
  • 36
  • 84
Aurora0001
  • 18,520
  • 13
  • 55
  • 169