‘Alexa, build me a skill’ – developing for Amazon Echo
Amazon Echo, the ecommerce giant’s personal assistant in a tennis ball can, hit the UK towards the end of 2016. Available in the US since mid-2015, the Amazon Echo (and the smaller, but functionally identical, Echo Dot) represents Amazon’s first attempt to step into the world of automated voice assistants. Conversational interfaces like these are not new, with Apple’s Siri and Google Assistant existing in various iterations on smart phones for years. Where Echo differs is in its location and form factor. It also presents an easier process for 3rd party developers.
Welcome to the Smart Home
The first and most obvious difference between Echo and services such as Siri and Google Assistant, is that Echo lives in a stand-alone device that finds a place in your home (although Google has made steps in a similar direction recently). This immediately changes how you interact with the device, as it becomes a functional part of your home instead of something that travels with you in your smart phone or tablet.
One of the main capabilities of the Echo, is its ability to talk to your smart home, from controlling smart lights, to playing music from your speakers, to turning up the thermostat. This placement as a feature of your home eases users into a different type of interaction with the device, allowing for a more conversational model instead of just straight commands.
At the Consumer Electronics Show in Las Vegas early in 2017, Echo was everywhere, from dishwashers and ovens to robots. What struck observers was that the Alexa assistant service was being built natively into devices, as well as utilising the stand alone Echo devices. This highlights that Alexa is more than the Echo speaker she currently lives in; she is a gateway to the powerful Amazon Web Services Lambda functions.
Alexa, what can you do for me?
Beyond smart home and other built in features such as asking about the weather or football scores, the power of Alexa comes from Skills. Skills are Amazon’s version of apps, such as you might find on your smart phone or tablet. The Alexa app has its own storefront where you can find and install new skills for your Echo device. At the current time, it is not possible to charge for Skills, so there’s a bevy of free experiences available for owners.
There are two main parts to a Skill: the Alexa skills kit, which serves as the voice interface; and the AWS Lambda function, which does the processing and bulk of the heavy lifting. Together, these two parts come packaged together to form an Alexa Skill. The two aspects live on different parts of the Amazon estate. The Voice Services sit at developer.amazon.com, whilst the Lambda functions sit within the normal Amazon Web Services bubble.
What did you say to me?
When building skills, the voice interface is a crucial part of the process. It can be difficult to get your head around how to design something that sounds and feels natural and efficient.
The first thing to consider is how a user will invoke the Skill. Amazon guidelines for invocation suggest a list of best practice considerations for this, but during our prototyping and discussions with clients we’ve found it necessary to consider which type of interaction you expect your user to have with the device.
Skills can either be launched, which allows call and response style interaction, or they can be contacted directly. For example ‘Alexa, start Test Skill’ vs ‘Alexa, ask Test Skill what it can do’.
The first phrase will launch the skill. Once the skill is launched, users can interact with Alexa without needing to say the Alexa keyword, and information can be retained between messages in the Skill session. For the direct call invocation, the skill ends at the end of the response. Consider the type of interaction you want your skill to have, as the interaction model will be quite different if you want it to sound natural.
The next step is to create the Interaction Model. This comes in three parts, the Intent Schema, Custom Slots and Sample Utterances. The intent schema is a JSON file that maps out a list of intents (think of these as methods), which can have any number of ‘slots’, which can be used to draw information from the user’s request.
Figure 1 - an example of an Intent Schema
Next up are Custom Slots. These are data types that allow for Alexa to hone in on a specific list of potential variables in a request. Whilst these do not limit the data that can be passed into a request, it does give Alexa guidance as to what to expect. Amazon also has a series of inbuilt Custom Types you can use, for example, First Names and UK Cities. Custom slots are optional, but are a powerful way to design your interaction model.
Figure 2 - an example of Custom Slot Types
Finally, there are the sample utterances. These are the ways that users will launch the intents we will outline in our Lambda function. Amazon has a great set of guidelines about how you should look to define these, but the key thing is to workshop with another person, saying the variants out loud to try to really nail down how a user would say the many distinctions of a particular phrase. Having one or more people with you for this step is essential – if you haven’t tried it, you’d be surprised how difficult it is to work out how people actually talk!
Figure 3 - an example of Sample Utterances
These three elements tie together to form the Alexa voice services layer of your Skill – this is the blueprint for how a user will talk to and interact with your Skill through Alexa. Do not be afraid to refine this, especially after you have a chance to test it on a wider audience – Alexa skills are definitely an area where the more people you witness interacting with your Skill, the better.
I’m the brains around here!
Now we come to the power at the heart of Alexa Skills – AWS Lambda functions. Lambda functions are one part of Amazon’s wider cloud computing offering. They allow for event driven processing in a server-less environment. Lambda functions are primarily written in Node.js, Python and Java. Event driven processing means that a ‘trigger’ can call Lambda functions from a variety of sources. In this case, we will be triggering our code via the Alexa voice services layer detailed above.
A lot of the boilerplate code for the Lambda function and its interaction with what is received from the voice services trigger is provided in some of the sample applications from Amazon, so take a look over those for an overview of how to structure your Lambda function.
Alexa voice services provide the information retrieved from the users in a set format, which allows Lambda to pick up the particular elements within for processing.
Figure 4 - an example of pulling Intent information
The basic build-up of an individual function consists of a few different aspects, including building up the ‘card’ that is sent to the Alexa app for reference, and building the speech outputs that Alexa will respond with.
Figure 5 - an example Function
Skills also maintain a session for the duration they are active, so information can be set, retrieved and passed into speech responses, allowing information to build-up over a series of questions.
Figure 6 - an example of using Session within a skill to retrieve stored information
Lambda functions can also hook into a variety of external functions like APIs or web endpoints, to post or pull information from outside of your individual Skill or trigger processes. This essentially makes Alexa a routing engine, opening up a myriad of functionality to voice command.
Alexa, seems like you’re pretty smart
Alexa voice services are a powerful and intuitive way to build a powerful voice framework, which grants access to the full power of the Lambda function toolkit. As Alexa grows and integrates with more devices around the smart home outside of the tower speakers, her reach will only expand. If you are interested in how Alexa and Echo can be utilised and integrated with your new or existing services, get in touch and we can work with you to build your custom Skills.