Monday , 16 July 2018
Home >> I >> iphone >> Apple makes the case that even its most banal features require a proficiency in machine learning

Apple makes the case that even its most banal features require a proficiency in machine learning

Amidst narratives of machine learning complacency, Apple is coming to terms with the fact that not talking about innovation means innovation never happened.

A detailed blog posting in the company’s machine learning journal makes public the technical effort that went into its “Hey Siri” feature — a capability so banal that I’d almost believe Apple was trying to make a point with highbrow mockery.

Even so, it’s worth taking the opportunity to explore exactly how much effort goes into the features that do, for one reason or another, go unnoticed. Here are five things that make the “Hey Siri” functionality (and competing offerings from other companies) harder to implement than you’d imagine, and commentary on how Apple managed to overcome the obstacles.

It had to not drain on your battery and processor all day

At its core, the “Hey Siri” functionality is really just a detector. The detector is listening for the phrase, ideally using fewer resources than the entirety of server-based Siri. Still, it wouldn’t make a lot of sense for this detector to even just suck on a device’s main processor all day.

Fortunately, the iPhone has a smaller “Always On Processor” that can be used to run detectors. At this point in time, it wouldn’t be feasible to smash an entire deep neural network (DNN) onto such a small processor. So instead, Apple runs a tiny version of its DNN for recognizing “Hey Siri.”

When that model is confident it has heard something resembling the phrase, it calls in backup and has the signal captured analyzed by a full-size neural network. All of this happens in a split second, such that you wouldn’t even notice it.

All languages and ways of pronouncing “Hey Siri” had to be accommodated

Deep learning models are hungry and suffer from what’s called the cold start problem — the period of time where a model just hasn’t been trained on enough edge cases to be effective. To overcome this, Apple got crafty and pulled audio of users saying “Hey Siri” naturally and without prompting, before the Siri wake feature even existed. Yeah I’m with you, this is weird that people would attempt to have real conversations with Siri, but crafty nonetheless.

These utterances were transcribed, spot checked by Apple employees and combined with general speech data. The aim was to create a model robust enough that it could handle the wide range of ways in which people say “Hey Siri” around the world.

Apple had to address the pause people would place in-between “Hey” and “Siri” to ensure that the model would still recognize the phrase. At this point, it became necessary to bring other languages into the mix — adding in examples to accommodate everything from French’s “Dis Siri” to Korean’s “Siri 야.”

It couldn’t get triggered by “Hey Seriously” and other similar but irrelevant terms

It’s obnoxious when you are using an Apple device and Siri activates without intentional prompting, pausing everything else — including music. The horror! To fix this, Apple had to get intimate with the voices of individual users.

When users initiate Siri, they say five phrases that each begin with “Hey Siri.” These examples get stored and thrown into a vector space with another specialized neural network. This space allows for the comparison of phrases said by different speakers. All of the phrases said by the same user tend to be clustered and this can be used to minimize the likelihood that one person saying “Hey Siri” in your office will trigger everyone’s iPhone.

==[ Click Here 1X ] [ Close ]==