The diagram below summarizes discussion in AGL Santa Clara F2F (Sept 2018) about wakeword. Feasibility of this proposed flow has not yet been ascertained.
Some of the open questions:
- How do we ascertain control of buffer between voice agents to ensure voice agent X can access audio buffer only when it is supposed to (currently: startListening API sent by VSHL)
- Different voice agents may have different requirement about time of silence before speech for ASR calibration - we need configuration established for that
- Wakeword detection, caching, and voice agent ASR recognition are happening in 3 separate processes. How do we make sure all the 3 processes are in sync in terms of buffer position? For example, ahl-softmixer needs to know the exact wakeword position to make sure it is not included when the ASR recognition begins.
- How do we accommodate voice barge-in in this scenario?
- Do we need to accommodate the scenario if voice agent also needs access to wakeword uttered as a part of the cached buffer?
- Event subscription flow and other definitions need to be formalized
- We need to decide if it is safe for Wakeword engine to close audio buffer on wakeword detection without the risk of ahl-softmixer dropping audio packets