Silent speech interface

(see also my survey article on this subject)

Silent speech interfaces attempt to discern speech without any (or with very little) audible utterance. They’re a form of Voice-based interface.

I’m excited about these systems as a possible poor man’s Brain-computer interface. In the ideal case, they’d allow pervasive, unobtrusive command and textual input, and even a slow form of synthetic telepathy. More practically, they’re perhaps a way to resolve Reading texts on computers is unpleasant by bringing elements of the Dynamic medium to physical books.

A huge variety of sensing modalities are possible here, all with different trade-offs, broadly focusing on:

  • cortical and nervous system signals (EEG, iEEG)
  • motor neuron signals (surface EMG)
  • motion of lips, face, jaw, vocal tract (ultrasound, video, radar, Doppler, strain gauges, magnetic implants, etc)
  • acoustics (special microphones for whispers / murmurs)

Many of these approaches are invasive or highly obtrusive with current and foreseen sensors, and I’m interested in consumer scenarios, so I’ll focus on non-invasive and relatively non-obtrusive approaches.

The best routes currently seem to be (updated 2022-07-05):

  • Visual speech recognition (i.e. from video of the lips or face). Only cheap, widely-available equipment needed (some systems use RGB-D cameras).
  • Sub-audible acoustic recognition
  • Electropalatography: capacitative sensing within the mouth
    • a dental retainer-like device—somewhat obtrusive: Kimura, N., Gemicioglu, T., Womack, J., Li, R., Zhao, Y., Bedri, A., Su, Z., Olwal, A., Rekimoto, J., & Starner, T. (2022). SilentSpeller: Towards mobile, hands-free, silent speech text entry using electropalatography. CHI Conference on Human Factors in Computing Systems.
    • “Live text entry speeds for seven participants averaged 37 words per minute at 87% accuracy.”
  • Surface electromyography, i.e. detection of motor neuron signals involved in articulation
    • These approaches all involve sensing apparatus which extends onto the jaw or face—i.e. obtrusive!
    • The most promising example I’ve seen here is AlterEgo.
  • Motion sensors (accelerometer et al, on the jaw):
    • This class has made less progress until recently, but the deep learning revolution may make this modality possible. The state of the art is still using SVMs, etc:
    • Khanna, P., Srivastava, T., Pan, S., Jain, S., & Nguyen, P. (2021). JawSense: Recognizing Unvoiced Sound using a Low-cost Ear-worn System. Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications, 44–49. https://doi.org/10.1145/3446382.3448363
    • Claims 92% accuracy rate for 9 most common phonemes

References

Apart from the specific systems above, this book offers a helpful survey:
Freitas, J., Teixeira, A., Dias, M. S., & Silva, S. (2017). An Introduction to Silent Speech Interfaces. Springer International Publishing