Homer Dudley made the first electronic voice synthesizer… in the 1930s!
I accidentally came across a flash extension called Smartmouth that does an analysis of an audio track and tries to match it to the animation mouth, arranging it automatically for you on the timeline. You can edit it later and clean it up as well. This again got me thinking that it must be possible, it must be possible! I realised that the keyword I needed to find information I needed might look more like: real-time phoneme recognition, vowel detection, temporal patterns
This unearthed other blogs out there with people who had also the same problem creating a Software Voice Vowel Detection in ActionScript 3.0 with the help of SoundMixer.computeSpectrum
At first I was thinking maybe the solution is like how the above blog tackles it. Like getting the .readFloat value via the SoundMixer and finding out if there is a way to get the number and just match it to the sound.
Perhaps that can be possible with just vowel sounds. But I guess a more complete approach would be to think of this as a temporal pattern. We have speech sounds that are these signals that are recurrent during the entire temporal signal sequence. These sounds can be summarised into patterns and our goal is not to just pick at the “value” of the sounds coming out, but to actually detect these patterns of phonemes and words and sounds put all together. Conceptually, the difference between the two approaches is that the readfloat value would be akin to taking a eyedropper to sample the colour of a pixel, when actually it is not just one pixel but an entire image full of many many pixels!
(The visual metaphor is not fully appropriate either, since I know that images are quite different from sound as well in that colour channels aren’t the same as sound channels, but the image came naturally to my head! I suppose this is another concept I have yet to wrap my head around – the way sound and colour channels aren’t to be added up, multiplied, or thought of in the same way!)
But practically speaking, with my limited programming ability, the idea of picking out the readfloat value on its own might not seem like the most meaningful approach but coming up with something that can compute all the patterns is even more complicated!