Not this Sphinx, naming is really a important fact. Anyway, I tried to play with CMU Sphinx. It, the pocketsphinx, provides Python binding, though no real documentation. There are two modes that you can do recognition, by on-the-fly or by block of data. By default models, on-the-fly gives useless results, I don’t know if it can do better after training a bit. However, I have no idea how to do that, too. By decoding a block a data gives acceptable results.
I was actually caught by gnome-voice-control. It does work, but it also crashes. I checked out the repository (I couldn’t compile version 0.3) and installed sphinxbase 0.4.1 and pocketsphinx 0.5.1.
Since it crashes every time, I wanted to write a simple or similar one using Python. Unfortunately, the result isn’t good, though reduce word bank and do word slice on our own plus decode by block may help to improve the accuracy, but I think that’s much effort to do and I have much knowledge of speech recognition. I stopped here.
I still organized a simple code, which uses pyalsaaudio 0.4 to capture audio. It records till you press Ctrl+C, then do recognition.
You can also try this Python script1, which is a GUI and uses Sphinx’s Gstreamer plugin.
[1] | http://www.speech.cs.cmu.edu/cmusphinx/moinmoin/GStreamer is gone. |
The problem people usually encounter with on-the-fly recognition is that the first few things you speak to it are frequently poorly recognized. This is because it needs to collect some data in order to normalize the audio input for your particular microphone.
ReplyDeleteThis is actually a bit different from training or speaker adaptation, something we hope to make easier soon...
@dhd Is that mean if I let decoder keep running and don't end_utt, that will make it better?
ReplyDeleteHow does it collect data and what does it collect? While collecting should we remain silence and only allow background noise through microphone?
If I end_utt and start_utt again, will the previous collected data in last session apply on new session?
Could you describe what conditions or procedures that I can make on-the-fly recognition more accurate?
The easiest thing you can do is to use the -cmninit configuration parameter. This accepts a vector of numbers which is used to normalize the acoustic features. In batch mode this is estimated based on the entire utterance, which gives a pretty good estimate. But when recognizing on the fly it's based on past samples of audio, which means that it can start out quite inaccurate, and then converges to a good estimate over time.
ReplyDeleteIn C, what you can do (and I think gnome-voice-control does this) is to save the estimated normalization vector from previous sessions and use it to initialize the decoder. This isn't available in the Python interface since it's buried in a few layers of API, unfortunately.
However if you look at the logging information that PocketSphinx prints out you will see something like:
CMN: 48.92 4.3 -0.4 ...
You can use these numbers (comma-separated) as the -cmninit argument. It will only be good for your specific microphone and sound card though.
Hi I am Harsha,
ReplyDeleteI am working on Pocket sphinx, I have instaled an4,sphinxtrain,sphinxbase,pocket sphinx on my linux machine.
but i donno wat to do next, how to get the speech recognition going, do i need to wrte drivers for mic(audio) pls help me
Thanks in advance:)
Have you check out Sphinx' sample code? Do what it does.
ReplyDeleteThank u very much i am workin in that ,,,will get back to u soon :)
ReplyDeleteWe are trying to use pocket sphinx to place text into a website text box on a mobile phone when a person speaks a word. We're getting nowhere. Can pocket sphinx do this and if so how?
ReplyDelete