Changes

USpeak (view source)

Revision as of 11:12, 28 March 2009

135 bytes added , 11:12, 28 March 2009

no edit summary

Line 66: Line 66:

# '''''Port an existing speech engine to the less powerful computers like XO.''''' ( This has been a part of the work that I have been doing so far. I chose Julius as the Speech engine as it is lighter and written in C. I have been able to compile Julius on the XO and am continuing to optimize it to make it work faster.)

−

# '''''Writing a system service that will take speech as an input and generate corresponding keystrokes and then proceed as if the input was given through the keyboard.''''' (This method was suggested by Benjamin M. Schwartz as an simpler approach as compared to writing a speech library in Python which would use DBUS to connect the engine to the activities in which case changes have to be made to the existing activities to use the library.)

+

# '''''Writing a system service that will take speech as an input and generate corresponding keystrokes and then proceed as if the input was given through the keyboard.''''' This method was suggested by Benjamin M. Schwartz as a simpler approach as compared to writing a speech library in Python (which would use DBUS to connect the engine to the activities) in which case changes have to be made to the existing activities to use the library.

−

# '''''Starting with recognition of alphabets of a language rather than full-blown speech recognition.''''' This will give an achievable target for the ~~summer of code~~. As the alphabet set is limited to a small number for most languages, this target will be feasible considering both computational power requirements and attainable efficiency.

+

# '''''Starting with recognition of alphabets of a language rather than full-blown speech recognition.''''' This will give an achievable target for the initial stages. As the alphabet set is limited to a small number for most languages, this target will be feasible considering both computational power requirements and attainable efficiency.

−

# '''''Introduce a command mode.''''' This would be based on the system service mentioned in step 2 but would differ in interpretation of speech. It will handle speech as commands instead of stream of characters.

+

# '''''Introduce a command and control mode.''''' This would be based on the system service mentioned in step 2 but would differ in interpretation of speech. It will handle speech as commands instead of stream of characters.

# '''''Demonstrating its use by applying it to activities like listen and spell which can benefit immediately from this feature.''''' (see the benefits section below.)

−

# '''''Create acoustic models where the corpus is recorded by children and where the dictionary maps to the vocabulary of children to improve recognition.''''' (I have been working on creating acoustic models for Indian English and Hindi. This part needs active community participation to bring in support for more languages. ~~The Qt application~~ can come in handy for anyone who is interested in contributing.)

+

# '''''Create acoustic models where the corpus is recorded by children and where the dictionary maps to the vocabulary of children to improve recognition.''''' (I have been working on creating acoustic models for Indian English and Hindi. This part needs active community participation to bring in support for more languages. A speech collection activity can come in handy for anyone who is interested in contributing.)

# '''''Use the model in activities like Speak and implement a dictation activity.'''''

Line 76: Line 76:

=====Proposal for GSoC 09=====

−

The above mentioned goals are very long term goals and some of those will need active participation from the community. I have already made progress with Steps 1 and 6 (~~and~~ these are continuous tasks in the background to help improve the accuracy).

+

The above mentioned goals are very long term goals and some of those will need active participation from the community. I have already made progress with Steps 1 and 6 (these are continuous tasks in the background to help improve the accuracy).

+

My deliverables for this Summer of Code are:

# Writing a system service that has support for recognition of characters and simple words

−

# Make a recording tool/acitivty so Users can use it make their own language models and improve it for their own needs

+

# Make a recording tool/acitivty so Users can use it to make their own language models and improve it for their own needs

'''I. The Speech Service:'''

−

The speech service will be a daemon running in the background that can be activated to provide input to the Sugar Interface using speech. This daemon can be activated by the user and can be 'initiated' and 'stopped' via a hotkey. This daemon will transfer the audio to Julius Speech Engine and will process its output to generate a stream of keystrokes ~~and~~ are passed as input method to other activities. Also the generated text data can be any Unicode character or text and will not be restricted to XKeyEvent data of X11 (helps in foreign languages).

+

The speech service will be a daemon running in the background that can be activated to provide input to the Sugar Interface using speech. This daemon can be activated by the user and can be 'initiated' and 'stopped' via a hotkey(s). This daemon will transfer the audio to Julius Speech Engine and will process its output to generate a stream of keystrokes which are passed as input method to other activities. Also the generated text data can be any Unicode character or text and will not be restricted to XKeyEvent data of X11 (helps in foreign languages).

So our flow is:

Line 120: Line 122:

The above code will send one character to the window. This can be looped to generate a continuous stream (An even nicer way to do this would be set a timer delay to make it look like a typed stream).

−

Similarly a whole host of events can be catered to using the X11 Input Server. Words like "Close" etc need not be parsed and broken into letters and can just send events like XCloseDisplay().

+

Similarly a whole host of events can be catered to using the X11 Input Server. Words like "Close" etc need not be parsed and broken into letters and can just send as events like XCloseDisplay().

All of this basically needs to be wrapped in a single service that can run in the background. That service can be implemented as a Sugar Feature that enables starting and stopping of this service.

Line 126: Line 128:

Flow:

−

# User activates the Speech Service

+

# User activates the Speech Service which will cause a toggle button to appear in the interface.

−

# The service listens for audio when a particular key combo is pressed (a small popup/notification can be shown to show that the daemon is listening).

+

# The service listens for audio when a particular key combo is pressed (a small popup/notification can be shown to show that the daemon is listening).

−

# Service passes the ~~spoke~~ audio to the Speech Engine after the key combo is pressed again to stop the listening.

+

# Service passes the spoken audio to the Speech Engine after the key combo is pressed again to stop the listening.

−

# The output of the speech engine is grabbed and analysis is done on it to see what the input is.

+

# The output of the speech engine is grabbed and analysis is done on it to see what the input is.

# If the input is a recognized 'Word Command' then it performs that command otherwise it generates key events as mentioned above and is sent to the currently focused window.

−

# This continues until the user de-~~avtivates~~ the Speech Service.

+

# This continues until the user de-activates the Speech Service.

−

This approach will simplify quite a few aspects and will be efficient.

+

This approach will simplify quite a few aspects and will be efficient.

Firstly, speech recognition is a very CPU consuming process. In the above approach the Speech Engine need not run all the time. Only when required it'll be initiated. Julius speech engine can perform realtime recognition with upto a 60,000 word vocabulary. So that will not be a problem.

−

Secondly, need of DBUS is eliminated as all of this can be done by generating X11 events and communication with Julius can be done simply by executing the process within the program itself and reading off the output.

+

Secondly, need of DBUS is eliminated as all of this can be done by generating X11 events and communication with Julius can be done simply by executing the process within the program itself and reading off the output.

Thirdly, Since this is just a service any activity can use this and not worry about changing their code and importing a library. The speech daemon becomes just another keyboard (a virtual one).

Line 144: Line 146: −

'''II. Make a recording tool/acitivty so Users can use it make their own language models and improve it for their own needs:'''

+

'''II. Make a recording tool/acitivty so Users can use it to make their own language models and improve it for their own needs:'''

This tool will help users in creating new Dictionary Based Language Models. They can use this to create language models in their own language and further extend the abilities of the service by training the Speech Recognition Engine.

Mavu

52

edits

Changes

USpeak (view source)

Revision as of 11:12, 28 March 2009

Navigation menu

Search