Changes

Jump to navigation Jump to search
940 bytes added ,  12:30, 28 March 2009
no edit summary
Line 68: Line 68:  
# '''''Writing a system service that will take speech as an input and generate corresponding keystrokes and then proceed as if the input was given through the keyboard.''''' This method was suggested by Benjamin M. Schwartz as a simpler approach as compared to writing a speech library in Python (which would use DBUS to connect the engine to the activities) in which case changes have to be made to the existing activities to use the library.
 
# '''''Writing a system service that will take speech as an input and generate corresponding keystrokes and then proceed as if the input was given through the keyboard.''''' This method was suggested by Benjamin M. Schwartz as a simpler approach as compared to writing a speech library in Python (which would use DBUS to connect the engine to the activities) in which case changes have to be made to the existing activities to use the library.
 
# '''''Starting with recognition of alphabets of a language rather than full-blown speech recognition.''''' This will give an achievable target for the initial stages. As the alphabet set is limited to a small number for most languages, this target will be feasible considering both computational power requirements and attainable efficiency.
 
# '''''Starting with recognition of alphabets of a language rather than full-blown speech recognition.''''' This will give an achievable target for the initial stages. As the alphabet set is limited to a small number for most languages, this target will be feasible considering both computational power requirements and attainable efficiency.
# '''''Introduce a command and control mode.''''' This would be based on the system service mentioned in step 2 but would differ in interpretation of speech. It will handle speech as commands instead of stream of characters.
   
# '''''Demonstrating its use by applying it to activities like listen and spell which can benefit immediately from this feature.''''' (see the benefits section below.)
 
# '''''Demonstrating its use by applying it to activities like listen and spell which can benefit immediately from this feature.''''' (see the benefits section below.)
 +
# '''''Introduce a command mode.''''' This would be based on the system service mentioned in step 2 but would differ in interpretation of speech. It will handle speech as commands instead of stream of characters.
 
# '''''Create acoustic models where the corpus is recorded by children and where the dictionary maps to the vocabulary of children to improve recognition.''''' (I have been working on creating acoustic models for Indian English and Hindi. This part needs active community participation to bring in support for more languages. A speech collection activity can come in handy for anyone who is interested in contributing.)
 
# '''''Create acoustic models where the corpus is recorded by children and where the dictionary maps to the vocabulary of children to improve recognition.''''' (I have been working on creating acoustic models for Indian English and Hindi. This part needs active community participation to bring in support for more languages. A speech collection activity can come in handy for anyone who is interested in contributing.)
 
# '''''Use the model in activities like Speak and implement a dictation activity.'''''
 
# '''''Use the model in activities like Speak and implement a dictation activity.'''''
Line 80: Line 80:  
My deliverables for this Summer of Code are:
 
My deliverables for this Summer of Code are:
   −
# Writing a system service that has support for recognition of characters and simple words
+
# Writing a system service that has support for recognition of characters and a demonstration that it works by running it with Listen and Spell.
# Make a recording tool/acitivty so Users can use it to make their own language models and improve it for their own needs
+
# Introduce modes in the system service. Dictation mode will process input as a stream of characters as described in deliverable 1 and a new mode called command mode will process the audio input to recognize a known set of commands.
 +
# Make a recording tool/activity so Users can use it to make their own models and improve it for their own needs.
       
'''I. The Speech Service:'''
 
'''I. The Speech Service:'''
   −
The speech service will be a daemon running in the background that can be activated to provide input to the Sugar Interface using speech. This daemon can be activated by the user and can be 'initiated' and 'stopped' via a hotkey(s). This daemon will transfer the audio to Julius Speech Engine and will process its output to generate a stream of keystrokes which are passed as input method to other activities. Also the generated text data can be any Unicode character or text and will not be restricted to XKeyEvent data of X11 (helps in foreign languages).
+
The speech service will be a daemon running in the background that enables speech recognition. This daemon can be activated by the user. When activated, a toggle switch appears on the sugar frame. A user can 'start' and 'stop' speech recognition via a hotkey(s)/toggle button. When in 'on' mode, this daemon will transfer the audio to Julius Speech Engine and will process its output to generate a stream of keystrokes ("hello" as "h" "e" "l" "l" "o" etc.) which are passed as input to other activities. Also the generated text data can be any Unicode character or text and will not be restricted to XKeyEvent data of X11 (helps in foreign languages). When in command mode, the input will be matched against a set of pre-decided commands and the corresponding event will be generated.
   −
So our flow is:
      +
                                            Audio
 +
                                              |
 +
                                              |
 +
                                              V
 
                                         [Speech Engine]
 
                                         [Speech Engine]
 
                                               |
 
                                               |
Line 99: Line 103:  
                                               V  
 
                                               V  
 
                                         [System Service]
 
                                         [System Service]
                                              |
+
        (command mode)  _______________________|_____________________  (dictation mode)
                                              |
+
                      |                                            |
                                              V
+
              [Recognize Command]                          [Input Method Server]
                                    [Input Method Server]
+
                      |                                            |
                                              |
+
                      |                                            |
                                              |
+
                      V                                            V
                                              V
+
              [Execute Command]                              [Focused Window]
                                      [Focused Window]
+
 
 +
Dictation Mode:
    
This can be done via simple calls to the X11 Server. Here is a snippet of how that can be done.
 
This can be done via simple calls to the X11 Server. Here is a snippet of how that can be done.
Line 122: Line 127:  
The above code will send one character to the window. This can be looped to generate a continuous stream (An even nicer way to do this would be set a timer delay to make it look like a typed stream).
 
The above code will send one character to the window. This can be looped to generate a continuous stream (An even nicer way to do this would be set a timer delay to make it look like a typed stream).
   −
Similarly a whole host of events can be catered to using the X11 Input Server. Words like "Close" etc need not be parsed and broken into letters and can just send as events like XCloseDisplay().
+
 
 +
Command Mode:
 +
 
 +
Similarly, a whole host of events can be catered to using the X11 Input Server. Words like "Close" etc (which will be defined in a list of commands that the engine will recognize) need not be parsed and broken into letters and can just be sent as events like XCloseDisplay().
 +
 
    
All of this basically needs to be wrapped in a single service that can run in the background. That service can be implemented as a Sugar Feature that enables starting and stopping of this service.
 
All of this basically needs to be wrapped in a single service that can run in the background. That service can be implemented as a Sugar Feature that enables starting and stopping of this service.
Line 128: Line 137:  
Flow:
 
Flow:
   −
# User activates the Speech Service which will cause a toggle button to appear in the interface.
+
# User activates the Speech Service which will make the system capable of handling speech input. A mode menu and toggle button to appear in the interface.
# The service listens for audio when a particular key combo is pressed (a small popup/notification can be shown to show that the daemon is listening).
+
# The mode menu selects whether the system service should treat the input audio as commands or a stream of characters.
# Service passes the spoken audio to the Speech Engine after the key combo is pressed again to stop the listening.
+
# The speech engine will start and take audio input when the toggle button is pressed or a hotkey combo is pressed.
# The output of the speech engine is grabbed and analysis is done on it to see what the input is.
+
# Speech input processing starts and user's speech is converted to keystrokes and sent to the focused application if in dictation mode or is executed as a command if in command mode.
# If the input is a recognized 'Word Command' then it performs that command otherwise it generates key events as mentioned above and is sent to the currently focused window.
+
# The speech engine stops listening when the toggle button is pressed again or key combo is pressed again.
 
# This continues until the user de-activates the Speech Service.
 
# This continues until the user de-activates the Speech Service.
    
This approach will simplify quite a few aspects and will be efficient.
 
This approach will simplify quite a few aspects and will be efficient.
   −
Firstly, speech recognition is a very CPU consuming process. In the above approach the Speech Engine need not run all the time. Only when required it'll be initiated. Julius speech engine can perform realtime recognition with upto a 60,000 word vocabulary. So that will not be a problem.
+
Firstly, speech recognition is a very CPU consuming process. In the above approach the Speech Engine need not run all the time. Only when required it'll be initiated. Julius speech engine can perform real-time recognition with up to a 60,000 word vocabulary. So that will not be a problem.
    
Secondly, need of DBUS is eliminated as all of this can be done by generating X11 events and communication with Julius can be done simply by executing the process within the program itself and reading off the output.
 
Secondly, need of DBUS is eliminated as all of this can be done by generating X11 events and communication with Julius can be done simply by executing the process within the program itself and reading off the output.
   −
Thirdly, Since this is just a service any activity can use this and not worry about changing their code and importing a library. The speech daemon becomes just another keyboard (a virtual one).
+
Thirdly, since this is just a service any activity can use this and not worry about changing their code and importing a library. The speech daemon becomes just another keyboard albeit, a virtual one.
    
Once this is done it can be tested on any activity (say Listen and Spell) to demonstrate its use.
 
Once this is done it can be tested on any activity (say Listen and Spell) to demonstrate its use.
Line 153: Line 162:     
Major Components:
 
Major Components:
- A language model browser which shows all the current samples and dictionary. Can create new ones or delete exisiting ones.
+
# A language model browser which shows all the current samples and dictionary. Can create new ones or delete exisiting ones.
- Ability to edit/record new samples and input new dictionary entries and save changes.
+
# Ability to edit/record new samples and input new dictionary entries and save changes.
    
The recording will be done via <code>arecord</code> and <code>aplay</code> which are good enough for recording Speech Samples.
 
The recording will be done via <code>arecord</code> and <code>aplay</code> which are good enough for recording Speech Samples.
Line 184: Line 193:  
'''Second Week:'''
 
'''Second Week:'''
 
* Complete writing the wrapper.  
 
* Complete writing the wrapper.  
* Implement a Sugar UI feature for enabling disbling the Speech Service.
+
* Implement a Sugar UI feature for enabling/disbling the Speech Service.
    
'''Third Week:'''
 
'''Third Week:'''
 
* Hook up the UI, Service, Speech Engine.
 
* Hook up the UI, Service, Speech Engine.
* Wrap up for mid term evaluations and test the language model for accuracy on letters and spoken commands.
+
* Wrap up for mid term evaluations and test the model for accuracy on letters and spoken commands.
 +
* Test this tool on Listen Spell and tweak out any problems.
 +
* Get feedback from the community.
    
'''Fourth Week:'''
 
'''Fourth Week:'''
* Test this tool on Listen Spell and tweak out any problems.
+
 
* Get feedback from the community.
+
* Implement the mode menu.
* Start writing the interface for the language tool.
+
* Add command mode.
      Line 200: Line 211:     
'''Fifth Week:'''
 
'''Fifth Week:'''
* Complete the interface.
+
* Complete the interface
* Start writing code for the language model browser and recorder.
+
* Start writing code for the model browser and recorder.
    
'''Sixth Week:'''
 
'''Sixth Week:'''
52

edits

Navigation menu