Changes

Jump to navigation Jump to search
7,036 bytes added ,  06:27, 28 March 2009
no edit summary
Line 60: Line 60:     
Ans:   
 
Ans:   
 +
 +
 +
====Long Term Vision:====
 +
 +
This project aims at introducing speech as an alternative to typing as a system-wide mode of input.
 +
 +
I have been working towards achieving this goal for the past 6 months. The task can be accomplished by breaking the problem into the following smaller subsets and tackling them one by one:
 +
 +
# '''''Port an existing speech engine to the less powerful computers like XO.''''' ( This has been a part of the work that I have been doing so far. I chose Julius as the Speech engine as it is lighter, wriiten in C and is more suitable to dictation based activities. I have been able to compile Julius on the XO and am continuing to optimize it to make it work faster.)
 +
 +
# '''''Writing a system service that will take speech as an input and generate corresponding keystrokes and then proceed as if the input was given through the keyboard.''''' (This method was suggested by Benjamin M. Schwartz as an simpler approach as compared to writing a speech library in Python which would use DBUS to connect the engine to the activities in which case changes have to be made to the existing activities to use the library.)
 +
 +
# '''''Starting with recognition of alphabets of a language rather than full-blown speech recognition.''''' This will give an achievable target for the summer of code. As the alphabet set is limited to a small number for most languages, this target will be feasible considering both computational power requirements and attainable efficiency.
 +
 +
# '''''Demonstrating its use by applying it to activities like listen and spell which can benefit immediately from this feature.''''' (see the benefits section below.)
 +
 +
# '''''Create acoustic models where the corpus is recorded by children and where the dictionary maps to the vocabulary of children to improve recognition.''''' (I have been working on creating acoustic models for Indian English and Hindi. This part needs active community participation to bring in support for more languages. The Qt application can come in handy for anyone who is interested in contributing.)
 +
 +
# '''''Use the model in activities like Speak and implement a dictation activity.'''''
 +
 +
# '''''Introduce a command mode.''''' This would be based on the system service mentioned in step 2 but would differ in interpretation of speech. It will handle speech as commands instead of stream of characters.
 +
 +
 +
====Proposal for GSoC 09====
 +
 +
The above mentioned goals are very long term goals and some of those will need active participation from the community. I have already made progress with Steps 1 and 3 (which could go on concurrently).
 +
 +
'''I propose to implement steps 2, 3 and 4 in GSoC. As the basic speech engine is working, these steps can be treated as independent of the later steps and will have immediate benefits.''' i.e.
 +
 +
# Writing a system service.
 +
# Enabling recognition of characters.
 +
# Demonstrating its use with activities like Listen and Spell.
 +
 +
'''The rest this wiki page will refer to the steps proposed for GSoC 09 as project.'''
 +
 +
I. The Speech Service:
 +
 +
The speech service will be a daemon running in the background that can be activated to provide input to the Sugar Interface using speech. This daemon can be activated by the user and can 'initiate' via a hotkey. This daemon will transfer the audio to Julius Speech Engine and will process its output to generate a stream of keystrokes and are passed as input method to other activities. Also the generated text data can be any Unicode character or text and will not be restricted to XKeyEvent data of X11 (helps in foreign languages).
 +
 +
So our flow is:
 +
 +
                                        [Speech Engine]
 +
                                              |
 +
                                              |
 +
                                              V
 +
                                  characters/Words/Phrases
 +
                                              |
 +
                                              | 
 +
                                              V
 +
                                    [Input Method Server]
 +
                                              |
 +
                                              |
 +
                                              V
 +
                                      [Focused Window]
 +
 +
 +
This can be done via simple calls to the X11 Server. Here is a snippet of how that can be done.
 +
 +
                      <code>
 +
                      XGetInputFocus(...); //to focus to the window.
 +
 +
                      // Create the event
 +
                      XKeyEvent event = createKeyEvent(...);
 +
 +
                      // Send the KEYCODE. We can define these using XK_ constasnts
 +
                      XSendEvent(...);
 +
 +
                      // Resend the event to emulate the key release
 +
                      event = createKeyEvent(...);
 +
                      XSendEvent(...);
 +
                      </code>
 +
 +
 +
The above code will send one character to the window. This can be looped to generate a continuous stream (An even nicer way to do this would be set a timer to make it look like a typed stream).
 +
 +
Similarly a whole host of events can be catered to using the X11 Input Server. Words like "Close" etc need not be parsed as as words and can just send events like XCloseDisplay().
 +
 +
All of this basically needs to be wrapped in a single service that can run in the background. That service can be implemented as a Sugar Feature that enables starting and stopping of this service.
 +
 +
So the architecture will be like:
 +
 +
                      Speech Engine ---> Service ---> Activity
 +
 +
Flow:
 +
 +
1. User activates the Speech Service
 +
2. The service listens for audio when a particular key combo is pressed (a small popup/notification can be shown to show that the daemon is listening).
 +
3. Service passes the spoke audio to the Speech Engine.
 +
4. The output of the speech engine is grabbed and analysis is done on it to see what the input is.
 +
5. If the input is a recognized 'Word Command' then it performs that command otherwise it generates key events as mentioned above and is sent to the currently focused window.
 +
6. This continues until the user de-avtivates the Speech Service.
 +
 +
This approach will simplify quite a few aspects and will be efficient.
 +
 +
Firstly, speech recognition is a very CPU consuming process. In the above approach the Speech Engine need not run all the time. Only when required it'll be initiated.
 +
 +
Secondly, need of DBUS is eliminated as all of this can be done by generating X11 events and communication with Julius can be done simply by executing the process within the program itself and reading off the output.
 +
 +
Thirdly, Since this is just a service any activity can use this and not worry about changing their code and importing a library. The speech daemon becomes just another keyboard (a virtual one).
 +
 +
 +
II. Demonstrate its utility using Listen and Spell:
 +
 +
Beyond this I would like to implement an activity that can quite well demonstrate the use of this service. I plan to Implement Speak Spell which will be a spelling activity where children can spell out the words show to them. Single character recognition can have very high recognition rates.
    
----
 
----
Line 71: Line 175:  
Q4: '''Convince us, in 5-15 sentences, that you will be able to successfully complete your project in the timeline you have described.'''
 
Q4: '''Convince us, in 5-15 sentences, that you will be able to successfully complete your project in the timeline you have described.'''
   −
Ans: I have always I have been able successfully completed a GSoC project last year. What makes me more confident about this year is
+
Ans:  
    
====You and the community====
 
====You and the community====
52

edits

Navigation menu