USpeak

From Sugar Labs
Revision as of 08:57, 28 March 2009 by Mavu (talk | contribs)
Jump to navigation Jump to search

This page is under construction


About you

Q1: What is your name?

Ans: Komaragiri Satya


Q2: What is your email address?

Ans: satya[DOT]komaragiri[AT]gmail[DOT]com


Q3: What is your Sugar Labs wiki username?

Ans: Mavu


Q4: What is your IRC nickname?

Ans: mavu


Q5: What is your primary language? (We have mentors who speak multiple languages and can match you with one of them if you'd prefer.)

Ans: Primary language for e-mails, IRC, Blogs: English. Also fluent in Hindi, Tamil, Telugu


Q6: Where are you located, and what hours do you tend to work? (We also try to match mentors by general time zone if possible.)

Ans: I am located in New Delhi, India (UTC +5:30). I work late mornings when I am not in college (10-11 AM-ish IST) to evening (4:30 -5 PM-ish IST) and late evenings (7- 7:30 PM IST) to late at night (1-2 AM IST and more if need be) so timezones will not be a problem.


Q7: Have you participated in an open-source project before? If so, please send us URLs to your profile pages for those projects, or some other demonstration of the work that you have done in open-source. If not, why do you want to work on an open-source project this summer?

Ans: Yes, I was introduced to Open Source through GSoC last year where I worked on Bootlimn: Extending Bootchart to use Systemtap for The Fedora Project. ( http://code.google.com/p/bootlimn/ ) or ( http://code.google.com/p/google-summer-of-code-2008-fedora/ ). I am currently working on the following projects.

  1. Introducing Speech Recognition in OLPC and making a dictation activity. ( http://wiki.laptop.org/go/Speech_to_Text )
  2. Introducing Java Profiling in Systemtap.(A work from home internship for Red Hat Inc.). This project involved extensive research which took most of the past 4 months I have been working on it. Coding has just begun.
  3. A sentiment analysis project for Indian financial markets. (My B. Tech major project that I plan to release under GPLv2.) I can put up the source code on https://blogs-n-stocks.dev.java.net/ after mid-April when I am done with my final evaluations in my college.



About your project

Q1: What is the name of your project?

Ans: USpeak


Q2: Describe your project in 10-20 sentences. What are you making? Who are you making it for, and why do they need it? What technologies (programming languages, etc.) will you be using?

Ans:

Long Term Vision:

This project aims at introducing speech as an alternative to typing as a system-wide mode of input.

I have been working towards achieving this goal for the past 6 months. The task can be accomplished by breaking the problem into the following smaller subsets and tackling them one by one:

  1. Port an existing speech engine to the less powerful computers like XO. ( This has been a part of the work that I have been doing so far. I chose Julius as the Speech engine as it is lighter, wriiten in C and is more suitable to dictation based activities. I have been able to compile Julius on the XO and am continuing to optimize it to make it work faster.)
  2. Writing a system service that will take speech as an input and generate corresponding keystrokes and then proceed as if the input was given through the keyboard. (This method was suggested by Benjamin M. Schwartz as an simpler approach as compared to writing a speech library in Python which would use DBUS to connect the engine to the activities in which case changes have to be made to the existing activities to use the library.)
  3. Starting with recognition of alphabets of a language rather than full-blown speech recognition. This will give an achievable target for the summer of code. As the alphabet set is limited to a small number for most languages, this target will be feasible considering both computational power requirements and attainable efficiency.
  4. Introduce a command mode. This would be based on the system service mentioned in step 2 but would differ in interpretation of speech. It will handle speech as commands instead of stream of characters.
  5. Demonstrating its use by applying it to activities like listen and spell which can benefit immediately from this feature. (see the benefits section below.)
  6. Create acoustic models where the corpus is recorded by children and where the dictionary maps to the vocabulary of children to improve recognition. (I have been working on creating acoustic models for Indian English and Hindi. This part needs active community participation to bring in support for more languages. The Qt application can come in handy for anyone who is interested in contributing.)
  7. Use the model in activities like Speak and implement a dictation activity.


Proposal for GSoC 09

The above mentioned goals are very long term goals and some of those will need active participation from the community. I have already made progress with Steps 1 and 6 (and these are continuous tasks in the background to help improve the accuracy).

I propose to implement steps 2, 3, 4 and 5 in GSoC. As the basic speech engine is working, these steps can be treated as independent of the other tasks and will have immediate benefits. i.e.

  1. Writing a system service that has support for recognition of characters and simple words
  2. Demonstrating its use with activities like Listen and Spell.

I. The Speech Service:

The speech service will be a daemon running in the background that can be activated to provide input to the Sugar Interface using speech. This daemon can be activated by the user and can be 'initiated' and 'stopped' via a hotkey. This daemon will transfer the audio to Julius Speech Engine and will process its output to generate a stream of keystrokes and are passed as input method to other activities. Also the generated text data can be any Unicode character or text and will not be restricted to XKeyEvent data of X11 (helps in foreign languages).

So our flow is:

                                       [Speech Engine]
                                              |
                                              |
                                              V
                                  Characters/Words/Phrases
                                              |
                                              |  
                                              V 
                                       [System Service]
                                              |
                                              |  
                                              V 
                                    [Input Method Server]
                                              |
                                              |
                                              V
                                      [Focused Window]

This can be done via simple calls to the X11 Server. Here is a snippet of how that can be done.

// Get the currently focused window.
XGetInputFocus(...); 
// Create the event
XKeyEvent event = createKeyEvent(...);
// Send the KEYCODE. We can define these using XK_ constants
XSendEvent(...);
// Resend the event to emulate the key release
event = createKeyEvent(...);
XSendEvent(...);
                      

The above code will send one character to the window. This can be looped to generate a continuous stream (An even nicer way to do this would be set a timer delay to make it look like a typed stream).

Similarly a whole host of events can be catered to using the X11 Input Server. Words like "Close" etc need not be parsed and broken into letters and can just send events like XCloseDisplay().

All of this basically needs to be wrapped in a single service that can run in the background. That service can be implemented as a Sugar Feature that enables starting and stopping of this service.

Flow:

  1. User activates the Speech Service
  2. The service listens for audio when a particular key combo is pressed (a small popup/notification can be shown to show that the daemon is listening).
  3. Service passes the spoke audio to the Speech Engine after the key combo is pressed again to stop the listening.
  4. The output of the speech engine is grabbed and analysis is done on it to see what the input is.
  5. If the input is a recognized 'Word Command' then it performs that command otherwise it generates key events as mentioned above and is sent to the currently focused window.
  6. This continues until the user de-avtivates the Speech Service.

This approach will simplify quite a few aspects and will be efficient.

Firstly, speech recognition is a very CPU consuming process. In the above approach the Speech Engine need not run all the time. Only when required it'll be initiated. Julius speech engine can perform realtime recognition with upto a 60,000 word vocabulary. So that will not be a problem.

Secondly, need of DBUS is eliminated as all of this can be done by generating X11 events and communication with Julius can be done simply by executing the process within the program itself and reading off the output.

Thirdly, Since this is just a service any activity can use this and not worry about changing their code and importing a library. The speech daemon becomes just another keyboard (a virtual one).


II. Demonstrate its utility using Listen and Spell:

Beyond this I would like to implement an activity that can quite well demonstrate the use of this service. I plan to Implement Speak Spell which will be a spelling activity where children can spell out the words show to them. Single character recognition can have very high recognition rates.


III. Technologies used:

I will be using (and have been using) Juilus as the speech recognition tool. Julius is suited for both dictation (continuous speech recognition) and command and control. A grammar-based recognition parser named "Julian" is integrated into Julius which is modified to use hand-designed DFA grammar as a language model. And hence it is suited for voice command system of small vocabulary, or various spoken dialog system tasks.

The coding will be done in C, shell scripts and Python and recording will be done on an external computer and the compiled model will be stored on the XO. I own an XO because of my previous efforts and hence I plan to work natively on it and test the performance real time.



Q3: What is the timeline for development of your project? The Summer of Code work period is 7 weeks long, May 23 - August 10; tell us what you will be working on each week.

Ans:


Q4: Convince us, in 5-15 sentences, that you will be able to successfully complete your project in the timeline you have described.

Ans: I have been working on speech recognition for XO since November last year. My research has helped me understand the requirements on this project. I have made some progress as shown in http://wiki.laptop.org/go/Speech_to_Text which wiill help me in this project. I am also familiar with the development environment.

Apart from this, I have worked on a few real life projects ( some open source as mentioned above and an Internship in HCL Infosystems) including one GSoC project for Fedora which has taught me how to work within the stipulated timeframe and accomplish the task.



You and the community

Q1: If your project is successfully completed, what will its impact be on the Sugar Labs community? Give 3 answers, each 1-3 paragraphs in length. The first one should be yours. The other two should be answers from members of the Sugar Labs community, at least one of whom should be a Sugar Labs GSoC mentor. Provide email contact information for non-GSoC mentors.

Ans:


Q2: Sugar Labs will be working to set up a small (5-30 unit) Sugar pilot near each student project that is accepted to GSoC so that you can immediately see how your work affects children in a deployment. We will make arrangements to either supply or find all the equipment needed. Do you have any ideas on where you would like your deployment to be, who you would like to be involved, and how we can help you and the community in your area begin it?

Ans: A community school in my neighborhood. That would be my ideal choice as they would really benefit from it apart from helping me test out my project. I am sure they'll be delighted to be a part of this program. I was fortunate enough to go to a private school and I realize that children who are already playing at their PCs at home and computer labs at school might not be very appreciative of what we are trying to achieve.


Q3: What will you do if you get stuck on your project and your mentor isn't around?

Ans: Even with my mentor around, I will try to first find a solution myself without expecting any spoon feeding, though I will let him/her know where I am stuck and what am I doing to find a solution. If I am not able to solve the problem, then I will ask my mentor for help.

If my mentor is not around,

  1. The first thing I will do is try to Google.
  2. If I cannot find a solution, I will more specifically go through the Mailing list archives wikis and forums of Sugarlabs, Julius or Xorg depending on where I am stuck.
  3. If I can still not find a solution, then I will ask on the respective IRC channels and Mailing Lists.

Q3: How do you propose you will be keeping the community informed of your progress and any problems or questions you might have over the course of the project?

Ans: I will maintain this page and keep it updated of the status. I'll also mail the Summer of Code specific mailing list of Sugar with weekly updates.



Miscellaneous

Screenshot

Q1. We want to make sure that you can set up a development environment before the summer starts. Please send us a link to a screenshot of your Sugar development environment with the following modification: when you hover over the XO-person icon in the middle of Home view, the drop-down text should have your email in place of "Restart."

Ans: Screenshot on right.


Q2. What is your t-shirt size? (Yes, we know Google asks for this already; humor us.)

Ans: M (Female)


Q3. Describe a great learning experience you had as a child.

Ans:


Q4. Is there anything else we should have asked you or anything else that we should know that might make us like you or your project more?

Ans: No.