Copyright 2001-2014 (c) Synthetic Reality Co.
is something I have been meaning to do for a long time. I
don't expect it to be as pleasurable for anyone else as
it was for me. But after a marathon 12 hour coding
session, I am pleased to submit 'speechLab'.
This program is FREE, not shareware. Just enjoy it.
What Is speechLab?
You know those 'text to speech' programs that take a written sentence and then speak it aloud, sounding a bit like a drunk Norwegian on crack? Well, I wanted to throw my own hat into that ring, but mainly focussing on the ability to edit the actual phonemes used, so that you can make custom voice sets for yourself, your friends, Darth Vader, etc. My supposed useful purpose is to add a feature to my chat rooms where the synthetic voice has a bit more personality than your usual TTS system.
TTS (Text to speech) has evolved a lot in recent years. As personal computers have grown more powerful and memory has been less of a constraint, developers have been unleashed to use complex mathematics to dynamically synthesize ever more drunken Norwegians. (By the way, no offense intended to Norge.. I'm 1/4 Norwegian myself!) Anyway, speechLab isn't like that. It uses technology which is 40 years old already!
Running the Program
Well, just unzip the executable and run it. It comes with a couple sub-folders you will want to keep inside the program's folder. You've done all this before. Please Note: speechLab REQUIRES DirectX, as well as a full-duplex sound card. Cheaper sound cards may not be compatible with the program. Microphone setup is between you and Windows.
It is a dialog-based MFC app written in VC++ 4.2b It starts up with a PHON table pre-loaded, so you can just enter an English sentence into the ENGLISH text box, and then press TTS to hear it spoken with that PHON table. The phonetic description of the sentence will be left in the PHONETIC box, and pressing the SPEAK! button will re-speak the current phonetic string. (You can type in the PHONs manually as well, if you like)
In the upper right is the PHON table. Each PHON is described with a one or two letter name, a 'hint' as to what part of the word it represents, and how large the current sample is for that PHON. Click on a PHON on the list to hear what it sounds like. Your GOAL is to make a new PHON list which sounds good or funny (but can still be understood). Or klingon...
In the upper left is the WORK BUFFER. You'll need a microphone to have fun with this program. When you press the RECord buttton, up to 4 seconds of speech are recorded into the WORK BUFFER. You then use the left and right mouse buttons to select a portion of that recording. You're trying to find just the little bit with the sound you want.
For example, say you are trying to make a new PHON for "AA." The hint is "fAther." So, you push the RECord button and speech "My FATHER is cool!" When the recording finishes, you see the sound samples like a little oscilloscope screen. LEFT click in the scope to set the START of the selection, then RIGHT click somewhere to set the END of the selection. Press the PLAY button and hear just the selected (bright green) data.
First you might try to isolate the whole word "Father." Once you do that you can push the CLIP button to get rid of everything else. This has the advantage of letting the scope 'zoom in on' a smaller number of speech samples, so you can more accurately find the start and end of the "AHHH" sound in the middle of "Father" with as little "FFF" or "THHHH" as possible.
So, this is the part where you are having fun, right? Once you have isolated the AA sound, press the [>>] button to save it in the PHON Table. (Be sure to click on AA in the phon table first. [>>] copies the selected samples from the work buffer to the currently selected PHON entry. So don't accidentally blow away your hard-fought TH PHON by accident!
Anyway, repeat until you have all 42 PHONs defined, and you're ready to start speaking in a new voice! Share your PHON tables with friends (be sure to SAVE your work now and then, of course). PHON tables are stored as *.voc files in speechLab's VOICES folder.
Here is a recap of all the stuff you see on the screen:
Again, this is just for fun. No wagering please.
I tend to have a math section in all these demo pages, but really this one isn't much about math. Sure I do some vaguely interesting scaling in time and amplitude. And I might add some FFT stuff just to get some more math into it. But really this is mostly about:
So, even though it's the recording and playback which is the most challenging part of the program to code, I guess it's that last bit which is the most interesting (and also why this is doomed to not be a state-of-the-art TTS system.) I should point to some URLs here as a study guide, but just do a web search (www.google.com) on "Speech Synthesis", "Text to Speech", "English to Phoneme", that sort of thing. You'll see the Naval Research Lab mentioned a lot. I first read about it in Byte magazine a million lifetimes ago, but lost track of the article. A more recent web search located the rules again (with special thanks to John A. Wasser for a very neatly documented pile of rules.)
So, what's a rule? Well, first, what's a phoneme? (or a PHON, as I call it here).
Now, say it again, as slowly as you can. What you actually said was "EH UH UL OH." Four separate little sounds about 100 msecs long each. We'll call those 'phonemes' They are like atoms. They are a teensy little unit that language utterances can be broken into. (I just mean the SOUNDS of spoken language, not the MEANINGS). The actual sounds used vary with the language, and there are generally thousands of unique sounds, but lots of them are really very similar. In speechLab, English is broken down into about 40 of these sounds, which I call PHONs.
The naval research rules give us a way to turn an arbitrary english phrase into a sequence of these PHONs. Unfortunately, the result is not really very easy to understand, so it doesn't sound like a normal person. The point is that it CAN be understood.
But just to give an example, consider the words "Cat" and "Ceiling." In one case the letter 'c' is pronounced as a K, while in the other as an 's.' The navy rules look at the letters to the left and the right of the 'c' and determine which PHON might make more sense. But as anyone learning english for the first time can tell you, English pronunciation rules suck! We have incorporated so many words (that's another problem, too many words!) from so many different languages, that the pronunciation is often unpredictable... The classic example: cough, rough, bough, through, though, slough, etc.
So, while the Navy rules don't do a perfect job, they do a darn GOOD job for having so few rules to begin with.
speechLab is the property of Synthetic Reality and all rights are reserved. If we can figure out a way to convince people to pay for this, you can be sure we'll give it a try. But for now it is expressly intended to provide a moment of joy for the speech geeks in the audience!
Thanks for your feedback in advance!