This text is intended to give a general outline of how EARS is
implemented.  However, the source, as always, is the final reference.

File Conventions
----------------
C++ source has *.cc and *.h endings where the header file usually
contains the class interfaces.  C source goes into *.c files.
The source is contained in several subdirs, and one .cc file for
each program in the main dir.  After complation, each subdir has
a library with the name of the subdir.

EARS requires the existence of the $(HOME)/.earsrc file, where
it is specified where the data resides in (default $(HOME)/.ears/).
If the .earsrc file does not exist, a default one will be made.
The .ears directory contains
- word lists (ending *.words)
- recognizer files (ending *.NDTW or so)
- 'raw' directory for WAV files 
- directories for pattern files, depending on method and pattern type
  (e.g. RASTA-V/* is a variable sized Rasta feature file).

EARS and C++
------------
In EARS, I have tried to use some OOP techniques (classes, abstraction)
to make the source more modular and understandable.  I would be
glad to have some feedback about the correctness of the result.
In the following, I'll use some terms from the excellent Gamma et al. 
'Design Patterns' book.  Please refer to it for the ideas.

There are several different layers in the program:

Top layer: the Protocol class called by main() plus some other subprotocols
called by the top protocol.  These create the objects that form the second 
layer.  They can thus be considered to be mediators.  
- Files: train_ears.cc and listen.cc, for the first call.
         ears/pr_train_ears.cc and ears/pr_listen.cc, for the top protocol
         the other ears/pr_*.cc for the subprotocols

Second layer:  this consists of several classes most of which are 
instantiated only once at a time so I implemented them as singletons, most 
notably 'screen' and 'Sound' which are the I/O objects. 'screen' also
handles keyboard input which should be separated in the future.
'messages' was an early attempt at internationalization that will be
completed at some time.  'config' handles the app configuration.
'words' holds the active word list. 
- Files (*.cc and *.h):  
  - objects: config, words, sound, screen, messages
  - GUI bridge is in ui/*, the others are in modules/*

Bottom layer:  Here are the most elementary objects that are shifted
between the mediators.  The first processing phase yields sndblocks
which are assembled to samples, which in turn become word_patterns that 
are saved and later recognized.  The pattern objects come from the 
RecordingProtocol and are handed to a 'recognizer'; both 'pattern' and 
'recognizer' are bridges, and there will be even more in the future.
- Files (*.cc and *.h): sndblock, sample, pattern
  - protocols: speechstream

Exceptions
----------
As exception class hierarchies require RTTI and gcc will not have
RTTI as default under Linux until 2.8.x, there is no real exception
handling in 'ears'.  But I tried to do my best to make the coming
changes as small as possible.  Now, a 'throw' invokes the function
Throw() in exception.cc which shuts down the screen and sound as
gracefully as possible.  There is no catching involved inside 'ears'.

Libraries
---------
The following libraries are used:
- libstdc++: for streams, strings and containers.  Probably more.
  Compiling with -fno-implicit-templates and providing templates.cc
  reduces code size heavily but gives away the inlines.  This will
  be better whenever gcc includes the template repository mechanism.
- libncurses,libpanel: fancy graphics
- libmrasta: feature extraction, source is provided

--------------------------------------------------------------------
Data
----
All files except the RIFF WAV files are written in ASCII.
Config and word files have entries, each a single line.
All I/O except setting/reading the sound device should be done with streams.

Options
-------
All options have defaults set on startup.  These can be changed 
inside the .earsrc file or by giving a command line option.  The 
'listen' program has a reduced set of options.  Available are:

.earsrc       |  command line  |   description
------------------------------------------------------------------
EARS_PATH     |      -p        | the directory where all data goes
BASENAME      |      -b        | file base for .words and .net file [default]
MIN_NUM_WORDS |      -m        | number of times to speak a word [1]
KEEP_SAMPLES  |      -k        | write WAV files? [no]
FEATURE       |      -f        | the feature extractor [MRASTA]
RECOGNIZER    |      -r        | the recognizer [NDTW]
SOUND_SPEED   |      -S        | sampling rate in Hz [8000]
SOUND_BITS    |      -B        | 8- or 16-bit sampling [8]
DEBUG         |      -d        | output recognizer debug info [no]
NEWLINE       |      -n        | output newline after words [no](listen only)

New methods
-----------
For adding new methods/algorithms, esp. a new recognition module,
an understanding of how bridges work is needed.  E.g. with recognizer, 
the training class accesses the recognizer via an abstract base class 
that has a defined interface.

So all you need to do is write a subclass to that ABC and implement
the respective interface.  In the case of DTW, most functions are
empty since all computation is done after hearing an unknown word;
also, there is no training at all for DTW.  A more complicated use
of the interface can be seen with the BP recognizers.
The provided interfaces should suffice for many purposes, but of course
you can improve that too.

The bridges are:

              screen ----------------->UserInterface
                ^                           ^
                |                           |
             tscreen                     UIraw
             lscreen                     UIncurses


              Sound ------------------->SoundInterface
                                            ^
                                            |
                                         AFSound
                                         VoxwareSound
                                         OssSound
                                         SunSound


              recognizer-------------->RecognizerImplementation
                                            ^
                                            |
                                           DTW
                                           NDTW
                                           BP
                                           BPMT
                                           ELMAN1


              pattern----------------->PatternImplementation
                                            ^
                                            |
                                         var_pattern
                                         fix_pattern
                                         bit_pattern

It is planned to build more abstractions: feature and endpointer

--------------------------------------------------------------------
Efficiency of the methods
=========================
Comparisons of speech recognition methods can doubtlessly found in
the literature.  I'm no expert --- I can only compare from the
engineering point of view.  Feedback is thankfully appreciated.

As a first and best start, try:  Rabiner, "Fundamentals of Speech
Recognition".

Endpoint detection
------------------
Cutting a word out of a sound stream seems tricky.  I tried several
sources but I was not satisfied, maybe I overlooked something.  Then
I decided to write it from scratch and it works surprisingly well,
at least until now.

Let me describe what the program does:  At the lowest level, when
reading raw data from the sound device and copying it into a
'sndblock', we already compute a value we call 'energy' (though
it is better described as the average of the derivative) of the
sound data.  This energy is high when there is a sound, and low
when there is no sound.
When we measure the noise level, we calculate the maximum of all
energies during the measurement and call it 'e_limit', that is,
the limit to the energy of a sndblock which is noise = no sound.

Now, when listening for words, we first let pass a delay and then 
discard sound that might end a previous word, then we discard all
sndblock's that are noise (energy < e_limit) until a sufficiently
long series of non-noise is encountered.  From here on we save all
sndblock's until a sufficiently long sequence of noise is seen.
The recorded array of soundblock's is then further processed. That's it.

Feature extraction
------------------
Another surprising find is that feature extraction with Rasta-PLP
seems so superior.  I have the impression that many researchers use
pure LPC but I can't say that it works for me.  There were no differences
between the OGI Rasta function and the Mrasta library, as far as I
can say.

Recognition
-----------
Five words: The surface is only scratched.
I have plans for at least five more recognizers that are substantially
different from the existing ones and from each other.  The next one
will be surely a recurrent neural net.

DTW is robust and doesn't require training, but recognition time 
increases at least linearly with dictionary size.  I simply copied
routines from Dr. Robinson's cookbook.

NDTW is the patched version of DTW with several speed improvements:
matrix allocation is now done only once (with a sufficient size)
and we search now only possible paths inside the parallelogram.
This is inspired from reading Rabiner and leads to a 2x speedup.
Additionally, after computing a row of global distances, we look
if the smallest globd is already bigger than the best result so far.
If this is so, we stop the current comparison.
Also, if the length of both patterns differ too much we do not
compare at all.

BP/BPMT shows that backpropagation isn't the answer to all problems.
Although the code is fast training time is a major problem.  Even
worse, error rate is high.  You would need more instances of words
and that slows training even more.  Last, the method doesn't account
for time-variability in the data, and generalization is poor.

ELMAN1 doesn't work yet.

--------------------------------------------------------------------
The 'ears' program
------------------
Nothing here yet.  But look into the contrib directory, there are some
nice tools that can make the 'ears' program obsolete.  And for a start,
I've begun to write down requirements for the program in
doc/ears-requirements.txt

