v0.3, January 2017


  Fast Channelizer, Filterbank, Speech Recognition Front End, Software
  Defined Radio, Polyphase Filter, Frequency Multiplexing, Audio Denoising,
  High Performance Computing, HPC, SDR, FFT, FFTW, SIMD, AVX, SSE, NEON.

  10.01.2017: EFCL-v0.3    minor compatibility bug fix, documentation update
  31.12.2016: EFCL-v0.2    2 bug fixes
  30.12.2016: EFCL-v0.1    initial release


  This work has been supported by the LINDAT/CLARIN project of the Ministry
  of Education, Youth and Sports of the Czech Republic (project LM2015071).


  This is a library and a stand-alone program that can split input signal
  stream into a number of equally spaced channels. The samples are complex
  valued single precision floats. If Fs is a sampling frequency, the
  channels' center frequencies will be 0, Fs/C, 2*Fs/C, ... (C-1)*Fs/C,
  where C is the number of channels (i.e. the zeroth channel spans (-Fs/2C,
  Fs/2C) interval, the first channel (Fs/2C,3Fs/2C) and the last one
  ((C-1.5)*Fs/C,(C-0.5)*Fs/C) which --- due to spectrum periodicity --- can
  be regarded as the channel "-1" spanning from -3Fs/2C to -Fs/2C).

  The output of the channelizer can be optionally decimated by an integer
  factor D (typically D=C).

  The library is designed for high throughput, using on-the-fly compiled
  assembly codelets that get compiled and tuned for a particular machine and
  C and D parameters when the program starts. The compiled codelet
  is then dynamically linked with the main program so that the channelizer
  could use it. The assembly heavily relies on using SIMD instructions, the
  flavor of which (SSE, AVX or NEON) is detected by the makefile when the
  library gets compiled. So, if one wants to distribute platform independent
  binaries, it is necessary to compile libefcl.so for different SIMD flavors
  and distribute several different libefcl.so files from which the installer
  (or the main program) would select the proper one (in fact this would be
  more complicated as there are also tuning constants and options which are
  CPU model specific).

  For CPUs that lack SIMD instructions, there is still a generic C version
  of the code, which is about 3 times slower.

  The library can also take an advantage of multi core CPUs to further
  increase its speed by running the computation in multiple threads. It is
  runtime configurable how many threads can be used and the library
  adaptively balances the load (even turning itself off on cores overloaded
  by other tasks) so as to gain maximal throughput.


  To select a channel out of a signal one needs a filter. The efcl
  channelizer expects to be given its impulse response in a file containing
  binary float numbers. The response itself may or may not be symmetric ---
  it does not matter as the filtering is fully general. If you need to use
  minimum phase filter for some reason, it is possible. It is enough to 
  specify this filter for channel 0, i.e. the low pass filter --- filters
  for other channels are derived automatically from this one by efcl.

  Although there is a utility "./simple_filter" which produces example filter
  (sigma factor weighted truncated sin(x)/x), it was meant only for debugging
  and performance testing, not for meaningful signal processing.

  Designing a proper filter is much more involved, the crucial thing to  
  consider is that the stop band of such a filter must be falling with 1/f
  or better with 1/f^2 so that the stop band would sum to a small constant
  when folded due to aliasing (which happens due to decimation). Otherwise
  the noise and spectral leakage would be much higher than expected.

  Designing such a filter is however beyond the scope of this library. Use
  GnuRadio or write your own design tool (weighted Remez algorithm might be
  a method of choice). Keep in mind that the filter length is zero-padded 
  towards the nearest higher multiple of C (the number of channels). This 
  new length is displayed as L in efcl's verbose mode. Note that those    
  zeroes are processed as if they were general numbers, thereby taking a  
  processing time. It is therefore advised to use only lengths divisible by
  C when designing the filter so as to take advantage of the additional taps.


  The speed generally depends on the number of channels C, filter length L and the
  decimation factor D. To get a notion I did some testing using C=72, D=50 and L=1152.
  The performance is measured in complex megasamples per second that are being emitted
  by the program. Note that for C!=D the output speed differs from the input one.
  Moreover the performance (in MS/s) is almost insensitive to the value of D, therefore
  it is better to specify the speed in output MS/s than in input MS/s (which can be
  computed from the output speed multiplying it by C/D).

  The plots show how the performance depends on the buffer size (which also
  affects the latency of the filterbank (the less the better)). Note that it is a fairly
  flat function, so we don't lose much when the buffer size is not optimal. Anyway,
  for a known machine the makefile sets it optimal and the user can easily measure it
  for his machine and then issue recompilation with his optimal size. In any case,
  the buffer size can always be overridden from the command line using -B= option.

  The following plots show a single core performance of various CPUs
in MS/s. On each CPU
  the program was compiled with SIMD enabled, i.e. Intel and AMD used SSEx86 implementation,
  whereas the ARM used NEON implementation.

Single core performance of Pentium M, Athlon XP, and ARM A53

  Note that this is a dry performance only, no real I/O was performed and the program just
  worked from memory to memory. In reality, I/O may become a non-negligible CPU load due to
  increased cache pressure. Nevertheless, at least 50% of what was measured should be
  attainable in a practial setup.

  Note that there is certain minimal size of the buffer (namely L+D), below which the
  algorithm cannot work and which defines the minimum latency. If less than this is
  specified, it is automatically replaced by L+D. This explains the flat region at
  the beginning of all the plots. The noise that can be seen there is caused by natural
  variations in speed. In fact these are sligtly greater than what can be seen in the picture
  because each point was measured 10 times and the best speed was plotted. This was to
  hide sudden drops in performance which I attribute to interference with system processes
  (as the testing time for each buffer size was as short as possible, even a small time
  stolen by the system (or other task) exhibits as a large drop in performace).

  Note that the noise increases to the right, because the data don't fit into the cache

  The next plots show the same thing in relative way, counting how many complex
  samples (on average) leave the channelizer at every CPU clock. This way is best for
  comparing CPU architectures of different clock speeds. It also shows that in the
  filter setup used for the test it takes about 100 CPU cycles to produce a single pair
  of output floats.

Single core performance expressed in average complex samples per CPU clock.

  Especially note that the per-clock performance of ARM-A53 is not at all bad, considering
  it is a in-order machine competing against out-of-order Intel and AMD. In fact, the
  assembly part ran as fast as the respective one on Intel, only the FFTW library seemed
  to be somewhat lazier on ARM (which might change by recompiling it with specific
  optimization for A53 turned on, but I have not tried that).

  Also note that A53 is a 64 bit machine but here it was run in Aarch32 mode, in which it
  emulates older 32 bit ARM instruction set. If programmed in native instructions it would
  surely be faster (its NEON unit has twice as much registers then, for example). Unfortunately,
  NEON64 implementation is not ready yet.

  The A53 CPU (as used in Raspbery-Pi 3 model B) is in fact a quad-core CPU, and the next
  picture shows its performance when more than one thread is running. The number of worker
  threads used is denoted by the t=<number>_ prefix there. Note that it does not have much sense
  to run more than 3 computation threads as there must always be a manager thread that supplies
  the data, which also consumes some power.

A53 multicore performance

  Finally, the last plots show absolute and relative performance of modern multi-core
  CPUs where 3 cores were utilized for computation:

EFCL speed on modern multicore CPUs

  Note that using more cores for minimal buffer size does only a little improvement. But
  when the buffer gets above a ceartain threshold the performance of multicore run rapidly
  increases (this happens when the buffers are big enough so that all three threads can
  be given their own chunk, taken from the current buffer).

  The next plot shows the same thing pre CPU clock.

EFCL performance on modern CPUs expressed in MS/CPU_clock

  Notice that although Phenom running at 3 cores was more powerful than a Core i7
  single core in MS/s, it would be actually slower if it ran at the same clock speed.
  The IPC performace of the new CPUs increased dramatically. Now it only takes
  10 CPU clocks to produce a single sample of the result, not 100 as in the case
  of AMD Athlon XP and its competitors shown above.


  Only the Linux operating system is currently supported. It was tested that the library
  works in Cygwin too, but only its generic C/C++ implementation. To compile the efcl
  library one needs the following software packages.

     PACKAGE                    VERSIONS KNOWN TO WORK WITH (*)
     gcc/g++                    3.3.6   4.7.3   4.9.2   5.4.0
     C library (getconf)        2.2.5   2.19    2.23
     pthread library            0.9     2.19
     FFTW library               3.3.3   3.3.4
     nasm                       2.11.06 2.10.07 2.11.05
     ld                         2.14    2.23.2  2.25

  extra programs needed for NEON_IMPLEMENTATION (on ARM CPU):

     as                         2.25

     grep                       2.20
     dash                       ? (does anyone know how to determine it?)
  (*) There might be other versions that work too, the listed ones were 
  actually tested. Note that nasm and ld are also required at the runtime
  if the library has been configured to use the SIMD code. Likewise,
  grep, as and dash are needed for ARM/NEON at runtime.



    make bench

  to benchmark performance of your machine and to test that compilation works.



  To get help on what the makefile provides, run

  To install the library, just type
    make install

  and it will compile and install:
       efcl and gosese binary           to $DESTDIR/usr/local/bin
       libefcl.a                        to $DESTDIR/usr/local/lib
       efcl.h header file               to $DESTDIR/usr/local/include

  If you wanted a different configuration than the one which was discovered
  by the SIMD_probe script, type:
    make clean
    SIMD=<implementation_name> make install                          
  where the <implementation_name> is one of SSEx86, SSE, AVX, AVX512, NEON,
  NEON64, C1, C2, C3, A1, A2, A4. Note that A1, A2, A4 implementations only
  compile using older gcc like gcc-3.3.6, but this is OK as it was only a
  prototype of SSEx86 implementation, which is now faster than any of Ax


  On latest Intel processors (Skylake, Kaby Lake), the TSC counters of
  individual cores can become out of sync somehow. This causes the load
  balancer to mispredict optimal workload for those cores which effectively
  degrades the overall performance as some cores have to wait for the others
  to finish their jobs. Consequently the suspend/resume mechanism detects
  this and usually suspends all the threads but two. These two still perform
  worse than they would on a system with synchronous TSC counters.

  The problem is already known to Linux developers as can be seen there:

  So there is some hope they will eventually fix it. Until then, the users
  of the most recent Intel CPUs would have to get by with single threaded
  performance which is unaffected by the bug.


  See pgmdoc.txt, usrdoc.txt and math.pdf files distributed with the source code
  for more detailed information. The easiest way to understand the internal
  working of the code is to start with efcl0.cc, which is the algorithm in its
  pure form, without complications stemming from multithreading and the use of
  assembly language.


  This code was created at the Facutly of Math and Physics at Charles University,
  in the Institute of Formal and Applied Linguistics. A student of mine
  wanted to implement a channelizer in his bachelor thesis [1] as described in
  the book [2]. He wanted to be faster than a similar channelizer in GnuRadio, in
  which he succeded, outperforming it by a factor of 3. Nevertheless, he haven't
  had a time to try all my advice in his implementation. To fulfil my tutor role
  I decided to implement it from scratch to test if my ideas were right, hoping
  that my student might appreciate some of them at last.

  The resulting program is however quite far from a neat style of polished software
  engineers. It is more aking to a wild style of "real programmers" [3], though it
  is not a spegetti code. It is nevertheless quite complex and the complexity is not
  hidden in functions/objects/whatever. Partly because of performance and partly
  because the efcl can be configured in so many ways. The user can turn on or off
  different ways by which it achieves high speed to study their impact. The price
  for this was a heavy use of conditinal compilation that somewhat obfuscated
  otherwise clean basic structure.

  Anyway, when efcl is used as a library, all this mess is hidden inside, exposing
  only the interface which is pretty straightforward.


  David Klusacek


  The permanent link to this work is http://hdl.handle.net/11234/1-2380.


 [1] Jan Hrach: Frequency Spectrum Monitoring System, Bachelor Thesis, MFF UK, 2016.
 [2] Frederic J. Harris: Multirate Signal Processing for Communication Systems,
     Prentice Hall PTR, Upper Saddle River, NJ, USA, 2004, ISBN 0131465112.
 [3] Ed Nather: The Story of Mel, USENET, May 21, 1983.