This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

EFCL Channelizer

Please use the following text to cite this item or export to a predefined format:
Klusáček, David, 2017, EFCL Channelizer, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-2380.
Date issued
2017-01-10
Description
Extremely fast digital audio channelizer implementation, usable as a building block for experimental ASR front-ends or signal denoising applications. Also applicable in software defined radios, due to its high throughput. It comes in a form of a C/C++ library and an executable example program which reads input stream, splitting it into equidistant frequency channels, emitting their data to the output. Features: (1) Hand tuned SIMD-aware assembly for x86 (SSE) and IA64 (AVX) as well as for ARM (NEON) processors. (2) Generic non-SIMD C++ implementation for other architectures. (3) Capable of taking advantage of multicore CPUs. (4) Fully configurable number of channels and the output decimation rate. (5) User supplied FIR of the channel separation filter, which allows to specify the width of the channels, whether they should overlap or be separated. (6) Input and output signal samples are treated as complex numbers. (7) Speed over 750 complex MS/s achieved on Core i7 4710HQ @ 2.5GHz, when channelizing into 72 output channels with a FIR length of 1152 samples, using 3 computing threads. (8) Runs under Linux OS.
Acknowledgement
This item isPublicly Available
and licensed under:
 Files in this item
Name
efcl-0.3.tbz
Size
1.02 MB
Format
application/octet-stream
Description
EFCL source code and documentation
MD5
6aaa44332c31a36f4cc05760f666c8da
Preview
  File Preview
Name
EFCL-v03.html
Size
33.21 KB
Format
text/html
Description
Short introduction
MD5
bc8e806778b73206e164a3157cd282c1
Preview
  File Preview
    EFCL
                         EFCL  CHANNELIZER

                                 v0.3, January 2017




    KEYWORDS

      Fast Channelizer, Filterbank, Speech Recognition Front End, Software
      Defined Radio, Polyphase Filter, Frequency Multiplexing, Audio Denoising,
      High Performance Computing, HPC, SDR, FFT, FFTW, SIMD, AVX, SSE, NEON.


    DOWNLOAD
     
      10.01.2017: EFCL-v0.3    minor compatibility bug fix, documentation update
      31.12.2016: EFCL-v0.2    2 bug fixes
      30.12.2016: EFCL-v0.1    initial release


    ACKNOWLEDGMENTS

      This work has been supported by the LINDAT/CLARIN project of the Ministry
      of Education, Youth and Sports of the Czech Republic (project LM2015071).


    WHAT IT IS

      This is a library and a stand-alone program that can split input signal
      stream into a number of equally spaced channels. The samples are complex
      valued single precision floats. If Fs is a sampling frequency, the
      channels' center frequencies will be 0, Fs/C, 2*Fs/C, ... (C-1)*Fs/C,
      where C is the number of channels (i.e. the zeroth channel spans (-Fs/2C,
      Fs/2C) interval, the first channel (Fs/2C,3Fs/2C) and the last one
      ((C-1.5)*Fs/C,(C-0.5)*Fs/C) which --- due to spectrum periodicity --- can
      be regarded as the channel "-1" spanning from -3Fs/2C to -Fs/2C).

      The output of the channelizer can be optionally decimated by an integer
      factor D (typically D=C).

      The library is designed for high throughput, using on-the-fly compiled
      assembly codelets that get compiled and tuned for a particular machine and
      C and D parameters when the program starts. The compiled codelet
      is then dynamically linked with the main program so that the channelizer
      could use it. The assembly heavily relies on using SIMD instructions, the
      flavor of which (SSE, AVX or NEON) is detected by the makefile when the
      library gets compiled. So, if one wants to distribute platform independent
      binaries, it is necessary to compile libefcl.so for different SIMD flavors
      and distribute several different libefcl.so files from which the installer
      (or the main program) would select the proper one (in fact this would be
      more complicated as there are also tuning constants and options which are
      CPU model specific).

      For CPUs that lack SIMD instructions, there is still a generic C version
      of the code, which is about 3 times slower.

      The library can also take an advantage of multi core CPUs to further
      increase its speed by running the computation in multiple threads. It is
      runtime configurable how many threads can be used and the library
      adaptively balances the load (even turning itself off on cores overloaded
      by other tasks) so as to gain maximal throughput.


    WHAT IT ISN'T

      To select a channel out of a signal one needs a filter. The efcl
      channelizer expects to be given its impulse response in a file containing
      binary float numbers. The response itself may or may not be symmetric ---
      it does not matter as the filtering is fully general. If you need to use
      minimum phase filter for some reason, it is possible. It is enough to 
      specify this filter for channel 0, i.e. the low pass filter --- filters
      for other channels are derived automatically from this one by efcl.

      Although there is a utility "./simple_filter" which produces example filter
      (sigma factor weighted truncated sin(x)/x), it was meant only for debugging
      and performance testing, not for meaningful signal processing.

      Designing a proper filter is much more involved, the crucial thing to  
      consider is that the stop band of such a filter must be falling with 1/f
      or better with 1/f^2 so that the stop band would sum to a small constant
      when folded due to aliasing (which happens due to decimation). Otherwise
      the noise and spectral leakage would be much higher than expected.

      Designing such a filter is however beyond the scope of this library. Use
      GnuRadio or write your own design tool (weighted Remez algorithm might be
      a method of choice). Keep in mind that the filter length is zero-padded 
      towards the nearest higher multiple of C (the number of channels). This 
      new length is displayed as L in efcl's verbose mode. Note that those    
      zeroes are processed as if they were general numbers, thereby taking a  
      processing time. It is therefore advised to use only lengths divisible by
      C when designing the filter so as to take advantage of the additional taps.


    HOW FAST IS IT

      The speed generally depends on the number of channels C, filter length L and the
      decimation factor D. To get a notion I did some testing using C=72, D=50 and L=1152.
      The performance is measured in complex megasamples per second that are being emitted
      by the program. Note that for C!=D the output speed differs from the input one.
      Moreover the performance (in MS/s) is almost insensitive to the value of D, therefore
      it is better to specify the speed in output MS/s than in input MS/s (which can be
      computed from the output speed multiplying it by C/D).

      The plots show how the performance depends on the buffer size (which also
      affects the latency of the filterbank (the less the better)). Note that it is a fairly
      flat function, so we don't lose much when the buffer size is not optimal. Anyway,
      for a known machine the makefile sets it optimal and the user can easily measure it
      for his machine and then issue recompilation with his optimal size. In any case,
      the buffer size can always be overridden from the command line using -B= option.

      The following plots show a single core performance of various CPUs
    in MS/s. On each CPU
      the program was compiled with SIMD enabled, i.e. Intel and AMD used SSEx86 implementation,
      whereas the ARM used NEON implementation.


    Single core performance of Pentium M, Athlon XP, and ARM A53


      Note that this is a dry performance only, no real I/O was performed and the program just
      worked from memory to memory. In reality, I/O may become a non-negligible CPU load due to
      increased cache pressure. Nevertheless, at least 50% of what was measured should be
      attainable in a practial setup.

      Note that there is certain minimal size of the buffer (namely L+D), below which the
      algorithm cannot work and which defines the minimum latency. If less than this is
      specified, it is automatically replaced by L+D. This explains the flat region at
      the beginning of all the plots. The noise that can be seen there is caused by natural
      variations in speed. In fact these are sligtly greater than what can be seen in the picture
      because each point was measured 10 times and the best speed was plotted. This was to
      hide sudden drops in performance which I attribute to interference with system processes
      (as the testing time for each buffer size was as short as possible, even a small time
      stolen by the system (or other task) exhibits as a large drop in performace).

      Note that the noise increases to the right, because the data don't fit into the cache
      anymore.

      The next plots show the same thing in relative way, counting how many complex
      samples (on average) leave the channelizer at every CPU clock. This way is best for
      comparing CPU architectures of different clock speeds. It also shows that in the
      filter setup used for the test it takes about 100 CPU cycles to produce a single pair
      of output floats.

    Single core performance expressed in average complex samples per CPU clock.


      Especially note that the per-clock performance of ARM-A53 is not at all bad, considering
      it is a in-order machine competing against out-of-order Intel and AMD. In fact, the
      assembly part ran as fast as the respective one on Intel, only the FFTW library seemed
      to be somewhat lazier on ARM (which might change by recompiling it with specific
      optimization for A53 turned on, but I have not tried that).

      Also note that A53 is a 64 bit machine but here it was run in Aarch32 mode, in which it
      emulates older 32 bit ARM instruction set. If programmed in native instructions it would
      surely be faster (its NEON unit has twice as much registers then, for example). Unfortunately,
      NEON64 implementation is not ready yet.

      The A53 CPU (as used in Raspbery-Pi 3 model B) is in fact a quad-core CPU, and the next
      picture shows its performance when more than one thread is running. The number of worker
      threads used is denoted by the t=<number>_ prefix there. Note that it does not have much sense
      to run more than 3 computation threads as there must always be a manager thread that supplies
      the data, which also consumes some power.

    A53 multicore performance

      Finally, the last plots show absolute and relative performance of modern multi-core
      CPUs where 3 cores were utilized for computation:

    EFCL speed on modern multicore CPUs


      Note that using more cores for minimal buffer size does only a little improvement. But
      when the buffer gets above a ceartain threshold the performance of multicore run rapidly
      increases (this happens when the buffers are big enough so that all three threads can
      be given their own chunk, taken from the current buffer).

      The next plot shows the same thing pre CPU clock.

    EFCL performance on modern CPUs expressed in MS/CPU_clock


      Notice that although Phenom running at 3 cores was more powerful than a Core i7
      single core in MS/s, it would be actually slower if it ran at the same clock speed.
      The IPC performace of the new CPUs increased dramatically. Now it only takes
      10 CPU clocks to produce a single sample of the result, not 100 as in the case
      of AMD Athlon XP and its competitors shown above.



    DEPENDENCIES


      Only the Linux operating system is currently supported. It was tested that the library
      works in Cygwin too, but only its generic C/C++ implementation. To compile the efcl
      library one needs the following software packages.


         PACKAGE                    VERSIONS KNOWN TO WORK WITH (*)
        
         gcc/g++                    3.3.6   4.7.3   4.9.2   5.4.0
         C library (getconf)        2.2.5   2.19    2.23
         pthread library            0.9     2.19
         FFTW library               3.3.3   3.3.4
         nasm                       2.11.06 2.10.07 2.11.05
         ld                         2.14    2.23.2  2.25

      extra programs needed for NEON_IMPLEMENTATION (on ARM CPU):


         as                         2.25

         grep                       2.20
         dash                       ? (does anyone know how to determine it?)
        
      (*) There might be other versions that work too, the listed ones were 
      actually tested. Note that nasm and ld are also required at the runtime
      if the library has been configured to use the SIMD code. Likewise,
      grep, as and dash are needed for ARM/NEON at runtime.


    QUICK TEST

      Run

        make bench


      to benchmark performance of your machine and to test that compilation works.

     

    HOW TO COMPILE AND INSTALL

      To get help on what the makefile provides, run
       
        make

      To install the library, just type
       
        make install

      and it will compile and install:
          
           efcl and gosese binary           to $DESTDIR/usr/local/bin
           libefcl.a                        to $DESTDIR/usr/local/lib
           efcl.h header file               to $DESTDIR/usr/local/include

      If you wanted a different configuration than the one which was discovered
      by the SIMD_probe script, type:
       
        make clean
        SIMD=<implementation_name> make install                          
                                                                         
      where the <implementation_name> is one of SSEx86, SSE, AVX, AVX512, NEON,
      NEON64, C1, C2, C3, A1, A2, A4. Note that A1, A2, A4 implementations only
      compile using older gcc like gcc-3.3.6, but this is OK as it was only a
      prototype of SSEx86 implementation, which is now faster than any of Ax
      implementations.


    PROBLEMS

      On latest Intel processors (Skylake, Kaby Lake), the TSC counters of
      individual cores can become out of sync somehow. This causes the load
      balancer to mispredict optimal workload for those cores which effectively
      degrades the overall performance as some cores have to wait for the others
      to finish their jobs. Consequently the suspend/resume mechanism detects
      this and usually suspends all the threads but two. These two still perform
      worse than they would on a system with synchronous TSC counters.

      The problem is already known to Linux developers as can be seen there:
        
         http://lkml.org/lkml/2016/12/16/215

      So there is some hope they will eventually fix it. Until then, the users
      of the most recent Intel CPUs would have to get by with single threaded
      performance which is unaffected by the bug.


    FURTHER READING

      See pgmdoc.txt, usrdoc.txt and math.pdf files distributed with the source code
      for more detailed information. The easiest way to understand the internal
      working of the code is to start with efcl0.cc, which is the algorithm in its
      pure form, without complications stemming from multithreading and the use of
      assembly language.


    ORIGIN

      This code was created at the Facutly of Math and Physics at Charles University,
      in the Institute of Formal and Applied Linguistics. A student of mine
      wanted to implement a channelizer in his bachelor thesis [1] as described in
      the book [2]. He wanted to be faster than a similar channelizer in GnuRadio, in
      which he succeded, outperforming it by a factor of 3. Nevertheless, he haven't
      had a time to try all my advice in his implementation. To fulfil my tutor role
      I decided to implement it from scratch to test if my ideas were right, hoping
      that my student might appreciate some of them at last.

      The resulting program is however quite far from a neat style of polished software
      engineers. It is more aking to a wild style of "real programmers" [3], though it
      is not a spegetti code. It is nevertheless quite complex and the complexity is not
      hidden in functions/objects/whatever. Partly because of performance and partly
      because the efcl can be configured in so many ways. The user can turn on or off
      different ways by which it achieves high speed to study their impact. The price
      for this was a heavy use of conditinal compilation that somewhat obfuscated
      otherwise clean basic structure.

      Anyway, when efcl is used as a library, all this mess is hidden inside, exposing
      only the interface which is pretty straightforward.


    AUTHOR

      David Klusacek


    HOW TO CITE

      The permanent link to this work is http://hdl.handle.net/11234/1-2380.


    REFERENCES

     [1] Jan Hrach: Frequency Spectrum Monitoring System, Bachelor Thesis, MFF UK, 2016.
     [2] Frederic J. Harris: Multirate Signal Processing for Communication Systems,
         Prentice Hall PTR, Upper Saddle River, NJ, USA, 2004, ISBN 0131465112.
     [3] Ed Nather: The Story of Mel, USENET, May 21, 1983.