v0.3, January 2017
Fast Channelizer, Filterbank, Speech Recognition Front End,
Defined Radio, Polyphase Filter, Frequency Multiplexing, Audio
High Performance Computing, HPC, SDR, FFT, FFTW, SIMD, AVX, SSE,
10.01.2017: EFCL-v0.3 minor compatibility
bug fix, documentation update
31.12.2016: EFCL-v0.2 2 bug fixes
30.12.2016: EFCL-v0.1 initial release
This work has been
supported by the LINDAT/CLARIN project of the Ministry
of Education, Youth and
Sports of the Czech Republic (project LM2015071).
WHAT IT IS
This is a
library and a stand-alone program that can split input signal
stream into a number of equally spaced channels. The samples are
valued single precision floats. If Fs is a sampling frequency,
channels' center frequencies will be 0, Fs/C, 2*Fs/C, ...
where C is the number of channels (i.e. the zeroth channel spans
Fs/2C) interval, the first channel (Fs/2C,3Fs/2C) and the last
((C-1.5)*Fs/C,(C-0.5)*Fs/C) which --- due to spectrum
periodicity --- can
be regarded as the channel "-1" spanning from -3Fs/2C to -Fs/2C).
The output of the
channelizer can be optionally decimated by an integer
factor D (typically D=C).
The library is designed
for high throughput, using on-the-fly compiled
assembly codelets that get
compiled and tuned for a particular machine and
C and D parameters
when the program starts. The compiled codelet
is then dynamically linked
with the main program so that the channelizer
could use it. The assembly
heavily relies on using SIMD instructions, the
flavor of which (SSE, AVX
or NEON) is detected by the makefile when the
library gets compiled. So,
if one wants to distribute platform independent
binaries, it is necessary
to compile libefcl.so for different SIMD flavors
and distribute several
different libefcl.so files from which the installer
(or the main program)
would select the proper one (in fact this would be
more complicated as there are also tuning constants and options
CPU model specific).
For CPUs that lack SIMD
instructions, there is still a generic C version
of the code, which is
about 3 times slower.
The library can also take
an advantage of multi core CPUs to further
increase its speed by
running the computation in multiple threads. It is
runtime configurable how
many threads can be used and the library
adaptively balances the
load (even turning itself off on cores overloaded
by other tasks) so as to
gain maximal throughput.
WHAT IT ISN'T
To select a channel out of
a signal one needs a filter. The efcl
channelizer expects to be
given its impulse response in a file containing
binary float numbers. The
response itself may or may not be symmetric ---
it does not matter as the
filtering is fully general. If you need to use
minimum phase filter for
some reason, it is possible. It is enough to
specify this filter for
channel 0, i.e. the low pass filter --- filters
for other channels are
derived automatically from this one by efcl.
Although there is a
utility "./simple_filter" which
produces example filter
weighted truncated sin(x)/x), it was meant only for
and performance testing,
not for meaningful signal processing.
Designing a proper filter
is much more involved, the crucial thing to
consider is that the stop
band of such a filter must be falling with 1/f
or better with 1/f^2 so
that the stop band would sum to a small constant
when folded due to
aliasing (which happens due to decimation). Otherwise
the noise and spectral
leakage would be much higher than expected.
Designing such a filter is
however beyond the scope of this library. Use
GnuRadio or write your own
design tool (weighted Remez algorithm might be
a method of choice). Keep
in mind that the filter length is zero-padded
towards the nearest higher
multiple of C (the number of channels). This
new length is displayed as
L in efcl's verbose mode. Note that those
zeroes are processed as if
they were general numbers, thereby taking a
processing time. It is
therefore advised to use only lengths divisible by
C when designing the
filter so as to take advantage of the additional taps.
HOW FAST IS IT
The speed generally
depends on the number of channels C, filter length L and the
decimation factor D. To
get a notion I did some testing using
C=72, D=50 and L=1152.
The performance is
measured in complex
megasamples per second that
are being emitted
by the program. Note that
for C!=D the output speed differs from
the input one.
Moreover the performance
(in MS/s) is almost insensitive to the
value of D, therefore
it is better to specify
the speed in output MS/s than in input
MS/s (which can be
computed from the output
speed multiplying it by C/D).
The plots show how the performance depends on the buffer size
affects the latency of the filterbank (the less the better)).
Note that it is a fairly
flat function, so we don't lose much when the buffer size is not
for a known machine the makefile sets it optimal and the user
can easily measure it
for his machine and then issue recompilation with his optimal
size. In any case,
the buffer size can always be overridden from the command line
using -B= option.
The following plots show a single core performance of various
CPUs in MS/s. On each CPU
the program was compiled with SIMD enabled, i.e. Intel and AMD
used SSEx86 implementation,
whereas the ARM used NEON implementation.
Note that this is a dry performance only, no real I/O was
performed and the program just
worked from memory to memory. In reality, I/O may become a
non-negligible CPU load due to
increased cache pressure. Nevertheless, at least 50% of what was
measured should be
attainable in a practial setup.
Note that there is certain minimal size of the buffer (namely
L+D), below which the
algorithm cannot work and which defines the minimum latency. If
less than this is
specified, it is automatically replaced by L+D. This explains
the flat region at
the beginning of all the plots. The noise that can be seen there
is caused by natural
variations in speed. In fact these are sligtly greater than what
can be seen in the picture
because each point was measured 10 times and the best speed was
plotted. This was to
hide sudden drops in performance which I attribute to
interference with system processes
(as the testing time for each buffer size was as short as
possible, even a small time
stolen by the system (or other task) exhibits as a large drop in
Note that the noise increases to the right, because the data
don't fit into the cache
The next plots show the same thing in relative way, counting
how many complex
samples (on average) leave the channelizer at every CPU clock.
way is best for
comparing CPU architectures of different clock speeds. It also
shows that in the
filter setup used for the test it takes about 100 CPU cycles to
produce a single pair
of output floats.
Especially note that the per-clock performance of ARM-A53 is not
at all bad, considering
it is a in-order machine competing against out-of-order Intel
and AMD. In fact, the
assembly part ran as fast as the respective one on Intel, only
the FFTW library seemed
to be somewhat lazier on ARM (which might change by recompiling
it with specific
optimization for A53 turned on, but I have not tried that).
Also note that A53 is a 64 bit machine but here it was run in
Aarch32 mode, in which it
emulates older 32 bit ARM instruction set. If programmed in
native instructions it would
surely be faster (its NEON unit has twice as much registers
then, for example). Unfortunately,
NEON64 implementation is not ready yet.
The A53 CPU (as used in Raspbery-Pi 3 model B) is in fact a
quad-core CPU, and the next
picture shows its performance when more than one thread is
The number of worker
threads used is denoted by the t=<number>_ prefix there.
it does not have much sense
to run more than 3 computation threads as there must always be a
manager thread that supplies
the data, which also consumes some power.
Finally, the last plots show absolute and relative performance
of modern multi-core
CPUs where 3 cores were utilized for computation:
Note that using more cores for minimal buffer size does only a
little improvement. But
when the buffer gets above a ceartain threshold the performance
of multicore run rapidly
increases (this happens when the buffers are big enough so that
all three threads can
be given their own chunk, taken from the current buffer).
The next plot shows the same thing pre CPU clock.
Notice that although Phenom running at 3 cores was more powerful
than a Core i7
single core in MS/s, it would be actually slower if it ran at
the same clock speed.
The IPC performace of the new CPUs increased dramatically. Now
it only takes
10 CPU clocks to produce a single sample of the result, not 100
as in the case
of AMD Athlon XP and its competitors shown above.
Only the Linux
operating system is currently supported. It was tested that the library
works in Cygwin too, but only its generic C/C++ implementation.
To compile the efcl
needs the following software packages.
VERSIONS KNOWN TO WORK WITH (*)
3.3.6 4.7.3 4.9.2 5.4.0
2.2.5 2.19 2.23
2.11.06 2.10.07 2.11.05
2.14 2.23.2 2.25
extra programs needed for NEON_IMPLEMENTATION (on ARM CPU):
? (does anyone know how to determine it?)
There might be other versions that work too, the listed ones were
tested. Note that nasm and ld are also required at the runtime
library has been configured to use the SIMD code. Likewise,
grep, as and dash are needed for ARM/NEON at
to benchmark performance of your machine and to test that
HOW TO COMPILE
To get help on what the
makefile provides, run
To install the library,
and it will compile and
and gosese binary
If you wanted a different
configuration than the one which was discovered
by the SIMD_probe script,
<implementation_name> is one of SSEx86,
SSE, AVX, AVX512, NEON,
NEON64, C1, C2, C3, A1,
A2, A4. Note that A1,
A2, A4 implementations only
compile using older gcc like gcc-3.3.6, but this is OK as it was
prototype of SSEx86
implementation, which is now faster than any of Ax
On latest Intel processors (Skylake, Kaby Lake), the TSC
individual cores can become out of sync somehow. This causes the
balancer to mispredict optimal workload for those cores which
degrades the overall performance as some cores have to wait for
to finish their jobs. Consequently the suspend/resume mechanism
this and usually suspends all the threads but two. These two
worse than they would on a system with synchronous TSC counters.
The problem is already known to Linux developers as can be seen
So there is some hope they will eventually fix it. Until then,
of the most recent Intel CPUs would have to get by with single
performance which is unaffected by the bug.
See pgmdoc.txt, usrdoc.txt
and math.pdf files distributed
with the source code
for more detailed
information. The easiest way to understand the internal
working of the code is to start with efcl0.cc, which is the algorithm in
pure form, without complications stemming from multithreading
and the use of
This code was created at the Facutly of Math and Physics at
in the Institute of Formal and Applied Linguistics. A
student of mine
wanted to implement a channelizer in his bachelor thesis  as
the book . He wanted to be faster than a similar channelizer
in GnuRadio, in
which he succeded, outperforming it by a factor of 3.
Nevertheless, he haven't
had a time to try all my advice in his implementation. To fulfil
my tutor role
I decided to implement it from scratch to test if my ideas were
that my student might appreciate some of them at last.
The resulting program is however quite far from a neat style of
engineers. It is more aking to a wild style of "real
programmers" , though it
is not a spegetti code. It is nevertheless quite complex and the
complexity is not
hidden in functions/objects/whatever. Partly because of
performance and partly
because the efcl can be configured in so many ways. The user can
turn on or off
different ways by which it achieves high speed to study their
impact. The price
for this was a heavy use of conditinal compilation that somewhat
otherwise clean basic structure.
Anyway, when efcl is used as a library, all this mess is hidden
only the interface which is pretty straightforward.
The permanent link to this work is
Spectrum Monitoring System, Bachelor Thesis, MFF UK, 2016.
Frederic J. Harris: Multirate Signal
Hall PTR, Upper Saddle River, NJ, USA, 2004, ISBN 0131465112.
 Ed Nather: The Story of Mel, USENET, May 21, 1983.