Radio Oceanography Laboratory Support | KnowHow / Decimation filters

Discussion in reverse temporal order

1/26/2020 pierre

after our discussion, it becomes clear that the best would be to have a 8^2 fixed filter, and if final sampling frequency need to be adjusted, do it by changing the TI1278 clock frequency, not touch the decimate filter.

There is enough adjustment latitude between 12 kHz and 48 kHz to cover all our potential needs.

Guam and Biaritz radars will run at 4.5 MHz with only 25 kHz chirp radio bandwidth, that would be sampled at 185 Hz with 12 kHz/64, and Xavier's lagoon radars at 35 Mhz with 750 kHz chirp would be sampled at 740 Hz with 48 kHz/64.

11/12/19 john.mclean

Some key elements of how the DSP design affects the resources in the FPGA

The core FIR Filter Arithmetic is based on the xilinx DSP48 logic, these are blocks of resources available in the FPGA for fixed point arithmetic shown below

Attach:Screenshot-2020-01-27-03-50-23.png Δ

The key steps for a Filter is the MAC (Multiply Accumulate) operation, the resolution of this is up to a 25x18 multiply, and a 48 bit accumulation. An FIR filter also requires storage for the interim calculations as the filter works its way through the coefficients, this requires Block RAM (BRAM) . It is actually the BRAM resource that is often the limiting factor in fitting logic into an FPGA It is possible to widen the calculation and to store more accumulator precision but that requires both extra DSP slices and extra storage. I've run several examples through the filter logic to give you an idea of how it grows.

Decimating filters are significantly more efficient rather than decimating as a secondary stage.

First Stage Filter

Inputs = 20 bits, Coefficients = 18 bits, no truncation, reloadable symmetrical coefficients, input sample rate up to 48 kHz on a 100 MHz system, decimate by 8

512 tap 16 channel filter will require 1 DSP slice and 7 BRAMs

1024 tap 16 channel filter will require 1 DSP slices and 12 BRAMs

Second Stage Filter

(increase the input bit width to accommodate additional resolution generated by the first stage)

Inputs = 24 bits, Coefficients = 18 bits, no truncation, reloadable symmetrical coefficients, input sample rate up to 6 kHz on a 100 MHz system, decimate by 8

512 tap 16 channel filter will require 3 DSP slices and 14 BRAMs

1024 tap 16 channel filter will require 3 DSP slices and 26 BRAMs

Resource utilisation is actually very non-linear eg trying storage variation by changing bit resolution on the second stage 24 bit filter – if we were to redefine it as

Inputs = 23 bits, Coefficients = 16 bits, no truncation, reloadable symmetrical coefficients, input sample rate up to 6 kHz on a 100 MHz system, decimate by 8

 512 tap 16 channel filter will require 1 DSP slices and 8 BRAMs

whilst

 1024 tap 16 channel filter will require 3 DSP slices and 48 BRAMs

The problem here is both a characteristic of the width required to calculate the data and a geometry of the RAMs that are storing the data. So in this case dropping 2 bits from the coefficient allows us to use 3 bits more significance in the inputs and actually saves RAM whereas the 1024 tap version of the same filter is significantly more expensive in RAM

A quick look at a Single Filter

(Yes much harder to get a sharp roll off in a single stage but possibly worth looking at)

Inputs = 20 bits, Coefficients = 18 bits, no truncation, reloadable symmetrical coefficients, input sample rate up to 48 kHz on a 100 MHz system, decimate by 64

 512 tap 16 channel filter will require 1 DSP slice and 2 BRAMs

 1024 tap 16 channel filter will require 1 DSP slices and 2 BRAMs

 2048 tap 16 channel filter will require 4 DSP slices and 8 BRAMs

As you can see there are "many ways to skin the cat" and some of the growth resource requirements is non-intuitive. This also will require careful slicing/ shifting of the output data to get the full scale resolution correct but that should drop out of the coefficient gain logic and knowledge about signal amplitudes

DSP count will not be an issue in your FPGA however I would recommend that you do not utilise any more than 80 BRAMs for the complete solution.

I think it will be very useful to discuss the implications (possibly when you visit !!!) of this as I'm not sure I understand why you would what to constrain the frequency obtaining very clean low frequency data whilst truncating the final calculation to 16 bits. Is there some signal significant signal amplitude gain adjustment that removes a significant amount of the dynamic range of the input ie truncating MSBs rather than LSBs ?

I hope this gives you some numbers in terms of FPGA resource requirements for the filtering operation

11/22/19 peter.milne

there are two types of "tap" in this email "filter-tap" : each element of the filter delay line "data-tap" : point to siphon the data off from inside the data chain for external examination.

We should state at the outset what we think we promised you.. a/ Peter offered final stage filtering in the past (I think probably a simple sum-to-N) b/ John may have offered you a final stage FIR It's no problem to provide a final stage FIR. Our architecture currently supports a single FIR, we think 512 taps is very doable, and decimate by /64 should be doable.

"lera_stream.pdf" calls for 2 cascaded /8 filters, for total decimation /64. Now that we eliminated the "WERA Tap", this isn't needed for the data-tap point (thank you), but maybe there is a technical reason why two cascaded filters might be better than one big filter? I'm going to guess that technically it's not a problem to cascade two filters, with a total 512 filter-taps rather than provide one, if that's what you prefer.

The main technical issue is that our architecture currently doesn't support data-tap points at all. (I personally think it could and it should, but let's stay for now with what we have, at least until it is shown to be unusable). So our question for you is: you want to see full rate data, you want to see decimated data. But did you really need to see them both at the same time?. Because it's trivial for us to set up a choice of two FPGA personalities, first straight through for full rate (ie the one we already have), and second the full filter personality. We suggest that you would select the full rate personality for testing, then switch to filter when the system is left to run. "Switch" is literally a script that sets a file system soft link and reboots the box.

A second technical issue is that the Sample Count / GPS time data needs to be inserted at the time of capture (ie at the top of the data chain, and it has to pass unchanged through the filter chain).

Our product has the capability to insert a number of columns of meta data (column [0]=sample count, [1]=a clock count, nominal usecs, [2], [3] can be latch count at GPS ONEPPS instant, GPS NMEA string - yes, we already do this on AUV applications). The extra columns have to pass through the filter untouched (well, untouched, but decimated). We've done something similar before, and it can be done again, but has to be agreed in advance, not bolted on afterwards. I think it's simple enough to agree:

 4 RADIO, 8 ADC + 4 meta : 12 columns x uint32
 8 RADIO,16 ADC + 4 meta : 20 columns x uint32
 16 RADIO,32 ADC + 4 meta : 36 columns x uint32

.. and we can work out an encoding for GPS ONEPPS, GPS TIME that can survive decimation later. Basically, the GPS latch time only changes once per second, so it doesn't matter if it gets decimated, and you can make a direct compare to the 1-usec free run time to determine exact position - that may in fact be a time that's lost by decimation, but you could interpolate if necessary - and if we slowly feed the ASCII NMEA string into [3] with > 64 samples per update, then the NMEA string can easily be extracted. So [2]-[1] gives you the exact instant of the GPS second, [3] tells you which second of the day it was, all embedded in the data, no need to re-align multiple files afterwards. [0] gives you reassurance that you did not lose samples. This technique is already in use and it works well, we simply have to agree how many columns must pass through the chain.

Comment by John: it's not necessary to have any programmable shifting inside the FIR's, scaling is done by normalizing the FIR coefficients. And, if you know the signal is weak, you could boost the signal by adjusting the coefficients before load. Coefficients can be loaded any time before the start of the run. (they could probably be loaded on the fly, but it's simpler and more precise to organize a stop/reload/start so you know exactly when the transition happened).

Comment by Peter: the data at the end of the decimate chain is gathered by DMA, and after that it's a software function to handle the 3-way fanout: - raw /64 data to a monitor port - int24 data to SSD - shift and output 16 bit data to a monitor port (for remote real time monitoring .. tsunami detection?

Easily doable in software, since the data rate is now very low and the calculations are not challenging.

Conclusion: provided you don't need the concurrent full rate tap, it's completely doable, then our main concern is schedule.

Can we assume that the 4RADIO/8ADC (ACQ430) case is the primary case, and the larger (ACQ435) cases could follow later?

11/13/19 pflament

t this stage, my group needs to understand better the limits of the Zynq's FPGA, so my previous emails may have erroneous statements, please excuse my ignorance.

The five important constrains from our point of view are to:

(1) recover enough ENOBs to reach full 24 bits, which is about +5 bits, thus a minimum decimation factor of about 2^5 or 32

(2) compress the data by slowing down the final sample rate to not much more than twice the Nyquist of 180 Hz.

(3) have digital anti-aliasing filters with perfect near-perfect phase stability and unit gain in the pass band, and reasonable amplitude rejection in the stop band

(4) use internal integer arithmetic to avoid generating spurious histogram/pdf or spectral anomalies.

(5) be able to load FIR filter weights and scaling factors from the ARM and be able to modify them.

How you achieve the function internally is not important as long as these five constrains are met.

You could decimate by successive identical decimate_block doing 8^2, or 3^4, or 4^3, or 2^6, or 5^3, or 9^2, etc... the HF frequency fed to the TI1278 being each time adjusted so that the final frequency is about 400 Hz. All these decimation factors recover more ENOBs than actually needed.

Because there is no geophysical signal above 180Hz, any energy in the FIR stop bands is due to internally generated noise, such as power supply or DC fan switching; a stop band with sidelobes below -30-40 dB is probably sufficient, which may ease constrains on FIR filter design

What you (D-Tacq) need to assess first is how much FPGA resources each decimate_block takes, and the tradeoff between many decimate_blocks each with very short FIR filters and coarse integer representation of the weights, or few decimate_blocks with long FIR filters. We want to maximize the performance within the resources available, knowing that it will be a compromise.

I think that I have now outlined all what has crossed my mind regarding the design of this function.

Ideally, this function should be implemented by the end of the winter, when we will ship a couple MK-IV radars to Peralex in Capetown for testing and possible adoption in ZA. I'll be myself in Europe mid december to mid march, and will visit Glasgow at least once in January when we can review and refine the concept.

11/13/19 pflament

Yesterday I was transposing what we do now on the NUC i5, into a possible flowchart for implementing in the FPGA, with little thoughts for simplifications.

However note:

- the 6 kHz tap is legacy, to feed into WERA software. We do not do this anymore, so complicating the implementation to keep this feature is a moot point;

- finessing by shifting initially by 4 bit is an unnecessary complication, since at the end we have an excess of ENOB to fit in a 24-bit word;

- since there is an excess of ENOB at the end, the adjustment between 187, 375, 750 Hz final sample frequency based on the number of radar range cells needed, should be done by adjusting the HF clock fed to the TI1278, i.e. 24 MHz in high-res mode yield F°=48 KHz that can be decimated by *64 to yield 720 Hz. If lower final sample frequencies are needed, the division should happen in the HF clock, i.e. feeding 12 or 6 MHz to the TI1278. That way, the entire DSP machinery is frozen.

There seems to be quite a lot of matlab tools for designing integer FIR filters, see https://www.mathworks.com/help/dsp/ug/create-an-fir-filter-using-integer-coefficients.html

The simplified implementation would work as follows:

1. build a "decimate_block", which takes as input a 24-bit stream at F°, as parameters NFILT=512 filter weights, NDEC=8 the decimation factor, and NSHIFT the bit-shifting rescaling factor, applies the FIR anti-aliasing filter, and outputs a 24-bit stream at F°/8 with unit gain.

2. configure the actual pipeline, by feeding the output of the ADC, to two "decimate_block" in sequence dividing frequency by 8 each, and a final conversion to int16 if required.

Discussion should focus on experimenting in matlab with integer FIR filters and integer convolution. It may well be that instead of doing two successive *8 decimations, it might be better to do three successive *4 decimations, with simpler internal artithmetics in the "decimate_block".

My group understands well the signal processing aspects of the radar data, but has zero experience in FPGA and DSP programming. We need to interact at every step with the experts at D-Tacq.

More than anything else, I do not want to divert anyone's manpower to iterate uselessly and implement never-to-be-used bells and whistles.

Comments? what can be done in the FPGA for the "decimate_block"?

Attach:Screenshot-2020-01-27-03-41-45.png Δ

Attach:Screenshot-2020-01-27-03-43-57.png Δ

11/12/19 pflament

The attachment shows the desired implementation, which would yield a final archival data stream requiring no further processing. I have tried to keep things ultra-simple with minimum bells and whistles. This processing structure should have appeal to your other customers beyond the narrow radar need.

Assume that the following programmable configuration parameter are available on the ARM side of the Zynq (like in /mnt/local/etc): 1st FIR length (NFIR1), 1st FIR weights (int_array[1:NFIR1]), 1st FIR norm (NORMFIR1), 1st decimation factor (NDEC1), #bit shift for intermediate casting to int16 (NSHIFT1), 2nd FIR length (NFIR2), 2nd FIR weights (int_array[1:NFIR2]), 2nd FIR norm (NORMFIR2), 2nd decimation factor (NDEC2), #bit shift for final casting to int16 (NSHIFT2).

I believe that it is desirable to implement the entire processing in integer arithmetic, and do occasional bit-shifts to prevent overflow. I am open to reconsider this on advise from experienced DSP VHDL programmers: do the DSP support double arithmetic? at what cost in speed? at what cost in term of spectral spurious due to rounding? I prefer integer. Spectrally and statistically cleaner.

(this is reminiscent of my first graduate job in 1980, which was to write a fast Fourier transform in hexadecimal code and integer arithmetic for the then-new MC68000 because my adviser was too poor to purchase the assembler, and co-processors did not exist :-})

I also listen to those experienced about implementing filtering schemes: perhaps twice as many successive decimation by a factor 4 would be more amenable to the programming structure of the FPGA/DSP? I have myself zero experience.

So the process would be, in streaming mode:

1. following Peter's idea, sample at the maximum that the TI1278 will allow in high-res mode: 24 MHz HF, 48 kHz LF, x512 oversampling

1->2. perform calibration, load int24 into int32, make stream available on port 4201 as usual

1->3. given that the TI1278 is only ENOB of 19.5, discard least significant 4 bits of int24 through right-bit shift to free bits on the left, keep only 20 bits i.e. int20

3->4. load int20 samples into int64 words, that should leave 44 zero-padded bits in each sample; it is important to figure a way to use int64 in the FIR convolution.

4->5. apply 1st FIR filter; given that we have 44 blank bits, multiplying by int32 weights should use 31 bits, leaving 13 bits for averaging, thus we can potentially have a filter length up to 8192 while doing strict integer arithmetic, without risk of overflow; of course actual limit will be the handling capacity of FPGA DSP cores; it may well be that in practice, shorter FIR filters (512) are sufficient, possibly using int16 weights instead of int32 - we need to experiment.

5->6. decimate by NDEC1 (default 8)

6->7. scale by NORMFIR1 to ensure a gain of 1 (its tricky, even matlab get it wrong sometime)

7->8. cast into int32 to restore same scaling as original input, make stream available on port 4202

8->9. cast into int16 using the scaling NSHIFT1, make stream available on port 4203

Each time you decimate by a factor 2, you gain one bit, so at this stage we should have an ENOB of 22.5 bits.

Now repeat process:

8->10. load the 3 most significant bytes of int32 into int64 words, that should leave 40 zero-padded bits in each sample;

10->11. apply 2nd FIR filter; given that we have 40 blank bits, multiplying by int32 weights should use 31 bits, leaving 9 bits for averaging, thus we can potentially have a filter length up to 512 without risk of overflow;

11->12. decimate by NDEC2 (default 16)

12->13. scale by NORMFIR2 to ensure a gain of 1

13->14. cast into int32 to restore same scaling as original input, make stream available on port 4204

At this stage we should have an ENOB of 26.5 bits.

14->15. cast into int16 using the scaling NSHIFT2, make stream available on port 4205

We know that our current processing using double, works, which, in IEEE, is 11 bit for exponent and 52 bits for mantissa. Since the radar post-processing works well with just 16 bits, there should be considerable leeway in sacrificing weight precision and filter length - most likely 16 bit weight and 512 length twice will be adequate, or 256 length four times.

The filter length and weight array should be dynamically loadable from the ARM side, as we do not want to go back to D-Tacq each time we need to experiment with a different FIR filter!

Please read the above, and get back to me with simplifying ideas.

One more item: we need to move on with time-tagging the samples to UTC so we can work on multi-static and radar tomography. We will want each frame of 8, 16 or 32 channels of int32, to be preceded by a single int32 frame of UTC. There will be a third email about the GPS ntp/ptp, the development of which I do not expect D-Tacq to provide pro-bono.

Attach:Screenshot-2020-01-27-03-38-49.png Δ

11/12/19 peter.milne

Summary:

 ADC output : 24 bit resolution, 19.5bit effective, 12kSPS stream.

 DSP, currently implemented in Matlab:
 DSP0: convert to doubles.
 DSP1 : Chebyshev IIR filter decimate 2 for compatibility with WERA, 6kSPS
 DSP2 : Chebyshev IIR filter decimate 16, 375SPS
 DSP3 :  shift to int16.

Given that this is all "inside the black box" of the FPGA, there's no possibility to tap off DSP1, we suggest DSP1 and DSP2 could be combined.

We're not comfortable with IIR filters, we do know how to do FIR filters. What we suggest is: - the FPGA won't be comfortable working in doubles, assuming internally it's int32 throughout. - we can easily make an FIR filter that decimates by 32. - how many taps does the filter need to have?. Unfortunately, this we don't know.

What we'd like to recommend is, that you or one of your students set up a fixed-point FIR function in Matlab/Simulink (educational license are good), and design a filter that gives you the equivalent output to your existing processing. You have plenty of source data to test with.

Then: give us a specification for a decimate/32 FIR filter - number of taps, number of coefficients, and, if it fits int the device, we'll do it. At the rates we're talking about, very likely ~501 taps will be possible.

11/12/19 pflament

First two attachments show the block diagram and short technical description of the radar.

Attach:Screenshot-2020-01-27-03-30-43.png Δ

The key point is that we demodulate the return signal with a copy of the transmitted signal. Because we chirp, the demodulated audio signal that we feed to the ADC has extremely limited bandwidth, as shown below:

Attach:Screenshot-2020-01-27-03-26-48.png Δ

These typical parameters bring ocean backscatter near the middle of the Doppler spectrum; there is no useful ocean information above those audio frequencies. All operating parameters are satisfied with a final sampling frequency of 370 Hz.

This yields a compressed data rate of 54 Mb/hour or 0.9Mb/min or 15 kbytes/s. This processing scheme allows 8 years of raw archive on a single 4 Tbyte 2.5" drive, and transmission over LTE at an affordable 40 Gbyte/mo.

The next attachment skeetches the present situation under which over 40 radars are operating, all with D-Tacq gear.

Presently, the TI-1278 is driven at 6 MHz in the high-res mode with x512 oversampling, and an output of 12 kHz; the effective number of bits is about 19.5 packed into int24 3-byte frames. Present processing in the FPGA is minimum, limited to padding to standard int32 4-byte frames, which are then streamed through port 4201, to an i5 NUC.

The decimation is done in matlab: (i) conversion to 64-bit double, (ii) a decimation by a factor 2 using Chebyshev IIR filter to yield an intermediate data stream compatible with WERA software, (iii) a decimation by a factor 16 using Chebyshev IIR filter to yield the final sampling frequency of 375 Hz, (iv) a programmable n-bit-shift to provide scaling and casting into an int16 for output and archiving.

All this processing could be done directly in the Zynq-7000, alleviating the need for the i5.

Attach:Screenshot-2020-01-27-03-33-19.png Δ