Compare View

switch
from
...
to
 
Commits (2)

Diff

Showing 6 changed files Inline Diff

43.4 KB

148 KB

31.8 KB

46.6 KB

161 KB

ifcs2018_journal.tex
\documentclass[a4paper,journal]{IEEEtran/IEEEtran} 1
\usepackage{graphicx,color,hyperref} 2
\usepackage{amsfonts} 3
\usepackage{amsthm} 4
\usepackage{amssymb} 5
\usepackage{amsmath} 6
\usepackage{algorithm2e} 7
\usepackage{url,balance} 8
\usepackage[normalem]{ulem} 9
\usepackage{tikz} 10
\usetikzlibrary{positioning,fit} 11
\usepackage{multirow} 12
\usepackage{scalefnt} 13 1 \documentclass[a4paper,journal]{IEEEtran/IEEEtran}
\usepackage{caption} 14 2 \usepackage{graphicx,color,hyperref}
\usepackage{subcaption} 15 3 \usepackage{amsfonts}
16 4 \usepackage{amsthm}
17 5 \usepackage{amssymb}
\hyphenation{op-tical net-works semi-conduc-tor} 18 6 \usepackage{amsmath}
\textheight=26cm 19 7 \usepackage{algorithm2e}
\setlength{\footskip}{30pt} 20 8 \usepackage{url,balance}
\pagenumbering{gobble} 21 9 \usepackage[normalem]{ulem}
\begin{document} 22 10 \usepackage{tikz}
\title{Filter optimization for real time digital processing of radiofrequency signals: application 23 11 \usetikzlibrary{positioning,fit}
to oscillator metrology} 24 12 \usepackage{multirow}
25 13 \usepackage{scalefnt}
\author{\IEEEauthorblockN{A. Hugeat\IEEEauthorrefmark{1}\IEEEauthorrefmark{2}, J. Bernard\IEEEauthorrefmark{2}, 26 14 \usepackage{caption}
G. Goavec-M\'erou\IEEEauthorrefmark{1}, 27 15 \usepackage{subcaption}
P.-Y. Bourgeois\IEEEauthorrefmark{1}, J.-M. Friedt\IEEEauthorrefmark{1}}\\ 28 16
\IEEEauthorblockA{\IEEEauthorrefmark{1}FEMTO-ST, Time \& Frequency department, Besan\c con, France }\\ 29 17
\IEEEauthorblockA{\IEEEauthorrefmark{2}FEMTO-ST, Computer Science department DISC, Besan\c con, France \\ 30 18 \hyphenation{op-tical net-works semi-conduc-tor}
Email: \{pyb2,jmfriedt\}@femto-st.fr} 31 19 \textheight=26cm
} 32 20 \setlength{\footskip}{30pt}
\maketitle 33 21 \pagenumbering{gobble}
\thispagestyle{plain} 34 22 \begin{document}
\pagestyle{plain} 35 23 \title{Filter optimization for real time digital processing of radiofrequency signals: application
\newtheorem{definition}{Definition} 36 24 to oscillator metrology}
37 25
\begin{abstract} 38 26 \author{\IEEEauthorblockN{A. Hugeat\IEEEauthorrefmark{1}\IEEEauthorrefmark{2}, J. Bernard\IEEEauthorrefmark{2},
Software Defined Radio (SDR) provides stability, flexibility and reconfigurability to 39 27 G. Goavec-M\'erou\IEEEauthorrefmark{1},
radiofrequency signal processing. Applied to oscillator characterization in the context 40 28 P.-Y. Bourgeois\IEEEauthorrefmark{1}, J.-M. Friedt\IEEEauthorrefmark{1}}\\
of ultrastable clocks, stringent filtering requirements are defined by spurious signal or 41 29 \IEEEauthorblockA{\IEEEauthorrefmark{1}FEMTO-ST, Time \& Frequency department, Besan\c con, France }\\
noise rejection needs. Since real time radiofrequency processing must be performed in a 42 30 \IEEEauthorblockA{\IEEEauthorrefmark{2}FEMTO-ST, Computer Science department DISC, Besan\c con, France \\
Field Programmable Array to meet timing constraints, we investigate optimization strategies 43 31 Email: \{pyb2,jmfriedt\}@femto-st.fr}
to design filters meeting rejection characteristics while limiting the hardware resources 44 32 }
required and keeping timing constraints within the targeted measurement bandwidths. The 45 33 \maketitle
presented technique is applicable to scheduling any sequence of processing blocks characterized 46 34 \thispagestyle{plain}
by a throughput, resource occupation and performance tabulated as a function of configuration 47 35 \pagestyle{plain}
characateristics, as is the case for filters with their coefficients and resolution yielding 48 36 \newtheorem{definition}{Definition}
rejection and number of multipliers. 49 37
\end{abstract} 50 38 \begin{abstract}
51 39 Software Defined Radio (SDR) provides stability, flexibility and reconfigurability to
\begin{IEEEkeywords} 52 40 radiofrequency signal processing. Applied to oscillator characterization in the context
Software Defined Radio, Mixed-Integer Linear Programming, Finite Impulse Response filter 53 41 of ultrastable clocks, stringent filtering requirements are defined by spurious signal or
\end{IEEEkeywords} 54 42 noise rejection needs. Since real time radiofrequency processing must be performed in a
55 43 Field Programmable Array to meet timing constraints, we investigate optimization strategies
\section{Digital signal processing of ultrastable clock signals} 56 44 to design filters meeting rejection characteristics while limiting the hardware resources
57 45 required and keeping timing constraints within the targeted measurement bandwidths. The
Analog oscillator phase noise characteristics are classically performed by downconverting 58 46 presented technique is applicable to scheduling any sequence of processing blocks characterized
the radiofrequency signal using a saturated mixer to bring the radiofrequency signal to baseband, 59 47 by a throughput, resource occupation and performance tabulated as a function of configuration
followed by a Fourier analysis of the beat signal to analyze phase fluctuations close to carrier. In 60 48 characateristics, as is the case for filters with their coefficients and resolution yielding
a fully digital approach, the radiofrequency signal is digitized and numerically downconverted by 61 49 rejection and number of multipliers.
multiplying the samples with a local numerically controlled oscillator (Fig. \ref{schema}) \cite{rsi}. 62 50 \end{abstract}
63 51
\begin{figure}[h!tb] 64 52 \begin{IEEEkeywords}
\begin{center} 65 53 Software Defined Radio, Mixed-Integer Linear Programming, Finite Impulse Response filter
\includegraphics[width=.8\linewidth]{images/schema} 66 54 \end{IEEEkeywords}
\end{center} 67 55
\caption{Fully digital oscillator phase noise characterization: the Device Under Test 68 56 \section{Digital signal processing of ultrastable clock signals}
(DUT) signal is sampled by the radiofrequency grade Analog to Digital Converter (ADC) and 69 57
downconverted by mixing with a Numerically Controlled Oscillator (NCO). Unwanted signals 70 58 Analog oscillator phase noise characteristics are classically performed by downconverting
and noise aliases are rejected by a Low Pass Filter (LPF) implemented as a cascade of Finite 71 59 the radiofrequency signal using a saturated mixer to bring the radiofrequency signal to baseband,
Impulse Response (FIR) filters. The signal is then decimated before a Fourier analysis displays 72 60 followed by a Fourier analysis of the beat signal to analyze phase fluctuations close to carrier. In
the spectral characteristics of the phase fluctuations.} 73 61 a fully digital approach, the radiofrequency signal is digitized and numerically downconverted by
\label{schema} 74 62 multiplying the samples with a local numerically controlled oscillator (Fig. \ref{schema}) \cite{rsi}.
\end{figure} 75 63
76 64 \begin{figure}[h!tb]
As with the analog mixer, 77 65 \begin{center}
the non-linear behavior of the downconverter introduces noise or spurious signal aliasing as 78 66 \includegraphics[width=.8\linewidth]{images/schema}
well as the generation of the frequency sum signal in addition to the frequency difference. 79 67 \end{center}
These unwanted spectral characteristics must be rejected before decimating the data stream 80 68 \caption{Fully digital oscillator phase noise characterization: the Device Under Test
for the phase noise spectral characterization \cite{andrich2018high}. The characteristics introduced between the 81 69 (DUT) signal is sampled by the radiofrequency grade Analog to Digital Converter (ADC) and
downconverter 82 70 downconverted by mixing with a Numerically Controlled Oscillator (NCO). Unwanted signals
and the decimation processing blocks are core characteristics of an oscillator characterization 83 71 and noise aliases are rejected by a Low Pass Filter (LPF) implemented as a cascade of Finite
system, and must reject out-of-band signals below the targeted phase noise -- typically in the 84 72 Impulse Response (FIR) filters. The signal is then decimated before a Fourier analysis displays
sub -170~dBc/Hz for ultrastable oscillator we aim at characterizing. The filter blocks will 85 73 the spectral characteristics of the phase fluctuations.}
use most resources of the Field Programmable Gate Array (FPGA) used to process the radiofrequency 86 74 \label{schema}
datastream: optimizing the performance of the filter while reducing the needed resources is 87 75 \end{figure}
hence tackled in a systematic approach using optimization techniques. Most significantly, we 88 76
tackle the issue by attempting to cascade multiple Finite Impulse Response (FIR) filters with 89 77 As with the analog mixer,
tunable number of coefficients and tunable number of bits representing the coefficients and the 90 78 the non-linear behavior of the downconverter introduces noise or spurious signal aliasing as
data being processed. 91 79 well as the generation of the frequency sum signal in addition to the frequency difference.
92 80 These unwanted spectral characteristics must be rejected before decimating the data stream
\section{Finite impulse response filter} 93 81 for the phase noise spectral characterization \cite{andrich2018high}. The characteristics introduced between the
94 82 downconverter
We select FIR filters for their unconditional stability and ease of design. A FIR filter is defined 95 83 and the decimation processing blocks are core characteristics of an oscillator characterization
by a set of weights $b_k$ applied to the inputs $x_k$ through a convolution to generate the 96 84 system, and must reject out-of-band signals below the targeted phase noise -- typically in the
outputs $y_k$ 97 85 sub -170~dBc/Hz for ultrastable oscillator we aim at characterizing. The filter blocks will
\begin{align} 98 86 use most resources of the Field Programmable Gate Array (FPGA) used to process the radiofrequency
y_n=\sum_{k=0}^N b_k x_{n-k} 99 87 datastream: optimizing the performance of the filter while reducing the needed resources is
\label{eq:fir_equation} 100 88 hence tackled in a systematic approach using optimization techniques. Most significantly, we
\end{align} 101 89 tackle the issue by attempting to cascade multiple Finite Impulse Response (FIR) filters with
102 90 tunable number of coefficients and tunable number of bits representing the coefficients and the
As opposed to an implementation on a general purpose processor in which word size is defined by the 103 91 data being processed.
processor architecture, implementing such a filter on an FPGA offers more degrees of freedom since 104 92
not only the coefficient values and number of taps must be defined, but also the number of bits 105 93 \section{Finite impulse response filter}
defining the coefficients and the sample size. For this reason, and because we consider pipeline 106 94
processing (as opposed to First-In, First-Out FIFO memory batch processing) of radiofrequency 107 95 We select FIR filters for their unconditional stability and ease of design. A FIR filter is defined
signals, High Level Synthesis (HLS) languages \cite{kasbah2008multigrid} are not considered but 108 96 by a set of weights $b_k$ applied to the inputs $x_k$ through a convolution to generate the
the problem is tackled at the Very-high-speed-integrated-circuit Hardware Description Language 109 97 outputs $y_k$
(VHDL) level. 110 98 \begin{align}
Since latency is not an issue in a openloop phase noise characterization instrument, 111 99 y_n=\sum_{k=0}^N b_k x_{n-k}
the large 112 100 \label{eq:fir_equation}
numbre of taps in the FIR, as opposed to the shorter Infinite Impulse Response (IIR) filter, 113 101 \end{align}
is not considered as an issue as would be in a closed loop system. 114 102
115 103 As opposed to an implementation on a general purpose processor in which word size is defined by the
The coefficients are classically expressed as floating point values. However, this binary 116 104 processor architecture, implementing such a filter on an FPGA offers more degrees of freedom since
number representation is not efficient for fast arithmetic computation by an FPGA. Instead, 117 105 not only the coefficient values and number of taps must be defined, but also the number of bits
we select to quantify these floating point values into integer values. This quantization 118 106 defining the coefficients and the sample size. For this reason, and because we consider pipeline
will result in some precision loss. 119 107 processing (as opposed to First-In, First-Out FIFO memory batch processing) of radiofrequency
120 108 signals, High Level Synthesis (HLS) languages \cite{kasbah2008multigrid} are not considered but
\begin{figure}[h!tb] 121 109 the problem is tackled at the Very-high-speed-integrated-circuit Hardware Description Language
\includegraphics[width=\linewidth]{images/zero_values} 122 110 (VHDL) level.
\caption{Impact of the quantization resolution of the coefficients: the quantization is 123 111 Since latency is not an issue in a openloop phase noise characterization instrument,
set to 6~bits -- with the horizontal black lines indicating $\pm$1 least significant bit -- setting 124 112 the large
the 30~first and 30~last coefficients out of the initial 128~band-pass 125 113 numbre of taps in the FIR, as opposed to the shorter Infinite Impulse Response (IIR) filter,
filter coefficients to 0 (red dots).} 126 114 is not considered as an issue as would be in a closed loop system.
\label{float_vs_int} 127 115
\end{figure} 128 116 The coefficients are classically expressed as floating point values. However, this binary
129 117 number representation is not efficient for fast arithmetic computation by an FPGA. Instead,
The tradeoff between quantization resolution and number of coefficients when considering 130 118 we select to quantify these floating point values into integer values. This quantization
integer operations is not trivial. As an illustration of the issue related to the 131 119 will result in some precision loss.
relation between number of fiter taps and quantization, Fig. \ref{float_vs_int} exhibits 132 120
a 128-coefficient FIR bandpass filter designed using floating point numbers (blue). Upon 133 121 \begin{figure}[h!tb]
quantization on 6~bit integers, 60 of the 128~coefficients in the beginning and end of the 134 122 \includegraphics[width=\linewidth]{images/zero_values}
taps become null, making the large number of coefficients irrelevant: processing 135 123 \caption{Impact of the quantization resolution of the coefficients: the quantization is
resources 136 124 set to 6~bits -- with the horizontal black lines indicating $\pm$1 least significant bit -- setting
are hence saved by shrinking the filter length. This tradeoff aimed at minimizing resources 137 125 the 30~first and 30~last coefficients out of the initial 128~band-pass
to reach a given rejection level, or maximizing out of band rejection for a given computational 138 126 filter coefficients to 0 (red dots).}
resource, will drive the investigation on cascading filters designed with varying tap resolution 139 127 \label{float_vs_int}
and tap length, as will be shown in the next section. Indeed, our development strategy closely 140 128 \end{figure}
follows the skeleton approach \cite{crookes1998environment, crookes2000design, benkrid2002towards} 141 129
in which basic blocks are defined and characterized before being assembled \cite{hide} 142 130 The tradeoff between quantization resolution and number of coefficients when considering
in a complete processing chain. In our case, assembling the filter blocks is a simpler block 143 131 integer operations is not trivial. As an illustration of the issue related to the
combination process since we assume a single value to be processed and a single value to be 144 132 relation between number of fiter taps and quantization, Fig. \ref{float_vs_int} exhibits
generated at each clock cycle. The FIR filters will not be considered to decimate in the 145 133 a 128-coefficient FIR bandpass filter designed using floating point numbers (blue). Upon
current implementation: the decimation is assumed to be located after the FIR cascade at the 146 134 quantization on 6~bit integers, 60 of the 128~coefficients in the beginning and end of the
moment. 147 135 taps become null, making the large number of coefficients irrelevant: processing
148 136 resources
\section{Methodology description} 149 137 are hence saved by shrinking the filter length. This tradeoff aimed at minimizing resources
150 138 to reach a given rejection level, or maximizing out of band rejection for a given computational
Our objective is to develop a new methodology applicable to any Digital Signal Processing (DSP) 151 139 resource, will drive the investigation on cascading filters designed with varying tap resolution
chain obtained by assembling basic processing blocks, with hardware and manufacturer independence. 152 140 and tap length, as will be shown in the next section. Indeed, our development strategy closely
Achieving such a target requires defining an abstract model to represent some basic properties 153 141 follows the skeleton approach \cite{crookes1998environment, crookes2000design, benkrid2002towards}
of DSP blocks such as performance (i.e. rejection or ripples in the bandpass for filters) and 154 142 in which basic blocks are defined and characterized before being assembled \cite{hide}
resource occupation. These abstract properties, not necessarily related to the detailed hardware 155 143 in a complete processing chain. In our case, assembling the filter blocks is a simpler block
implementation of a given platform, will feed a scheduler solver aimed at assembling the optimum 156 144 combination process since we assume a single value to be processed and a single value to be
target, whether in terms of maximizing performance for a given arbitrary resource occupation, or 157 145 generated at each clock cycle. The FIR filters will not be considered to decimate in the
minimizing resource occupation for a given performance. In our approach, the solution of the 158 146 current implementation: the decimation is assumed to be located after the FIR cascade at the
solver is then synthesized using the dedicated tool provided by each platform manufacturer 159 147 moment.
to assess the validity of our abstract resource occupation indicator, and the result of running 160 148
the DSP chain on the FPGA allows for assessing the performance of the scheduler. We emphasize 161 149 \section{Methodology description}
that all solutions found by the solver are synthesized and executed on hardware at the end 162 150
of the analysis. 163 151 Our objective is to develop a new methodology applicable to any Digital Signal Processing (DSP)
164 152 chain obtained by assembling basic processing blocks, with hardware and manufacturer independence.
In this demonstration, we focus on only two operations: filtering and shifting the number of 165 153 Achieving such a target requires defining an abstract model to represent some basic properties
bits needed to represent the data along the processing chain. 166 154 of DSP blocks such as performance (i.e. rejection or ripples in the bandpass for filters) and
We have chosen these basic operations because shifting and the filtering have already been studied 167 155 resource occupation. These abstract properties, not necessarily related to the detailed hardware
in the literature \cite{lim_1996, lim_1988, young_1992, smith_1998} providing a framework for 168 156 implementation of a given platform, will feed a scheduler solver aimed at assembling the optimum
assessing our results. Furthermore, filtering is a core step in any radiofrequency frontend 169 157 target, whether in terms of maximizing performance for a given arbitrary resource occupation, or
requiring pipelined processing at full bandwidth for the earliest steps, including for 170 158 minimizing resource occupation for a given performance. In our approach, the solution of the
time and frequency transfer or characterization \cite{carolina1,carolina2,rsi}. 171 159 solver is then synthesized using the dedicated tool provided by each platform manufacturer
172 160 to assess the validity of our abstract resource occupation indicator, and the result of running
Addressing only two operations allows for demonstrating the methodology but should not be 173 161 the DSP chain on the FPGA allows for assessing the performance of the scheduler. We emphasize
considered as a limitation of the framework which can be extended to assembling any number 174 162 that all solutions found by the solver are synthesized and executed on hardware at the end
of skeleton blocks as long as performance and resource occupation can be determined. 175 163 of the analysis.
Hence, 176 164
in this paper we will apply our methodology on simple DSP chains: a white noise input signal 177 165 In this demonstration, we focus on only two operations: filtering and shifting the number of
is generated using a Pseudo-Random Number (PRN) generator or by sampling a wideband (125~MS/s) 178 166 bits needed to represent the data along the processing chain.
14-bit Analog to Digital Converter (ADC) loaded by a 50~$\Omega$ resistor. Once samples have been 179 167 We have chosen these basic operations because shifting and the filtering have already been studied
digitized at a rate of 125~MS/s, filtering is applied to qualify the processing block performance -- 180 168 in the literature \cite{lim_1996, lim_1988, young_1992, smith_1998} providing a framework for
practically meeting the radiofrequency frontend requirement of noise and bandwidth reduction 181 169 assessing our results. Furthermore, filtering is a core step in any radiofrequency frontend
by filtering and decimating. Finally, bursts of filtered samples are stored for post-processing, 182 170 requiring pipelined processing at full bandwidth for the earliest steps, including for
allowing to assess either filter rejection for a given resource usage, or validating the rejection 183 171 time and frequency transfer or characterization \cite{carolina1,carolina2,rsi}.
when implementing a solution minimizing resource occupation. 184 172
185 173 Addressing only two operations allows for demonstrating the methodology but should not be
The first step of our approach is to model the DSP chain. Since we aim at only optimizing 186 174 considered as a limitation of the framework which can be extended to assembling any number
the filtering part of the signal processing chain, we have not included the PRN generator or the 187 175 of skeleton blocks as long as performance and resource occupation can be determined.
ADC in the model: the input data size and rate are considered fixed and defined by the hardware. 188 176 Hence,
The filtering can be done in two ways, either by considering a single monolithic FIR filter 189 177 in this paper we will apply our methodology on simple DSP chains: a white noise input signal
requiring many coefficients to reach the targeted noise rejection ratio, or by 190 178 is generated using a Pseudo-Random Number (PRN) generator or by sampling a wideband (125~MS/s)
cascading multiple FIR filters, each with fewer coefficients than found in the monolithic filter. 191 179 14-bit Analog to Digital Converter (ADC) loaded by a 50~$\Omega$ resistor. Once samples have been
192 180 digitized at a rate of 125~MS/s, filtering is applied to qualify the processing block performance --
After each filter we leave the possibility of shifting the filtered data to consume 193 181 practically meeting the radiofrequency frontend requirement of noise and bandwidth reduction
less resources. Hence in the case of cascaded filter, we define a stage as a filter 194 182 by filtering and decimating. Finally, bursts of filtered samples are stored for post-processing,
and a shifter (the shift could be omitted if we do not need to divide the filtered data). 195 183 allowing to assess either filter rejection for a given resource usage, or validating the rejection
196 184 when implementing a solution minimizing resource occupation.
\subsection{Model of a FIR filter} 197 185
198 186 The first step of our approach is to model the DSP chain. Since we aim at only optimizing
A cascade of filters is composed of $n$ FIR stages. In stage $i$ ($1 \leq i \leq n$) 199 187 the filtering part of the signal processing chain, we have not included the PRN generator or the
the FIR has $C_i$ coefficients and each coefficient is an integer value with $\pi^C_i$ 200 188 ADC in the model: the input data size and rate are considered fixed and defined by the hardware.
bits while the filtered data are shifted by $\pi^S_i$ bits. We define also $\pi^-_i$ as 201 189 The filtering can be done in two ways, either by considering a single monolithic FIR filter
the size of input data and $\pi^+_i$ as the size of output data. The figure~\ref{fig:fir_stage} 202 190 requiring many coefficients to reach the targeted noise rejection ratio, or by
shows a filtering stage. 203 191 cascading multiple FIR filters, each with fewer coefficients than found in the monolithic filter.
204 192
\begin{figure} 205 193 After each filter we leave the possibility of shifting the filtered data to consume
\centering 206 194 less resources. Hence in the case of cascaded filter, we define a stage as a filter
\begin{tikzpicture}[node distance=2cm] 207 195 and a shifter (the shift could be omitted if we do not need to divide the filtered data).
\node[draw,minimum size=1.3cm] (FIR) { $C_i, \pi_i^C$ } ; 208 196
\node[draw,minimum size=1.3cm] (Shift) [right of=FIR, ] { $\pi_i^S$ } ; 209 197 \subsection{Model of a FIR filter}
\node (Start) [left of=FIR] { } ; 210 198
\node (End) [right of=Shift] { } ; 211 199 A cascade of filters is composed of $n$ FIR stages. In stage $i$ ($1 \leq i \leq n$)
212 200 the FIR has $C_i$ coefficients and each coefficient is an integer value with $\pi^C_i$
\node[draw,fit=(FIR) (Shift)] (Filter) { } ; 213 201 bits while the filtered data are shifted by $\pi^S_i$ bits. We define also $\pi^-_i$ as
214 202 the size of input data and $\pi^+_i$ as the size of output data. The figure~\ref{fig:fir_stage}
\draw[->] (Start) edge node [above] { $\pi_i^-$ } (FIR) ; 215 203 shows a filtering stage.
\draw[->] (FIR) -- (Shift) ; 216 204
\draw[->] (Shift) edge node [above] { $\pi_i^+$ } (End) ; 217 205 \begin{figure}
\end{tikzpicture} 218 206 \centering
\caption{A single filter is composed of a FIR (on the left) and a Shifter (on the right)} 219 207 \begin{tikzpicture}[node distance=2cm]
\label{fig:fir_stage} 220 208 \node[draw,minimum size=1.3cm] (FIR) { $C_i, \pi_i^C$ } ;
\end{figure} 221 209 \node[draw,minimum size=1.3cm] (Shift) [right of=FIR, ] { $\pi_i^S$ } ;
222 210 \node (Start) [left of=FIR] { } ;
FIR $i$ has been characterized through numerical simulation as able to reject $F(C_i, \pi_i^C)$ dB. 223 211 \node (End) [right of=Shift] { } ;
This rejection has been computed using GNU Octave software FIR coefficient design functions 224 212
(\texttt{firls} and \texttt{fir1}). 225 213 \node[draw,fit=(FIR) (Shift)] (Filter) { } ;
For each configuration $(C_i, \pi_i^C)$, we first create a FIR with floating point coefficients and a given $C_i$ number of coefficients. 226 214
Then, the floating point coefficients are discretized into integers. In order to ensure that the coefficients are coded on $\pi_i^C$~bits effectively, 227 215 \draw[->] (Start) edge node [above] { $\pi_i^-$ } (FIR) ;
the coefficients are normalized by their absolute maximum before being scaled to integer coefficients. 228 216 \draw[->] (FIR) -- (Shift) ;
At least one coefficient is coded on $\pi_i^C$~bits, and in practice only $b_{C_i/2}$ is coded on $\pi_i^C$~bits while the others are coded on much fewer bits. 229 217 \draw[->] (Shift) edge node [above] { $\pi_i^+$ } (End) ;
230 218 \end{tikzpicture}
With these coefficients, the \texttt{freqz} function is used to estimate the magnitude of the filter 231 219 \caption{A single filter is composed of a FIR (on the left) and a Shifter (on the right)}
transfer function. 232 220 \label{fig:fir_stage}
Comparing the performance between FIRs requires however defining a unique criterion. As shown in figure~\ref{fig:fir_mag}, 233 221 \end{figure}
the FIR magnitude exhibits two parts: we focus here on the transitions width and the rejection rather than on the 234 222
bandpass ripples as emphasized in \cite{lim_1988,lim_1996}. Throughout this demonstration, 235 223 FIR $i$ has been characterized through numerical simulation as able to reject $F(C_i, \pi_i^C)$ dB.
we arbitrarily set a bandpass of 40\% of the Nyquist frequency and a bandstop from 60\% 236 224 This rejection has been computed using GNU Octave software FIR coefficient design functions
of the Nyquist frequency to the end of the band, as would be typically selected to prevent 237 225 (\texttt{firls} and \texttt{fir1}).
aliasing before decimating the dataflow by 2. The method is however generalized to any filter 238 226 For each configuration $(C_i, \pi_i^C)$, we first create a FIR with floating point coefficients and a given $C_i$ number of coefficients.
shape as long as it is defined from the initial modeling steps: Fig. \ref{fig:rejection_pyramid} 239 227 Then, the floating point coefficients are discretized into integers. In order to ensure that the coefficients are coded on $\pi_i^C$~bits effectively,
as described below is indeed unique for each filter shape. 240 228 the coefficients are normalized by their absolute maximum before being scaled to integer coefficients.
241 229 At least one coefficient is coded on $\pi_i^C$~bits, and in practice only $b_{C_i/2}$ is coded on $\pi_i^C$~bits while the others are coded on much fewer bits.
\begin{figure} 242 230
\begin{center} 243 231 With these coefficients, the \texttt{freqz} function is used to estimate the magnitude of the filter
\scalebox{0.8}{ 244 232 transfer function.
\centering 245 233 Comparing the performance between FIRs requires however defining a unique criterion. As shown in figure~\ref{fig:fir_mag},
\begin{tikzpicture}[scale=0.3] 246 234 the FIR magnitude exhibits two parts: we focus here on the transitions width and the rejection rather than on the
\draw[<->] (0,15) -- (0,0) -- (21,0) ; 247 235 bandpass ripples as emphasized in \cite{lim_1988,lim_1996}. Throughout this demonstration,
\draw[thick] (0,12) -- (8,12) -- (20,0) ; 248 236 we arbitrarily set a bandpass of 40\% of the Nyquist frequency and a bandstop from 60\%
249 237 of the Nyquist frequency to the end of the band, as would be typically selected to prevent
\draw (0,14) node [left] { $P$ } ; 250 238 aliasing before decimating the dataflow by 2. The method is however generalized to any filter
\draw (20,0) node [below] { $f$ } ; 251 239 shape as long as it is defined from the initial modeling steps: Fig. \ref{fig:rejection_pyramid}
252 240 as described below is indeed unique for each filter shape.
\draw[>=latex,<->] (0,14) -- (8,14) ; 253 241
\draw (4,14) node [above] { passband } node [below] { $40\%$ } ; 254 242 \begin{figure}
255 243 \begin{center}
\draw[>=latex,<->] (8,14) -- (12,14) ; 256 244 \scalebox{0.8}{
\draw (10,14) node [above] { transition } node [below] { $20\%$ } ; 257 245 \centering
258 246 \begin{tikzpicture}[scale=0.3]
\draw[>=latex,<->] (12,14) -- (20,14) ; 259 247 \draw[<->] (0,15) -- (0,0) -- (21,0) ;
\draw (16,14) node [above] { stopband } node [below] { $40\%$ } ; 260 248 \draw[thick] (0,12) -- (8,12) -- (20,0) ;
261 249
\draw[>=latex,<->] (16,12) -- (16,8) ; 262 250 \draw (0,14) node [left] { $P$ } ;
\draw (16,10) node [right] { rejection } ; 263 251 \draw (20,0) node [below] { $f$ } ;
264 252
\draw[dashed] (8,-1) -- (8,14) ; 265 253 \draw[>=latex,<->] (0,14) -- (8,14) ;
\draw[dashed] (12,-1) -- (12,14) ; 266 254 \draw (4,14) node [above] { passband } node [below] { $40\%$ } ;
267 255
\draw[dashed] (8,12) -- (16,12) ; 268 256 \draw[>=latex,<->] (8,14) -- (12,14) ;
\draw[dashed] (12,8) -- (16,8) ; 269 257 \draw (10,14) node [above] { transition } node [below] { $20\%$ } ;
270 258
\end{tikzpicture} 271 259 \draw[>=latex,<->] (12,14) -- (20,14) ;
} 272 260 \draw (16,14) node [above] { stopband } node [below] { $40\%$ } ;
\end{center} 273 261
\caption{Shape of the filter transmitted power $P$ as a function of frequency $f$: 274 262 \draw[>=latex,<->] (16,12) -- (16,8) ;
the passband is considered to occupy the initial 40\% of the Nyquist frequency range, 275 263 \draw (16,10) node [right] { rejection } ;
the stopband the last 40\%, allowing 20\% transition width.} 276 264
\label{fig:fir_mag} 277 265 \draw[dashed] (8,-1) -- (8,14) ;
\end{figure} 278 266 \draw[dashed] (12,-1) -- (12,14) ;
279 267
In the transition band, the behavior of the filter is left free, we only define the passband and the stopband characteristics. 280 268 \draw[dashed] (8,12) -- (16,12) ;
Initial considered criteria include the mean value of the stopband rejection which yields unacceptable results since notches 281 269 \draw[dashed] (12,8) -- (16,8) ;
overestimate the rejection capability of the filter. 282 270
An intermediate criterion considered the maximal rejection within the stopband, to which the sum of the absolute values 283 271 \end{tikzpicture}
within the passband is subtracted to avoid filters with excessive ripples, normalized to the 284 272 }
bin width to remain consistent with the passband criterion (dBc/Hz units in all cases). 285 273 \end{center}
In this case, cascading too many filters with individual excessive ($>$ 1~dB) passband ripples 286 274 \caption{Shape of the filter transmitted power $P$ as a function of frequency $f$:
led to unacceptable ($>$ 10~dB) final ripple levels, especially close to the transition band. 287 275 the passband is considered to occupy the initial 40\% of the Nyquist frequency range,
Hence, the final criterion considers the minimal rejection in the stopband to which the 288 276 the stopband the last 40\%, allowing 20\% transition width.}
the maximal amplitude in the passband (maximum value minus the minimum value) is substracted, with 289 277 \label{fig:fir_mag}
a 1~dB threshold on the latter quantity over which the filter is discarded. 290 278 \end{figure}
With this 291 279
criterion, we meet the expected rejection capability of low pass filters as shown in figure~\ref{fig:custom_criterion}. 292 280 In the transition band, the behavior of the filter is left free, we only define the passband and the stopband characteristics.
The best filter has a correct rejection estimation and the worst filter 293
is discarded based on the excessive passband ripple criterion. 294 281 Initial considered criteria include the mean value of the stopband rejection which yields unacceptable results since notches
295 282 overestimate the rejection capability of the filter.
\begin{figure} 296 283 An intermediate criterion considered the maximal rejection within the stopband, to which the sum of the absolute values
\centering 297
\includegraphics[width=\linewidth]{images/custom_criterion} 298
\caption{Selected filter qualification criterion computed as the maximum rejection in the stopband 299
minus the maximal ripple amplitude in the passband with a $>$ 1~dB threshold above which the filter is discarded: 300
comparison between monolithic filter (blue, rejected in this case) and cascaded filters (red).} 301 284 within the passband is subtracted to avoid filters with excessive ripples, normalized to the
\label{fig:custom_criterion} 302 285 bin width to remain consistent with the passband criterion (dBc/Hz units in all cases).
\end{figure} 303 286 In this case, cascading too many filters with individual excessive ($>$ 1~dB) passband ripples
304 287 led to unacceptable ($>$ 10~dB) final ripple levels, especially close to the transition band.
Thanks to the latter criterion which will be used in the remainder of this paper, we are able to automatically generate multiple FIR taps 305 288 Hence, the final criterion considers the minimal rejection in the stopband to which the
and estimate their rejection. Figure~\ref{fig:rejection_pyramid} exhibits the 306 289 the maximal amplitude in the passband (maximum value minus the minimum value) is substracted, with
rejection as a function of the number of coefficients and the number of bits representing these coefficients. 307 290 a 1~dB threshold on the latter quantity over which the filter is discarded.
The curve shaped as a pyramid exhibits optimum configurations sets at the vertex where both edges meet. 308
Indeed for a given number of coefficients, increasing the number of bits over the edge will not improve the rejection. 309
Conversely when setting the a given number of bits, increasing the number of coefficients will not improve 310
the rejection. Hence the best coefficient set are on the vertex of the pyramid. Notice that the word length 311
and number of coefficients do not start at 1: filters with too few coefficients or too little tap word size are rejected 312
by the excessive ripple constraint of the criterion. Hence, the size of the pyramid is significantly reduced by discarding 313 291 With this
these filters and so is the solution search space. 314 292 criterion, we meet the expected rejection capability of low pass filters as shown in figure~\ref{fig:custom_criterion}.
315 293 The best filter has a correct rejection estimation and the worst filter
\begin{figure} 316 294 is discarded based on the excessive passband ripple criterion.
\centering 317
\includegraphics[width=\linewidth]{images/rejection_pyramid} 318
\caption{Filter rejection as a function of number of coefficients and number of bits 319
: this lookup table will be used to identify which filter parameters -- number of bits 320
representing coefficients and number of coefficients -- best match the targeted transfer function. Filters 321
with fewer than 10~taps or with coefficients coded on fewer than 5~bits are discarded due to excessive 322
ripples in the passband.} 323
\label{fig:rejection_pyramid} 324 295
\end{figure} 325 296 \begin{figure}
326 297 \centering
Although we have an efficient criterion to estimate the rejection of one set of coefficients (taps), 327 298 \includegraphics[width=\linewidth]{images/custom_criterion}
we have a problem when we cascade filters and estimate the criterion as a sum two or more individual criteria. 328 299 \caption{Selected filter qualification criterion computed as the maximum rejection in the stopband
If the FIR filter coefficients are the same between the stages, we have: 329 300 minus the maximal ripple amplitude in the passband with a $>$ 1~dB threshold above which the filter is discarded:
$$F_{total} = F_1 + F_2$$ 330 301 comparison between monolithic filter (blue, rejected in this case) and cascaded filters (red).}
But selecting two different sets of coefficient will yield a more complex situation in which 331 302 \label{fig:custom_criterion}
the previous relation is no longer valid as illustrated on figure~\ref{fig:sum_rejection}. The red and blue curves 332 303 \end{figure}
are two different filters with maximums and notches not located at the same frequency offsets. 333 304
Hence when summing the transfer functions, the resulting rejection shown as the dashed yellow line is improved 334 305 Thanks to the latter criterion which will be used in the remainder of this paper, we are able to automatically generate multiple FIR taps
with respect to a basic sum of the rejection criteria shown as a the dotted yellow line. 335 306 and estimate their rejection. Figure~\ref{fig:rejection_pyramid} exhibits the
Thus, estimating the rejection of filter cascades is more complex than taking the sum of all the rejection 336 307 rejection as a function of the number of coefficients and the number of bits representing these coefficients.
criteria of each filter. However since the individual filter rejection sum underestimates the rejection capability of the cascade, 337 308 The curve shaped as a pyramid exhibits optimum configurations sets at the vertex where both edges meet.
this upper bound is considered as a conservative and acceptable criterion for deciding on the suitability 338 309 Indeed for a given number of coefficients, increasing the number of bits over the edge will not improve the rejection.
of the filter cascade to meet design criteria. 339 310 Conversely when setting the a given number of bits, increasing the number of coefficients will not improve
340 311 the rejection. Hence the best coefficient set are on the vertex of the pyramid. Notice that the word length
\begin{figure} 341 312 and number of coefficients do not start at 1: filters with too few coefficients or too little tap word size are rejected
\centering 342 313 by the excessive ripple constraint of the criterion. Hence, the size of the pyramid is significantly reduced by discarding
\includegraphics[width=\linewidth]{images/cascaded_criterion} 343 314 these filters and so is the solution search space.
\caption{Transfer function of individual filters and after cascading the two filters, 344 315
demonstrating that the selected criterion of maximum rejection in the bandstop (horizontal 345 316 \begin{figure}
lines) is met. Notice that the cascaded filter has better rejection than summing the bandstop 346 317 \centering
maximum of each individual filter. 347 318 \includegraphics[width=\linewidth]{images/rejection_pyramid}
} 348 319 \caption{Filter rejection as a function of number of coefficients and number of bits
\label{fig:sum_rejection} 349 320 : this lookup table will be used to identify which filter parameters -- number of bits
\end{figure} 350 321 representing coefficients and number of coefficients -- best match the targeted transfer function. Filters
351 322 with fewer than 10~taps or with coefficients coded on fewer than 5~bits are discarded due to excessive
Finally in our case, we consider that the input signal are fully known. The 352 323 ripples in the passband.}
resolution of the input data stream are fixed and still the same for all experiments 353 324 \label{fig:rejection_pyramid}
in this paper. 354 325 \end{figure}
355 326
Based on this analysis, we address the estimate of resource consumption (called 356 327 Although we have an efficient criterion to estimate the rejection of one set of coefficients (taps),
silicon area -- in the case of FPGAs this means processing cells) as a function of 357 328 we have a problem when we cascade filters and estimate the criterion as a sum two or more individual criteria.
filter characteristics. As a reminder, we do not aim at matching actual hardware 358 329 If the FIR filter coefficients are the same between the stages, we have:
configuration but consider an arbitrary silicon area occupied by each processing function, 359 330 $$F_{total} = F_1 + F_2$$
and will assess after synthesis the adequation of this arbitrary unit with actual 360 331 But selecting two different sets of coefficient will yield a more complex situation in which
hardware resources provided by FPGA manufacturers. The sum of individual processing 361 332 the previous relation is no longer valid as illustrated on figure~\ref{fig:sum_rejection}. The red and blue curves
unit areas is constrained by a total silicon area representative of FPGA global resources. 362 333 are two different filters with maximums and notches not located at the same frequency offsets.
Formally, variable $a_i$ is the area taken by filter~$i$ 363 334 Hence when summing the transfer functions, the resulting rejection shown as the dashed yellow line is improved
(in arbitrary unit). Variable $r_i$ is the rejection of filter~$i$ (in dB). 364 335 with respect to a basic sum of the rejection criteria shown as a the dotted yellow line.
Constant $\mathcal{A}$ is the total available area. We model our problem as follows: 365
366 336 Thus, estimating the rejection of filter cascades is more complex than taking the sum of all the rejection
\begin{align} 367 337 criteria of each filter. However since the individual filter rejection sum underestimates the rejection capability of the cascade,
\text{Maximize } & \sum_{i=1}^n r_i \notag \\ 368
\sum_{i=1}^n a_i & \leq \mathcal{A} & \label{eq:area} \\ 369 338 this upper bound is considered as a conservative and acceptable criterion for deciding on the suitability
a_i & = C_i \times (\pi_i^C + \pi_i^-), & \forall i \in [1, n] \label{eq:areadef} \\ 370 339 of the filter cascade to meet design criteria.
r_i & = F(C_i, \pi_i^C), & \forall i \in [1, n] \label{eq:rejectiondef} \\ 371 340
\pi_i^+ & = \pi_i^- + \pi_i^C - \pi_i^S, & \forall i \in [1, n] \label{eq:bits} \\ 372 341 \begin{figure}
\pi_{i - 1}^+ & = \pi_i^-, & \forall i \in [2, n] \label{eq:inout} \\ 373 342 \centering
\pi_i^+ & \geq 1 + \sum_{k=1}^{i} \left(1 + \frac{r_j}{6}\right), & \forall i \in [1, n] \label{eq:maxshift} \\ 374 343 \includegraphics[width=\linewidth]{images/cascaded_criterion}
\pi_1^- &= \Pi^I \label{eq:init} 375 344 \caption{Transfer function of individual filters and after cascading the two filters,
\end{align} 376 345 demonstrating that the selected criterion of maximum rejection in the bandstop (horizontal
377 346 lines) is met. Notice that the cascaded filter has better rejection than summing the bandstop
Equation~\ref{eq:area} states that the total area taken by the filters must be 378 347 maximum of each individual filter.
less than the available area. Equation~\ref{eq:areadef} gives the definition of 379 348 }
the area used by a filter, considered as the area of the FIR since the Shifter is 380 349 \label{fig:sum_rejection}
assumed not to require significant resources. We consider that the FIR needs $C_i$ registers of size 381 350 \end{figure}
$\pi_i^C + \pi_i^-$~bits to store the results of the multiplications of the 382 351
input data with the coefficients. Equation~\ref{eq:rejectiondef} gives the 383 352 Finally in our case, we consider that the input signal are fully known. The
definition of the rejection of the filter thanks to the tabulated function~$F$ that we defined 384 353 resolution of the input data stream are fixed and still the same for all experiments
previously. The Shifter does not introduce negative rejection as we will explain later, 385 354 in this paper.
so the rejection only comes from the FIR. Equation~\ref{eq:bits} states the 386 355
relation between $\pi_i^+$ and $\pi_i^-$. The multiplications in the FIR add 387 356 Based on this analysis, we address the estimate of resource consumption (called
$\pi_i^C$ bits as most coefficients are close to zero, and the Shifter removes 388
$\pi_i^S$ bits. Equation~\ref{eq:inout} states that the output number of bits of 389 357 silicon area -- in the case of FPGAs this means processing cells) as a function of
a filter is the same as the input number of bits of the next filter. 390 358 filter characteristics. As a reminder, we do not aim at matching actual hardware
Equation~\ref{eq:maxshift} ensures that the Shifter does not introduce negative 391 359 configuration but consider an arbitrary silicon area occupied by each processing function,
rejection. Indeed, the results of the FIR can be right shifted without compromising 392 360 and will assess after synthesis the adequation of this arbitrary unit with actual
the quality of the rejection until a threshold. Each bit of the output data 393 361 hardware resources provided by FPGA manufacturers. The sum of individual processing
increases the maximum rejection level by 6~dB. We add one to take the sign bit 394 362 unit areas is constrained by a total silicon area representative of FPGA global resources.
into account. If equation~\ref{eq:maxshift} was not present, the Shifter could 395 363 Formally, variable $a_i$ is the area taken by filter~$i$
shift too much and introduce some noise in the output data. Each supplementary 396 364 (in arbitrary unit). Variable $r_i$ is the rejection of filter~$i$ (in dB).
shift bit would cause an additional 6~dB rejection rise. A totally equivalent equation is: 397 365 Constant $\mathcal{A}$ is the total available area. We model our problem as follows:
$\pi_i^S \leq \pi_i^- + \pi_i^C - 1 - \sum_{k=1}^{i} \left(1 + \frac{r_j}{6}\right)$. 398 366
Finally, equation~\ref{eq:init} gives the number of bits of the global input. 399 367 \begin{align}
400 368 \text{Maximize } & \sum_{i=1}^n r_i \notag \\
This model is non-linear since we multiply some variable with another variable 401 369 \sum_{i=1}^n a_i & \leq \mathcal{A} & \label{eq:area} \\
and it is even non-quadratic, as the cost function $F$ does not have a known 402 370 a_i & = C_i \times (\pi_i^C + \pi_i^-), & \forall i \in [1, n] \label{eq:areadef} \\
linear or quadratic expression. To linearize this problem, we introduce $p$ FIR configurations. 403 371 r_i & = F(C_i, \pi_i^C), & \forall i \in [1, n] \label{eq:rejectiondef} \\
This variable $p$ is defined by the user, and represents the number of different 404 372 \pi_i^+ & = \pi_i^- + \pi_i^C - \pi_i^S, & \forall i \in [1, n] \label{eq:bits} \\
set of coefficients generated (remember, we use \texttt{firls} and \texttt{fir1} 405 373 \pi_{i - 1}^+ & = \pi_i^-, & \forall i \in [2, n] \label{eq:inout} \\
functions from GNU Octave) based on the targeted filter characteristics and implementation 406 374 \pi_i^+ & \geq 1 + \sum_{k=1}^{i} \left(1 + \frac{r_j}{6}\right), & \forall i \in [1, n] \label{eq:maxshift} \\
assumptions (estimated number of bits defining the coefficients). Hence, $C_{ij}$ and 407 375 \pi_1^- &= \Pi^I \label{eq:init}
$\pi_{ij}^C$ become constants and 408 376 \end{align}
we define $1 \leq j \leq p$ so that the function $F$ can be estimated (Look Up Table) 409 377
for each configurations thanks to the rejection criterion. We also define the binary 410 378 Equation~\ref{eq:area} states that the total area taken by the filters must be
variable $\delta_{ij}$ that has value 1 if stage~$i$ is in configuration~$j$ 411 379 less than the available area. Equation~\ref{eq:areadef} gives the definition of
and 0 otherwise. The new equations are as follows: 412 380 the area used by a filter, considered as the area of the FIR since the Shifter is
413 381 assumed not to require significant resources. We consider that the FIR needs $C_i$ registers of size
\begin{align} 414 382 $\pi_i^C + \pi_i^-$~bits to store the results of the multiplications of the
a_i & = \sum_{j=1}^p \delta_{ij} \times C_{ij} \times (\pi_{ij}^C + \pi_i^-), & \forall i \in [1, n] \label{eq:areadef2} \\ 415 383 input data with the coefficients. Equation~\ref{eq:rejectiondef} gives the
r_i & = \sum_{j=1}^p \delta_{ij} \times F(C_{ij}, \pi_{ij}^C), & \forall i \in [1, n] \label{eq:rejectiondef2} \\ 416 384 definition of the rejection of the filter thanks to the tabulated function~$F$ that we defined
\pi_i^+ & = \pi_i^- + \left(\sum_{j=1}^p \delta_{ij} \pi_{ij}^C\right) - \pi_i^S, & \forall i \in [1, n] \label{eq:bits2} \\ 417 385 previously. The Shifter does not introduce negative rejection as we will explain later,
\sum_{j=1}^p \delta_{ij} & \leq 1, & \forall i \in [1, n] \label{eq:config} 418 386 so the rejection only comes from the FIR. Equation~\ref{eq:bits} states the
\end{align} 419 387 relation between $\pi_i^+$ and $\pi_i^-$. The multiplications in the FIR add
420 388 $\pi_i^C$ bits as most coefficients are close to zero, and the Shifter removes
Equations \ref{eq:areadef2}, \ref{eq:rejectiondef2} and \ref{eq:bits2} replace 421 389 $\pi_i^S$ bits. Equation~\ref{eq:inout} states that the output number of bits of
respectively equations \ref{eq:areadef}, \ref{eq:rejectiondef} and \ref{eq:bits}. 422 390 a filter is the same as the input number of bits of the next filter.
Equation~\ref{eq:config} states that for each stage, a single configuration is chosen at most. 423 391 Equation~\ref{eq:maxshift} ensures that the Shifter does not introduce negative
424 392 rejection. Indeed, the results of the FIR can be right shifted without compromising
The problem remains quadratic at this stage since in the constraint~\ref{eq:areadef2} 425 393 the quality of the rejection until a threshold. Each bit of the output data
we multiply 426 394 increases the maximum rejection level by 6~dB. We add one to take the sign bit
$\delta_{ij}$ and $\pi_i^-$. However, since $\delta_{ij}$ is a binary variable we can 427 395 into account. If equation~\ref{eq:maxshift} was not present, the Shifter could
linearize this multiplication. The following formula shows how to linearize 428 396 shift too much and introduce some noise in the output data. Each supplementary
this situation in general case with $y$ a binary variable and $x$ a real variable ($0 \leq x \leq X^{max}$): 429 397 shift bit would cause an additional 6~dB rejection rise. A totally equivalent equation is:
\begin{equation*} 430 398 $\pi_i^S \leq \pi_i^- + \pi_i^C - 1 - \sum_{k=1}^{i} \left(1 + \frac{r_j}{6}\right)$.
m = x \times y \implies 431 399 Finally, equation~\ref{eq:init} gives the number of bits of the global input.
\left \{ 432 400
\begin{split} 433 401 This model is non-linear since we multiply some variable with another variable
m & \geq 0 \\ 434 402 and it is even non-quadratic, as the cost function $F$ does not have a known
m & \leq y \times X^{max} \\ 435 403 linear or quadratic expression. To linearize this problem, we introduce $p$ FIR configurations.
m & \leq x \\ 436
m & \geq x - (1 - y) \times X^{max} \\ 437
\end{split} 438
\right . 439
\end{equation*} 440
So if we bound up $\pi_i^-$ by 128~bits which is the maximum data size whose estimation is 441
assumed on hardware characteristics, 442
the Gurobi (\url{www.gurobi.com}) optimization software will be able to linearize 443
for us the quadratic problem so the model is left as is. This model 444
has $O(np)$ variables and $O(n)$ constraints. 445 404 This variable $p$ is defined by the user, and represents the number of different
446 405 set of coefficients generated (remember, we use \texttt{firls} and \texttt{fir1}
Two problems will be addressed using the workflow described in the next section: on the one 447 406 functions from GNU Octave) based on the targeted filter characteristics and implementation
hand maximizing the rejection capability of a set of cascaded filters occupying a fixed arbitrary 448 407 assumptions (estimated number of bits defining the coefficients). Hence, $C_{ij}$ and
silicon area (section~\ref{sec:fixed_area}) and on the second hand the dual problem of minimizing the silicon area 449 408 $\pi_{ij}^C$ become constants and
for a fixed rejection criterion (section~\ref{sec:fixed_rej}). In the latter case, the 450 409 we define $1 \leq j \leq p$ so that the function $F$ can be estimated (Look Up Table)
objective function is replaced with: 451 410 for each configurations thanks to the rejection criterion. We also define the binary
\begin{align} 452 411 variable $\delta_{ij}$ that has value 1 if stage~$i$ is in configuration~$j$
\text{Minimize } & \sum_{i=1}^n a_i \notag 453 412 and 0 otherwise. The new equations are as follows:
\end{align} 454 413
We adapt our constraints of quadratic program to replace equation \ref{eq:area} 455 414 \begin{align}
with equation \ref{eq:rejection_min} where $\mathcal{R}$ is the minimal 456 415 a_i & = \sum_{j=1}^p \delta_{ij} \times C_{ij} \times (\pi_{ij}^C + \pi_i^-), & \forall i \in [1, n] \label{eq:areadef2} \\
rejection required. 457 416 r_i & = \sum_{j=1}^p \delta_{ij} \times F(C_{ij}, \pi_{ij}^C), & \forall i \in [1, n] \label{eq:rejectiondef2} \\
458 417 \pi_i^+ & = \pi_i^- + \left(\sum_{j=1}^p \delta_{ij} \pi_{ij}^C\right) - \pi_i^S, & \forall i \in [1, n] \label{eq:bits2} \\
\begin{align} 459 418 \sum_{j=1}^p \delta_{ij} & \leq 1, & \forall i \in [1, n] \label{eq:config}
\sum_{i=1}^n r_i & \geq \mathcal{R} & \label{eq:rejection_min} 460 419 \end{align}
\end{align} 461 420
462 421 Equations \ref{eq:areadef2}, \ref{eq:rejectiondef2} and \ref{eq:bits2} replace
\section{Design workflow} 463 422 respectively equations \ref{eq:areadef}, \ref{eq:rejectiondef} and \ref{eq:bits}.
\label{sec:workflow} 464 423 Equation~\ref{eq:config} states that for each stage, a single configuration is chosen at most.
465 424
In this section, we describe the workflow to compute all the results presented in sections~\ref{sec:fixed_area} 466
and \ref{sec:fixed_rej}. Figure~\ref{fig:workflow} shows the global workflow and the different steps involved 467
in the computation of the results. 468
469
\begin{figure} 470
\centering 471
\begin{tikzpicture}[node distance=0.75cm and 2cm] 472
\node[draw,minimum size=1cm] (Solver) { Filter Solver } ; 473
\node (Start) [left= 3cm of Solver] { } ; 474
\node[draw,minimum size=1cm] (TCL) [right= of Solver] { TCL Script } ; 475
\node (Input) [above= of TCL] { } ; 476 425 The problem remains quadratic at this stage since in the constraint~\ref{eq:areadef2}
\node[draw,minimum size=1cm] (Deploy) [below= of Solver] { Deploy Script } ; 477 426 we multiply
\node[draw,minimum size=1cm] (Bitstream) [below= of TCL] { Bitstream } ; 478 427 $\delta_{ij}$ and $\pi_i^-$. However, since $\delta_{ij}$ is a binary variable we can
\node[draw,minimum size=1cm,rounded corners] (Board) [below right= of Deploy] { Board } ; 479 428 linearize this multiplication. The following formula shows how to linearize
\node[draw,minimum size=1cm] (Postproc) [below= of Deploy] { Post-Processing } ; 480 429 this situation in general case with $y$ a binary variable and $x$ a real variable ($0 \leq x \leq X^{max}$):
\node (Results) [left= of Postproc] { } ; 481 430 \begin{equation*}
482 431 m = x \times y \implies
\draw[->] (Start) edge node [above] { $\mathcal{A}, n, \Pi^I$ } node [below] { $(C_{ij}, \pi_{ij}^C), F$ } (Solver) ; 483 432 \left \{
\draw[->] (Input) edge node [left] { ADC or PRN } (TCL) ; 484 433 \begin{split}
\draw[->] (Solver) edge node [below] { (1a) } (TCL) ; 485 434 m & \geq 0 \\
\draw[->] (Solver) edge node [right] { (1b) } (Deploy) ; 486 435 m & \leq y \times X^{max} \\
\draw[->] (TCL) edge node [left] { (2) } (Bitstream) ; 487 436 m & \leq x \\
\draw[->,dashed] (Bitstream) -- (Deploy) ; 488 437 m & \geq x - (1 - y) \times X^{max} \\
\draw[->] (Deploy) to[out=-30,in=120] node [above] { (3) } (Board) ; 489 438 \end{split}
\draw[->] (Board) to[out=150,in=-60] node [below] { (4) } (Deploy) ; 490 439 \right .
\draw[->] (Deploy) edge node [left] { (5) } (Postproc) ; 491 440 \end{equation*}
\draw[->] (Postproc) -- (Results) ; 492 441 So if we bound up $\pi_i^-$ by 128~bits which is the maximum data size whose estimation is
\end{tikzpicture} 493 442 assumed on hardware characteristics,
\caption{Design workflow from the input parameters to the results allowing for 494 443 the Gurobi (\url{www.gurobi.com}) optimization software will be able to linearize
a fully automated optimal solution search.} 495 444 for us the quadratic problem so the model is left as is. This model
\label{fig:workflow} 496 445 has $O(np)$ variables and $O(n)$ constraints.
\end{figure} 497 446
498
The filter solver is a C++ program that takes as input the maximum area 499
$\mathcal{A}$, the number of stages $n$, the size of the input signal $\Pi^I$, 500
the FIR configurations $(C_{ij}, \pi_{ij}^C)$ and the function $F$. It creates 501
the quadratic programs and uses the Gurobi solver to estimate the optimal results. 502
Then it produces two scripts: a TCL script ((1a) on figure~\ref{fig:workflow}) 503
and a deploy script ((1b) on figure~\ref{fig:workflow}). 504
505
The TCL script describes the whole digital processing chain from the beginning 506
(the raw signal data) to the end (the filtered data) in a language compatible 507
with proprietary synthesis software, namely Vivado for Xilinx and Quartus for 508
Intel/Altera. The raw input data generated from a 20-bit Pseudo Random Number (PRN) 509
generator inside the FPGA and $\Pi^I$ is fixed at 16~bits. 510
Then the script builds each stage of the chain with a generic FIR task that 511
comes from a skeleton library. The generic FIR is highly configurable 512
with the number of coefficients and the size of the coefficients. The coefficients 513
themselves are not stored in the script. 514
As the signal is processed in real-time, the output signal is stored as 515
consecutive bursts of data for post-processing, mainly assessing the consistency of the 516
implemented FIR cascade transfer function with the design criteria and the expected 517
transfer function. 518
519
The TCL script is used by Vivado to produce the FPGA bitstream ((2) on figure~\ref{fig:workflow}). 520
We use the 2018.2 version of Xilinx Vivado and we execute the synthesized 521
bitstream on a Redpitaya board fitted with a Xilinx Zynq-7010 series 522
FPGA (xc7z010clg400-1) and two LTC2145 14-bit 125~MS/s ADC, loaded with 50~$\Omega$ resistors to 523
provide a broadband noise source. 524
The board runs the Linux kernel and surrounding environment produced from the 525
Buildroot framework available at \url{https://github.com/trabucayre/redpitaya/}: configuring 526
the Zynq FPGA, feeding the FIR with the set of coefficients, executing the simulation and 527
fetching the results is automated. 528 447 Two problems will be addressed using the workflow described in the next section: on the one
529 448 hand maximizing the rejection capability of a set of cascaded filters occupying a fixed arbitrary
The deploy script uploads the bitstream to the board ((3) on 530 449 silicon area (section~\ref{sec:fixed_area}) and on the second hand the dual problem of minimizing the silicon area
figure~\ref{fig:workflow}), flashes the FPGA, loads the different drivers, 531 450 for a fixed rejection criterion (section~\ref{sec:fixed_rej}). In the latter case, the
configures the coefficients of the FIR filters. It then waits for the results 532 451 objective function is replaced with:
and retrieves the data to the main computer ((4) on figure~\ref{fig:workflow}). 533 452 \begin{align}
534 453 \text{Minimize } & \sum_{i=1}^n a_i \notag
Finally, an Octave post-processing script computes the final results thanks to 535 454 \end{align}
the output data ((5) on figure~\ref{fig:workflow}). 536 455 We adapt our constraints of quadratic program to replace equation \ref{eq:area}
The results are normalized so that the Power Spectrum Density (PSD) starts at zero 537 456 with equation \ref{eq:rejection_min} where $\mathcal{R}$ is the minimal
and the different configurations can be compared. 538 457 rejection required.
539 458
\section{Maximizing the rejection at fixed silicon area} 540 459 \begin{align}
\label{sec:fixed_area} 541 460 \sum_{i=1}^n r_i & \geq \mathcal{R} & \label{eq:rejection_min}
This section presents the output of the filter solver {\em i.e.} the computed 542 461 \end{align}
configurations for each stage, the computed rejection and the computed silicon area. 543 462
Such results allow for understanding the choices made by the solver to compute its solutions. 544 463 \section{Design workflow}
545 464 \label{sec:workflow}
The experimental setup is composed of three cases. The raw input is generated 546 465
by a Pseudo Random Number (PRN) generator, which fixes the input data size $\Pi^I$. 547 466 In this section, we describe the workflow to compute all the results presented in sections~\ref{sec:fixed_area}
Then the total silicon area $\mathcal{A}$ has been fixed to either 500, 1000 or 1500 548 467 and \ref{sec:fixed_rej}. Figure~\ref{fig:workflow} shows the global workflow and the different steps involved
arbitrary units. Hence, the three cases have been named: MAX/500, MAX/1000, MAX/1500. 549 468 in the computation of the results.
The number of configurations $p$ is 1133, with $C_i$ ranging from 3 to 60 and $\pi^C$ 550 469
ranging from 2 to 22. In each case, the quadratic program has been able to give a 551 470 \begin{figure}
result up to five stages ($n = 5$) in the cascaded filter. 552 471 \centering
553 472 \begin{tikzpicture}[node distance=0.75cm and 2cm]
Table~\ref{tbl:gurobi_max_500} shows the results obtained by the filter solver for MAX/500. 554 473 \node[draw,minimum size=1cm] (Solver) { Filter Solver } ;
Table~\ref{tbl:gurobi_max_1000} shows the results obtained by the filter solver for MAX/1000. 555 474 \node (Start) [left= 3cm of Solver] { } ;
Table~\ref{tbl:gurobi_max_1500} shows the results obtained by the filter solver for MAX/1500. 556 475 \node[draw,minimum size=1cm] (TCL) [right= of Solver] { TCL Script } ;
557 476 \node (Input) [above= of TCL] { } ;
\renewcommand{\arraystretch}{1.4} 558 477 \node[draw,minimum size=1cm] (Deploy) [below= of Solver] { Deploy Script } ;
559 478 \node[draw,minimum size=1cm] (Bitstream) [below= of TCL] { Bitstream } ;
\begin{table} 560 479 \node[draw,minimum size=1cm,rounded corners] (Board) [below right= of Deploy] { Board } ;
\caption{Configurations $(C_i, \pi_i^C, \pi_i^S)$, rejections and areas (in arbitrary units) for MAX/500} 561 480 \node[draw,minimum size=1cm] (Postproc) [below= of Deploy] { Post-Processing } ;
\label{tbl:gurobi_max_500} 562 481 \node (Results) [left= of Postproc] { } ;
\centering 563 482
{\scalefont{0.77} 564 483 \draw[->] (Start) edge node [above] { $\mathcal{A}, n, \Pi^I$ } node [below] { $(C_{ij}, \pi_{ij}^C), F$ } (Solver) ;
\begin{tabular}{|c|ccccc|c|c|} 565 484 \draw[->] (Input) edge node [left] { ADC or PRN } (TCL) ;
\hline 566 485 \draw[->] (Solver) edge node [below] { (1a) } (TCL) ;
$n$ & $i = 1$ & $i = 2$ & $i = 3$ & $i = 4$ & $i = 5$ & Rejection & Area \\ 567 486 \draw[->] (Solver) edge node [right] { (1b) } (Deploy) ;
\hline 568 487 \draw[->] (TCL) edge node [left] { (2) } (Bitstream) ;
1 & (21, 7, 0) & - & - & - & - & 32~dB & 483 \\ 569 488 \draw[->,dashed] (Bitstream) -- (Deploy) ;
2 & (3, 5, 18) & (33, 10, 0) & - & - & - & 48~dB & 492 \\ 570 489 \draw[->] (Deploy) to[out=-30,in=120] node [above] { (3) } (Board) ;
3 & (3, 5, 18) & (19, 7, 1) & (15, 7, 0) & - & - & 56~dB & 493 \\ 571 490 \draw[->] (Board) to[out=150,in=-60] node [below] { (4) } (Deploy) ;
4 & (3, 5, 18) & (19, 7, 1) & (15, 7, 0) & - & - & 56~dB & 493 \\ 572 491 \draw[->] (Deploy) edge node [left] { (5) } (Postproc) ;
5 & (3, 5, 18) & (19, 7, 1) & (15, 7, 0) & - & - & 56~dB & 493 \\ 573 492 \draw[->] (Postproc) -- (Results) ;
\hline 574 493 \end{tikzpicture}
\end{tabular} 575 494 \caption{Design workflow from the input parameters to the results allowing for
} 576 495 a fully automated optimal solution search.}
\end{table} 577 496 \label{fig:workflow}
578 497 \end{figure}
\begin{table} 579 498
\caption{Configurations $(C_i, \pi_i^C, \pi_i^S)$, rejections and areas (in arbitrary units) for MAX/1000} 580 499 The filter solver is a C++ program that takes as input the maximum area
\label{tbl:gurobi_max_1000} 581 500 $\mathcal{A}$, the number of stages $n$, the size of the input signal $\Pi^I$,
\centering 582 501 the FIR configurations $(C_{ij}, \pi_{ij}^C)$ and the function $F$. It creates
{\scalefont{0.77} 583 502 the quadratic programs and uses the Gurobi solver to estimate the optimal results.
\begin{tabular}{|c|ccccc|c|c|} 584 503 Then it produces two scripts: a TCL script ((1a) on figure~\ref{fig:workflow})
\hline 585 504 and a deploy script ((1b) on figure~\ref{fig:workflow}).
$n$ & $i = 1$ & $i = 2$ & $i = 3$ & $i = 4$ & $i = 5$ & Rejection & Area \\ 586 505
\hline 587 506 The TCL script describes the whole digital processing chain from the beginning
1 & (37, 11, 0) & - & - & - & - & 56~dB & 999 \\ 588 507 (the raw signal data) to the end (the filtered data) in a language compatible
2 & (15, 8, 17) & (35, 11, 0) & - & - & - & 80~dB & 990 \\ 589 508 with proprietary synthesis software, namely Vivado for Xilinx and Quartus for
3 & (3, 13, 26) & (31, 9, 1) & (27, 9, 0) & - & - & 92~dB & 999 \\ 590 509 Intel/Altera. The raw input data generated from a 20-bit Pseudo Random Number (PRN)
4 & (3, 5, 18) & (19, 7, 1) & (19, 7, 0) & (19, 7, 0) & - & 98~dB & 994 \\ 591 510 generator inside the FPGA and $\Pi^I$ is fixed at 16~bits.
5 & (3, 5, 18) & (19, 7, 1) & (19, 7, 0) & (19, 7, 0) & - & 98~dB & 994 \\ 592 511 Then the script builds each stage of the chain with a generic FIR task that
\hline 593 512 comes from a skeleton library. The generic FIR is highly configurable
\end{tabular} 594 513 with the number of coefficients and the size of the coefficients. The coefficients
} 595 514 themselves are not stored in the script.
\end{table} 596 515 As the signal is processed in real-time, the output signal is stored as
597 516 consecutive bursts of data for post-processing, mainly assessing the consistency of the
\begin{table} 598 517 implemented FIR cascade transfer function with the design criteria and the expected
\caption{Configurations $(C_i, \pi_i^C, \pi_i^S)$, rejections and areas (in arbitrary units) for MAX/1500} 599 518 transfer function.
\label{tbl:gurobi_max_1500} 600 519
\centering 601 520 The TCL script is used by Vivado to produce the FPGA bitstream ((2) on figure~\ref{fig:workflow}).
{\scalefont{0.77} 602 521 We use the 2018.2 version of Xilinx Vivado and we execute the synthesized
\begin{tabular}{|c|ccccc|c|c|} 603 522 bitstream on a Redpitaya board fitted with a Xilinx Zynq-7010 series
\hline 604 523 FPGA (xc7z010clg400-1) and two LTC2145 14-bit 125~MS/s ADC, loaded with 50~$\Omega$ resistors to
$n$ & $i = 1$ & $i = 2$ & $i = 3$ & $i = 4$ & $i = 5$ & Rejection & Area \\ 605 524 provide a broadband noise source.
\hline 606 525 The board runs the Linux kernel and surrounding environment produced from the
1 & (47, 15, 0) & - & - & - & - & 71~dB & 1457 \\ 607 526 Buildroot framework available at \url{https://github.com/trabucayre/redpitaya/}: configuring
2 & (19, 6, 15) & (51, 14, 0) & - & - & - & 102~dB & 1489 \\ 608 527 the Zynq FPGA, feeding the FIR with the set of coefficients, executing the simulation and
3 & (15, 9, 18) & (31, 8, 0) & (27, 9, 0) & - & - & 116~dB & 1488 \\ 609 528 fetching the results is automated.
4 & (3, 9, 22) & (31, 9, 1) & (27, 9, 0) & (19, 7, 0) & - & 125~dB & 1500 \\ 610 529
5 & (3, 9, 22) & (31, 9, 1) & (27, 9, 0) & (19, 7, 0) & - & 125~dB & 1500 \\ 611 530 The deploy script uploads the bitstream to the board ((3) on
\hline 612 531 figure~\ref{fig:workflow}), flashes the FPGA, loads the different drivers,
\end{tabular} 613 532 configures the coefficients of the FIR filters. It then waits for the results
} 614 533 and retrieves the data to the main computer ((4) on figure~\ref{fig:workflow}).
\end{table} 615 534
616 535 Finally, an Octave post-processing script computes the final results thanks to
\renewcommand{\arraystretch}{1} 617 536 the output data ((5) on figure~\ref{fig:workflow}).
618 537 The results are normalized so that the Power Spectrum Density (PSD) starts at zero
By analyzing these tables, we can first state that we reach an optimal solution 619 538 and the different configurations can be compared.
for each case : $n = 3$ for MAX/500, and $n = 4$ for MAX/1000 and MAX/1500. Moreover 620 539
the cascaded filters always exhibit better performance than the monolithic solution. 621 540 \section{Maximizing the rejection at fixed silicon area}
It was an expected result as it has 622 541 \label{sec:fixed_area}
been previously observed that many small filters are better than 623 542 This section presents the output of the filter solver {\em i.e.} the computed
a single large filter \cite{lim_1988, lim_1996, young_1992}, despite such conclusions 624 543 configurations for each stage, the computed rejection and the computed silicon area.
being hardly used in practice due to the lack of tools for identifying individual filter 625 544 Such results allow for understanding the choices made by the solver to compute its solutions.
coefficients in the cascaded approach. 626 545
627 546 The experimental setup is composed of three cases. The raw input is generated
Second, the larger the silicon area, the better the rejection. This was also an 628 547 by a Pseudo Random Number (PRN) generator, which fixes the input data size $\Pi^I$.
expected result as more area means a filter of better quality with more coefficients 629 548 Then the total silicon area $\mathcal{A}$ has been fixed to either 500, 1000 or 1500
or more bits per coefficient. 630 549 arbitrary units. Hence, the three cases have been named: MAX/500, MAX/1000, MAX/1500.
631 550 The number of configurations $p$ is 1133, with $C_i$ ranging from 3 to 60 and $\pi^C$
Then, we also observe that the first stage can have a larger shift than the other 632 551 ranging from 2 to 22. In each case, the quadratic program has been able to give a
stages. This is explained by the fact that the solver tries to use just enough 633 552 result up to five stages ($n = 5$) in the cascaded filter.
bits for the computed rejection after each stage. In the first stage, a 634 553
balance between a strong rejection with a low number of bits is targeted. Equation~\ref{eq:maxshift} 635 554 Table~\ref{tbl:gurobi_max_500} shows the results obtained by the filter solver for MAX/500.
gives the relation between both values. 636 555 Table~\ref{tbl:gurobi_max_1000} shows the results obtained by the filter solver for MAX/1000.
637 556 Table~\ref{tbl:gurobi_max_1500} shows the results obtained by the filter solver for MAX/1500.
Finally, we note that the solver consumes all the given silicon area. 638 557
639 558 \renewcommand{\arraystretch}{1.4}
The following graphs present the rejection for real data on the FPGA. In all the following 640 559
figures, the solid line represents the actual rejection of the filtered 641 560 \begin{table}
data on the FPGA as measured experimentally and the dashed line are the noise levels 642 561 \caption{Configurations $(C_i, \pi_i^C, \pi_i^S)$, rejections and areas (in arbitrary units) for MAX/500}
given by the quadratic solver. The configurations are those computed in the previous section. 643 562 \label{tbl:gurobi_max_500}
644 563 \centering
Figure~\ref{fig:max_500_result} shows the rejection of the different configurations in the case of MAX/500. 645 564 {\scalefont{0.77}
Figure~\ref{fig:max_1000_result} shows the rejection of the different configurations in the case of MAX/1000. 646
Figure~\ref{fig:max_1500_result} shows the rejection of the different configurations in the case of MAX/1500. 647 565 \begin{tabular}{|c|ccccc|c|c|}
648 566 \hline
\begin{figure} 649 567 $n$ & $i = 1$ & $i = 2$ & $i = 3$ & $i = 4$ & $i = 5$ & Rejection & Area \\
\centering 650 568 \hline
\begin{subfigure}{\linewidth} 651 569 1 & (21, 7, 0) & - & - & - & - & 32~dB & 483 \\
\includegraphics[width=\linewidth]{images/max_500} 652 570 2 & (3, 5, 18) & (33, 10, 0) & - & - & - & 48~dB & 492 \\
\caption{Filter transfer functions for varying number of cascaded filters solving 653 571 3 & (3, 5, 18) & (19, 7, 1) & (15, 7, 0) & - & - & 56~dB & 493 \\
the MAX/500 problem of maximizing rejection for a given resource allocation (500~arbitrary units).} 654 572 4 & (3, 5, 18) & (19, 7, 1) & (15, 7, 0) & - & - & 56~dB & 493 \\
\label{fig:max_500_result} 655 573 5 & (3, 5, 18) & (19, 7, 1) & (15, 7, 0) & - & - & 56~dB & 493 \\
\end{subfigure} 656 574 \hline
657 575 \end{tabular}
\begin{subfigure}{\linewidth} 658 576 }
\includegraphics[width=\linewidth]{images/max_1000} 659 577 \end{table}
\caption{Filter transfer functions for varying number of cascaded filters solving 660 578
the MAX/1000 problem of maximizing rejection for a given resource allocation (1000~arbitrary units).} 661 579 \begin{table}
\label{fig:max_1000_result} 662 580 \caption{Configurations $(C_i, \pi_i^C, \pi_i^S)$, rejections and areas (in arbitrary units) for MAX/1000}
\end{subfigure} 663 581 \label{tbl:gurobi_max_1000}
664 582 \centering
\begin{subfigure}{\linewidth} 665 583 {\scalefont{0.77}
\includegraphics[width=\linewidth]{images/max_1500} 666 584 \begin{tabular}{|c|ccccc|c|c|}
\caption{Filter transfer functions for varying number of cascaded filters solving 667 585 \hline
the MAX/1500 problem of maximizing rejection for a given resource allocation (1500~arbitrary units).} 668 586 $n$ & $i = 1$ & $i = 2$ & $i = 3$ & $i = 4$ & $i = 5$ & Rejection & Area \\
\label{fig:max_1500_result} 669 587 \hline
\end{subfigure} 670 588 1 & (37, 11, 0) & - & - & - & - & 56~dB & 999 \\
\caption{Solutions for the MAX/500, MAX/1000 and MAX/1500 problems of maximizing 671 589 2 & (15, 8, 17) & (35, 11, 0) & - & - & - & 80~dB & 990 \\
rejection for a given resource allocation. 672 590 3 & (3, 13, 26) & (31, 9, 1) & (27, 9, 0) & - & - & 92~dB & 999 \\
The filter shape constraint (bandpass and bandstop) is shown as thick 673 591 4 & (3, 5, 18) & (19, 7, 1) & (19, 7, 0) & (19, 7, 0) & - & 98~dB & 994 \\
horizontal lines on each chart.} 674 592 5 & (3, 5, 18) & (19, 7, 1) & (19, 7, 0) & (19, 7, 0) & - & 98~dB & 994 \\
\end{figure} 675 593 \hline
676 594 \end{tabular}
In all cases, we observe that the actual rejection is close to the rejection computed by the solver. 677 595 }
678 596 \end{table}
We compare the actual silicon resources given by Vivado to the 679 597
resources in arbitrary units. 680 598 \begin{table}
The goal is to check that our arbitrary units of silicon area models well enough 681 599 \caption{Configurations $(C_i, \pi_i^C, \pi_i^S)$, rejections and areas (in arbitrary units) for MAX/1500}
the real resources on the FPGA. Especially we want to verify that, for a given 682 600 \label{tbl:gurobi_max_1500}
number of arbitrary units, the actual silicon resources do not depend on the 683 601 \centering
number of stages $n$. Most significantly, our approach aims 684 602 {\scalefont{0.77}
at remaining far enough from the practical logic gate implementation used by 685 603 \begin{tabular}{|c|ccccc|c|c|}
various vendors to remain platform independent and be portable from one 686 604 \hline
architecture to another. 687 605 $n$ & $i = 1$ & $i = 2$ & $i = 3$ & $i = 4$ & $i = 5$ & Rejection & Area \\
688 606 \hline
Table~\ref{tbl:resources_usage} shows the resources usage in the case of MAX/500, MAX/1000 and 689 607 1 & (47, 15, 0) & - & - & - & - & 71~dB & 1457 \\
MAX/1500 \emph{i.e.} when the maximum allowed silicon area is fixed to 500, 1000 690 608 2 & (19, 6, 15) & (51, 14, 0) & - & - & - & 102~dB & 1489 \\
and 1500 arbitrary units. We have taken care to extract solely the resources used by 691 609 3 & (15, 9, 18) & (31, 8, 0) & (27, 9, 0) & - & - & 116~dB & 1488 \\
the FIR filters and remove additional processing blocks including FIFO and Programmable 692 610 4 & (3, 9, 22) & (31, 9, 1) & (27, 9, 0) & (19, 7, 0) & - & 125~dB & 1500 \\
Logic (PL -- FPGA) to Processing System (PS -- general purpose processor) communication. 693 611 5 & (3, 9, 22) & (31, 9, 1) & (27, 9, 0) & (19, 7, 0) & - & 125~dB & 1500 \\
694 612 \hline
\begin{table}[h!tb] 695 613 \end{tabular}
\caption{Resource occupation following synthesis of the solutions found for 696 614 }
the problem of maximizing rejection for a given resource allocation. The last column refers to available resources on a Zynq-7010 as found on the Redpitaya.} 697 615 \end{table}
\label{tbl:resources_usage} 698 616
\centering 699 617 \renewcommand{\arraystretch}{1}
\begin{tabular}{|c|c|ccc|c|} 700 618
\hline 701 619 By analyzing these tables, we can first state that we reach an optimal solution
$n$ & & MAX/500 & MAX/1000 & MAX/1500 & \emph{Zynq 7010} \\ \hline\hline 702
& LUT & 249 & 453 & 627 & \emph{17600} \\ 703
1 & BRAM & 1 & 1 & 1 & \emph{120} \\ 704 620 for each case : $n = 3$ for MAX/500, and $n = 4$ for MAX/1000 and MAX/1500. Moreover
& DSP & 21 & 37 & 47 & \emph{80} \\ \hline 705 621 the cascaded filters always exhibit better performance than the monolithic solution.
& LUT & 2253 & 474 & 691 & \emph{17600} \\ 706 622 It was an expected result as it has
2 & BRAM & 2 & 2 & 2 & \emph{120} \\ 707 623 been previously observed that many small filters are better than
& DSP & 0 & 50 & 70 & \emph{80} \\ \hline 708 624 a single large filter \cite{lim_1988, lim_1996, young_1992}, despite such conclusions
& LUT & 1329 & 2006 & 3158 & \emph{17600} \\ 709 625 being hardly used in practice due to the lack of tools for identifying individual filter
3 & BRAM & 3 & 3 & 3 & \emph{120} \\ 710 626 coefficients in the cascaded approach.
& DSP & 15 & 30 & 42 & \emph{80} \\ \hline 711 627
& LUT & 1329 & 1600 & 2260 & \emph{17600} \\ 712 628 Second, the larger the silicon area, the better the rejection. This was also an
4 & BRAM & 3 & 4 & 4 & \emph{120} \\ 713 629 expected result as more area means a filter of better quality with more coefficients
& DPS & 15 & 38 & 49 & \emph{80} \\ \hline 714 630 or more bits per coefficient.
& LUT & 1329 & 1600 & 2260 & \emph{17600} \\ 715 631
5 & BRAM & 3 & 4 & 4 & \emph{120} \\ 716 632 Then, we also observe that the first stage can have a larger shift than the other
& DPS & 15 & 38 & 49 & \emph{80} \\ \hline 717 633 stages. This is explained by the fact that the solver tries to use just enough
\end{tabular} 718 634 bits for the computed rejection after each stage. In the first stage, a
\end{table} 719 635 balance between a strong rejection with a low number of bits is targeted. Equation~\ref{eq:maxshift}
720 636 gives the relation between both values.
In case $n = 2$ for MAX/500, Vivado replaces the DSPs by Look Up Tables (LUTs). We assume that, 721 637
when the filter coefficients are small enough, or when the input size is small 722 638 Finally, we note that the solver consumes all the given silicon area.
enough, Vivado optimizes resource consumption by selecting multiplexers to 723 639
implement the multiplications instead of a DSP. In this case, it is quite difficult 724 640 The following graphs present the rejection for real data on the FPGA. In all the following
to compare the whole silicon budget. 725 641 figures, the solid line represents the actual rejection of the filtered
726 642 data on the FPGA as measured experimentally and the dashed line are the noise levels
However, a rough estimation can be made with a simple equivalence: looking at 727 643 given by the quadratic solver. The configurations are those computed in the previous section.
the first column (MAX/500), where the number of LUTs is quite stable for $n \geq 2$, 728 644
we can deduce that a DSP is roughly equivalent to 100~LUTs in terms of silicon 729 645 Figure~\ref{fig:max_500_result} shows the rejection of the different configurations in the case of MAX/500.
area use. With this equivalence, our 500 arbitrary units correspond to 2500 LUTs, 730 646 Figure~\ref{fig:max_1000_result} shows the rejection of the different configurations in the case of MAX/1000.
1000 arbitrary units correspond to 5000 LUTs and 1500 arbitrary units correspond 731 647 Figure~\ref{fig:max_1500_result} shows the rejection of the different configurations in the case of MAX/1500.
to 7300 LUTs. The conclusion is that the orders of magnitude of our arbitrary 732 648
unit map well to actual hardware resources. The relatively small differences can probably be explained 733
by the optimizations done by Vivado based on the detailed map of available processing resources. 734
735
We now present the computation time needed to solve the quadratic problem. 736
For each case, the filter solver software is executed on a Intel(R) Xeon(R) CPU E5606 737
clocked at 2.13~GHz. The CPU has 8 cores that are used by Gurobi to solve 738
the quadratic problem. Table~\ref{tbl:area_time} shows the time needed to solve the quadratic 739
problem when the maximal area is fixed to 500, 1000 and 1500 arbitrary units. 740
741
\begin{table}[h!tb] 742
\caption{Time needed to solve the quadratic program with Gurobi} 743
\label{tbl:area_time} 744
\centering 745
\begin{tabular}{|c|c|c|c|}\hline 746
$n$ & Time (MAX/500) & Time (MAX/1000) & Time (MAX/1500) \\\hline\hline 747
1 & 0.01~s & 0.02~s & 0.03~s \\ 748
2 & 0.1~s & 1~s & 2~s \\ 749
3 & 5~s & 27~s & 351~s ($\approx$ 6~min) \\ 750
4 & 4~s & 141~s ($\approx$ 3~min) & 1134~s ($\approx$ 18~min) \\ 751
5 & 6~s & 630~s ($\approx$ 10~min) & 49400~s ($\approx$ 13~h) \\\hline 752
\end{tabular} 753
\end{table} 754
755 649 \begin{figure}
As expected, the computation time seems to rise exponentially with the number of stages. 756 650 \centering
When the area is limited, the design exploration space is more limited and the solver is able to 757 651 \begin{subfigure}{\linewidth}
find an optimal solution faster. 758 652 \includegraphics[width=\linewidth]{images/max_500}
We also notice that the solution with $n$ greater than the optimal value 759 653 \caption{Filter transfer functions for varying number of cascaded filters solving
takes more time to be found than the optimal one. This can be explained since the search space is 760 654 the MAX/500 problem of maximizing rejection for a given resource allocation (500~arbitrary units).}
larger and we need more time to ensure that the previous solution (from the 761 655 \label{fig:max_500_result}
smaller value of $n$) still remains the optimal solution. 762 656 \end{subfigure}
763 657
\subsection{Minimizing resource occupation at fixed rejection} 764 658 \begin{subfigure}{\linewidth}
\label{sec:fixed_rej} 765 659 \includegraphics[width=\linewidth]{images/max_1000}
766 660 \caption{Filter transfer functions for varying number of cascaded filters solving
This section presents the results of the complementary quadratic program aimed at 767 661 the MAX/1000 problem of maximizing rejection for a given resource allocation (1000~arbitrary units).}
minimizing the area occupation for a targeted rejection level. 768 662 \label{fig:max_1000_result}
769 663 \end{subfigure}
The experimental setup is composed of four cases. The raw input is the same 770 664
as in the previous section, from a PRN generator, which fixes the input data size $\Pi^I$. 771 665 \begin{subfigure}{\linewidth}
Then the targeted rejection $\mathcal{R}$ has been fixed to either 40, 60, 80 or 100~dB. 772 666 \includegraphics[width=\linewidth]{images/max_1500}
Hence, the three cases have been named: MIN/40, MIN/60, MIN/80 and MIN/100. 773 667 \caption{Filter transfer functions for varying number of cascaded filters solving
The number of configurations $p$ is the same as previous section. 774 668 the MAX/1500 problem of maximizing rejection for a given resource allocation (1500~arbitrary units).}
775 669 \label{fig:max_1500_result}
Table~\ref{tbl:gurobi_min_40} shows the results obtained by the filter solver for MIN/40. 776 670 \end{subfigure}
Table~\ref{tbl:gurobi_min_60} shows the results obtained by the filter solver for MIN/60. 777 671 \caption{Solutions for the MAX/500, MAX/1000 and MAX/1500 problems of maximizing
Table~\ref{tbl:gurobi_min_80} shows the results obtained by the filter solver for MIN/80. 778 672 rejection for a given resource allocation.
Table~\ref{tbl:gurobi_min_100} shows the results obtained by the filter solver for MIN/100. 779 673 The filter shape constraint (bandpass and bandstop) is shown as thick
780 674 horizontal lines on each chart.}
\renewcommand{\arraystretch}{1.4} 781 675 \end{figure}
782 676
\begin{table}[h!tb] 783 677 In all cases, we observe that the actual rejection is close to the rejection computed by the solver.
\caption{Configurations $(C_i, \pi_i^C, \pi_i^S)$, rejections and areas (in arbitrary units) for MIN/40} 784 678
\label{tbl:gurobi_min_40} 785 679 We compare the actual silicon resources given by Vivado to the
\centering 786 680 resources in arbitrary units.
{\scalefont{0.77} 787 681 The goal is to check that our arbitrary units of silicon area models well enough
\begin{tabular}{|c|ccccc|c|c|} 788 682 the real resources on the FPGA. Especially we want to verify that, for a given
\hline 789 683 number of arbitrary units, the actual silicon resources do not depend on the
$n$ & $i = 1$ & $i = 2$ & $i = 3$ & $i = 4$ & $i = 5$ & Rejection & Area \\ 790 684 number of stages $n$. Most significantly, our approach aims
\hline 791 685 at remaining far enough from the practical logic gate implementation used by
1 & (27, 8, 0) & - & - & - & - & 41~dB & 648 \\ 792 686 various vendors to remain platform independent and be portable from one
2 & (3, 5, 18) & (27, 8, 0) & - & - & - & 42~dB & 360 \\ 793 687 architecture to another.
3 & (3, 5, 18) & (27, 8, 0) & - & - & - & 42~dB & 360 \\ 794 688
4 & (3, 5, 18) & (27, 8, 0) & - & - & - & 42~dB & 360 \\ 795 689 Table~\ref{tbl:resources_usage} shows the resources usage in the case of MAX/500, MAX/1000 and
5 & (3, 5, 18) & (27, 8, 0) & - & - & - & 42~dB & 360 \\ 796 690 MAX/1500 \emph{i.e.} when the maximum allowed silicon area is fixed to 500, 1000
\hline 797 691 and 1500 arbitrary units. We have taken care to extract solely the resources used by
\end{tabular} 798 692 the FIR filters and remove additional processing blocks including FIFO and Programmable
} 799 693 Logic (PL -- FPGA) to Processing System (PS -- general purpose processor) communication.
\end{table} 800 694
801 695 \begin{table}[h!tb]
\begin{table}[h!tb] 802 696 \caption{Resource occupation following synthesis of the solutions found for
\caption{Configurations $(C_i, \pi_i^C, \pi_i^S)$, rejections and areas (in arbitrary units) for MIN/60} 803 697 the problem of maximizing rejection for a given resource allocation. The last column refers to available resources on a Zynq-7010 as found on the Redpitaya.}
\label{tbl:gurobi_min_60} 804 698 \label{tbl:resources_usage}
\centering 805
{\scalefont{0.77} 806 699 \centering
\begin{tabular}{|c|ccccc|c|c|} 807 700 \begin{tabular}{|c|c|ccc|c|}
\hline 808 701 \hline
$n$ & $i = 1$ & $i = 2$ & $i = 3$ & $i = 4$ & $i = 5$ & Rejection & Area \\ 809 702 $n$ & & MAX/500 & MAX/1000 & MAX/1500 & \emph{Zynq 7010} \\ \hline\hline
\hline 810 703 & LUT & 249 & 453 & 627 & \emph{17600} \\
1 & (39, 13, 0) & - & - & - & - & 60~dB & 1131 \\ 811 704 1 & BRAM & 1 & 1 & 1 & \emph{120} \\
2 & (15, 6, 16) & (23, 9, 0) & - & - & - & 60~dB & 675 \\ 812 705 & DSP & 21 & 37 & 47 & \emph{80} \\ \hline
3 & (3, 5, 18) & (15, 6, 2) & (23, 8, 0) & - & - & 60~dB & 543 \\ 813 706 & LUT & 2253 & 474 & 691 & \emph{17600} \\
4 & (3, 5, 18) & (15, 6, 2) & (23, 8, 0) & - & - & 60~dB & 543 \\ 814 707 2 & BRAM & 2 & 2 & 2 & \emph{120} \\
5 & (3, 5, 18) & (15, 6, 2) & (23, 8, 0) & - & - & 60~dB & 543 \\ 815 708 & DSP & 0 & 50 & 70 & \emph{80} \\ \hline
\hline 816 709 & LUT & 1329 & 2006 & 3158 & \emph{17600} \\
\end{tabular} 817 710 3 & BRAM & 3 & 3 & 3 & \emph{120} \\
} 818 711 & DSP & 15 & 30 & 42 & \emph{80} \\ \hline
\end{table} 819 712 & LUT & 1329 & 1600 & 2260 & \emph{17600} \\
820 713 4 & BRAM & 3 & 4 & 4 & \emph{120} \\
\begin{table}[h!tb] 821 714 & DPS & 15 & 38 & 49 & \emph{80} \\ \hline
\caption{Configurations $(C_i, \pi_i^C, \pi_i^S)$, rejections and areas (in arbitrary units) for MIN/80} 822 715 & LUT & 1329 & 1600 & 2260 & \emph{17600} \\
\label{tbl:gurobi_min_80} 823 716 5 & BRAM & 3 & 4 & 4 & \emph{120} \\
\centering 824 717 & DPS & 15 & 38 & 49 & \emph{80} \\ \hline
{\scalefont{0.77} 825 718 \end{tabular}
\begin{tabular}{|c|ccccc|c|c|} 826 719 \end{table}
\hline 827 720
$n$ & $i = 1$ & $i = 2$ & $i = 3$ & $i = 4$ & $i = 5$ & Rejection & Area \\ 828 721 In case $n = 2$ for MAX/500, Vivado replaces the DSPs by Look Up Tables (LUTs). We assume that,
\hline 829 722 when the filter coefficients are small enough, or when the input size is small
1 & (55, 16, 0) & - & - & - & - & 81~dB & 1760 \\ 830 723 enough, Vivado optimizes resource consumption by selecting multiplexers to
2 & (15, 8, 17) & (35, 11, 0) & - & - & - & 80~dB & 990 \\ 831 724 implement the multiplications instead of a DSP. In this case, it is quite difficult
3 & (3, 7, 20) & (31, 9, 1) & (19, 7, 0) & - & - & 80~dB & 783 \\ 832 725 to compare the whole silicon budget.
4 & (3, 7, 20) & (31, 9, 1) & (19, 7, 0) & - & - & 80~dB & 783 \\ 833 726
5 & (3, 7, 20) & (31, 9, 1) & (19, 7, 0) & - & - & 80~dB & 783 \\ 834 727 However, a rough estimation can be made with a simple equivalence: looking at
\hline 835 728 the first column (MAX/500), where the number of LUTs is quite stable for $n \geq 2$,
\end{tabular} 836 729 we can deduce that a DSP is roughly equivalent to 100~LUTs in terms of silicon
} 837 730 area use. With this equivalence, our 500 arbitrary units correspond to 2500 LUTs,
\end{table} 838 731 1000 arbitrary units correspond to 5000 LUTs and 1500 arbitrary units correspond
839 732 to 7300 LUTs. The conclusion is that the orders of magnitude of our arbitrary
\begin{table}[h!tb] 840 733 unit map well to actual hardware resources. The relatively small differences can probably be explained
\caption{Configurations $(C_i, \pi_i^C, \pi_i^S)$, rejections and areas (in arbitrary units) for MIN/100} 841 734 by the optimizations done by Vivado based on the detailed map of available processing resources.
\label{tbl:gurobi_min_100} 842 735
\centering 843 736 We now present the computation time needed to solve the quadratic problem.
{\scalefont{0.77} 844 737 For each case, the filter solver software is executed on a Intel(R) Xeon(R) CPU E5606
\begin{tabular}{|c|ccccc|c|c|} 845 738 clocked at 2.13~GHz. The CPU has 8 cores that are used by Gurobi to solve
\hline 846 739 the quadratic problem. Table~\ref{tbl:area_time} shows the time needed to solve the quadratic
$n$ & $i = 1$ & $i = 2$ & $i = 3$ & $i = 4$ & $i = 5$ & Rejection & Area \\ 847 740 problem when the maximal area is fixed to 500, 1000 and 1500 arbitrary units.
\hline 848 741
1 & - & - & - & - & - & - & - \\ 849 742 \begin{table}[h!tb]
2 & (27, 9, 15) & (35, 11, 0) & - & - & - & 100~dB & 1410 \\ 850 743 \caption{Time needed to solve the quadratic program with Gurobi}
3 & (3, 5, 18) & (35, 11, 1) & (27, 9, 0) & - & - & 100~dB & 1147 \\ 851 744 \label{tbl:area_time}
4 & (3, 5, 18) & (15, 6, 2) & (27, 9, 0) & (19, 7, 0) & - & 100~dB & 1067 \\ 852 745 \centering
5 & (3, 5, 18) & (15, 6, 2) & (27, 9, 0) & (19, 7, 0) & - & 100~dB & 1067 \\ 853
\hline 854 746 \begin{tabular}{|c|c|c|c|}\hline
\end{tabular} 855 747 $n$ & Time (MAX/500) & Time (MAX/1000) & Time (MAX/1500) \\\hline\hline
} 856 748 1 & 0.01~s & 0.02~s & 0.03~s \\
\end{table} 857 749 2 & 0.1~s & 1~s & 2~s \\
\renewcommand{\arraystretch}{1} 858 750 3 & 5~s & 27~s & 351~s ($\approx$ 6~min) \\
859 751 4 & 4~s & 141~s ($\approx$ 3~min) & 1134~s ($\approx$ 18~min) \\
From these tables, we can first state that almost all configurations reach the targeted rejection 860 752 5 & 6~s & 630~s ($\approx$ 10~min) & 49400~s ($\approx$ 13~h) \\\hline
level or even better thanks to our underestimate of the cascade rejection as the sum of the 861 753 \end{tabular}
individual filter rejection. The only exception is for the monolithic case ($n = 1$) in 862 754 \end{table}
MIN/100: no solution is found for a single monolithic filter reach a 100~dB rejection. 863 755
Furthermore, the area of the monolithic filter is twice as big as the two cascaded filters 864 756 As expected, the computation time seems to rise exponentially with the number of stages.
(675 and 1131 arbitrary units v.s 990 and 1760 arbitrary units for 60 and 80~dB rejection 865 757 When the area is limited, the design exploration space is more limited and the solver is able to
respectively). More generally, the more filters are cascaded, the lower the occupied area. 866 758 find an optimal solution faster.
867 759 We also notice that the solution with $n$ greater than the optimal value
Like in previous section, the solver chooses always a little filter as first 868 760 takes more time to be found than the optimal one. This can be explained since the search space is
filter stage and the second one is often the biggest filter. This choice can be explained 869 761 larger and we need more time to ensure that the previous solution (from the
as in the previous section, with the solver using just enough bits not to degrade the input 870 762 smaller value of $n$) still remains the optimal solution.
signal and in the second filter selecting a better filter to improve rejection without 871 763
having too many bits in the output data. 872 764 \subsection{Minimizing resource occupation at fixed rejection}
765 \label{sec:fixed_rej}
873 766
For each case, we found an optimal solution with $n < 5$: for MIN/40 $n=2$, 874 767 This section presents the results of the complementary quadratic program aimed at
for MIN/60 and MIN/80 $n = 3$ and for MIN/100 $n = 4$. In all cases, the solutions 875 768 minimizing the area occupation for a targeted rejection level.
when $n$ is greater than this optimal $n$ remain identical to the optimal one. 876 769
877 770 The experimental setup is composed of four cases. The raw input is the same
The following graphs present the rejection for real data on the FPGA. In all the following 878 771 as in the previous section, from a PRN generator, which fixes the input data size $\Pi^I$.
figures, the solid line represents the actual rejection of the filtered 879 772 Then the targeted rejection $\mathcal{R}$ has been fixed to either 40, 60, 80 or 100~dB.
data on the FPGA as measured experimentally and the dashed line is the noise level 880 773 Hence, the three cases have been named: MIN/40, MIN/60, MIN/80 and MIN/100.
given by the quadratic solver. 881 774 The number of configurations $p$ is the same as previous section.
882 775
Figure~\ref{fig:min_40} shows the rejection of the different configurations in the case of MIN/40. 883 776 Table~\ref{tbl:gurobi_min_40} shows the results obtained by the filter solver for MIN/40.
Figure~\ref{fig:min_60} shows the rejection of the different configurations in the case of MIN/60. 884 777 Table~\ref{tbl:gurobi_min_60} shows the results obtained by the filter solver for MIN/60.
Figure~\ref{fig:min_80} shows the rejection of the different configurations in the case of MIN/80. 885 778 Table~\ref{tbl:gurobi_min_80} shows the results obtained by the filter solver for MIN/80.
Figure~\ref{fig:min_100} shows the rejection of the different configurations in the case of MIN/100. 886 779 Table~\ref{tbl:gurobi_min_100} shows the results obtained by the filter solver for MIN/100.
887 780
\begin{figure} 888 781 \renewcommand{\arraystretch}{1.4}
\centering 889 782
\begin{subfigure}{\linewidth} 890 783 \begin{table}[h!tb]
\includegraphics[width=.91\linewidth]{images/min_40} 891 784 \caption{Configurations $(C_i, \pi_i^C, \pi_i^S)$, rejections and areas (in arbitrary units) for MIN/40}
\caption{Filter transfer functions for varying number of cascaded filters solving 892 785 \label{tbl:gurobi_min_40}
the MIN/40 problem of minimizing resource allocation for reaching a 40~dB rejection.} 893 786 \centering
\label{fig:min_40} 894 787 {\scalefont{0.77}
\end{subfigure} 895 788 \begin{tabular}{|c|ccccc|c|c|}
896 789 \hline
\begin{subfigure}{\linewidth} 897 790 $n$ & $i = 1$ & $i = 2$ & $i = 3$ & $i = 4$ & $i = 5$ & Rejection & Area \\
\includegraphics[width=.91\linewidth]{images/min_60} 898 791 \hline
\caption{Filter transfer functions for varying number of cascaded filters solving 899 792 1 & (27, 8, 0) & - & - & - & - & 41~dB & 648 \\
the MIN/60 problem of minimizing resource allocation for reaching a 60~dB rejection.} 900 793 2 & (3, 5, 18) & (27, 8, 0) & - & - & - & 42~dB & 360 \\
\label{fig:min_60} 901 794 3 & (3, 5, 18) & (27, 8, 0) & - & - & - & 42~dB & 360 \\
\end{subfigure} 902 795 4 & (3, 5, 18) & (27, 8, 0) & - & - & - & 42~dB & 360 \\
903 796 5 & (3, 5, 18) & (27, 8, 0) & - & - & - & 42~dB & 360 \\
\begin{subfigure}{\linewidth} 904 797 \hline
\includegraphics[width=.91\linewidth]{images/min_80} 905 798 \end{tabular}
\caption{Filter transfer functions for varying number of cascaded filters solving 906 799 }
the MIN/80 problem of minimizing resource allocation for reaching a 80~dB rejection.} 907 800 \end{table}
\label{fig:min_80} 908 801
\end{subfigure} 909 802 \begin{table}[h!tb]
910 803 \caption{Configurations $(C_i, \pi_i^C, \pi_i^S)$, rejections and areas (in arbitrary units) for MIN/60}
\begin{subfigure}{\linewidth} 911 804 \label{tbl:gurobi_min_60}
\includegraphics[width=.91\linewidth]{images/min_100} 912 805 \centering
\caption{Filter transfer functions for varying number of cascaded filters solving 913 806 {\scalefont{0.77}
the MIN/100 problem of minimizing resource allocation for reaching a 100~dB rejection.} 914 807 \begin{tabular}{|c|ccccc|c|c|}
\label{fig:min_100} 915 808 \hline
\end{subfigure} 916 809 $n$ & $i = 1$ & $i = 2$ & $i = 3$ & $i = 4$ & $i = 5$ & Rejection & Area \\
\caption{Solutions for the MIN/40, MIN/60, MIN/80 and MIN/100 problems of reaching a 917 810 \hline
given rejection while minimizing resource allocation. The filter shape constraint (bandpass and 918 811 1 & (39, 13, 0) & - & - & - & - & 60~dB & 1131 \\
bandstop) is shown as thick 919 812 2 & (15, 6, 16) & (23, 9, 0) & - & - & - & 60~dB & 675 \\
horizontal lines on each chart.} 920 813 3 & (3, 5, 18) & (15, 6, 2) & (23, 8, 0) & - & - & 60~dB & 543 \\
\end{figure} 921 814 4 & (3, 5, 18) & (15, 6, 2) & (23, 8, 0) & - & - & 60~dB & 543 \\
922 815 5 & (3, 5, 18) & (15, 6, 2) & (23, 8, 0) & - & - & 60~dB & 543 \\
We observe that all rejections given by the quadratic solver are close to the experimentally 923 816 \hline
measured rejection. All curves prove that the constraint to reach the target rejection is 924 817 \end{tabular}
respected with both monolithic (except in MIN/100 which has no monolithic solution) or cascaded filters. 925 818 }
926 819 \end{table}
Table~\ref{tbl:resources_usage} shows the resource usage in the case of MIN/40, MIN/60; 927 820
MIN/80 and MIN/100 \emph{i.e.} when the target rejection is fixed to 40, 60, 80 and 100~dB. We 928 821 \begin{table}[h!tb]
have taken care to extract solely the resources used by 929 822 \caption{Configurations $(C_i, \pi_i^C, \pi_i^S)$, rejections and areas (in arbitrary units) for MIN/80}
the FIR filters and remove additional processing blocks including FIFO and PL to 930 823 \label{tbl:gurobi_min_80}
PS communication. 931 824 \centering
932 825 {\scalefont{0.77}
\renewcommand{\arraystretch}{1.2} 933 826 \begin{tabular}{|c|ccccc|c|c|}
\begin{table} 934 827 \hline
\caption{Resource occupation. The last column refers to available resources on a Zynq-7010 as found on the Redpitaya.} 935 828 $n$ & $i = 1$ & $i = 2$ & $i = 3$ & $i = 4$ & $i = 5$ & Rejection & Area \\
\label{tbl:resources_usage_comp} 936 829 \hline
\centering 937 830 1 & (55, 16, 0) & - & - & - & - & 81~dB & 1760 \\
{\scalefont{0.90} 938 831 2 & (15, 8, 17) & (35, 11, 0) & - & - & - & 80~dB & 990 \\
\begin{tabular}{|c|c|cccc|c|} 939 832 3 & (3, 7, 20) & (31, 9, 1) & (19, 7, 0) & - & - & 80~dB & 783 \\
\hline 940 833 4 & (3, 7, 20) & (31, 9, 1) & (19, 7, 0) & - & - & 80~dB & 783 \\
$n$ & & MIN/40 & MIN/60 & MIN/80 & MIN/100 & \emph{Zynq 7010} \\ \hline\hline 941 834 5 & (3, 7, 20) & (31, 9, 1) & (19, 7, 0) & - & - & 80~dB & 783 \\
& LUT & 343 & 334 & 772 & - & \emph{17600} \\ 942 835 \hline
1 & BRAM & 1 & 1 & 1 & - & \emph{120} \\ 943 836 \end{tabular}
& DSP & 27 & 39 & 55 & - & \emph{80} \\ \hline 944 837 }
& LUT & 1664 & 2329 & 474 & 620 & \emph{17600} \\ 945 838 \end{table}
2 & BRAM & 2 & 2 & 2 & 2 & \emph{120} \\ 946 839
& DSP & 0 & 15 & 50 & 62 & \emph{80} \\ \hline 947 840 \begin{table}[h!tb]
& LUT & 1664 & 3114 & 1884 & 2873 & \emph{17600} \\ 948 841 \caption{Configurations $(C_i, \pi_i^C, \pi_i^S)$, rejections and areas (in arbitrary units) for MIN/100}
3 & BRAM & 2 & 3 & 3 & 3 & \emph{120} \\ 949 842 \label{tbl:gurobi_min_100}
& DSP & 0 & 0 & 22 & 27 & \emph{80} \\ \hline 950 843 \centering
& LUT & 1664 & 3114 & 2570 & 4318 & \emph{17600} \\ 951 844 {\scalefont{0.77}
4 & BRAM & 2 & 3 & 4 & 4 & \emph{120} \\ 952 845 \begin{tabular}{|c|ccccc|c|c|}
& DPS & 0 & 15 & 19 & 19 & \emph{80} \\ \hline 953 846 \hline
& LUT & 1664 & 3114 & 2570 & 4318 & \emph{17600} \\ 954 847 $n$ & $i = 1$ & $i = 2$ & $i = 3$ & $i = 4$ & $i = 5$ & Rejection & Area \\
5 & BRAM & 2 & 3 & 4 & 4 & \emph{120} \\ 955 848 \hline
& DPS & 0 & 0 & 19 & 19 & \emph{80} \\ \hline 956 849 1 & - & - & - & - & - & - & - \\
\end{tabular} 957 850 2 & (27, 9, 15) & (35, 11, 0) & - & - & - & 100~dB & 1410 \\
} 958 851 3 & (3, 5, 18) & (35, 11, 1) & (27, 9, 0) & - & - & 100~dB & 1147 \\
\end{table} 959 852 4 & (3, 5, 18) & (15, 6, 2) & (27, 9, 0) & (19, 7, 0) & - & 100~dB & 1067 \\
\renewcommand{\arraystretch}{1} 960 853 5 & (3, 5, 18) & (15, 6, 2) & (27, 9, 0) & (19, 7, 0) & - & 100~dB & 1067 \\
961 854 \hline
If we keep the previous estimation of cost of one DSP in terms of LUT (1 DSP $\approx$ 100 LUT) 962 855 \end{tabular}
the real resource consumption decreases as a function of the number of stages in the cascaded 963 856 }
filter according 964 857 \end{table}
to the solution given by the quadratic solver. Indeed, we have always a decreasing 965 858 \renewcommand{\arraystretch}{1}
consumption even if the difference between the monolithic and the two cascaded 966 859
filters is less than expected. 967 860 From these tables, we can first state that almost all configurations reach the targeted rejection
968 861 level or even better thanks to our underestimate of the cascade rejection as the sum of the
Finally, table~\ref{tbl:area_time_comp} shows the computation time to solve 969 862 individual filter rejection. The only exception is for the monolithic case ($n = 1$) in
the quadratic program. 970 863 MIN/100: no solution is found for a single monolithic filter reach a 100~dB rejection.
971 864 Furthermore, the area of the monolithic filter is twice as big as the two cascaded filters
\renewcommand{\arraystretch}{1.2} 972 865 (675 and 1131 arbitrary units v.s 990 and 1760 arbitrary units for 60 and 80~dB rejection
\begin{table}[h!tb] 973 866 respectively). More generally, the more filters are cascaded, the lower the occupied area.
\caption{Time to solve the quadratic program with Gurobi} 974 867
\label{tbl:area_time_comp} 975 868 Like in previous section, the solver chooses always a little filter as first
\centering 976 869 filter stage and the second one is often the biggest filter. This choice can be explained
{\scalefont{0.90} 977 870 as in the previous section, with the solver using just enough bits not to degrade the input
\begin{tabular}{|c|c|c|c|c|}\hline 978 871 signal and in the second filter selecting a better filter to improve rejection without
$n$ & Time (MIN/40) & Time (MIN/60) & Time (MIN/80) & Time (MIN/100) \\\hline\hline 979 872 having too many bits in the output data.
1 & 0.04~s & 0.01~s & 0.01~s & - \\ 980 873
2 & 2.7~s & 2.4~s & 2.4~s & 0.8~s \\ 981 874 For each case, we found an optimal solution with $n < 5$: for MIN/40 $n=2$,
3 & 4.6~s & 7~s & 7~s & 18~s \\ 982 875 for MIN/60 and MIN/80 $n = 3$ and for MIN/100 $n = 4$. In all cases, the solutions
4 & 3~s & 22~s & 70~s & 220~s ($\approx$ 3~min) \\ 983 876 when $n$ is greater than this optimal $n$ remain identical to the optimal one.
5 & 5~s & 122~s & 200~s & 384~s ($\approx$ 5~min) \\\hline 984
\end{tabular} 985
} 986
\end{table} 987 877
\renewcommand{\arraystretch}{1} 988 878 The following graphs present the rejection for real data on the FPGA. In all the following
989 879 figures, the solid line represents the actual rejection of the filtered
The time needed to solve this configuration is significantly shorter than the time 990 880 data on the FPGA as measured experimentally and the dashed line is the noise level
needed in the previous section. Indeed the worst time in this case is only 5~minutes, 991 881 given by the quadratic solver.
compared to 13~hours in the previous section: this problem is more easily solved than the 992 882
previous one. 993 883 Figure~\ref{fig:min_40} shows the rejection of the different configurations in the case of MIN/40.
994 884 Figure~\ref{fig:min_60} shows the rejection of the different configurations in the case of MIN/60.
To conclude, we compare our monolithic filters with the FIR Compiler provided by 995 885 Figure~\ref{fig:min_80} shows the rejection of the different configurations in the case of MIN/80.
Xilinx in the Vivado software suite (v.2018.2). For each experiment we use the 996 886 Figure~\ref{fig:min_100} shows the rejection of the different configurations in the case of MIN/100.
same coefficient set and we compare the resource consumption, having checked that 997 887
the transfer functions are indeed the same with both implementations. 998
Table~\ref{tbl:xilinx_resources} exhibits the results. 999
The FIR Compiler never uses BRAM while our filter implementation uses one block. This difference 1000
is explained be our wish to have a dynamically reconfigurable FIR filter whose 1001
coefficients can be updated from the processing system without having to update the FPGA design. 1002
With the FIR compiler, the coefficients are defined during the FPGA design so that 1003
changing coefficients required generating a new design. The difference with the LUT consumption 1004
is also attributed to the reconfigurability logic. However the DSP consumption, the scarcest 1005
resource, is the same between the Xilinx FIR Compiler end 1006
our FIR block: we hence conclude that our solutions are as good as the Xilinx implementation. 1007
1008
\renewcommand{\arraystretch}{1.2} 1009
\begin{table} 1010
\centering 1011
\caption{Resource consumption compared between the FIR Compiler from Xilinx and our FIR block} 1012
\label{tbl:xilinx_resources} 1013
\begin{tabular}{|c|c|c|c|c|c|c|} 1014
\hline 1015
\multirow{2}{*}{} & \multicolumn{3}{c|}{Xilinx} & \multicolumn{3}{c|}{Our FIR block} \\ \cline{2-7} 1016
& LUT & BRAM & DSP & LUT & BRAM & DSP \\ \hline 1017
MAX/500 & 177 & 0 & 21 & 249 & 1 & 21 \\ \hline 1018
MAX/1000 & 306 & 0 & 37 & 453 & 1 & 37 \\ \hline 1019
MAX/1500 & 418 & 0 & 47 & 627 & 1 & 47 \\ \hline 1020