Commit 0642fff00e7ced96f5ac91a61743a7134e7e5a42

Authored by jfriedt
1 parent a5c9e7b942
Exists in master

relecture journal

Showing 2 changed files with 238 additions and 165 deletions Side-by-side Diff

... ... @@ -197,4 +197,35 @@
197 197 number=5,
198 198 month={May}
199 199 }
  200 +
  201 +@inproceedings{carolina1,
  202 + title={Digital electronics based on red pitaya platform for coherent fiber links},
  203 + author={Olaya, AC Cardenas and Micalizio, S and Ortolano, M and Calosso, CE and Rubiola, E and Friedt, JM},
  204 + booktitle={2016 European Frequency and Time Forum (EFTF)},
  205 + pages={1--4},
  206 + year={2016},
  207 + organization={IEEE}
  208 +}
  209 +
  210 +@article{carolina2,
  211 + title={Phase Noise and Frequency Stability of the Red-Pitaya Internal PLL},
  212 + author={Olaya, Andrea Carolina C{\'a}rdenas and Calosso, Claudio Eligio and Friedt, Jean-Michel and Micalizio, Salvatore and Rubiola, Enrico},
  213 + journal={IEEE transactions on ultrasonics, ferroelectrics, and frequency control},
  214 + volume={66},
  215 + number={2},
  216 + pages={412--416},
  217 + year={2019},
  218 + publisher={IEEE}
  219 +}
  220 +
  221 +@article{sherman,
  222 + title={Oscillator metrology with software defined radio},
  223 + author={Sherman, Jeff A and J{\"o}rdens, Robert},
  224 + journal={Review of Scientific Instruments},
  225 + volume={87},
  226 + number={5},
  227 + pages={054711},
  228 + year={2016},
  229 + publisher={AIP Publishing}
  230 +}
ifcs2018_journal.tex
... ... @@ -8,7 +8,6 @@
8 8 % Gwen : peut-on faire un vrai banc de bruit de phase avec ce FIR, ie ajouter ADC, NCO et mixer
9 9 % (zedboard ou redpit)
10 10  
11   -% ajouter pyramide "juste"
12 11 % label schema : verifier que "argumenter de la cascade de FIR" est fait
13 12  
14 13 \documentclass[a4paper,conference]{IEEEtran/IEEEtran}
... ... @@ -53,7 +52,11 @@
53 52 noise rejection needs. Since real time radiofrequency processing must be performed in a
54 53 Field Programmable Array to meet timing constraints, we investigate optimization strategies
55 54 to design filters meeting rejection characteristics while limiting the hardware resources
56   -required and keeping timing constraints within the targeted measurement bandwidths.
  55 +required and keeping timing constraints within the targeted measurement bandwidths. The
  56 +presented technique is applicable to scheduling any sequence of processing blocks characterized
  57 +by a throughput, resource occupation and performance tabulated as a function of configuration
  58 +characateristics, as is the case for filters with their coefficients and resolution yielding
  59 +rejection and number of multipliers.
57 60 \end{abstract}
58 61  
59 62 \begin{IEEEkeywords}
... ... @@ -99,7 +102,7 @@
99 102  
100 103 \section{Finite impulse response filter}
101 104  
102   -We select FIR filter for their unconditional stability and ease of design. A FIR filter is defined
  105 +We select FIR filters for their unconditional stability and ease of design. A FIR filter is defined
103 106 by a set of weights $b_k$ applied to the inputs $x_k$ through a convolution to generate the
104 107 outputs $y_k$
105 108 \begin{align}
106 109  
... ... @@ -108,12 +111,13 @@
108 111 \end{align}
109 112  
110 113 As opposed to an implementation on a general purpose processor in which word size is defined by the
111   -processor architecture, implementing such a filter on an FPGA offer more degrees of freedom since
  114 +processor architecture, implementing such a filter on an FPGA offers more degrees of freedom since
112 115 not only the coefficient values and number of taps must be defined, but also the number of bits
113 116 defining the coefficients and the sample size. For this reason, and because we consider pipeline
114 117 processing (as opposed to First-In, First-Out FIFO memory batch processing) of radiofrequency
115 118 signals, High Level Synthesis (HLS) languages \cite{kasbah2008multigrid} are not considered but
116   -the problem is tackled at the Very-high-speed-integrated-circuit Hardware Description Language (VHDL) level.
  119 +the problem is tackled at the Very-high-speed-integrated-circuit Hardware Description Language
  120 +(VHDL) level.
117 121 Since latency is not an issue in a openloop phase noise characterization instrument, the large
118 122 numbre of taps in the FIR, as opposed to the shorter Infinite Impulse Response (IIR) filter,
119 123 is not considered as an issue as would be in a closed loop system.
120 124  
121 125  
122 126  
... ... @@ -151,27 +155,40 @@
151 155 moment.
152 156  
153 157 \section{Methodology description}
154   -We want create a new methodology to develop any Digital Signal Processing (DSP) chain
155   -and for any hardware platform (Altera, Xilinx...). To do this we have defined an
156   -abstract model to represent some basic operations of DSP.
157 158  
158   -For the moment, we are focused on only two operations: the filtering and the shifting of data.
159   -We have chosen this basic operation because the shifting and the filtering have already be studied in
160   -lot of works \cite{lim_1996, lim_1988, young_1992, smith_1998} hence it will be easier
161   -to check and validate our results.
  159 +Our objective is to develop a new methodology applicable to any Digital Signal Processing (DSP)
  160 +chain obtained by assembling basic processing blocks, with hardware and manufacturer independence.
  161 +Achieving such a target requires defining an abstract model to represent some basic properties
  162 +of DSP blocks such as perfomance (i.e. rejection or ripples in the bandpass for filters) and
  163 +resource occupation. These abstract properties, not necessarily related to the detailed hardware
  164 +implementation of a given platform, will feed a scheduler solver aimed at assembling the optimum
  165 +target, whether in terms of maximizing performance for a given arbitrary resource occupation, or
  166 +minimizing resource occupation for a given perfomance. In our approach, the solution of the
  167 +solver is then synthesized using the dedicated tool provided by each platform manufacturer
  168 +to assess the validity of our abstract resource occupation indicator, and the result of running
  169 +the DSP chain on the FPGA allows for assessing the performance of the scheduler. We emphasize
  170 +that all solutions found by the solver are synthesized and executed on hardware at the end
  171 +of the analysis.
162 172  
163   -However having only two operations is insufficient to work with complex DSP but
164   -in this paper we only want demonstrate the relevance and the efficiency of our approach.
165   -In future work it will be possible to add more operations and we are able to
166   -model any DSP chain.
  173 +In this demonstration , we focus on only two operations: filtering and shifting the number of
  174 +bits needed to represent the data along the processing chain.
  175 +We have chosen these basic operations because shifting and the filtering have already been studied
  176 +in the literature \cite{lim_1996, lim_1988, young_1992, smith_1998} providing a framework for
  177 +assessing our results. Furthermore, filtering is a core step in any radiofrequency frontend
  178 +requiring pipelined processing at full bandwidth for the earliest steps, including for
  179 +time and frequency transfer or characterization \cite{carolina1,carolina2,rsi}.
167 180  
168   -We will apply our methodology on very simple DSP chain. We generate a digital signal
169   -thanks at generator of Pseudo-Random Number (PRN) or thanks at an Analog to Digital
170   -Converter (ADC). Once we have a digital signal, we filter it to decrease the noise level.
171   -Finally we stored some burst of filtered samples before post-processing it.
172   -In this particular case, we want optimize the filtering step to have the best noise
173   -rejection for constrain number of resource or to have the minimal resources
174   -consumption for a given rejection objective.
  181 +Addressing only two operations allows for demonstrating the methodology but should not be
  182 +considered as a limitation of the framework which can be extended to assembling any number
  183 +of skeleton blocks as long as perfomance and resource occupation can be determined. Hence,
  184 +in this paper we will apply our methodology on simple DSP chains: a white noise input signal
  185 +is generated using a Pseudo-Random Number (PRN) generator or thanks at a radiofrequency-grade
  186 +Analog to Digital Converter (ADC) loaded by a 50~$\Omega$ resistor. Once samples have been
  187 +digitized at a rate of 125~MS/s, filtering is applied to qualify the processing block performance --
  188 +practically meeting the radiofrequency frontend requirement of noise and bandwidth reduction
  189 +by filtering and decimating. Finally, bursts of filtered samples are stored for post-processing,
  190 +allowing to assess either filter rejection for a given resource usage, or validating the rejection
  191 +when implementing a solution minimizing resource occupation.
175 192  
176 193 The first step of our approach is to model the DSP chain and since we just optimize
177 194 the filtering, we have not modeling the PRN generator or the ADC. The filtering can be
... ... @@ -185,9 +202,10 @@
185 202 and a shifter (the shift could be omitted if we do not need to divide the filtered data).
186 203  
187 204 \subsection{Model of a FIR filter}
188   -A cascade of filter are composed of $n$ stage. In stage $i$ ($1 \leq i \leq n$)
189   -the FIR has $C_i$ coefficients and each coefficients are integer values with $\pi^C_i$
190   -bits and the filtered data are shifted of $\pi^S_i$ bits. We define also $\pi^-_i$ as
  205 +
  206 +A cascade of filters is composed of $n$ FIR stages. In stage $i$ ($1 \leq i \leq n$)
  207 +the FIR has $C_i$ coefficients and each coefficient is an integer value with $\pi^C_i$
  208 +bits while the filtered data are shifted by $\pi^S_i$ bits. We define also $\pi^-_i$ as
191 209 the size of input data and $\pi^+_i$ as the size of output data. The figure~\ref{fig:fir_stage}
192 210 shows a filtering stage.
193 211  
194 212  
195 213  
196 214  
... ... @@ -209,19 +227,23 @@
209 227 \label{fig:fir_stage}
210 228 \end{figure}
211 229  
212   -FIR $i$ can reject $F(C_i, \pi_i^C)$ dB. $F$ is determined numerically.
213   -To measure this rejection, we use GNU Octave software to design FIR filter coefficients thanks to two
214   -algorithms (\texttt{firls} and \texttt{fir1}).
  230 +FIR $i$ has been characterized through numerical simulation as able to reject $F(C_i, \pi_i^C)$ dB.
  231 +This rejection has been computed using GNU Octave software FIR coefficient design functions
  232 +(\texttt{firls} and \texttt{fir1}).
215 233 For each configuration $(C_i, \pi_i^C)$, we first create a FIR with floating point coefficients and a given $C_i$ number of coefficients.
216 234 Then, the floating point coefficients are discretized into integers. In order to ensure that the coefficients are coded on $\pi_i^C$~bits effectively,
217 235 the coefficients are normalized by their absolute maximum before being scaled to integer coefficients.
218   -At least one coefficient is coded on $\pi_i^C$~bits, and in practice only $b_{C_i/2}$ is coded on $\pi_i^C$~bits while the other are coded on very fewer bits.
  236 +At least one coefficient is coded on $\pi_i^C$~bits, and in practice only $b_{C_i/2}$ is coded on $\pi_i^C$~bits while the others are coded on much fewer bits.
219 237  
220   -With these coefficients, the \texttt{freqz} function is used to estimate the magnitude of the filter.
221   -Comparing the performance between FIRs requires however a unique criterion. As shown in figure~\ref{fig:fir_mag},
222   -the FIR magnitude exhibits two parts.
  238 +With these coefficients, the \texttt{freqz} function is used to estimate the magnitude of the filter
  239 +transfer function.
  240 +Comparing the performance between FIRs requires however defining a unique criterion. As shown in figure~\ref{fig:fir_mag},
  241 +the FIR magnitude exhibits two parts: we focus here on the transitions width and the rejection rather than on the
  242 +bandpass ripples as emphasized in \cite{lim_1988,lim_1996}.
223 243  
224 244 \begin{figure}
  245 +\begin{center}
  246 +\scalebox{0.8}{
225 247 \centering
226 248 \begin{tikzpicture}[scale=0.3]
227 249 \draw[<->] (0,15) -- (0,0) -- (21,0) ;
228 250  
229 251  
230 252  
231 253  
... ... @@ -249,38 +271,42 @@
249 271 \draw[dashed] (12,8) -- (16,8) ;
250 272  
251 273 \end{tikzpicture}
  274 +}
  275 +\end{center}
252 276 \caption{Shape of the filter transmitted power $P$ as a function of frequency $f$:
253 277 the passband is considered to occupy the initial 40\% of the Nyquist frequency range,
254 278 the stopband the last 40\%, allowing 20\% transition width.}
255 279 \label{fig:fir_mag}
256 280 \end{figure}
257 281  
258   -In the transition band, the behavior of the filter is left free, we only care about the passband and the stopband.
259   -Our first criterion considers the mean value of the stopband rejection, as shown in figure~\ref{fig:mean_criterion}. This criterion does not work because we do not consider the shape of the passband.
260   -A second criterion considers the maximum rejection within the stopband minus the mean of the absolute value of passband rejection. With this criterion, the results are significantly improved as shown in figure~\ref{fig:custom_criterion}.
  282 +In the transition band, the behavior of the filter is left free, we only care about the passband and the stopband characteristics.
  283 +Our initial criterion considered the mean value of the stopband rejection, as shown in figure~\ref{fig:mean_criterion}. This criterion
  284 +yields unacceptable results since notches overestimate the rejection capability of the filter. Furthermore, the losses within
  285 +the passband are not considered and might be excessive for excessively wide transitions widths introduced for filters with few coefficients.
  286 +Such biases are compensated for by the second considered criterion which is based on computing the maximum rejection within the stopband minus the mean of the absolute value of passband rejection. With this criterion, the results are significantly improved as shown in figure~\ref{fig:custom_criterion} and meet the expected rejection capability of low pass filters.
261 287  
262 288 \begin{figure}
263 289 \centering
264 290 \includegraphics[width=\linewidth]{images/colored_mean_criterion}
265   -\caption{Mean criterion comparison between monolithic filter and cascade filters}
  291 +\caption{Mean stopband rejection criterion comparison between monolithic filter and cascaded filters}
266 292 \label{fig:mean_criterion}
267 293 \end{figure}
268 294  
269 295 \begin{figure}
270 296 \centering
271 297 \includegraphics[width=\linewidth]{images/colored_custom_criterion}
272   -\caption{Custom criterion comparison between monolithic filter and cascade filters}
  298 +\caption{Custom criterion (maximum rejection in the stopband minus the mean of the absolute value of the passband rejection)
  299 +comparison between monolithic filter and cascaded filters}
273 300 \label{fig:custom_criterion}
274 301 \end{figure}
275 302  
276   -Thanks to this criterion we are able to automatically generate lot of fir coefficients
277   -and estimate their rejection. The figure~\ref{fig:rejection_pyramid} exhibits the
278   -rejection in function of the number of coefficients and their number of bits.
279   -We can observe it looks like a pyramid so the edge represents the best
280   -coefficient set. Indeed if we choose a number of coefficients, increasing the number
281   -of bits over the edge will not improve the rejection. Conversely when we choose
282   -a number of bits, too much increase the number of coefficients will not improve
283   -the rejection. Hence the best coefficient set are on the edge of pyramid.
  303 +Thanks to the latter criterion which will be used in the remainder of this paper, we are able to automatically generate multiple FIR taps
  304 +and estimate their rejection. Figure~\ref{fig:rejection_pyramid} exhibits the
  305 +rejection as a function of the number of coefficients and the number of bits representing these coefficients.
  306 +The curve shaped as a pyramid exhibits optimum configurations sets at the vertex where both edges meet.
  307 +Indeed for a given number of coefficients, increasing the number of bits over the edge will not improve the rejection.
  308 +Conversely when setting the a given number of bits, increasing the number of coefficients will not improve
  309 +the rejection. Hence the best coefficient set are on the vertex of the pyramid.
284 310  
285 311 \begin{figure}
286 312 \centering
287 313  
... ... @@ -289,18 +315,19 @@
289 315 \label{fig:rejection_pyramid}
290 316 \end{figure}
291 317  
292   -Although we have a efficient criterion to estimate the rejection of one set of coefficient
293   -we have a problem when we sum two or more criterion. If the FIR filter coefficients are the same
294   -between the stage, we have:
  318 +Although we have an efficient criterion to estimate the rejection of one set of coefficients (taps),
  319 +we have a problem when we cascade filters and estimate the criterion as a sum two or more individual criteria.
  320 +If the FIR filter coefficients are the same between the stages, we have:
295 321 $$F_{total} = F_1 + F_2$$
296   -But when we choose two different set of coefficient, the previous equality are not
297   -true. The figure~\ref{fig:sum_rejection} illustrates the problem. The red and blue curves
298   -are two different filter coefficient and we can see that their maximum on the stopband
299   -are not at the same frequency. So when we sum the rejection criteria (the dotted yellow line)
300   -we do not meet the dashed yellow line. Define the rejection of cascaded filters
301   -is more difficult than just take the summation between all the rejection criteria of each filter.
302   -However this summation gives us an upper bound for rejection although in fact we obtain
303   -better rejection than expected.
  322 +But selecting two different sets of coefficient will yield a more complex situation in which
  323 +the previous relation is no longer valid as illustrated on figure~\ref{fig:sum_rejection}. The red and blue curves
  324 +are two different filters with maximums and notches not located at the same frequency offsets.
  325 +Hence when summing the transfer functions, the resulting rejection shown as the dashed yellow line is improved
  326 +with respect to a basic sum of the rejection criteria shown as a the dotted yellow line.
  327 +Thus, estimating the rejection of filter cascades is more complex than takin the sum of all the rejection
  328 +criteria of each filter. However since the this sum underestimates the rejection capability of the cascade,
  329 +this upper bound is considered as a pessimistic and acceptable criterion for deciding on the suitability
  330 +of the filter cascade to meet design criteria.
304 331  
305 332 \begin{figure}
306 333 \centering
307 334  
... ... @@ -309,12 +336,17 @@
309 336 \label{fig:sum_rejection}
310 337 \end{figure}
311 338  
312   -The first problem we address is to maximize the rejection under bounded silicon area
313   -and feasibility constraints. Variable $a_i$ is the area taken by filter~$i$
  339 +Based on this analysis, we address the estimate of resource consumption (called
  340 +silicon area -- in the case of FPGAs meaning processing cells) as a function of
  341 +filter characteristics. As a reminder, we do not aim at matching actual hardware
  342 +configuration but consider an arbitrary silicon area occupied by each processing function,
  343 +and will assess after synthesis the adequation of this arbitrary unit with actual
  344 +hardware resources provided by FPGA manufacturers. The sum of individual processing
  345 +unit areas is constrained by a total silicon area representative of FPGA global resources.
  346 +Formally, variable $a_i$ is the area taken by filter~$i$
314 347 (in arbitrary unit). Variable $r_i$ is the rejection of filter~$i$ (in dB).
315 348 Constant $\mathcal{A}$ is the total available area. We model our problem as follows:
316 349  
317   -Finally we can describe our abstract model with following expressions :
318 350 \begin{align}
319 351 \text{Maximize } & \sum_{i=1}^n r_i \notag \\
320 352 \sum_{i=1}^n a_i & \leq \mathcal{A} & \label{eq:area} \\
321 353  
... ... @@ -328,12 +360,12 @@
328 360  
329 361 Equation~\ref{eq:area} states that the total area taken by the filters must be
330 362 less than the available area. Equation~\ref{eq:areadef} gives the definition of
331   -the area for a filter. More precisely, it is the area of the FIR as the Shifter
332   -does not need any circuitry. We consider that the FIR needs $C_i$ registers of size
  363 +the area used by a filter, considered as the area of the FIR since the Shifter is
  364 +assumed not to require significant resources. We consider that the FIR needs $C_i$ registers of size
333 365 $\pi_i^C + \pi_i^-$~bits to store the results of the multiplications of the
334   -input data and the coefficients. Equation~\ref{eq:rejectiondef} gives the
335   -definition of the rejection of the filter thanks to function~$F$ that we defined
336   -previously. The Shifter does not introduce negative rejection as we explain later,
  366 +input data with the coefficients. Equation~\ref{eq:rejectiondef} gives the
  367 +definition of the rejection of the filter thanks to the tabulated function~$F$ that we defined
  368 +previously. The Shifter does not introduce negative rejection as we will explain later,
337 369 so the rejection only comes from the FIR. Equation~\ref{eq:bits} states the
338 370 relation between $\pi_i^+$ and $\pi_i^-$. The multiplications in the FIR add
339 371 $\pi_i^C$ bits as most coefficients are close to zero, and the Shifter removes
340 372  
... ... @@ -342,12 +374,12 @@
342 374 Equation~\ref{eq:maxshift} ensures that the Shifter does not introduce negative
343 375 rejection. Indeed, the results of the FIR can be right shifted without compromising
344 376 the quality of the rejection until a threshold. Each bit of the output data
345   -increases the maximum rejection level of 6~dB. We add one to take the sign bit
  377 +increases the maximum rejection level by 6~dB. We add one to take the sign bit
346 378 into account. If equation~\ref{eq:maxshift} was not present, the Shifter could
347 379 shift too much and introduce some noise in the output data. Each supplementary
348   -shift bit would cause 6~dB of noise. A totally equivalent equation is:
349   -$\pi_i^S \leq \pi_i^- + \pi_i^C - 1 - \sum_{k=1}^{i} \left(1 + \frac{r_j}{6}\right) $.
350   -Finally, equation~\ref{eq:init} gives the global input's number of bits.
  380 +shift bit would cause an additional 6~dB rejection rise. A totally equivalent equation is:
  381 +$\pi_i^S \leq \pi_i^- + \pi_i^C - 1 - \sum_{k=1}^{i} \left(1 + \frac{r_j}{6}\right)$.
  382 +Finally, equation~\ref{eq:init} gives the number of bits of the global input.
351 383  
352 384 This model is non-linear and even non-quadratic, as $F$ does not have a known
353 385 linear or quadratic expression. We introduce $p$ FIR configurations
354 386  
... ... @@ -371,15 +403,16 @@
371 403 model, and since Gurobi is able to linearize, the model is left as is. This model
372 404 has $O(np)$ variables and $O(n)$ constraints.
373 405  
374   -The section~\ref{sec:fixed_area} shows the results for the first version of quadratic program but the section~\ref{sec:fixed_rej}
375   -presents the results for the complementary problem. In this case we want
376   -minimize the occupied area for a targeted rejection level. Hence we have replace
377   -the objective function with:
  406 +Two problems will be addressed using the workflow described in the next section: on the one
  407 +hand maximizing the rejection capability of a set of cascaded filters occupying a fixed arbitrary
  408 +silcon area (section~\ref{sec:fixed_area}) and on the second hand the dual problem of minimizing the silicon area
  409 +for a fixed rejection criterion (section~\ref{sec:fixed_rej}). In the latter case, the
  410 +objective function is replaced with:
378 411 \begin{align}
379 412 \text{Minimize } & \sum_{i=1}^n a_i \notag
380 413 \end{align}
381   -We adapt our constraints of quadratic program to replace the equation \ref{eq:area}
382   -by the equation \ref{eq:rejection_min} where $\mathcal{R}$ is the minimal
  414 +We adapt our constraints of quadratic program to replace equation \ref{eq:area}
  415 +with equation \ref{eq:rejection_min} where $\mathcal{R}$ is the minimal
383 416 rejection required.
384 417  
385 418 \begin{align}
... ... @@ -389,8 +422,9 @@
389 422 \section{Design workflow}
390 423 \label{sec:workflow}
391 424  
392   -In this section, we describe the workflow to compute all the results presented in section~\ref{sec:fixed_area}.
393   -Figure~\ref{fig:workflow} shows the global workflow and the different steps involved in the computations of the results.
  425 +In this section, we describe the workflow to compute all the results presented in sections~\ref{sec:fixed_area}
  426 +and \ref{sec:fixed_rej}. Figure~\ref{fig:workflow} shows the global workflow and the different steps involved
  427 +in the computation of the results.
394 428  
395 429 \begin{figure}
396 430 \centering
397 431  
398 432  
399 433  
... ... @@ -423,29 +457,33 @@
423 457 The filter solver is a C++ program that takes as input the maximum area
424 458 $\mathcal{A}$, the number of stages $n$, the size of the input signal $\Pi^I$,
425 459 the FIR configurations $(C_{ij}, \pi_{ij}^C)$ and the function $F$. It creates
426   -the quadratic programs and uses the Gurobi solver to get the optimal results.
  460 +the quadratic programs and uses the Gurobi solver to estimate the optimal results.
427 461 Then it produces two scripts: a TCL script ((1a) on figure~\ref{fig:workflow})
428 462 and a deploy script ((1b) on figure~\ref{fig:workflow}).
429 463  
430 464 The TCL script describes the whole digital processing chain from the beginning
431   -(the raw signal data) to the end (the filtered data).
432   -The raw input data generated from a Pseudo Random Number (PRN)
  465 +(the raw signal data) to the end (the filtered data) in a language compatible
  466 +with proprietary synthesis software, namely Vivado for Xilinx and Quartus for
  467 +Intel/Altera. The raw input data generated from a 20-bit Pseudo Random Number (PRN)
433 468 generator inside the FPGA and $\Pi^I$ is fixed at 16~bits.
434 469 Then the script builds each stage of the chain with a generic FIR task that
435 470 comes from a skeleton library. The generic FIR is highly configurable
436 471 with the number of coefficients and the size of the coefficients. The coefficients
437 472 themselves are not stored in the script.
438   -Whereas the signal is processed in real-time, the output signal is stored as
439   -consecutive bursts of data.
  473 +As the signal is processed in real-time, the output signal is stored as
  474 +consecutive bursts of data for post-processing, mainly assessing the consistency of the
  475 +implemented FIR cascade transfer function with the design criteria and the expected
  476 +transfer function.
440 477  
441 478 The TCL script is used by Vivado to produce the FPGA bitstream ((2) on figure~\ref{fig:workflow}).
442 479 We use the 2018.2 version of Xilinx Vivado and we execute the synthesized
443 480 bitstream on a Redpitaya board fitted with a Xilinx Zynq-7010 series
444   -FPGA (xc7z010clg400-1) and two 125~MS/s ADC.
445   -The board works with a Buildroot Linux image. We have developed some tools and
446   -drivers to flash and communicate with the FPGA. They are used to automatize all
447   -the workflow inside the board: load the filter coefficients and retrieve the
448   -computed data.
  481 +FPGA (xc7z010clg400-1) and two LTC2145 14-bit 125~MS/s ADC, loaded with 50~$\Omega$ resistors to
  482 +provide a broadband noise source.
  483 +The board runs the Linux kernel and surrounding environment produced from the
  484 +Buildroot framework available at \url{https://github.com/trabucayre/redpitaya/}: configuring
  485 +the Zynq FPGA, feeding the FIR with the set of coefficients, executing the simulation and
  486 +fetching the results is automated.
449 487  
450 488 The deploy script uploads the bitstream to the board ((3) on
451 489 figure~\ref{fig:workflow}), flashes the FPGA, loads the different drivers,
452 490  
... ... @@ -457,14 +495,11 @@
457 495 The results are normalized so that the Power Spectrum Density (PSD) starts at zero
458 496 and the different configurations can be compared.
459 497  
460   -The workflow used to compute the results in section~\ref{sec:fixed_rej}, we
461   -have just adapted the quadratic program but the rest of the workflow is unchanged.
462   -
463   -\section{Experiments with fixed area space}
  498 +\section{Maximizing the rejection at fixed silicon area}
464 499 \label{sec:fixed_area}
465 500 This section presents the output of the filter solver {\em i.e.} the computed
466 501 configurations for each stage, the computed rejection and the computed silicon area.
467   -This is interesting to understand the choices made by the solver to compute its solutions.
  502 +Such results allow for understanding the choices made by the solver to compute its solutions.
468 503  
469 504 The experimental setup is composed of three cases. The raw input is generated
470 505 by a Pseudo Random Number (PRN) generator, which fixes the input data size $\Pi^I$.
471 506  
... ... @@ -542,13 +577,13 @@
542 577 From these tables, we can first state that the more stages are used to define
543 578 the cascaded FIR filters, the better the rejection. It was an expected result as it has
544 579 been previously observed that many small filters are better than
545   -a single large filter \cite{lim_1988, lim_1996, young_1992}, despite such conclusion
  580 +a single large filter \cite{lim_1988, lim_1996, young_1992}, despite such conclusions
546 581 being hardly used in practice due to the lack of tools for identifying individual filter
547 582 coefficients in the cascaded approach.
548 583  
549 584 Second, the larger the silicon area, the better the rejection. This was also an
550   -expected result as more area means a filter of better quality (more coefficients
551   -or more bits per coefficient).
  585 +expected result as more area means a filter of better quality with more coefficients
  586 +or more bits per coefficient.
552 587  
553 588 Then, we also observe that the first stage can have a larger shift than the other
554 589 stages. This is explained by the fact that the solver tries to use just enough
555 590  
... ... @@ -558,9 +593,9 @@
558 593  
559 594 Finally, we note that the solver consumes all the given silicon area.
560 595  
561   -The following graphs present the rejection for real data on the FPGA. In all following
  596 +The following graphs present the rejection for real data on the FPGA. In all the following
562 597 figures, the solid line represents the actual rejection of the filtered
563   -data on the FPGA as measured experimentally and the dashed line are the noise level
  598 +data on the FPGA as measured experimentally and the dashed line are the noise levels
564 599 given by the quadratic solver. The configurations are those computed in the previous section.
565 600  
566 601 Figure~\ref{fig:max_500_result} shows the rejection of the different configurations in the case of MAX/500.
567 602  
... ... @@ -603,10 +638,10 @@
603 638 Table~\ref{tbl:resources_usage} shows the resources usage in the case of MAX/500, MAX/1000 and
604 639 MAX/1500 \emph{i.e.} when the maximum allowed silicon area is fixed to 500, 1000
605 640 and 1500 arbitrary units. We have taken care to extract solely the resources used by
606   -the FIR filters and remove additional processing blocks including FIFO and PL to
607   -PS communication.
  641 +the FIR filters and remove additional processing blocks including FIFO and Programmable
  642 +Logic (PL -- FPGA) to Processing System (PS -- general purpose processor) communication.
608 643  
609   -\begin{table}
  644 +\begin{table}[h!tb]
610 645 \caption{Resource occupation. The last column refers to available resources on a Zynq-7010 as found on the Redpitaya.}
611 646 \label{tbl:resources_usage}
612 647 \centering
613 648  
614 649  
615 650  
616 651  
617 652  
... ... @@ -632,30 +667,28 @@
632 667 \end{table}
633 668  
634 669 In some cases, Vivado replaces the DSPs by Look Up Tables (LUTs). We assume that,
635   -when the filters coefficients are small enough, or when the input size is small
636   -enough, Vivado optimized resource consumption by selecting multiplexers to
  670 +when the filter coefficients are small enough, or when the input size is small
  671 +enough, Vivado optimizes resource consumption by selecting multiplexers to
637 672 implement the multiplications instead of a DSP. In this case, it is quite difficult
638 673 to compare the whole silicon budget.
639 674  
640   -However, a rough estimation can be made with a simple equivalence. Looking at
  675 +However, a rough estimation can be made with a simple equivalence: looking at
641 676 the first column (MAX/500), where the number of LUTs is quite stable for $n \geq 2$,
642 677 we can deduce that a DSP is roughly equivalent to 100~LUTs in terms of silicon
643   -area use. With this equivalence, our 500 arbitraty units corresponds to 2500 LUTs,
644   -1000 arbitrary units corresponds to 5000 LUTs and 1500 arbitrary units corresponds
  678 +area use. With this equivalence, our 500 arbitraty units correspond to 2500 LUTs,
  679 +1000 arbitrary units correspond to 5000 LUTs and 1500 arbitrary units correspond
645 680 to 7300 LUTs. The conclusion is that the orders of magnitude of our arbitrary
646   -unit are quite good. The relatively small differences can probably be explained
  681 +unit map well to actual hardware resources. The relatively small differences can probably be explained
647 682 by the optimizations done by Vivado based on the detailed map of available processing resources.
648 683  
649   -We present the computation time to solve the quadratic problem.
650   -For each case, the filter solver software are executed with a Intel(R) Xeon(R) CPU E5606
651   -cadenced at 2.13~GHz. The CPU has 8 cores that are used by Gurobi to solve
652   -the quadratic problem.
653   -
654   -Table~\ref{tbl:area_time} shows the time needed to solve the quadratic
  684 +We now present the computation time needed to solve the quadratic problem.
  685 +For each case, the filter solver software is executed on a Intel(R) Xeon(R) CPU E5606
  686 +clocked at 2.13~GHz. The CPU has 8 cores that are used by Gurobi to solve
  687 +the quadratic problem. Table~\ref{tbl:area_time} shows the time needed to solve the quadratic
655 688 problem when the maximal area is fixed to 500, 1000 and 1500 arbitrary units.
656 689  
657   -\begin{table}
658   -\caption{Time to solve the quadratic program with Gurobi}
  690 +\begin{table}[h!tb]
  691 +\caption{Time needed to solve the quadratic program with Gurobi}
659 692 \label{tbl:area_time}
660 693 \centering
661 694 \begin{tabular}{|c|c|c|c|}\hline
662 695  
663 696  
664 697  
... ... @@ -671,15 +704,16 @@
671 704 As expected, the computation time seems to rise exponentially with the number of stages. % TODO: exponentiel ?
672 705 When the area is limited, the design exploration space is more limited and the solver is able to
673 706 find an optimal solution faster. On the contrary, in the case of MAX/1500 with
674   -5~stages, we were not able to obtain a result after 40~hours of computation so we decided to stop.
  707 +5~stages, we were not able to obtain a result after 40~hours of computation when the program was
  708 +manually stopped.
675 709  
676   -\section{Experiments with fixed rejection target}
677   -\label{sec:fixed_rej}
678   -This section presents the results of complementary quadratic program which we
679   -minimize the area occupation for a targeted noise level.
  710 +\subsection{Minimizing resource occupation at fixed rejection}\label{sec:fixed_rej}
680 711  
  712 +This section presents the results of the complementary quadratic program aimed at
  713 +minimizing the area occupation for a targeted rejection level.
  714 +
681 715 The experimental setup is also composed of three cases. The raw input is the same
682   -as previous section, a PRN generator, which fixes the input data size $\Pi^I$.
  716 +as in the previous section, from a PRN generator, which fixes the input data size $\Pi^I$.
683 717 Then the targeted rejection $\mathcal{R}$ has been fixed to either 40, 60 or 80~dB.
684 718 Hence, the three cases have been named: MIN/40, MIN/60, MIN/80.
685 719 The number of configurations $p$ is the same as previous section.
... ... @@ -690,7 +724,7 @@
690 724  
691 725 \renewcommand{\arraystretch}{1.4}
692 726  
693   -\begin{table}
  727 +\begin{table}[h!tb]
694 728 \caption{Configurations $(C_i, \pi_i^C, \pi_i^S)$, rejections and areas (in arbitrary units) for MIN/40}
695 729 \label{tbl:gurobi_min_40}
696 730 \centering
... ... @@ -708,7 +742,7 @@
708 742 }
709 743 \end{table}
710 744  
711   -\begin{table}
  745 +\begin{table}[h!tb]
712 746 \caption{Configurations $(C_i, \pi_i^C, \pi_i^S)$, rejections and areas (in arbitrary units) for MIN/60}
713 747 \label{tbl:gurobi_min_60}
714 748 \centering
... ... @@ -727,7 +761,7 @@
727 761 }
728 762 \end{table}
729 763  
730   -\begin{table}
  764 +\begin{table}[h!tb]
731 765 \caption{Configurations $(C_i, \pi_i^C, \pi_i^S)$, rejections and areas (in arbitrary units) for MIN/80}
732 766 \label{tbl:gurobi_min_80}
733 767 \centering
734 768  
735 769  
736 770  
737 771  
... ... @@ -747,24 +781,28 @@
747 781 \end{table}
748 782 \renewcommand{\arraystretch}{1}
749 783  
750   -From these tables, we can first state that all configuration reach the target rejection
751   -level and more we have stages lesser is the area occupied in arbitrary unit.
752   -Futhermore, the area of the monolithic filter is twice bigger than the two cascaded.
753   -More generally, more there is filters lower is the occupied area.
  784 +% JMF : je croyais que dans un cas le monolithique n'y arrivait juste pas : tu as retire' ce cas ?
  785 +From these tables, we can first state that all configurations reach the targeted rejection
  786 +level or even better thanks to our underestimate of the cascade rejection as the sum of the
  787 +individual filter rejection
  788 +% we have stages lesser is the area occupied in arbitrary unit. JMF : je ne comprends pas cette phrase
  789 +Futhermore, the area of the monolithic filter is twice as big as the two cascaded filters
  790 +(1131 and 1760 arbitrary units v.s 547 and 903 arbitrary units for 60 and 80~dB rejection
  791 +respectively). More generally, the more filters are cascaded, the lower the occupied area.
754 792  
755   -Like in previous section, the solver choose always a little filter as first
756   -filter stage and the second one is often the biggest filter. this choice can be explain
757   -as the previous section. The solver uses just enough bits to not degrade the input
758   -signal and in second filter it can choose a better filter to improve rejection without
759   -have too bits in the output data.
  793 +Like in previous section, the solver chooses always a little filter as first
  794 +filter stage and the second one is often the biggest filter. This choice can be explained
  795 +as in the previous section, with the solver using just enough bits not to degrade the input
  796 +signal and in the second filter selecting a better filter to improve rejection without
  797 +having too many bits in the output data.
760 798  
761   -For the specific case in MIN/40 for $n = 5$ the solver has determined that the optimal
762   -number of filter is 4 so it not chose any configuration in last filter. Hence this
  799 +For the specific case of MIN/40 for $n = 5$ the solver has determined that the optimal
  800 +number of filters is 4 so it did not chose any configuration for the last filter. Hence this
763 801 solution is equivalent to the result for $n = 4$.
764 802  
765   -The following graphs present the rejection for real data on the FPGA. In all following
  803 +The following graphs present the rejection for real data on the FPGA. In all the following
766 804 figures, the solid line represents the actual rejection of the filtered
767   -data on the FPGA as measured experimentally and the dashed line are the noise level
  805 +data on the FPGA as measured experimentally and the dashed line is the noise level
768 806 given by the quadratic solver.
769 807  
770 808 Figure~\ref{fig:min_40} shows the rejection of the different configurations in the case of MIN/40.
771 809  
... ... @@ -792,11 +830,11 @@
792 830 \label{fig:min_80}
793 831 \end{figure}
794 832  
795   -We observe that all rejections given by the quadratic solver are close to the real
796   -rejection. All curves prove that the constraint to reach the target rejection is
797   -respected both monolithic filter or cascaded filters.
  833 +We observe that all rejections given by the quadratic solver are close to the experimentally
  834 +measured rejection. All curves prove that the constraint to reach the target rejection is
  835 +respected with both monolithic or cascaded filters.
798 836  
799   -Table~\ref{tbl:resources_usage} shows the resources usage in the case of MIN/40, MIN/60 and
  837 +Table~\ref{tbl:resources_usage} shows the resource usage in the case of MIN/40, MIN/60 and
800 838 MIN/80 \emph{i.e.} when the target rejection is fixed to 40, 60 and 80~dB. We
801 839 have taken care to extract solely the resources used by
802 840 the FIR filters and remove additional processing blocks including FIFO and PL to
803 841  
804 842  
805 843  
... ... @@ -827,16 +865,17 @@
827 865 \end{tabular}
828 866 \end{table}
829 867  
830   -If we keep the previous estimation of cost of one DSP in term of LUT (1 DSP $\approx$ 100 LUT)
831   -the real resource consumption decrease in function of number of stage filter according
  868 +If we keep the previous estimation of cost of one DSP in terms of LUT (1 DSP $\approx$ 100 LUT)
  869 +the real resource consumption decreases as a function of the number of stages in the cascaded
  870 +filter according
832 871 to the solution given by the quadratic solver. Indeed, we have always a decreasing
833 872 consumption even if the difference between the monolithic and the two cascaded
834   -filters is lesser than expected.
  873 +filters is less than expected.
835 874  
836   -Finally, the table~\ref{tbl:area_time_comp} shows the computation time to solve
  875 +Finally, table~\ref{tbl:area_time_comp} shows the computation time to solve
837 876 the quadratic program.
838 877  
839   -\begin{table}
  878 +\begin{table}[h!tb]
840 879 \caption{Time to solve the quadratic program with Gurobi}
841 880 \label{tbl:area_time_comp}
842 881 \centering
843 882  
844 883  
845 884  
... ... @@ -850,29 +889,31 @@
850 889 \end{tabular}
851 890 \end{table}
852 891  
853   -The time needed to solve this configuration are substantially faster than time
854   -needed in the previous section. Indeed the worst time in this case is only 3~minutes
855   -in balance of 3~days on previous section. We are able to solve more easily this
856   -problem than the previous one.
  892 +The time needed to solve this configuration is significantly shorter than the time
  893 +needed in the previous section. Indeed the worst time in this case is only 3~minutes,
  894 +compared to 3~days in the previous section: this problem is more easily solved than the
  895 +previous one.
857 896  
858 897 \section{Conclusion}
859 898  
860   -In this paper, we have proposed a new approach to work with a cascade of FIR filter inside a FPGA.
861   -This method aims to be hardware independent and focus an high-level of abstraction.
862   -We have modeled the FIR filter operation and the data shift impact. With this model
863   -we have created a quadratic program to select the optimal FIR coefficient set to reject a
864   -maximum of noise. In our experiments we have chosen deliberately some common tools
865   -to design the filter coefficients but we can use any other method.
  899 +We have proposed a new approach to schedule a set of signal processing blocks whose performances
  900 +and resource consumption has been tabulated, and applied this methodology to the practical
  901 +case of implementing cascaded FIR filters inside a FPGA.
  902 +This method aims to be hardware independent and focuses an a high-level of abstraction.
  903 +We have modeled the FIR filter operation and the impact of data shift. Thanks to this model,
  904 +we have created a quadratic program to select the optimal FIR taps to reach a targeted
  905 +rejection. Individual filter taps have been identified using commonly available tools and the
  906 +emphasis is on FIR assembly rather than individual FIR coefficient identification.
866 907  
867 908 Our experimental results are very promising in providing a rational approach to selecting
868 909 the coefficients of each FIR filter in the context of a performance target for a chain of
869   -such filters. The FPGA design that is produced automatically by our
870   -workflow is able to filter an input signal as expected which validates our model and our approach.
871   -We can easily change the quadratic program to adapt it to an other problem.
  910 +such filters. The FPGA design that is produced automatically by the proposed
  911 +workflow is able to filter an input signal as expected, validating experimentally our model and our approach.
  912 +The quadratic program can be adapted it to an other problem based on assembling skeleton blocks.
872 913  
873 914 A perspective is to model and add the decimators to the processing chain to have a classical
874   -FIR filter and decimator. The impact of the decimator is not so trivial, especially in terms of silicon
875   -area for the subsequent stages since some hardware optimization can be applied in
  915 +FIR filter and decimator. The impact of the decimator is not trivial, especially in terms of silicon
  916 +area usage for subsequent stages since some hardware optimization can be applied in
876 917 this case.
877 918  
878 919 The software used to demonstrate the concepts developed in this paper is based on the