VLSI-Friendly Filtering Algorithms for Deep Neural Networks

Cariow, Aleksandr; Papliński, Janusz P.; Makowska, Marta

doi:10.3390/app13159004

Open AccessArticle

VLSI-Friendly Filtering Algorithms for Deep Neural Networks

by

Aleksandr Cariow

,

Janusz P. Papliński

^*

and

Marta Makowska

Faculty of Computer Science and Information Technology, West Pomeranian University of Technology in Szczecin, Żołnierska 49, 71-210 Szczecin, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(15), 9004; https://doi.org/10.3390/app13159004

Submission received: 5 July 2023 / Revised: 31 July 2023 / Accepted: 4 August 2023 / Published: 6 August 2023

(This article belongs to the Special Issue Recent Developments in Algorithms and Computational Complexity)

Download

Browse Figures

Versions Notes

Abstract

:

The paper introduces a range of efficient algorithmic solutions for implementing the fundamental filtering operation in convolutional layers of convolutional neural networks on fully parallel hardware. Specifically, these operations involve computing M inner products between neighbouring vectors generated by a sliding time window from the input data stream and an M-tap finite impulse response filter. By leveraging the factorisation of the Hankel matrix, we have successfully reduced the multiplicative complexity of the matrix-vector product calculation. This approach has been applied to develop fully parallel and resource-efficient algorithms for M values of 3, 5, 7, and 9. The fully parallel hardware implementation of our proposed algorithms achieves approximately a 30% reduction in embedded multipliers compared to the naive calculation methods.

Keywords:

filtering algorithms; deep neural networks; very large-scale integration; multiplicative complexity

1. Introduction

The need for high-speed processing of large amounts of information stimulates the development and use of highly effective data processing systems. In such systems, the primary requirement for implementing computing methods is to minimise the time of data processing, ensuring the ability to fulfil the planned task within the allocated time for this application. This requirement is especially relevant in the implementation of algorithms for processing digital information in deep neural networks (DNNs) [1,2,3,4,5]. As is known, in deep neural networks, the primary and time-consuming operation is digital convolution. The need to quickly calculate digital convolution arises in both convolutional and capsule neural networks. Digital convolution calculations can be accelerated by algorithmic and hardware methods. In general, algorithmic methods primarily focus on minimising the number of arithmetic operations involved. One widely employed strategy for reducing the computational complexity of the digital convolution operation is utilising the Fast Fourier Transform (FFT) algorithm. This approach has found application in some deep neural networks [6,7,8,9,10,11]. However, modern convolutional and capsule neural networks use small filters more often than the traditionally used large filters computed using the FFT approach. The Winograd’s minimal filtering algorithm [1,12,13,14,15], which has recently gained significant popularity, is widely regarded as well-suited for such scenarios. This approach exhibits enhanced efficiency, specifically when employing small filters and tile sizes. In such cases, it performs linear convolution with minimal computational complexity. Indeed, this method calculates the dot products of adjacent vectors obtained from a sliding time window in the current data stream. It employs a third-order finite impulse response filter (FIR) for this purpose.

2. State of the Art

Since we are talking mainly about convolutional and capsular neural networks, it is clear that linear convolution is the main operation in their implementation. In general, convolutional layers tend to be the most time-intensive component, often accounting for over half of the total computation time in a typical implementation [16,17]. The convolution itself is also a time-consuming operation. For this reason, deep neural network builders are looking for and creating efficient ways to minimise convolution computation complexity [11].

Another opportunity to speed up calculations in deep neural networks is by utilising high-performance field-programmable gate arrays (FPGAs) [18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33], graphics processing units (GPUs) [33,34,35,36], and specialised application-specific integrated circuits (ASICs) [33,37,38,39].

When modern stationary data processing systems possess ample computing power, the design of mobile on-board systems powered by batteries encounters various conflicting factors that hinder peak performance. The conventional approach of parallelising computations to enhance data processing speed increases the data processing unit’s size, weight, and power consumption. Consequently, there is a necessity for solutions that effectively leverage computational parallelisation while concurrently optimising hardware costs.

Extensive research is being conducted on algorithms and structures for high-performance computing devices intended to process digital signals and images with practical applications in embedded systems. Developing micro-miniature processing units tailored for image processing and recognition in on-board mobile neural networks, known as Tiny ML or Edge AI, is of particular interest. These cutting-edge applications are at the forefront of modern technology.

For these reasons, the Tiny ML Summit has been held since 2019, bringing together experts from major companies and universities. The primary focus of this conference is to discuss the potential of transitioning machine learning from high-performance mainframes to small battery-powered signal microprocessors. The Tiny ML concept is continually evolving, driven by the development of dedicated chips designed for these applications. Notably, digital signal processing algorithms are pivotal in the systems under discussion. Over time, numerous algorithms and processor structures have been developed to address the challenges posed by these systems. With a focus on flexibility and versatility, the prevailing approach involves using universal signal microprocessors and FPGAs.

However, the high flexibility of the developed processors contradicts a highly efficient implementation. For example, a programmable signal processing unit is flexible, scalable and upgradable but highly inefficient in terms of performance, die area, weight, and power consumption.

Therefore, developing ASIC-centric solutions is best suited for portable applications as minimising power consumption, weight, and size of the processing unit in battery-powered systems has become an essential aspect of on-board processing.

At the algorithmic level, methods for reducing the above parameters usually focus on minimising the number of arithmetic operations, especially multiplications. In this regard, developing algorithms for performing the main filtering operations, characterised by minimal multiplicative complexity, is an urgent task. So, we again emphasise that convolution calculation is an essential mathematical macro operation in DNNs. And usually (though not always) it is computed using Winograd’s minimum filtering algorithm [1,12,14,18,20,21,22,40,41,42]. However, since, as already noted [43], this algorithm can only calculate two adjacent dot products, it is not suitable for all possible situations that can arise in neural networks. For example, Winograd’s minimum filtering algorithm is redundant for

M = 3

and tile size (5 × 5) or for

M = 5

and tile size (9 × 9). Many other examples could be given. In this article, we present algorithmic solutions for FIR filters with short-length impulse responses, which can be more efficient in some cases than Winograd’s minimal filtering algorithm.

3. Preliminary Remarks

The primary step in computing a 2D convolution involves taking the dot product between the vectors created by the sliding time window from the present data stream and the impulse response of an M-order finite impulse response (FIR) filter.

The procedure for computing convolution elements can be represented as follows in the most general case:

y_{j} = \sum_{i = 0}^{M - 1} x_{i + j} w_{i}

(1)

j = 0, 1, \dots, N - M + 1,

where N represents the length of the current data stream, with

{x_{i + j}}

denoting the elements of the data stream, and

{w_{i}}

represents the constant coefficients of the FIR filter’s impulse response.

In a more detailed form, expression (1) can be represented as follows:

\begin{matrix} y_{0} = w_{0} x_{0} + w_{1} x_{1} + \dots + w_{M - 1} x_{M - 1} \\ y_{1} = w_{0} x_{1} + w_{1} x_{2} + \dots + w_{M - 1} x_{M} \\ ⋮ \\ y_{N - M} = w_{0} x_{N - M} + w_{1} x_{N - M + 1} + \dots + w_{M - 1} x_{N - 1} \end{matrix}

(2)

Figure 1 illustrates the sequence of steps in calculating the moving dot product.

The equations above comprehensively describe all the mathematical operations required for the calculations. However, strictly speaking, they do not constitute an algorithm since they do not reveal the specific sequence of calculations. In some instances, expressing the sliding dot product operation as a matrix-vector product is more convenient:

Y_{(N - M + 1) \times 1} = W_{(N - M + 1) \times N} X_{N \times 1}

(3)

where:

Y_{(N - M + 1) \times 1} = {[y_{0}, y_{1}, \dots, y_{N - M}]}^{T},

X_{N \times 1} = {[x_{0}, x_{1}, \dots, x_{N - 1}]}^{T},

\begin{matrix} \begin{matrix} W_{(N - M + 1) \times N} = [\begin{matrix} w_{0} & w_{1} & \dots & w_{M - 1} \\ w_{0} & w_{1} & \dots & w_{M - 1} \\ ⋱ & ⋱ & ⋱ & ⋱ \\ w_{0} & w_{1} & \dots & w_{M - 1} \end{matrix}] . \end{matrix} \end{matrix}

However, regarding the research task at hand, such a representation is unhelpful as it does not facilitate identifying opportunities for reducing the computational complexity of the procedure for determining the sliding inner product when the sequence is comprised of input signal samples.

Let us rewrite expressions (2) in the following form:

Y_{(N - M + 1) \times 1} = X_{(N - M + 1) \times M} W_{M \times 1}

(4)

where:

W_{M \times 1} = {[w_{0}, w_{1}, \dots, w_{M - 1}]}^{T},

\begin{matrix} X_{(N - M + 1) \times M} = [\begin{matrix} x_{0} & x_{1} & \dots & x_{M - 1} \\ x_{1} & x_{2} & \dots & x_{M} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{N - M} & x_{N - M + 1} & \dots & x_{N - 1} \end{matrix}] . \end{matrix}

This form of writing is much more useful and legible, which will be visible in the next steps. It turns out that considering the structural properties of the matrix

X_{(N - M + 1) \times M}

in the expression (4) allows for a fairly significant reduction in the number of arithmetic operations. Let us consider this problem in more detail. Let us impose certain conditions on the sizes of the input sequences. Suppose

N = M (K + 1) - 1

, where

K = 1, 2, 3, \dots

is a positive integer number. It is obvious that if the supposed requirement for N is not satisfied, sequences

{x_{n}}, n = 0, 1, \dots, N - 1

can be padded with zeros without losing computation precision.

Then the expression (4) takes the following form:

Y_{K M \times 1} = X_{K M \times M} W_{M \times 1}

(5)

where:

Y_{K M \times 1} = {[y_{0}, y_{1}, \dots, y_{K M - 1}]}^{T},

X_{K M \times M} = {[X_{1 \times M}^{(0)}, X_{1 \times M}^{(1)}, \dots, X_{1 \times M}^{(K M - 1)}]}^{T},

and

X_{1 \times M}^{(i)} = [x_{i}, x_{i + 1}, \dots, x_{M - 1 + i}], i = 0, 1, \dots, K M - 1 .

To see the structure of the matrix

X_{K M \times M}

, we present expression (5) in a more detailed form:

[\begin{matrix} y_{0} \\ y_{1} \\ ⋮ \\ y_{K M - 1} \end{matrix}] = [\begin{matrix} x_{0} & x_{1} & \dots & x_{M - 1} \\ x_{1} & x_{2} & \dots & x_{M} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{M - 1} & x_{M} & \dots & x_{2 M - 2} \\ x_{M} & x_{M + 1} & \dots & x_{2 M - 1} \\ x_{M + 1} & x_{M + 2} & \dots & x_{2 M} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{2 M - 1} & x_{2 M} & \dots & x_{3 M - 2} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ x_{(K - 1) M} & x_{(K - 1) M + 1} & \dots & x_{K M - 1} \\ x_{(K - 1) M + 1} & x_{(K - 1) M + 2} & \dots & x_{K M} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{K M - 1} & x_{K M} & \dots & x_{(K - 1) M - 2} \end{matrix}] [\begin{matrix} w_{0} \\ w_{1} \\ ⋮ \\ w_{M - 1} \end{matrix}]

(6)

By examining this description, it becomes evident that the matrix

X_{K M \times M}

possesses a block structure and comprises K submatrices of Hankel type. Thus, the calculation of the sliding dot product, taking into account the imposed conditions, comes down to multiplying the sequence of

K

sub-matrices (i.e. the Hankel matrices) by the vector

W_{M \times 1}

and then combining the individual calculation results.

Hence, we establish the fundamental filtering operation in DNNs as multiplying the Hankel sub-matrix (formed from the current input data sequence using a sliding window of size M) by a vector whose elements are the impulse response coefficients of an M-order FIR filter. There are efficient algorithms for multiplying Hankel matrices by a vector. However, they are mainly focused on large matrices, the order of which is a power of two [44,45]. But in most cases of image processing in neural networks, the impulse responses of FIR filters are short and contain an odd number of coefficients. Under these conditions, we are dealing with Hankel matrices of small orders. When multiplying small-size Hankel matrices by small-length vectors, known algorithms are inefficient or even counterproductive. Therefore, we have developed our own algorithms explicitly focused on calculating matrix-vector products with Hankel matrices of small orders.

So, we define the basic filtering macrooperation as:

Y_{M \times 1} = X_{M} W_{M \times 1}

(7)

where:

X_{M} = [\begin{matrix} x_{0} & x_{1} & \dots & x_{M - 1} \\ x_{1} & x_{2} & \dots & x_{M} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{M - 1} & x_{M} & \dots & x_{2 M - 2} \end{matrix}],

Y_{M \times 1} = {[y_{0}^{(M)}, y_{1}^{(M)}, \dots, y_{M - 1}^{(M)}]}^{T},

W_{M \times 1} = {[w_{0}^{(M)}, w_{1}^{(M)}, \dots, w_{M - 1}^{(M)}]}^{T} .

(Kindly take note that from this point forward, the superscript M will signify quantities associated with the basic filtering macrooperation employing an M-order filter).

We emphasise once again that, as a rule, small-order filters are used in deep neural networks when the impulse response vectors contain a small number of elements. And almost always, we are dealing with an odd number of records. Based on the information provided above, this article aims to create and describe resource-efficient filtering algorithms for FIR filters with widely used orders: M = 3, 5, 7, and 9.

4. Minimal Filtering Algorithms

Let us show, based on specific examples, how it works.

4.1. Algorithm 1, M = 3

Let

X_{5 \times 1} = {[x_{0}, x_{1}, x_{2}, x_{3}, x_{4}]}^{T}

be a vector that represents the input data set,

W_{3 \times 1} = [w_{0}^{(3)},

{w_{1}^{(3)}, w_{2}^{(3)}]}^{T}

be a vector that contains the coefficients of the impulse response of a 3-tap FIR filter, and

Y_{3 \times 1} = {[y_{0}^{(3)}, y_{1}^{(3)}, y_{2}^{(3)}]}^{T}

be a vector describing the results of using a 3-tap FIR filter:

Y_{3 \times 1} = [\begin{matrix} x_{0} & x_{1} & x_{2} \\ x_{1} & x_{2} & x_{3} \\ x_{2} & x_{3} & x_{4} \end{matrix}] [\begin{matrix} w_{0}^{(3)} \\ w_{1}^{(3)} \\ w_{2}^{(3)} \end{matrix}]

(8)

As can be seen, calculating the product (8) requires 9 multiplications and 6 additions.

We can formulate a streamlined algorithm for computing

Y_{3 \times 1}

by utilising the following matrix-vector calculation procedure:

Y_{3 \times 1} = T_{3 \times 6}^{(3)} D_{6}^{(3)} T_{6 \times 5}^{(3)} X_{5 \times 1}

(9)

where

T_{3 \times 6}^{(3)} = [\begin{matrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{matrix}],

T_{6 \times 5}^{(3)} = [\begin{matrix} 1 & - 1 & - 1 \\ 1 \\ - 1 & 1 & - 1 \\ 1 \\ 1 \\ - 1 & - 1 & 1 \end{matrix}],

and

D_{6}^{(3)} = d i a g (s_{0}^{(3)}, s_{1}^{(3)}, \dots, s_{5}^{(3)}),

s_{0}^{(3)}, = w_{0}^{(3)}, s_{1}^{(3)}, = w_{0}^{(3)} + w_{1}^{(3)}, s_{2}^{(3)}, = w_{1}^{(3)}

s_{3}^{(3)} = w_{0}^{(3)} + w_{2}^{(3)}, s_{4}^{(3)} = w_{1}^{(3)} + w_{2}^{(3)}, s_{5}^{(3)} = w_{2}^{(3)} .

Figure 2 depicts a data flow graph of the proposed algorithm for implementing the basic filtering operation for a 3-tap FIR filter. In this paper, the data flow diagrams are arranged from left to right, and straight lines in the figures represent data transfer operations. The circles in these figures indicate multiplication operations, with the corresponding numbers written inside the circles. The convergence points of the lines indicate summation, while dashed lines represent data transfer operations with a simultaneous change of sign. To maintain clarity, the figures utilise simple lines without arrows. Furthermore, to simplify the presentation, the superscripts of variables have been omitted in all figures, as the vector sizes involved in each case can be inferred from the figures themselves.

4.2. Algorithm 2, M = 5

Let

X_{9 \times 1} = {[x_{0}, x_{1}, x_{2}, x_{3}, x_{4}, x_{5}, x_{6}, x_{7}, x_{8}]}^{T}

be a vector that represents the input data set,

W_{5 \times 1} = {[w_{0}^{(5)}, w_{1}^{(5)}, w_{2}^{(5)}, w_{3}^{(5)}, w_{4}^{(5)}]}^{T}

be a vector that contains the coefficients of the impulse response of a 5-tap FIR filter, and

Y_{5 \times 1} = {[y_{0}^{(5)}, y_{1}^{(5)}, y_{2}^{(5)}, y_{3}^{(5)}, y_{4}^{(5)}]}^{T}

be a vector describing the results of using a 5-tap FIR filter:

Y_{5 \times 1} = [\begin{matrix} x_{0} & x_{1} & x_{2} & x_{3} & x_{4} \\ x_{1} & x_{2} & x_{3} & x_{4} & x_{5} \\ x_{2} & x_{3} & x_{4} & x_{5} & x_{6} \\ x_{3} & x_{4} & x_{5} & x_{6} & x_{7} \\ x_{4} & x_{5} & x_{6} & x_{7} & x_{8} \end{matrix}] [\begin{matrix} w_{0}^{(5)} \\ w_{1}^{(5)} \\ w_{2}^{(5)} \\ w_{3}^{(5)} \\ w_{4}^{(5)} \end{matrix}]

(10)

As can be seen, calculating the product (10) requires 25 multiplications and 20 additions.

We can devise a streamlined algorithm to compute

Y_{5 \times 1}

by employing the following matrix-vector calculation procedure:

Y_{5 \times 1} = T_{5 \times 14}^{(5)} D_{14}^{(5)} T_{14 \times 9}^{(5)} X_{9 \times 1}

(11)

where

\begin{matrix} \begin{matrix} T_{5 \times 14}^{(5)} = [\begin{matrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 \end{matrix}], \end{matrix} \end{matrix}

\begin{matrix} \begin{matrix} T_{14 \times 9}^{(5)} = [\begin{matrix} 1 & - 1 & - 1 & 1 & - 1 \\ 1 & - 1 \\ 1 & - 1 \\ 1 \\ - 1 & 1 & 1 & - 1 & - 1 \\ 1 \\ - 1 & 1 \\ - 1 & 1 \\ - 1 & 1 & 1 & - 1 & - 1 \\ 1 \\ 1 & - 1 & - 1 & 1 & - 1 \\ 1 \\ 1 \\ - 1 & - 1 & - 1 & - 1 & 1 \end{matrix}], \end{matrix} \end{matrix}

and

D_{14}^{(5)} = d i a g (s_{0}^{(5)}, s_{1}^{(5)}, \dots, s_{13}^{(5)}),

s_{0}^{(5)} = w_{0}^{(5)}, s_{1}^{(5)} = w_{0}^{(5)} + w_{1}^{(5)}, s_{2}^{(5)} = w_{0}^{(5)} + w_{2}^{(5)}, s_{3}^{(5)} = w_{0}^{(5)} + w_{1}^{(5)} + w_{2}^{(5)} + w_{3}^{(5)},

s_{4}^{(5)} = w_{1}^{(5)}, s_{5}^{(5)} = w_{0}^{(5)} + w_{4}^{(5)}, s_{6}^{(5)} = w_{1}^{(5)} + w_{3}^{(5)}, s_{7}^{(5)} = w_{2}^{(5)} + w_{3}^{(5)},

s_{8}^{(5)} = w_{2}^{(5)}, s_{9}^{(5)} = w_{1}^{(5)} + w_{4}^{(5)}, s_{10}^{(5)} = w_{3}^{(5)}, s_{11}^{(5)} = w_{2}^{(5)} + w_{4}^{(5)},

s_{12}^{(5)} = w_{3}^{(5)} + w_{4}^{(5)}, s_{13}^{(5)} = w_{4}^{(5)} .

Figure 3 illustrates a data flow graph of the proposed algorithm for implementing the basic filtering operation for a 5-tap FIR filter.

4.3. Algorithm 3, M = 7

Let

X_{13 \times 1} = {[x_{0}, x_{1}, \dots, x_{12}]}^{T}

be a vector that represents the input data set,

W_{7 \times 1} = [w_{0}^{(7)}, w_{1}^{(7)},

{\dots, w_{6}^{(7)}]}^{T}

be a vector that contains the coefficients of the impulse response of a 7-tap FIR filter, and

Y_{7 \times 1} = {[y_{0}^{(7)}, y_{1}^{(7)}, \dots, y_{6}^{(7)}]}^{T}

be a vector describing the results of using a 7-tap FIR filter:

Y_{7 \times 1} = [\begin{matrix} x_{0} & x_{1} & x_{2} & x_{3} & x_{4} & x_{5} & x_{6} \\ x_{1} & x_{2} & x_{3} & x_{4} & x_{5} & x_{6} & x_{7} \\ x_{2} & x_{3} & x_{4} & x_{5} & x_{6} & x_{7} & x_{8} \\ x_{3} & x_{4} & x_{5} & x_{6} & x_{7} & x_{8} & x_{9} \\ x_{4} & x_{5} & x_{6} & x_{7} & x_{8} & x_{9} & x_{10} \\ x_{5} & x_{6} & x_{7} & x_{8} & x_{9} & x_{10} & x_{11} \\ x_{6} & x_{7} & x_{8} & x_{9} & x_{10} & x_{11} & x_{12} \end{matrix}] [\begin{matrix} w_{0}^{(7)} \\ w_{1}^{(7)} \\ w_{2}^{(7)} \\ w_{3}^{(7)} \\ w_{4}^{(7)} \\ w_{5}^{(7)} \\ w_{6}^{(7)} \end{matrix}]

(12)

As can be seen, calculating the product (12) requires 49 multiplications and 42 additions.

We can formulate a streamlined algorithm for computing

Y_{7 \times 1}

by utilising the following matrix-vector calculation procedure:

Y_{7 \times 1} = T_{7 \times 15}^{(7)} T_{15 \times 25}^{(7)} D_{25}^{(7)} T_{25 \times 18}^{(7)} T_{18 \times 13}^{(7)} X_{13 \times 1}

(13)

where

\begin{matrix} T_{7 \times 15}^{(7)} = [\begin{matrix} 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & - 1 & 1 \\ 1 & - 1 & 1 \\ 1 & - 1 & 1 \\ 1 \end{matrix}], \end{matrix}

T_{9 \times 18}^{(7)} = [\begin{matrix} T_{3 \times 6}^{(3)} & 0_{3 \times 6} & 0_{3 \times 6} \\ 0_{3 \times 6} & T_{3 \times 6}^{(3)} & 0_{3 \times 6} \\ 0_{3 \times 6} & 0_{3 \times 6} & T_{3 \times 6}^{(3)} \end{matrix}], T_{9 \times 7}^{(7)} = [\begin{matrix} 1 & 0_{1 \times 6} \\ 0_{8 \times 1} & 0_{8 \times 6} \end{matrix}],

T_{6 \times 7}^{(7)} = [\begin{matrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \end{matrix}], T_{15 \times 25}^{(7)} = [\begin{matrix} T_{9 \times 18}^{(7)} & T_{9 \times 7}^{(7)} \\ 0_{6 \times 18} & T_{6 \times 7}^{(7)} \end{matrix}],

T_{6 \times 8}^{(7)} = [\begin{matrix} T_{6 \times 5}^{(3)} & 0_{6 \times 3} \end{matrix}], {\tilde{T}}_{6 \times 8}^{(7)} = [\begin{matrix} 0_{6 \times 3} & T_{6 \times 5}^{(3)} \end{matrix}], T_{12 \times 8}^{(7)} = [\begin{matrix} T_{6 \times 8}^{(7)} \\ {\tilde{T}}_{6 \times 8}^{(7)} \end{matrix}],

T_{10 \times 5}^{(7)} = [\begin{matrix} - 1 \\ - 1 \\ - 1 \\ - 1 \\ - 1 \end{matrix}], T_{10}^{(7)} = [\begin{matrix} 0_{10 \times 4} & T_{10 \times 5}^{(7)} & 0_{10 \times 1} \end{matrix}],

T_{12 \times 10}^{(7)} = [\begin{matrix} 0_{2 \times 10} \\ T_{10}^{(7)} \end{matrix}], T_{6 \times 10}^{(7)} = [\begin{matrix} T_{6 \times 5}^{(3)} & 0_{6 \times 5} \end{matrix}], T_{6 \times 7}^{(7)} = [\begin{matrix} I_{6} & 0_{6 \times 1} \end{matrix}],

\begin{matrix} T_{1 \times 7}^{(7)} = [\begin{matrix} - 1 & - 1 & - 1 & - 1 & - 1 & - 1 & 1 \end{matrix}], \end{matrix}

T_{7}^{(7)} = [\begin{matrix} T_{6 \times 7}^{(7)} \\ T_{1 \times 7}^{(7)} \end{matrix}], T_{7 \times 10}^{(7)} = [\begin{matrix} 0_{7 \times 3} & T_{7}^{(7)} \end{matrix}], T_{13 \times 10}^{(7)} = [\begin{matrix} T_{6 \times 10}^{(7)} \\ T_{7 \times 10}^{(7)} \end{matrix}],

T_{25 \times 18} = [\begin{matrix} T_{12 \times 8}^{(7)} & T_{12 \times 10} \\ 0_{13 \times 8} & T_{13 \times 10} \end{matrix}],

\begin{matrix} T_{8 \times 11}^{(7)} = [\begin{matrix} 1 & 1 & - 1 \\ 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ 1 & 1 \end{matrix}], \end{matrix}

T_{8 \times 13}^{(7)} = [\begin{matrix} T_{8 \times 11}^{(7)} & 0_{8 \times 2} \end{matrix}], T_{10 \times 13}^{(7)} = [\begin{matrix} 0_{10 \times 3} & I_{10} \end{matrix}], T_{18 \times 13} = [\begin{matrix} T_{8 \times 13}^{(7)} \\ T_{10 \times 13}^{(7)} \end{matrix}],

and

D_{25}^{(7)} = d i a g (s_{0}^{(7)}, s_{1}^{(7)}, \dots, s_{24}^{(7)}),

s_{0}^{(7)} = w_{0}^{(7)}, s_{1}^{(7)} = w_{0}^{(7)} + w_{1}^{(7)}, s_{2}^{(7)} = w_{1}^{(7)}, s_{3}^{(7)} = w_{0}^{(7)} + w_{2}^{(7)}, s_{4}^{(7)} = w_{1}^{(7)} + w_{2}^{(7)},

s_{5}^{(7)} = w_{2}^{(7)}, s_{6}^{(7)} = w_{3}^{(7)}, s_{7}^{(7)} = w_{3}^{(7)} + w_{4}^{(7)}, s_{8}^{(7)} = w_{4}^{(7)}, s_{9}^{(7)} = w_{3}^{(7)} + w_{5}^{(7)},

s_{10}^{(7)} = w_{4}^{(7)} + w_{5}^{(7)}, s_{11}^{(7)} = w_{5}^{(7)}, s_{12}^{(7)} = w_{3}^{(7)} - w_{0}^{(7)}, s_{13}^{(7)} = w_{3}^{(7)} + w_{4}^{(7)} - w_{0}^{(7)} - w_{1}^{(7)},

s_{14}^{(7)} = w_{4}^{(7)} - w_{1}^{(7)}, s_{15}^{(7)} = w_{3}^{(7)} + w_{5}^{(7)} - w_{0}^{(7)} - w_{2}^{(7)}, s_{16}^{(7)} = w_{4}^{(7)} + w_{5}^{(7)} - w_{1}^{(7)} - w_{2}^{(7)},

s_{17}^{(7)} = w_{5}^{(7)} - w_{2}^{(7)}, s_{18}^{(7)} = w_{0}^{(7)} + w_{6}^{(7)}, s_{19}^{(7)} = w_{1}^{(7)} + w_{6}^{(7)}, s_{20}^{(7)} = w_{2}^{(7)} + w_{6}^{(7)},

s_{21}^{(7)} = w_{3}^{(7)} + w_{6}^{(7)}, s_{22}^{(7)} = w_{4}^{(7)} + w_{6}^{(7)}, s_{23}^{(7)} = w_{5}^{(7)} + w_{6}^{(7)}, s_{24}^{(7)} = w_{6}^{(7)} .

Figure 4 illustrates a data flow graph of the proposed algorithm for implementing the basic filtering operation for a 7-tap FIR filter.

4.4. Algorithm 4, M = 9

Let

X_{17 \times 1} = {[x_{0}, x_{1}, \dots, x_{16}]}^{T}

be a vector that represents the input data set,

W_{9 \times 1} = [w_{0}^{(9)}, w_{1}^{(9)},

{\dots, w_{8}^{(9)}]}^{T}

be a vector that contains the coefficients of the impulse response of a 9-tap FIR filter, and

Y_{9 \times 1} = {[y_{0}^{(9)}, y_{1}^{(9)}, \dots, y_{8}^{(9)}]}^{T}

be a vector describing the results of using a 9-tap FIR filter:

Y_{9 \times 1} = [\begin{matrix} x_{0} & x_{1} & x_{2} & x_{3} & x_{4} & x_{5} & x_{6} & x_{7} & x_{8} \\ x_{1} & x_{2} & x_{3} & x_{4} & x_{5} & x_{6} & x_{7} & x_{8} & x_{9} \\ x_{2} & x_{3} & x_{4} & x_{5} & x_{6} & x_{7} & x_{8} & x_{9} & x_{10} \\ x_{3} & x_{4} & x_{5} & x_{6} & x_{7} & x_{8} & x_{9} & x_{10} & x_{11} \\ x_{4} & x_{5} & x_{6} & x_{7} & x_{8} & x_{9} & x_{10} & x_{11} & x_{12} \\ x_{5} & x_{6} & x_{7} & x_{8} & x_{9} & x_{10} & x_{11} & x_{12} & x_{13} \\ x_{6} & x_{7} & x_{8} & x_{9} & x_{10} & x_{11} & x_{12} & x_{13} & x_{14} \\ x_{7} & x_{8} & x_{9} & x_{10} & x_{11} & x_{12} & x_{13} & x_{14} & x_{15} \\ x_{8} & x_{9} & x_{10} & x_{11} & x_{12} & x_{13} & x_{14} & x_{15} & x_{16} \end{matrix}] [\begin{matrix} w_{0}^{(9)} \\ w_{1}^{(9)} \\ w_{2}^{(9)} \\ w_{3}^{(9)} \\ w_{4}^{(9)} \\ w_{5}^{(9)} \\ w_{6}^{(9)} \\ w_{7}^{(9)} \\ w_{8}^{(9)} \end{matrix}]

(14)

As can be seen, calculating the product (14) requires 81 multiplications and 72 additions.

We can formulate a streamlined algorithm for computing

Y_{9 \times 1}

by utilising the following matrix-vector calculation procedure:

Y_{9 \times 1} = T_{9 \times 18}^{(9)} T_{18 \times 36}^{(9)} D_{36}^{(9)} T_{36 \times 30}^{(9)} T_{30 \times 17}^{(9)} X_{17 \times 1}

(15)

where

\begin{matrix} \begin{matrix} T_{9 \times 18}^{(9)} = [\begin{matrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{matrix}], \end{matrix} \end{matrix}

{\tilde{T}}_{9 \times 18}^{(9)} = [\begin{matrix} T_{3 \times 6}^{(3)} & 0_{3 \times 6} & 0_{3 \times 6} \\ 0_{3 \times 6} & T_{3 \times 6}^{(3)} & 0_{3 \times 6} \\ 0_{3 \times 6} & 0_{3 \times 6} & T_{3 \times 6}^{(3)} \end{matrix}], T_{18 \times 36}^{(9)} = [\begin{matrix} {\tilde{T}}_{9 \times 18}^{(9)} & 0_{9 \times 18} \\ 0_{9 \times 18} & {\tilde{T}}_{9 \times 18}^{(9)} \end{matrix}],

T_{6 \times 10}^{(9)} = [\begin{matrix} T_{6 \times 5}^{(3)} & 0_{6 \times 5} \end{matrix}], {\tilde{T}}_{6 \times 10}^{(9)} = [\begin{matrix} 0_{6 \times 5} & T_{6 \times 5}^{(3)} \end{matrix}], T_{12 \times 10}^{(9)} = [\begin{matrix} T_{6 \times 10}^{(9)} \\ {\tilde{T}}_{6 \times 10}^{(9)} \end{matrix}],

T_{36 \times 30}^{(9)} = [\begin{matrix} T_{12 \times 10}^{(9)} & 0_{12 \times 10} & 0_{12 \times 10} \\ 0_{12 \times 10} & T_{12 \times 10}^{(9)} & 0_{12 \times 10} \\ 0_{12 \times 10} & 0_{12 \times 10} & T_{12 \times 10}^{(9)} \end{matrix}],

\begin{matrix} \begin{matrix} T_{5 \times 11}^{(9 a)} = [\begin{matrix} 1 & - 1 & - 1 \\ 1 & - 1 & - 1 \\ 1 & - 1 & - 1 \\ 1 & - 1 & - 1 \\ 1 & - 1 & - 1 \end{matrix}], \end{matrix} \end{matrix}

\begin{matrix} \begin{matrix} T_{5 \times 11}^{(9 b)} = [\begin{matrix} - 1 & 1 & - 1 \\ - 1 & 1 & - 1 \\ - 1 & 1 & - 1 \\ - 1 & 1 & - 1 \\ - 1 & 1 & - 1 \end{matrix}], \end{matrix} \end{matrix}

\begin{matrix} \begin{matrix} T_{5 \times 11}^{(9 c)} = [\begin{matrix} - 1 & - 1 & 1 \\ - 1 & - 1 & 1 \\ - 1 & - 1 & 1 \\ - 1 & - 1 & 1 \\ - 1 & - 1 & 1 \end{matrix}], \end{matrix} \end{matrix}

T_{5 \times 17}^{(9 a)} = [\begin{matrix} T_{5 \times 11}^{(9 a)} & 0_{5 \times 6} \end{matrix}], T_{5 \times 17}^{(9 b)} = [\begin{matrix} 0_{5 \times 3} & I_{5} & 0_{5 \times 9} \end{matrix}],

T_{5 \times 17}^{(9 c)} = [\begin{matrix} 0_{5 \times 3} & T_{5 \times 11}^{(9 b)} & 0_{5 \times 3} \end{matrix}], T_{5 \times 17}^{(9 d)} = [\begin{matrix} 0_{5 \times 6} & I_{5} & 0_{5 \times 6} \end{matrix}],

T_{5 \times 17}^{(9 e)} = [\begin{matrix} 0_{5 \times 9} & I_{5} & 0_{5 \times 3} \end{matrix}], T_{5 \times 17}^{(9 f)} = [\begin{matrix} 0_{5 \times 6} & T_{5 \times 11}^{(9 c)} \end{matrix}],

T_{30 \times 17}^{(9)} = [\begin{matrix} T_{5 \times 17}^{(9 a)} \\ T_{5 \times 17}^{(9 b)} \\ T_{5 \times 17}^{(9 c)} \\ T_{5 \times 17}^{(9 d)} \\ T_{5 \times 17}^{(9 e)} \\ T_{5 \times 17}^{(9 f)} \end{matrix}],

D_{36}^{(9)} = d i a g (s_{0}^{(9)}, s_{1}^{(9)}, \dots, s_{35}^{(9)}),

s_{0}^{(9)} = w_{0}^{(9)}, s_{1}^{(9)} = w_{0}^{(9)} + w_{1}^{(9)}, s_{2}^{(9)} = w_{1}^{(9)}, s_{3}^{(9)} = w_{0}^{(9)} + w_{2}^{(9)}, s_{4}^{(9)} = w_{1}^{(9)} + w_{2}^{(9)},

s_{5}^{(9)} = w_{2}^{(9)}, s_{6}^{(9)} = w_{0}^{(9)} + w_{3}^{(9)}, s_{7}^{(9)} = w_{0}^{(9)} + w_{1}^{(9)} + w_{3}^{(9)} + w_{4}^{(9)}, s_{8}^{(9)} = w_{1}^{(9)} + w_{4}^{(9)},

s_{9}^{(9)} = w_{0}^{(9)} + w_{2}^{(9)} + w_{3}^{(9)} + w_{5}^{(9)}, s_{10}^{(9)} = w_{1}^{(9)} + w_{2}^{(9)} + w_{4}^{(9)} + w_{5}^{(9)}, s_{11}^{(9)} = w_{2}^{(9)} + w_{5}^{(9)},

s_{12}^{(9)} = w_{3}^{(9)}, s_{13}^{(9)} = w_{3}^{(9)} + w_{4}^{(9)}, s_{14}^{(9)} = w_{4}^{(9)}, s_{15}^{(9)} = w_{3}^{(9)} + w_{5}^{(9)}, s_{16}^{(9)} = w_{4}^{(9)} + w_{5}^{(9)},

s_{17}^{(9)} = w_{5}^{(9)}, s_{18}^{(9)} = w_{0}^{(9)} + w_{6}^{(9)}, s_{19}^{(9)} = w_{0}^{(9)} + w_{1}^{(9)} + w_{6}^{(9)} + w_{7}^{(9)}, s_{20}^{(9)} = w_{1}^{(9)} + w_{7}^{(9)},

s_{21}^{(9)} = w_{0}^{(9)} + w_{2}^{(9)} + w_{6}^{(9)} + w_{8}^{(9)}, s_{22}^{(9)} = w_{1}^{(9)} + w_{2}^{(9)} + w_{7}^{(9)} + w_{8}^{(9)}, s_{23}^{(9)} = w_{2}^{(9)} + w_{8}^{(9)},

s_{24}^{(9)} = w_{3}^{(9)} + w_{6}^{(9)}, s_{25}^{(9)} = w_{3}^{(9)} + w_{4}^{(9)} + w_{6}^{(9)} + w_{7}^{(9)}, s_{26}^{(9)} = w_{4}^{(9)} + w_{7}^{(9)},

s_{27}^{(9)} = w_{3}^{(9)} + w_{5}^{(9)} + w_{6}^{(9)} + w_{8}^{(9)}, s_{28}^{(9)} = w_{4}^{(9)} + w_{5}^{(9)} + w_{7}^{(9)} + w_{8}^{(9)}, s_{29}^{(9)} = w_{5}^{(9)} + w_{8}^{(9)},

s_{30}^{(9)} = w_{6}^{(9)}, s_{31}^{(9)} = w_{6}^{(9)} + w_{7}^{(9)}, s_{32}^{(9)} = w_{7}^{(9)}, s_{33}^{(9)} = w_{6}^{(9)} + w_{8}^{(9)}, s_{34}^{(9)} = w_{7}^{(9)} + w_{8}^{(9)},

s_{35}^{(9)} = w_{8}^{(9)} .

Figure 5 illustrates a data flow graph of the proposed algorithm for implementing the basic filtering operation for a 9-tap FIR filter.

5. Implementation Complexity

Due to the relatively small lengths of the input sequences and the straightforward nature of the data flow diagrams depicting the computation process, it is easy to assess the implementation complexity of the proposed solutions. Table 1 estimates the number of arithmetic blocks required for the fully parallel implementation of the filtering algorithms designed for short lengths. The values presented in the table can be regarded as an approximate measure of the implementation cost on an ASIC.

As we can see, using the proposed algorithmic solutions to construct digital filtering cores results in fewer multipliers being needed than using naive approaches to their design. In the context of designing specialised fully parallel VLSI processors, minimising the number of multipliers is of paramount importance. This approach significantly reduces the cost of implementing the entire system and mitigates power dissipation. This is due to the hardware multiplier’s higher level of complexity and larger chip area occupation than the adder. It has been demonstrated that the hardware cost of a binary adder rises linearly with the operand size. In contrast, the implementation cost of a hardwired multiplier escalates quadratically with the operand size [46]. Hence, reducing the number of multipliers, even if it results in a slight increase in the number of adders, significantly influences the hardware implementation of digital filtering cores.

The proposed algorithms have been exemplarily implemented in FPGAs on the simplest possible devices of Xilinx’s Spartan 3 series. The criterion for selecting a model from the Spartan 3 family was to provide a sufficient number of inputs and outputs. The 8-bit inputs

X_{(2 M - 1) \times 1}

, 16-bit outputs

Y_{9 \times 1}

, and fixed 8-bit coefficients of the impulse response of the FIR filter

W_{M \times 1}

, were assumed. Table 2 shows the number of slices used in the Spartan 3 FPGA implementation. The number of uses multipliers MULT 18 × 18 is also shown in this table, but both algorithms mostly used all hardware-accessible multipliers. Only for size

M = 5

was the number of available multipliers is for the proposed algorithm greater than required, and only in this case did the algorithm not use all of them. Table 3 shows the number of four input LUTs used in the Spartan 3 FPGA implementation. For each size M, the proposed algorithms achieved a reduction of the logic blocks used. The smallest was for the size

M = 3

, where the reduction was only about 1% for the slices and 2,4% for four inputs LUTs. The biggest was for size

M = 5

, which achieves nearly a 40% decrease in logical blocks.

6. Conclusions

This study explores methods to reduce the multiplicative complexity of conducting basic filtering operations for M-tap FIR filters with short impulse responses, commonly used in convolutional neural networks. Additionally, new algorithms for resource-efficient implementations of these algorithms have been devised for M values of 3, 5, 7, and 9. By utilising these algorithms, basic filtering operations computational complexity is reduced, which also lessens the difficulty of their hardware implementation. Reducing the number of multiplications in the algorithms comes at the expense of some increase in the number of additions. However, this is not significant due to the much higher implementation cost of the hardware multiplier relative to the adder. Some limitation of the proposed algorithms is the increased complexity of data manipulation. For this reason, it seems particularly advantageous to implement the proposed solutions in ASICs. The distinctive feature of all the proposed algorithms is their evident parallel and modular structures. The modularity allows unifying the implementation of the algorithms in FPGAs and makes it easier to map them into ASIC structures. Consequently, the parallelisation of computing processes enables accelerated computations during the execution of these algorithms. The implementation of the proposed algorithms in DNNs will be a target for further research.

Author Contributions

Conceptualization, A.C.; methodology, A.C., J.P.P. and M.M.; formal analysis, A.C., J.P.P. and M.M.; writing—original draft preparation, A.C.; writing—review and editing, A.C. and J.P.P.; visualization, A.C., M.M. and J.P.P.; supervision, A.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

VLSI	very large-scale integration
DNN	deep neural networks
FFT	fast Fourier transform
FIR	finite impuls responce
FPGA	field programmable gate array
GPU	graphics processing unit
ASIC	application-specific integrated circuit
Tiny ML	Tiny machine learning
Edge AI	Edge artificial intelligence
MULT	Multiplication
LUT	look up table

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Adhikari, S.P.; Kim, H.; Yang, C.; Chua, L.O. Building cellular neural network templates with a hardware friendly learning algorithm. Neurocomputing 2018, 312, 276–284. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 1–74. [Google Scholar]
Habib, G.; Qureshi, S. Optimization and acceleration of convolutional neural networks: A survey. J. King Saud-Univ.-Comput. Inf. Sci. 2022, 34, 4244–4268. [Google Scholar] [CrossRef]
Lin, S.; Liu, N.; Nazemi, M.; Li, H.; Ding, C.; Wang, Y.; Pedram, M. FFT-based deep learning deployment in embedded systems. In Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), Dresden, Germany, 19–23 March 2018; pp. 1045–1050. [Google Scholar] [CrossRef] [Green Version]
Mathieu, M.; Henaff, M.; LeCun, Y. Fast Training of Convolutional Networks through FFTs. arXiv 2014, arXiv:1312.5851. [Google Scholar]
Abtahi, T.; Kulkarni, A.; Mohsenin, T. Accelerating convolutional neural network with FFT on tiny cores. In Proceedings of the 2017 IEEE International Symposium on Circuits and Systems (ISCAS), Baltimore, MD, USA, 28–31 May 2017; pp. 1–4. [Google Scholar] [CrossRef]
Abtahi, T.; Shea, C.; Kulkarni, A.; Mohsenin, T. Accelerating Convolutional Neural Network with FFT on Embedded Hardware. IEEE Trans. Very Large Scale Integr. (Vlsi) Syst. 2018, 26, 1737–1749. [Google Scholar] [CrossRef]
Lin, J.; Yao, Y. A Fast Algorithm for Convolutional Neural Networks Using Tile-based Fast Fourier Transforms. Neural Process. Lett. 2019, 50, 1951–1967. [Google Scholar] [CrossRef]
Wu, Y. Review on FPGA-Based Accelerators in Deep Learning. In Proceedings of the 2023 IEEE 6th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 23–26 February 2023; Volume 6, pp. 452–456. [Google Scholar] [CrossRef]
Lavin, A.; Gray, S. Fast Algorithms for Convolutional Neural Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4013–4021. [Google Scholar] [CrossRef] [Green Version]
Zhao, Y.; Wang, D.; Wang, L. Convolution accelerator designs using fast algorithms. Algorithms 2019, 12, 112. [Google Scholar] [CrossRef] [Green Version]
Yang, D.S.; Xu, C.H.; Ruan, S.J.; Huang, C.M. Unified energy-efficient reconfigurable MAC for dynamic Convolutional Neural Network based on Winograd algorithm. Microprocess. Microsyst. 2022, 93, 104624. [Google Scholar] [CrossRef]
Dolz, M.F.; Barrachina, S.; Martínez, H.; Castelló, A.; Maciá, A.; Fabregat, G.; Tomás, A.E. Performance–energy trade-offs of deep learning convolution algorithms on ARM processors. J. Supercomput. 2023, 79, 1–18. [Google Scholar] [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Wang, X.; Wang, C.; Zhou, X. Work-in-Progress: WinoNN: Optimising FPGA-based Neural Network Accelerators using Fast Winograd Algorithm. In Proceedings of the 2018 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Turin, Italy, 30 September–3 October 2018; pp. 1–2. [Google Scholar] [CrossRef]
Farabet, C.; Poulet, C.; Han, J.Y.; LeCun, Y. CNP: An FPGA-based processor for convolutional networks. In Proceedings of the FPL 2009, IEEE, Prague, Czech Republic, 31 August–2 September 2009; pp. 32–37. [Google Scholar]
Lu, L.; Liang, Y. SpWA: An Efficient Sparse Winograd Convolutional Neural Networks Accelerator on FPGAs. In Proceedings of the 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 24–28 June 2018; pp. 1–6. [Google Scholar] [CrossRef]
Yu, J.; Hu, Y.; Ning, X.; Qiu, J.; Guo, K.; Wang, Y.; Yang, H. Instruction driven cross-layer CNN accelerator with winograd transformation on FPGA. In Proceedings of the 2017 International Conference on Field Programmable Technology (ICFPT), Melbourne, VI, Australia, 11–13 December 2017; pp. 227–230. [Google Scholar] [CrossRef]
Liang, Y.; Lu, L.; Xiao, Q.; Yan, S. Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs. IEEE Trans. -Comput.-Aided Des. Integr. Circuits Syst. 2020, 39, 857–870. [Google Scholar] [CrossRef]
Shawahna, A.; Sait, S.M.; El-Maleh, A. FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review. IEEE Access 2019, 7, 7823–7859. [Google Scholar] [CrossRef]
Guo, K.; Zeng, S.; Yu, J.; Wang, Y.; Yang, H. A Survey of FPGA-Based Neural Network Accelerator. arXiv 2018, arXiv:1712.08934. [Google Scholar]
Hoffmann, J.; Navarro, O.; Kästner, F.; Janßen, B.; Hübner, M. A Survey on CNN and RNN Implementations. In Proceedings of the PESARO 2017: The Seventh International Conference on Performance, Safety and Robustness in Complex Systems and Applications, Pesaro, Italy, 23–27 April 2017; pp. 33–39. [Google Scholar]
Liu, Z.; Chow, P.; Xu, J.; Jiang, J.; Dou, Y.; Zhou, J. A Uniform Architecture Design for Accelerating 2D and 3D CNNs on FPGAs. Electronics 2019, 8, 65. [Google Scholar] [CrossRef] [Green Version]
Zhao, R.; Song, W.; Zhang, W.; Xing, T.; Lin, J.H.; Srivastava, M.; Gupta, R.; Zhang, Z. Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 15–24. [Google Scholar] [CrossRef] [Green Version]
Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015; pp. 161–170. [Google Scholar] [CrossRef]
Li, Y.; Liu, Z.; Xu, K.; Yu, H.; Ren, F. A GPU-Outperforming FPGA Accelerator Architecture for Binary Convolutional Neural Networks. arXiv 2017, arXiv:1702.06392. [Google Scholar] [CrossRef] [Green Version]
Qiu, J.; Wang, J.; Yao, S.; Guo, K.; Li, B.; Zhou, E.; Yu, J.; Tang, T.; Xu, N.; Song, S.; et al. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2016; pp. 26–35. [Google Scholar]
Li, H.; Fan, X.; Jiao, L.; Cao, W.; Zhou, X.; Wang, L. A high performance FPGA-based accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 26th International Conference on Field Programmable Logic and Applications (FPL), IEEE, Lausanne, Switzerland, 29 August–2 September 2016; pp. 1–9. [Google Scholar]
Hardieck, M.; Kumm, M.; Möller, K.; Zipf, P. Reconfigurable Convolutional Kernels for Neural Networks on FPGAs. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, 22–24 February 2019; pp. 43–52. [Google Scholar] [CrossRef]
Ghimire, D.; Kil, D.; Kim, S.H. A survey on efficient convolutional neural networks and hardware acceleration. Electronics 2022, 11, 945. [Google Scholar] [CrossRef]
Strigl, D.; Kofler, K.; Podlipnig, S. Performance and Scalability of GPU-Based Convolutional Neural Networks. In Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing, Pisa, Italy, 17–19 February 2010. [Google Scholar]
Li, X.; Zhang, G.; Huang, H.H.; Wang, Z.; Zheng, W. Performance Analysis of GPU-Based Convolutional Neural Networks. In Proceedings of the 2016 45th International Conference on Parallel Processing (ICPP), Philadelphia, PA, USA, 16–19 August 2016; pp. 67–76. [Google Scholar] [CrossRef]
Cengil, E.; Cinar, A.; Guler, Z. A GPU-based convolutional neural network approach for image classification. In Proceedings of the 2017 International Artificial Intelligence and Data Processing Symposium (IDAP), Malatya, Turkey, 16–17 September 2017; pp. 1–6. [Google Scholar] [CrossRef]
Chen, Y.H.; Krishna, T.; Emer, J.; Sze, V. 14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. In Proceedings of the 2016 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 31 January 2016; pp. 262–263. [Google Scholar] [CrossRef] [Green Version]
Ovtcharov, K.; Ruwase, O.; Kim, J.Y.; Fowers, J.; Strauss, K.; Chung, E.S. Accelerating deep convolutional neural networks using specialized hardware. Microsoft Res. 2015, 2, 1–14. [Google Scholar]
Tu, F.; Yin, S.; Ouyang, P.; Tang, S.; Liu, L.; Wei, S. Deep convolutional neural network architecture with reconfigurable computation patterns. IEEE Trans. Very Large Scale Integr. (Vlsi) Syst. 2017, 25, 2220–2233. [Google Scholar] [CrossRef]
Zhao, Y.; Wang, D.; Wang, L.; Liu, P. A Faster Algorithm for Reducing the Computational Complexity of Convolutional Neural Networks. Algorithms 2018, 11, 159. [Google Scholar] [CrossRef] [Green Version]
Kala, S.; Jose, B.R.; Mathew, J.; Nalesh, S. High-performance CNN accelerator on FPGA using unified winograd-GEMM architecture. IEEE Trans. Very Large Scale Integr. (Vlsi) Syst. 2019, 27, 2816–2828. [Google Scholar] [CrossRef]
An, Y.; Li, B.; Bu, J.; Gao, Y. Optimizing Winograd convolution on GPUs via multithreaded communication. In Proceedings of the Second International Conference on Algorithms, Microchips, and Network Applications (AMNA 2023), SPIE, Zhengzhou, China, 13–15 January 2023; Volume 12635, pp. 204–212. [Google Scholar]
Cariow, A.; Cariowa, G. Minimal filtering algorithms for convolutional neural networks. In Reliability Engineering and Computational Intelligence; Springer: Cham, Switzerland, 2021; pp. 73–88. [Google Scholar]
Cariow, A.; Gliszczyński, M. Fast algorithms to compute matrix-vector products for Toeplitz and Hankel matrices. Electr. Rev. 2012, 88, 166–171. [Google Scholar]
Beliakov, G. On fast matrix-vector multiplication with a Hankel matrix in multiprecision arithmetics. arXiv 2014, arXiv:1402.5287. [Google Scholar]
Oudjida, A.K.; Chaillet, N.; Berrandjia, M.L.; Liacha, A. A New High Radix-2^r (r ≥ 8) Multibit Recoding Algorithm for Large Operand Size (N ≥ 32) Multipliers. J. Low Power Electron. 2013, 9, 50–62. [Google Scholar] [CrossRef]

Figure 1. The illustration of the step sequences during calculating the moving dot product.

Figure 2. Data flow graph of the algorithm for implementing the basic filtering operation for the case M = 3.

Figure 3. Data flow graph of the algorithm for implementing the basic filtering operation for the case M = 5.

Figure 4. Data flow graph of the algorithm for implementing the basic filtering operation for the case M = 7.

Figure 5. Data flow graph of the algorithm for implementing the basic filtering operation for the case M = 9.

Table 1. The complexities of implementing of the naive and proposed solutions.

Size M	Numbers of Arithmetic Blocks
	Naive Method		Proposed Algorithm
	Multipliers	M-Input Adders	Multipliers	2-Input Adders	3-Input Adders	4-Input Adders	$M$ -Input Adders
3	9	3	6	-	6	-	-
5	25	5	14	4	-	-	10
7	49	7	25	7	24	1	1
9	81	9	36	-	60	-	-

Table 2. The number of the multipliers MULT 18 × 18, and the slices used in the Spartan 3 FPGA implementations.

		MULT 18 × 18		Slices
Size M	Devices	School	Proposed	School	Proposed	Reduction
3	xc3s50-4pq208	4	4	111	110	0.9%
5	xc3s400-4fg456	16	14	563	342	39.3%
7	xc3s400-4fg456	16	16	776	684	11.9%
9	xc3s1000-4fg676	24	24	1303	1066	18.2%

Table 3. The number of the 4 input LUTs used in the Spartan 3 FPGA implementations.

		4 Input LUTs
Size M	Devices	School	Proposed	Reduction
3	xc3s50-4pq208	207	202	2.4%
5	xc3s400-4fg456	1040	636	38.8%
7	xc3s400-4fg456	1441	1284	10.9%
9	xc3s1000-4fg676	2456	1994	18.8%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cariow, A.; Papliński, J.P.; Makowska, M. VLSI-Friendly Filtering Algorithms for Deep Neural Networks. Appl. Sci. 2023, 13, 9004. https://doi.org/10.3390/app13159004

AMA Style

Cariow A, Papliński JP, Makowska M. VLSI-Friendly Filtering Algorithms for Deep Neural Networks. Applied Sciences. 2023; 13(15):9004. https://doi.org/10.3390/app13159004

Chicago/Turabian Style

Cariow, Aleksandr, Janusz P. Papliński, and Marta Makowska. 2023. "VLSI-Friendly Filtering Algorithms for Deep Neural Networks" Applied Sciences 13, no. 15: 9004. https://doi.org/10.3390/app13159004

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

VLSI-Friendly Filtering Algorithms for Deep Neural Networks

Abstract

1. Introduction

2. State of the Art

3. Preliminary Remarks

4. Minimal Filtering Algorithms

4.1. Algorithm 1, M = 3

4.2. Algorithm 2, M = 5

4.3. Algorithm 3, M = 7

4.4. Algorithm 4, M = 9

5. Implementation Complexity

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI