A Survey on FPGA Hardware Implementation for Image Processing immediate March 25, 2015
Abstract
Even though FPGAs are not new, they have reached a large logic density over the years, and have become a useful parallel platform for image process processing ing.. Also, Also, FPGAs FPGAs are loaded loaded with logic circuits providing reconfiguration flexibility. flexibility. Novice users of FPGAs often think that this advantage consists in real high operation frequency, nevertheless FPGAs are limited to a several hundreds of MHz. On the contrary, a CPU frequency is around GHz. The reason why FPGA can outperform other processors processors is that FPGA is a real parallel parallel processor, for example it will take 5 operation cycles for a CPU to fish and addition, while for FPGA, it will only take one operation cycle to finish the operation. This paper is divided in X number of sections.
Image Image Process Processing ing algorit algorithms hms implem implemen ented ted in FPGA
1
Intr Introdu oduct ctio ion n
Image Image process processing ing algori algorithm thmss are used used to solve solve many many issues these days. days. From medical medical to military applications, image processing has become an interesting area of study for developing faster and better better algori algorithm thms. s. Many Many of these these applic applicaations require real-time processing; as image sizes get larger, using software solutions makes the response slower, and that’s where hardware implementati mentation on turns turns up. CPU is the most most popular popular hardware for image processing, but real-time applicat plication ionss are less less realiz realizabl ablee due to the image image size, data width or user interruptions. In order to enhance the performance of a hardware processor, two issues need to be considered: Parallel operation eration and increase operation operation frequency frequency.. DSPs and GPUs are examples of circuits exclusively designed to enhance parallel processing, but they have been developed to provide a predefined set of operations, not being able to work in a specific designed application of image processing [1].
1.1
Introd Introduct uction ion of FPGAs FPGAs
1.2
Adv Advantag antages es and limi limitat tation ionss of imimplemen plementin ting g Image Image process processing ing algorithms in hardware
In order to implement a design in hardware, several methods can be chosen. FPGAs for example, have hundred of thousands of logic gates embedded in a single chip. chip. Besides, Besides, a user can program 1
an FPGA design in considerably less time than the required for the production of other high level integrated systems (such as ASICs). Furthermore, FPGAs can be fully tested after design and manufacture. Image processing speed has become a bottleneck to further improvement of real-time processing systems. FPGAs are programmed in low level hardware description languages, such as VHDL or Verilog. Most of the software developers are not related of circuit design, nor hardware languages which functionality relies in synchronization and timing. [REFERNCIA A Accelerated IMAGE PROCESSING FPGA 2003]. Moreover, the simulations that are yet slow when just a fraction of time is emulated.
clustering. The performance earned by the FPGAs was not too large (5 to 15) times. The efficiency of the FPGAs are limited by size of the FPGAs and the memory bandwidth. (The image data is too large to be stored in an FPGA.)
2.2
Implementing Image Processing Algorithms in FPGA Hardware (2013)
In this work, the author describes different image processing algorithms that include filtering and enhancement implemented in a Xilinx Spartan 6 FPGA on Nexys3. Several image processing algorithms need to perform dozens of operations on every pixel. Thus results in a heavy load to handle in a single serial processor. The algorithms used in this work can be grouped into a category 1.3 Image processing algorithms called Windowing operators. These techniques based on FPGA hardware take a group of neighbouring pixels, called windows, and depending on the algorithm, calculates 2 Methods a new value for the pixel in the center. Using 3x3 and 5x5 windows, the author focus 2.1 How Fast is an FPGA in Image on developing hardware implementations of popProcessing? (2008) ular image processing algorithms such as: The FPGA’s high performance comes from: •
•
•
High parallelism in applications in image processing High ratio of 8-bit operations A large number of internal memory banks of FPGAs which can be accessed in parallel.
The objective of this work was to reach the best performance by reducing the number of operations and memory accesses. Three applications were implemented in this paper: Twodimensional filters, stereo-vision and k-means 2
•
Median Filter
•
Smoothing Filter
•
Sobel Edge Detection
•
Motion Blur
•
Emboss filter
The images utilised in this article were of 585x450 pixels, but they claim that images of any size can be used, using the proper hardware, and the author also say that using the window generator described many other algorithms can be added easily.
2.3
A Programmable Image process- spection of ceramic tiles is performed based on the presence of the chromatic difference in the ing System Using FPGA tonality of surface, faulty edges, or presence of cracks.
In this work, a flexible programmable image processing system is proposed. This system includes the integration of DSP and FPGA to deal with bit-level operations and arithmetic operations found in image processing algorithms. They describe a systolic system (a pipeline array architecture synchronised by a clock signal that calculates operations). These characteristics can be achieved with an FPGA such as Xilinx FPGA (in this case they used 2 Xilinx 2090-100). The system needs an IBM PC AT computer, working as a host that gathers the data in a memory unit (FIFO). A 1-D median filter of window size of 5 was implemented for the removal of impulsive noise from signals. In the results showed in this article, an input image corrupted by Sand & pepper noise, and the result is an image with a Peak signal-to-noise-ratio improved by 10 dB.
2.4
Line capturing starts when the tile sensor activates the scan camera. Pixels are sent as 3.3v signals, working with CMOS technology. Then, the ceramic tile scanned image data is transferred to the FPGAs SRAM Memory in 1024x8bit for a single scanned line (Gray pixels are stored as 8-bit data). The data bus is also 8 bits long, and is used to deliver the 8-bit pixel data to the SRAM controller, and then is transferred to an XGA block used for image displaying.
The ceramic tile surface defects could be determined by detecting a malfunction in the output pixel intensity levels. The threshold of these levels are previously defined for light, and for dark intensity. Also, a simple edge defect detection algorithm is considered with white tile surface imCeramic Tiles Failure Detection ages (Comparing the white color of the tile with Based on FPGA image Processing the dark background).
(2009) This article takes an industrial approach of image processing algorithms; where computer visual diagnosis is used to classify tiles according to surface and edge defects, implemented in an FPGA-based embedded hardware digital design. The whole systems consists in acquiring an image from a camera that is aligned to the failure detection line, and marking the faulty tiles for a final inspection. Normally, the visual inspection is performed by humans, but using a system with complete automation of the manufacturing process avoids human based errors . The process for visual in-
An FPGA Xilinx Spartan 3 developer board was used for the digital design implementation. The main part of this digital design consists in the definition of a finite state machine (FSM). The author compares the time needed to verify a ceramic tile using the same algorithm in C++ language; the experiment is done on PC with T7300 processor under Windows XP. Moreover, the FPGA’s expected performance were calculated on a frequency of 75MHz. The result of this work for detecting defects in tiles is 6 times faster than the standard PC based algorithm, implemented in C++ language. 3
2.5
The Platform of Image Acquisi- for standard VGA (640x480), with an operating tion and Processing System Based frequency of 125.59 MHz. The face detection is ensured to be generated every clock cycle after on DSP and FPGA (2008) the first pipeline is completed. The author compares his work with some others (PONER LAS REFERENCIAS), but claims that his algorithm is faster.
In this paper an hybrid system using an Altera’s FPGA and a digital signal processor of Texas Instruments is presented. The scope of the applications proposed by the author go from image enhancement to image segmentation. A large FIFO is designed of the FPGA, using the RAM of the Altera board. An image is captured by a high performance Charge-Coupled Device (CCD) sensor, then the analog data is converted into digital data after pre-processing, which is transferred into the FIFO in FPGA, and then to the DSP. The DSP is used to perform the algorithms for image processing in parallel. In this work, Altera Quartus II was used to design, simulate and synthesize the VHDL models.
2.6
2.7
Design and Implementation of a Pipelined Datapath for HighSpeed Face Detection Using FPGA (2012)
In this work an algorithm for face detection is described using cascades of boosted classifiers, implemented in a pipeliined datapath in FPGA. A 16 level image pyramid is generated from the input image to simultaneously identify faces with varied sizes. The image is downscaled and then transferred to the first stage of the cascade classifiers. By following this method, the resource utilisation of the FPGA’s is reduced to one-eight, compared to the full parallel algorithm, this without accuracy loss. The hardware used for this implementation was the Xilinx Virtex-5 LX330 FPGA. The performance of this method is 307 frames per second, careless of the number of faces in the image
FPGA Implementation of the LRU Algorithm for Video Compression (1994)
Image and video compression are typical applications for HDTV, teleconferencing, multimedia communications etc. The purpose of video and image compression is to decrease the numbers of bits used to represent an image while the quality stays acceptable. In this work, the author presents an implementation in FPGA of the Least Recently Used (LRU) algorithm in Cache based Vector Quantization for constant quality and fixed bit rate video transmission applications. The operation frequency of the chip was 16 MHz, and is stated that such frequency is enough for real-time execution of the CVQ algorithm.
2.8
A Board System for High-Speed Analysis and Neural Networks (1996)
In this paper, the author implement neural networks of diverse sizes and architectures in an FPGA controller, for applications that involve text location, character recognition, and noise removal from an image that contains text. The system used requires an external controller to generate the adresses for the code memory, and the calculation for transferring the data from and to the state memory. This interface 4
controller is integrated bye four Xilinx 4005PG156 field programmable gate arrays. In the results, the optical character recognition algorithm reaches a speed of approximately 1000 characters per seconds; this is 10 to 100 times faster than an implementation with a microprocessor (SPARC Station 10).
2.9
mance of these implementation was compared with the existing solutions and the "high speedup and efficiency have been attained for the parallel implementation".
2.11
A Real-Time Matching System for Large Fingerprint Databases Another implementation of image processing al(1996)
gorithms using high level programming enviroment and FPGA is described in this paper. In one side, the programming model of the system is a PC programmed in C++. On the other hand, the FPGA acts as the coprocessor for the algebra of the image processing algorithms to carry out some basic operations (convolution, neighbouring, etc). The basic instructions of the coprocessor can be described by a static window with preset weights. Some of this instructions include Multiplication, Accumulation, Maximum and Minimum, and several neighbouring operations can be done. The features needed to generate a new image with this systems include dimension of the image (256x256), 3x3 window size, a 16-bit pixel size and the weights of the neighbourhood window.
Databases of fingerprints are characterised by their large size and bad quality query images. This work presents a method or indexing large databases of fingerprint, implemented on Splash 2, a field programmable gate array processor to nearly match an ASIC speed. Index-based object recognition has become popular within the vision computer community, and specific characteristics from an image are compared with features in the model of database objects. Using Splash 2, "the pattern matching under the conditions described earlier can be executed at the rate of 110,000 matches per second".
2.10
Design and implementation of a high level programming enviroment for FPGA-based image processing (2000)
FPGA Implementations of Fast Fourier Transforms for RealTime and Signal Processing (2005)
2.12
Programming FPGAs require skilled users to have a detailed knowledge of the architecture of the device used and is done in a very low level. In this paper 1-D and 2-D FFTs using HandelC (Parametrisable structural language similar to VHDL) code on a Celoxica RC1000 PCI-based FPGA development board. According to the mathematical model, the algorithm has been implemented for parallel 2-D FFT. The perfor-
Applying an XC6200 to RealTime Image Processing (1998)
Some FPGAs have a microprocessor embedded and can be partially reconfigured in the operation. Although in this work a two-dimensional discrete cosine transform (2D DCT) is implemented, this system is able to perform real-time image processing applications. The design was implemented in a 78x64 block within the XC6200 5
FPGA series from Xilinx, using 30% of the total intersection units. Filters and shifters work efchip area (128x128 cells), with a performance of ficiently in hardware, helping the achievement 2 billion operations per second. of real-time applications. In the author’s implementation in FPGA, 50 comparisons per sec2.13 Combined Line-Based Architec- ond were made, working with a 35MHz clock ture for the 5-3 and 9-7 Wavelet frequency device. FPGA implementation makes Transform of JPEG2000 (2003) this application convenient for the industry.The processing unit that consumes more time is the Another work that deals with image compression histogram generator, being that the image must is [REFERENCIA]. The author describes a hard- be fully read. This issue is solved by using exterware implementation of a discrete wavelet trans- nal RAM. form for image compression using the JPEG2000 standard. 2.15 Accelerated Image Processing on The goal is to implement a fast wavelet transFPGAs (2003) form by processing two lines at a time. This architecture allows fast calculation and minimum In this work a high level language is used for memory requirements. Using a VIRTEX E1000- the design of hardware. SA-C is a derivation of 8 at 110 MHz, 2 pixels per clock cycle can be the C programming language designed to achieve decoded. parallelism. There are some differences between The authors claim that the main advantages standard C language and SA-C: of their system are: •
•
Minimum memory requirement
•
Pipelined datapath
•
•
Minimum area: Using one third of the classical design
•
Genericity: the coefficients used for this transform can be replaced by other to implement new filters
2.14
•
Finds a representation of floating point operations in a fixed point representation, taking advantage of the FPGA to form more precise circuits. Includes some standard C extensions to provide the FPGA with data parallel mechanisms and "true multi-dimensional arrays" Restrics variables to be single assigments
SA-C language makes the FPGAs to be availColour histogram content-based image retrieval and hardware im- able to programmers with no experience in hardware description languages. In the results, the plementation (2003) author states that implementing an edge detector (Canny and Prewitt) in FPGA using SA-C language, the hardware implementations overcome the software implemented in a Pentium processor.
A pipelined hardware structure was developed to improve the operation of a composite colour image histogram processing using 4 units: histogram generator, normalisation, FIR filter and 6
2.16
Design of image acquisition better than a machine vision system, but will and processing based on be slower doing the task. The development of a machine vision system begins with underFPGA(2003)
standing the applications requirements and constraints and proceeds with selecting appropriate machine vision software and hardware to solve the task. Also, industrial vision system must be fast enough to meet the speed requirements of their application environment. The author in this work proposes the types of This system include an Analogic-to-Digital ininspection used in industrial applications: terface, a FIFO, the sensor controller and other modules. One of the challenges implementing this algorithm is to synchronize the clock fre Inspection of dimensional quality: Correct quency of the FIFO and the image capture. tolerances, correct shape. Inspection and In this work, white balance processing and image denoising methods are implemented in FPGA. CMOS sensor data transit into RGB format and storage to SDRAM, and after the processing, is displayed in to the VGA display.
•
2.17
classification of solder joints of PCBs.
FPGA-Based Real-Time Image Segmentation for Medical Systems and Data Processing
•
] A hardware platform is proposed in this work to implement a 3-D image segmentation algorithm for medical systems. An issue encountered in this kind of algorithms, and moreover, in other high demanding image processing algorithms, is the large amount of memory needed and the synchronization of all the parallel processes to make the system more efficient. The use of DDR SDRAM modules up to 1GB was needed to work with 266 MSamples/s.
2.18
•
•
•
Image processing oriented to industrial applications
Inspection of surface quality: Inspecting ob jects for scratches, cracks, wear or checking for proper finish, roughness and textures. Inspection of correct assembling Inspection of structural quality: Checking for missing components, or for the presence of foreign or extra objects. Inspection of accurate or correct operation: Verification of correct or accurate operation of the inspected products according to the manufacturing standards.
This work states that the structure of typical industrial vision systems is: Authors says that all the decisions are reduced to the action of confirmation of quality standards satisfaction. Which is in most cases a binary (Yes/No).
(Information taken from [2], paraphrase is in development) Industrial automation requires innovative solution from machine vision systems. Usually, quality control and visual inspection are executed by human experts, while they might be 7
Figure 1: Typical industrial inspection system 3
Conclusions
3.1 •
•
•
•
•
Possible information in the conclusions What kind of image processing algorithms could be used for industry application Why FPGAs are a good choice for implementing IPAs Limitations using FPGA’s for IPAs What has been done already on implementing IPA’s on FPGAs Opportunity areas on the field
References
[1] JianXiong, Q.M. Jonathan Wu (2010). An Investigation of FPGA Implementation for Image Processing [2] Elias N. Malamas, et. al (2003). A survey on industrial vision systems, applications and tools Image and vision computing , 21:171– 188.
8