Title to be selected Ravi Kishore Kodali and Satya Kesav Department of Electronics and Communication Engineering National Institute of Technology Warangal, India. Abstract—Floating Abstract—Floating point arithmetic is widely used in several applic applicati ations ons of signal signal proce processi ssing ng and in most most scient scientific ific comcomputations. putations. Especiall Especially, y, the floating floating point multiplicatio multiplication n is more frequently used among all arithmetic operations. The IEEE-754 format format of single single and double double prec precisi ision on floatin floating g point point number numberss multiplication requires the 24 x 24 and 53 x 53 mantissa multiplications respectively. Hence, there is a constraint on the hardware utilizatio utilization n of the mantissa mantissa multiplie multiplierr. This paper paper presents presents the implementa implementation tion of Floating Floating point multiplier multiplier using using two differen differentt multiplic multiplication ation algorithms algorithms namely namely Booth’s Booth’s and Karatsuba Karatsuba (Nor(Normal and Recursive) Algorithms on FPGA devices. Xilinx Virtex-6 device has been used during the implementation and comparison of the hardware hardware resourc resources es and their execution execution speeds is made. The performance results have been summarized, compared and conclusion has been presented. Keywords— Keywords— Floating Point Multiplication, FPGA, Single precision, Double-precision, Karatsuba Multiplication, Booth’s Multiplication.
I.
I NTRODUCTION
Embedded systems are designed for different functionality which finally requires requires the manipulat manipulation ion of real-va real-valued lued data. These data are stored as floating point numbers in the memory which is limited. So, these floating point computations are to be approximated to avoid memory limitations which are known as the floating point arithmetic [? [ ?]. In floating point arithmetic, multiplication operation occurs frequently when compared to others. Thus floating point multiplication plays a major role in the design and implementation aspect aspectss of floatin floating g point point proces processor sor [ ?].Computational speed and hardware hardware utilizat utilization ion [ ?] are the two criteri criteriaa that that decide decide the choice of selection of an algorithm in the implementation of floating point multipliers [ ?].
II.
L ITERATURE ITERATURE RE VIEW
Optimizing the operational speed of the multiplier is the main main aspe aspect ct in the the desi design gn of a float floatin ing g poin pointt arit arithm hmet etic ic processor. 24 x 24 and 53 x 53 mantissa multiplication when performe performed d using using traditio traditional nal multipli multiplicati cation on approach, approach, utilizes utilizes large large amount amount of hardwa hardware re resour resources ces and takes takes more more delay delay for the computation. So, the hardware utilization and timing delays delays can be reduced reduced by the Booth’s Booth’s algorithm algorithm when used for mantissa multiplication [ ?] which is detailed in [? [ ?] and [? [?] The timing timing delay delay and power dissipat dissipation ion are further further reduced by using a carry save adder scheme, high-speed CMOS full adder and modified carry select adder which is given in [ ?]. However, several algorithms are in existence which serves the purpose of optimization of floating point multipliers. Karats Karatsuba uba algori algorithm thm defined defined for multi multipli plicat cation ion of long long integers is one of the fastest and best algorithms. The survey of strengths and weaknesses of booth’s and karatsuba algorithm is presented in [? [ ?] which concludes that karatsuba multiplication takes takes lesser lesser signal signal propog propogati ation on time time and also also long long number number multiplication can be suitably implemented using karatsuba’s algorithm when compared to booth’s [ ?]. The imple implemen mentat tation ion of Floati Floating ng point point multip multiplie lierr using using karatsuba algorithm is very efficient as presented in [ ?]. When this algorithm algorithm is performe performed d recursi recursivel vely y [ ?] till it encounters the multi multipli plicat cation ion of 2-bit 2-bit or 3-bit 3-bit number numbers, s, use of higher higher order order logica logicall multip multiplie lierr blocks blocks is avoid avoided ed and hence hence the implemen implementati tation on becomes becomes very simple and efficien efficientt in terms terms of area requirements [? [ ?]. Hence recursive karatsuba algorithm is chosen for the implementa implementation tion of the floating point multiplie tiplierr on FPGA FPGA platfo platform rm and compar compariso ison n is done done with with the aforementioned two algorithms. III.
OVERVIEW OF FLOATING FLOATING POINT MULTIPLICATION MULTIPLICATION AND THE USED ALGORITHMS ALGORITHMS
The format of a floating point number is as follows: FPGAs offer high performa performance nce and very high operating operating speeds speeds with with limit limited ed amoun amountt of logic logic devic devices es and IP cores cores available on the system. Their applications are more commonly observed observed in the field of digital digital signal processi processing, ng, communicommunications cations engineering, engineering, and also in very very high speed computing computing systems such as super computers. This work involves efficient implementation of floating point arithmetic operation, namely single and double precision floating point multiplication, using two different algorithms on an FPGA. The rest of the paper is organized as follows: Section II provides literature survey, section section III presents presents an overvie overview w of the floating floating point multimultiplication and the algorithms used, section IV gives hardware implementation, section V presents simulation and experimental results and section VI concludes the work
For single precision:
Sign Exponent Mantissa
1-bit
8-bits
23-bits
For Double precision:
Sign Exponent Mantissa
1-bit
11-bits
52-bits
In general, general, floating floating point point arithmet arithmetic ic implemen implementati tation on involves computing the sign, exponent and mantissa part of the operands separately, and then combining the three of them after rounding and normalization. A basic overview of the flow of floating point multiplication operations is given below.
Q
n
1) 2) 3) 4) 5)
Q
Recoded Bits
+1
n
Consider an example which has the 8 bit multiplicand as 11011001 and multiplier as 011100010.
Operation Performed
0
0
0
shift
0
1
+1
Add M
1
0
−
1
1
0
1
Make group of two bits in the overlapped way.
→
TABLE I: Booth’s multiplier recoding rules
Subtract M Shift M
Multiplicand
11011001
Multiplier
0 1 1 1 0 0 0 10
Recoded multiplier
+1 0 0 -10 0+1-1
XOR sign bits of both numbers to get the sign bit of the product. Add both operands exponent, adjust the bias. Perform the multiplication of both mantissa. Perform normalization of mantissa product Finally, rounding of mantissa and bias adjustment of exponent.
000100111 111011001 000000000 000000000 000100111 000000000
The first three steps can be easily stated by the following expression,
000000000 111011001
y = Operand1 × Operand2 = (−1)sign1 ∗ 2exp1 ∗ 1.mant1 × (−1)sign2 = −1
sign1 xorsign2
∗
exp11 +exp2 −bias
2
∗
Product ∗
2
exp2
∗
1.mant2
1.mant1 × 1.mant2
where sign1 ,exp1 , mant1 are the sign, exponent and mantissa of first operand, and sign2 ,exp2 , mant2 are the sign, exponent and mantissa of second operand A. Booth’s Multiplication Algorithm
Booth algorithm provides a procedure for multiplying binary integers in signed-2’s complement representation [13]. According to the multiplication procedure, strings of 0’s in the multiplier require no addition but just shifting and a string of 1’s in the multiplier from bit weight 2 k to weight 2m can be treated as 2k+1 − 2m . Booth algorithm involves recoding the multiplier first. In the recoded format, each bit in the multiplier can take any of the three values: 0, 1 and -1.Suppose we want to multiply a number by 01110 (in decimal 14). This number can be considered as the difference between 10000 (in decimal 16) and 00010 (in decimal 2). The multiplication by 01110 can be achieved by summing up the following products: → →
→
24 times the multiplicand (2 4 = 16) 2’s complement of 21 times the multiplicand (21 = 2). In a standard multiplication, three additions are required due to the string of three 1s. This can be replaced by one addition and one subtraction. The above requirement is identified by recoding of the multiplier 01110 using the following rules summarized in table 1.
Recode the number using the above table. To generate recoded multiplier for radix-2, following steps are to be performed: →
Append the given multiplier with a zero to the LSB side
0000001001001001
B. Non-Recursive Karatsuba Multiplication Algorithm
Let p = (u1u2...un)b , q = (v1v2...vn)b , where n = 2k , then we can write p and q as the following form:
p = p1 bn/2 + p0 , q = q 1 bn/2 + q 0 So we have
p × q = p 1 q 1 bn + ( p1 q 0 + p0 q 1 )bn/2 + p0 q 0
(1)
In 1963, Karatsuba transformed formula (1) to the following formula (2)
p × q = r 1 bn + (r2 − r1 − r0 )bn/2 + r0
(2)
where r0 = p 0 q 0 , r1 = p 1 q 1 r2 = ( p0 + q 0 ) × (q 0 + q 1 ) if n = 2, equation (2) needs to execute three multiplication and four addition and subtraction base operations. If n > 2, the same equation reduces the problem of multiplying two length n(n = 2k ) integers to three multiplication of length n/2 numbers, namely p 0 q 0 , p1 q 1 , ( p1 + p0 )(q 1 + q 0 ), plus two addition of length n/2 numbers, two additions of length n numbers, two subtractions of length n numbers. C. Recursive Karatsuba Multiplication Algorithm
We can obtain the same product as that in above by using divide and conquer method recursively. Let T (n) be the computation time of the multiplication p × q , we can get the recursion easily:
T (n) =
7 if n = 2, 3T (n/2) + 5n if n > 2 .
So we get T (n) = 9nlog 3 − 10n = O(nlog 3 ) Hence recursive karatsuba is more efficient than normal karatsuba which is proved by hardware implementation on FPGA.
IV.
A LGORITHM I MPLEMENTATION
In this section the details of the floating point multiplier design using Booth’s, Karatsuba and Recurive karatsuba algorithms has been discussed. The implementation of design has been focused on xilinx virtex-6 FPGA. Algorithm 1 Booth’s Multiplication INPUTS : Two Operands Multiplier and Multiplicand(Mc) OUTPUT: Product(P ) with double the size of the Operand 1. Consider a Product register ( P r ) with length equal to twice that of the operand’s length (L) plus one. 2. Append a zero to the right of Multiplier.
Mq (L : 0) ← Multiplier(L − 1 : 0) & 0
3. Initialize the lower L bits of the Product register with Multiplicand.
P r(2L : 0) ← (others <= 0 ) & M c(L − 1 : 0)
4. Find the 2’s compliment of multiplicand and store it in another register M cc . for i = 0 to L − 1 do if Mq (1 : 0) = ”01” then
P r(2L : L) ← P r(2L : L) + M c(L − 1 : 0) else if Mq (1 : 0) = ”10” then P r(2L : L) ← P r(2L : L) + M cc (L − 1 : 0) else do nothing end if end for 6. P = P r(2L − 1 : 0) is the Product .
Algorithm 2 Karatsuba Multiplication INPUTS : Two Operands X and Y of length N . OUTPUT: Product P with length 2N . 1. If the length of the operands is odd then prepend a zero to the two operands in order to make their length an even number. L ← N + 1 if N is odd L ← N if N is even 2. Break each operand into two sequences of length L/2 each.
X 1 ← X (L/2 − 1 : 0) X 2 ← X (L − 1 : L/2) Y 1 ← Y (L/2 − 1 : 0) Y 2 ← Y (L − 1 : L/2) 3. Calculate the product of first half length sequences of X and Y and also product of second half length sequences of X and Y
P 1 ← X 1 × Y 1 P 2 ← X 2 × Y 2 4. Calculate P 3 as P 3 ← (X 1 + X 2) × (Y 1 + Y 2) − P 1 − P 2 5. Then calculate Pr as
P r(2L − 1 : L) = P 1(L − 1 : 0) P r(L − 1 : 0) = P 2(L − 1 : 0) P r(3L/2 − 1 : L/2) = P r(3L/2 − 1 : L/2)+P 3(L − 1 : 0) 6. Final product is
P = P r(2L − 3 : 0) if N is odd P = P r if N is even
Algorithm 3 Recursive Karatsuba Multiplication INPUTS : Two Operands X and Y of length L . OUTPUT: Product P with length 2L. 1. Recursive Karatsuba Multiplication requires the length of the operand to be a power of 2. 2. So, prepend zeros to the two operands such that their length becomes a power of 2. Let the length be N after adding the 0’s and X’, Y’ are the sequences. 3. Product P is obtained by P ← Recur(X’,Y’) where Recur is the recursive function used for the calculation of the product. 4. The recursive function for the multiplication is as follows: function Recur (Op1, Op2:Bit Vector) Begin l ← length(Op1) if l = 2 then
P r(2l − 1 : 0) ← Op1 × Op2 else
P 1 ← Recur(Op1(l − 1 : l/2), Op2(l − 1 : l/2) P 2 ← Recur(Op1(l/2 − 1 : 0), Op2(l/2 − 1 : 0) P 3 ← Recur(Op1(l − 1 : l/2) + Op1(l/2 − 1 : 0), Op2(l/2 − 1 : 0) + Op2(l/2 − 1 : 0)) P r(2l − 1 : 0) ← P 1(l − 1 : 0)&P 2(l − 1 : 0) P r(3l/2 − 1 : l/2) ← P 3(l − 1 : 0)+ P r(3l/2 − 1 : l/2)) end if return P r End Recur
TABLE II: Comparison between algorithms for Single precision format Device Utilisation Summary
Booth
Karatsuba
Recursive Karatsuba
Number of slices
1282
1156
679
Number of 4-input LUTs
2351
2165
1206
Number of Bonded INPUTS
64
64
64
Number of Bonded OUTPUT
32
32
32
Macro statistics Adders/ Subtractor
50
81
45
Multiplexers
528
462
-
Maximum Combinational path delay
98.837 ns
54.899 ns
31.123 ns
Maximum Combinational path delay
91.679 ns
58.878 ns
21.892 ns
Maximum output required time after clock
4.114 ns
4.114 ns
4.114 ns
Timing Summary
V.
R ESULTS AND S IMULATION
We have implemented the aforementioned algorithms for both single and double precision using VHDL. The design has been synthesized and routed on Virtex-6 FPGA target using Xilinx ISE. Simulation results has been analyzed in ModelSim-SE. Hardware utilization and performance for all proposed implementation and for the Xilinx core are shown in Table-II and Table-III respectively, on FPGA. All the hardware resource estimates were obtained after place and route process of FPGA synthesis. The simulation results for single and double precision are shown in figure-1 and figure-2.
TABLE III: Comparison between algorithms for double precision format Device Utilisation Summary
Booth
Karatsuba
Recursive Karatsuba
Number of slices
6115
5285
1784
Number of 4-input LUTs
11346
9662
3280
Number of Bonded INPUTS
128
128
128
Number of Bonded OUTPUT
64
64
64
Macro statistics Adders/ Subtractor
108
81
45
Multiplexers
2703
462
1
Maximum Combinational path delay
196.198 ns
54.899 ns
37.348 ns
Maximum Combinational path delay
189.029 ns
58.878 ns
29.660 ns
M axi mum o ut put r equ ir ed t im e a ft er c loc k
4. 114 n s
4 .11 4 ns
4 .11 4 n s
Timing Summary
VI.
C ONCLUSION
After implementation and comparison of three multiplication algorithms, Booth’s, Normal Karatsuba and Recursive Karatsuba, a brief performance result is reported in this paper. Two main criteria, FPGA resources and processing speed are adopted when evaluating the performance. We can see from the results that Recurisve Karatsuba Algorithm performs better than Normal Karatsuba and Booth’s algorithm. Recursive Karatsuba algorithm has the least FPGA resources utilised and the speed is relatively high. Hence Recursive Karatsuba is the best algorithm among the three algorithms.