# Design of low power fixed-width multiplier with row bypassing 

S. Balamurugan, Sneha Ghosh, Atul, S. Balakumaran, R. Marimuthu, and P.S. Mallick ${ }^{\text {a) }}$<br>School of Electrical Engineering, VIT University, Vellore, Vellore-632014, Tamil Nadu, India<br>a) psmallick@vit.ac.in


#### Abstract

This paper presents a low power fixed-width multiplier with row bypassing (FWM-RB) for multimedia applications. When the operands of the multiplier are zero, significant power reductions can be achieved if that particular row is disabled. This is done with the help of multiplexers incorporated in the Modified Full Adder (MFA). The design is developed by using Verilog-HDL and implemented using Cadence typical libraries of TSMC 90 nm technology with a supply voltage of 1.2 V . This work evaluates the performance of power, area and delay of fixed-width multipliers and it has been shown that the proposed architecture consumes lesser power as compared to the conventional fixed-width multipliers.


Keywords: low power, fixed-width multiplier, row bypassing, parallel architecture
Classification: Integrated circuits

## References

[1] L.-D. Van and C.-C. Yang, "Generalized Low-Error Area-Efficient FixedWidth Multipliers," IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 52, no. 8, pp. 1608-1619, Aug. 2005.
[2] N. Petra, D. De Caro, V. Garofalo, E. Napoli, and A. G. M. Strollo, "Design of fixed width multipliers with linear compensation function," IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 58, no. 5, pp. 947-960, May 2011.
[3] J. M. Jou, S. R. Kuang, and R. Der Chen, "Design of low-error fixed-width multipliers for DSP applications," IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 46, no. 6, pp. 836-842, June 1999.
[4] C.-H. Chang and R. Kumar Satzoda, "A low error and high performance multiplexer-based truncated multiplier," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 12, pp. 1767-1771, Dec. 2010.
[5] K.-C. Kuo and C.-W. Chou, "Low power and high speed multiplier design with row bypassing and parallel architecture," Microelectronics Journal, vol. 41, no. 10, Oct. 2010.
[6] J. Ohban, V. G. Moshnyaga, and K. Inoue, "Multiplier energy reduction through bypassing of partial products," Asia -Pacific Conference on Circuits and Systems (APCCAS'02), vol. 2, pp. 13-17, Oct. 2002.

## 1 Introduction

Devices operating at high frequency with less power consumption, small dimension and high speed are getting much attention. Multiplications are the basic building blocks of any multimedia device and are a bottleneck in terms of power consumption and area minimization in high end portable devices. The design of low power and area efficient multiplier architecture is the most challenging part of any data path circuitry. In battery operated devices, we need a low power computation circuitry to extend the operating hours of the battery.

Fixed-width multipliers (FWM) are used in several multimedia applications where some degrees of error are acceptable in the final product without sacrificing the performance. In these kinds of applications, full-precision multipliers [5] are not the best choice, because they occupy more area and have higher power consumption than the fixed-width multipliers. Many fixedwidth multipliers have already been proposed for these applications [1, 2, 3, 4].

Analysis shows that the average of zero input in the operand of a multiplier is $73.8 \%$ in conventional DSP applications [5]. Researchers have contributed their work towards optimizing the power of full-precision multiplier (FPM) using bypassing technique [5, 6], but to the best of our knowledge no one has concentrated on fixed-width multiplier design using bypassing technique. In the total power consumption, the dynamic power is the most dominating one which needs to be reduced. So, we have developed a new design to achieve a low power fixed-width multiplication using row bypassing technique. Recently, the row bypassing technique has been utilized for full precision multiplier design using ripple carry adder (RCA) [5]. The reason for using RCA instead of carry save adder (CSA) is to achieve parallel architecture. The CSA based multipliers cannot be decomposed into two parallel blocks as proposed here since the inputs of the current row of CSA adders come from the upper row. Also, the row bypassing in a CSA structure [6] requires two multiplexers and three tri-state buffers, which increases the power and area significantly. Bypassing method achieves significant power reduction if the multiplier operand has more number of zeros.

In this paper, we have proposed a low power fixed-width parallel multiplier with row bypassing technique using the Modified Full Adder (MFA), which requires only one multiplexer and two tri-state buffers. This architecture is compared with FWM [3] and the results show that there is a reduction in power consumption.

## 2 Design of fixed-width multiplier

In a fixed-width multiplier, two $n$-bit numbers are received which generates a n-bit product by removing the least significant product (LP) of $n$ - columns in the partial product.

Two unsigned $n$-bit numbers results an unsigned $2 n$-bit output by multi-
plication, which can be represented as

$$
\begin{equation*}
P=a . b=\sum_{k=0}^{2 n-1} p_{k} 2^{k}=\left(\sum_{i=0}^{n-1} a_{i} 2^{i}\right)\left(\sum_{j=0}^{n-1} b_{j} 2^{j}\right) \tag{1}
\end{equation*}
$$

Where, $a_{i}$ is the $i^{\text {th }}$ bit of a, $b_{j}$ is the $j^{\text {th }}$ bit of b and $P_{k}$ represents the $k^{t h}$ bit of the product P. However, there are several applications in which the required output must also be of n bits with less power consumption and less error. The product P can be separated into two parts, the first part being the most significant product (MP) and the latter part referred to as the LP. The contribution of LP part is very less in the final output. So, by removing the LP part, we have n-bit output. This can be represented through the following equation

$$
\begin{equation*}
P=M P+L P=\left(\sum_{k=0}^{2 n-1} P_{k} 2^{k}\right)+\left(\sum_{k=0}^{n-1} P_{k} 2^{k}\right) \tag{2}
\end{equation*}
$$

The separation of partial products is shown in Fig. 1


Fig. 1. $8 \times 8$ Separation of partial products into MP and LP


Fig. 2. Separation of partial product of LP into LPminor and LPmajor

### 2.1 Error compensation biased circuit for fixed width multiplier

By removing LP part, the multiplier produces the result with some error. The least significant part i.e. LP can be divided into two parts based on their significance. The separation of the partial products of $L P$ into $L P_{\text {minor }}$ and $L P_{\text {major }}$ are shown in Fig. 2. The lesser significant part in LP is called
$L P_{\text {minor }}$ and the higher significant part is known as $L P_{\text {major }} . L P_{\text {major }}$ column's partial product is used for error compensation. Eq. (2) can be rewritten as $P=M P+L P$ and $P=M P+L P_{\text {major }}+L P_{\text {minor }}$. The error compensation column in [3] and our proposed fixed-width multiplier utilize the same partial products to generate the error compensation bias value and hence our proposed multipliers will also have the same accuracy.

In the proposed architecture we have divided the error compensation into two parts. As we are using parallel architecture, the upper part of $L P_{\text {major }}$ is used for compensating the error of BLOCK 1 and the lower part is used for compensating the error of BLOCK 2. In our proposed architecture AND-OR structure is used as shown in Fig. 3 (a).

## 3 Row bypassing multiplier

The conventional full adders have three inputs and two outputs, which have various disadvantages such as low operational speed and unwanted switching activity when the operand bit of the multiplier is zero. The significant factors which should be considered for better performance include speed and power. When zero partial products are added, a large number of signal transitions are generated, which are unnecessary and do not produce any effect on the final product. Hence, by using the row bypassing multiplier [5], these zero partial products can be bypassed to achieve optimization.

### 3.1 Modified Full Adder (MFA) design

The MFA [5] shown in Fig. 3 (b) depicts the basic unit of modified full adder circuit for the fixed-width multiplier with row bypassing and parallel architecture (FWM-RB-PA). As compared with [6], it requires lesser hardware to modify the full adder to bypass the partial products. This gives result in terms of low area and less power consumption. The two tri-state buffers are placed at the inputs of the modified adder to disable the operation of the full adder when the control line $b_{j}$ has logic 0 . Based on the value of $b_{j}$, the sum result of the MFA can be chosen from either the bypassing value or sum output of the full adder.

## 4 Proposed low-power fixed-width multiplier

Fig. 6 shows the final proposed architecture with FWM-RB-PA, which adopts the carry propagate adder structure. The row bypassing structure introduced here is separated into two parts. The final output products from $P_{n}$ to $P_{2 n-1}$ form the first part. Hence, in the 8 x 8 multiplier, the first part consists of $P_{8}$ to $P_{15}$ (MP) and $P_{0}$ to $P_{7}$ forms the second part (LP). Thus, when the truncation is introduced, the $\mathrm{LP}_{\text {minor }}$ part is removed as it does not give much contribution to the output.

This truncation multiplier using row bypassing technique without decomposition minimizes the power consumption. However, the delay produced by this architecture will be very high. So, in order to minimize this delay, we have introduced parallelism in our architecture as shown in Fig. 6


Fig. 3. (a) AND-OR structure (b) Modified Full Adder (MFA) for row bypassing


Fig. 4. Fixed-width multiplier with row bypassing (BLOCK 1)


Fig. 5. Fixed-width multiplier with row bypassing (BLOCK 2)

The parallelism is implemented by decomposing the MP part of the 8 x 8 multiplier into two dissimilar blocks. The upper part of MP along with the $L P_{\text {major }}$ upper part forms the BLOCK 1 as shown in Fig. 4. The lower part of MP along with the $L P_{\text {major }}$ lower part forms the BLOCK 2 as shown in Fig. 5. The MFA with two tri-state buffers and one multiplexer is designed.


Fig. 6. An $8 \times 8$ Fixed-width multiplier with row bypassing and parallel architecture (FWM-RB-PA)

At the end of each row, we have an AND gate to generate the carry out for correct operation. Thus, the partial sums and carry output from these two blocks are computed simultaneously. The output from these two blocks is given to a set of full adder structures as shown in Fig. 6. This set consists of three CSA adders (indicated by broken lines), which helps us to implement the parallel architecture so that the delay of the architecture is reduced. The remaining are simple RCA adders. This forms the final proposed structure of FWM-RB-PA.

## 5 Experimental results \& comparison

We have proposed $6 \times 6,8 \times 8$ and $12 \times 12$ FWM-RB-PA technique using modified full adder and compared the performance of this design with a FWM [3] and FPM [5]. The proposed architecture FWM-RB-PA, FWM [3] and FPM [5] are designed using the structural Verilog HDL code and implemented using typical libraries (Process: 1, Voltage: 1.2 V , Temperature: $25^{\circ} \mathrm{C}$ ) of TSMC 90 nm technology and it shows that FWM-RB-PA and FWM [3] architectures give the same outputs for all input combinations. The Value Changed Dump (VCD) file is generated for the proposed multipliers ( $6 \times 6,8 \times 8$, and $12 \times 12$ ) as well as FWM [3] for 15 test vector input combinations at the clock frequency of 200 MHz . This is then imported to Cadence RTL compiler for power, area and delay calculation. Table I shows area,delay and power comparison of proposed FWM-RB-PA and FWM [3]. Table II shows area,delay and power comparison of proposed FWM-RB-PA and conventional FPM [5].

Table I shows that our proposed (FWM-RB-PA) method achieves power reduction by $92.7 \%, 94.6 \%$, and $93.3 \%$ for $6 \times 6,8 \times 8$, and $12 \times 12$ structures respectively as compared to FWM [3]. The results show that the area overhead of our design is $39.6 \%, 53 \%$ and $66.2 \%$ for $6 \times 6,8 \times 8$, and $12 \times 12$ structures respectively as compared to FWM [3]. Though the delay of our proposed architecture is slightly higher for smaller bit size, it is getting reduced as the bit

Table I. Comparision of Area, Delay and Power

| $\begin{aligned} & \stackrel{N}{n} \\ & \stackrel{N}{n} \\ & \stackrel{\sim}{n} \end{aligned}$ | $\begin{aligned} & 0 \\ & Z \\ & U \\ & U \\ & 0 \\ & : \\ & U \\ & U \\ & 4 \\ & 4 \end{aligned}$ |  |  | $\begin{aligned} & \overparen{0} \\ & \stackrel{\text { an }}{0} \\ & \stackrel{\oplus}{0} \\ & \stackrel{0}{0} \end{aligned}$ |  | $\begin{aligned} & \text { E } \\ & \text { E } \\ & 0 \\ & 3 \\ & 0 \\ & 0 \end{aligned}$ | $\text { Power ratio ( } 100 \% \text { ) }$ | $$ | 68 <br> 8 <br> 0 <br>  <br>  <br> 0 <br> 0 <br> 0 <br> 0 <br> 0 <br> 0 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 6-bit | FWM [3] | 390.2 | 100 | 2351 | 100 | 2311.1 | 100 | 5.4 | 100 |
|  | FWM-RB-PA | 544.7 | 139.6 | 2582 | 109.8 | 168.5 | 7.3 | 0.4 | 8.1 |
| 8-bit | FWM [3] | 714 | 100 | 3176 | 100 | 3653 | 100 | 11.6 | 100 |
|  | FWM-RB-PA | 1093 | 153 | 3324 | 104.7 | 196.1 | 5.4 | 0.7 | 5.6 |
| 12-bit | FWM [3] | 1633 | 100 | 4827 | 100 | 4093 | 100 | 19.7 | 100 |
|  | FWM-RB-PA | 2714 | 166 | 4808 | 99.6 | 274.6 | 6.7 | 1.3 | 6.6 |

Table II. Comparision of Area, Delay and Power

| $\begin{aligned} & \stackrel{\sim}{N} \\ & \stackrel{\rightharpoonup}{n} \\ & \stackrel{\sim}{n} \end{aligned}$ | $\begin{aligned} & 0 \\ & Z \\ & U \\ & 0 \\ & . \\ & J \\ & 0 \\ & 4 \end{aligned}$ |  | $\overparen{8}$ <br> 0 <br> 0 <br>  <br>  <br> 0 <br> 0 <br> 0 <br> 0 <br> 0 <br> 4 |  |  | $$ | $$ | $\begin{aligned} & 10 \\ & \stackrel{1}{1} \\ & 0 \\ & \vdots \\ & 0 \\ & 0 \\ & 0 \end{aligned}$ | $$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 6-bit | FPM [5] | 1088 | 100 | 2918 | 100 | 1777 | 100 | 5.2 | 100 |
|  | FWM-RB-PA | 544.7 | 50 | 2582 | 88 | 168.5 | 9.5 | 0.4 | 7.7 |
| 8-bit | FPM [5] | 2139 | 100 | 4064 | 100 | 2875 | 100 | 11.6 | 100 |
|  | FWM-RB-PA | 1093 | 51 | 3324 | 82 | 196.1 | 6.8 | 0.7 | 5.6 |
| 12-bit | FPM [5] | 5123 | 100 | 6113 | 100 | 4980 | 100 | 30.4 | 100 |
|  | FWM-RB-PA | 2714 | 53 | 4808 | 78 | 274.6 | 5.5 | 1.3 | 4.3 |

size get increases. Even though the delay of the FWM-RB-PA is higher, the Power Delay Product (PDP) is getting reduced. Table II shows the performance comparison of the FWM-RB-PA with Conventional FPM [5] in terms of area, delay, power consumption and power delay product. The proposed architecture achieves power reduction by $90.5 \%, 93.8 \%$ and $94.5 \%$ for 6 x 6 , 8 x 8 , and 12 x 12 structures respectively as compared to FPM [5]. The results shows that the proposed design achieves $50 \%, 49 \%$ and $47 \%$ in area reduction when compared with FPM [5] of above said multipliers. In addition, the proposed design achieves $12 \%, 18 \%$ and $22 \%$ in delay reduction, due to that it can achieve more reduction in PDP as compared to the FPM [5]. From these results it is evident that the proposed design outperforms as compared with FWM [3] in terms of power, delay and PDP for $12 \times 12$ at the cost of an area. In the proposed multiplier design, it can achieves more power, area, delay and PDP reduction while comparing with the conventional FPM [5].

## 6 Conclusion

A low-power fixed-width unsigned multiplier is designed using MFA. Our proposed fixed-width multiplier utilizes a bypassing technique in the ripple carry array multiplier with parallel architecture. Our design consumes less power in all cases, and it can be used for various multimedia applications where the distribution of 0's and 1's is not uniform. The entire architecture
is less complex and easier to design compared with the design in [6]. It would be interesting to evaluate this design for signed multipliers.

## 7 Acknowledgment

The authors would like to thank S. Sivanantham and K. Jaganatha Naidu of VLSI Division, VIT University, Vellore, India, for their valuable suggestions to carry out the research work successfully.

