English abstract
Technologies such as artificial intelligence, machine learning, natural language processing are
relying heavily on intensive multiplication operations due to the use of matrices or neural networks, for
instance. In despite of the insanely speed of nowadays processing power, the rapid pace of such
technological advancement has rendered them too slow. People have been exploiting the
multiprocessing or task parallelism features of GPUs and DSPs to speed up the execution. Yet, the core
multiplication logic circuit – whether in GPUs, DSPs, or even CPUs – is still the same. The fundamental
dilemma within the multiplication logic circuit is its complexity; it increases exponentially when the
number of bits is increased, or the intended execution time is reduced. Either way, such complexity
ripples exponentially in i.e., neural network, all over again. So, the fact remains that, it requires a
breakthrough that improves the conventional multiplication method and develops another that performs
the multiplication in a single cycle or reduces the circuit logic and the execution time at the same time.
This research aims to assess the effect of employing the Japanese multiplication on the binary
multiplier's latency and to provide an enhanced architecture and algorithm for the logic circuit of the
multiplication (a.k.a. binary multiplier) that is inspired by it, which incorporates a hybrid approach and
some of the key features of some leading market multipliers such as Dadda's and Wallace's and couple
of literature proposed multipliers.
This research introduces a theoretical methodology of evaluating different implementations
using big O notation. These evaluations are then analyzed over a much wider domain instead of singular
cases, and then the analysis results have steered the selection of the key features and the architecting of
the proposed multiplier. So, the proposed multiplier has been constructed using an enhanced design of
a half adder, full adder, vertical compressor slices (VCSs) and carry lookahead adder (CLA). In
addition, it applies a new multiplication algorithm which minds the optimal connections and does not
propagate the output carries.
The proposed multiplier has exhibited about 58% less area utilization, and a slightly lower
latency than the Dadda multiplier – the benchmark multiplier – at least theoretically. A functional unit
test has been successfully carried out on an Artix-7 development board. In the future, the proposed
multiplier needs to be fabricated for proper practical analysis against the Dadda multiplier. Moreover,
the incorporation of the signed and floating-point numbers might be considered for production use.