Document
Spark-based solution of large linear systems for big data processing.
Publisher
Sultan Qaboos University.
Gregorian
2020
Language
English
English abstract
Big data is a collection of large datasets that cannot be processed using conventional
data management methods. In order to process these large amounts of data in an
efficient way, parallelism is used. The scale, diversity, and complexity of the big
data require new parallel computing architectures and techniques. Among the parallel computing frameworks, Spark enables parallel processing of data on collections
of commodity computing nodes without the need to handle the complexity of designing and implementing parallel programs. Spark is an efficient in-memory cluster
computing technology which offers high scalability and fault-tolerance capabilities.
Furthermore, it enables to better handle iterative tasks as a result of reading and
writing the data from memory (using the caching feature of Spark through the iterations) rather than reading and writing from disk.
Solving very large systems of linear equations is needed in many applications. There
are several well-known methods for solving such large systems of linear equations
including the Jacobi method. This iterative method has been used to solve systems
of linear equations with thousands of unknown variables in different areas such as the
area of machine learning and climate science. As compared to other frameworks such
as Hadoop, Spark offers the promise of being able to solve large-scale linear systems
efficiently using iterative methods (such as Jacobi) due to its data persistence/caching
feature.
In this project we put this hypothesis to the test by implementing the Jacobi method
for solving large systems of linear equations using Spark on a commodity cluster of
computers and evaluate its performance. The aim is to show the effectiveness of the
Spark in solving important big data problems in a distributed system environment.
The performance evaluation results show that the Jacobi method on Spark can achieve
super-linear speedup due to the ability of the Spark cluster to cache large amounts of
data across the cluster nodes. The results also reveal that Jacobi on Spark achieves
substantially higher speedup and efficiency for very large matrices (of size 4000×4000
or higher). Our results compare favourably to the results obtained in other projects
which implemented iterative methods on Hadoop. The Jacobi method on Spark
v
achieved much higher (7 times higher) efficiency than the Jacobi method on Hadoop
MapReduce for large matrix size due to the caching feature offered by Spark which
is not available on Hadoop MapReduce.
Member of
Resource URL
Category
Theses and Dissertations