Spark-based solution of large linear systems for big data processing.

Spark-based solution of large linear systems for big data processing.

Author

Al-Farsiyah, Salha Mohammed Issa.

Publisher

Sultan Qaboos University.

Gregorian

2020

Language

English

Subject

English abstract

Big data is a collection of large datasets that cannot be processed using conventional data management methods. In order to process these large amounts of data in an efficient way, parallelism is used. The scale, diversity, and complexity of the big data require new parallel computing architectures and techniques. Among the parallel computing frameworks, Spark enables parallel processing of data on collections of commodity computing nodes without the need to handle the complexity of designing and implementing parallel programs. Spark is an efficient in-memory cluster computing technology which offers high scalability and fault-tolerance capabilities. Furthermore, it enables to better handle iterative tasks as a result of reading and writing the data from memory (using the caching feature of Spark through the iterations) rather than reading and writing from disk. Solving very large systems of linear equations is needed in many applications. There are several well-known methods for solving such large systems of linear equations including the Jacobi method. This iterative method has been used to solve systems of linear equations with thousands of unknown variables in different areas such as the area of machine learning and climate science. As compared to other frameworks such as Hadoop, Spark offers the promise of being able to solve large-scale linear systems efficiently using iterative methods (such as Jacobi) due to its data persistence/caching feature. In this project we put this hypothesis to the test by implementing the Jacobi method for solving large systems of linear equations using Spark on a commodity cluster of computers and evaluate its performance. The aim is to show the effectiveness of the Spark in solving important big data problems in a distributed system environment. The performance evaluation results show that the Jacobi method on Spark can achieve super-linear speedup due to the ability of the Spark cluster to cache large amounts of data across the cluster nodes. The results also reveal that Jacobi on Spark achieves substantially higher speedup and efficiency for very large matrices (of size 4000×4000 or higher). Our results compare favourably to the results obtained in other projects which implemented iterative methods on Hadoop. The Jacobi method on Spark v achieved much higher (7 times higher) efficiency than the Jacobi method on Hadoop MapReduce for large matrix size due to the caching feature offered by Spark which is not available on Hadoop MapReduce.

Member of

Theses and Dissertations

Resource URL

https://hdl.handle.net/20.500.12408/3409