Logistic regression on hadoop using pyspark

Mahto K.K.; Ranichandra C

doi:10.1007/978-3-030-49339-4_3

Profiles Research Units Publications

Conferences

Logistic regression on hadoop using pyspark

Mahto K.K.,

Published in Springer

2021

DOI: 10.1007/978-3-030-49339-4_3

Volume: 1180 AISC

Pages: 19 - 26

Abstract

Training a Machine Learning (ML) model on bigger datasets is a difficult task to accomplish, especially when a high-end configuration is not accessible. A relatively good configuration may also not always produce quick outcomes and depending on the dataset size, the time taken would be anything between seconds to several hours. More often, the tasks we are interested in involve big datasets and complex models. The purpose of our work was to see how effective Hadoop can be in terms of increasing the efficiency of working with Machine Learning for a given problem. Out of many models to choose from, Logistic Regression was chosen, which is relatively simpler to implement. Three Logistic Regression models were implemented and trained on MNIST Handwritten Digits dataset. First one was implemented in Python using NumPy without any ML libraries. The second implementation used LogisticRegression class that comes with the Scikit-learn Python package, and the third implementation was done using PySpark MLlib. Towards the end of the paper, we present the observations and results obtained from the execution of each. © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021.