Header menu link for other important links
Logistic regression on hadoop using pyspark
Mahto K.K.,
Published in Springer
Volume: 1180 AISC
Pages: 19 - 26
Training a Machine Learning (ML) model on bigger datasets is a difficult task to accomplish, especially when a high-end configuration is not accessible. A relatively good configuration may also not always produce quick outcomes and depending on the dataset size, the time taken would be anything between seconds to several hours. More often, the tasks we are interested in involve big datasets and complex models. The purpose of our work was to see how effective Hadoop can be in terms of increasing the efficiency of working with Machine Learning for a given problem. Out of many models to choose from, Logistic Regression was chosen, which is relatively simpler to implement. Three Logistic Regression models were implemented and trained on MNIST Handwritten Digits dataset. First one was implemented in Python using NumPy without any ML libraries. The second implementation used LogisticRegression class that comes with the Scikit-learn Python package, and the third implementation was done using PySpark MLlib. Towards the end of the paper, we present the observations and results obtained from the execution of each. © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021.
About the journal
JournalData powered by TypesetAdvances in Intelligent Systems and Computing
PublisherData powered by TypesetSpringer
Open AccessNo