Machine Learning in HPC Environments

Workshop Program

Thursday

Our workshop program is now available on the SC website.

Date: Thursday November 12, 2020

Time: 10:00am - 5:30pm

Workshop Introduction (10:00 am)

Seung-Hwan Lim, Oak Ridge National Laboratory

Keynote: Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning 10:10 am

Timnit Gebru, Google

A growing body of work shows that many problems in fairness, accountability, transparency, and ethics in machine learning systems are rooted in decisions surrounding the data collection and annotation process. We argue that a new specialization should be formed within machine learning that is focused on methodologies for data collection and annotation: efforts that require institutional frameworks and procedures. Specifically for sociocultural data, parallels can be drawn from archives and libraries. Archives are the longest standing communal effort to gather human information and archive scholars have already developed the language and procedures to address and discuss many challenges pertaining to data collection such as consent, power, inclusivity, transparency, and ethics privacy. We discuss these five key approaches in document collection practices in archives that can inform data collection in sociocultural machine learning.

Coffee Break (11:10 am - 11:25 am)

EventGraD: Event-Triggered Communication in Parallel Stochastic Gradient Descent

Soumyadip Ghosh (University of Notre Dame), Vijay Gupta (University of Notre Dame)

Communication in parallel systems consumes significant amount of time and energy which often turns out to be a bottleneck in distributed machine learning. In this paper, we present EventGrad - an algorithm with event-triggered communication in parallel stochastic gradient descent. The main idea of this algorithm is to modify the requirement of communication at every epoch to communicating only in certain epochs when necessary. In particular, the parameters are communicated only in the event when the change in their values exceed a threshold. The threshold for a parameter is chosen adaptively based on the rate of change of the parameter. The adaptive threshold ensures that the scheme can be applied to different models on different datasets without any change. We focus on data-parallel training of a popular convolutional neural network used for training the MNIST dataset and show that EventGrad can reduce the communication load by up to 70% while retaining the same level of accuracy.

[Presentation]

A Benders Decomposition Approach to Correlation Clustering

Jovita Lukasik (University of Mannheim), Margret Keuper (University of Mannheim), Maneesh Singh (Verisk Analytics, Inc), Julian Yarkony (Verisk Analytics, Inc)

We tackle the problem of graph partitioning for image segmentation using correlation clustering (CC), which we treat as an integer linear program (ILP). We reformulate optimization in the ILP so as to admit efficient optimization via Benders decomposition, a classic technique from operations research. Our Benders decomposition formulation has many subproblems, each associated with a node in the CC instance's graph, which can be solved in parallel. Each Benders subproblem enforces the cycle inequalities corresponding to edges with negative (repulsive) weights attached to its corresponding node in the CC instance. We generate Magnanti-Wong Benders rows in addition to standard Benders rows to accelerate optimization. Our Benders decomposition approach provides a promising new avenue to accelerate optimization for CC, and, in contrast to previous cutting plane approaches, theoretically allows for massive parallelization.

[Presentation]

Accelerating GPU-based Machine Learning in Python using MPI Library: A Case Study with MVAPICH2-GDR

Seyedeh Mahdieh Ghazimirsaeed (Ohio State University), Quentin Anthony (Ohio State University), Aamir Shafi (Ohio State University), Hari Subramoni (Ohio State University), Dhabaleswar K. (DK) Panda (Ohio State University)

The growth of big data applications during the last decade has led to a surge in the deployment and popularity of machine learning libraries. On the other hand, the high performance offered by GPUs makes them well-suited for machine learning problems. To take advantage of the performance provided by GPUs for machine learning, NVIDIA has recently developed the cuML library. The cuML library is the GPU-counterpart of Scikit-learn and provides similar Pythonic interfaces while hiding the complexities of writing compute kernels for GPUs directly using CUDA. To support execution of machine learning workloads on Multi-Node Multi-GPU (MNMG) systems, the cuML library exploits NVIDIA Collective Communications Library (NCCL) as a backend for collective communications between the processes. On the other hand, MPI is a defacto standard for communication in HPC systems. Amongst various MPI libraries, MVAPICH2-GDR is the pioneer in optimizing communications for GPUs. This paper explores various aspects and challenges of providing MPI-based communication support for GPU accelerated cuML applications. More specifically, it proposes a Python API to take advantage of MPI-based communications for cuML applications. It also gives an in-depth analysis, characterization, and benchmarking of the cuML algorithms such as K-Means, Nearest Neighbors, Random Forest, and tSVD. Moreover, it provides a comprehensive performance evaluation and profiling study for MPI-based versus NCCL-based communication for these algorithms. The evaluation results show that the proposed MPI-based communication approach achieves up to 38%, 20%, 20%, and 26% performance gain for K-Means, Nearest Neighbors, Linear Regression, and tSVD, respectively on up to 32 GPUs.

[Presentation]

Lunch Break (12:55 PM ~ 2:30 PM)

Keynote: Programming Systems of Data (2:30 am)

Michael Garland, NVIDIA

Machine learning and data analysis thrive on mass quantities of data. At the same time, the cost of data distribution and movement is among the most critical factors determining the performance of applications at scale. Consequently, scalable high-performance machine learning and data analysis requires software environments that support the careful management of data. Whereas modern cloud systems provide data stores and services that help support efficient delivery of data to applications, the tools at hand for developers to efficiently manage distributed data within a running application are considerably more limited. It is particularly challenging to deliver high-performance execution across distributed nodes while maintaining software modularity and composability. In this talk, I will focus on developments in the design of scalable programming systems that help address these challenges by providing data-centric interfaces that provide a convenient notation to the developer and dynamic information to the runtime system tasked with scheduling the application at peak efficiency.

[Presentation]

-->

Accelerate Distributed Stochastic Descent for Nonconvex Optimization with Momentum

Guojing Cong (IBM Corporation), Tianyi Liu (Georgia Institute of Technology)

Momentum method has been used extensively in optimizers for deep learn- ing. Recent studies show that distributed train- ing through K-step averaging has many nice properties. We propose a momentum method for such model averaging ap- proaches. At each individual learner level traditional stochas- tic gradient is applied. At the meta-level (global learner level), one momentum term is applied and we call it block mo- mentum. We analyze the convergence and scaling properties of such momentum methods. Our experimental results show that block momentum not only accelerates training, but also achieves better results.

[Presentation]

Coffee Break (4:00 PM - 4:15 PM)

High-bypass Learning: Automated Detection of Tumor Cells that Significantly Impact Drug Response

Justin Wozniak (Argonne National Laboratory), Hyunseung Yoo (Argonne National Laboratory), Jamaludin Mohd-Yusof (Los Alamos National Laboratory), Bogdan Nicolae (Argonne National Laboratory), Richard Turgeon (Argonne National Laboratory), Nick Collier (Argonne National Laboratory), Jonathan Ozik (Argonne National Laboratory), Thomas Brettin (Argonne National Laboratory), Rick Stevens (Argonne National Laboratory)

Machine learning in biomedicine is reliant on the availability of large, high-quality data sets. These corpora are used for training statistical or deep learning -based models that can be validated against other data sets and ultimately used to guide decisions. The quality of these data sets is an essential component of the quality of the models and their decisions. Thus, identifying and inspecting outlier data is critical for evaluating, curating, and using biomedical data sets. Many techniques are available to look for outlier data, but it is not clear how to evaluate the impact on highly complex deep learning methods. In this paper, we use deep learning ensembles and workflows to construct a system for automatically identifying data subsets that have a large impact on the trained models. These effects can be quantified and presented to the user for further inspection, which could improve data quality overall. We then present results from running this method on the near-exascale Summit supercomputer.

Deep Generative Models that Solve PDEs: Distributed Computing for Training Large Data-Free Models

Sergio Botelho (RocketML Inc), Ameya Joshi (New York University), Biswajit Khara (Iowa State University), Vinay Rao (RocketML Inc), Soumik Sarkar (Iowa State University), Chinmay Hegde (New York University), Santi Adavani (RocketML Inc), Baskar Ganapathysubramanian (Iowa State University)

Recent progress in scientific machine learning (SciML) has opened up the possibility of training novel neural network architectures that solve complex partial differential equations (PDEs). Several (nearly data free) approaches have been recently reported that successfully solve PDEs, with examples including deep feed forward networks, generative networks, and deep encoder-decoder networks. However, practical adoption of these approaches is limited by the difficulty in training these models, especially to make predictions at large output resolutions (greater or equal to 1024 x 1024). Here we report on a software framework for data parallel distributed deep learning that resolves the twin challenges of training these large SciML models – training in reasonable time as well as distributing the storage requirements. Our framework provides several out of the box functionality including (a) loss integrity independent of number of processes, (b) synchronized batch normalization, and (c) distributed higher-order optimization methods. We show excellent scalability of this framework on both cloud as well as HPC clusters, and report on the interplay between bandwidth, network topology and bare metal vs cloud. We deploy this approach to train generative models of sizes hitherto not possible, showing that neural PDE solvers can be viably trained for practical applications. We also demonstrate that distributed higher-order optimization methods are 2-3 times faster than stochastic gradient-based methods and provide minimal convergence drift with higher batch-size.

[Presentation]