Our workshop program is now available on the SC website.
Date: Monday November 18, 2019
Time: 9:00am - 5:00pm
Seung-Hwan Lim, Oak Ridge National Laboratory
Torsten Hoefler, ETH Zürich
Jiali Li (University of Tennessee), Bogdan Nicolae (Argonne National Laboratory), Justin Wozniak (Argonne National Laboratory), George Bosilca (Univeristy of Tennessee)
In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. With increasing complexity of learning models and amounts of training data, data-parallel approaches based on frequent all-reduce synchronization steps are increasingly popular. Despite the fact that high-performance computing (HPC) technologies have been designed to address such patterns efficiently, the behavior of data-parallel approaches on HPC platforms is not well understood. To address this issue, in this paper we study the behavior of Horovod, a popular data-parallel approach that relies on MPI, on Theta, a pre-Exascale machine at Argonne National Laboratory. Using two representative applications, we explore two aspects: (1) how performance and scalability is affected by important parameters such as number of nodes, number of workers, threads per node, batch size; (2) how computational phases are interleaved withall-reduce communication phases at fine granularity and what consequences this interleaving has in terms of potential bottlenecks. Our findings show that pipelining of back-propagation, gradient reduction and weight updates mitigate the effects of stragglers during all-reduce only partially. Furthermore, there can be significant delays between weights update, which can be leveraged to mask the overhead of additional background operations that are coupled with the training.
J. Travis Johnston (Oak Ridge National Laboratory), Steven R. Young (Oak Ridge National Laboratory), Catherine D. Schuman (Oak Ridge National Laboratory), Junghoon Chae (Oak Ridge National Laboratory), Don D. March (Oak Ridge National Laboratory), Robert M. Patton (Oak Ridge National Laboratory), Thomas E. Potok (Oak Ridge National Laboratory)
As deep convolutional neural networks (CNNs) have become increasingly popular and successful at an ever-widening number of machine learning tasks specialized hardware has become increasingly available for training and deploying them. NVIDIA's recent Volta architecture includes tensor cores which perform a fused operation reduced and mixed precision (16-bit multiply, 32-bit accumulate). Recent research indicates that, typically, very little is lost (in terms of training accuracy) when half precision is used in place of single precision, and performance gains can be made by doing arithmetic in reduced precision. In this work we demonstrate that making layer-by-layer choices as to the arithmetic/data precision can lead to further performance improvement. In our study of 25,200 CNNs we demonstrate an average speedup (over purely half precision) of 1.27x and speedups as high as 3.64x by appropriately combining single and half precision arithmetic and data types on a layer-by-layer basis.
Greg Heinrich (NVIDIA), Iuri Frosio (NVIDIA)
Training intelligent agents through reinforcement learning (RL) is a notoriously unstable procedure. Massive parallelization on GPUs and distributed systems has been exploited to generate a large amount of training experiences and consequently reduce instabilities, but the success of training remains strongly influenced by the choice of the hyperparameters. To overcome this issue, we introduce HyperTrick, a new metaoptimization algorithm, and show its effective application to tune hyperparameters in the case of deep RL, while learning to play different Atari games on a distributed system. Our analysis provides evidence of the interaction between the identification of the optimal hyperparameters and the learned policy, that is peculiar of the case of metaoptimization for deep RL. When compared with state-of-the-art metaoptimization algorithms, HyperTrick is characterized by a simpler implementation and it allows learning similar policies, while making a more effective use of the computational resources in a distributed system
Gabriel Laberge (Polytechnique Montreal), Shahrzad Shirzad (Luisiana State University), Patrick Diehl (Luisiana State University), Hartmut Kaiser (Luisiana State University), Serge Prudhomme (Polytechnique Montreal), Adrian S. Lemoine (Luisiana State University)
Linear algebra algorithms are used widely in a variety of domains, e.g. machine learning, numerical physics and video games graphics. For all these applications, loop-level parallelism is required to achieve high performance. However, finding the optimal way to schedule the workload between threads is a non-trivial problem because it depends on the structure of the algorithm being parallelized and the hardware the executable is run on. In the realm of Asynchronous Many Task runtime systems, a key aspect of the scheduling problem is predicting the proper chunk-size, where the chunk-size is defined as the number of iterations of a for-loop are assigned to a thread as one task. In this paper, we study the applications of supervised learning models to predict the chunk-size which yields maximum performance on multiple parallel linear algebra operations using the HPX backend of Blaze's linear algebra library. More precisely, we generate our training and tests sets by measuring performance of the application with different chunk-sizes for multiple linear algebra operations; vector-addition, matrix-vector-multiplication, matrix-matrix addition and matrix-matrix-multiplication. We compare the use of logistic regression, neural networks and decision trees with a newly developed decision tree based model in order to predict the optimal value for chunk-size. Our results show that classical decision trees and our custom decision tree model are able to forecast a chunk-size which results in good performance for the linear algebra operations.
Gordon E. Moon (Ohio State University), Denis Newman-Griffis (Ohio State University), Jinsung Kim (University of Utah), Aravind Sukumaran-Rajam (Ohio State University), Eric Fosler-Lussier (Ohio State University), P. Sadayappan (University of Utah)
The Word2Vec model is a neural network-based unsupervised word embedding technique widely used in applications such as natural language processing, bioinformatics and graph mining. As Word2Vec repeatedly performs Stochastic Gradient Descent (SGD) to minimize the objective function, it is very compute-intensive. However, existing methods for parallelizing Word2Vec are not optimized enough for data locality to achieve high performance. In this paper, we develop a parallel data-locality-enhanced Word2Vec algorithm based on Skip-gram with a novel negative sampling method that decouples loss calculation with positive and negative samples; this allows us to efficiently reformulate matrix-matrix operations for the negative samples over the sentence. Experimental results demonstrate our parallel implementations on multi-core CPUs and GPUs achieve significant performance improvement over the existing state-of-the-art parallel Word2Vec implementations while maintaining evaluation quality. We also show the utility of our Word2Vec implementation within the Node2Vec algorithm which accelerates embedding learning for large graphs.
Raju Ram (Fraunhofer Institute for Industrial Mathematics), Sabine Müller (Fraunhofer Institute for Industrial Mathematics), Franz-Josef Pfreundt (Fraunhofer Institute for Industrial Mathematics), Nicolas R. Gauger (Universityof Kaiserlautern), Janis Keuper (Fraunhofer Institute for Industrial Mathematics)
Most machine learning methods require careful selection of hyper-parameters in order to train a high performing model with good generalization abilities. Hence, several automatic selection algorithms have been introduced to overcome tedious manual (try and error) tuning of these parameters. Due to its very high sample efficiency, Bayesian Optimization over a Gaussian Processes modeling of the parameter space has become the method of choice. Unfortunately, this approach suffers from a cubic compute complexity due to underlying Cholesky factorization, which makes it very hard to be scaled beyond a small number of sampling steps. In this paper, we present a novel, highly accurate approximation of the underlying Gaussian Process. Reducing its computational complexity from cubic to quadratic allows an efficient strong scaling of Bayesian Optimization while outperforming the previous approach regarding optimization accuracy. First experiments show speedups of a factor of 162 in single node and further speed up by a factor of 5 in a parallel environment.
Julie Bernauer, NVIDIA
From climate modelling to drug design, AI models are not fully part of scientific modelling and AI models are getting more complex and larger every year. The adoption of challenging workloads like the BERT language model and the popularity of Deep Learning performance blogs or benchmarks such as MLPerf highlight the importance of being able to quickly train and tune such models. Until recently, system design for HPC and AI were often done in isolation as the requirements for the platforms where different, making large scientific experimentations difficult. To overcome these gaps, systems are now designed with AI software in mind and scale is introduced in the software design from ground up so that each model running at the edge can be trained in minutes at scale. In this talk we will cover how software leverages the inherent scaling nature of large models and how HPC infrastructures can be built and leveraged as the ideal platforms for fast experimentation and large problems.
Avraam Chatzimichailidis (Fraunhofer Institute for Industrial Mathematics), Janis Keuper (Fraunhofer Institute for Industrial Mathematics), Franz-Josef Pfreundt (Fraunhofer Institute for Industrial Mathematics), Nicolas R. Gauger (University of Kaiserlautern)
Current training methods for deep neural networks boil down to very high dimensional and non-convex optimization problems which are usually solved by a wide range of stochastic gradient descent methods. While these approaches tend to work in practice, there are still many gaps in the theoretical understanding of key aspects like convergence and generalization guarantees, which are induced by the properties of the optimization surface (loss landscape). In order to gain deeper insights, a number of recent publications proposed methods to visualize and analyze the otimization surfaces. However, the computational cost of these methods are very high, making it hardly possible to use them on larger networks. In this paper, we present the GradVis Toolbox, an open source library for efficient and scalable visualization and analysis of deep neural network loss landscapes in Tesorflow and PyTorch. Introducing more efficient mathematical formulations and a novel parallelization scheme, GradVis allows to plot 2d and 3d projections of optimization surfaces and trajectories, as well as high resolution second order gradient information for large networks.
Adam Rupe (University of California, Davis), Nalini Kumar (Intel), Vladislav Epifanov (Intel), Karthik Kashinath (Lawrence Berkeley National Laboratory), Oleksandr Pavlyk (Intel), Frank Schlimbach (Intel), Mostofa Patwary (Baidu USA), Sergey Maidanov (Intel), Victor Lee (Intel), Mr Prabhat (Lawrence Berkeley National Laboratory), James P. Crutchfield (University of California, Davis)
Extracting actionable insight from complex unlabeled scientific data is an open challenge and key to unlocking data-driven discovery in science. Complementary and alternative to supervised machine learning approaches, unsupervised physics-based methods based on behavior-driven theories hold great promise. Due to computational limitations, practical application on real-world domain science problems has lagged far behind theoretical development. However, powerful modern supercomputers provide the opportunity to narrow the gap between theory and practical application. We present our first step towards bridging this divide - DisCo - a high-performance distributed workflow for the behavior-driven local causal state theory. DisCo provides a scalable unsupervised physics-based representation learning method that decomposes spatiotemporal systems into their structurally relevant components, which are captured by the latent local causal state variables. In several firsts we demonstrate the efficacy of DisCo in capturing physically meaningful coherent structures from observational and simulated scientific data. To the best of our knowledge, DisCo is also the first application software developed entirely in Python to scale to over 1000 machine nodes, providing good performance along with ensuring domain scientists' productivity. Our capstone experiment, using newly developed and optimized DisCo workflow and libraries, performs unsupervised spacetime segmentation analysis of CAM5.1 climate simulation data, processing an unprecedented 89.5 TB in 6.6 minutes end-to-end using 1024 Intel Haswell nodes on the Cori supercomputer obtaining 91% weak-scaling and 64% strong-scaling efficiency. This enables us to achieve state-of-the-art unsupervised segmentation of coherent spatiotemporal structures in complex fluid flows.