GPU programming for Machine Learning and Data Processing
This course is a primer in using accelerators, specifically GPUs. In most laptop CPUs these days, the integrated GPU is consuming more silicon space than the rest of the chip, and definitely more than the actual CPU cores (excluding caches). For deep learning-style computations, accelerators like GPUs beat CPU-based implementations hands down, allowing for faster iteration of possible model concepts.
We will consider several ways to access the power of GPUs, both with ready-made frameworks for deep learning, and with code that gives more freedom in expressing what operations you want to perform, from Python and C++. Cursory familiarity with these languages is expected, but we will specifically try to focus on techniques that allow the programmer to focus on semantics, rather than arcane syntax and hardware details. The intended scope of the course is to sample a number of libraries and technologies related to GPU programming, for students with no previous familiarity with their usage, or exclusive familiarity with a single framework.
Together with the strong focus on the accelerator programming for machine learning algorithms, the course will also cover aspects of large-scale machine learning. In particular the needs for continuous analysis of large data volumes. It requires understanding of available infrastructures, tools and technologies, and strategies for efficient data access and management. The course will cover model serving and many-task-computing model for different machine learning algorithms.
The course is divided into three activity types, the preparatory week, the live week (remote option available), and the post-live project. In order to make the preparatory work relevant, it will consist of a number of prelabs (that can be executed in an HPC environment hosted in Uppsala), with accompanying literature. Thus, all students will be expected to have at least seen some of the technologies used when coming to the first lecture.
The purpose of the project is that the students, ideally, should be able to try out some technologies covered in the course for an application relevant to their research, and provide a report on the results in a way that succeeds in explaining the problem solved, the approach used, any efficiency gains, and discusses possible alternative choices, advantages, and drawbacks.
Course details will be hosted on https://github.com/scicompuu/sesegpu (at the time of writing, the information shown is for the course instance hosted in June 2020).
Main teacher: Carl Nettelblad (firstname.lastname@example.org)
Tentative lecture and lab schedule:
The live part of the course is given the week November 8-12 2021 in Uppsala, with the possibility for remote participation. The prelabs should be completed before the week starts. The project is completed independently in the weeks that follow.
|Prelab 1||Work through a prewritten TensorFlow code training a deep learning network. Compare performance for various settings on GPU and CPU.|
|Lecture 1||Why do we care about GPUs? What are the differences in bandwidth, parallelism and latency between a modern CPU and a GPU? What is a “good problem for a GPU”?|
|Lecture 2||TensorFlow for deep learning. Computation as a graph, auto-differentiation. Brief explanation of stochastic gradient descent.|
|Lecture 3||TensorFlow and other high-level frameworks. Using TensorFlow to express non-Deep Learning computations (optimization problems and other computational graphs), data generators. Comparison to other Python-based frameworks such as PyTorch, Caffe and MXNet, pros and cons.|
|Lab 1||Using TensorFlow for optimizing two simple models. Note performance differences from “irrelevant” changes. Using TensorFlow to express e.g. a cellular automata simulation.|
|Prelab 2||Read and explain what a code seems to be doing in CUDA, Thrust, and OpenMP Target. Run the codes and note differences in behavior. Start work on identifying computation-heavy tasks in a numpy-based code (given in lab or some other code) and determine whether they are truly computation-bound inside numpy operations.|
|Lecture 4||The early days of “GPGPU”. History of shaders, the introduction of CUDA, OpenCL. Practical concerns in CUDA invocation and compilation models. CUDA contrasted against OpenMP Target. Nvidia Thrust as an example of a higher-level library.|
|Lecture 5||Immediate tensor-based frameworks. Versatility of maintaining the same syntax and approach as numpy. Possible bottlenecks. How to keep computations on the GPU. Expressing arbitrary GPU computations in Python. Comparison of afnumpy, CuPY, MinPy, and Numba.|
|Lab 2||Choose a problem (based on examples or prelab) and a framework and try to implement working code. Time to catch up on Lab 1 as well.|
|Prelab 3||Read and understand the basic concepts of distributed infrastructures (Clusters and Clouds), different layers IaaS, PaaS and SaaS and different service deployment models|
|Lecture 6||Introduction to the distributed computing infrastructures, Services and different deployment models, the role of dynamic contextualization and orchestration for large scale deployments.|
|Lecture 7||Introduction to different frameworks and strategies for scalable deployments using Ansible, Kubernetes and similar frameworks|
|Lab 3||Implement first cloud service based on dynamic contextualization and orchestration|
|Lecture 9||Guest lecture on accelerator-based and/or distributed challenges in practice|
|Lab 4||Large-scale model training and serving using Ansible and Kubernetes.|
|Prelab 5||Brainstorm a possible project that would be relevant for the course content.|
|Lecture 10||Debugging and profiling. What can we learn about GPU behavior? What bottlenecks do we know from theory and how can we demonstrate and measure them in practice? What tools are available in the frameworks we have considered so far.|
|Lecture 11||Scope of project, desired structure of report, things to consider, possible pitfalls.|
|Lab 5||Continue with previous labs, ask teachers about possible project ideas, possibly trying out quick ideas to test their feasibility.|
|Lecture 12||Finalization of course. Quick summary. Discussing tentative project ideas in the full group (“I will probably work on X”). Course evaluation.|