GPU programming for Machine Learning and Data Processing
This course is a primer in using accelerators, specifically GPUs. In most laptop CPUs these days, the integrated GPU is consuming more silicon space than the rest of the chip, and definitely more than the actual CPU cores (excluding caches). For deep learning-style computations, accelerators like GPUs beat CPU-based implementations hands down, allowing for faster iteration of possible model concepts.
We will consider several ways to access the power of GPUs, both with ready-made frameworks for deep learning, and with code that gives more freedom in expressing what operations you want to perform, from Python and C++. Cursory familiarity with these languages is expected, but we will specifically try to focus on techniques that allow the programmer to focus on semantics, rather than arcane syntax and hardware details. The intended scope of the course is to sample a number of libraries and technologies related to GPU programming, for students with no previous familiarity with their usage, or exclusive familiarity with a single framework.
Together with the strong focus on the accelerator programming for machine learning algorithms, the course will also cover aspects of large-scale machine learning. In particular the needs for continuous analysis of large data volumes. It requires understanding of available infrastructures, tools and technologies, and strategies for efficient data access and management. The course will cover model serving and many-task-computing model for different machine learning algorithms.
The course is divided into three activity types, the preparatory week, the on-site week, and the post-on-site project. In order to make the preparatory work relevant, it will consist of a number of prelabs (that can be executed in an HPC environment hosted in Uppsala), with accompanying literature. Thus, all students will be expected to have at least seen some of the technologies used when coming to the first lecture.
The purpose of the project is that the students, ideally, should be able to try out some technologies covered in the course for an application relevant to their research, and provide a report on the results in a way that succeeds in explaining the problem solved, the approach used, any efficiency gains, and discusses possible alternative choices, advantages, and drawbacks.
Main teacher: Carl Nettelblad (email@example.com)
Assistant teacher: Salman Toor (Salman.Toor@it.uu.se)
Possibly other PhD students and teachers involved during labs, depending on the size of the course. Number of admittable students will be contingent on the amount of hardware resources available at this time.
Tentative lecture and lab schedule:
|Prelab 1||Work through a prewritten TensorFlow code training a deep learning network. Compare performance for various settings on GPU and CPU.|
|Lecture 1||Why do we care about GPUs? What are the differences in bandwidth, parallelism and latency between a modern CPU and a GPU? What is a “good problem for a GPU”?|
|Lecture 2||TensorFlow for deep learning. Computation as a graph, auto-differentiation. Brief explanation of stochastic gradient descent.|
|Lecture 3||TensorFlow and other high-level frameworks. Using TensorFlow to express non-Deep Learning computations (optimization problems and other computational graphs), data generators. Comparison to other Python-based frameworks such as PyTorch, Caffe and MXNet, pros and cons.|
|Lab 1||Using TensorFlow for optimizing two simple models. Note performance differences from “irrelevant” changes. Using TensorFlow to express e.g. a cellular automata simulation.|
|Prelab 2||Read and explain what a code seems to be doing in CUDA, Thrust, SyCL, and OpenMP Target. Run the codes and note differences in behavior. Start work on identifying computation-heavy tasks in a numpy-based code (given in lab or some other code) and determine whether they are truly computation-bound inside numpy operations.|
|Lecture 4||The early days of “GPGPU”. History of shaders, the introduction of CUDA, OpenCL. Practical concerns in CUDA invocation and compilation models. CUDA contrasted against SyCL and OpenMP Target. Nvidia Thrust as an example of a higher-level library.|
|Lecture 5||Immediate tensor-based frameworks. Versatility of maintaining the same syntax and approach as numpy. Possible bottlenecks. How to keep computations on the GPU. Expressing arbitrary GPU computations in Python. Comparison of afnumpy, CuPY, minpy, and Numba.|
|Lab 2||Choose a problem (based on examples or prelab) and a framework and try to implement working code. Time to catch up on Lab 1 as well.|
|Prelab 3||Read and understand the basic concepts of distributed infrastructures (Clusters and Clouds), different layers IaaS, PaaS and SaaS and different service deployment models|
|Lecture 6||Introduction to the distributed computing infrastructures, Services and different deployment models, the role of dynamic contextualization and orchestration for large scale deployments.|
|Lecture 7||Introduction to different frameworks and strategies for scalable deployments using Ansible, Kubernetes and similar frameworks|
|Lab 3||Implement first cloud service based on dynamic contextualization and orchestration|
|Lecture 9||Guest lecture, possible topic – Challenges in large-scale machine learning model deployment and training|
|Lab 4||Large-scale model training and serving using Ansible and Kubernetes.|
|Prelab 5||Brainstorm a possible project that would be relevant for the course content.|
|Lecture 10||Data flow CPU<=>GPU. Explicit or implicit data movement, the fractal nature of data locality. GPU interconnects. Data parallelism vs. model parallelism. Debugging and profiling. What can we learn about GPU behavior? What tools are available in the frameworks we have considered so far.|
|Lecture 11||Scope of project, desired structure of report, things to consider, possible pitfalls.|
|Lab 5||Continue with previous labs, ask teacher and TA about possible project ideas, possibly trying out quick ideas to test their feasibility.|
|Lecture 12||Finalization of course. Quick summary. Discussing tentative project ideas in the full group (“I will probably work on X”). Course evaluation.|