Graphs, a potentially extensive web of nodes connected by edges, can be used to express and explore relationships between data, such as social connections, financial transactions, traffic, energy networks, and molecular interactions. As researchers collect more data and build out these graphics, researchers will need faster and more efficient methods, as well as more computational power, to conduct deep learning on them, in the form of graphical neural networks (GNN).
Now, a new method called SALIENT (SAmpling, sLIcing, and data movemeNT), developed by researchers at MIT and IBM Research, improves training and inference performance by addressing three major computation bottlenecks. This dramatically shortens the runtime of GNNs on large datasets, for example, containing 100 million nodes and 1 billion edges on the scale. The team also found that the technique scales well when computing power is added from one to 16 graphics processing units (GPUs). The work was presented at the Fifth Conference on Machine Learning and Systems.
“We started looking at the challenges current systems face when scaling up advanced machine learning techniques for graphs to really large datasets. As it turned out, there was a lot of work to be done, as many of the existing systems achieved good performance, especially on smaller datasets that could fit into GPU memory,” said Tim Kaler, the lead author and a postdoctoral fellow in MIT Computer Science and the artificial intelligence laboratory (CSAIL).
By huge data sets, experts mean scales like the entire Bitcoin network, where certain patterns and data relationships can describe trends or foul play. “There are nearly a billion Bitcoin transactions on the blockchain, and if we want to identify illicit activity within such a collaborative network, then we are dealing with a graph of that scale,” said co-author Jie Chen, Sr. researcher and manager of IBM Research and the MIT-IBM Watson AI Lab. “We want to build a system that can handle those kinds of graphs and make the processing as efficient as possible because we want to keep up with the pace of the new data being generated every day.”
Kaler and Chen’s co-authors include Jump Trading’s Nickolas Stathas MEng ’21, who developed SALIENT as part of his graduate work; former MIT-IBM Watson AI Lab intern and MIT graduate Anne Ouyang; MIT CSAIL postdoc Alexandros-Stavros Iliopoulos; MIT CSAIL researcher Tao B. Schardl; and Charles E. Leiserson, the Edwin Sibley Webster Professor of Electrical Engineering at MIT and a researcher at the MIT-IBM Watson AI Lab.
For this problem, the team took a systems-oriented approach when developing their method: SALIENT, says Kaler. To do this, the researchers implemented what they saw as major, basic optimizations of components that fit into existing machine learning frameworks, such as PyTorch Geometric and the deep graph library (DGL), which are interfaces for building a machine learning -fashion model. Stathas says the process is like swapping engines to build a faster car. Their method was designed to fit existing GNN architectures so that domain experts could easily apply this work to their specified fields to accelerate model training and unravel insights faster during inference. The trick, the team determined, was to keep all the hardware (CPUs, data links, and GPUs) busy at all times: while the CPU samples the graph and prepares mini-batches of data that are then transferred over the data link. , the more critically the GPU works on training the machine learning model or performing inferences.
The researchers began by analyzing the performance of a widely used machine learning library for GNNs (PyTorch Geometric), which showed surprisingly low utilization of available GPU resources. By applying simple optimizations, the researchers improved GPU utilization from 10 to 30 percent, resulting in a performance increase of 1.4 to two times over public benchmark codes. This fast baseline code can complete a full pass through the algorithm (a time slot) over a large training dataset in 50.4 seconds.
Looking for further performance improvements, the researchers set out to identify the bottlenecks that exist at the beginning of the data pipeline: the graph sampling and mini-batch preparation algorithms. Unlike other neural networks, GNNs perform a neighborhood aggregation operation, which calculates information about a node using information present in other nearby nodes on the graph – for example, in a social network graph, information from friends of a user’s friends . As the number of layers in the GNN increases, the number of nodes the network must reach for information can explode, pushing a computer’s limits. Neighborhood sampling algorithms help by selecting a smaller random subset of nodes to collect; however, the researchers found that current implementations of this were too slow to keep up with the processing speed of modern GPUs. In response, they identified a mix of data structures, algorithmic optimizations, and so on that improved the sampling rate, ultimately improving the sampling operation alone by about three times, increasing the runtime per epoch from 50.4 to 34.6 seconds. They also found that sampling, at an appropriate rate, can be done during inference, improving overall power efficiency and performance, a point that had been overlooked in the literature, the team notes.
In previous systems, this sampling step was a multi-process approach, creating additional data and unnecessary data movement between processes. The researchers made their SALIENT method more agile by creating a single process with lightweight threads that held the data on the CPU in shared memory. Furthermore, SALIENT takes advantage of a cache of modern processors, says Stathas, parallelizing feature slicing, which extracts relevant information from nodes of interest and their surrounding neighbors and edges, within the shared memory of the CPU core cache. This again reduced the total running time per epoch from 34.6 to 27.8 seconds.
The final bottleneck the researchers addressed was pipeline mini-batch data transfers between the CPU and GPU using a prefetching step, which would prepare data just before it’s needed. The team calculated that this would maximize bandwidth usage in the data link and bring the method to perfect usage; however, they saw only about 90 percent. They identified and fixed a performance bug in a popular PyTorch library that was causing unnecessary round-trip communication between the CPU and GPU. With this bug fixed, the team achieved a runtime of 16.5 seconds per era with SALIENT.
“Our work showed, I think, that the devil is in the details,” says Kaler. “If you pay close attention to the details that affect performance when training a neural network, you can solve a wide variety of performance problems. With our solutions, we ended up being completely hampered by GPU computations, which is the ideal goal of such a system.”
SALIENT’s speed was evaluated on three standard datasets ogbn-arxiv, ogbn-products, and ogbn-papers100M, as well as in multi-machine settings, with different levels of fanout (amount of data the CPU would prepare for the GPU), and over different architectures, including the most recent state-of-the-art, GraphSAGE-RI. In every setting, SALIENT outperformed PyTorch Geometric, especially on the large ogbn-papers100M dataset, with 100 million nodes and over a billion edges. Here it was three times faster, running on a single GPU, than the optimized baseline originally created for this work; with 16 GPUs, SALIENT was an additional eight times faster.
While other systems had slightly different hardware and experimental setups, so it wasn’t always a direct comparison, SALIENT still outperformed. Among systems that achieved similar accuracy, representative performance figures included 99 seconds with one GPU and 32 CPUs, and 13 seconds with 1,536 CPUs. In contrast, SALIENT’s runtime with one GPU and 20 CPUs was 16.5 seconds and only two seconds with 16 GPUs and 320 CPUs. “If you look at the bottom-line numbers that previous work reports, our 16 GPU runtime (two seconds) is an order of magnitude faster than other numbers previously reported in this dataset,” says Kaler. The researchers attributed their performance improvements in part to their approach of optimizing their code for a single machine before moving to the distributed setting. Stathas says the lesson here is that for your money, “it makes more sense to use the hardware you have efficiently and to the limit before you start scaling up to multiple computers,” which can yield significant cost and carbon savings. that can come with model training.
This new capability now allows researchers to tackle ever-larger graphs and dig deeper. For example, the aforementioned Bitcoin network contained 100,000 nodes; the SALIENT system can handle a graph 1000 times (or three orders of magnitude) larger.
“In the future, we would not only look at running this graphical neural network training system on the existing algorithms we have implemented for classifying or predicting the properties of each node, but we also want to perform more in-depth tasks such as identifying common patterns in a graph (subgraph patterns), [which] can actually be interesting to report financial crimes,” says Chen. “We also want to identify nodes on a graph that are similar in the sense that they might correspond to the same culprit in a financial crime. Additional algorithms would need to be developed for these tasks, and possibly neural network architectures.”
This research was supported by the MIT-IBM Watson AI Lab and in part by the US Air Force Research Laboratory and the US Air Force Artificial Intelligence Accelerator.