The paper “Looking beyond GPUs for DNN Scheduling on Multi-Tenant Clusters” proposes a resource-sensitive scheduler for shared GPU clusters. The scheduler uses offline profiling to detect a job’s sensitivity to CPU and memory resource allocation. The study shows that workload-aware CPU and memory allocations can improve job completion time by up to 3.4X and increase cluster resource utilization. The authors also present two heuristic algorithms for scheduling tasks, offering a more efficient solution for deep learning model training.
Read more at Medium…