Visualizing execution traces with task dependencies

2835240 8 Visualizing execution traces with task dependencies Task-based scheduling has emerged as one method to reduce the complexity of parallel computing. When using task-based schedulers, developers must frame their computation as a series of tasks with various data dependencies. The scheduler can take these tasks, along with their input and output dependencies, and schedule the task in parallel across a node or cluster. While these schedulers simplify the process of parallel software development, they can obfuscate the performance characteristics of the execution of an algorithm. The execution trace has been used for many years to give developers a visual representation of how their computations are performed. These methods can be employed to visualize when and where each of the tasks in a task-based algorithm is scheduled. In addition, the task dependencies can be used to create a directed acyclic graph (DAG) that can also be visualized to demonstrate the dependencies of the various tasks that make up a workload. The work presented here aims to combine these two data sets and extend execution trace visualization to better suit task-based workloads. This paper presents a brief description of task-based schedulers and the performance data they produce. It will then describe an interactive extension to the current trace visualization methods that combines the trace and DAG data sets. This new tool allows users to gain a greater understanding of how their tasks are scheduled. It also provides a simplified way for developers to evaluate and debug the performance of their scheduler. task-based scheduling execution trace data movement DAG Blake Haugen University of Tennessee, Knoxville bhaugen@utk.edu Stephen Richmond University of Tennessee, Knoxville srichmo1@utk.edu Jakub Kurzak University of Tennessee, Knoxville kurzak@icl.utk.edu Chad A. Steed Oak Ridge National Laboratory csteed@acm.org Jack Dongarra University of Manchester dongarra@eecs.utk.edu 1 C. Augonnet and R. Namyst. A unified runtime system for heterogeneous multicore architectures. In Proceedings of the Euro-Par 2008 Workshops - Parallel Processing, Lecture Notes in Computer Science, pages 174--183, Las Palmas de Gran Canaria, Spain, August 2008. Springer. DOI: 10.1007/978-3-642-00955-6_22. 2 C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency Computat. Pract. Exper., 23(2):187--198, 2011. DOI: 10.1002/cpe.1631. 3 C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. In Proceedings of the 15th International Euro-Par Conference on Parallel Processing, Euro-Par '09, pages 863--874, Berlin, Heidelberg, 2009. Springer-Verlag. 4 C. Aulagnon, D. Martin-Guillerez, F. RuÃl', and F. Trahay. Runtime function instrumentation with eztrace. In I. Caragiannis, M. Alexander, R. Badia, M. Cannataro, A. Costan, M. Danelutto, F. Desprez, B. Krammer, J. Sahuquillo, S. Scott, and J. Weidendorfer, editors, Euro-Par 2012: Parallel Processing Workshops, volume 7640 of Lecture Notes in Computer Science, pages 395--403. Springer Berlin Heidelberg, 2013. 5 E. Ayguadé, R. M. Badia, F. D. Igual, J. Labarta, R. Mayo, and E. S. Quintana-Ortí. An Extension of the StarSs Programming Model for Platforms with Multiple GPUs. In Proceedings of the 15th International Euro-Par Conference on Parallel Processing, pages 851--862. Springer-Verlag, 2009. 6 R. M. Badia, J. R. Herrero, J. Labarta, J. M. Perez, E. S. Quintana-Orti, and G. Quintana-Orti. Parallelizing dense and banded linear algebra libraries using SMPSs. Concurrency Computat. Pract. Exper., 21(18):2438--2456, 2009. DOI: 10.1002/cpe.1463. 7 R. M. Badia, J. Labarta, J. Gimenez, and F. Escale. Dimemas: Predicting mpi applications behavior in grid environments. In Workshop on Grid Applications and Programming Tools (GGF8), volume 86, pages 52--62, 2003. 8 R. M. Badia, J. Labarta, R. Sirvent, J. M. Perez, J. M. Cela, and R. Grima. Programming grid applications with GRID Superscalar. J. Grid Comput., 1(2):151--170, 2003. DOI: 10.1023/B:GRID.0000024072.93701.f3. 9 G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in numerical linear algebra. SIAM Journal on Matrix Analysis and Applications, 32(3):866--901, 2011. 10 G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, A. Haidar, T. Herault, J. Kurzak, J. Langou, P. Lemarinier, H. Ltaief, P. Luszczek, A. YarKhan, and J. Dongarra. Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA. In Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops, IPDPSW '11, pages 1432--1441, Washington, DC, USA, 2011. IEEE Computer Society. 11 S. Brinkmann, J. Gracia, C. Niethammer, and R. Keller. TEMANEJO - a debugger for task based parallel programming models. CoRR, abs/1112.4604, 2011. 12 A. Chan, W. Gropp, and E. Lusk. An efficient format for nearly constant-time access to arbitrary time intervals in large trace files. Scientific Programming, 16(2-3):155--165, 2008. 13 K. Coulomb, A. Degomme, M. Faverge, and F. Trahay. An open-source tool-chain for performance analysis. In H. Brunst, M. S. MÃijller, W. E. Nagel, and M. M. Resch, editors, Tools for High Performance Computing 2011, pages 37--48. Springer Berlin Heidelberg, 2012. 14 L. Dagum and R. Menon. OpenMP: An Industry Standard API for Shared-Memory Programming. Computational Science Engineering, IEEE, 5(1):46--55, 1998. 15 A. Duran, E. Ayguade, R. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. OmpSs: A proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett., 21(2):173--193, 2011. DOI: 10.1142/S0129626411000151. 16 H. Gelabert and G. Sáanchez. Extrae user guide manual for version 2.2. 0. Barcelona Supercomputing Center (B. Sc.), 2011. 17 A. Knüpfer, H. Brunst, J. Doleschal, M. Jurenz, M. Lieber, H. Mickler, M. S. Müller, and W. E. Nagel. The vampir performance analysis tool-set. In Tools for High Performance Computing, pages 139--155. Springer, 2008. 18 J. Kurzak and J. Dongarra. Fully dynamic scheduler for numerical scheduling on multicore processors. Technical Report LAWN (LAPACK Working Note) 220, UT-CS-09-643, Innovative Computing Lab, University of Tennessee, 2009. 19 J. M. Pérez, R. M. Badia, and J. Labarta. A dependency-aware task-based programming environment for multi-core architectures. In Proceedings of the 2008 IEEE International Conference on Cluster Computing, 29 September - 1 October 2008, Tsukuba, Japan, pages 142--151. IEEE, 2008. 20 J. M. Perez, P. Bellens, R. M. Badia, and J. Labarta. CellSs: Making it easier to program the Cell Broadband Engine processor. IBM J. Res. & Dev., 51(5):593--604, 2007. DOI: 10.1147/rd.515.0593. 21 V. Pillet, J. Labarta, T. Cortes, and S. Girona. Paraver: A tool to visualize and analyze parallel code. In Proceedings of WoTUG-18: Transputer and occam Developments, volume 44, pages 17--31. mar, 1995. 22 J. Planas, R. M. Badia, E. Ayguadé, and J. Labarta. Hierarchical task-based programming with StarSs. Int. J. High Perf. Comput. Applic., 23(3):284--299, 2009. DOI: 10.1177/1094342009106195. 23 S. S. Shende and A. D. Malony. The tau parallel performance system. International Journal of High Performance Computing Applications, 20(2):287--311, 2006. 24 B. Shneiderman. The eyes have it: A task by data type taxonomy for information visualizations. In Visual Languages, 1996. Proceedings., IEEE Symposium on, pages 336--343. IEEE, 1996. 25 H. Sutter. The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software. Dr. Dobb's Journal, 30(3), 2005. 26 F. Trahay, Y. Ishikawa, F. Rue, R. Namyst, M. Faverge, and J. Dongarra. Eztrace: a generic framework for performance analysis. In Cluster, Cloud and Grid Computing (CCGrid), 2011 11th IEEE/ACM International Symposium on, pages 618--619. IEEE, 2011.