A snapshot of parallelism in distributed deep learning training

  • Hairol Romero-Sandí Universidad Nacional
  • Gabriel Núñez Universidad Nacional
  • Elvis Rojas Universidad Nacional | National High Technology Center
Palabras clave: deep learning, parallelism, artificial neural networks

Resumen

The accelerated development of applications related to artificial intelligence has generated the creation of increasingly complex neural network models with enormous amounts of parameters, currently reaching up to trillions of parameters. Therefore, it makes your training almost impossible without the parallelization of training. Parallelism applied with different approaches is the mechanism that has been used to solve the problem of training on a large scale. This paper presents a glimpse of the state of the art related to parallelism in deep learning training from multiple points of view.  The topics of pipeline parallelism, hybrid parallelism, mixture-of-experts and auto-parallelism are addressed in this study, which currently play a leading role in scientific research related to this area. Finally, we develop a series of experiments with data parallelism and model parallelism. The objective is that the reader can observe the performance of two types of parallelism and understand more clearly the approach of each one.

Referencias bibliográficas

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., . . . Zheng, X. (2016, March 14). TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv(1603.04467 [cs.DC]). doi:10.48550/arXiv.1603.04467

Agarwal, S., Yan, C., Zhang, Z., & Venkataraman, S. (2023, October). BagPipe: Accelerating Deep Recommendation Model Training. SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles (SOSP '23) (pp. 348-363). Koblenz, Germany: Association for Computing Machinery, New York, NY, USA. doi:10.1145/3600006.3613142

Akintoye, S. B., Han, L., Zhang, X., Chen, H., & Zhang, D. (2022). A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning. IEEE Access, 10, 77950-77961. doi:10.1109/ACCESS.2022.3193690

Albawi, S., Mohammed, T. A., & Al-Zawi, S. (2017). Understanding of a convolutional neural network. 2017 International Conference on Engineering and Technology (ICET) (pp. 1-6). Antalya, Turkey: IEEE. doi:10.1109/ICEngTechnol.2017.8308186

Aminabadi, R. Y., Rajbhandari, S., Awan, A. A., Li, C., Li, D., Zheng, E., . . . He, Y. (2022). DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1-15). Dallas, TX, USA: IEEE. doi:10.1109/SC41404.2022.00051

Batur Dinler, Ö., Şahin, B. C., & Abualigah, L. (2021, November 30). Comparison of Performance of Phishing Web Sites with Different DeepLearning4J Models. European Journal of Science and Technology(28), 425-431. doi:10.31590/ejosat.1004778

Ben-Nun, T., & Hoefler, T. (2019, August 30). Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis. ACM Computing Surveys (CSUR), 52(4), 1-43, Article No. 65. doi:10.1145/3320060

Cai, Z., Yan, X., Ma, K., Yidi, W., Huang, Y., Cheng, J., . . . Yu, F. (2022, August 1). TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism. IEEE Transactions on Parallel and Distributed Systems, 33(8), 1967-1981. doi:10.1109/TPDS.2021.3132413

Camp, D., Garth, C., Childs, H., Pugmire, D., & Joy, K. (2011, November). Streamline Integration Using MPI-Hybrid Parallelism on a Large Multicore Architecture. IEEE Transactions on Visualization and Computer Graphics, 17(11), 1702-1713. doi:10.1109/TVCG.2010.259

Chen, C.-C., Yang, C.-L., & Cheng, H.-Y. (2019, October 28). Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform. arXiv:1809.02839v4 [cs.DC]. doi:10.48550/arXiv.1809.02839

Chen, M. (2023, March 15). Analysis of Data Parallelism Methods with Deep Neural Network. ITCE '22: Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering (pp. 1857-1861). Xiamen, China: Association for Computing Machinery, New York, NY, USA. doi:10.1145/3573428.3573755

Chen, T., Huang, S., Xie, Y., Jiao, B., Jiang, D., Zhou, H., . . . Wei, F. (2022, June 2). Task-Specific Expert Pruning for Sparse Mixture-of-Experts. arXiv:2206.00277v2 [cs.LG], 1-13. doi:10.48550/arXiv.2206.00277

Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., . . . Zhang, Z. (2015, December 3). MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. arXiv, arXiv:1512.01274v1 [cs.DC], 1-6. doi:10.48550/arXiv.1512.01274

Chen, Z., Deng, Y., Wu, Y., Gu, Q., & Li, Y. (2022). Towards Understanding the Mixture-of-Experts Layer in Deep Learning. In A. H. Oh, A. Agarwal, D. Belgrave, & K. Cho (Ed.), Advances in Neural Information Precessing Systems. New Orleans, Louisiana, USA. Retrieved from https://openreview.net/forum?id=MaYzugDmQV

Collobert, R., Bengio, S., & Mariéthoz, J. (2002, October 30). Torch: a modular machine learning software library. Research Report, IDIAP, Martigny, Switezerland. Retrieved from https://publications.idiap.ch/downloads/reports/2002/rr02-46.pdf

Dai, D., Dong, L., Ma, S., Zheng, B., Sui, Z., Chang, B., & Wei, F. (2022, May). StableMoE: Stable Routing Strategy for Mixture of Experts. In S. Muresan, P. Nakov, & A. Villavicencio (Ed.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 1: Long Papers, pp. 7085–7095. Dublin, Ireland: Association for Computational Linguistics. doi:10.18653/v1/2022.acl-long.489

Duan, Y., Lai, Z., Li, S., Liu, W., Ge, K., Liang, P., & Li, D. (2022). HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training. 2022 IEEE International Conference on Cluster Computing (CLUSTER) (pp. 313-323). Heidelberg, Germany: IEEE. doi:10.1109/CLUSTER51413.2022.00043

Fan, S., Rong, Y., Meng, C., Cao, Z., Wang, S., Zheng, Z., . . . Lin, W. (2021, February). DAPPLE: a pipelined data parallel approach for training large models. PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (pp. 431-445). Virtual Event, Republic of Korea: Association for Computing Machinery, New York, NY, USA. doi:10.1145/3437801.3441593

Fedus, W., Zoph, B., & Shazeer, N. (2022, January 1). Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. (A. Clark, Ed.) The Journal of Machine Learning Research, 23(1), Article No. 120, 5232-5270. Retrieved from https://dl.acm.org/doi/abs/10.5555/3586589.3586709

Gholami, A., Azad, A., Jin, P., Keutzer, K., & Buluc, A. (2018). Integrated Model, Batch, and Domain Parallelism in Training Neural Networks. SPAA '18: Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures (pp. 77-86). Vienna, Austria: Association for Computing Machinery, New York, NY, USA. doi:10.1145/3210377.3210394

Guan, L., Yin, W., Li, D., & Lu, X. (2020, November 9). XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training. arXiv:1911.04610v3 [cs.LG]. doi:10.48550/arXiv.1911.04610

Harlap, A., Narayanan, D., Phanishayee, A., Seshadri, V., Devanur, N., Ganger, G., & Gibbons, P. (2018, June 8). PipeDream: Fast and Efficient Pipeline Parallel DNN Training. arXiv:1806.03377v1 [cs.DC], 1-14. doi:10.48550/arXiv.1806.03377

Hazimeh, H., Zhao, Z., Aakanksha, C., Sathiamoorthy, M., Chen, Y., Mazumder, R., . . . Chi, E. H. (2024). DSelect-k: differentiable selection in the mixture of experts with applications to multi-task learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, & J. W. Vaughan (Ed.), NIPS'21: Proceedings of the 35th International Conference on Neural Information Processing Systems. Article No. 2246, pp. 29335-29347. Curran Associates Inc., Red Hook, NY, USA. doi:10.5555/3540261.3542507

He, C., Li, S., Soltanolkotabi, M., & Avestimehr, S. (2021, July). PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models. In M. Meila, & T. Zhang (Ed.), Proceedings of the 38th International Conference on Machine Learning. 139, pp. 4150-4159. PMLR. Retrieved from https://proceedings.mlr.press/v139/he21a.html

He, J., Qiu, J., Zeng, A., Yang, Z., Zhai, J., & Tang, J. (2021, March 24). FastMoE: A Fast Mixture-of-Expert Training System. arXiv:2103.13262v1 [cs.LG], 1-11. doi:10.48550/arXiv.2103.13262

Hey, T. (2020, October 1). Opportunities and Challenges from Artificial Intelligence and Machine Learning for the Advancement of Science, Technology, and the Office of Science Missions. Technical Report, USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR), United States. doi:10.2172/1734848

Hopfield, J. J. (1988, September). Artificial neural networks. IEEE Circuits and Devices Magazine, 4(5), 3-10. doi:10.1109/101.8118

Howison, M., Bethel, E. W., & Childs, H. (2012, January). Hybrid Parallelism for Volume Rendering on Large-, Multi-, and Many-Core Systems. IEEE Transactions on Visualization and Computer Graphics, 18(1), 17-29. doi:10.1109/TVCG.2011.24

Hu, Y., Imes, C., Zhao, X., Kundu, S., Beerel, P. A., Crago, S. P., & Walters, J. P. (2021, October 28). Pipeline Parallelism for Inference on Heterogeneous Edge Computing. arXiv:2110.14895v1 [cs.DC], 1-12. doi:10.48550/arXiv.2110.14895

Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, M. X., Chen, D., . . . Chen, Z. (2019, December 8). GPipe: efficient training of giant neural networks using pipeline parallelism. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, & E. B. Fox (Ed.), Proceedings of the 33rd International Conference on Neural Information Processing Systems (NIPS'19). Article No. 10, pp. 103 - 112. Vancouver, BC, Canada: Curran Associates Inc., Red Hook, NY, USA. doi:10.5555/3454287.3454297

Hwang, C., Cui, W., Xiong, Y., Yang, Z., Liu, Z., Hu, H., . . . Xiong, Y. (2023, June 5). Tutel: Adaptive Mixture-of-Experts at Scale. arXiv:2206.03382v2 [cs.DC], 1-19. doi:10.48550/arXiv.2206.03382

Janbi, N., Katib, I., & Mehmood, R. (2023, May). Distributed artificial intelligence: Taxonomy, review, framework, and reference architecture. Intelligent Systems with Applications, 18, 200231. doi:10.1016/j.iswa.2023.200231

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., . . . Darrell, T. (2014, November 3). Caffe: Convolutional Architecture for Fast Feature Embedding. MM '14: Proceedings of the 22nd ACM international conference on Multimedia (pp. 675-678). Orlando, Florida, USA: Association for Computing Machinery, New York, NY, USA. doi:10.1145/2647868.2654889

Jia, Z., Lin, S., Qi, C. R., & Aiken, A. (2018). Exploring Hidden Dimensions in Accelerating Convolutional Neural Networks. In J. Dy, & A. Krause (Ed.), Proceedings of the 35th International Conference on Machine Learning. 80, pp. 2274-2283. PMLR. Retrieved from https://proceedings.mlr.press/v80/jia18a.html

Jiang, W., Zhang, Y., Liu, P., Peng, J., Yang, L. T., Ye, G., & Jin, H. (2020, January). Exploiting potential of deep neural networks by layer-wise fine-grained parallelism. Future Generation Computer Systems, 102, 210-221. doi:10.1016/j.future.2019.07.054

Kamruzzaman, M., Swanson, S., & Tullsen, D. M. (2013, November 17). Load-balanced pipeline parallelism. SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. Article No. 14, pp. 1-12. Denver, Colorado, USA: Association for Computing Machinery, New York, NY, USA. doi:10.1145/2503210.2503295

Kirby, A. C., Samsi, S., Jones, M., Reuther, A., Kepner, J., & Gadepally, V. (2020, September). Layer-Parallel Training with GPU Concurrency of Deep Residual Neural Networks via Nonlinear Multigrid. (2007.07336 [cs.LG]), 1-7. doi:10.1109/HPEC43674.2020.9286180

Kossmann, F., Jia, Z., & Aiken, A. (2022, August 2). Optimizing Mixture of Experts using Dynamic Recompilations. arXiv:2205.01848v2 [cs.LG] , 1-13. doi:10.48550/arXiv.2205.01848

Krizhevsky, A. (2014, April 26). One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997v2 [cs.NE], 1-7. doi:10.48550/arXiv.1404.5997

Kukačka, J., Golkov, V., & Cremers, D. (2017, October 29). Regularization for Deep Learning: A Taxonomy. arXiv:1710.10686v1 [cs.LG], 1-23. doi:10.48550/arXiv.1710.10686

Li, C., Yao, Z., Wu, X., Zhang, M., Holmes, C., Li, C., & He, Y. (2024, January 14). DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing. arXiv:2212.03597v3 [cs.LG], 1-19. doi:10.48550/arXiv.2212.03597

Li, J., Jiang, Y., Zhu, Y., Wang, C., & Xu, H. (2023, July). Accelerating Distributed MoE Training and Inference with Lina. 2023 USENIX Annual Technical Conference (USENIX ATC 23) (pp. 945-959). USENIX Association, Boston, MA, USA. Retrieved from https://www.usenix.org/conference/atc23/presentation/li-jiamin

Li, S., & Hoefler, T. (2021, November). Chimera: efficiently training large-scale neural networks with bidirectional pipelines. SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Article No. 27, pp. 1-14. St. Louis, Missouri, USA: Association for Computing Machinery, New York, NY, USA. doi:10.1145/3458817.3476145

Li, S., Liu, H., Bian, Z., Fang, J., Huang, H., Liu, Y., . . . You, Y. (2023, August). Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training. ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing (pp. 766-775). Salt Lake City, UT, USA: Association for Computing Machinery, New York, NY, USA. doi:10.1145/3605573.3605613

Li, S., Mangoubi, O., Xu, L., & Guo, T. (2021). Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep Learning. 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS) (pp. 528-538). DC, USA: IEEE. doi:10.1109/ICDCS51616.2021.00057

Li, Y., Huang, J., Li, Z., Zhou, S., Jiang, W., & Wang, J. (2023). HSP: Hybrid Synchronous Parallelism for Fast Distributed Deep Learning. ICPP '22: Proceedings of the 51st International Conference on Parallel Processing (pp. 1-11). Bordeaux, France: Association for Computing Machinery, New York, NY, USA. doi:10.1145/3545008.3545024

Li, Z., Liu, F., Yang, W., Peng, S., & Zhou, J. (2022, December). A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Transactions on Neural Networks and Learning Systems, 33(12), 6999-7019. doi:10.1109/TNNLS.2021.3084827

Li, Z., Zhuang, S., Guo, S., Zhuo, D., Zhang, H., Song, D., & Stoica, I. (2021). TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models. In M. Meila, & T. Zhang (Ed.), Proceedings of the 38th International Conference on Machine Learning. 139, pp. 6543-6552. PMLR. Retrieved from https://proceedings.mlr.press/v139/li21y.html

Liang, P., Tang, Y., Zhang, X., Bai, Y., Su, T., Lai, Z., . . . Li, D. (2023, August). A Survey on Auto-Parallelism of Large-Scale Deep Learning Training. IEEE Transactions on Parallel and Distributed Systems, 34(8), 2377-2390. doi:10.1109/TPDS.2023.3281931

Liu, D., Chen, X., Zhou, Z., & Ling, Q. (2020, May 15). HierTrain: Fast Hierarchical Edge AI Learning With Hybrid Parallelism in Mobile-Edge-Cloud Computing. IEEE Open Journal of the Communications Society, 1, 634-645. doi:10.1109/OJCOMS.2020.2994737

Liu, W., Lai, Z., Li, S., Duan, Y., Ge, K., & Li, D. (2022). AutoPipe: A Fast Pipeline Parallelism Approach with Balanced Partitioning and Micro-batch Slicing. 2022 IEEE International Conference on Cluster Computing (CLUSTER) (pp. 301-312). Heidelberg, Germany: IEEE. doi:10.1109/CLUSTER51413.2022.00042

Ma, J., Zhao, Z., Yi, X., Chen, J., Hong, L., & Chi, E. H. (2018, July). Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts. KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1930-1939). London, United Kingdom: Association for Computing Machinery, New York, NY, USA. doi:10.1145/3219819.3220007

Manaswi, N. K. (2018). Understanding and Working with Keras. In N. K. Manaswi, Deep Learning with Applications Using Python: Chatbots and Face, Object, and Speech Recognition With TensorFlow and Keras (pp. 31–43). Berkeley, CA, USA: Apress. doi:10.1007/978-1-4842-3516-4

Mastoras, A., & Gross, T. R. (2018, February 24). Understanding Parallelization Tradeoffs for Linear Pipelines. In Q. Chen, Z. Huang, & P. Balaji (Ed.), PMAM'18: Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores (pp. 1-10). Vienna, Austria: Association for Computing Machinery, New York, NY, USA. doi:10.1145/3178442.3178443

Miao, X., Wang, Y., Jiang, Y., Shi, C., Nie, X., Zhang, H., & Cui, B. (2022, November 1). Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism. Proceedings of the VLDB Endowment, 16(3), 470-479. doi:10.14778/3570690.3570697

Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen, R., Zhou, Y., . . . Dean, J. (2017, June 25). Device Placement Optimization with Reinforcement Learning. arXiv:1706.04972v2 [cs.LG], 1-11. doi:10.48550/arXiv.1706.04972

Mittal, S., & Vaishay, S. (2019, October). A survey of techniques for optimizing deep learning on GPUs. Journal of Systems Architecture, 99, 101635. doi:10.1016/j.sysarc.2019.101635

Moreno-Alvarez, S., Haut, J. M., Paoletti, M. E., & Rico-Gallego, J. A. (2021, June 21). Heterogeneous model parallelism for deep neural networks. Neurocomputing, 441, 1-12. doi:10.1016/j.neucom.2021.01.125

Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R., . . . Zaharia, M. (2019, October). PipeDream: generalized pipeline parallelism for DNN training. SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems Principles (pp. 1-15). Huntsville, Ontario, Canada: Association for Computing Machinery, New York, NY, USA. doi:10.1145/3341301.3359646

Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., . . . Zaharia, M. (2021). Efficient large-scale language model training on GPU clusters using megatron-LM. SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Article No. 58, pp. 1-15. St. Louis, Missouri, USA: Association for Computing Machinery, New York, NY, USA. doi:10.1145/3458817.3476209

Nie, X., Miao, X., Cao, S., Ma, L., Liu, Q., Xue, J., . . . Cui, B. (2022, October 9). EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate. arXiv:2112.14397v2 [cs.LG], 1-14. doi:10.48550/arXiv.2112.14397

Oyama, Y., Maruyama, N., Dryden, N., McCarthy, E., Harrington, P., Balewski, J., . . . Van Essen, B. (2021, July 1). The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs With Hybrid Parallelism. IEEE Transactions on Parallel and Distributed Systems, 32(7), 1641-1652. doi:10.1109/TPDS.2020.3047974

Park, J. H., Yun, G., Yi, C. M., Nguyen, N. T., Lee, S., Choi, J., . . . Choi, Y.-r. (2020, July). HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism. 2020 USENIX Annual Technical Conference (USENIX ATC 20) (pp. 307-321). USENIX Association. Retrieved from https://www.usenix.org/conference/atc20/presentation/park

Pouyanfar, S., Sadiq, S., Yan, Y., Tian, H., Tao, Y., Presa, M. R., . . . Iyengar, S. S. (2018, September 18). A Survey on Deep Learning: Algorithms, Techniques, and Applications. ACM Computing Surveys (CSUR), 51(5), 1-36, Article No. 92. doi:10.1145/3234150

Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Aminabadi, R. Y., Awan, A. A., . . . He, Y. (2022, July). DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, & S. Sabato (Ed.), Proceedings of the 39th International Conference on Machine Learning. 162, pp. 18332-18346. PMLR. Retrieved from https://proceedings.mlr.press/v162/rajbhandari22a.html

Rasley, J., Rajbhandari, S., Ruwase, O., & He, Y. (2020, August). DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 3505 - 3506). Virtual Event, CA, USA: Association for Computing Machinery, New York, NY, USA. doi:10.1145/3394486.3406703

Ravanelli, M., Parcollet, T., & Bengio, Y. (2019). The Pytorch-kaldi Speech Recognition Toolkit. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6465-6469). Brighton, UK: IEEE. doi:10.1109/ICASSP.2019.8683713

Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., Jenatton, R., Pinto, A. S., . . . Houlsby, N. (2024, December). Scaling vision with sparse mixture of experts. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, & J. W. Vaughan (Ed.), NIPS'21: Proceedings of the 35th International Conference on Neural Information Processing Systems. Article No. 657, pp. 8583-8595. Curran Associates Inc., Red Hook, NY, USA. doi:10.5555/3540261.3540918

Rojas, E., Quirós-Corella, F., Jones, T., & Meneses, E. (2022). Large-Scale Distributed Deep Learning: A Study of Mechanisms and Trade-Offs with PyTorch. In I. Gitler, C. J. Barrios Hernández, & E. Meneses (Ed.), High Performance Computing. CARLA 2021. Communications in Computer and Information Science. 1540, pp. 177-192. Springer, Cham. doi:10.1007/978-3-031-04209-6_13

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. International Conference on Learning Representations (ICLR 2017), (pp. 1-19). Toulon, France. Retrieved from https://openreview.net/forum?id=B1ckMDqlg

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2020, March 13). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053v4 [cs.CL], 1-15. doi:10.48550/arXiv.1909.08053

Song, L., Mao, J., Zhuo, Y., Qian, X., Li, H., & Chen, Y. (2019). HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array. 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 56-68). Washington, DC, USA: IEEE. doi:10.1109/HPCA.2019.00027

Stevens, R., Taylor, V., Nichols, J., Maccabe, A. B., Yelick, K., & Brown, D. (2020, February 1). AI for Science: Report on the Department of Energy (DOE) Town Halls on Artificial Intelligence (AI) for Science. Technical Report, USDOE; Lawrence Berkeley National Laboratory (LBNL); Argonne National Laboratory (ANL); Oak Ridge National Laboratory (ORNL), United States. doi:10.2172/1604756

Subhlok, J., Stichnoth, J. M., O'Hallaron, D. O., & Gross, T. (1993, July 1). Exploiting task and data parallelism on a multicomputer. ACM SIGPLAN Notices, 28(7), 13-22. doi:10.1145/173284.155334

Takisawa, N., Yazaki, S., & Ishihata, H. (2020). Distributed Deep Learning of ResNet50 and VGG16 with Pipeline Parallelism. 2020 Eighth International Symposium on Computing and Networking Workshops (CANDARW) (pp. 130-136). Naha, Japan: IEEE. doi:10.1109/CANDARW51189.2020.00036

Tanaka, M., Taura, K., Hanawa, T., & Torisawa, K. (2021). Automatic Graph Partitioning for Very Large-scale Deep Learning. 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (pp. 1004-1013). Portland, OR, USA: IEEE. doi:10.1109/IPDPS49936.2021.00109

Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., & Rellermeyer, J. S. (2020). A Survey on Distributed Machine Learning. ACM Computing Surveys (CSUR), 53(2), 1-33, Article No. 30. doi:10.1145/3377454

Wang, H., Imes, C., Kundu, S., Beerel, P. A., Crago, S. P., & Walters, J. P. (2023). Quantpipe: Applying Adaptive Post-Training Quantization For Distributed Transformer Pipelines In Dynamic Edge Environments. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). Rhodes Island, Greece: IEEE. doi:10.1109/ICASSP49357.2023.10096632

Wang, S.-C. (2003). Artificial Neural Network. In S.-C. Wang, Interdisciplinary Computing in Java Programming (1 ed., Vol. 743, pp. 81-100). Boston, MA, USA: Springer. doi:10.1007/978-1-4615-0377-4_5

Wang, Y., Feng, B., Wang, Z., Geng, T., Barker, K., Li, A., & Ding, Y. (2023, July). MGG: Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms. 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23) (pp. 779-795). Boston, MA, USA: USENIX Association. Retrieved from https://www.usenix.org/conference/osdi23/presentation/wang-yuke

Wu, J. (2017, May 1). Introduction to Convolutional Neural Networks. Nanjing Universit, National Key Lab for Novel Software Technology, China. Retrieved from https://cs.nju.edu.cn/wujx/paper/CNN.pdf

Yang, B., Zhang, J., Li , J., Ré, C., Aberger, C. R., & De Sa, C. (2021, March 15). Proceedings of the 4th Machine Learning and Systems Conference, 3, pp. 269-296. San Jose, CA, USA. Retrieved from https://proceedings.mlsys.org/paper_files/paper/2021/file/9412531719be7ccf755c4ff98d0969dc-Paper.pdf

Yang, P., Zhang, X., Zhang, W., Yang, M., & Wei, H. (2022). Group-based Interleaved Pipeline Parallelism for Large-scale DNN Training. The Tenth International Conference on Learning Representations (ICLR 2022), (pp. 1-15). Retrieved from https://openreview.net/forum?id=cw-EmNq5zfD

Yoon, J., Byeon, Y., Kim, J., & Lee, H. (2022, July 15). EdgePipe: Tailoring Pipeline Parallelism With Deep Neural Networks for Volatile Wireless Edge Devices. IEEE Internet of Things Journal, 9(14), 11633 - 11647. doi:10.1109/JIOT.2021.3131407

Yuan, L., He, Q., Chen, F., Dou, R., Jin, H., & Yang, Y. (2023, April 30). PipeEdge: A Trusted Pipelining Collaborative Edge Training based on Blockchain. In Y. Ding, J. Tang, J. Sequeda, L. Aroyo, C. Castillo, & G.-J. Houben (Ed.), WWW '23: Proceedings of the ACM Web Conference 2023 (pp. 3033-3043). Austin, TX, USA: Association for Computing Machinery, New York, NY, USA. doi:10.1145/3543507.3583413

Zeng, Z., Liu, C., Tang, Z., Chang, W., & Li, K. (2021). Training Acceleration for Deep Neural Networks: A Hybrid Parallelization Strategy. 2021 58th ACM/IEEE Design Automation Conference (DAC) (pp. 1165-1170). San Francisco, CA, USA: IEEE. doi:10.1109/DAC18074.2021.9586300

Zhang, J., Niu, G., Dai, Q., Li, H., Wu, Z., Dong, F., & Wu, Z. (2023, October 28). PipePar: Enabling fast DNN pipeline parallel training in heterogeneous GPU clusters. Neurocomputing, 555, 126661. doi:10.1016/j.neucom.2023.126661

Zhang, P., Lee, B., & Qiao, Y. (2023, October). Experimental evaluation of the performance of Gpipe parallelism. Future Generation Computer Systems, 147, 107-118. doi:10.1016/j.future.2023.04.033

Zhang, S., Diao, L., Wang, S., Cao, Z., Gu, Y., Si, C., . . . Lin, W. (2023, February 16). Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform. arXiv:2302.08141v1 [cs.DC], 1-16. doi:10.48550/arXiv.2302.08141

Zhao, L., Xu, R., Wang, T., Tian, T., Wang, X., Wu, W., . . . Jin, X. (2021, January 14). BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training. arXiv:2012.12544v2 [cs.DC]. doi:10.48550/arXiv.2012.12544

Zhao, L., Xu, R., Wang, T., Tian, T., Wang, X., Wu, W., . . . Jin, X. (2022). BaPipe: Balanced Pipeline Parallelism for DNN Training. Parallel Processing Letters, 32(03n04), 2250005, 1-17. doi:10.1142/S0129626422500050

Zhao, S., Li, F., Chen, X., Guan, X., Jiang, J., Huang, D., . . . Cui, H. (2022, March 1). vPipe: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training. IEEE Transactions on Parallel and Distributed Systems, 33(3), 489-506. doi:10.1109/TPDS.2021.3094364

Zheng, L., Li, Z., Zhang, H., Zhuang, Y., Chen, Z., Huang, Y., . . . Stoica, I. (2022, July). Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) (pp. 559-578). Carlsbad, CA, USA: USENIX Association. Retrieved from https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin

Zhou, Q., Guo, S., Qu, Z., Li, P., Li, L., Guo, M., & Wang, K. (2021, May 1). Petrel: Heterogeneity-Aware Distributed Deep Learning Via Hybrid Synchronization. IEEE Transactions on Parallel and Distributed Systems, 32(5), 1030-1043. doi:10.1109/TPDS.2020.3040601

Zhu, X. (2023, April 28). Implement deep neuron networks on VPipe parallel system: a ResNet variant implementation. In X. Li (Ed.), Proceedings Third International Conference on Artificial Intelligence and Computer Engineering (ICAICE 2022). 12610, p. 126104I. Wuhan, China: International Society for Optics and Photonics, SPIE. doi:10.1117/12.2671359

Cómo citar
Romero-Sandí, H., Núñez, G., & Rojas, E. (2024). A snapshot of parallelism in distributed deep learning training. Revista Colombiana De Computación, 25(1), 60–73. Recuperado a partir de https://revistas.unab.edu.co/index.php/rcc/article/view/5054

Descargas

Los datos de descargas todavía no están disponibles.
Publicado
2024-06-30
Sección
Artículo de investigación científica y tecnológica

Métricas

QR Code
Crossref Cited-by logo