分布式并行 [1]

Overview

数据并行 (DP)

The most naive way for Data parallelism (DP) is to copy the same model weights into multiple workers and assign a fraction of data to each worker to be processed at the same time. [1]

数据并行（DP） 最幼稚的方法是将相同的模型权重复制到多个工作线程中，并将一小部分数据分配给每个工作线程以同时处理。

数据并行可以分为中心化方式的和无中心化方式的，对应于pytorch里面的DataParallel和DistributedDataParallel(DDP) [4]

模型并行性 (MP) [1]

Model parallelism (MP) aims to solve the case when the model weights cannot fit into a single node. The computation and model parameters are partitioned across multiple machines. Different from data parallelism where each worker hosts a full copy of the entire model, MP only allocates a fraction of model parameters on one worker and thus both the memory usage and the computation are reduced.

模型并行性（MP） 旨在解决模型权重无法适应单个节点的情况。computation 和 model 参数在多台计算机上进行分区。与每个 worker 托管整个模型的完整副本的数据并行性不同，MP 仅在一个 worker 上分配一小部分模型参数，因此内存使用和计算量都减少了。