
The most naive way for Data parallelism (DP) is to copy the same model weights into multiple workers and assign a fraction of data to each worker to be processed at the same time. [1]
数据并行 (DP) 最幼稚的方法是将相同的模型权重复制到多个工作线程中,并将一小部分数据分配给每个工作线程以同时处理。
数据并行可以分为中心化方式的和无中心化方式的,对应于pytorch里面的DataParallel和DistributedDataParallel(DDP) [4]

Model parallelism (MP) aims to solve the case when the model weights cannot fit into a single node. The computation and model parameters are partitioned across multiple machines. Different from data parallelism where each worker hosts a full copy of the entire model, MP only allocates a fraction of model parameters on one worker and thus both the memory usage and the computation are reduced.
模型并行性 (MP) 旨在解决模型权重无法适应单个节点的情况。computation 和 model 参数在多台计算机上进行分区。与每个 worker 托管整个模型的完整副本的数据并行性不同,MP 仅在一个 worker 上分配一小部分模型参数,因此内存使用和计算量都减少了。