Internet-Draft draft-liu-multicast-for-computing-storag July 2023
Liu & Geng Expires 11 January 2024 [Page]
Workgroup:
Network Working Group
Internet-Draft:
draft-liu-multicast-for-computing-storage-00
Published:
Intended Status:
Informational
Expires:
Authors:
Y. Liu
China Mobile
X. Geng
Huawei

Multicast for Computing and Storage

Abstract

This document introduces the multicast use case for computing and storage.

Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 11 January 2024.

Table of Contents

1. Introduction

There are applications in data center with point-to-multipoint communication patterns that would benefit from network multicast service, without which, these applications, when migrating to public clouds, will use server based packet replication techniques. This leads to CPU load inflation and prevents tenants from sustaining high throughputs and low latencies for multicast workloads.

In order to better show the requirements for computing and storage, we list 3 typical potential multicast scenarios with P2MP services: Multi-tenant Cloud, Computing and Storage.

The multicast requirements could be described with the following 3 aspects:

- Network Scale: number of switches, number of links, number of hosts

- Multicast Tree Size: number of intermeidate nodes; number of receivers

- Multicast Service Number

2. Use in Multi-tenant Cloud

As illustrated in the following figure, a data-center may contain: a network fabric configured in unicast-only mode, hosts running as virtual machines (VMs) managed by tenants, central replicators (C-R) for providing MSR6 packet delivery service among the hosts of a tenant.

       +----+   +----+   +----+
       |C-R1|   |C-R2|   |C-Rn| ====>Central-Replicator
       +-+--+   +-+--+   +-+--+
         |        |        |
   +-----+--------+--------+-----+
   | Spine +Leaf +vSwitch Fabric |
   |       (Unicast Only)        |
   +--+--+--+--+--+--+--+--+--+--+
      |  |  |  |  |  |  |  |
      h1 |  h3 h4 |  h6 |  h8   ====>Tenant-1
         |        |     |
         h2       h5    h7      ====>Tenant-2

Take tenant-1 for example. The host h1 can send multicast flow using MSR6 packets to C-R1, the MSR6 packets include one or more of the destination hosts h3/h4/h6/h8 encoding in the MSR6 header. An MSR6 packet may be sent to C-R1 where it is replicated and sends to the desired destination hosts. An MSR6 packet may be sent to C-R1 where it is replicated and sends to the part of the destination hosts, and another copy to C-R2 for replication and delivery to the left destination hosts.

A Tenant may have a dedicated set of C-Rs for its own use, or a Tenant may use a shared C-Rs for its replication requirement among VMs.

3. Typical Multicast Scenario in Computing

3.1. AI Training

The following figure shows a typical RDMA AI training scenario.

                 PS(Parameter Server) Nodes
               +-------+          +-------+
               |  CPU  |          |  CPU  |
               | Server|          | Server|
               +-+-+-+-+          +-+-+-+-+
    ^            | | |              | | |          |
    |         +--|-|-|--------------+ | |          |
    |       +----+ | +----------------------+      |
    |       | |    +--------+ +-------+ |   |      V
Gradients   | |             | |         |   | Parameters
        +---+-+-+       +---+-+-+     +-+---+-+
        |  GPU  |       |  GPU  |     |  GPU  |
        | Worker|       | Worker|     | Worker|
        +-------+       +-------+     +-------+

Worker->PS: The gradient of each worker is pushed to PS node

PS->Worker: PS will pull the parameters back to all workers after aggregation

In this process, the second stage is information distribution, with the same data content. N connections are used to transmit unicast separately. The bandwidth efficiency is 1/N, the larger the scale, the lower the efficiency.

                      +---------------+
                      |     Source    |
                      | +---+   +---+ |
                      | |CPU|   |GPU| |
                      | +-+-+   +-+-+ |
                      |   |       |   |
                      |    \     /    |
                      |   +-V---V-+   |
                      |   |  HCA  |   |
                      |   +-------+   |
                      +--+-+-+-+-+-+--+
                         | | ... | |
                      +--V-V-----V-V--+
                      |     Switch    |
                      +-+-----------+-+
                       /             \
        +-------------V-+           +-V------------+
        |  Destination  |           |  Destination  |
        |   +-------+   |           |   +-------+   |
        |   |  HCA  |   |           |   |  HCA  |   |
        |   +-V---V-+   |           |   +-V---V-+   |
        |    /     \    |           |    /     \    |
        |   |       |   |           |   |       |   |
        | +-+-+   +-+-+ |           | +-+-+   +-+-+ |
        | |CPU|   |GPU| |           | |CPU|   |GPU| |
        | +---+   +---+ |           | +---+   +---+ |
        +---------------+           +---------------+

If the source only sends 1 copy to the network and the switches replicate the packet to different distinations. The use of bandwidth is more efficient and the training is faster.

The large-scale multicast requirement in this scenario is as the following:

- Network Scale: 10-10k GPU

- Multicast Tree Size: 10-10k receivers

- Multicast Service Number: depends on the scenario

3.2. HPC

The following is an example of MPI in HPC scenario.

      +-------------------------------------------+
      |                Dispatcher                 |
      |                  Master                   |
      +---------------------+---------------------+
                            |
          +-----------------+
          |
      +---+----+  +--------+             +--------+
      |+--V---+|  |+------+|             |+------+|
      ||Dispa-||  ||Dispa-||             ||Dispa-||
      ||Agent ||  ||Agent ||             ||Agent ||
      |+---+--+|  |+---+--+|             |+---+--+|
      |    |   |  |    |   |             |    |   |
      |+---V--+|  |+---V--+|             |+---V--+|
      ||  MPI ||  ||  MPI ||     ...     ||  MPI ||
      ||Proces||  ||Proces||             ||Proces||
      |+---^--+|  |+---^--+|             |+---^--+|
      |    |   |  |    |   |             |    |   |
      |+---V--+|  |+---V--+|             |+---V--+|
      || RoCE |<-->| RoCE |<------------->| RoCE ||
      |+------+|  |+------+|             |+------+|
      +--------+  +--------+             +--------+

Stage 1: Dispatcher Master senses millions of cores and schedules millions of Rank MPI jobs on demand. Dispatcher Master sends the scheduling results to Dispatcher Agent

Stage 2: Dispatcher Agent starts Million Rank MPI on each node The Dispatcher Agent that receives the message broadcast the message to other Dispatcher Agents and do the initialization before starting the MPI application

Stage 3: Dispatcher Agent broadcaast the message to start the MPI application. MPI internal initialization Synchronize the RoCE endpoint in allgather way after the MPI application is started

The last 2 stages could benefit from multicast and reduce task completion time.

The large-scale multicast requirement in this scenario is as the following:

- Network Scale: 1000 k CPU/GUP

- Multicast Tree Size: 10k~100k receivers

- Multicast Service Number: 1~100

4. Typical Multicast Scenario in Computing

Ceph is an open-source distributed software platform. It mainly focuses on scale-out file system including storage distribution and availability, which is widely used in storage.

Ceph Object Storage Daemons (OSDs) are reponsible for storing objects on a local file system on behalf of Ceph clients. Also, Ceph OSDs use the CPU, memory, and networking of Ceph cluster nodes for data replication, erasure coding, recovery, monitoring and reporting functions.

The following process request P2MP service.

- Application initiates "write" operation from a client to a server.

- Client finds the server to write in, and 3 copies are sent to 3 services.

               +-------+          +-------+
               |Client1|          |Client2|
               +---+---+          +---+---+
                   |                  |
                   +---------+--------+
                             |
                     +-------+-------+
                     |     Switch    |
                     +-------+-------+
                             |
            +----------------+----------------+
            |                |                |
        +---+---+        +---+---+        +---+---+
        | Server|        | Server|        | Server|
        +-------+        +-------+        +-------+

The large-scale multicast requirement in this scenario is as the following:

- Network Scale: 3k Server (1 Pod)

- Multicast Tree Size: 3 receivers

- Multicast Service Number: 10k

5. IANA Considerations

This document makes no request of IANA.

6. Security Considerations

TBD

7. Acknowledgements

TBD

8. Normative References

[I-D.cheng-spring-ipv6-msr-design-consideration]
Cheng, W., Mishra, G. S., Li, Z., Wang, A., Qin, Z., and C. Fan, "Design Consideration of IPv6 Multicast Source Routing (MSR6)", Work in Progress, Internet-Draft, draft-cheng-spring-ipv6-msr-design-consideration-01, , <https://datatracker.ietf.org/doc/html/draft-cheng-spring-ipv6-msr-design-consideration-01>.
[I-D.ietf-avtcore-rtp-topologies-update]
Westerlund, M. and S. Wenger, "RTP Topologies", Work in Progress, Internet-Draft, draft-ietf-avtcore-rtp-topologies-update-10, , <https://datatracker.ietf.org/doc/html/draft-ietf-avtcore-rtp-topologies-update-10>.
[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC3493]
Gilligan, R., Thomson, S., Bound, J., McCann, J., and W. Stevens, "Basic Socket Interface Extensions for IPv6", RFC 3493, DOI 10.17487/RFC3493, , <https://www.rfc-editor.org/info/rfc3493>.
[RFC3542]
Stevens, W., Thomas, M., Nordmark, E., and T. Jinmei, "Advanced Sockets Application Program Interface (API) for IPv6", RFC 3542, DOI 10.17487/RFC3542, , <https://www.rfc-editor.org/info/rfc3542>.
[RFC8200]
Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification", STD 86, RFC 8200, DOI 10.17487/RFC8200, , <https://www.rfc-editor.org/info/rfc8200>.

Authors' Addresses

Yisong Liu
China Mobile
Xuesong Geng
Huawei