| Internet-Draft | NFFRR | July 2022 | 
| Kompella & Lin | Expires 9 January 2023 | [Page] | 
There are several cases where, once Fast Reroute has taken place (for MPLS protection), a second fast reroute is undesirable, even detrimental. This memo gives several examples of this, and proposes a mechanism to prevent further fast reroutes.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 9 January 2023.¶
Copyright (c) 2022 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
MPLS Fast Reroute (FRR) [RFC4090] [RFC5286] [RFC7490] is a useful and widely deployed tool for minimizing packet loss in the case of a link or node failure. This has not only proven to be very effective, it is often the reason for using MPLS as a data plane. FRR works for a variety of control plane protocols, including LDP, RSVP-TE, and SPRING. Furthermore, FRR is often used to protect MPLS services such as IP VPN and EVPN.¶
Having said this, there are case where, once FRR has taken place, if the packet encounters a second failure, a second FRR is not helpful, perhaps even disruptive. For example, the packet may loop until TTL expires. This can lead to link congestion and further packet loss. Thus, the attempt to prevent a packet from being dropped may instead affect many other packets. Note that the "second" failure may simply be another manifestation of the same failure; see Figure 1.¶
This memo proposes a mechanism for preventing further FRR once in cases where such further protection may be harmful. Several examples where this is the case are demonstrated as motivation. A solution using special-purpose labels (SPLs) is then offered. Some mechanisms for distributing the capability to avoid further fast reroutes are also discussed, although these may be better placed in other documents in other Working Groups.¶
[ALDT] has a more elaborate mechanism for preventing loops due to multiple failures. This involves marking the nodes redirecting traffic in a header (either individually, or as node groups), and dropping the packet at a transit node if its ID is in the header.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
A few cases are given where "further fast reroute" is harmful. Some of the cases are for MPLS services; others for "plain" MPLS forwarding.¶
Consider the following topology for multihoming an Ethernet VPN (EVPN [RFC7432]) Customer Edge (CE) device for protection against the failure of a Provider Edge (PE) device or a PE-CE link. To do so, there is a backup MPLS path between PE2 and PE3 (denoted by the starred line).¶
                   P1 ...         ... P3 --- PE2
                  /                         *   \ link1
                 /                         *     \
      CE1 --- PE1                          *      CE2
                 \                         *     /
                  \                         *   / link2
                   P2 ...         ... P4 --- PE3
Suppose (known unicast) traffic goes from CE1 to CE2. With active-active multihoming, this traffic will be load-balanced between PE2 (to CE2 via link link1) and PE3 (to CE2 via link2). If link1 were to fail, PE2 can still get traffic for CE2 by sending it over the backup path to PE3 (and similarly for PE3 if link2 fails).¶
However, suppose CE2 is down. PE2 will assume link1 is down and send traffic for CE2 to PE3 over the backup path. PE3 (which thinks that link2 is down; note that the single real failure of CE2 being down is manifested as separate failures to PE2 and PE3) will protect this "second" failure by sending traffic for CE2 over the backup path to PE2. Thus, traffic will ping-pong between PE2 and PE3 until TTL expires.¶
Thus, the attempt to protect traffic to CE2 may end up doing more harm than good, by congesting the backup path between PE2 and PE3 and by giving PE2 and PE3 useless work to do.¶
A similar topology can be used in EVPN-Etree [RFC8317], EVPN-VPWS [RFC8214], IP VPN [RFC4364] or VPLS [RFC4761] [RFC4762]. In all these cases, the same looping behavior would occur for unicast traffic if CE2 is down.¶
                  R0 . . . R1
                .             .
             R7                 R2
Anti-     |  .                   .  |
Clockwise |  .        Ring       .  | Clockwise
          v  .                   .  v
             R6                 R3
                .             .
                  R5 . . . R4
In Resilient MPLS Rings (RMR), suppose traffic goes from a node, say R0, to a node, say R4, over a clockwise path. Protection consists of switching this traffic onto the anti-clockwise path to R4. This works well if a node or link between R0 or R4 is down. However, if node R4 itself is down, its adjacent neighbor R3, will send the traffic anti-clockwise to R4; when this traffic reaches R4's other neighbor R5, it will return to N3, and so on, until TTL expires. [I-D.ietf-mpls-rmr] provides more details, and offers some means of mitigation. This memo offers a more elegant solution.¶
Consider the following topology:¶
         N1 --- N2 --- N3 --- N4
                |       |
                |       |
         N5 --- N6 --- N7 --- N8
                |       |
                |       |
                N9 --- N10
Say link protection is configured for links N2-N3 and N6-N7. Link N2-N3 is protected by a bypass tunnel N2-N6-N7-N3, and link N7-N3 is protected by a bypass tunnel N7-N6-N2-N3. (These bypass tunnels may be set up using RSVP-TE [RFC3209] or via SPRING stacks [RFC8660].) Say furthermore that there is an LSP from N1 to N4 with path N1-N2-N3-N4, which asks for link protection. If link N2-N3 fails, traffic will take the path N1-N2-N6-N7-N3-N4.¶
Suppose, however, links N2-N3 and N7-N3 fail simultaneously. This may happen if they share fate (e.g., go over a common fiber conduit); it may also appear to happen if node N3 fails. Either way, first, the bypass protecting link N2-N3 kicks in, and traffic is sent to N3 via N6 and N7. However, when the traffic hits N7, the bypass for N7-N3 kicks in, and traffic is sent back to N2. Thus the traffic will loop between N2 and N7 until TTL expires, in the process congesting links N2-N6 and N6-N7.¶
Now consider an LSP: N5-N6-N7-N8. The link N6-N7 may be protected by the bypass N6-N2-N3-N7 or by N6-N9-N10-N7, or by load-balancing between these two bypasses. If both links N2-N3 and N6-N7 fail, then traffic that is protected via bypass N6-N2-N3-N7 will ping-pong between N6 and N2 until TTL expires; traffic protected via bypass N6-N9-N10-N7 will successfully make it to N8. If link N6-N7 is protected by load-balancing across the two bypass paths, then about half the traffic will loop between N6 and N2, and the rest will make it to N8.¶
While the above description is for protection using a bypass tunnel, the same principle applies to protection using Loop-Free Alternates [RFC5286] [RFC7490] or any of its variants (such as Topology Independent LFA).¶
To address this issue, we suggest the use of a SPL [RFC7274] called NFFRR (value TBD; suggested: 8). An alternate would be to use an extended SPL, whereby a pair of labels indicates that no further fast route is desired. However, in the case of SPRING MPLS bypass tunnels (Section 3.2.1) of depth N, this would triple the label stack size. Using regular SPLs instead would only double the stack size.¶
To illustrate, we'll first take the example of Figure 3, with MPLS paths signaled using RSVP-TE. This method can be used for paths that use SPRING stacks, but this will be detailed in a later version.¶
         N1 --- N2 --- N3 --- N4       LSP N1 to N4:  L1->L2->null
                |       |          Bypass for N2-N3:  L3->L4->null
                |       |          Bypass for N7-N3:  L5->L6->null
         N5 --- N6 --- N7 --- N8       LSP N5 to N8:  L7->L8->null
                |       |          Bypass1 for N6-N7: L9->L10->null
                |       |          Bypass2 for N6-N7: L11->L12->null
                N9 --- N10           (via N9-N10-N7)
| Node | Action | Next | New Pkt | Comment | 
|---|---|---|---|---|
| N1 | push L1 | N2 | [L1] pkt | ingress | 
| N2 | L1 -> L2 | N3 | [L2] pkt | |
| N3 | pop L2 | N4 | pkt | PHP | 
| N4 | fwd pkt | - | - | continue | 
Note 1: "[L1 ...]" denotes the label stack on the packet; pkt is the original packet received at ingress. "L1 -> L2" means swap label L1 with L2. "pop L2" means pop the top label L2. "fwd pkt" means forward the packet as usual.¶
| Node | Action | Next | New Pkt | Comment | 
|---|---|---|---|---|
| N2 | push L3 | N6 | [L3] pkt | ingress | 
| N6 | L3 -> L4 | N7 | [L4] pkt | |
| N7 | pop L4 | N3 | pkt | PHP | 
| Node | Action | Next | New Pkt | Comment | 
|---|---|---|---|---|
| N7 | push L5 | N6 | [L5] pkt | ingress | 
| N6 | L5 -> L6 | N2 | [L6] pkt | |
| N2 | pop L6 | N3 | pkt | PHP | 
| Node | Action | Next | New Pkt | Comment | 
|---|---|---|---|---|
| N1 | push L1 | N2 | [L1] pkt | ingress | 
| N2 | L1 -> L2 | N3 | [L2] pkt | N3 X | 
| N2 | push L3 | N6 | [L3 L2] pkt | PLR | 
| N6 | L3 -> L4 | N7 | [L4 L2] pkt | |
| N7 | pop L4 | N3 | [L2] pkt | merge | 
| N3 | pop L2 | N4 | pkt | PHP | 
| N4 | fwd pkt | - | - | continue | 
Table 4 is obtained by composing Table 1 and Table 2.¶
Note 2: "N3 X" means "next hop N3 unavailable (because link N2-N3 failed)".¶
| Node | Action | Next | New Pkt | Comment | 
|---|---|---|---|---|
| N1 | push L1 | N2 | [L1] pkt | ingress | 
| N2 | L1 -> L2 | N3 | [L2] pkt | N3 X | 
| N2 | push L3 | N6 | [L3 L2] pkt | PLR | 
| N6 | L3 -> L4 | N7 | [L4 L2] pkt | |
| N7 | pop L4 | N3 | [L2] pkt | N3 X' | 
| N7 | push L5 | N6 | [L5 L2] pkt | |
| N6 | L5 -> L6 | N2 | [L6 L2] pkt | PLR | 
| N2 | pop L6 | N3 | [L2] pkt | N3 X | 
| N2 | push L3 | N6 | [L3 L2] | PLR | 
| etc | loop! | 
Table 5 is obtained by composing Table 1, Table 2 and Table 3.¶
Note 3: "N3 X'" means "next hop N3 unavailable because link N7-N3 is down.¶
Note 4: While the impact of a loop is pretty bad, the impact of an ever-growing label stack (not illustrated here) and possible associated fragmentation on transit nodes may be worse.¶
An LSR (typically a PLR) that wishes to prevent further FRRs after the first one can push an SPL, namely NFFRR, onto the label stack as follows:¶
| Node | Action | Next | New Pkt | Comment | 
|---|---|---|---|---|
| N1 | push L1 | N2 | [L1] pkt | ingress | 
| N2 | L1 -> L2 | N3 | [L2] pkt | N3 X | 
| N2 | push L3, NFFRR | N6 | [L3 NFFRR L2] pkt | PLR | 
| N6 | L3 -> L4 | N7 | [L4 NFFRR L2] pkt | |
| N7 | pop L4, NFFRR | N3 | [L2] pkt | merge | 
| N3 | pop L2 | N4 | pkt | PHP | 
| N4 | fwd pkt | - | - | continue | 
Note 5: N2 can insert an NFFRR label only if it knows that all LSRs in the path can process it correctly. See Section 4 for some details on how this capability is communicated.¶
| Node | Action | Next | New Pkt | Comment | 
|---|---|---|---|---|
| N1 | push L1 | N2 | [L1] pkt | ingress | 
| N2 | L1 -> L2 | N3 | [L2] pkt | N3 X | 
| N2 | push L3, NFFRR | N6 | [L3 NFFRR L2] pkt | PLR | 
| N6 | L3 -> L4 | N7 | [L4 NFFRR L2] pkt | |
| N7 | pop L4 | N3 | [NFFRR L2] pkt | N3 X | 
| N7 | check NFFRR | - | - | drop pkt | 
Note 6: "check NFFRR" means that, before N7 applies FRR (because link N7-N3 is down), N7 checks the label below the top label (or in this case, because of PHP, the top label itself). If this is the NFFRR label, N7 drops the packet rather than apply FRR.¶
Suppose that, to protect link N2-N3, a bypass tunnel N2-N6-N7-N3 were instantiated using SPRING MPLS [RFC8660], in particular, using adjacency SIDs. If the corresponding labels for links N6-N7 and N7-N3 were L20 and L21, the bypass would consist of pushing the label stack [L20 L21] onto the packet and sending the packet to N6. To indicate that FRR has already occurred and to drop the packet rather than to try to protect the packet again, N2 would have to push [L20 NFFRR L21 NFFRR] onto the packet before sending it to N6. If the packet came from N1 with label L1, N2 would send a packet with label stack [L20 NFFRR L21 NFFRR L2] to N6.¶
N6 would see L20, pop it, note the NFFRR label and pop it, then attempt to send the packet to N7. If the link N6-N7 is down, N6 drops the packet. Otherwise, N7 gets the packet, sees L21, pops it, sees NFFRR, pops it and tries to send the packet to N3. If link N7-N3 is down, N7 drops the packet. Otherwise, N3 gets the packet with L2, swaps with with L3 and sends it to N4.¶
Note that with SPRING MPLS, the NFFRR label needs to be repeated for each label in the bypass stack. Hence the request for a "regular" SPL rather than an extended SPL.¶
First, we illustrate known unicast EVPN forwarding:¶
| Node | Action | Next | Packet | Comment | 
|---|---|---|---|---|
| PE1 | send to CE2 | PE2 | [T1 S2] pkt | EVPN | 
| PE2 | send to CE2 | link1 | pkt | done! | 
Note: T1/T2/T3 are the transport labels for PE1/PE3/PE2 to reach PE2/PE2/PE3 respectively. S2/S3 are the service labels announced by PE2/PE3 for CE2.¶
Then, we show what happens when CE2 is down without NFFRR:¶
| Node | Action | Next | Packet | Comment | 
|---|---|---|---|---|
| PE1 | send to CE2 | PE2 | [T1 S2] pkt | EVPN | 
| PE2 | send to CE2 | link1 | -- | link1 X | 
| PE2 | send to CE2 | PE3 | [T3 S3] pkt | eFRR | 
| PE3 | send to CE2 | link2 | -- | link2 X | 
| PE3 | send to CE2 | PE2 | [T2 S2] pkt | eFRR | 
| PE2 | send to CE2 | link1 | -- | link1 X | 
| PE2 | send to CE2 | PE3 | [T3 S3] pkt | eFRR | 
| ... | loop! | 
Note: link1/link2 X means link1/link2 is down. eFRR refers to EVPN multihoming FRR.¶
In the case of MPLS services such as EVPN Figure 1, the NFFRR label is inserted below the service label, as shown below:¶
| Node | Action | Next | Packet | Comment | 
|---|---|---|---|---|
| PE1 | send to CE2 | PE2 | [T1 S2] pkt | EVPN | 
| PE2 | send to CE2 | link1 | -- | link1 X | 
| PE2 | send to CE2 | PE3 | [T3 S2 NFFRR] pkt | eFRR | 
| PE3 | send to CE2 | link2 | -- | link2 X | 
| PE3 | drop pkt | -- | -- | check NFFRR | 
Note: "check NFFRR" is as above.¶
As described in Figure 2, packets will loop until TTL expires if the destination node in an RMR ring (here, R4) fails. The solution in this case is that the first node to apply RMR protection (R3) pops the current RMR transport label being used, sees that the next label is not NFFRR (so protection is allowed), pushes an NFFRR label and then the RMR transport label for the reverse direction.¶
When R5 receives the packet, it sees that the next link is down, pops the RMR transport label, sees the NFFRR label and drops the packet. Thus, the loop is avoided.¶
The ideal choice would be an attribute consisting of a bit vector of node capabilities, one bit of which would be the capability of processing the NFFRR SPL below the BGP service label. This would be used by BGP L2VPN, BGP VPLS, EVPN, E-Tree and E-VPWS. An alternative is to use the BGP Capabilities Optional Parameter [I-D.ietf-idr-next-hop-capability]. Details to be worked out.¶
One approach to signaling NFFRR capability for MPLS services signaled with targeted LDP is to introduce a new LDP TLV called the NFFRR Capability TLV as an Optional Parameter in the Label Mapping Message [RFC5036]. This TLV has Type TBD (suggested: 0x0207) and Length 0.¶
Another approach is to use LDP Capabilities [RFC5561]; this approach has the advantage that it deals with capabilities on a node basis rather than on a per label mapping basis. However, there don't appear to be other documents using this approach.¶
The authors suggest signaling a router's ability to process the NFFRR SPL using the Link State Router TE Node Capabilities [RFC5073], which works for both IS-IS and OSPF. A new TE Node Capability bit, the N bit (suggested value 5) indicates that the advertising node is capable of processing the NFFRR SPL.¶
If this draft is deemed useful, a way to signal that No Further Fast-route should be performed on a packet will be needed. There are two approaches: allocate an SPL for NFFRR: if so, we suggest the early allocation of label 8 for this. Alternatively, if [I-D.kompella-mpls-mspl4fa] (or similar) is adopted, allocate a forwarding action bit saying whether or not to do FRR.¶
Furthermore, means of signaling the ability to process the NFFRR SPL/bit should be defined for IS-IS, OSPF, LDP and BGP.¶
The following update is suggested for the Link State Router TE Node Capabilities registry:¶
| Bit | Name | Reference | 
|---|---|---|
| 5 | NFFRR | This document | 
The following update is suggested for the TLV Type Name Space of the Label Distribution Protocol (LDP) Parameters registry:¶
| Type | Name | Reference | 
|---|---|---|
| 0x0207 | NFFRR | This document | 
A malicious or compromised LSR can insert NFFRR into a label stack, preventing FRR from occurring. If so, protection will not kick in for failures that could have been protected, and there will be unnecessary packet loss.¶