Tracking time delays in the RPKI-based Route Origin Validation supply chain 

What is the life cycle of Resource Public Key Infrastructure (RPKI) data used to secure Internet routing? More specifically, how long does a Route Origin Authorization (ROA) take to propagate, and how quickly does it actually affect Internet routing and reachability?

These are questions that network operators would love to have answers to, given that changes on the RPKI management plane can impact how traffic flows to or from their networks. I recently collaborated on a project, RPKI Time-of-Flight: Tracking Delays in the Management, Control, and Data Planes, to answer these questions by dissecting the stages in the life of RPKI data.

Below is a summary of the RPKI lifecycle and our findings.

Key points:
  • Creation times vary significantly across the Regional Internet Registries (RIRs), ranging from a few minutes to over an hour for new ROAs to reach the publication points.
  • High publication delays were initially observed for ARIN and LACNIC due to a time zone issue. The problem has been reported and is now fixed. Observed delays are usually less than 20 minutes.
  • Relying Party (RP) delay represents the most time-consuming step observed in ROA processing.
  • Deleting ROAs takes longer to reflect in BGP as routers explore alternate routes that have not yet been invalidated.

ROV supply chain

Publishing ROAs is complex. The process involves several players, is not instantaneous, and is often dominated by ad hoc administrative decisions.

It starts when a resource holder queries an RIR to create or update RPKI information for its prefixes. The ROAs and other meta files (manifests, CRLs) are then placed in public repositories called publication points.

RPs periodically fetch and validate all the objects from the global RPKI repositories, after which they produce a list of Validated ROA Payloads (VRPs) that routers use to verify incoming BGP announcements. These changes are fetched by operators performing Route Origin Validation (ROV-enabled ASes, green in Figure 1) that use this new information to update their routers. Only then do you start to see changes on the data plane when routing announcements are either accepted or dropped by ROV-enabled ASes.

Infographic showing Figure 1 — Data flow from the creation of a ROA by the prefix holder to the corresponding BGP updates recorded at the route collectors.
Figure 1 — Data flow from the creation of a ROA by the prefix holder to the corresponding BGP updates are recorded at the route collectors (RIS / RouteViews). The red labels on the left show the points at which time measurements were taken.

The Time to Create or Delete ROVs Varies

Each of the above steps is common to all RIRs and ROV-enabled ASes, but each (may) perform these steps at different time intervals and frequencies.

In our study, we found that RIRs usually publish new RPKI information within five minutes, except APNIC, which was on average ten minutes slower (Table 1, column 3). We also observed significant disparities in ISPs’ reaction time to new RPKI information, ranging from a few minutes to one hour.

Sign
(min)
NotBefore
(min)
Publication
(min)
Relying Party
(min)
BGP
(min)
AFRINIC0 (0)0 (0)3 (2)14 (13)15 (16)
APNIC10 (13)14 (16)10 (13)34 (38)26 (28)
ARIN– (-)– (-)69 (97)81 (109)95 (143)
LACNIC0 (0)– (-)54 (32)66 (42)51 (34)
RIPE0 (0)0 (0)4 (4)14 (13)18 (18)
After Fix
ARIN(-)(-)8 (9)21 (22)28 (23)
Table 1— ROA creation median delays (IPv6 in parentheses).

When deleting ROAs, we found the delay to be significantly longer (Table 2, column 4) except for ARIN and LACNIC (I’ll explain why these differ below).

Revocation
(min)
Relying Party
(min)
BGP
(min)
AFRINIC0 (0)13 (14)34 (38)
APNIC10 (12)31 (36)51 (56)
ARIN0 (0)14 (16)45 (51)
LACNIC0 (0)18 (20)48 (49)
RIPE0 (0)14 (13)41 (50)
Table 2 — ROA deletion median delays (IPv6 in parentheses).

For ROA revocation, we observed that the delay between ROA deletion and unreachability varies depending on the topology. Again, BGP delays are significantly higher for ROA deletion than for ROA creation. This is probably because all neighbors must withdraw the ROA. For example, the BGP delay for unreachability went up to 51 minutes for IPv4 and 56 minutes for IPv6, and we rarely observed short BGP delays.

We proposed two possible causes for this:

  1. Using multiple RP caches (for redundancy) will likely slow ROA deletion than ROA creation. 
  2. BGP path hunting (Figure 3): in some cases, we observed that the AS path between the RIPE Atlas probe and the destination changed before becoming unreachable.

ARIN and LACNIC timezone issues

Before April 2022, the publication delay for ARIN and LACNIC could last several hours due to a time zone conversion problem.

Two user query to BGP update delay graphs showing the difference between ROA creation and deletion between the RIRs.
Figure 2 — Difference between ROA creation and deletion between the RIRs.

Both RIRs intended to set ROA NotBefore values to midnight. However, ARIN had been setting this value to 04:00 UTC or 05:00 UTC (corresponding to 00:00 in Eastern Daylight Time and Eastern Standard Time) and LACNIC to 03:00 UTC (corresponding to 00:00 in Uruguay Standard Time).

For example, a query at 01:00 UTC to create a ROA in LACNIC would create a ROA with a NotBefore value set to 03:00 UTC. Therefore, the ROA would be invalid for the two hours following its creation. Our experiment revealed that the publication point wisely does not publish the ‘not-yet-valid’ ROA to the repository, therefore, delaying its availability to RPs. The same holds for ARIN.

We reported this issue to ARIN and LACNIC, who promptly acknowledged and fixed the problem.

Data Plane Measurements 

As more ROAs are created to protect prefixes from being mis-originated, one wonders how long it takes for the effect of RPKI changes to appear in the data plane.

To achieve the above, we used a ‘toggling ROAs’ mechanism, where we used an ‘invalidating’ ROA with AS666 to keep the RPKI status of our advertised prefixes ‘invalid’. We would then change the RPKI status ‘valid <-> invalid’ of the test prefixes by either creating a ‘validating’ ROA with a properly authorized origin AS or deleting the ROA.

To test data plane reachability and the delay of prefixes with toggling ROAs, we performed traceroutes every 15 minutes from RIPE Atlas with probes in six different ASes. When creating a ‘validating ROA’, the delay between the user query and data plane reachability is similar to BGP. We observe a median delay between 23 minutes (RIPE) and 50 minutes (APNIC).

For ROA revocation, we observed that the delay between ROA deletion and unreachability varies depending on the topology. Again, BGP delays are significantly higher for ROA deletion than for ROA creation. For example, the BGP delay for unreachability goes up to 51 minutes for IPv4 and 56 minutes for IPv6, and we rarely observe short BGP delays. We proposed two possible causes for this:

  • Using multiple RP caches (for redundancy) likelys result in significantly slower ROA deletion than ROA creation.
  • BGP path hunting (Figure 3): in some cases, we observed that the AS path between the RIPE Atlas probe and the destination changed before becoming unreachable.
Graph showing the effects of ROA creation/deletion on the data plane.
Figure 3 — Effects of ROA creation/deletion on the data plane. We observe BGP path hunting with AS path changes.

RP Delay: Possible Bottleneck?

RPs periodically fetch RPKI data from publications points.

We found RP delays represent the most time-consuming step observed in ROA processing. The delay between the ROA creation and the time when an RP validates the new ROA was usually less than 15 minutes for most RIRs. This represents 10 minutes more than the publication delay and consisted mainly of the:

  • Polling interval (5 minutes delay on average) 
  • Downloading time from all Certification Authorities (4 minutes) 
  • ROA processing time (1 minute).

Other factors that can potentially affect RP delays include, varying downloading times due to networking conditions, publication point time-outs, or lack of multi-threading support in some RP software.

Read our paper and check out our GitHub to learn more about our methodology and findings.

This study was partly funded by the MANRS Fellowship program and was a collaboration between IIJ Research Lab, Internet Society, UCLouvain, LAAS-CNRS, and Arrcus Inc.

Contributor: Romain Fontugne.

Leave a Comment