Normalization

Normalization allows us to determine if two URLs refer to the same resource. URLs comparisons serve the same purpose, where two strings are compared as if they were normalized.

There is no way to determine whether two URLs refer to the same resource without full knowledge or control of them. Thus, equivalence is based on string comparisons augmented by additional URL and scheme rules. This means comparison is not sufficient to determine whether two URLs identify different resources as the same resource can always be served from different addresses.

For this reason, comparison methods are designed to minimize false negatives while strictly avoiding false positives. In other words, if two URLs compare equal, they definitely represent the same resource. If they are considered different, they might still refer to the same resource depending on the application.

Context-dependent rules can be considered to minimize the number of false negatives, where cheaper methods have a higher chance of producing false negatives:

  • Simple String Comparison

  • Syntax-Based Normalization

  • Scheme-Based Normalization

  • Protocol-Based Normalization

Simple String Comparison

Simple String Comparison can be performed by accessing the underlying buffer of URLs:

url_view u1("https://www.boost.org/index.html");
url_view u2("https://www.boost.org/doc/../index.html");
assert(u1.buffer() != u2.buffer());

By only considering the rules of rfc3986, Simple String Comparison fails to identify the URLs above point to the same resource.

Syntax-Based Normalization

The comparison operators implement Syntax-Based Normalization, which implements the rules defined by rfc3986.

url_view u1("https://www.boost.org/index.html");
url_view u2("https://www.boost.org/doc/../index.html");
assert(u1 == u2);

In mutable URLs, the member function normalize can also be used to apply Syntax-Based Normalization to a URL. A normalized URL is represented by a canonical string where any two strings that would compare equal end up with the same underlying representation. In other words, Simple String Comparison of two normalized URLs is equivalent to Syntax-Based Normalization.

url_view u1("https://www.boost.org/index.html");
url u2("https://www.boost.org/doc/../index.html");
assert(u1.buffer() != u2.buffer());
assert(u1 == u2);
u2.normalize();
assert(u1.buffer() == u2.buffer());
assert(u1 == u2);

Syntax-Based Normalization Procedure

Normalization uses the following definitions of rfc3986 to minimize false negatives:

  • Case Normalization: percent-encoding triplets are normalized to use uppercase letters

  • Percent-Encoding Normalization: decode octets that do not require percent-encoding

  • Path Segment Normalization: path segments "." and ".." are resolved

The following example normalizes the percent-encoding and path segments of a URL:

url u("https://www.boost.org/doc/../%69%6e%64%65%78%20file.html");
u.normalize();
assert(u.buffer() == "https://www.boost.org/index%20file.html");

Scheme and Protocol-Based Normalization

Syntax-Based Normalization can also be used as a first step for Scheme-Based and Protocol-Based Normalization. One common scheme-specific rule is ignoring the default port for that scheme and empty absolute paths:

auto normalize_http_url =
    [](url& u)
{
    u.normalize();
    if (u.port() == "80" ||
        u.port().empty())
        u.remove_port();
    if (u.has_authority() &&
        u.encoded_path().empty())
        u.set_path_absolute(true);
};

url u1("https://www.boost.org");
normalize_http_url(u1);
url u2("https://www.boost.org/");
normalize_http_url(u2);
url u3("https://www.boost.org:/");
normalize_http_url(u3);
url u4("https://www.boost.org:80/");
normalize_http_url(u4);

assert(u1.buffer() == "https://www.boost.org/");
assert(u2.buffer() == "https://www.boost.org/");
assert(u3.buffer() == "https://www.boost.org/");
assert(u4.buffer() == "https://www.boost.org/");

Other criteria commonly used to minimize false negatives for specific schemes are:

  • Handling empty authority component as an error or localhost

  • Replacing authority with empty string for the default authority

  • Normalizing extra subcomponents with case-insensitive data

  • Normalizing paths with case-insensitive data