Collapsing // to / inside an HTTP URL path is not normalization.
RFC 3986 defines the path component and the segment grammar in a way that allows for empty segments. A double slash is therefore syntactically meaningful. It represents a zero-length segment between two separators.
3.3. Path
The path component contains data, usually organized in hierarchical form, that, along with data in the non-hierarchical query component (Section 3.4), serves to identify a resource within the scope of the URI’s scheme and naming authority (if any). The path is terminated by the first question mark ("?") or number sign ("#") character, or by the end of the URI.
If a URI contains an authority component, then the path component must either be empty or begin with a slash ("/") character. If a URI does not contain an authority component, then the path cannot begin with two slash characters ("//"). In addition, a URI reference (Section 4.1) may be a relative-path reference, in which case the first path segment cannot contain a colon (":") character. The ABNF requires five separate rules to disambiguate these cases, only one of which will match the path substring within a given URI reference. We use the generic term “path component” to describe the URI substring matched by the parser to one of these rules.
path = path-abempty ; begins with "/" or is empty / path-absolute ; begins with "/" but not "//" / path-noscheme ; begins with a non-colon segment / path-rootless ; begins with a segment / path-empty ; zero characters path-abempty = *( "/" segment ) path-absolute = "/" [ segment-nz *( "/" segment ) ] path-noscheme = segment-nz-nc *( "/" segment ) path-rootless = segment-nz *( "/" segment ) path-empty = 0<pchar> segment = *pchar segment-nz = 1*pchar segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" ) ; non-zero-length segment without any colon ":" pchar = unreserved / pct-encoded / sub-delims / ":" / "@"A path consists of a sequence of path segments separated by a slash ("/") character. A path is always defined for a URI, though the defined path may be empty (zero length). Use of the slash character to indicate hierarchy is only required when a URI will be used as the context for relative references. For example, the URI mailto:fred@example.com has a path of “fred@example.com”, whereas the URI foo://info.example.com?fred has an empty path.
The path segments “.” and “..”, also known as dot-segments, are defined for relative reference within the path name hierarchy. They are intended for use at the beginning of a relative-path reference (Section 4.2) to indicate relative position within the hierarchical tree of names. This is similar to their role within some operating systems’ file directory structures to indicate the current directory and parent directory, respectively. However, unlike in a file system, these dot-segments are only interpreted within the URI path hierarchy and are removed as part of the resolution process (Section 5.2).
Aside from dot-segments in hierarchical paths, a path segment is considered opaque by the generic syntax.
Because segment = *pchar,
the empty string is a valid segment.
Therefore,
path-abempty = *( "/" segment )
allows a slash followed by an empty segment.
Any transformation that collapses // to /
removes a syntactically valid segment
and thus changes the parsed sequence of segments.
HTTP (RFC 9110) uses the RFC 3986 path grammar for request targets.
4.1. URI References
URI references are used to target requests, indicate redirects, and define relationships.
The definitions of “URI-reference”, “absolute-URI”, “relative-part”, “authority”, “port”, “host”, “path-abempty”, “segment”, and “query” are adopted from the URI generic syntax. An “absolute-path” rule is defined for protocol elements that can contain a non-empty path component. (This rule differs slightly from the path-abempty rule of RFC 3986, which allows for an empty path, and path-absolute rule, which does not allow paths that begin with “//”.) A “partial-URI” rule is defined for protocol elements that can contain a relative URI but not a fragment component.
URI-reference = <URI-reference, see [URI], Section 4.1> absolute-URI = <absolute-URI, see [URI], Section 4.3> relative-part = <relative-part, see [URI], Section 4.2> authority = <authority, see [URI], Section 3.2> uri-host = <host, see [URI], Section 3.2.2> port = <port, see [URI], Section 3.2.3> path-abempty = <path-abempty, see [URI], Section 3.3> segment = <segment, see [URI], Section 3.3> query = <query, see [URI], Section 3.4> absolute-path = 1*( "/" segment ) partial-URI = relative-part [ "?" query ]
4.2.1. http URI Scheme
http-URI = "http" "://" authority path-abempty [ "?" query ]The origin server for an “http” URI is identified by the authority component, which includes a host identifier ([URI], Section 3.2.2) and optional port number ([URI], Section 3.2.3). If the port subcomponent is empty or not given, TCP port 80 (the reserved port for WWW services) is the default.
The hierarchical path component and optional query component identify the target resource within that origin server’s namespace.
Collapsing // alters the sequence of segments
and therefore alters the identifier.
Unless the origin explicitly defines those two identifiers as equivalent,
a generic normalizer has no authority to do so. Only the origin could
munge URIs in its own namespace.
//RFC 3986 is quite explicit about what syntax-based normalization is: case normalization, percent-encoding normalization, and dot-segment removal. It does not list any rule that removes empty segments or collapses multiple slashes.
6.2.2. Syntax-Based Normalization
Implementations may use logic based on the definitions provided by this specification to reduce the probability of false negatives. This processing is moderately higher in cost than character-for- character string comparison. For example, an application using this approach could reasonably consider the following two URIs equivalent:
example://a/b/c/%7Bfoo%7D eXAMPLE://a/./b/../b/%63/%7bfoo%7dWeb user agents, such as browsers, typically apply this type of URI normalization when determining whether a cached response is available. Syntax-based normalization includes such techniques as case normalization, percent-encoding normalization, and removal of dot-segments.
Path normalization is quite narrowly specified too:
it is about . and .. in relative references, not empty segments.
6.2.2.3. Path Segment Normalization
The complete path segments “.” and “..” are intended only for use within relative references (Section 4.1) and are removed as part of the reference resolution process (Section 5.2). However, some deployed implementations incorrectly assume that reference resolution is not necessary when the reference is already a URI and thus fail to remove dot-segments when they occur in non-relative paths. URI normalizers should remove dot-segments by applying the remove_dot_segments algorithm to the path, as described in Section 5.2.4.
Notice what is not present: there is no rule permitting removal of empty segments, nor any directive to coalesce repeated separators, etc.
//HTTP adds a few scheme-based normalization rules, and they are quite narrow still. The only rule that touches the path concerns the empty path component (not empty segments inside a path):
4.2.3. http(s) Normalization and Comparison
URIs with an “http” or “https” scheme are normalized and compared according to the methods defined in Section 6 of [URI], using the defaults described above for each scheme.
HTTP does not require the use of a specific method for determining equivalence. For example, a cache key might be compared as a simple string, after syntax-based normalization, or after scheme-based normalization.
Scheme-based normalization (Section 6.2.3 of [URI]) of “http” and “https” URIs involves the following additional rules:
If the port is equal to the default port for a scheme, the normal form is to omit the port subcomponent.
When not being used as the target of an OPTIONS request, an empty path component is equivalent to an absolute path of “/”, so the normal form is to provide a path of “/” instead.
The scheme and host are case-insensitive and normally provided in lowercase; all other components are compared in a case-sensitive manner.
Characters other than those in the “reserved” set are equivalent to their percent-encoded octets: the normal form is to not encode them (see Sections 2.1 and 2.2 of [URI]).
Again, it does not include collapsing // inside the path.
The RFC 3986 path grammar explicitly permits empty segments
(segment = *pchar).
Therefore // in a path is syntactically valid
and corresponds to an explicit empty segment.
The generic syntax declares that, aside from dot-segments,
path segments are opaque.
Collapsing // changes the segment sequence
and therefore changes opaque data,
which is outside what normalization is supposed to do.
HTTP uses RFC 3986’s path definitions for HTTP(S) URIs and states that the hierarchical path component identifies the resource within the origin’s namespace. That is, the exact path string (other than the very limited normalization rules) is part of the identifier.
The normalization rules in RFC 3986 and RFC 9110
do not authorize collapsing repeated slashes inside the path.
The only allowed path-related normalizations are
dot-segment removal (generic URIs) and empty-path-to-/ (HTTP).
Therefore, collapsing // to / in HTTP URL path segments is not correct
normalization. It produces a different, non-equivalent identifier unless the
origin explicitly defines those two paths as equivalent.
So, for example,
https://git.runxiyu.org/furweb.git// is a distinct identifier from
https://git.runxiyu.org/furweb.git/ under the standards’ grammar and
normalization rules, and must not be rewritten by a generic normalizer;
indeed, these two specific URLs serve different content.
/tmp $ git clone https://git.runxiyu.org/furweb.git/
Cloning into 'furweb'...
remote: Not Found
remote:
remote: You might be attempting to perform Git operations on
remote: a hierarchical index rather than a Git repository.
remote: Note that repositories URLs always end with a "//"
remote: sentinel. Perhaps try the following URL instead?
remote:
remote: https://git.runxiyu.org/furweb.git//
remote:
fatal: repository 'https://git.runxiyu.org/furweb.git/' not found
128 /tmp $ git clone https://git.runxiyu.org/furweb.git//
Cloning into 'furweb'...
remote: Enumerating objects: 2005, done.
remote: Counting objects: 100% (2005/2005), done.
remote: Compressing objects: 100% (500/500), done.
remote: Total 2005 (delta 1455), reused 2005 (delta 1455), pack-reused 0
Receiving objects: 100% (2005/2005), 372.87 KiB | 606.00 KiB/s, done.
Resolving deltas: 100% (1455/1455), done.
/tmp $