Skip to content

JSON serialization for deriving event.id is not adequately specified #354

@shafemtol

Description

@shafemtol

Correct derivation of event.id is critical for correctly referencing, signing and verifying events. NIP-01 currently specifies the derivation of event.id as follows:

To obtain the event.id, we sha256 the serialized event. The serialization is done over the UTF-8 JSON-serialized string (with no white space or line breaks) of the following structure:

[
  0,
  <pubkey, as a (lowercase) hex string>,
  <created_at, as a number>,
  <kind, as a number>,
  <tags, as an array of arrays of non-null strings>,
  <content, as a string>
]

JSON itself does not specify a canonical way to serialize strings and numbers. That means the above specification is ambiguous and can lead to mutually incompatible implementations.

Excluding control characters, \ and ", JSON strings can directly represent any Unicode codepoint as itself. JSON provides some special escape sequences like \n, \\, \" and \/(!). In addition it provides \u, which can be used to represent any UTF-16(!) codepoint. A newline can be represented as either \n or, in theory, \u000a or \u000A. The letter ß can be represented as ß, \u00df, \u00DF or even something as silly as \u00dF. Some JSON implementations will escape any non-ASCII character by default, others might not. Some escape the forward slash by default, others do not.

The mention of "UTF-8 JSON-serialized" can be understood to mean that non-ASCII characters MUST NOT be escaped, but this is not clear.

I see two different ways to deal with the incompleteness of the current specification:

  • Encode strings as RFC 3629 UTF-8, do not escape characters in strings unless absolutely needed (\, ", control characters), and use the special, short escape sequences where available (e.g. \n instead of \u000a). For \u escape sequences I would assume lower case simply from two implementations I've tested, but this really has to be specified. Represent integers without any fraction or exponent.
  • Represent the values exactly as they are represented in the event object itself. This would make implementation easier when generating events and should be backwards compatible with all existing events, but it would complicate implementation everywhere else.

It seems to me that clients generally set created_at as an integer, although this should also be more clearly specified. If non-integer values are allowed here, that introduces even more issues such as the allowed precision (JSON itself has no limits here, but many implementations will decode to IEEE 754 binary64, silently rounding if needed), fraction and exponent representation.

Another thing that puzzles me is the mention of "non-null strings" here. If "null strings" should not exist in tags, I would expect that to be specified for the event object itself, not in the serialization for id derivation.

I'm leaning towards changing NIP-01 to specifying a strict serialization with minimal escaping of strings and not allowing non-integer created_at, unless doing so would make it incompatible with a significant portion of already existing events. In case of incompatibility with existing events, where their id has been derived using a different serialization, the alternative of using the same string serialization as for the event itself will be necessary, which impacts the json handling of both relays and receiving clients, making it rather brittle. Perhaps handling of such events can be made optional, so that relay and client implementations can choose to only accept strictly compliant events.

Some suggestions to address other potential issues regarding Unicode strings:

  • Clarify that Unicode normalization MUST NOT be performed on received events or during serialization for id derivation.
  • Require that UTF-8 strings (including during id derivation) comply with the latest UTF-8 spec, RFC 3629, which specifically disallows the CESU-8 variant.

I haven't seen this issue mentioned in this repo. It seems @vitorpamplona ran into the issue on nostr here: note1yef4z9ren4lt4xqj95yugzk2edghjvg0k3mxxfj5gm3fzxj4p6yspdtpwl

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions