-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Improve API/event validation in synapse #8445
Description
Background
We've recently encountered a number of bugs in which malformed (or at least, unexpected) data has caused Synapse or clients to misbehave in some way. These bugs stem from the fact that, faced with a given datastructure, you cannot rely on it having the expected format. For example:
- What happens if the
displayname
in anm.room.member
event is not a string (User directory gets stuck when encountering non-string display name #8220, Sqlite Error: Error binding parameter 1 - probably unsupported type (when joining a room) #8340)? - What happens if an event object is missing its
origin
field (Failing to join / send_join rooms: FrozenEventV3 has no 'origin' property #8319)?
Such bugs are disruptive, and in extreme cases could progress beyond "denial of service" to "security threat", and it might be possible to avoid them by validating data at the point of entry to Synapse. We've recently discussed this in some depth within the core Synapse team; this issue serves to record some of our thoughts on the topic, including promising areas for further development.
Introduction
There are actually multiple reasons why it would be useful to improve validation of data within Synapse. These include:
- As above, avoiding potential bugs
- Simplifying code by removing
isinstance
checks everywhere - Forcing clients to follow the specification rather than using "whatever works in Synapse", thus improving compatibility with other server implementations which do enforce the spec (or conversely: not forcing other implementations to grandfather in Synapse's misfeatures for compatibility with clients)
- Making synapse return 400 errors rather than "Internal Server Errors" if clients pass invalid parameters would:
- Give better feedback to client developers: the 500 error gives the developer little information about their error.)
- Allow server administrators to distinguish between "my server is having problems" and "somebody passed invalid parameters to my server"
- Allow load-balancers to distinguish between "this process is having problems" and "somebody passed invalid parameters". Notably, Cloudflare will stop sending requests to an origin which returns too many 500 errors.
There are multiple related, but different, things we mean when we talk about validation. At a high level, these can be broken down into:
- validating API request parameters (both CS and SS API); and
- validating events (and other such data types, like EDUs etc).
Let's consider these in turn.
API validation
Given that we already have JSON schema specifications for our APIs, this is theoretically relatively straightforward (see, for example c39941c, which is a proof-of-concept applying this to a single endpoint), though it's certain to uncover a large number of places where clients are relying on non-spec-compliant behaviour. We also have to be wary, especially on the SS API, to ensure backwards compatibility.
Doing this would certainly reduce the number of false 500 errors, which as above brings a number of advantages. It might also reduce the occurrences of bugs due to bad events (e.g. non-string display names) since updated Synapses will correctly reject them on the CS API; however, it will absolutely not fix those bugs, since such malformed data could still be received from buggy or malicious servers over federation.
Event validation
Validating events, particularly those received over federation, is quite hard as:
- you need to decide what to do with events that fail validation,
- retaining backwards compatibility means it’s hard to reject/drop/decline to handle events that fail strict validation.
For example, an event with malformed m.relates_to
data can’t just be dropped as, according to the authentication rules of all room versions to date, it is a valid event, even though its payload (that Synapse still interprets) is invalid, since annotations were added after the current room versions were specced.
The main problem we have currently is that events contain a mix of properties which are validated on receipt (auth_events
, prev_events
, etc) alongside a bunch of untrusted data that we cannot assume nor assert the types of. Then, when we come to access event data, we need to remember to add checks for any untrusted data. While room versions allow us to add stricter schemas, relying on that approach will always be a case of playing catchup as we’ll want to use new features in older room versions where possible. (See also MSC2801 on this subject.)
Ideally, therefore, we’d try and add some tooling to make it easier to statically assert (or at least tell easily in the code) whether the fields you’re accessing on events (and other data types) have been validated or not. Hopefully it is possibly to do something with mypy here, however it will require some experimentation here.
Conclusions
In summary there are a few paths going forward:
- validate schemas of APIs;
- validate a basic schema for events (and other data types), potentially using room versions to allow us to enforce stricter schemas while retaining backwards compatibilty; and
- experiment with mypy and other tooling (as well as code changes) to try and highlight the difference between validated and unvalidated data in events, and to enforce correct validating when using untrusted data.
Both 1 and 2 would be good things to do, and may reduce the number of occurrences of bugs, but won’t actually fix the class of bugs we’re seeing due to unexpected formats of various keys in events. However, while paths 1 and 2 are something that we know how to do, path 3 has a lot more unknowns attached to it, but ultimately is the only option that will fully prevent the class of bugs we’re seeing.