-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Description
This spec/issue describes the filter
parameter originally mentioned in issue #5463, however it contains a more in-depth spec than usual, due to the complexity of the feature. This issue serves as a basis for documentation whilst the feature is in active development, as well as giving a basis for testing.
Table of Contents:
Summary
The idea behind filter queries is including additional parameters in Browse
API requests, causing the result to be filtered in some useful way. This provides an additional level of power and flexibility for users interacting with their data in themes, apps, and in the admin UI.
The initial focus is on the ability to do advanced queries against Posts, Tags, and Users. The following document outlines a detailed proposal for a 'query language' exposed via a top-level filter
property available when accessing the API from various tools.
Key Contextual Information
The Ghost JSON API supports full BREAD operations, however filtering only applies to Browse requests.
Resources
There are 3 main resources in the Ghost JSON API which are 'public' resources, and have clear use cases which require advanced filtering:
- Posts
- Tags
- Users
Relations
The following relations need special treatment when querying across resources.
- Posts have many Tags
- Posts have one Author (Users)
- Tags have many Posts
- Users have many Posts
Additionally, the relations & business logic in a blog reveal the following important relational data concepts:
- Each Post has a Next Post and a Previous Post.
- Each Post has a Tag Count
- Each User has a Post Count
- Each Tag has a Post Count
API Access Tools
There are 3 key tools which can be used to request data from the Ghost JSON API which need to be considered:
- RESTful HTTP calls - E.g.
GET /api/posts?limit=5
- Method calls (used internally & in apps) - E.g.
api.posts.browse({limit: 5})
- The (in progress) 'get' handlebars helper - E.g.
{{#get "posts" limit="5"}}
Proposal
Adding advanced filtering to Ghost's JSON API via a top-level filter
parameter which accepts a string representation of an SQL query, highly optimised for the kinds of queries which we expect will be common to Ghost, but flexible enough to grow outside of anticipated use cases.
Top-Level Filter Parameter
A top-level filter
parameter will provide advanced querying capabilities available in all 3 tools in a standard way, such that the parameter's value is always a string:
- HTTP:
GET /api/posts?limit=5&filter=???
- Method Call:
api.posts.browse({limit: 5, filter: "???"})
- Get Helper:
{{#get "posts" limit="5" filter="???"}}
In both the method call and get helper syntaxes, the filter parameter value will be surrounded by double-quotes. Avoiding double-quotes inside the string is therefore a requirement.
In the HTTP version, the string will need to be url safe. Url encoding & decoding will be required.
Filter Parameter Values
The key requirement of the filter
parameter is to accept values which can represent the rough equivalent of the where
clause of an SQL statement (language diagram). The fundamentals of this proposal are therefore the definition of a language which fits Ghost's use cases and can be converted into SQL through Ghost's Bookshelf/Knex model layer.
SQL's capabilities are extensive. To start with, the focus for Ghost is on supporting logical and & or combinations, comparisons using equals, greater than & less than (with support for not). The importance of the relation between Post & Tags also drives the need to support in.
Filter Anatomy
An individual filter is composed of 3 parts, the property that is being filtered, the operator (currently always =
) and the value. Filters need to be combined in groups of or & and rules.
The property needs to support path matching: e.g. tag
currently really means tag.slug
.
The operators need to include equals, not equals, and numerical comparisons.
The values are strings, numbers, nulls etc that are matched/compared against.
Filter Style / Syntax
There are a number of common styles for combining properties, operators and values into expressions. This proposal borrows heavily from the style used in GitHub, Gmail & Slack amongst others: E.g.label:api
or from:-hannah
or stars:<10
.
The syntax of this style is the use of property-operator-value triplets in the form: property:operatorvalue, where the colon is a separator, and operator is optional.
This syntax is short and compact for the common use cases in Ghost, for example: featured:true
, tags:photo
, author:john
, image:-null
, posts.count:>10
This syntax is also flexible enough to be extended to support multiple complex expressions by combining the property-operator-value triplets together with other symbols to represent and & or. Following on from using -
for negation, the proposal is to use +
for and as well as ,
for or e.g: featured:true+tags.count:>10
, tags:photo,tags:video
.
This is then used inside of the filter=""
parameter, E.g:
- HTTP:
GET /api/posts?limit=5&filter=tags:photo,featured:true
- Method Call:
api.posts.browse({limit: 5, filter: "tags:photo,featured:true"})
- Get Helper:
{{#get "posts" limit="5" filter="tags:photo,featured:true"}}
Filter Query Language Spec
Syntax
Filter Expressions
A filter expression is a string which provides the property, operator and value in the form property:operatorvalue:
- property - a path representing the key to filter on
:
- separator between property and an operator-value expression- operator is optional, so
:
on its own is roughly=
- operator is optional, so
Property
Matches: [a-zA-Z_][a-zA-Z0-9_.]
- can contain only alpha-numeric characters and
_
- cannot contain whitespace
- must start with a letter
- supports
.
separated paths, E.g.author.name
orposts.count
- is always lowercase, but accepts and converts uppercase
Filter expressions can be combined using the following:
+
- represents and,
- represents or(
filter expression)
- overrides operator precedence
Operator
The following operators are spec'd for the first version:
-
- not operator>
- greater than operator>=
- greater than or equals operator<
- less than operator<=
- less than or equals operator
Value
Can be one of the following
- null
- true
- false
- a *number *(integer)
- a literal
- Any character string which follows these rules:
- Cannot start with
-
but may contain it - Cannot contain any of these symbols:
'"+,()><=[]
unless they are escaped - Cannot contain whitespace
- a string surrounded by single quotes
- Any character except a single or double quote surrounded by single quotes
- Single or Double quote *_MUST _be escaped
- Can contain whitespace
- A string can contain a date any format that can be understood by
new Date()
Special characters:
:
- filter expression separator+
- *_and *_symbol,
- *_or *_symbol(
- left parenthesis (operator precedence))
- right parenthesis (operator precedence)-
- not operator>
- greater than operator>=
- greater than or equals operator<
- less than operator<=
- less than or equals operator[
- left bracket (*_in *_operator)]
- right bracket (*_in *_operator)'
- single quote (*_string *_syntax)"
- double quote - NOT PERMITTED UNESCAPED
Whitespace rules
Whitespace is not permitted inside of a property, or inside a value unless the value is surrounded with quotes. Everywhere else, whitespace is ignored and therefore becomes optional.
This allows for whitespace to be used to improve readability if necessary, as any of the following are equivalent and will be accepted.
featured:true+posts.count:>10
,tags:photo,tags:video
featured:true + posts.count:>10
,tags:photo, tags:video
featured: true + posts.count: >10
,tags: photo, tags: video
featured: true + posts.count :> 10
,tags: photo, tags: video
Implementation
Hold on to your hats, this is where shit gets real...
Creating a Parser
As far as I can determine, the set of rules above describes a language that can be represented by an unambiguous context-free grammar and therefore understood by an LALR(1) parser.
What this means is that we can represent the set of rules in a standard way, and then use JISON to generate a parser in JavaScript, which can in turn be used in Ghost to understand filter query strings, and convert them into something more useful. Essentially, all that has to be done is write a set of rules & the hard part of this is done. Neat.
The parser should live in its own repository, with a full suite of tests, and become a dependency of Ghost. The repository for it is here: https://github.com/TryGhost/GQL
Parser Output
The next step is to determine what the 'something more useful' that the parser is supposed to output looks like. The essence of what we're doing is representing SQL, so we could output the SQL equivalent of a filter query string and then pass that to knex.whereRaw()
, which will then run the query for us inside the model layer of Ghost.
In fact, this is what I did for a first test that all this works. I wrote a rudimentary version of the rules, output raw SQL, and ran it through knex. And it worked.
The downside to parsing straight to SQL is two-fold: firstly, by translating from one string format to another, there is little opportunity for semantic analysis - that is to validate or operate on the property and value parts of the query to check they exist, or are permitted (filtering by the password
property of the User resource should not be permitted, for example). Secondly, it seems to me that running raw sql queries, even through knex, is likely to result in security problems.
Instead of churning out SQL, the parser needs to output an intermediary format that we can use to get all of the useful and contextual information out of the query. This format should be something we can easily pass to knex's query builder functions, like .where()
and .orWhere()
, (which take a property, and operator and a value) so that we can run the queries through the query builder.
The intermediary format must not lose any information from the query.
JSON Format
The proposal is to use an array. Each filter expression will be converted into an object with key value pairs representing the property, operator, and value.
Arrays are ordered, and so the order of the array will match the order in which the query should be built.
All except the first item of the array should also indicate whether the expression is to be combined with and or or. Something like this:
{statements: [
{prop: 'tags.slug', op: '=', value: 'photo'},
{prop: 'featured', op: '=', value: true, func: 'and'},
{prop: 'tag.count', op: '>', value: 5, func: 'or'}
]}
To maintain precedence overrides, a single object in the array can alternatively represent a group that contains a sub-array of objects. The group object must also indicate whether it is combined using and or or.
{statements: [
{prop: 'author.slug', op: '!=', value: 'joe'},
{
group: [
{prop: 'tags.slug', op: '=', value: 'photo'},
{prop: 'image', op: 'IS NOT', value: null, func: 'or'},
{prop: 'featured', op: '=', value: true, func: 'or'}
], func: 'and'
}
]}
This format lends itself nicely to building queries using the knex query builder functions, as we can cycle through the array (and recursively through groups) calling where
or orWhere
as necessary. Having a JSON format with named key-value pairs for property, operator, and value, makes it easy to use lodash magic to pull out parts of the query (e.g. all properties or all values) and validate them.
Post-Processing
Once we have our filter in a JSON format, and we can pull out properties (table & field names) and values, we can start to do some validation and processing of the information in order to ensure the queries are safe and sensible.
Public vs Private
Certain properties should not be filtered upon, e.g. user.email
, user.password
, user.status
and post.status
as all of these properties are affected by permissions / not intended to be available to public endpoints.
For now, as we're only intending to offer 'official' public-only beta API access via HTTP and the get helper, these properties could be santised out entirely if it's easier, or otherwise they should be santised out if the context is public (see the change coming with public permissions for more context for what I mean here).
Standard validation
All of the properties should be known attributes of the models, and all of the values should be valid values for those attributes, according to our standard validation mechanisms.
Joins
Any reference to a related table needs to be translated into a join.
Dates
Any String value used along with a date property E.g. created_at
, updated_at
, published_at
, last_login
should be converted into a Date that SQL can understand, providing it is something that can be parsed using JavaScript's new Date()
. Some care needs to be taken here to check this will work across SQLite, MySQL and pg as they treat dates differently and Bookshelf/Knex is not good at handling this (yet).
Special Rules
There are a few queries which should have special rules for how they affect other parts of the query. The single example at the moment is filtering posts by tags where tag.slug is IN
a set (or in fact, any use of IN
).
Where using IN
we need to be sure that we still return distinct items in the set (and the correct number). When using IN
the default order should naturally change to be ordered by the number of matches in descending order.
E.g. when requesting posts with tags IN
a group, the query output should be similar to (or have the same affect as) this:
select * from "posts" inner join "posts_tags" on "posts_tags"."post_id" = "posts"."id" inner join "tags" on "posts_tags"."tag_id" = "tags"."id" where "page" = false and "status" = 'published' and "posts"."id" != 11 and "tags"."slug" in ('markdown', 'list') group by "posts"."id" order by count(tags.id) DESC, "posts"."status" ASC, "posts"."published_at" DESC, "posts"."updated_at" DESC, "posts"."id" DESC limit 5
Note the group by
and order by count(tags.id)
parts.
Examples
Here is a separate Gist outlining some example queries in HTTP, method call & get helper form: https://gist.github.com/ErisDS/f516a859355d515aa6ad, feel free to suggest more example use cases in the comments.