-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Just jotting down some ideas as I come up with them / as I read things on the internet / clarify my own thinking around this. Feel free to add your own, we'd probably need to try & mix and match a few approaches to get to something usable!
Adapting from #8094 , there are 3 problems to solve:
- Do businesses in the same verticals (however you define the verticals), use product analytics the same way? i.e. is a general classification possible?
- If yes, can we map the custom & autocapture events they create to this classification accurately?
- If yes, can we surface useful insights in PostHog that they haven't already thought of?
Some things to consider:
-
We don't really care about (1) and (2). The goal is simply (3). It's possible to reach (3) without doing (1) and (2), by say, using a more fluid approach than a hard taxonomical classification. (Don't know how this would work yet, just something to keep in mind)
-
I think to effectively do (3), not only would we want to map events to a model, but also event properties. For example, a
subscribed
event would likely have aprice
oramount
property, and showing users they can "track daily revenue" vs just number of people who subscribed is where the magic happens. The latter is easy to figure out, the former, not so much! -
??
Problem 1 solutions
My gut feel here is yes: most companies with the same business model look the same, do the same things, and earn money the same way. Thus, the events they track should be similar.
What's interesting to me here is that these companies can be in different industries: You can have a health subscription service, or a SaaS, both of which would have very similar events: subscription (started | cancelled)
and amount
props. By contrast, a health insurance company might have things like: bought product
with product type: A
as a property (spitballing here).
So, I propose we divide verticals by business models instead of industries. (Before going this route, actually check our data if we can confirm this hypothesis or not)
I may be oversimplifying, and there may be other variables that are also important, but I feel figuring these out would make things a lot clearer.
Choosing the right division here is important, because it can make the next problem impossibly hard to easy.
Problem 2 solutions
There's two parts to this problem: (2.1): What does our internal model for this vertical look like? and (2.2): How do we map user events to this internal model.
We need both to be distinct, since we use (2.1) as a generator for solving (3).
Generic Word Embeddings
We can represent every word by a 200-300 dimension vector. Lots of generic trained models exist. Any two events whose some measure of distance, like Euclidean in this vector space is less than epsilon map to the same thing.
So, given a representation for (2.1) (perhaps manually choosing words), we should be able to solve (2.2) using these word embeddings.
We shouldn't train our own embeddings, as (I think) that's a losing battle, hard to get right, and not worth it for the MVP.
It's easier to find generic word embeddings vs embeddings specific to a field, but I expect results to be better when we use specific embeddings for a specific field: they map domain words better.
We should try testing both kinds, to see what works.
Probability I think this will work: Moderately high
Automatic taxonomy creation
There's lots of interesting methods to generate taxonomies. Why not use these to generate a model (2.1), and use it to predict which custom property goes where? (2.2).
This definitely scales better than manually doing (2.1), but runs into a new problem: How do we map this model to smart insights? For example, the taxonomy created might focus instead on different disease classification, vs. events coming into PostHog.
Similar arguments can be made for ontology creation.
However, I think we can take inspiration from these techniques, and figure out something that works for us.
Probability I think this will work: Low
Text matching
There's no reason we have to solve all the hard parts via code. We could manually build a taxonomy of what events should look like for a vertical (assuming we've solved problem (1) well). And encourage companies to adhere to these guidelines: call your events like we tell you to.
This makes (2.2) very easy: we know apriori what's coming in!
(2.1) is hard though. Do we know enough about industries to do this manually?
Further, How do we tell oura to not go with the health-industry taxonomy, but with the SaaS-taxonomy?
And, mucho friction, as industries change / businesses grow / their business models change, and this feature goes to trash. Maybe.
But anyway, I think we should definitely attempt this once, just to understand the edge cases better: When/why would businesses not want to track events like so, etc. etc.
Probability I think this will work: Moderate
Text matching without training
It's like the above, but what if we assume, given we select the verticals properly, most users will call their events similarly?
This removes all the icky bits from the above method, and just keeps the easy bits.
Probability I think this will work: Moderate, if (1) is solved well. Low otherwise.
Problem 3 solutions
Given we have a model (2.1), we should be able to create all important insights manually (and thanks to ideas from companies in the same model vertical).
Not sure about the effort this will take, and whether we'll surface interesting things. But, I suspect this will atleast level the playing field: Here's the basic things every company in this vertical looks at, which can be valuable enough.
Some really out there solutions:
Random Insights
What if, instead of doing the hard work of creating a taxonomy, we randomly suggest insights based on events & properties data coming in? Of course, there needs to be some structure, AND, we can do some pruning based on prelim results, like a chess engine / A* search algorithm (need to define the problem better for search, but you get the idea)
So, you generate random insights, and discard any for which the result is 0. Then we have heuristics to prune certain combinations, like, say, "if conversion rate below 1%, probs not useful". We'll need to play around a lot to figure these out, but idk, might do better.
(I mean, if this does better than solving (1) and (2), we know our models are pretty shitty, a.k.a the problem is very hard 😂 )
Probability I think this will work: Moderate-low
Neural Net all the things!
This is a surprisingly well defined problem to attack via machine learning: You have a set of event with properties, persons with properties, and the output is a list of tuples: the insight type, and the events/actions in the list.
Actually, we could possibly use GPT-3 here! If it can generate code, it can generate filter objects! We just need to prompt several good examples of meaningful filter objects, given events & properties. (every filter object uniquely maps to an insight)
I think GPT-3 would definitely work better than training our own neural nets. (because training is hardddd, needs lots of data, etc. etc.)
Hmm, now that I think about it, this might be the most promising approach, barring concerns with using an external API.
Probability I think this will work: High