Skip to content

Conversation

Twixes
Copy link
Member

@Twixes Twixes commented Jan 4, 2022

Changes

Follow-up to #7824. Aiming to resolve demo data concerns from PostHog/posthog.com#2661 (comment).

How did you test this code?

Alexa remind me

@macobo
Copy link
Contributor

macobo commented Jan 5, 2022

If you're doing this, it's also valuable to perhaps set up some groups data :)

@posthog-bot

This comment was marked as off-topic.

@posthog-bot

This comment was marked as off-topic.

@Twixes Twixes reopened this Jan 27, 2022
@posthog-bot posthog-bot removed the stale label Jan 28, 2022
@posthog-bot

This comment was marked as off-topic.

@posthog-bot

This comment was marked as off-topic.

@Twixes Twixes reopened this Feb 14, 2022
@posthog-bot posthog-bot removed the stale label Feb 15, 2022
@posthog-bot

This comment was marked as off-topic.

@posthog-bot

This comment was marked as off-topic.

@posthog-bot posthog-bot closed this Mar 4, 2022
@Twixes Twixes reopened this Mar 4, 2022
@Twixes Twixes changed the title Rework demo data generation system feat(demo): Rework demo data generation system Mar 4, 2022
@posthog-bot posthog-bot removed the stale label Mar 7, 2022
@Twixes Twixes marked this pull request as ready for review May 13, 2022 19:19
@Twixes
Copy link
Member Author

Twixes commented May 13, 2022

I suppose this is now reviewable. Guide below.

How to test

First run pip install -r requirements.txt because this adds random data package mimesis as a dependency.

  1. Run the new simulate_matrix Django command to run a simulation (e.g. DEBUG=1 ./manage.py simulate_matrix --start 2022-03-02 --end 2022-05-15 --seed xyz --n-clusters 20) and see its output
  2. Run the server with DEMO=1 (e.g. DEMO=1 ./bin/start) and log in for the full demo experience

Approach

What happens when a user enters the demo environment?

Whenever a user signs up/logs in (they are the same here), they are:

  • either logged into the existing account matching the email – if they've used the demo before
  • or a Hedgebox simulation is first ran, its results are saved, and only then the user is plopped into PostHog – if it's their first time with the demo

How does the simulation work?

Each individual simulation is a matrix (abstractly, a Matrix instance, and concretely in our case, a Matrix subclass HedgeboxMatrix instance). A matrix consists of clusters (Clusters), which in our case (HedgeboxCluster) are companies. And a cluster effectively is a grid of people (SimPersons). Those people are simulated session-by-session synchronously in an outwards spiral pattern. This whole architecture allows for pretty efficient modeling of relations inside groups (for Group Analytics), such as people recommending the product to each other, or invitations. Currently only browser clients can be simulated.

A Matrix also has a set_project_up() method, which instruments the whole project with insights, dashboards, cohorts, etc.

The results of a Matrix are saved using MatrixManager, which does not know how the simulation works, only how to save it.

TODOs

This system has MUCH more potential than the previous one, but it's not as robust as it could yet. Here are some enhancement opportunities:

  • make agents invite each other within a company, with $groupidentify
  • make agents upgrade/downgrade plan for their org
  • add some more insights
  • add an annotation or two
  • fix events showing up as stale
  • add another dashboard
  • add recently viewed insight entries
  • document simulation approach more in code

@Twixes Twixes requested review from mariusandra and tiina303 May 13, 2022 19:20
# TODO: Support persons on events
}
p = ClickhouseProducer()
p.produce(topic=KAFKA_EVENTS_JSON, sql=INSERT_EVENT_SQL(), data=data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
The only concern I'd have with this is that it could be slightly more annoying to debug, e.g. wrote a bad event, but I'll take the speed improvements + looks like we already use it for persons and other stuff too.
Noticed that in the bulk_create_events we still have sync_execute below that's probably fine and potentially faster that way.

Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff!

I tried running it, but somehow can't seem to get any events to show up. Not in the ingestion, and not in clickhouse. Am I missing something obvious? 🤔

2022-05-17 12 03 33

@@ -535,6 +535,7 @@ export const keyMapping: KeyMappingInterface = {
},
},
}
keyMapping['$distinct_id'] = keyMapping['distinct_id']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks 👁️ 👁️ dodgy. // what's happening?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is actually wrong, I misread _get_distinct_id and saw both distinct_id and $distinct_id being supported, but that's only at the top level of the object. In props indeed only the former is recognized. (Not the most straightforward situation, but I guess that's backwards compat for you)

@Twixes
Copy link
Member Author

Twixes commented May 17, 2022

Oh right, I didn't explicitly point that out, but the manage.py command is just a dry run. This isn't hooked up to the setup_dev command yet, so for the data to be ingested you need to use option 2 @mariusandra:

Run the server with DEMO=1 (e.g. DEMO=1 ./bin/start) and log in for the full demo experience

@mariusandra
Copy link
Collaborator

Oh, apologies, that's on me. I tried that earlier, but it didn't work as I had too much nothing:

image

Now I do see this:

image

.. but no data other than my own local plugins and clicking around. I'm not even sure what project things should appear under.

@Twixes
Copy link
Member Author

Twixes commented May 17, 2022

So, for the actual demo experience, you need to log out and sign in with a different email. That should set you up with fresh data.

Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. 🙃

Well, this looks good and works. However it takes over a minutes to generate the demo data for me:

2022-05-17 13 32 37
2022-05-17 13 32 48

The console shows output during the first few seconds, but then pretty much pauses for a while and the app seems to be stuck by all visible indicators. 80 seconds later the next log line appears:

[DEMO] Simulated 1058 people in 6.18 s
[DEMO] Saved (individual part) 1058 people in 80.83 s

Is there a plan to pregenerate this for users? As it's now, we can't really run this for every new users, unless we show a clear "please wait" screen with a game they can play when waiting.

@Twixes
Copy link
Member Author

Twixes commented May 17, 2022

Hmm, this should take a few seconds since the events are now ingested async via Kafka, but I'll look into it. 👀 There's a couple of approaches where this could be pre-generated, though I think it's more fun if everyone gets a random environment of their own – provided of course this takes like less than a 10 seconds, where the wait would be OK (we could just show an approximate "Preparing the world" progress bar).

@Twixes Twixes merged commit d035124 into master May 17, 2022
@Twixes Twixes deleted the demo-data-reworked branch May 17, 2022 20:23
alexkim205 pushed a commit that referenced this pull request May 23, 2022
* Rework demo data generation system

* Fix `setup_dev` and `posthog-foss`

* Keep old demo data generators to reduce hassle

* Move to Hoglify concept

* Separate new generator from old version

* Fix issues

* Rework simulation structure

* Restore package.json

* Reformat `requirements`

* Fix signup button margin

* Refactor things

* Remove snapshots

* Strip old stuff

* Rearrange more

* Fix bad imports

* Add simulation scaffolding

* Add `dry_run_matrix` command

* Fix determinism

* Update naming

* Update dry_run_matrix.py

* Model web client, add sessions, enable full-cluster simulation

* Update flake8 config

* Ignore T001 violation

* Fix saving data

* Instrument `set_project_up` more

* Add demo cohorts, feature flag, experiment

* Parametrize `start` and `end` in `simulate_matrix`

* Add neighbor effects

* Add more events

* Allow silencing events in `simulate_matrix`

* Improve effect scheduling and add more activities

* Fix time measurement

* Disallow creating extra orgs for demo users

* Add more useful info to `simulate_matrix` output

* Add super properties, refine world

* Fix first-seen moment

* `create_event` to Kafka if possible for speed

* Alias `$distinct_id` to `distinct_id` in `keyMapping`

* Extend simulation to 120 days

* Fix experiment instrumentation

* Fix some error message

* Fix experiment flag

* Increase number of demo sim clusters

* Fix typing

* Remove unused agent actions

* Support Python 3.8

* Avoid `Union[Team, int]`

* Fix an arg

* Remove dodgy alias
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants