Compliance changes after 2019 user journey redesign

2020-02-28

Retention
- Still not enough data for >= 2-month retention
Level attainment
- P(finishes 5 days | finishes in-text) up substantially: 58±1.4% -> 68±1.7% Query
- P(finishes 2 weeks | finishes 5 days) more clearly up: 74±1.2% -> 79±1.5% Query

2020-02-13

Retention
- Still not enough data for >= 2-month retention
Level attainment
- P(finishes 5 days | finishes in-text) up substantially: 58±1.4% -> 68±1.8% Query
- P(finishes 2 weeks | finishes 5 days) more clearly up: 74±1.2% -> 79±1.6% Query

2020-02-05

Retention
- 1-month retention up substantially: 39±2.0% -> 46±3.9% Query
- (this is the percentage of users who do >= 1 session >= 1 month after they register, among users who ever did a review session)
- [not enough data for 2-month and higher]
Level attainment
- P(finishes 5 days | finishes in-text) up substantially: 57±1.4% -> 66±1.9% Query
- P(finishes 2 weeks | finishes 5 days) highly likely up somewhat: 74±1.2% -> 76±1.7% Query
- But the post-change numbers are a lower bound; another 7-8% of post-change users who finished 5 days are still active but haven’t reached 2 weeks yet. A few pp of them will.
- [not enough data for P(finishes 1 month) and higher]

2020-01-16

P(3|2) seems fairly conclusively down now:

Old: 84±1.7% (N=1784)
New: 73±5.5% (N=250)

Google Cloud Platform

That’s very interesting! If there’s a real phenomenon behind that, we’ll really need to understand it. I find myself basically not believing it—that there must be some weird other confounding effect, that we’ve changed something about what “counts” as a notification or a session.

2019-12-30

P(did session 1 | eventually collected 80% of an essay):
- Old: 84±2.1% (N=1194)
- New: 77±5.9% (N=195)
- Query
P(did session 3 | eventually collected 80% of an essay):
- Old: 93±1.6% (N=903)
- New: 83±6.6% (N=123)

I’m having a lot of trouble squaring this with the data we saw earlier, and with the MAU/retention data. I'll need to do a much more careful analysis.

Quick take, after a bunch of exploratory analysis: there’s so much inter-month variation in compliance that a natural experiment may be impossible: it may be necessary to run an RCT.

Here’s a month-cohort per-session compliance series: Google Cloud Platform

2019-12-16: Readers with the new schedule have about the same per-card demonstrated retention distribution after 3 repetitions as readers using the old schedule had after 5 repetitions

https://console.cloud.google.com/bigquery?utm_source=bqui&utm_medium=link&utm_campaign=classic&project=metabook-qcc&j=bq:US:bquxjob_caf190b_16f10a10cf8&page=queryresults

If that trend continues, we’ll have cut a large amount of review time off for readers.

2019-12-10: Still not enough data to judge session 3

3|2 is now at: 84±1.7% (N=1778) -> 82±8.1% (N=84). We’ll get more samples soon: 145 new-experience readers have finished their second session, so we only have third-session data from a little more than half of them. Query

We have a little more resolution on 2|1: 79±1.7% (N=2277) -> 86±5.2% (N=169), up from N=125 and an interval of 5.5% on 12/05.

Looking again at session 1 compliance:

Among readers who eventually collect 80% of an essay’s cards, the numbers are also still pretty unclear: 84±2.0% (N=1189) -> 81±6.5% (N=139). Query
If we lower that bar to, say, 20%, we see 53±1.4% (N=4585) -> 54±4.3% (N=523). Query
- We’ll have more trouble getting these intervals tight because they’re closer to 50%. Compliance among complete-readers needs fewer samples to get tight binomial intervals, but of course, we get samples much more slowly.

2019-12-05: Solid session 2 retention increase

To get a leading indicator of the impact on retention, let’s look at P(did session 2 | did session 1) for users who joined before/after introducing the new user journey elements. Still not enough N among readers with 80% of cards in their first session, so I’ll do this analysis across all readers.

New: 89% ± 5.5% (95% CI: 83-94%); N=125
Old: 79% ± 1.7% (95% CI: 77-81%); N=2275

Query

The CI is pretty wide on the new cohort, but the intervals don’t overlap; that looks like a real shift.

There may be some influence of the teleportation essay here: maybe people who just read that essay are more likely to do a second session because they could review the whole thing in their first one? I could tell the story the other way pretty easily: many people who just read teleportation will have a huge interval between their first and second session, which is ample time to churn.

What about P(does session 3 | did session 2)? Do we have enough samples? Not quite. We went from 84% ± 1.7% (N=1774) to 88% ± 8.7% (N=52).

WITH
  eligible AS (
  SELECT
    userID
  FROM
    `logs.compliance`
  WHERE
    sessionNumber = 1
    AND studyTimestamp IS NOT NULL),
    
  conditions AS (
  SELECT
    userID,
    CASE
      WHEN timestamp >= TIMESTAMP("2019-11-12") THEN "new"
      WHEN timestamp < TIMESTAMP("2019-10-01") THEN "old"
    ELSE -- Throwing out users who would have seen some partial version of the new experience.
    NULL
  END
    AS bucket
  FROM
    `logs.registeredUsers`),
  means AS (
  SELECT
    bucket,
    COUNTIF(studyTimestamp IS NOT NULL
      OR hoursLate >= 24*7) AS N,
    COUNTIF(studyTimestamp IS NOT NULL) / COUNTIF(studyTimestamp IS NOT NULL
      OR hoursLate >= 24*7) AS fraction
  FROM
    `logs.compliance`
  JOIN
    eligible
  USING
    (userID)
  JOIN
    conditions
  USING
    (userID)
  WHERE
    sessionNumber = 2 AND bucket IS NOT NULL
  GROUP BY
    bucket),
  cis AS (
  SELECT
    *,
    1.96 * SQRT((fraction * (1 - fraction)) / N) AS CI95
  FROM
    means)
SELECT
  *,
  fraction - CI95 AS lower,
  fraction + CI95 AS upper
FROM
  cis
ORDER BY
  bucket