(Superseded) Lapsed question accuracy remains iffy, despite retry and interval-shortening

==Update, 2021-04-12==
I don’t really believe this analysis anymore. It’s comparing two different user cohorts. For more recent analysis on lapses, see Demonstrated retention reliably bounds future recall attempts on Quantum Country. On retry, see Retry intervention produces substantial increases in early accuracy on Quantum Country.

If a reader forgets a question in a review session, we shorten that question’s interval so that they’re more likely to remember it during the next session in which it appears.

Question 1: Does shortening lapsed question intervals help readers remember? Or do people often just keep forgetting, over and over again?

Question 2: We recently added a related intervention: when a reader forgets a question during a review session, they’ll review it again in that session. Does that make readers more likely to remember the answer during the next session in which it appears?

I can answer both of those questions at odnce. I’ll focus on forgotten questions at the 2-week level, since that’s the first level which (in our new schedule) will shorten subsequent review intervals when forgotten. If a new-schedule reader forgets a 5-day question, it stays at 5 days. Query 1

Here are the accuracy rates on “lapsed” 2-week questions, in their first non-retry review after they were forgotten:

New schedule, with retry: 79% ± 4.1% (95% CI: 75-83%); N=375
Old schedule, without retry: 74% ± 1.8% (95% CI: 72-76%); N=2347

==Update, 2020-03-07:==

New schedule, with retry: 78% ± 2.1% (95% CI: 76-80%); N=1553
Old schedule, without retry: 74% ± 1.8% (95% CI: 72-76%); N=2373

Question 1 analysis: does interval-shortening help?

These accuracy rates sit fairly low, compared to readers’ overall accuracy rates for questions at the 2-week level, which is 95% (N=71,475) Query 2. We’d expect these lapsed-question accuracies to be lower, but that’s a lot lower!

The lapsed-question accuracy rates remain similar at longer intervals (can’t compare with versus without retry—too little data):

1 month: 72% ± 2.4% (N=1369)
2 months: 75% ± 3.7% (N=516)

So, big-picture: if readers forget the answer to a question, they’re fairly likely (~1:4) to forget it again next time, roughly irrespective of interval.

This analysis doesn’t dig into whether it’s the same questions being forgotten over and over again for a given reader. That’d be interesting to know.

Question 2 analysis: does retry help?

It maybe helps a bit.

The confidence intervals overlap slightly, but I’d believe it’s a couple percentage points. (Also, the binomial confidence interval analysis here is wonky because individual lapsed question performance will be highly correlated within samples from an individual reader).

Another limitation of this analysis is that new-schedule readers will have reviewed a given question fewer times when they hit the 2-week mark than their old-schedule peers. I’m comparing 2-week lapses to 2-week lapses to control for interval, but I’m not controlling for repetition. That said, there appears to be fairly little variation with repetition count, so it probably doesn’t matter that much.

So what?

~75% is probably too low to produce long-term confidence here. We’ll probably need a stronger intervention to address lapses reliably.

If retry isn’t terribly helpful, should we remove it? It’s making people do more work. My instinct is that retry is emotionally important. It communicates: “hey! we know you forgot that, but don’t fret: you’ll see it again soon, and we’re keeping track.” Of course, we don’t want to waste people’s time or mislead them—it’d be better to replace retry with a stronger mechanic that also yields this emotional response—but the cost seems low now.

Query 1:

WITH
  users AS (
  SELECT
    userID,
    schedule AS category
  FROM
    `logs.registeredUsers`),
  reviewsWithoutRetry AS (
  SELECT
    *
  FROM
    `logs.reviews`
  WHERE
    isRetry IS NOT TRUE
    AND sessionID IS NOT NULL),
  laggingReviewTimestamps AS (
  SELECT
    *,
    LAG(timestamp) OVER (PARTITION BY userID, cardID ORDER BY timestamp ASC) AS previousReviewTimestamp,
    LAG(reviewMarking) OVER (PARTITION BY userID, cardID ORDER BY timestamp ASC) AS previousReviewMarking,
    LAG(beforeInterval) OVER (PARTITION BY userID, cardID ORDER BY timestamp ASC) AS previousInterval
  FROM
    reviewsWithoutRetry),
  withRetryInBetween AS (
  SELECT
    *,
  IF
    ((
      SELECT
        COUNT(*) > 0
      FROM
        `logs.reviews` AS r
      WHERE
        r.userID = l.userID
        AND r.cardID = l.cardID
        AND r.timestamp > l.previousReviewTimestamp
        AND r.timestamp < l.timestamp
        AND r.isRetry IS TRUE),
      TRUE,
      FALSE) AS didInterveningRetry
  FROM
    laggingReviewTimestamps AS l),
  accuracies AS (
  SELECT
    previousInterval,
    category,
    COUNT(*) AS N,
    COUNTIF(reviewMarking="remembered") AS countCorrect,
    COUNTIF(reviewMarking="remembered")/COUNT(*) AS accuracy
  FROM
    withRetryInBetween
  JOIN
    users
  USING
    (userID)
  WHERE
    previousReviewMarking = "forgotten"
    AND previousInterval > 1000*60*60*24*5
    AND ((category = "aggressiveStart"
        AND didInterveningRetry)
      OR (category = "original"
        AND NOT didInterveningRetry))
  GROUP BY
    category,
    previousInterval
  ORDER BY
    category,
    previousInterval),

Query 2:

SELECT
  COUNTIF(reviewMarking="remembered")/COUNT(*) AS accuracy,
  COUNT(*) AS N
FROM
  `logs.reviews`
WHERE
  beforeInterval = 1209600000
  AND isRetry IS NOT TRUE