2020-01 Quantum Country efficacy experiment

Analysis

Without review, most Quantum Country readers forget at least a third of the material after a month (2020 efficacy study)
Accuracy rates for withheld Quantum Country questions were roughly equal at 5 days and 1 month (2020 efficacy experiment)
Withheld Quantum Country questions have highly variable accuracy rates (2020 efficacy study)
In-essay Quantum Country prompts boost performance on first repetition

Experimental plan

Our data already demonstrates fairly clearly that long-term Quantum Country users develop reliable memory across long intervals, and with decreasing cost over time. So this experiment is mostly about establishing the counter-factual—that comparable readers won’t retain details they don’t review.

In this experiment, we’ll remove one entire review set from an experimental group’s in-essay experience. Then we can re-insert those questions later and compare cross-condition accuracy.

I propose we remove the 9-question review set beginning with “<\psi| is an example of a _.” in “Why are the unitaries the only matrices which preserve length?” It’s enough questions to give us decent signal; this review set seem to have one of the smallest overlaps with other sets; there’s no surrounding prose referring to the questions.

High level design

In this experiment, we’ll use the same manipulation on two different time scales to gather data about two questions:

Q1: What is the marginal impact on retention of embedding review questions directly inside the essay?
Q2: What’s the “natural” long-term memory decay rate for this material?

We can’t directly answer these questions, so instead we’ll answer two proxies:

Q1’: How does first-repetition reader accuracy compare between readers who answered certain review questions while reading the essay and those who read the corresponding material but never saw those questions?
Q2’: 1 month after reading the essay, how does reader accuracy compare between those who have reviewed certain questions several times, and those who read the corresponding material but never saw the questions?

Newly-registered users will be split into three conditions:

control (40%), with no manipulation
5-day delay (20%), which will not see the manipulated review set on their initial read, but will have it included as normal in their review sessions (i.e. for Q1’)
1-month delay (40%), which will not see the manipulated review set in-essay or in their review sessions until 1 month after enrollment (i.e. for Q2’)

We’ll have enough data to answer Q1’ long before Q2’; at that point, we can switch to enrolling 60% of readers in the “1-month delay” condition.

Implementation details

We assign experimental condition upon new user registration. I’ll call users in the “1-session delay” and “1-month delay” conditions “experimental users.” Previously-registered users won’t participate; they’ll see the same experience they always have.

The UI will calculate progress relative to the reduced card count for experimental users, until the review session in which they interact with those cards for the first time. Likewise, once an experimental user has answered a question in a review session, it’ll appear as normal within the essay.

We’ll define an experimental user as “enrolled” at the time they answer a question located after the experimental review set in the essay. Our prior data suggests that almost all users read completely linearly, so this represents strong evidence that the user has read the material covered by the manipulated review set. We’ll define a control user as “enrolled” when they’ve answered all questions in the manipulated review set.

When “5-day delay” users are enrolled, the manipulated review set’s questions will be marked on their account as due 5 days later. This will cause them to be shuffled into their next review session as if they’d been answered in context of the essay. Because QCVC questions are shuffled across several sessions, the manipulated questions may be split across several sessions, just as for their control-group peers.

When “1-month delay” users are enrolled, the manipulated review set’s questions will be marked on their account as due 1 month later. This will naturally cause a review session to be due at that time. To minimize the sense of manipulation, we’ll make sure that review session is at least 25 questions long by randomly adding other random questions if necessary. Correct answers won’t advance those questions to the next level; they’ll be scheduled to next appear at their current interval. If too many questions are due to fit into a single review session, we’ll prioritize the manipulated review set’s questions.

1 month after control users are enrolled, we’ll manufacture a test review session including the manipulated question set. As with for the 1-month delay users, we’ll pad the review session to 25 questions as necessary. If the user already has that many questions due, of course, we won’t do any special manipulation.

Analysis

For Q1’, we’ll compare the distribution of first-repetition accuracies over the manipulated question set between control users and “5-day delay” users.

For Q2’, we’ll compare the distribution of accuracies over the manipulated question set in the special 1-month-post-enrollment session between control users and “1-month delay” users.

The analysis is a bit tricky in both cases because for each user, the measured sample is the number of questions answered correctly (e.g. out of 9). That’s an ordinal variable, which makes it harder to talk about aggregate effects, specifically. We can test and report shifts in the medians, though; e.g.: “the median control reader answered 9/9 correctly, while the median delayed reader answered 4/9 correctly; X^2 = xx, p < 0.01”.

We can also do the naive analysis we did in the 2019 Q3 RCT: pool all questions into a single accuracy measure for each category; compute confidence intervals assuming they’re binomial; etc. The sample covariance will be much worse in this experiment, though, because the manipulated questions are all drawn from the same review set. The situation will be closer to a single hidden variable for a given user. A binomial estimate will substantially overreport confidence.

We’ll discard samples from users who:

are more than a week late answering any questions in the review set after the test is due
as control users, repeat the review set questions fewer than two times before the 1 month test

In comparison to the 2019 Q3 experiment

This experiment will improve upon our 2019 Q3 efficacy experiment:

We’ll remove a single review set instead of individual cards from many review sets. This should minimize the overlaps between the delayed questions and unmanipulated questions which diminished the effect in our last experiment.
We expect a stronger effect by increasing the test interval from 2 weeks to 1 month.
We expect a stronger effect by removing the in-text questions from the experimental groups as well.
We’ll include a larger pool of users by avoiding the requirement that readers read a large fraction of the essay by their first session.