Our data already demonstrates fairly clearly that long-term Quantum Country users develop reliable memory across long intervals, and with decreasing cost over time. So this experiment is mostly about establishing the counter-factual—that comparable readers won’t retain details they don’t review.
In this experiment, we’ll remove one entire review set from an experimental group’s in-essay experience. Then we can re-insert those questions later and compare cross-condition accuracy.
I propose we remove the 9-question review set beginning with “<\psi| is an example of a _.” in “Why are the unitaries the only matrices which preserve length?” It’s enough questions to give us decent signal; this review set seem to have one of the smallest overlaps with other sets; there’s no surrounding prose referring to the questions.
In this experiment, we’ll use the same manipulation on two different time scales to gather data about two questions:
We can’t directly answer these questions, so instead we’ll answer two proxies:
Newly-registered users will be split into three conditions:
We’ll have enough data to answer Q1’ long before Q2’; at that point, we can switch to enrolling 60% of readers in the “1-month delay” condition.
We assign experimental condition upon new user registration. I’ll call users in the “1-session delay” and “1-month delay” conditions “experimental users.” Previously-registered users won’t participate; they’ll see the same experience they always have.
The UI will calculate progress relative to the reduced card count for experimental users, until the review session in which they interact with those cards for the first time. Likewise, once an experimental user has answered a question in a review session, it’ll appear as normal within the essay.
We’ll define an experimental user as “enrolled” at the time they answer a question located after the experimental review set in the essay. Our prior data suggests that almost all users read completely linearly, so this represents strong evidence that the user has read the material covered by the manipulated review set. We’ll define a control user as “enrolled” when they’ve answered all questions in the manipulated review set.
When “5-day delay” users are enrolled, the manipulated review set’s questions will be marked on their account as due 5 days later. This will cause them to be shuffled into their next review session as if they’d been answered in context of the essay. Because QCVC questions are shuffled across several sessions, the manipulated questions may be split across several sessions, just as for their control-group peers.
When “1-month delay” users are enrolled, the manipulated review set’s questions will be marked on their account as due 1 month later. This will naturally cause a review session to be due at that time. To minimize the sense of manipulation, we’ll make sure that review session is at least 25 questions long by randomly adding other random questions if necessary. Correct answers won’t advance those questions to the next level; they’ll be scheduled to next appear at their current interval. If too many questions are due to fit into a single review session, we’ll prioritize the manipulated review set’s questions.
1 month after control users are enrolled, we’ll manufacture a test review session including the manipulated question set. As with for the 1-month delay users, we’ll pad the review session to 25 questions as necessary. If the user already has that many questions due, of course, we won’t do any special manipulation.
For Q1’, we’ll compare the distribution of first-repetition accuracies over the manipulated question set between control users and “5-day delay” users.
For Q2’, we’ll compare the distribution of accuracies over the manipulated question set in the special 1-month-post-enrollment session between control users and “1-month delay” users.
The analysis is a bit tricky in both cases because for each user, the measured sample is the number of questions answered correctly (e.g. out of 9). That’s an ordinal variable, which makes it harder to talk about aggregate effects, specifically. We can test and report shifts in the medians, though; e.g.: “the median control reader answered 9/9 correctly, while the median delayed reader answered 4/9 correctly; X^2 = xx, p < 0.01”.
We can also do the naive analysis we did in the 2019 Q3 RCT: pool all questions into a single accuracy measure for each category; compute confidence intervals assuming they’re binomial; etc. The sample covariance will be much worse in this experiment, though, because the manipulated questions are all drawn from the same review set. The situation will be closer to a single hidden variable for a given user. A binomial estimate will substantially overreport confidence.
We’ll discard samples from users who:
This experiment will improve upon our 2019 Q3 efficacy experiment: