2021-03-23 Note to Michael on flat forgetting curves

Hello, Michael! One of the mnemonic medium rabbit holes I’ve been exploring over the past few months has become striking enough that I wanted to share it in case you’re interested.

Across two different experiments, I’m measuring flat forgetting curves for Quantum Country readers between five-day and one-month intervals.

[Meta: I’ve mostly stopped sending you updates on my work around the mnemonic medium because I’ve gotten the strong sense that you’re trying to distance yourself from it, and from other discussion around tools for thought. I can understand that, and I want to be respectful! If you’d like me to stop sending you notes like this, say the word; no hard feelings. Likewise if I’ve misread and you’d like to stay more connected. I confess it’s awkward for me! I’m continuing to run experiments with Quantum Country, which is our project, even though you’re no longer involved. I’m pretty sure you don’t want to be in my critical path, but it feels odd not to run experiments by you. I’ve compromised by only running unobtrusive experiments.]

The early 2020 held-out-questions experiment
But wait, you say! What about our 2020 experiment in which we withheld one section’s questions from some users? Didn’t we find that most of those users forgot at least a third of the questions after a month?

Yes, that is what we found. And yes, most users who did review those questions over the month could remember them all. But the difference between these groups isn’t necessarily caused by forgetting over the course of the month. For instance, it could be that many people skim over that material in the essay, but they learn it in the course of a month’s practice.

Happily, this experiment had another condition, which provides a baseline. A third group of users had the same questions withheld while they read the essay, but the questions were re-inserted in their first review session—five days later, instead of one month later. I was surprised to find that these users who had questions withheld for five days performed roughly the same as users who had questions withheld for a month.

The 25th/50th/75th %iles are 4/5/7 and 4/6/8 questions remembered out of 9, respectively; N=72 and 131. The distributions are visualized below, first for the five-day interval then for the one-month interval. Note that to control for survivorship effects in the first group, I include only readers who continued with their practice and reviewed all these prompts again after 30+ days.

One interpretation is that this distribution primarily represents reading comprehension and prior knowledge. For a given question, readers either already know the answer, absorbed it from the text, or else didn’t really internalize it. Yes, some forgetting is happening, but not enough to shift these mostly-binary outcomes, especially when washed out by the effects of reviewing other somewhat-related questions. We can see some evidence for those related-review effects: the one-month-withheld group actually performs just slightly better than the five-days-withheld group.

An aside on models: the measures I’m reporting here aren’t like the measures typically reported in cogpsy studies on memory. Those studies typically model a per-subject recall probability as a single binomial variable, which they can estimate using the fraction of questions recalled. This model depends on questions being independent, equally unfamiliar, similar complexity, etc. Under these conditions, that per-subject recall probability typically follows a normal distribution across subjects—in fact, such studies often obtain between-subject variances of only a few percentage points! Very different from our data.

By contrast, our questions are highly interdependent, variously amenable to prior knowledge, variable complexity, etc. We’re not sampling from a single distribution here. It’s more like we’re sampling from the sum of nine partially-correlated Bernoulli variables, each of which is correlated with reader-specific characteristics (prior knowledge, level of effort while reading). Still, even under this complex model, if you expect each of those variables to decay with time, the expected value of the sum should still decay.

The late 2020 variable-schedule experiment
As one way of exploring “how effortless can memory be made?”, I launched a simple experiment in December which manipulates a small subset of review intervals, on a per reader/question basis. Each time a question is reviewed, it has a 75% chance of being assigned its usual interval, and some smaller chance of being assigned the interval corresponding with preceding or following levels. I’d hoped to build some picture of how memory dynamics varied per-user and per-question across repetitions.

We now have enough data to look at manipulations of the first repetition interval, which is usually five days, but which was manipulated to 3, 14, 31, or 62 days for small subsets of reader/question pairs.

Taken across all readers/question pairs, we see no real effect of interval size:

(I’ve compensated for survivorship effect by including only those reader/question histories which have lasted at least 31 days—not enough samples to do this for 62. Note also that questions which were assigned a 3-day interval mostly got reviewed at the same time as the 5-day-interval questions, because of our batching behavior, so that row should probably be ignored)

Such shallow forgetting curves don’t really correspond to my experience of learning difficult technical material. My experience is that if something felt unwieldy or challenging while reading, and I don’t exert some cognitive effort to use it (e.g. in an exercise or a thought experiment), then I might remember the detail the next day, but I’m very unlikely to remember it in a month. On the other hand, when something feels like a natural extension of familiar ideas, I often feel that SRS prompts are arriving too frequently, and I probably could set the initial interval to a month without suffering too much. But I don’t see this play out in the data.

We see similarly flat results if we group according to whether the question correctly answered in-essay (yes, no, respectively):

Maybe it varies by question “difficulty”? Here I’ve grouped the data by the “easiest” / “hardest” half of questions (according to number of lapses those questions cause in traces of at least a year):

Nope. Mostly flat across all these subgroups, with even a slight rise at 31 days, suggesting again that forgetting may be washed out by reviewing related material. For the most part we don’t have enough samples to really trust the 62-day measures.

Stepping back for a moment, the “accuracy” measures I’m reporting here have a worse version of the same aggregation problems I was describing for the previous experiment. The implicit model here is that we’re drawing from a single binomial distribution, and we can estimate its parameter with a frequency. That’s obviously not true, especially when cutting simultaneously across both readers and questions. Eventually we’ll have enough from this experiment to at least slice by question, but we’re far from that point right now.

Comparison to the literature
Maybe forgetting curves really are roughly flat over a one-month interval for this kind of material. A month isn’t that long. Does that contradict the literature?

Well, Ebbinghaus forgot his nonsense-syllable sequences completely in one day. That resonates with experience: they have high complexity and no intrinsic meaning. His experiments with stanzas of Byron’s “Don Juan” demonstrated substantial forgetting day-over-day, but he only records effort required for faultless reproduction, which is perhaps not a fair comparison: a stanza contains much more information than a Quantum Country prompt.

Spitzer’s 1939 experiment on 6th grade kids learning from short articles is a bit more similar to Quantum Country, and he found substantial week-over-week forgetting. But these are kids being asked to recall fairly arbitrary facts from articles they don’t care about.

This 2010 review from Custers of more authentic settings reports medical students forgetting 25-35% of their studied material over the span of a year. That’s quite shallow! We wouldn’t expect the forgetting to be uniformly distributed throughout the year, but maybe this would work out to a mere 5% forgotten over the course of a month—small enough that we might not measure it cleanly.

I wonder how much of the SuperMemo, Anki, Leitner, etc forgetting emphasis comes from the focus on language learning. Sure, you’ll rapidly forget vocabulary words. But when conceptual material is introduced in a coherent narrative, and that material connects to other material you already understand, the initial interval certainly does not need to be 1 day, as it is in Anki. Several weeks may be fine, in fact.

So what? Recasting review sessions
I’ll spare you all the details, but one of the other big themes of my analysis work around Quantum Country is “almost everything’s too easy.” Almost everyone remembers almost everything almost all the time. If we couple that finding with this one, it recasts the role of the review sessions for the mnemonic medium. Right now we present them mostly as a war on forgetting: sure, you knew this material when you read the essay, but you’ll forget, so we have to practice to produce more durable encodings.

That’s probably still true over longer time periods, but it’s probably not the right model for the first few weeks. Our data suggest that if you were able to remember an answer in-essay, you’ll almost certainly still remember it five days or a month later. If you couldn’t remember an answer in-essay, you probably won’t remember it next time we ask you, no matter when that is. The first few sessions are mostly about helping you learn the material in that latter bucket—not “reinforce your memory,” but really to learn it in the first place. If it’s learning-in-the-first-place we’re after, straight retrieval practice is a somewhat blunt tool for the job. This may explain why it often takes people several tries to dig themselves out of an initial lapse. It’s interesting to consider how review sessions might evolve given this framing.

Maybe there’s tons of forgetting going on, and I can’t see it because of related-review effects. To understand those effects, I should probably run an experiment in which new users are assigned different, but internally consistent schedules. This is more invasive, so I’ll need to consider the best way to approach this. I’m concerned that if we defer initial reviews to e.g. one month, we substantially change the vibe. And we may interfere with another useful function of review sessions: the emotional effect of staying connected with a text over time, and identifying as being actively engaged with it.