2021-03-25 Patreon letter - Too easy to be effortless

Private copy; not to be shared publicly; part of Patron letters on memory system experiments

Now that a few Orbit experiments are in flight, I’ve spent much of the last month digging back into data from Quantum Country. I’m struck by a surprising problem: basically everyone remembers basically everything, basically all the time.

Feelings-driven optimization
How effortless can memory be?

At the limit, we can imagine automatically remembering everything we perceive. We might not want that—savants like Shereshevsky often report curse-like symptoms of their perfect memory. Perhaps we’d settle for the ability to remember or forget something as easily as moving a muscle. What would be true of such a world? Certainly schools would not exist as we know them, but what of workplaces and studios? What of relationships? Borges, Chiang, the Wachowskis, and other great science fiction authors have dramatized these implications, but I’m also interested in the mundane: shifts in the give-and-take of workplace collaborations; coincidences and contradictions suddenly more salient.

(Of course, effortlessness is just one of many useful lenses! A contrary lens points out that maybe effortfulness is exactly what you want from your interactions with memory. You want to constantly be questioning things you think you “know”; you want everything to stay molten so that you can form new connections and see things in new ways; etc etc…)

Even with today’s systems, memory is far from effortless. How close can we get? The usual approach is to treat this as an optimization problem, but I find it generative to recognize that effortlessness is a feeling. Powerful technologies feel like an extension of the body. The edges melt away; the space between intention and action closes. Strap a brick to your pencil, though, and it ceases to feel like part of your hand. Likewise, learning can seem effortless in an energetic discussion with friends, but in a boring study hall, the same ideas may demand more effort than you can muster.

This lens gives us a different way to think about how we might “optimize” tools for thought. What kinds of interactions create a sense of separation, of dutifulness, of boredom?

In any kind of computerized learning system (including spaced repetition systems), one reliable source of boredom is material which feels too easy. This material isn’t the good kind of effortless. Flipping through this stuff feels almost like speed-running a license agreement prompt in a software installer. “Yeah, yeah, I know, I know.” I don’t really have to think; I’m not really engaged; I resent being asked. Sometimes the problem is that I don’t actually care about the material, in which case I should really remove it (perhaps fuzzily). Quite often, though, I do really care about the material. I’d engage more seriously if it felt less trivial in that moment.

This observation devolves into a classic problem in learning technology: correctly estimating the state of a student’s knowledge to optimize a study plan. The difference is that if we hold onto our feelings-based lens, we don’t see optimization itself as the problem to be solved. Our central goal is a feeling of effortlessness. Model optimization is an instrumental lever for that feeling. But there are other levers. You can’t play The Witness without memorizing many complex rules, but you’ll do that naturally as you interact with the environment: memorization itself is not the effortful part.

Quantum Country’s over-easy effortfulness
Having paid this lofty penance, let’s turn our attention to the performance of an unusual memory augmentation system: Quantum Country.

Please note: this is an informal discussion of data from Quantum Country. The analysis is preliminary and shouldn’t be cited or excerpted in other work. I’m working with the garage door up here.

On the one hand, Quantum Country delivers on its promise to help people remember what they read. After the fifth repetition, most readers have been able to recall 95%+ of questions across intervals of more than a month. That’s pretty remarkable. In my past experiences reading textbooks, I’d be lucky to remember a fraction of the details after a month.

Another way to look at this is “maintenance cost.” To maintain the first essay’s 112 questions for the first year, the median reader performs 567 reviews, consuming ~1.5 hours. Readers report that the first essay takes 2-4 hours to read, so we can frame the first year’s reviews as a ~50% extra time cost these readers could choose to pay to durably remember all the key details from that essay. I expect the second year to have roughly half the time cost, but we don’t have the data for that yet.

The problem, I suppose, is that Quantum Country works “too well.” Basically everybody remembers basically everything basically all the time.

The trouble we’ll discuss begins at the start of what I call the “maintenance” phase. For a given reader and question, histories are generally clustered into two phases: an initial (usually short) “learning” phase, in which readers absorbs the material enough to remember it across sessions; followed by a (much longer) “maintenance” phase, in which repetitions mostly serve to combat the erosion of forgetting. You can approximate the delineation pretty well by saying that people transition to the maintenance phase after their first successful repetition.

After the first successful repetition of a given question—once they’re in the “maintenance phase”—the median reader answers 95% of subsequent repetitions correctly. In fact, 82% of all first-year question histories contain zero forgotten answers after that point (which is indeed what you’d expect from the typical first-year repetition count given a binomial variable with p=0.95).

That’s a bit abstract. To make it more concrete: after their first successful repetition, the median reader forgets just 15 times out of 448 reviews over the following year, across the 112 questions in the first essay.

A whole year of diligent reviewing and just 15 misses! 433 successful recollections! The problem here isn’t exactly one of efficiency. Talking to readers, plenty of them would be (and have been) happy to pay a 50% time cost to thoroughly internalize the material. It’s not that 448 is too many reviews, or that it takes too long. The problem is that it feels tedious, like wasted time, to review material that you already know perfectly well. And that’s mostly what people are doing.

But actually, the forgetting is even more skewed than I’ve let on. If those 11 misses were drawn with equal probability from all the questions, it might not feel so bad: any question might be the one you miss today! As it happens, though, half of all long-term lapses come from just 12% of questions. Emotionally speaking, those are the questions which generate “oh, no, that question again…”. By contrast, the median question produces only one lapse for every ten readers across the entire first year of the “maintenance phase.” For 95% of questions, the median reader never forgets in the first year of the maintenance phase. So most reviews probably feel tedious and unnecessary.

We might worry that perhaps everything’s fine for the median reader, but many less-capable readers are struggling. After all, questions are highly power-law distributed in the lapses they produce. But readers are not nearly so sharply distributed. Our 25th percentile reader forgets 35 times in 483 repetitions over the first year of maintenance. The 10th percentile reader forgets 59 times in 516 repetitions. And again, this forgetting is localized in a relatively small pool of questions. The vast majority of questions produce no forgetting, even for relatively less successful readers.

When forgetting does happen, it’s usually not that bad. One way to look at this is to ask how often questions are forgotten multiple times back to back, so that the reader fails to recall a prompt across an interval they could previously span. This happens almost never: on about 2% of first-year reader/question histories. So our “demonstrated retention” progress metric is a pretty good one. Once you’ve demonstrated a given interval of retention, you’re very unlikely to lose it if you keep reviewing. And if a lapse does occur, it has only a 7% chance of “backsliding” to the point that a reader can no longer span five days. As a reminder, Quantum Country roughly halves the review interval when a question is forgotten. Anki’s default behavior of resetting the interval to zero upon every lapse seems particularly inappropriate in our context given this data.

The implication here is that we should probably be much more aggressive with our expanding review schedule. Yes, this would make the experience more efficient; but what I really care about is that it would probably make the experience feel much less tedious.

What should the schedule be, exactly? Many papers suggest dynamic and complex models for these schedules, and perhaps I’ll implement one at some point. An ideal schedule would weigh tedium-avoidance with other important feeling-variables: connectedness to the material, the frustration of forgetting the same thing repeatedly, predictability of session timing. In terms of low-hanging fruit, it’s amazing how far simple heuristics could go. For instance, when readers begin by successfully answering a question both while reading the essay and in their first review session, 96% of those histories include zero lapses in the next year. It’s probably safe to stretch them out a great deal.

Just by focusing on too-easy questions, it’s pretty easy to imagine cutting the number of repetitions necessary for the first year of maintenance down by half, or perhaps more. If we did that, we’d cut the number of reviews in the first year from 567 down to 343, a 40% reduction. So something like a 30% extra time cost savings.

The data I’ve presented don’t have much to say about the counterfactual. If the intervals had been twice what they are, would we see only a bit more forgetting, or would we see bedlam? I’ve been running controlled experiments along these lines, and they’ve been producing very interesting and confusing results… which will have to wait for another time.

Scheduling for the mnemonic medium versus existing SRS modalities
Almost all work around spaced repetition systems—both academic and commercial—has focused on definitions: vocabulary for language learners, terminology for medical students, people and events for history classes, etc. This kind of knowledge tends to be arbitrary and disconnected, and so I suspect it’s forgotten much more rapidly.

Quantum Country’s schedule is pretty aggressive. We start at a five-day interval and grow by 2-3x on each repetition. By default, Anki starts at a one-day interval and grows by 1.8x. And yet we’re still seeing very little forgetting. I don’t think the problem is that Anki’s wildly conservative: I think it’s that conceptual knowledge, introduced in a narrative arc and thoroughly connected to prior knowledge, has very different memory dynamics from vocabulary words. Scheduling for the mnemonic medium should probably look quite different from scheduling for traditional spaced repetition systems.

SuperMemo models something like the effect I’m describing with “item complexity,” but because each user makes their own databases, it must estimate each item’s complexity from just a few point samples. The mnemonic medium’s shared questions create an interesting opportunity: item complexities can be estimated by pooling many prior users’ attempts, and a new user’s pre-existing proficiency with the material can be estimated by comparing their in-essay performance to that of prior students. This type of approach has been used in a model for scheduling Spanish vocabulary practice, and I’m interested to explore how it might fare on more conceptual topics. One distinguishing challenge for mnemonic essays (unlike vocabulary lists) is that the questions are highly interdependent. Reviewing one question makes readers more likely to be able to answer various other related questions. So I’ll probably need to mix a model like the one I’ve described with something like deep knowledge tracing, which can account for inter-item interactions.

I’m not yet sure how deep I want to go on such optimization. There are so many opportunities to explore in this space, and my hours are so few! In fact, there are many simple levers for a feeling of effortlessness which don’t involve actually reducing the number of repetitions. For example, Quantum Country readers felt reviews were much less burdensome when we “batched” them so that small review sessions on adjacent days were combined into a single full-length session.

In a future post, I’ll explore how multiple experiments are struggling to measure any appreciable forgetting-over-time at all on Quantum Country. Until next time, thank you as always for your support.

Last updated 2023-07-13.