2023-09-30 Patreon letter - Highlight-driven practice and comprehension support

Private copy; not to be shared publicly; part of Patron letters on memory system experiments

I’ve been wrestling with a new insight this summer: when people struggle to recall and use what they’ve read after a few months, it’s often because they didn’t really understand in the first place. The lapse feels like forgetting, but people often can’t tell the difference. I’ve argued that “books don’t work” because people seem to rapidly forget almost all of what they read. But when those supposed memory failures are actually poor initial comprehension in disguise, memory augmentation probably isn’t the right solution.

So I’ve been exploring how reading environments might directly support comprehension, learning what we know about expert practice and interventions. None of the directions I’ve prototyped have seemed promising. The central problem seems to be obtrusiveness. Systems in prior research and in my own experiments intrude too much on the reading experience. These systems all try to offer feedback and support by determining what you’ve comprehended and what you haven’t. It’s tough to do that without demanding a whole lot of burdensome interaction.

Somewhat frustrated a few weeks ago, I stopped to reevaluate. Fine, so people have comprehension gaps; but ongoing practice is still quite helpful, right? How, specifically, have I observed practice fail in the face of comprehension gaps? What if I reframe around those problems, rather than treating reading comprehension as the end itself?

These are the (hypothesized) problems that set me on this path in the first place:

Retrieval practice of poorly-comprehended conceptual material usually doesn’t actually work; you can parrot but can’t use the knowledge.
When retrieval practice feels dogmatic—like “guessing the teacher’s password”—that’s often because of comprehension gaps. This is especially true when someone else writes the prompts.
Retrieval practice (and problem-solving practice) are unpleasant and indirect ways to diagnose and fix comprehension gaps.

Very naively, one way to deal with these problems is to ensure that people only practice material which they comprehend. So: how might we do that? I found myself tying together some ideas I’ve described over the past few years. To my surprise, that path led me to a more promising opportunity for reading comprehension support.

Concept overview

Here’s the high-level design:

As you read a text, you have a magic highlighter. You can use it to mark anything important, anything you want to make sure you understand and remember. Maybe you can jot a few extra words to clarify what specifically interests you.
Future practice sessions will include tasks which reinforce and elaborate the ideas you highlighted.
When you finish reading a section, you can press a button to highlight other important details which you didn’t mark (in a different color, say). These “extra” (“suggested”? “shadow”?) highlights let you quickly check whether you skimmed over something you might value.

The main design insight is that a highlight interaction can serve both as a way for the reader to choose what to practice and also as a (weak) indication of comprehension. That same highlight primitive can then be repurposed to draw attention—in a very lightweight way—to important details which the reader might have unknowingly missed.

Conceptual elements I like about this design

Subverting a natural (but ineffective) interaction. People naturally (and sometimes compulsively) highlight. It’s a favorite study practice. This makes sense: it feels good to point at things you feel are important. It’s an outlet for interest, a mark of your efforts, vibrantly reflected back on the page. And it’s very undemanding. The trouble, of course, is that highlighted material isn’t actually better understood or remembered in controlled studies, although students believe it is. It would be awfully nice if we could somehow “rescue” highlighting—could make it actually have the effect which we wish it had. In my proposed design, the act of highlighting would be as ineffective as it usually is; what’s different is that those highlights would trigger later practice (which we know can be quite effective) and comprehension feedback (efficacy to be determined).

Keeping the locus of control with the reader. This design continues my 2022 efforts to give readers control over what they practice. But it extends that goal to the comprehension support interaction. (I haven’t seen this elsewhere in the research literature; a typical intervention asks students to explain every sentence aloud after they read it.) If the first half of a section is totally familiar, a reader can simply skip it. After they reveal the “extra” highlights, a reader can scroll right by anything in that first half—no interaction necessary. Then, just by scrolling and looking, they can pay attention to details they neglected in the second half, where the material felt unfamiliar.

Non-throwaway comprehension support interactions. Last month I described a reading comprehension support system which works by asking readers to explain the text to themselves as they read. This sort of thing often feels unpleasantly hygienic. I think that’s in large part because these self-explanations have no enduring value. They’re throwaway work. I’m just writing them to make sure that I understand, in this moment. But the interaction feels like too much cost for too little benefit. I feel like I’m understanding just fine without all the ceremony—in part because I, like most people, underestimate how often I have comprehension gaps. Every focused reading comprehension intervention I’ve seen has the same “throwaway work” problem. By contrast, in the proposed design, when you use the magic highlighter, you’re teeing up future practice which will ensure that you understand and remember that detail. The interaction isn’t thrown away; it has enduring meaning and weight.

An idea-centric memory system. In my 2022 mnemonic medium designs, prompts are presented alongside the text; readers can choose which prompts they’d like to add to their collection. I learned from my user research that people don’t naturally think in terms of evaluating prompts; they react to ideas in the text—“Ooh, I’d better make sure I remember this!”—then look at the adjacent prompts. The prompt-saving interaction was an awkwardly indirect way of capturing that reaction. My impression is that what people really wanted was to be able to point at the idea they found important; the prompts are mostly just implementation details. The proposed design moves us towards such an idea-centric practice system, which I believe may have other benefits, like promoting fluid understanding through variation and escalation.

Smooth on-ramps to obligation. In my late 2022 user research, I observed an interesting tension: readers often weren’t initially sure how much they cared about a detail. They could see that it was important. But did they want to sign up for ongoing practice? It wasn’t clear—they had to read a little further, to get a sense of how that detail fit into the whole. Many readers asked for a highlighter; when I dug in, this uncertainty was often behind their request. People wanted to mark details as tentatively important, then to come back and “upgrade” those details by saving the adjacent prompts later, if it seemed appropriate. This makes sense! I often do something similar in my own memory practice. I’ll read through a section, highlighting what seems important. Then I’ll make a second pass, guided by my highlights, to write prompts for whichever details seem to deserve it. My proposed design lends itself naturally to these smoothly escalating interactions. You can have an ordinary yellow highlighter to mark details which seem tentatively important, and a purple “magic” highlighter to mark details you want to make sure get reinforced. Highlights can be “swapped” to the other color with a click. Readers would have a smooth slope between “mark as important” to “mark as to-be-reinforced.”

Conceptual challenges for this design

Highlights don’t encourage deep processing. Effective readers are demanding. They interrogate a text, interpret it, elaborate it, and connect it to prior knowledge. People can—and typically do—highlight without much of that happening. It’s easy to highlight text without even processing what it says. All this means that my system’s “comprehension support” is setting a very low bar. But if the goal is to resolve my three motivating problems, I think it’ll help a great deal. You’ll be much less likely to be given prompts about ideas you completely missed. And the prompts can be constructed to induce the elaboration and interpretation which might not yet have occurred.

Density and ambiguity. Prompt-writing has given me a great appreciation for just how many separate details can be conveyed in a single sentence. If the reader highlights a key sentence, they could be interested in many different details—or all of them. Also, they might have comprehended only half those details. (I had this happen to me in testing.) I’ve found that helps to make “minimal” highlights—i.e. to highlight a key adjective if that’s what you’re interested in, alongside perhaps other separate small highlights in the same sentence. It also helps in these cases to jot a few words about your specific interest.

Trees over forest. A highlight-centric interaction emphasizes locality and detail. But I usually want practice to include synthesis, too. Often the best questions are about getting to the heart of some idea, finding a one-sentence way of expressing it when you look from the right angle. Sometimes I want my practice to be about summarizing a long exposition.

Novices can’t reliably judge what matters. One advantage of the original mnemonic medium design is that a domain expert tells you exactly what you need to know. In the proposed design, we shift the locus of control quite decisively to the reader; an expert merely provides “hints”. For a reader who really does want to be authoritatively led, this new design has much more friction. The deeper problem is that readers often aren’t in a good position to judge what matters most in a text. Is the “extra highlights” interaction enough to mitigate that problem?

An initial test

I took the concept for a scrappy initial test drive, with Wizard-of-Oz help from my friend Elliott Jin (a computer science instructor at Bradfield). Continuing last month’s studies, I read section One.III.1 of Jim Hefferon’s Linear Algebra, highlighting the details I wanted reinforced as I went. This material was already familiar to Elliott; he separately and carefully marked all the important details in the section. Then, once I’d finished reading, he manually compared my highlights to his, and marked my copy with any ideas I’d skipped. Then I could review those extra highlights as proposed in the design.

First, and most crucially, the interaction helped me notice three important ideas which I had completely ignored. My eyes had simply slid right past them on the page. That’s a promising validation of the notion that this kind of “extra highlights” interaction can surface comprehension gaps.

The exercise also demonstrated that highlighting doesn’t necessarily imply comprehension. In one instance, I had highlighted a definition but had totally ignored a few key words. This turned out to be fine. Subsequent practice quickly revealed that I’d missed those words; and because I highlighted the definition, I’d indicated that I wanted to know them—so I was grateful to the practice for revealing that.

In another instance, I ignored an “extra” highlight because I thought it was subsumed in something else I highlighted. That judgment turned out to be wrong! Subsequent practice of some downstream ideas revealed the misconception. It turned out to be fairly easy to diagnose in this instance, but that wouldn’t be true in general.

Crucially, the interaction felt great. I already like to highlight as I read; this felt like it was working with my natural behavior and making it more powerful, rather than distorting my reading practices. It feels subtly rewarding to “color in” the text, and even better to make those markings have real meaning, both in terms of the comprehension check and in terms of subsequent practice. Scrolling through the “extra highlights”, I felt interested in checking them out but not inappropriately compelled to engage. Elliott had highlighted some details I’d skipped because they were familiar or didn’t seem interesting; it was easy to scan past those.

Alongside the highlights, Elliott found himself also wanting to mark “lowlights”—details which might be worth attention, but which seem more incidental. Perhaps highlights could display some mark of their importance, e.g. with color intensity? If a reader could mark something as lower priority, we could then arrange to show them relatively fewer tasks about that detail. Alternately, these levels could act as a kind of feedback for readers that they’re mostly highlighting relatively unimportant details, and not the central ideas.

The next section of the book (One.III.2) posed some interesting difficulties. This section revolved around proofs of some important statements made in the previous chapter. Along the way, some useful new properties and procedural strategies emerge. The latter category could be handled using the same highlighting interaction, but it was much less clear what to do with the proof material. I think that’s in part because I don’t have naturalistic highlighting habits when reading proofs, whereas my automatic highlighting behavior in explanatory prose aligns quite well with indicating what should be reinforced. Learning from others’ proofs seems to demand different patterns; unfortunately, I lack a rich theory of knowledge and learning here.

One final hesitation: how important are my comprehension gaps, really? This prototype revealed a few meaningful details I’d skipped. But as it turned out, those holes were diagnosed by the section’s problem set without much fuss. If I hadn’t had this fancy augmented reading environment, I would have been fine. But I worked with a student earlier this year who seemed quite blocked by reading comprehension problems. And in previous sections of Hefferon’s textbook, I had comprehension gaps which left me simply confused during the problems. Even worse, such gaps might not even be noticed. Auditing the problem sets for the material they cover, I notice that they’re focused on applied problem solving, and problem sets will often fail to reinforce conceptual details discussed in the text—or to reveal gaps in comprehension of those details.

My rough impression is that conceptual gaps are more likely to be ignored or poorly diagnosed by problem-solving practice than factual or procedural gaps. Confusion also seems to arise when knowledge is only covered by problems which involve some transfer. So, in principle, maybe we could just construct problems which ramp up smoothly enough to effectively identify comprehension gaps. But I notice that I don’t like answering such basic questions when my comprehension is actually fine. They feel boring and burdensome. Maybe the proposed design’s lightweight comprehension support is a reasonable compromise.

Evaluating with Quantum Country

Another way we can judge the proposed new design is to ask: what would Quantum Country have been like to read this way?

A first question we might ask: how many highlights would a reader need to make to “collect” every prompt? I mapped the 112 prompts in the first essay (QCVC) to representative highlight ranges and found that 78 highlights would cover all the prompts. For ~25,000 words, that doesn’t seem so unreasonable: it’s about 1 highlight every 320 words, or roughly for every screenful of text on my display. (Though obviously the density of the text’s ideas varies considerably.)

This exercise revealed that there are plenty of details in QCVC which seemed important and non-obvious to me, but for which we didn’t include questions. That’s a limitation of the original mnemonic medium design: because every user would receive every question (and in fact would receive every question immediately—we didn’t introduce them over time), we had to be somewhat conservative with prompts. We didn’t want to overwhelm people. As a result, a given person is probably asked to practice some details which they didn’t find meaningful (e.g. a “boat” metaphor for computational range) but not given practice for other details which they found important.

The overwhelming majority of highlights (57) mapped onto a single prompt. 16 mapped onto 2 prompts, 3 onto 3, and 1 each onto 4 and 5. Most of the one-to-many instances are places where we used several prompts to encode an idea from multiple angles, or through multiple examples, or emphasizing different aspects. Auditing all these grouped prompts, I feel that at least 80% would be better off practiced in separate sessions. The prompts are mutually reinforcing; practicing one will generally diminish retrieval demand for another. Also, such theme-and-variation prompts are especially apt to feel boring and dogmatic when presented in rapid succession.

I feel this distribution of prompt counts also illustrates a limitation of the original mnemonic medium: most of those 57 “solo” details would have benefited from reinforcement from multiple angles. But again, we had to be conservative when all prompts are presented en masse to all users.

3 prompts had no direct source in the text; they ask the reader to draw an inference based on one or more details. These are a problem for my highlight interaction! One fix might be to assign these “synthesis / inference” prompts if a user’s highlights include the “inputs” to the expected inference. These sorts of prompts seem especially valuable to me, since they force the reader to go beyond the text. At the same time, because the whole point of these prompts is that you’re not retrieving the answer from memory, you’d probably want them to vary each time, demanding a new inference involving those ideas.

6 prompts are actually about details made in the problem statements of optional exercises. These are a bit tricky. One way to look at these prompts is that even if you don’t do the exercises which are about showing why these statements are true, you should learn that the statements are true. If we take this perspective, the highlighting interaction would probably want the statements to be made “in the main text”, so that all readers encounter them. Another way to look at these prompts is that if you do an exercise proving some result, you probably want to remember that result. From this perspective, the highlighting interaction is probably fine, and in fact better than Quantum Country’s one-size-fits-all model.

This exercise also helped me see the catch quite clearly: as we’ve discussed, readers are not always the best judge of what’s important. The proposed design would result in people missing important details, relative to Quantum Country’s design. That’s a price we’d be paying to permit a more fluid and reader-centric experience. Of course, I don’t yet understand the true cost or the true perceived benefit. That will require more user research.

Interaction cost

More than two years ago, when I was just starting to dig into tensions around reader control in the mnemonic medium, I observed that if QCVC contains 112 prompts, a reader wouldn’t want to make 112 decisions about which prompts to save, or even to click “save this prompt” 112 times in the interface! That motivated the introduction of the “bulk” prompt interaction in last year’s prototypes.

And yet I notice that I don’t feel much concern about requiring a reader to make 78 highlights. 78 still seems like a lot of interactions. Why do I feel so differently?

One factor is that highlighting is a natural behavior for many readers. It feels like part of reading the text, not a separate decision or interaction. Spatially, it’s happening within the text, not on a separate interface surface.

It’s also important that readers wouldn’t be required to evaluate prompts. Choosing which of 112 prompts to save is much more burdensome: you’d have to read and consider all that text. But in the proposed design, you’re not deciding “which prompts to save”; you’re emphasizing a subset of the text you’ve already read. The “extra highlights” view will offer a lightweight way to quickly add anything important that you might have missed, and even this interaction will be less demanding than evaluating prompts, since you’re evaluating the main text, much or all of which you’ve already read.

Some of QCVC’s prompts are much less important than others. In Quantum Country and in last year’s mnemonic medium designs, all prompts had the same status, so a user would have to evaluate all 112 prompts on equal footing. But in the proposed design, a user might naturalistically not highlight some text representing a less important detail, and that’s no big deal. The text doesn’t impose a cost. And the cost of evaluating an “extra highlight”, while low, could be further mitigated by a visual indication of importance.

Implementation details and challenges

So far, I’ve focused on the interaction design and ignored how it would actually work. I think that’s the right emphasis, but I’ll briefly discuss implementation insofar as it bears on my next steps for the design.

We can factor this design’s implementation into three core problems:

Text to curated highlights: Given a text, what are the most important details to understand, and what highlights would draw one’s attention most appropriately to those details?
Highlights to tasks: Given a set of highlights-in-context (and, potentially, emphasis remarks), construct a set of practice tasks.
Semantic highlight diff: Given a set of user highlights-in-context, determine which of the curated highlights’ conceptual matter is not “covered”?

Of course, there are other ways to factor this problem. We could map highlights onto a knowledge graph, and a knowledge graph onto tasks, to capture connections and dependencies. Rather than expressing and comparing to an “ideal” set of highlights, we could try to find a set of highlights to recommend based on the reader’s apparent interest and level of detail. We could allow readers to express why they’re reading the text—their goals and questions—and steer highlights appropriately. But I’ll set these elaborations aside for now.

Let’s start with a simplified implementation model which involves no cutting-edge machine learning.

Text to curated highlights: Paralleling the original mnemonic medium, an expert constructs an “ideal” set of practice tasks and maps them (many-to-many) to an “ideal” set of highlights.
Highlights to tasks: Given a user’s highlights, we use traditional NLP tools like latent semantic analysis to identify “semantically matching” expert highlights. Readers are given the corresponding tasks from the expert’s map.
Semantic highlight diff: Compute the set difference between the expert’s “curated highlights” and the ones identified in (2).

Apart from the heavy demand on expert labor, the main drawback of this model is that if the reader highlights a detail which the expert didn’t emphasize, there would be no tasks to reinforce it. Likewise, a reader couldn’t expect the system to create practice tasks around an original observation, or to heed any notes you make about your specific interest in a highlight. But a model like this would let me explore and refine the interaction design without extensive generative AI sidequests.

Of course, I don’t think I would have come up with this design idea in the first place were it not for the astonishing recent progress in large language models. The freeform nature of the highlight interaction cries out for the open-ended interpretation that is these models’ hallmark. And an idea-centric practice system requires high-quality task generation machinery. By shifting (or expanding) the system’s primitive from prompts to ideas-in-context, we would make it much easier for users to add their own ideas to their practice. The highlighting interaction can apply to your own notes. If you read a text and notice some important limitation in the author’s argument, you can jot a sentence about that and highlight your own words. Likewise, if you’re writing in your journal about a striking comment from a friend, you could simply highlight that remark to ensure you’d grapple with it in future sessions.

Back to reality. I’ve run many experiments with using GPT-4 to perform all three of these tasks. My coarse impression so far has been: these systems are amazing, and I’m able to get remarkably far; I’ve not yet managed to make their output quite as good as it needs to be; but I expect they’ll get there with some combination of determined prompt engineering, fine-tuning, or patience for next year’s model.

Text to curated highlights: A surprisingly good start. Usually includes 10-20% unimportant details (even when I ask the model to include an importance rating), and omits a handful of important elements. The endpoints of the highlight are often not quite in the best spot.
Highlights to tasks: The most difficult of the tasks. I believe that much of this will come down to articulating a philosophy of instructional design, or a “pattern language” for review. The issue generally isn’t the model’s “intelligence”; it’s that you can’t describe the sorts of tasks you want (and don’t want) clearly enough. Still, for basic retrieval practice tasks, I can get usable results somewhat more than half the time. (more notes)
Semantic highlight diff: Surprisingly difficult for the model. It particularly struggles with user highlights which aren’t in the set of “curated” highlights—it wants to make spurious mappings.

Next steps

Happily, I don’t need to solve those open-ended technical problems to evaluate and improve upon the core design idea here. I plan to conduct a round of Wizard-of-Oz user testing:

Text to curated highlights: constructed by me, as described in the simple model above.
Highlights to tasks: as described in the simple model above, but I’ll match reader highlights to my curated highlights by hand, rather than using LSA or similar.
Semantic highlight diff: I’ll just do it by hand.

I’ll test initially with some experienced spaced repetition users, so that I can focus on the highlighting interaction design, and the concept of idea-centric practice. What I’d like to observe:

Whether readers feel they can trust mere highlighting to indicate the tasks they’ll be practicing. Are “emphasis remarks” necessary?
Whether the “extra highlight” visualization uncovers comprehension gaps, and how readers feel about that.
How the tasks mapped from their highlights feel, emotionally—is there still a sense of guess-the-password if they indirectly “signed up for” these tasks?

Ultimately, what excites me about this design is that it’s positioned to attack three distinct problems which have emerged in my experiments over the past few years:

The mnemonic medium feels unpleasantly authoritarian in many contexts; the locus of control should move towards readers.
Comprehension gaps are routine; practicing others’ prompts doesn’t work and feels oppressive when this occurs in conceptual material.
“Mere” retrieval practice of conceptual material often produces brittle understanding which transfers poorly; more fluid practice would likely produce more fluid understanding.

It’s safe to assume that this new design will fail, too, but I’m feeling optimistic that it will fail in interesting and instructive ways.

My thanks to Elliott Jin for facilitating my initial test and for extended discussion of these ideas! Thanks also to Joe Edelman for helpful discussion.

Last updated 2023-10-04.