2023-11-30 Patreon letter - Initial results from highlight-driven prototype

Private copy; not to be shared publicly; part of Patron letters on memory system experiments

I’ve been working on a new augmented reading environment centered around highlighting as the core interaction. The idea is to give readers a magic wand with two unusual “powers”:

You can point the wand at anything and say “make sure I know this.”
You can wave the wand over a section and ask “did I miss anything important?”

Of course, those are aspirational framings. My current design instantiates those powers like this:

You have a special purple highlighter. When you mark text with it, spaced repetition prompts about those ideas will be added to future review sessions.
At the end of each section, you can press a button to take a second pass over “suggested highlights”. This button marks phrases corresponding to all the details which the author thought were important, but which didn’t semantically intersect your own highlights. You can scroll through to quickly check for gaps in your reading comprehension.

It’s easy to imagine more elaborate instantiations! These are intended as a meaningful first step towards the more aspirational powers. For more background, see my introductory letter on the concept.

An initial prototype

This month, I tested a prototype of this concept, adapting a linear algebra primer by Jim Hefferon. You can see the prototype in action in this new demo video (6m17s). I hosted study sessions with 14 people who had some authentic reason to study linear algebra, and who had some experience with spaced repetition memory systems. We met in person, one-on-one—that always helps me form richer impressions of a new prototype.

After a short background interview, and an explanation of the interface, participants read the first section of the book, marking it with both a normal yellow highlighter and the special purple highlighter however they liked. At the end of the section, we used the “suggested highlights” tool to make a quick second pass. I asked readers to comment on each extra highlight: was it something they understood but didn’t feel was worth marking, or was it something they skimmed over? Finally, we reviewed all the prompts corresponding to their purple highlights. I probed readers about how it felt to review these prompts, and whether they felt their highlights were faithfully represented.

Before we dig into what I observed, I should explain that this prototype involved some significant smoke and mirrors. Readers imagined that I’d implemented some elaborate machine learning system. But no—not yet, anyway. Here’s how it worked:

Before meeting with any readers, I manually “curated” highlights corresponding to what I thought were all the important details in the section we read.
For each of those highlights, I wrote one or more practice prompts.
Then, in realtime while each participant read, I manually mapped each of their highlights onto the curated highlights (if any) which pointed at the same underlying idea. This sometimes required fluid judgment!
The “suggested highlight” feature then displayed all my curated highlights which had no corresponding reader highlights.
The review session displayed the prompts associated with all the curated highlights I’d mapped to the reader’s highlights.

That may look fairly baroque in writing, but in practice it created a remarkably transparent experience. Readers didn’t perceive manual steps; they often innocently asked if they could keep using the tool to read on their own after our session. This “Wizard of Oz”-style testing let me focus on the interaction design concepts, rather than on potentially unbounded problems of language model pipelines. That’s the right trade to be making for now.

What I learned from readers

Pre-screening is always imperfect in user research. For 3 of my 14 participants, the book (intended as a first course for undergraduates) was too difficult to read comfortably. Another 2 readers didn’t actually want to understand the topic in detail; they just wanted “the big picture”. In discussing what I learned, I’ll focus on the 9 readers who aligned with my target audience.

Mapping highlights to prompts seems very promising

Readers broadly loved the concept of the augmented highlighting interaction. Most of them already had a habit of highlighting texts, though all readily admitted that they didn’t think it actually affected how well they learned the material. Instead, readers described highlighting as a fidgeting behavior, a way to stay more engaged, and an ad-hoc bookmarking method. One reader didn’t end up using the special highlighter; he self-described as hypermnesic and felt he didn’t need practice support for the section’s material. The rest used the highlighter extensively.

Most readers were extremely happy with the retrieval practice prompts they were given. One said: “The prompts captured my own intent to the point where it took me some time to realize that they weren’t written by me”. Another said: “These are the kinds of things I wish I could actually have with a highlighting system. … It’s not just throwing things back at me verbatim. … For the concepts I highlighted, it asked me about the important logical relationships.” Most readers spontaneously asked if they could use “the magic highlighter” with other books.

Interestingly, readers had such positive perceptions of my highlight-to-prompt mappings despite the fact that I hadn’t always prepared a corresponding prompt for each of their highlights. Most readers highlighted a couple points which I didn’t feel were important enough to mark. No one noticed the absences, but I worry that this kind of silent omission would erode trust in the tool over time. Fully fixing this problem would require reliable machine-generated prompts; absent that, we could provide a fallback workflow for readers to write their own prompts for these “missing” highlights.

As I hoped, this design also mostly eliminated two failure modes I saw routinely in past mnemonic medium prototypes. Because each prompt (in principle) corresponded to something which the reader “said” they wanted to know, readers were much less likely to experience the review sessions as unpleasantly authoritarian or “school-like”. And for the same reason—in conjunction with the reading comprehension support mechanism—readers were less often outright confused by a question or its answer. Of course, readers could (and did) highlight passages without really understanding them, but when that happened, readers didn’t complain of those prompts as feeling arbitrary and capricious as they did in previous prototypes. The review interface offers a “View Source” button which shows the connection to the source material they’d highlighted, for any prompt. I think this generally created a feeling that readers had “asked for” their confusion, rather than that the confusion was “being done to” them.

These sessions weren’t a rigorous experiment; I was only aiming for high-level qualitative evaluation. But my early impression is that the prompts-from-highlights design concept seems fundamentally quite promising, and is well worth pushing further.

Suggested highlights diagnosed some gaps, felt lightweight

This prototype’s second big idea was “suggested highlights” as a lightweight reading comprehension support intervention. Here results were somewhat more equivocal. The readers I worked with varied enormously in their pace and diligence. Some muttered every word under their breath, stopped routinely to ask and answer questions of the text, and re-read passages multiple times to clarify misunderstandings. Others breezed through in less than half the time, skipping passages which seemed repetitive or obvious. The “testing” context created distortions, too: some readers confessed that they were reading much more carefully than they would if I weren’t present—even though I explicitly asked them not to as part of their initial instructions.

Of the 9 readers matching my intended target user, 4 had meaningful reading comprehension gaps. The “suggested highlights” interaction quickly identified places where these readers hadn’t attended to some important point, and gave them a straightforward opportunity to fill that gap. Sometimes readers felt the details they missed weren’t so important, but often readers colored the “suggested highlights” with their special purple highlighter once they’d re-read the passage. I take that as a sign that the interaction identified something meaningful.

These readers were quite enthusiastic about the design. One said: “This is the tool that I want!” Another: “This is insanely cool! Man, I wish I had this everywhere.” One hesitation I have is that if these readers had a few straightforward gaps which my tool could identify, they probably had some other subtle gaps which will remain. Maybe it’s fine. Maybe these are the kinds of details which will get easily ironed out during a problem set. And the interaction at least ensured that readers weren’t being asked to do retrieval practice on material they hadn’t understood—a key goal. I’ll need to run more focused experiments to better understand the effects of my intervention on reading comprehension.

The other 5 “target” users had no overt comprehension gaps; the “suggested highlights” were all false positives. I spontaneously probed these readers’ understandings with extra questions, and they all performed quite well. So these readers didn’t need extra reading comprehension support, at least in the test context. Happily, 3 of these 5 liked the idea of the tool, and said that they didn’t mind the false positives; they found the interaction lightweight enough that they would want to use it anyway: “I still think it’s helpful. It gives me a safety net—guardrails.” The other 2 weren’t sure.

Purple highlights as to-do’s

Several readers used the special purple highlighter in a surprising way: they marked passages which they didn’t yet understand. These readers wanted to move on with the reading, but they wanted to make sure that they eventually understand the detail they’d marked. They were effectively leaving themselves a “to-do for understanding.”

This makes a lot of sense! After all, I told them that if they mark a passage with their purple highlighter, the system will make sure that they internalize those ideas. The current mechanism sort of accomplishes this goal. These readers received retrieval practice prompts about their “to-do” markings. They predictably didn’t know the answer, and they used the “View Source” button to return to the original passage for a re-reading. In several cases, the explanation made more sense on a second pass, now that they’d seen how the ideas fit into later parts of the text.

So at least sometimes, the retrieval practice prompts indirectly accomplished these readers’ “to-do” intention, insofar as it provoked them to re-read the relevant passages. But in some cases, the passage was still confusing, and the reader needed some conversation to make sense of it. In other cases, the confusing passage didn’t correspond to any retrieval practice prompt—for instance, one reader was confused by a particular step in an example problem—and so the to-do was effectively dropped. It would be interesting to consider how one might support the “to-do” workflow more directly.

Transparency in highlight-to-prompt mapping

One of my “target” readers felt that the highlight-to-prompt mapping was uncomfortably “magical”. When I asked to what extent he felt the review prompts represented his highlights, he said that he really didn’t know: he couldn’t easily see the correspondences, so he couldn’t tell how well his intent had been reflected. The whole system felt like a black box.

This makes sense! I’m surprised more readers didn’t feel this way. Technologists like to describe their products as “magical”, but we really want “magic” in the sense of “astounding capacity, ease, expressivity”, not in the sense of “ineffable, inscrutable, mythical, eldritch.” My favorite paper on AI in interface design is Jeffrey Heer’s 2019 “Agency plus automation”. In it, he argues that such interfaces benefit from shared representations. You want the automated system to clearly surface its proposals in forms you yourself can create and manipulate, and you want to clearly see the connection between those proposals and the inputs which influenced them.

All that is missing from my current design. You can’t write your own prompts or modify those which the system provides. During review, you can “View Source” on each prompt to see which highlight it “came from”, but that’s a pretty cumbersome way to get an overview of the connections. And there’s no equivalent available while reading: that is, when you make a purple highlight, you’re given no hint of what the system understands you to mean—what prompts will result. Ideally, those representations should enable an interactive feedback loop. That is, you should be able to say “oh, no, that’s not what I meant; focus on this part.”

One naive solution: whenever a reader makes a purple highlight, we display a preview of the associated prompts and allow readers to intervene if they like. But I want to be careful to avoid re-introducing problems I encountered in my prototypes late last year. In those designs, curated prompts were presented in the margin alongside associated content. This approach made the interaction very clear: if you click to save a prompt, you’ll get exactly what you see; you can edit it in place to adjust as you like. But it also created quite a distracted reading experience. Those marginal prompts tugged readers’ eyes away from the body of the text. Watching them, I could see them constantly jumping back and forth, losing their place, finding it again, eyes skipping down the page to the next spot in the text with a marginal prompt. I think most ended up spending far too much attention evaluating and making decisions about prompts.

One of my big motivations for this new highlight-centric design was to solve that problem. I wanted to make it easy for readers to remain immersed in the text, while still benefiting from selective augmentation. I think this prototype performed quite well in that regard. But I’m not yet sure how to sustain that success while creating a more transparent and shared representation for the prompts.

Tailorable prompt mapping: emphasis notes and feedback

This prototype’s reading interface treated the highlight-to-prompt mapping as a black box, but as an experiment, it did offer a way to “steer” the prompts. When readers used their purple highlighter, they could optionally write a note to clarify what, specifically, they’d like to emphasize. For example, maybe you’ve highlighted a definition, but it’s the notation which you want to make sure you remember; or you want to make sure you internalize the contrast between this definition and some earlier concept. So you can highlight the definition of (say) linear equations, and jot a note that says “contrast with linear combinations.”

Most users didn’t use this feature heavily, but did use it at least once. And I can imagine that they might tend to use it more over time, as they build a mental model for how the system maps their highlights onto prompts, and for how that mapping sometimes isn’t exactly what they’d want.

In practice, I was only able to honor about a third of these requests with my pre-made curated prompts. I could map another third onto broader prompts which included or indirectly reinforced the detail they mentioned. The rest of the requests were idiosyncratic enough that they probably can’t be satisfied without machine-generated prompts. Only one reader noticed that his requests weren’t exactly being granted, and he didn’t express much concern. But I think this would become more troubling over time, and I wouldn’t want to include an “emphasis note” feature if it isn’t reliable.

The “emphasis note” framing front-loads user guidance. Another way to think about this kind of control is through iterative feedback. For example, one reader highlighted a theorem which says that if a linear system is transformed through one of three listed operations into another system, the second system has the same set of solutions as the first. In his review session, he got a prompt about this theorem’s role in the safety of Gauss’s method. He was confused about this, and once he clicked “View Source” to see where the prompt came from, he said “Oh, no, I don’t really care about proving correctness here—I wanted to make sure I know the three “safe” operations.” Ideally, he should be able to just tell the system that, during review: “Just make sure I know the safe operations."

Another reader wished that he could make the prompts less formal: more verbal explanation and examples; tone down the notation and abstraction. This kind of feedback should influence not only the current prompt but probably all the prompts in the book, and maybe all prompts in general.

My brief experiments suggest that tailoring pre-existing prompts is a much more viable task for large language models than asking them to generate prompts anew. For example, consider the prompt: “Q. What is the leading variable of a row in a linear system? A. The first variable with a nonzero coefficient.” One of my readers wanted to see this kind of abstract answer in the context of an example. GPT-4 was able to rewrite the prompt appropriately for that request. Of course, this example wouldn’t be hard to do by hand, so the tradeoffs here may not make sense unless they can apply to many prompts, or unless the machine generation is extremely reliable.

Next steps

Speaking more personally for a moment: this last round of testing was quite exciting for me! The new design seems to have solved many of the problems I’ve observed with my various memory system designs over the years. And—very tentatively—it also appears to help some readers with reading comprehension support. It’s a great sign that I really want to use this tool every day in my own reading.

All that said, this past round of testing was pretty shallow. I wanted to see how a variety of people reacted to the design ideas, so I met with 14 people for around an hour each. Because we were reading the first section of the book, there wasn’t much opportunity for the ideas to really build on each other and put heavy demands on memory and comprehension. And because I observed just a single session, I didn’t have a chance to see how the memory and comprehension support fared over time, as forgetting became more relevant.

So, starting this week, I’ll switch to a depth-first approach. Like I did earlier this year, I’ll be meeting with one student weekly for a few hours. We’ll continue more deeply into the book, where the material will start to compound in more demanding ways. We’ll also work through some problem sets during those sessions, to observe how the augmentation interacts with practical capacity.

Those observations will still be qualitative. Meanwhile, I’d like to start working towards more systematically understanding the impact of my design on reading comprehension. How often does it help readers identify meaningful gaps? Are there kinds of gaps which it tends to ignore? Do readers who use this intervention understand the material appreciably better? Feel appreciably more capable or engaged with the material?

Right now, my prototype requires me to manually map reader highlights to curated highlights. And I have to do that with little delay, so that they can click the “suggested highlights” feature once they finish reading the section. If I’d like to run an experiment with a few dozen users, this would consume a huge amount of time. So, in parallel with my depth-first work with one student, I’ll attempt to automate the mapping between readers’ highlights and the curated highlights.

My initial tests suggest that this is a much more tractable task to automate than the two other currently-manual tasks in my design: identifying the most important elements in a text, and writing good prompts for each of them. It’s nice that these tasks are somewhat separable, so that I can make some progress by automating just the highlight-to-prompt mapping.

And if I automate that mapping, I can make this prototype publicly available, albeit constrained to this one book. That’s a nice milestone to work towards, and I’m sure that—as always—public use would surface many unexpected insights. From there, yes, it would be great to automate the curatorial and prompt-writing tasks. But I’m also interested in the prospect of using one large book as a depth-first laboratory to explore other reading augmentation ideas, many of which I’ve discussed in these essays.

One idea that’s animated my work is the claim that “books don’t work”. That is: in order to actually understand, internalize, and remember ideas from an explanatory text, a reader has to employ all kinds of tacit and often unreliable strategies, and the medium of the book does surprisingly little to help. What if the experience of engaging with a text naturally assured all this extra work gets done? I want to create an alien sense of capacity and ease when I engage with explanatory texts.

Breaking down some of the things which you may need for a book to really “work” the way you might hope:

Comprehension. You need to actually process the words on the page, and notice when you’ve failed to do that. This is surprisingly difficult for most readers, much of the time! Most interventions are quite obtrusive; the current prototype is my attempt at a more functional augmentation.
Memory. You need to remember what you read. This motivated Quantum Country and the mnemonic medium. But I find that today’s memory systems often produce brittle memory, and I’d like to explore ideas like varying prompts and escalating their complexity to improve that.
Elaboration. You need to understand not just what the text says, but what it means, why that matters. You need to connect the text’s ideas to prior knowledge and experience. Discussion is one method I like; thoughtful writing (ideally for some authentic purpose) is another. I’d like to find more, and I’d like to find ways to better connect those activities to the reading experience.
Fluency. You need to practice using what you’ve read so that it becomes automatic. Pattern induction; schema acquisition; knowledge compilation. In technical topics, this often means problem-solving practice—see projects like Mathigon and Execute Program. Project-based learning is another common approach, and I’m interested in ideas like “doing-centric explanatory mediums” to that end.
Intervention. You need to diagnose and resolve confusions and misconceptions that you may have. Procedurally-focused problems often fail to clearly identify conceptual issues. Teaching others is a classic approach here, inspiring integrated interventions like AutoTutor. Frontier LLMs are often surprisingly good at resolving confusions, when the user can articulate them. I’m interested in integrated, lightweight methods for identifying and acting on confusion.
Integration. Much of the time, you read a book not to just to acquire knowledge but because you want it to change you somehow—change the way you think, or the way you view the world, or the way you act or feel in a situation. To make a book real in this way, you have to carry its ideas with you into your life. For a few ideas, see salience prompts and timeful texts. Reading clubs are great for this; I’d like to explore more ideas at the intersection of new media and social convening.

That’s enough research agenda for several lifetimes, of course. But I articulate all this here as a way of helping myself resist the cultural forces surrounding me in San Francisco. Those forces demand that if I find an idea—like highlight-driven memory prompts—I should focus aggressively on scaling it to apply to as many places and people as possible. There’s merit in that, of course! I certainly want to use this special highlighter everywhere. But I also need to weigh that impulse against the prospect of uncovering more foundational ideas, of solving the problem I care about more completely.

Thanks to all the students who worked with me on this first round of tests, and thanks to Benjamin Reinhardt for helpful discussion about my next steps.

Last updated 2023-12-03.