The hardware seems faintly unbelievable—a computer as powerful as Apple’s current mid-tier laptops (M2), plus a dizzying sensor/camera array with dedicated co-processor, plus displays with 23M 6µm pixels (my phone: 3M 55µm pixels; the PSVR2 is 32µm) and associated optics, all in roughly a mobile phone envelope.
But that kind of vertical integration is classic Apple. I’m mainly interested in the user interface and the computing paradigm. What does Apple imagine we’ll be doing with these devices, and how will we do it?
Given how ambitious the hardware package is, the software paradigm is surprisingly conservative. visionOS is organized around “apps”, which are conceptually defined just like apps on iOS:
I was surprised to see that the interface paradigm is classic WIMP. At a high level, the pitch is not that this is a new kind of dynamic medium, but rather that Vision Pro gives you a way to use (roughly) 2D iPad app UIs on a very large, spatial display. Those apps are organized around familiar UIKit controls and layouts. We see navigation controllers, split views, buttons, text fields, scroll views, etc, all arranged on a 2D surface (modulo some 3D lighting and eye tracking effects). Windows, icons, menus, and even a pointer (more on that later).
These 2D surfaces are in turn arranged in a “Shared Space”, which is roughly the new window manager. My impression is that the shared space is arranged cylindrically around the user (moving with them?), with per-window depth controls, but I’m not yet sure of that. An app can also transition into “Full Space”, which is roughly like “full screening” an app on today’s OSes.
In either mode, an app can create a “volume” instead of a “window”. We don’t see much of this yet: the Breathe app spreads into the room; panoramas and 3D photography is displayed spatially; a CAD app displays a model in space; an educational app displays a 3D heart. visionOS’s native interface primitives don’t make use of a volumetric paradigm, so anything we see here will be app/domain-specific (for now).
For me, the most interesting part of visionOS is the input part of the interaction model. The core operation is still pointing. On NLS and its descendants, you point by indirect manipulation: moving a cursor by translating a mouse or swiping a trackpad, and clicking. On the iPhone and its descendants, you point by pointing. Direct manipulation became much more direct, though less precise; and we lost “hover” interactions. On Vision Pro and its descendants, you point by looking, then “clicking” your bare fingers, held in your lap.
Sure, I’ve seen this in plenty of academic papers, but it’s quite wild to see it so central to a production device. There are other VR/AR devices which feature eye tracking, but (AFAIK) all still ship handheld controllers or support gestural pointing. Apple’s all in on foveation as the core of their input paradigm, and it allows them to produce a controller-free default experience. It reminds me of Steve’s jab at styluses at the announcement of the iPhone.
My experiences with hand tracking-based VR interfaces have been uniformly unpleasant. Without tactile feedback, the experience feels mushy and unreliable. And it’s uncomfortable after tens of seconds (see also Bret’s comments). The visionOS interaction model dramatically shifts the role of the hands. They’re for basically-discrete gestures now: actuate, flick. Hands no longer position the pointer; eyes do. Hands are the buttons and scroll wheel on the mouse. Based on my experiences with hand-tracking systems, this is a much more plausible vision for the use of hands, at least until we get great haptic gloves or similar.
But it does put an enormous amount of pressure on the eye tracking. As far as I can tell so far, the role of precise 2D control has been shifted to the eyes. The thing which really sold the iPhone as an interface concept was Bas’s and Imran’s ultra-direct, ultra-precise 2D scrolling with inertia. How will scrolling feel with such indirect interaction? More importantly, how will fine control feel—sliders, scrubbers, cursor positioning? One answer is that such designs may rely on “direct touch”, akin to existing VR systems’ hand tracking interactions. Apple suggests that “up close inspection or object manipulation” should be done with this paradigm. Maybe the experience will be better than on other VR headsets I’ve tried because sensor fusion with the eye tracker can produce more accuracy?
By relegating hands to a discrete role in the common case, Apple reinforces the 2D conception of the visionOS interface paradigm. You point with your eyes and “click” with your hands. One nice benefit of this change is that we recover a natural “hover” interaction. But moving incrementally from here to a more ambitious “native 3D” interface paradigm seems like it would be quite difficult.
For text, Apple imagines that people will use speech for quick input and a Bluetooth keyboard for long input sessions. They’ll also offer a virtual keyboard you can type on with your fingertips. My experience with this kind of virtual keyboard has been uniformly bad—because you don’t have feedback, you have to look at the keyboard while you type; accuracy feels effortful; it’s quickly tiring. I’d be surprised (but very interested) if Apple has solved these problems.
Note how different Apple’s strategy is from the vision in Meta’s and MagicLeap’s pitches. These companies point towards radically different visions of computing, in which interfaces are primarily three-dimensional and intrinsically spatial. Operations have places; the desired paradigm is more object-oriented (“things” in the “meta-verse”) than app-oriented. Likewise, there are decades of UIST/etc papers/demos showing more radical “spatial-native” UI paradigms. All this is very interesting, and there’s lots of reason to find it compelling, but of course it doesn’t exist, and a present-day Quest / HoloLens buyer can’t cash in that vision in any particularly meaningful way. Those buyers will mostly run single-app, “full-screen” experiences; mostly games.
But, per Apple’s marketing, this isn’t a virtual reality device, or an augmented reality device, or a mixed reality device. It’s a “spatial computing” device. What is spatial computing for? Apple’s answer, right now, seems to be that it’s primarily for giving you lots of space. This is a practical device you can use today to do all the things you already do on your iPad, but better in some ways, because you won’t be confined to “a tiny black rectangle”. You’ll use all the apps you already use. You don’t have to wait for developers to adapt them. This is not a someday-maybe tech demo of a future paradigm; it’s (mostly) today’s paradigm, transliterated to new display and input technology. Apple is not (yet) trying to lead the way by demonstrating visionary “killer apps” native to the spatial interface paradigm. But, unlike Meta, they’ll build their device with ultra high-resolution displays and suffer the premium costs, so that you can do mundane-but-central tasks like reading your email and browsing the web comfortably.
On its surface, the iPhone didn’t have totally new killer apps when it launched. It had a mail client, a music player, a web browser, YouTube, etc. The multitouch paradigm didn’t substantively transform what you could do with those apps; it was important because it made those apps possible on the tiny display. The first iPhone was important not because the functionality was novel but because it allowed those familiar tools to be used anywhere. My instinct is that the same story doesn’t quite apply to the Vision Pro, but being generous for a moment, I might suggest its analogous contribution is to allow desktop-class computing in any workspace: on the couch, at the dining table, etc. “The office” as an important, specially-configured space, with “computer desk” and multiple displays, is (ideally) obviated in the same way that the iPhone obviated quick, transactional PC use.
Relatively quickly, the iPhone did acquire many functions which were “native” to that paradigm. A canonical example is the 2008 GPS-powered map, complete with local business data, directions, and live transit information. You could build such a thing on a laptop, but the amazing power of the iPhone map is that I can fly to Tokyo with no plans and have a great time, no stress. Rich chat apps existed on the PC, but the phenomenon of the “group chat” really depended on the ubiquity of the mobile OS paradigm, particularly in conjunction with its integrated camera. Mobile payments. And so on. The story is weaker for the iPad, but Procreate and its analogues are compelling and unique to that form factor. I expect Vision Pro will evolve singular apps, too; I’ll discuss a few of interest to me later in this note. Will its story be more like the iPhone, or more like the iPad and Watch?
It’s worth noting that this developer platform strategy is basically an elaboration of the Catalyst strategy they began a few years ago: develop one app; run it on iOS and macOS. With the Apple Silicon computers, the developer’s participation is not even required: iPad apps can be run directly on macOS. Or, with SwiftUI, you can at least use the same primitives and perhaps much of the same code to make something specialized to each platform. visionOS is running with the same idea, and it seems like a powerful strategy to bootstrap a new platform. The trouble here has been that Catalyst apps (and SwiftUI apps, though somewhat less so) are unpleasant to use on the Mac. This is partially because those frameworks are still glitchy and unfinished, but partially because an application architecture designed for a touch paradigm can’t be trivially transplanted to the information/action-dense Mac interface. Apple makes lots of noises in their documentation about rethinking interfaces for the Mac, but in practice, the result is usually an uncanny iOS app on a Mac display. Will visionOS have the same problem with this strategy? It benefits, at least, from not having decades of “native” apps to compare against.
If I find the Vision Pro’s launch software suite conceptually conservative, what might I like to see? What sorts of interactions seem native to this paradigm, or could more ambitiously fulfill its unique promise?
Huge, persistent infospaces: I love this photo of Stewart Brand in How Buildings Learn. He’s in a focused workspace, surrounded by hundreds of photos and 3”x5” cards on both horizontal and vertical surfaces. It’s a common trope among writers: both to “pickle” yourself in the base material and to spread printed manuscript drafts across every available surface. I’d love to work like this every day, but my “office” is a tiny corner of my bedroom. I don’t have room for this kind of infospace, and even if I did, I wouldn’t want to leave it up overnight in my bedroom. There’s tremendous potential for the Vision Pro here. And unlike the physical version, a virtual infospace could contend with much more material than could actually fit in my field of view, because the computational medium affords dynamic filtering, searching, and navigation interactions (see Softspace for one attempt). And you could swap between persistent room-scale infospaces for different projects. I suspect that visionOS’s windowing system is not at all up to this task. One could prototype the concept with a huge “volume”, but it would mean one’s writing windows couldn’t sit in the middle of all those notes. (Update: maybe a custom Shared Space would work?)
Ubiquitous computing, spatial computational objects: The Vision Pro is “spatial computing”, insofar as windows are arranged in space around you. But it diverges from the classic visions along these lines (Ubiquitous computing, Dynamicland) in that the computation lives in virtual windows, without more than a loose spatial connection to anything physical in the world. What if programs live in places, live in physical objects in your space? For instance, you might place all kinds of computational objects in your kitchen: timers above your stove; knife work reference overlays above your cutting board; a representation of your fridge’s contents; a catalog of recipes organized by season; etc. Books and notes live not in a virtual 2D window but “out in space”, on my coffee table (solving problems of Peripheral vision). When physical, they’re augmented—with cross-references, commentary from friends, practice activities, etc. Some are purely digital. But both signal their presence clearly from the table while I’m wearing the headset. My memory system is no longer stuck inside an abstract practice session; practice activities appear in context-relevant places, ideally integrating with “real” activities in my environment, as I perform them.
Shared spatial computing: Part of these earlier visions of spatial computing, and particularly of Dynamicland, is that everything I’m describing can be shared. When I’m interacting with the recipe catalog that lives in the kitchen, my wife can walk by, see the “book” open and say “Oh, yeah, artichokes sound great! And what about pairing them with the leftover pork chops?” I’ll reserve judgment about the inherent qualities of the front-facing “eye display” until I see it in person, but no matter how well-executed that is, it doesn’t afford the natural “togetherness” of shared dynamic objects. Particularly exciting will be to create this kind of “togetherness” over distance. I think a “minimum viable killer app” for this platform will be: I can stand at my whiteboard, and draw (with a physical marker!), and I see you next to me, writing on the “same surface”—even though you’re a thousand miles away, drawing on your own whiteboard. FaceTime and Freeform windows floating in my field of view don’t excite me very much as an approximation, particularly since the latter requires “drawing in the air.”
A few elements of visionOS’s design really tickled me because they finally productized some visual interface ideas we tried in 2012 and 2013. It’s been long enough now that I feel comfortable sharing in broad strokes.
The context was that Scott Forstall had just been fired, Jony Ive had taken over, and he wanted to decisively remake iOS’s interface in his image. This meant aggressively removing ornamentation from the interface, to emphasize user content and to give it as much screen real estate as possible. Without borders, drop shadows, and skeuomorphic textures, though, the interfaces loses cues which communicate depth, hierarchy, and interactivity. How should we make those things clear to users in our new minimal interfaces? With a few other Apple designers and engineers1, I spent much of that year working on possible solutions that never shipped.
You might remember the “parallax effect” from iOS 7’s home screen, the Safari tabs view, alerts, and a few other places. We artificially created a depth effect using the device's motion sensors. Internally, even two months before we revealed the new interface, this effect was system-wide, on every window and control. Knobs on switches and scrubbers floated slightly above the surface. Application windows floated slightly above the wallpaper. Every app had depth-y design specialization: the numbers in the Calculator app floated way above the plane, as if they were a hologram; in Maps, pins, points of interest, and labels floated at different heights by hierarchy; etc. It was eventually deemed too much (“a bit… carnival, don't you think?”) and too battery-intensive. So it's charming to see this concept finally get shipped in visionOS, where UIKit elements seem to get the same depth-y treatments we'd tried in 2012/2013. It's much more natural in the context of a full 3D environment, and the Vision Pro can do a much better job of simulating depth than we'd ever manage with motion sensors.
A second concept rested on the observation that the new interface might be very white, but there are lots of different kinds of white: acrylic, paper, enamel, treated glass, etc. Some of these are “flat”, while others are extremely reactive to the room. If you put certain kinds of acrylic or etched glass in the middle of a table, it picks up color and lighting quality from everything around it. It’s no longer just “white”. So, what if interactive elements were not white but “digital white”—i.e. the material would be somehow dynamic, perhaps interacting visually with their surroundings? For a couple months, in internal builds, we trialled a “shimmer” effect, almost as if the controls were made of a slightly shiny foil with a subtly shifting gloss as you moved the device (again using the motion sensors). We never could really make it live up to the concept: ideally, we wanted the light to interact with your surroundings. visionOS actually does it! They dynamically adapt the control materials to the lighting in your environment and to your relative pose. And interactive elements are conceptually made of a different material which reacts to your gaze with a subtle gloss effect! Timing is everything, I suppose…
Only some of the WWDC videos about the Vision Pro have been released so far. I imagine my views will evolve as more information becomes available.
Having now watched all the WWDC 2023 talks:
1 Something in the Apple omertà makes me uncomfortable naming my collaborators as I normally would, even as I discuss the project itself. I guess it feels like I’d be implicating them in this “behind-the-scenes” discussion without their consent? Anyway, I want to make clear that I was part of a small team here; these ideas should not be attributed to me.
avy
and Raskin’s Canon Cat “LEAP” interaction want to be. “Go where I’m looking.” It’s not there yet, but in a few years, it’ll feel like a Poor man’s brain-computer interface, I think.DragGesture
, though, it appears that you can access 6DOF information about a pinch+drag actuated on any type of scene.AnchorEntity(.head)
might not, because it doesn’t actually expose the transform matrix?com.apple.ist.ds.appleconnect
as enabled in the list of disabled services. It didn’t actually have a corresponding launchd plist anywhere, but maybe something was checking for it. I ran sudo launchctl disable com.apple.ist.ds.appleconnect
, then sudo find / -name “AppleConnect”
and deleted some stray preference plists I found. After reboot, it was fixed.InputTargetComponent
and CollisionComponent
. That must have been quite a complicated integratino.HoverEffectComponent
is used for out-of-process hover effects… but it appears that there’s no available configuration at all.Model3D
inside a .plain
SwiftUI window, it seems to clip at z=-250 and +570Model3D
inside a volumetric window, I can translate to ~±200 outside its z extents before it clips. It clips immediately when exceeding the x or y values of the frustrum.Float
; SwiftUI types are expressed in terms of Double
. Also, number literals like 0.5
are inferred to be Double
not Float
when no context is provided. And Swift does not automatically coerce. What a nuisance..previewLayout(.sizeThatFits)
modifier to one’s #Preview
view, but in my experiments, that only resizes up to relatively small limits. My immersive scenes are clipping in Xcode Preview.RealityKit
’s animation primitives. Unless I drop into the full physics body simulator, I guess? Frustrating..sizeThatFits
preview layout. Iterating on device with immersive scenes has extra friction because the macOS mirroring window disappears while in the immersive scene.ARBodyAnchor
is unavailable on visionOS, even when in an immersive space and with all available permissions. It’s not clear whether that’s a deliberate decision—to make people-tracking unavailable—or if it’s just part of a broader API migration away from ARAnchor
in favor of the new ARKitSession
suite of APIs, which includes analogues of many of the ARAnchor
subclasses, but not this one.ImageAnchor
this morning, tracking physical books.AnchorEntity
’s limited API, you don’t need to request any user permissions to use image recognition (but you do need to be in an immersive space, alas)print
logs? Did your app crash? Too bad—the macOS mirroring window isn’t visible! And you can’t even squint at your Mac’s screen through the headset because it’s blacked out while mirroring is enabled, even if the mirroring window isn’t visible.originFromAnchorTransform
? This seemed backwards to me at first: isn’t this the transform which moves the anchor, from the origin? The name makes sense if you think of it as a transform which takes a point in the anchor’s coordinate system and moves it into the world coordinate system.Some important new vision features for enterprise customers: BarcodeDetectionProvider
will let us track barcodes and QR codes. And… you can request camera data. But only for enterprises. Bluh.
Explore object tracking for visionOS - WWDC24 - Videos - Apple Developer
Create enhanced spatial computing experiences with ARKit - WWDC24 - Videos - Apple Developer
https://developer.apple.com/videos/play/wwdc2024/10103/?time=164
Create custom hover effects in visionOS - WWDC24 - Videos - Apple Developer