How to snoop on what an Apple Vision Pro user is typing | Kaspersky official blog

In September 2024, a team of researchers from both the University of Florida and Texas Tech University presented a paper detailing a rather sophisticated method for intercepting text entered by users of the Apple Vision Pro mixed reality (MR) headset.

The researchers dubbed this method GAZEploit. In this post, we’ll explore how the attack works, the extent of the threat to owners of Apple VR/AR devices, and how best to protect your passwords and other sensitive information.

How text input works in Apple visionOS

First, a bit about how text is input in visionOS — the operating system powering Apple Vision Pro. One of the most impressive innovations of Apple’s MR headset is its highly effective use of eye tracking.

Gaze direction serves as the primary method of user interaction with the visionOS interface. The tracking is so precise that it works even for the smallest interface elements — including the virtual keyboard.

visionOS uses a virtual keyboard and eye tracking to input text. Source

Although visionOS offers voice control, the virtual keyboard remains the primary text input method. For sensitive information such as passwords, visionOS provides protection against prying eyes: in screen-sharing mode, both the keyboard and the entered password are automatically hidden.

During screen sharing, visionOS automatically hides passwords entered by Vision Pro users. Source

Another key feature of Apple’s MR headset lies in its approach to video calls. Since the device sits directly on the user’s face, the standard front-camera option is no good for transmitting the user’s video image. On the other hand, using a separate external camera for video calls would be very un-Apple-like; plus, video-conference participants wearing headsets would look rather odd.

So Apple came up with a highly original technology that features a so-called virtual camera. Based on a 3D face scan, Vision Pro creates a digital avatar of the user (Apple calls it a Persona), which is what actually takes part in the video call. You can use your Persona in FaceTime and other video-conferencing apps.

https://media.kasperskydaily.com/wp-content/uploads/sites/92/2024/10/03125345/how-to-steal-passwords-apple-vision-pro-3.mp4

By using lots of biometric data, the Persona digital avatar in visionOS looks truly lifelike. Source

The headset’s sensors track the user’s face in real-time, allowing the avatar to mimic head movements, lip movements, facial expressions, and so on.

GAZEploit: How to snoop on Apple Vision Pro user input

For the GAZEploit researchers, the seminal feature of the Persona digital avatar is the use of data fed from the Vision Pro’s highly precise sensors to replicate the user’s eye movements with absolute pinpoint accuracy. And it was here that the team discovered a vulnerability enabling interception of input text.

Here’s how GAZEploit works in principle — allowing an attacker to intercept text entered by an Apple Vision Pro user. Source

The attack’s core concept is quite simple: although the system carefully hides passwords entered during video calls, by tracking the user’s eye movements, mirrored by their digital avatar, a threat actor can reconstruct the characters entered on the virtual keyboard, or, rather, keyboards, as visionOS has three: passcode (PIN) keyboard, default QWERTY keyboard, and number and special character keyboard. This complicates the recognition process, since an outside observer doesn’t know which keyboard is in use.

visionOS actually has three different virtual keyboards: (а) for passcodes, (b) for letters, and (c) for numbers and special characters. Source

However, neural networks effectively automate the GAZEploit attack. The first stage of the attack uses a neural network to identify text-input sessions. Eye movement patterns during use of the virtual keyboard differ significantly from normal patterns: blink rates decrease, and gaze direction becomes more structured.

First, the neural network identifies when text is being entered on the virtual keyboard. Source

At the second stage, the neural network analyzes gaze stability changes to identify eye-based selection of characters, and uses characteristic patterns to pinpoint virtual key presses. Then, based on gaze direction, the system calculates which key the user was looking at.

Next, the neural network recognizes individual virtual keystrokes and the characters being entered. Source

How accurately GAZEploit recognizes input data

In actual fact, it’s all a bit more complicated than the graph above suggests. Calculations based on the avatar’s eye position generate a heatmap of probable points on the virtual keyboard where the user’s gaze might have landed during text entry.

Mapped gaze directions for keystroke inference of the demo attack: (a) adaptive virtual keyboard mapping, (b) predicted first guess keystrokes, (c) actual keystrokes. The accuracy isn’t perfect, but it’s no bad. Source

Then, the researchers’ model converts the collected information into a list of K virtual keys that were most likely “pressed” by the user. The model also provides for various data-entry scenarios (password, email address/link, PIN, arbitrary message), taking into account the specifics of each.

What’s more, the neural network uses a dictionary and additional techniques to improve interpretation. For example, due to its size, the spacebar is often a top-five candidate — producing many false positives that need filtering. The backspace key requires special attention: if the keystroke guess is correct, it means the previous character was deleted, but if it’s wrong, then two characters may get mistakenly discarded.

GAZEploit suggests the top-five most likely characters. Source

The researchers’ detailed error analysis shows that GAZEploit often confuses adjacent keys. At maximum precision (K=1), roughly one-third of entered characters are identified correctly. However, for groups of five most likely characters (K=5), depending on the specific scenario, the accuracy is already 73–92%.

The accuracy of GAZEploit recognition in various scenarios. Source

How dangerous the GAZEploit attack is in practical terms

In practice, such accuracy means that potential attackers are unlikely to obtain the target password in ready-to-go form; but they can dramatically — by many orders of magnitude, in fact — reduce the number of attempts needed to brute-force it.

The researchers claim that for a six-digit PIN, it’ll only take 32 attempts to cover a quarter of all the most likely combinations. For a random eight-character alphanumeric password, the number of attempts is slashed from hundreds of trillions to hundreds of thousands (from 2.2×1014 to 3.9×105, to be precise), which makes password cracking feasible even with a prehistoric Pentium CPU.

In light of this, GAZEploit could pose a serious enough threat and find practical application in high-profile targeted attacks. Fortunately, the vulnerability has already been patched: in the latest versions of visionOS, Persona is suspended when the virtual keyboard is in use.

Apple could conceivably protect users from such attacks in a more elegant way — by sprinkling some random distortions in the precise biometric data driving the digital avatar’s eye movements.

Regardless, Apple Vision Pro owners should update their devices to the latest version of visionOS — and breathe easily. One last thing, we advise them — and everyone else — to exercise caution when entering passwords during video calls: avoid it if you can, always use the strongest (long and random) character combinations possible, and use a password manager to create and store them.

Kaspersky official blog – Read More