Stroke paralysis for 18 years. AI enables her to speak again Brain-computer interface simulates facial expressions and digitizes her identity as a mouthpiece.

18 years of stroke paralysis. AI helps her speak again. Brain-computer interface mimics facial expressions, digitizing her identity as a spokesperson.

【Introduction】After a stroke, Ann, who was paralyzed, lost her ability to speak for 18 years. Recently, brain-computer interfaces and digital avatars have enabled her to “speak” using facial expressions.

On the same day, Nature published a groundbreaking study on “brain-computer interfaces” that could change the entire human race!

At the age of 30, a devastating stroke left a 47-year-old Canadian woman almost completely paralyzed, and she remained unable to speak for the next 18 years.

Fortunately, a team from the University of California has developed a new brain-computer interface (BCI) that allows Ann to control her “digital avatar” and speak again.

When the words “I think you are wonderful” were spoken, it was a breakthrough for Ann after more than a decade.

It is worth mentioning that the facial expression in this digital avatar is achieved using the same technology as “The Last of Us Part II”.

Specifically, researchers implanted a series of electrodes under Ann’s cerebral cortex.

When Ann tries to speak, the BCI intercepts the brain signals and converts them into words and speech. Here, AI decodes phonemes rather than the entire word.

The BCI at the University of California enables Ann to speak at a rate of 78 words per minute, far exceeding the previous device she used, which allowed for 14 words per minute.

As the title of the paper suggests, the key to the research is the achievement of “speech decoding” and “digital avatar control”, which is the biggest difference from previous studies.

The new BCI technology uses facial expressions to animate the digital avatar, imitating the details of natural human communication.

Link to the paper:

This groundbreaking research was published in Nature on August 23. It is the first time that speech and facial movements have been synthesized directly from brain signals, marking a major leap forward in brain-computer interfaces.

Another study published in Nature also focuses on converting speech neural activity into text through a brain interface.

The research results show that paralyzed patients can communicate at a rate of 62 words per minute, 3.4 times faster than previous studies.

Link to the paper:

Both studies have significantly improved the speed of converting speech brain signals into text, and even allowed virtual avatars to “speak” for humans.

The advent of brain-computer interfaces brings humanity closer to mechanical transcendence.

When the first sentence was spoken, she smiled happily

Turning thirty, for everyone, there are still many surprises in life waiting to be discovered.

For Ann, as a high school math teacher in Canada, she is educating and nurturing students on the podium, with her students achieving great success.

However, a sudden stroke took away her control over all her muscles, even her ability to breathe.

Since then, she has never spoken another word.

The most direct consequence of a stroke is the inability to control facial muscles, resulting in facial paralysis and the inability to speak.

In the following five years, Ann often struggled to sleep, fearing that she would die in her sleep.

After years of physical therapy, she has seen some initial progress.

She is able to make limited facial expressions and some movements of her head and neck. However, she still cannot activate the facial muscles for speech.

For this reason, she also underwent a brain-computer interface (BCI) surgery.

However, the previous BCI technology was not advanced enough and only allowed Ann to communicate slowly and with difficulty. It was unable to decode her brain signals into fluent language.

Ann gently moves her head and slowly types on the computer screen using the device, saying, “Everything was taken away from me overnight.”

In 2022, Ann decided to make another attempt and volunteered to be a subject for the research team at the University of California.

Adding a face, a voice

In response, the researchers recorded the brain signal patterns when Ann attempted to recite words in order to train an artificial intelligence algorithm to recognize various speech signals.

It is worth mentioning that the trained AI decodes phonemes, the basic elements of speech, rather than whole words, which increases its speed and versatility by three times.

To achieve this, the research team implanted a rectangular electrode, composed of 253 electrodes as thin as a sheet of paper, on Ann’s brain surface.

Then, a cable was inserted into a port fixed on Ann’s head, connecting the electrodes to a set of computers.

Now, this system can transcribe Ann’s attempted speech into text at a speed of nearly 80 words per minute, far exceeding the speed of her previous BCI device.

Using a wedding video of Ann from 2005, the research team used artificial intelligence to reconstruct her unique intonation and accent.

Then, using software developed by Speech Graphics, a company dedicated to speech-driven animation technology, they created a personalized digital avatar that can simulate Ann’s facial expressions in real-time.

It can match the signals emitted by Ann’s brain when she attempts to speak and convert these signals into facial movements of her digital avatar.

Includes opening and closing the chin, pursing and tightening the lips, raising and lowering the tongue, as well as facial expressions of happiness, sadness, and surprise.

Now, when Ann tries to speak, the digital avatar seamlessly creates animations and speaks the words she wants to say.

Here, famous adventure games such as “The Last of Us Part II” and “Halo Infinite” also use Speech Graphics’ facial capture technology to present a variety of vivid facial expressions of characters.

Michael Berger, Chief Technology Officer and Co-founder of Speech Graphics, said:

Creating a digital avatar that can speak, express emotions, and directly connect with the subject’s brain demonstrates that the potential of AI-driven facial features goes far beyond video games.

Just the ability to speak again is impressive, and facial communication is a fundamental human characteristic that allows patients to regain this extraordinary ability.

This research by the University of California is not only a breakthrough in BCI technology, but also a hope for countless special individuals.

This technological breakthrough enables individuals to achieve independence and self-expression, bringing unprecedented hope to Ann and countless people who have lost their ability to speak due to paralysis.

For Ann’s 13-month-old daughter, this BCI breakthrough allows her to hear the voice of her mother, which she has never heard since birth.

It is reported that the next version of the BCI they are developing is wireless, eliminating the hassle of connecting to physical systems.

Edward Chang, the leader of this experiment at the University of California, has been advancing brain-computer interface technology for over a decade.

In 2021, he and his research team developed a “speech neuroprosthesis” that allows a severely paralyzed man to communicate with complete sentences.

This technology captures the brain signals that point to the vocal tract and converts them into text displayed on the screen, marking the first demonstration that speech-brain signals can be decoded into complete words.

So, how did the University of California specifically enable Ann to “speak”?

Technical Implementation

In this study, a research team led by Dr. Edward Chang, the director of neurosurgery at the University of California, San Francisco, implanted a 253-electrode array into Ann’s brain’s language control area.

These probes monitor and capture neural signals, and transmit them to a set of processors through cable ports in the skull, where there is a machine learning AI in this computing stack.

Over the past few weeks, Ann has been working with the team to train the artificial intelligence algorithm of the training system to recognize the neural signal patterns of over 1,000 words in her brain.

This requires repeatedly going through different phrases in the conversation vocabulary of 1,024 words until the computer identifies the brain activity patterns associated with all basic speech.

Instead of training the AI to recognize whole words, researchers have created a system that can decode words from smaller components called phonemes. Phonemes form spoken words in the same way that letters form written words. For example, the word “Hello” consists of four phonemes: “HH,” “AH,” “L,” and “OW.”

Using this approach, the computer only needs to learn 39 phonemes to decipher any word in English. This improves the accuracy of the system and increases the speed threefold.

But this is just a small prelude to the research, as the real challenge lies in decoding and mapping Ann’s intentions with AI.

Electrodes are placed in the brain regions, and the research team has found that these regions are crucial for language.

Using a deep learning model, the research team maps the detected neural signals to speech units and speech features to output text, synthesize speech, and drive virtual characters.

As mentioned earlier, the researchers collaborated with Speech Graphics to create virtual avatars of the patients.

Based on the analysis of the audio input, SG’s technology “reverse designs” the necessary facial muscle and skeletal movements, which are then input in real-time into the game engine to create a lag-free avatar.

Because the patient’s mental signals can be directly mapped to the avatar, she can express emotions and even engage in non-verbal communication.

Overview of the Multimodal Speech Decoding System

The researchers have designed a speech decoding system to help Ann, who is severely paralyzed and unable to speak, communicate with others.

Ann has worked with the team to train an AI algorithm to recognize brain signals related to phonemes (phonemes are subunits of speech that form spoken language).

The researchers implanted a high-density ECoG array with 253 channels on Ann’s cortical surface, specifically covering brain regions associated with language, including the SMC and superior temporal gyrus.

In simple terms, these regions are associated with the movements of the researchers’ facial expressions, lips, tongue, and chin (1a-c).

Through this array, the researchers can detect the electrical signals in these regions when Ann wants to speak.

The researchers noticed that the array can capture different activation signals when Ann tries to move her lips, tongue, and chin (1d).

In order to study how to decode language from brain signals, researchers had Ann silently articulate the sentence she saw on the screen, that is, make the movements of pronunciation.

The researchers extracted two main brain activity signals from the signals captured by 253 ECoG electrodes on Ann’s head: high gamma activity (70-150 Hz) and low-frequency signals (0.3-17 Hz).

Then, a deep learning model was used to learn how to predict pronunciation, speech, and oral movements from these brain signals, and ultimately convert these predictions into text, synthesized speech, and virtual avatar movements.

Text Decoding

The research team aims to decode text from the brain, especially in cases where people have difficulty speaking.

However, their early efforts were hindered by slow decoding speed and limited vocabulary.

In this study, they used the method of phone decoding, which allowed them to decode arbitrary phrases from a large vocabulary and achieve near-natural speaking speed.

To evaluate real-time performance, the research team decoded the text when Ann attempted to silently read 249 sentences. These sentences were randomly selected from a sentence set containing 1024 words and were not used during model training. To decode, they extracted features from the ECoG signals and processed them using a bidirectional recurrent neural network (RNN).

They used several standard metrics to evaluate decoding performance, including word error rate (WER), phone error rate (PER), character error rate (CER), and words per minute (WPM).

The research team observed that at a decoding speed of 78.3 WPM, this already exceeded Ann’s usual communication speed using her assistive device and approached natural speaking speed.

To evaluate the stability of the signals, they conducted a separate task that required Ann to silently read the 26 code words of NATO or attempt four gesture actions. The results showed that the classifier performance of the neural network was excellent, with an average accuracy of up to 96.8%.

Finally, to evaluate the model’s performance on a predefined set of sentences without any pauses between words, they simulated decoding on two different sets of sentences. The results showed that for these frequently used limited and predefined sentences, the decoding speed was very fast and the accuracy was very high.

Speech Synthesis

Another approach to text decoding is to directly synthesize speech from recorded neural activity, which can provide a more natural and expressive means of communication for non-speaking individuals.

Previous studies on individuals with intact speech function have shown that understandable speech can be synthesized through neural activity during vocalization or imitation of speech, but this method has not been validated in paralyzed individuals.

Researchers have directly converted neural activity during silent reading attempts in audiovisual tasks into audible speech for real-time speech synthesis (Figure 3a).

To synthesize speech, researchers passed the time window of neural activity to a bidirectional recurrent neural network (RNN).

Before testing, researchers trained the RNN to predict the probability of 100 discrete speech units for each time step.

To create a reference sequence of training speech units, researchers used HuBERT, a self-supervised speech representation learning model that encodes continuous speech waveforms into a time sequence of discrete speech units that capture underlying phoneme and pronunciation representations.

During the training process, researchers used the CTC loss function to allow the RNN to learn the mapping from ECoG features to speech units derived from these reference waveforms without alignment between silent speech attempts by participants and reference waveforms.

After predicting unit probabilities, the most probable unit at each time step is passed into a pretrained unit-to-speech model, which first generates a mel-spectrogram and then synthesizes it into audible speech waveforms in real time.

In offline scenarios, researchers used a speech conversion model trained on a short period of time prior to the participant’s injury to process the decoded speech into the participant’s personalized synthesized voice.

Facial Avatar Decoding

Researchers developed a facial embodiment BCI interface that decodes neural activity into speech gestures and presents dynamic virtual faces under audiovisual task conditions (Figure 4a).

To achieve dynamic animation of synthesized facial avatars, researchers adopted an avatar animation system (Speech Graphics) designed to convert speech signals into facial animation movements.

Researchers used two approaches to animate the avatars: a direct approach and an acoustic approach. The direct approach directly infers articulatory movements from neural activity without any speech intermediaries.

The acoustic approach is used for real-time audiovisual synthesis, ensuring low-latency synchronization between the decoded speech audio and the avatar’s movements.

In addition to articulatory movements accompanying synthesized speech, a complete avatar BCI should also be able to display non-speech facial movements and expressive movements related to emotions.

To achieve this, researchers collected neural data from participants performing two additional tasks: an articulatory movement task and an emotion expression task.

The results showed that participants were able to control the avatar BCI to display articulatory movements and strong emotional expressions, revealing the potential of a multimodal communication BCI to restore meaningful facial movements.

Pronunciation Representation-Driven Decoding

In healthy speakers, the neural representation of the SMC (including the precentral gyrus and the postcentral gyrus) encodes the articulatory movements of the facial muscles.

When electrode arrays are implanted into the SMC of participants, researchers speculate that the neural representation of pronunciation still exists and drives the performance of speech decoding, even after paralysis.

To evaluate this, researchers fitted a linear temporal receptive field encoding model to predict the high-gamma activity (HGA) of each electrode based on the phoneme probabilities computed by a text decoder under the condition of a 1024-word general text task.

For each activated electrode, researchers computed the maximum encoding weight for each phoneme, resulting in a phonotopic tuning space. In this space, each electrode has an associated phoneme encoding weight vector.