ID R&D has long established itself as a technological innovator and leader in biometrics, and in recent years has demonstrated a particularly strong focus on liveness detection. Its facial liveness technology, for example, beat over 80 other algorithms to attain top rankings in testing conducted by the National Institute of Standards and Technology (NIST) last year. But it’s the company’s voice biometrics technology that has made news in recent weeks, thanks to ID R&D’s launch of its IDLive Voice liveness solution just as the FTC issued a call for help in fighting voice clones. Naturally, the company’s work in this increasingly important field is a central topic in FindBiometrics’ latest interview with Alexey Khitrov, Founder & President of ID R&D, which also touches on generative AI, user authentication, and the biometrics industry landscape more broadly, as detailed in the Biometric Digital Identity Prism.
Read on to hear from one of the pioneering vendors in biometrics and the fight against deepfake clones:
FindBiometrics: ID R&D just announced a new solution that can detect voice clones. What is a voice clone and how is it a different kind of biometric threat than we generally see in the biometrics industry?
Alexey Khitrov, Founder & President, ID R&D: Voice cloning is a deepfake of audio that leverages generative AI and speech-to-text technologies to digitally replicate a person’s voice. Voice cloning collects and analyzes audio data of a person speaking, trains AI algorithms, and then creates a synthesis of speech audio based on text.
It’s a powerful technology with a wide range of compelling applications, but voice cloning has also been cited as the most concerning outcome of AI advancement. To understand why, it’s helpful to distinguish between threats in our real worlds and in our digital worlds. AI can be used to create deepfake digital imagery and content, and this certainly poses threats in the form of identity fraud and misinformation. Voice clones take the threat to a different level, in that they can be used to impersonate people in the real world, in real time. This is a new type of threat without precedent.
Consider also the role that audio communications plays in our daily lives. On WhatsApp alone, there are 100 million voice calls every day. Without a facial image to confirm the identity of the person we’re speaking to, we must rely on two things: what they say and who they sound like. Voice clones enable incredibly convincing impersonations, even of people whose voices we hear every day. When combined with conversational AI technology, they can be fully automated to enable scalable vishing attacks for large-scale fraud and extortion.
FindBiometrics: Voice clones, deepfakes, and other types of generative AI-created media have gone from a novelty to a mainstream concern in a matter of months. How can identity technologies stop these AI-impersonator fraud threats, and what is the key to keeping up in the machine learning arms race?
Alexey Khitrov, Founder & President, ID R&D: Security has always been a whack-a-mole arms race that requires a multifaceted approach; security by its nature is determined by its greatest vulnerability. Liveness detection is no different in that it must be able to detect every method among many that a fraudster might attempt to impersonate a live person. So, key #1 to keeping up in the arms race is to have the ability to monitor for new attack vectors and to quickly adapt and address the threat.
Also important is the fact that there are two categories of signals that liveness detection must cover: the content of the attack – what’s in the picture or audio – and the method of the attack – how the attack is performed. In the case of facial images, the content of an attack could be a deepfake on a screen, a copy of a face image printed on paper, or any of several other attack vectors. These attacks can also be conducted using different points of entry; either a presentation attack to the camera or an injection attack by way of software or hardware hacks directly into a system or platform..
AI plays an essential role in detecting liveness signals in both categories. But as deepfake content becomes more difficult to differentiate from real content, it becomes increasingly important to be able to detect the method of the attack. So key #2 to keeping up with the arms race is to assume that the technology will advance to a point where we can’t rely on obvious artifacts in the content to detect attacks; the method of attack must also be detectable.
FindBiometrics: We are seeing an increased focus on voice biometrics, both as a standalone authenticator and as a multi-factor option. It’s not a new modality, but this current generation is seeing a stronger level of acceptance than previous iterations. What is behind this new rise of voice biometrics?
Alexey Khitrov, Founder & President, ID R&D: Looking back at the evolution of facial biometrics is instructive. In the early days – before AI – the performance of facial recognition was terrible – worse than fingerprints, sensitive to environmental factors like lighting – and simply wasn’t sufficiently reliable or convenient to protect consumers’ access to bank accounts and other sensitive information. AI matured and facial recognition was an early proof point. The ubiquity of facial image data led to more and more machine learning, to a point now where the algorithms are incredibly accurate compared to ten years ago.
Voice biometrics is following a similar trajectory but it’s somewhat of a different story. Consider that biometrics are like superheroes, in that each modality has different strengths and weaknesses for different applications. Facial biometrics are hard to beat for logging onto a mobile phone; they are essentially frictionless. But what’s happening right now is that thanks to generative AI, advances in speech-to-text and conversational AI are converging to prompt an explosion of use cases that leverage the naturalness of a voice interface.
Our voice is our most natural communications interface as it’s natural for humans to speak, rather than blinking at a camera or pressing a finger to a scanner. This means that while facial biometrics are so optimal for frictionless identity functions – when assisted by voice – securing digital communications is a natural job for voice biometrics and voice liveness.
We are entering a new paradigm of human-machine interface, where our machines now actually have much more interesting and valuable things to say, and they have new voices to say them with. In some cases these voices will be clones. We will go from asking Alexa for the time and weather to asking AI agents to attend our Zoom calls in the near future. The value of intelligent agent services – put in the trillions of dollars by some accounts – creates a significant need to secure access to these services and also to ensure that people know when they’re communicating with a real person and when they’re not, in the same vein that chat support is differentiated between live agents vs. an AI bot.
FindBiometrics: What is the difference between active and passive liveness detection? And how are AI technologies moving the industry toward the passive option?
Alexey Khitrov, Founder & President, ID R&D: The difference between active and passive liveness is essentially about the level and nature of friction imposed on legitimate users to pass the checks. Active techniques rely on interaction with users, requiring them to blink or move their phones closer and further away. There is plenty of evidence to show that this type of friction causes significant abandonment rates.
There are some techniques that claim to be passive because they don’t require the user to do anything but still impose an impact on the user experience, such as with flashing lights and imagery that at best is distracting and at worst can cause harm.
A truly passive approach is performed in a way that has no impact on the user experience and has the additional benefit of not tipping off fraudsters that the checks are being performed or providing clues about how it might be defeated. A still more elegant solution uses only the images or audio being used for biometric matching, minimizing any latency that might be contributed by having to send extra audio or video streams back to a server for analysis.
FindBiometrics: Looking at 2024, what emerging trends do you expect to influence our industry?
Alexey Khitrov, Founder & President, ID R&D: Generative AI has been yet another injection of jet fuel to our digital transformation, accelerating technological advancements to a pace we have never seen before. One outcome is that we will soon be sharing the world with virtual humans in multiple forms: intelligent agents, digital assistants, chatbots, and others. They will be very knowledgeable, and we will communicate with them using our voices. The lines between humans and machines will continue to blur in the digital world, and increasingly in the real world. Consider the movie Her, released ten years ago this past October, about a man who falls in love with his computer and her voice. Science fiction is becoming science a lot quicker than most of us imagined in 2013.
It’s hard to overestimate the impact, but it’s also hard to fully visualize what that impact will be. Will virtual humans have faces? Identities? How will our interactions with them be governed and secured? Whatever the answers to these questions, it’s clear that these intelligent agents will have a massive impact on identity and what it means. Until now, digital identity has been about giving humans their (many) identities in the digital realm. Will digital beings now need identities in the human realm? It’s not far-fetched to expect that people will want to know when the machine they thought they knew has been reprogrammed, and that they will have different identities.
FindBiometrics: ID R&D is a Catalyst in the Biometric Digital Identity Prism — a formidable designation. The Prism proposes that the biometrics and identity industries are facing an inflection point. What do you think will be key to seizing this moment and building an identity-safe future?
Alexey Khitrov, Founder & President, ID R&D: Envisioning the future is what makes good technology providers great, no matter the field, and identity is no exception. The arrival of machines with knowledge and communication skills on par with humans is a paradigm shift with no precedent in the identity arena. Where liveness has been a secondary enabler of remote biometric authentication, the blurring of lines between real and virtual has liveness moving to center stage. We will first need to ask you, “Are you a person”, and then use biometrics to verify that you are the person that you claim to be. Covert use of clones and deepfakes are a strong and useful signal of attempted fraud and crime. Technology providers who have anticipated this shift will be well prepared to participate in the wild ride that is just beginning.
ID R&D saw very early not only how important and challenging liveness detection is, but also that it must be determined without friction. This is even more true with voice, where the whole point of the technology is to enable a natural human interface. ID R&D has been a provider of best-in-class voice technology for a long time. It’s the most natural form of human communication, and for this reason will play a vital role in the development of our relationships with virtual humans.
For more information on voice clone detection, visit: www.idrnd.ai/voice-clone-landing
Follow Us