Back in August of 2014, a team of MIT Researchers released a video which some dubbed the, “Mary Had A Little Lamb” video. It described an experiment by MIT researchers to extract audio from only the vibrations of a plant, potato-chip bag, laptop earbuds, and other objects.
The significance of the nursery rhyme name was to acknowledge one of the first phrases spoken in 1878 by Thomas Edison into his first phonograph. But the rhyme also acknowledges (perhaps unknowingly) the differences between visual microphones, which we’ll cover here, and optical microphones – to be covered in a future story. Both technologies have a modern connection to Edison.
The “Mary had a Little Lamb” video used a then experimental technology known as a visual microphone. It worked by collecting and converting vibrational data back into audio signals. The microphone picked up input sound caused by air pressure fluctuations at the surface of an object. High speed cameras were used to capture the fluctuating movement on film at a rate of between 1kHz-20kHz. The resulting recorded video was then processed by a proprietary algorithm to recover the original sound.
|Image Source: The Public Domain Review|
In the video, the MIT researchers recovered intelligible speech from the video of a vibrating potato-chip bag. In addition to high resolution video technology, the crucial element in this technique was a unique algorithm that reconstructed the audio signal by analyzing the minute vibrations of impinged objects. The Youtube video about the visual microphone put it this way: “When sounds hits an object, it causes that object to vibrate. The motion of this vibration creates a subtle visual signal that’s usually invisible to the naked eye. In our work, we show how by using only a video of the object and a suitable processing algorithm, we can extract these minute vibrations and partially recover the sounds that produced them, letting us turn everyday visible objects into visual microphones.”
As noted in the study, reconstructing audio from video requires that the frequency of the video samples – i.e., the number of frames of video captured per second — be higher than the frequency of the audio signal. This sampling rate suggests a derivative of the Nyquist Criteria from communication theory. Further, the study explains how researchers would occasionally use medium-grade, high-speed cameras that captured 2,000 to 6,000 frames per second. As a point of reference, this rate was much faster than the 60 frames per second possible with some high-end smartphone cameras.
Here’s where things gets interesting. The researchers also explored the use of ordinary digital cameras – in other words, commodity consumer hardware. It seems that a quirk in the rolling shutter mechanisms of most cameras’ sensors enabled the inference of higher frequency vibrational information from video recorded at a standard 60 frames per second. Perhaps a similar compensating technique could be used in the CMOS type CCD sensors of smart phone cameras?
Naturally, the 60 frames per second data rate did not produce the same quality audio reproductions as the high-speed cameras. But the ordinary digital camera rate was enough to identify the gender of a speaker in a room; the number of speakers; and even the identities of the some of the speakers.
It’s highly likely that the next generation of mobile smartphones will include higher resolution digital cameras even extra processing power to improve the quality of these visual microphones.
|Image Source: MIT / Abe Davis research video|
Since its introduction in 2014, several new uses for this technology have been suggested:
- Biomedical imaging: To extract heartbeats from the tiny movements on a patient’s head.
- Physical measurements: To detect the speed of moving hot air or similar transparent fluids.
- 3D Video Processing: To detect motion to extract a depth map from binocular images
- Audio recovery: To extract speech signals by measuring the vibration of a person’s neck in a video.
- Sci-Fi: Perhaps it might be possible to even recover sound across space, since only light but not sound can travel in space.
The audio recovery application suggests the use of motion extraction to recover sounds from silent video. According to the MIT researchers, this could be done by extracting vibration separately in a number of scales and angles. The technique would involve alignment of the signals in time (to avoid destructive interference) and the use of weighted average among all orientations and scales.
If one could recover audio from silent movies, why not recover audio from 8mm and 16mm home movies? If this is possible, it opens up a world of historical possibilities. In the extreme, one might be able to use this vibrational microphone technology to reveal secrets from the infamous Zapruder home video of the assignation of the US president, John F. Kennedy.
Unfortunately, as with any technology, visual microphones could be used for questionable practices, such as transcribing the keystrokes on a PC with only the visual (not audio) input. Corporate, national and election espionage villains might use this technology to give themselves a previously unheard-of advantage (no pun intended).
|Image Source: Zapruder Film / David Erickson at Flickr Creative Commons|
John Blyler is a Design News senior editor, covering the electronics and advanced manufacturing spaces. With a BS in Engineering Physics and an MS in Electrical Engineering, he has years of hardware-software-network systems experience as an editor and engineer within the advanced manufacturing, IoT and semiconductor industries. John has co-authored books related to system engineering and electronics for IEEE, Wiley, and Elsevier.