• Font Size:
  • A
  • A
  • A

Tech Papers

System for Capturing High-resolution 3D Video Images

Munro Design & Technologies, LLC

Abstract
Presented is a non-scanning non-stereoscopic system shown to be capable of capturing 3D images at video frame rates, with VGA cross-range resolution and millimeter radial accuracy. The camera uses off-the-shelf components, is compact, and is simple to manufacture and use.

The 3D video imager operates by emitting sinusoidally-modulated light from a small bank of LEDs, whose emission envelope illuminates the entire target scene. The back-reflected light is collected by an imaging lens, which focuses the light onto the photocathode of a micro-channel plate image intensifier (MCP-II) through a narrow-pass optical filter. The image intensifier amplifies the relatively weak optical signal by several orders of magnitude. A second function of the MCP-II is to capture a nearly instantaneous sample of the received sinusoidally-modulated light back-reflected from the target scene.

Once an intensified image-sample is captured, it is then momentarily stored on the MCP-II’s fluorescent screen, whose image is re-imaged onto a high-speed CMOS image sensor. Subsequent samples of the intensified back-reflected light at the same equivalent time are collected and additively stored within the wells of the image sensor to improve the SNR. The image sensor is then read out by a digital processor. Four such aggregated image-samples are captured and read out in sequence, each aggregated image-sample being 90° apart along the sinusoid in equivalent time. A four-point DFT (Discrete Fourier Transform) is executed on each pixel of the four image-samples, and the phase and amplitude of the returned signal is computed for each pixel. The radial distance for each pixel is then determined by use of the formula Distance = c?/4πf, where c is the speed of light, ? is the phase of the received signal and f is the frequency of the emission modulation.

I Introduction
Conventional digital cameras capture two-dimensional images in which the horizontal and vertical spatial information of a target scene is encoded in a two-dimensional array of pixel values. Imaging devices that can also capture the third dimension – depth – would be of considerable value and have been the topic of numerous research papers over the past few decades.

The earliest papers were generally concerned with adapting laser ranging methods in which a laser is scanned across the target scene and the 3D image constructed from the intensity and time delay of the back-reflected light1. These systems are common in the marketplace, but generally suffer from low resolution (i.e., low pixel density) and low speed (i.e., low frames per second).

More recently, with the advent of low-cost yet high-powered digital processors, stereoscopic 3D imagers are also now common in the marketplace. Stereoscopic 3D imagers work on the principal of triangulation, in which the images from a pair of separated conventional digital cameras are mathematically fused to form a 3D image2. Stereoscopic 3D imagers suffer from two important drawbacks: 1) both cameras must be able to see all portions of a target scene, and any nooks or recesses that are hidden from view of either camera will not be a part of the 3D image, and 2) in order to obtain good depth accuracy the two cameras must be spaced widely apart which increases the size of the imaging system and also exacerbates the hidden-view problem of 1).

To overcome the limitations of stereoscopic and scanning rangefinder 3D imagers, the so-called Flash 3D imager was introduced3. These 3D imagers work by emitting an intense pulse of light that illuminates the entire object scene, and then range gating the received signal on a specialized image sensor to extract the distance and intensity information. Flash cameras are fast, and range accuracies on the order of an inch have been reported. However, the specialized image sensors currently limit the spatial resolution to 128 x 128 pixels.

A newer approach to 3D imaging entails emitting temporally modulated light that also illuminates the entire object scene, receiving the back-reflected light on a high-speed image sensor, and then processing the received image to extract the phase shift (i.e., distance) and intensity information for each pixel4. These types of 3D imagers are faster and have higher pixel density than the scanning and stereoscopic imagers, and do not suffer the occluded pixel problem of stereoscopic imagers since the light source can be co-located with the receiver. However, the weak back-reflected light limits the range to only a few meters, video frame rates have not yet been realized, and the distance accuracy is only 10mm at best.

Recently, the introduction of a new ranging algorithm5 has opened up the possibility of overcoming the speed, resolution, and range limitations of the previous method; its use in a 3D video imager is described in the following sections of this paper.

The present 3D imager’s hardware can be grouped into three sections: a Control Section, a Transmission Section, and a Receiver Section, which are discussed in turn below and illustrated in Figure 1.

II Control Section
The Control Section is responsible for generating two key digital signals that are synchronized with one another. One signal is output to the LED Driver and is the source signal from which the LED output is modulated. The second signal is output to the Micro-Channel Plate Gate Driver, which in turn causes the MCP Image-Intensifier to switch from its normally Off state to On then back to Off.

Both output signals must be coherent with one another, and are generated by a Xilinx XC6SLX16 Spartan-6 FPGA on an off-the-shelf SP601 evaluation board, which utilizes a low-jitter 200MHz non-PLL clock oscillator as the timing source.

Figure 1  Hardware Block Diagram of the distance camera

The FPGA’s signal output to the LED Driver is a 50MHz square wave that is easily generated from the SP601’s 200MHz oscillator. The signal output to the MCP Gate Driver relies upon the 5ns period of the 200MHz to generate the narrow gate pulses required for fast MCP switching. Furthermore, as will be described later, the four-point DFT process requires that the received 50MHz signal is sampled 90° apart in time, or in equivalent time. This can only be accomplished if a timing signal at least four times greater than the 50MHz signal is available, which again is provided by the 200MHz FPGA clock. Lastly, code is provided within the FPGA’s programming to switch the MCP Gate pulse timing by Nx90° (N = 0, 1, 2, or 3 for a four-point DFT) so that the four phase samples can be collected.

III Transmission Section
The purpose of the transmission section of the camera is to emit sinusoidally modulated light that is monochromatic or substantially monochromatic. Input to the Transmission Section, which consists of the LED Driver functional block and 40 LEDs, is the 50MHz square wave output by the FPGA. The LED Driver consists of an input comparator that removes any imperfections in the input signal which is followed by two low-pass filter stages that convert the binary output of the comparator to a high-fidelity sinusoid. The output of the low-pass filter is directed to four OPA483 op amps who each in turn drive a MOSFET that are connected to two Osram SFH4236 LEDs. The SFH4236’s emit at 850nm, and are one of the few LEDs on the market that can be modulated at 50MHz.

The SFH4236’s have integrated lenses that focus the LED emission into a pattern having a 20° half-power width, and the emission illuminates the entire scene that is to be imaged. The total power emitted by the LEDs is approximately 12 Watts, with a 70% depth of modulation.

IV Receiver Section
The purpose of the Receiver Section is to receive and image as much back-scattered light from the target as possible, intensify and sample the optical signal and then convert the sampled optical image to an electronic image format that can then be digitally processed.

At the front end of the Receiver is a Nikon f/2 50mm lens that serves to collect as much light from the target as possible and focus it on to the photocathode of an MCP-II through a narrow-pass optical filter. The optical filter is needed to prevent ambient light from entering the system, and its passband is centered on the 850nm emission wavelength of the LEDs. It is important to note that the received light that is imaged onto the photocathode retains the 50MHz sinusoidal modulation waveform, although the phase of the waveform varies across the face of the photocathode in accordance with the spatially varying distance of the target object.

The photocathode in turn converts the received optical signal into an electronic signal that varies spatially and temporally in accordance with the received optical signal. If the MCP is gated On, the electrons generated by the photocathode are then collected by the MCP, which then intensifies the relatively weak electronic signal by many orders of magnitude. An equally important purpose of the MCP is to act as an electronic sampler. It is well known that an MCP, by way of the voltage applied to it, can turn from Off to On and back to Off in a matter of nanoseconds. This switching property, in which the MCP is On for only a few nanoseconds, can be used to sample – both temporally and spatially – the varying received electronic signal across the entire image at once. The resulting sampled and amplified electronic signal is then directed onto a fluorescent screen where the amplified electronic image exiting the MCP is converted back into photons. The photon flux varies spatially across the fluorescent screen in accordance with the amplitude of the sampled and amplified signal output by the MCP.

The MCP-II employed in the distance camera is the XX1450ALS by Photonis of Lancaster, Pennsylvania (USA) which has an active area diameter of 18mm, an S25 photocathode having a sensitivity of 48.2mA/W at 850nm, an XR5 style MCP having a channel diameter of 2µm, and a resolution of 69 lp/mm. The fluorescent screen utilizes a P46 phosphor, which has a decay time of 100ns.

The photons emitted by the fluorescent screen are next imaged onto a high-speed monochrome imager through an optically fast relay lens. The imager is a Phantom v7.3 manufactured by Vision Research of Wayne, New Jersey (USA). It is capable of capturing over 11,000 frames per second at a resolution of 512 x 512 pixels, has a 2µs exposure time, and a bit depth of 14 bits.

The images captured by the Phantom v7.3 are then downloaded to a PC over a gigabit ethernet line, whereupon the captured images are filtered and processed with a DFT algorithm to compute the phase and amplitude for every pixel of every frame of video.

V Image Processing
Refer to Figure 2, below, which is a timing diagram that illustrates the processing needed for determining the distance and amplitude of a representative pixel of a 3D image. The uppermost waveform seen in Figure 2, the Master Clock Waveform, is simply the 200MHz clock of the FPGA. As mentioned earlier, this is the source for all of the timing within the distance camera, and is the source of synchronization between the light emissions and the sampling of the back-reflected received light at the MCP-II. The second waveform shown in Figure 2, the Emitted Light Waveform, is the optical illumination produced by the bank of LEDs, and is light that is amplitude modulated with a high-fidelity 50MHz sinusoid. The phase of the 50MHz modulation is not allowed to vary or drift with respect to the phase of the 200MHz Master Clock Waveform.

Figure 2   Timing diagram of key signals within the 3D video imager

The next waveform shown in Figure 2 is the Received Signal, which is the optical signal incident on a differentially-small area of the photocathode. Note that it is still amplitude modulated with a 50MHz sinusoid, but that its phase has been shifted with respect to the light emitted by the LEDs in the preceding waveform. This phase shift is due solely to the round-trip travel time of the light from the centroid of the LEDs to the object location corresponding to the imaged differentially-small location on the photocathode, to the differentially-small location on the photocathode. The phase shift will be Δ? = 2πfd/c, where f is the 50MHz modulation frequency, d is the round-trip distance, and c is the speed of light. c/f is the wavelength, λ, of the modulation, being 6 meters for a 50MHz frequency. As a quick example, if the round-trip distance is 3.0 meters, then the phase shift, Δ?, will be π radians.

The fourth waveform shown in Figure 2 is the MCP Sampling Signal, which is the voltage applied between the photocathode of the MCP-II and the input face of the MCP. For the Photonis XX1450ALS, a voltage of 40V turns Off the MCP-II, while a voltage of -200V will turn On the MCP-II. Since the period of the 50MHz sinusoidal modulation is only 20ns, and we wish to obtain samples of this sinusoid so its phase can be determined, an MCP-II On duration of less than 5ns is highly desirable. The timing of the pulses of the MCP Sampling Signal is of paramount importance. Note that the time between sample pulses is greater than the period of the 50MHz sinusoidal modulation, and that the time between sample pulses varies. This is the essence of the equivalent time sampling mechanism employed in the distance camera for capturing the four phase images of the returned light, from which a DFT can be executed for each pixel. More particularly, the time between MCP sample pulses is 2000 + 5N nanoseconds, where N = 0, 1, 2, or 3. The FPGA can readily produce the timing for this signal, with low jitter, from the 200MHz Master Clock Waveform, with which the sampling pulses are tightly synchronized.

When the MCP-II turns On and then Off in accordance with the MCP Sampling Signal, the optical image incident on the photocathode at that instant is intensified and directed on to the fluorescent screen. Then when the MCP-II is turned Off, the image on the fluorescent screen is frozen and then subsequently decays in intensity as a function of the decay time of the phosphors comprising the screen. This effect, for a pixel-sized area, is illustrated by the Fluorescent Screen Intensity waveform of Figure 2. It is important to note that the peak intensity of the exponential pulses comprising the Fluorescent Screen Intensity waveform are not only a function of the reflectivity of the corresponding location of the target scene, but also of the location in time of the equivalent time sample of the received light. That is, for a given pixel, if the equivalent time sample is made near the peak of the sinusoidally-modulated return signal, then the fluorescent screen will be bright at that location, and conversely will be dim if the sample was made near the trough of the signal.

The sampled image available at the fluorescent screen is then captured with the Phantom v7.3 high-speed camera through an optically fast relay lens. The FPGA provides a trigger signal to the camera which causes its electronic shutter to open for 2µs, thereby capturing substantially all of the energy available to it before the fluorescent screen image decays to zero. The charge that a representative pixel of the high-speed camera collects during the exposure period is shown in the Phantom v7.3 Pixel Charge waveform of Figure 2. Note that the magnitude of the pixel charge, and consequently the voltage produced by the v7.3 at that pixel, is proportional to the area under the curve of an exponentially decaying fluorescent pulse of the Fluorescent Screen Intensity waveform. Next a read strobe within the high-speed camera, the Phantom v7.3 Pixel Readout Strobe waveform of Figure 2, causes the pixel charge to be read and converted to a voltage, resulting in the Sampled Pixel waveform which is then available for subsequent processing.

Note that the Sampled Pixel waveform has four voltage levels, denoted as X0, X1, X2, and X3, whose values, along with corresponding four-level voltages from the other pixels, are uploaded from the high-speed camera to the PC for processing. That is, four images are uploaded to the PC: one having X0 phase data, one having X1 phase data, one having X2 phase data, and finally one having X3 phase data. These four images can be in 16-bit monochrome bitmap (.bmp) format for example, although the Phantom v7.3 camera lumps them all into one image file (Vision Researc’s .cine format) for simplified uploading.

Figure 3   Timing diagram showing how four samples are collected per frame of the high-speed camera

At the PC, a custom Windows program, written in National Instruments LabWindows/CVI, opens the .cine file and extracts the four phase images. Next, a four-point DFT is executed using the data in the four phase images as input, and computes the phase for each pixel. The mathematics of a four-point DFT operating on the fundamental frequency are quite simple, and the phase of the received signal at a given pixel is

Equation (1)

where m denotes the m’th pixel of the 262,144 pixels comprising a 512 x 512 pixel image. From this phase, the distance to the target object corresponding to that pixel is:

Equation (2)

where D is the distance, λ is the wavelength of the modulation frequency and Δ? is the difference in phase of the received signal relative to the phase of the sinusoidally modulated emission at the centroid of the LEDs. Generally the transmitted phase is taken to have a value of 0°, in which case Δ? = ?. Equations (1) and (2) can be combined as:

Equation (3)

While Equation (3) appears to be computationally expensive, especially since it must be performed for every pixel, in reality it can be easily implemented in a look-up-table (LUT) in which the argument of the inverse tangent is input, and the distance D is found.

The amplitude of the received signal at each pixel, corresponding to the reflectance of the object at that pixel location, is also needed. For a four-point DFT the amplitude of the m’th pixel is

Equation (4)

A LUT can also be used for the square root function of Equation 4. When the distance and amplitude for every pixel of the object is computed, an entire 3D image of the target will have been created.

VI Improved Alternate Processing
One weakness of the signal processing method described above is that only one of the four DFT samples is collected per frame of the high-speed camera. Although in principal this would allow for one 3D image to be captured every 364µs (i.e., a frame rate of 11,000/4 frames / second, limited by the frame rate of the v7.3 camera), the resulting distance image will have noisier pixels with a standard deviation of well over 25mm. Since a goal is to attain 1mm distance performance, the high frame rate can be traded-off to reduce the measurement variance and improve accuracy. There are two ways to do this, and both are implemented in the camera.

The first way to reduce measurement variance is to simply collect several frames of phase data, and then average them together as part of the image processing algorithm. If the goal is to achieve a 3D video frame rate of 30 frames/second, and if the high-speed camera can capture images at 11,000 frames/second, then the number of phase image frames that can be averaged together is: P = 11,000 raw frames/sec / (30 3D frames/sec x 4 phase frames / 3D frame) = 91.67 raw frames / phase frame. Since a partial frame is not realizable, P, the number of frames from the high-speed camera that can be averaged together to obtain a low-noise phase frame, can be 91.

The second way to reduce measurement variance is to collect several image samples – of the same phase – from the MCP-II per image frame of the v7.3 high-speed camera. Note that Figure 2 shows the timing associated with capturing one of the four phase images per frame of the high-speed camera; Figure 3 shows the timing of a system in which the electronic shutter of the high-speed camera is left open over four MCP-II samples.

As described in connection with Figure 2, the photons comprising the Fluorescent Screen Intensity waveform are imaged onto the high-speed camera. Now, however, the High-Speed Camera Readout Strobe of Figure 3, occurs after several “sub-samples” have been taken and presented to the high-speed camera by the fluorescent screen. Note that for each sub-sample, the amplitude of the Pixel Charge waveform increases in an additive process, thereby increasing the magnitude of the sampled pixel signal. At the same time, the accumulation of charge over multiple sub-samples also performs an averaging function in which the noise-induced variations of the Fluorescent Screen Intensity pulses are reduced. The reduction of noise in this manner further improves the SNR. In the current implementation in which the frame rate is 11,000 frames per second, the exposure time of the high-speed camera is 90µs and some 45 MCP-II image samples are collected per phase frame.

Figure 4   A view inside the 3D video imager showing, clockwise from the upper left, the v7.3 high-speed camera, the relay lens, the MCP gate driver circuitry, and the off-the-shelf FPGA board.VII Results
Testing shows that we have achieved the 1mm distance accuracy goal while also meeting all other requirements of the 3D imager, including video frame rates and VGA-level resolution. Figure 4 shows a side view of the inside of the imager, showcasing the v7.3 camera, internal relay lens, MCP gate driver circuitry, and the Xilinx FPGA board. Figure 5 is a frontal oblique view of the imager, whose 8”×18”×15” (200×460×380mm) size can be significantly reduced with further development. Most importantly, Figure 6 presents a series of three 3D images (amplitude and false-color distance) captured at video speed. The subject is Brooklyn, the author’s dog, who was startled and flinched during the imaging process. Note that the co-axial illumination / imaging axis allows for all portions of Brooklyn to be imaged, including those areas in the (blue) background that would be hidden from one of two stereoscopic cameras. Also note a few dark patches in the upper right image around her muzzle where the amplitude was insufficient to generate a reliable depth estimate.

Provisions were also made for shortened shutter speeds. Specifically, in addition to the 32ms shutter times for video frame rates, the 3D camera is also capable of 16ms and 8ms shutter speeds, although the noise present in the images does become more pronounced.

Figure 5   The 3D video imager VIII Future Considerations
There are several enhancements and adaptations that can be made to the 3D video camera described in this paper. For example, the power of the LED emission can be easily increased to achieve a longer measurement range. The LED beam emission width can also be narrowed – or the LEDs can be replaced with a laser – to efficiently increase the range as well. The resolution of the camera is currently limited by the performance of the off-the-shelf Nikon lenses. The objective lens is not designed for imaging 850nm light, and the internal relay lens is being made to operate at non-ideal imaging conjugates. Better lenses with optimal performance will shift the image-quality bottleneck to the MCP-II, from which SVGA image quality can be realized. Alternately, the frame rate, accuracy, and resolution can all be traded-off for a lower cost embodiment.

The physical size of the camera is largely determined by the size of the off-the-shelf Phantom v7.3 camera; the use of a custom high-speed camera, or an off-the-shelf camera with lower speed, can greatly reduce the package size. Indeed, our technology roadmap concludes with a 3D imager that utilizes an electron-bombardment intensifier / image sensor that is compatible with smart-phone sized packages.

Figure 6   Three successive intensity and distance images of Brooklyn, the author’s dog, in motion.  The images were captured at a rate of 30 frames per second.

References

  1. Lei, et al, A 3D Scanning Laser Rangefinder and its Application to an Autonomous Guided Vehicle, Vehicular Technology Conference Proceedings, 2000. VTC 2000-Spring Tokyo. 2000 IEEE 51st (Vol 1).
  2. Rao, et al, Digital Stereoscopic Imaging, Stereoscopic Displays and Applications X, IS&T/SPIE San Jose, CA January 1999.
  3. Dorrington, et al, An evaluation of Time-of-Flight Range Cameras for Close Range Metrology Applications, International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol XXXVIII, Part 5 Commission V Symposium, Newcastle upon Tyne, UK. 2010.
  4. Stettne, et al, Eye-safe Laser Radar 3D Imaging, http://advancedscientificconcepts.com/technology/documents/Eye-safepaper.pdf
  5. Munro, Low-cost Laser Rangefinder with Crystal-Controlled Accuracy, Optical Engineering 44(2), 023605 (February 2005).

James F. MunroAbout the Author
James F. Munro received a BSEE with High Honors from Michigan Technological University in 1982. He also earned an MS, Optics, from the University of Rochester in 1990, and an MBA from the William E. Simon Graduate School of Business, also at the University of Rochester, in 2000. Past positions include optical, electrical, and software engineering, up to and including the title of VP, Engineering. Mr. Munro has been awarded 21 US patents, and is a co-founder and CTO of Munro Design & Technologies, LLC located in the Rochester, NY area.

 

 

 

 

 

Search AIA:


Browse by Products:


Browse by Company Type: