ScNat
In Cam Audio Lag

Subframe audio video sync measurement

I designed and realized an apparatus to measure the synchronisation between the in cam audio stream and the video stream as recorded on the camera memory card to assess the performance of syncing programs.

In any NLE software, one can verify that sound is in sync with the images, stepping frame by frame. But how to measure synchronization correctness within a precision better than a frame duration? This method successfully determined the sync error of some consumer models within a 2 milliseconds precision:

Experimental Setup

A mechanical slate is used to generates events filmed by the camera. The measurement must be done multiple times in order to evaluate its reproducibility, hence a 30 RPM rotating arm is used as a slate and recorded for a minute or two, generating some 30-60 clicking events, all processed to generate valid statistical distributions.

No sound is produced: an electrical contact sends for each turn a 100 mV pulse directly into the camera microphone input. Before running the experiment, a calibration static shot is done to record the rotating arm reference position, exactly at the angular position where the electrical click is occurring.

Principle of operation

Successive arm positions are used for a linear regression to interpolate the precise time when the arm passes through the reference position, this time (called video time, T vT_{v}) is compared to the time detected for the sound pulse in the audio recording (T aT_{a}, audio time). Ideally those time should coincide, and the sync error

Δt=T aT v\Delta t = T_{a} - T_{v}

should be zero, or at the very least, much less than a frame duration. If t a<t vt_{a} \lt t_{v}, the audio click happens before the arm reaches the reference position so the audio is leading. Hence cameras with lagging audio have a positive Δt\Delta t and leading audio will have a Δt<0\Delta t \lt 0.

Automated image and sound processing

About ten frames are used for each pass (five frames before the click and five after) and for each frame the angular arm position has to be measured with the best available spatial precision (at pixel level). Manually the whole process would be tedious so the video frames and sound track are analyzed with a python script. To simplify image processing, high contrast videos are obtained in complete obscurity with a lightly lit LED fixed on the rotating arm. This LED produces an arc of circle due to motion blur. At this low light, the camera sets the maximum shutter exposition length: one full frame (almost a 360 degrees shutter angle) and successive arcs almost touch each other.

The steps are listed below (after manually finding the (x,y)(x,y) reference position of the LED in the ‘calibration’ static sequence):