Architecture for Encoder

The H.264 hardware encoder is advised to be designed as a modular system with small, efficient components doing well defined tasks.

Input:
Video frames pumped into the encoder, will be written to (external) SRAM to buffer it temporarily. It is read into the prediction components such as Intra-Prediction/Inter-prediction as per the requirement.

Output:
Outputs from the prediction components is then pumped through the transform/quantize (core-coding) loop and sent it to the entropy encoder. The core-coding output is transfer to de-quantize, inv-transform, reconstruct loop. For further encoding, these reconstructed pixels are required to predict the next blocks, in addition they are also written to (external) SRAM for use for the next frame.

 

   

1) Intra Prediction

2) Inter Prediction

3) Forward Transform

4) Quantize

5) CAVLC

6) Deblocking

 

 
 

Forward Transform

Concept:

The forward transforms module groups three different two dimensional transforms.

The first transform is an integer approximation of the 2-D DCT and it is applied over 4x4 input blocks. The other two transforms are 2-D Hadamard transforms, one applied over 4x4 input blocks and other applied over 2x2 input blocks.

Initially the luma macroblock is processed by the 2-D DCT. If prediction mode was intra 16x16, then the DC coefficients of the 2-D DCT results are processed by the 4x4 2-D Hadamard. In this case, the first block sent to the output contains the results of the 4x4 2-D Hadamard.

After, the 2-D DCT AC coefficients are sent to the output. If the prediction mode was not intra 16x16, then the 4x4 Hadamard is not applied. After the chroma blocks are processed by the 2-D DCT and the DC results are processed by the 2x2 2-D Hadamard. Then the results of the 2x2 Hadamard for the Cb component is sent to the output. After, the results of the 2x2 Hadamard for the Cr is sent to the output. Finally, the 2-D DCT AC chrominance coefficients are sent to the output, first the Cb coefficients and later the Cr coefficients.

Architecture:

Forward transform architectures were designed in a fully parallel fashion. 4x4 2-D DCT and 4x4 2-D Hadamard consume 16 samples per clock cycle and 2x2 2-D Hadamard consumes 4 samples per clock cycle.

The algorithm and the designed architecture for 4x4 Hadamard and 4x4 DCT transforms are similar.

Two buffers are used to allow the synchronism among these transforms.

The 4x4 FDCT (forward DCT) is the first operation of this module and it is common for all input samples. After the DCT calculation, there are three possible datapaths, depending on the sample type and the prediction mode used in prediction stage.

   

 

Intra prediction hardware

 

 

Inter prediction hardware

 

Bit-stream generation hardware

Deblocking Filter

 

 

READ FURTHER: LINKS

H.264 Video Codec on FPGA
Implementation of Video Codec

Features in H.264 and implementation feasibilities
Identifying and finalizing the expected features is important for a RTL project.

Encoder architecture for H.264 Video encoder in FPGA

Important modules and complexity estimates

Memory Management in the FPGA Codec

H.264 on FPGA IP Vendors