RT Journal Article T1 CAVLCU: an efcient GPU‑based implementation of CAVLC A1 Fuentes-Alventosa, Antonio A1 Gómez-Luna, Juan A1 González-Linares, José María A1 Guil-Mata, Nicolás A1 Medina-Carnicer, Rafael K1 Compresión de datos (Informática) K1 Imágenes - Compresión K1 Procesado de imágenes - Técnicas digitales K1 Compresión de vídeo AB In this paper, we present CAVLCU, an efficient implementation of CAVLC on GPU, which is based on four key ideas. First, we use only one kernel to avoid the long latency global memory accesses required to transmit intermediate results among different kernels, and the costly launches and terminations of additional kernels. Second, we apply an efficient synchronization mechanism for thread-blocks (In this paper, to prevent confusion, a block of pixels of a frame will be referred to as simply block and a GPU thread block as thread-block.) that process adjacent frame regions (in horizontal and vertical dimensions) to share results in global memory space. Third, we exploit fully the available global memory bandwidth by using vectorized loads to move directly the quantized transform coefficients to registers. Fourth, we use register tiling to implement the zigzag sorting, thus obtaining high instruction-level parallelism. An exhaustive experimental evaluation showed that our approach is between 2.5× and 5.4× faster than the only state-of-the-art GPUbased implementation of CAVLC. PB Springer Nature YR 2022 FD 2022 LK https://hdl.handle.net/10630/37825 UL https://hdl.handle.net/10630/37825 LA eng NO Fuentes-Alventosa, A., Gómez-Luna, J., González-Linares, J.M. et al. CAVLCU: an efficient GPU-based implementation of CAVLC. J Supercomput 78, 7556–7590 (2022). https://doi.org/10.1007/s11227-021-04183-8 NO Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. DS RIUMA. Repositorio Institucional de la Universidad de Málaga RD 19 ene 2026