Paper

Overview

Can we use VLMs as judges to improve human-aligned image compression? Yes! In VLIC (Vision Language Models for Image Compression), we present a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. Please consult our paper for more details, and check out the visualizations on this page!

centered image

Gallery

Select a Scene from the bottom bar. Select a Method from the sidebar to compare against VLIC.

Candidate

Reference

We observe that VLIC produces high-quality reconstructions, particularly for human-relevant details such as text and faces. We conduct large-scale user studies and quantitative evaluations. Please consult the paper for the details. Thank you!

Acknowledgments

We thank Ben Poole, David Minnen, and Dina Bashkirova for helpful discussions.

BibTeX

@inproceedings{sargent2026vlic,
    title   = {VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression},
    author  = {Sargent, Kyle and Gao, Ruiqi and Henzler, Philipp and Herrmann, Charles and Holynski, Aleksander and Li, Fei-Fei and Wu, Jiajun and Zhang, Jason},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year    = {2026}
}