Instant Photorealistic Style Transfer:
A Lightweight and Adaptive Approach
1University of Southern California
2USC Institute for Creative Technologies
When provided with an underwater video that has a resolution of 3840x2160 pixels and operates at a 30 fps for a duration of 11 seconds, the entire transfer process is completed in just 1.53 seconds, excluding frame I/O operations.
Abstract
In this paper, we propose an Instant Photorealistic Style Transfer (IPST) approach, designed to achieve instant photorealistic style transfer on super-resolution inputs without the need for pre-training on pair-wise datasets or imposing extra constraints. Our method utilizes a lightweight StyleNet to enable style transfer from a style image to a content image while preserving non-color information. To further enhance the style transfer process, we introduce an instance-adaptive optimization to prioritize the photorealism of outputs and accelerate the convergence of the style network, leading to a rapid training completion within seconds. Moreover, IPST is well-suited for multi-frame style transfer tasks, as it retains temporal and multi-view consistency of the multi-frame inputs such as video and Neural Radiance Field (NeRF). Experimental results demonstrate that IPST requires less GPU memory usage, offers faster multi-frame transfer speed, and generates photorealistic outputs, making it a promising solution for various photorealistic transfer applications.
Results
Image Style Transfer Comparison
Input
Style
Video Style Transfer Comparison
Input
Style
NeRF Style Transfer Comparison
Input
Style
Approach
Instant Photorealistic Style Transfer (IPST) Pipeline
Overview of Instant Photorealistic Style Transfer (IPST) pipeline and instance-adaptive optimization. IPST consists of a key component and is divided into two stages: the StyleNet, and the stages of training and inference. The StyleNet takes a single image as input and generates the transferred output. In the initial training stage, the output of StyleNet cannot accurately capture the desired transformation due to its limited training. However, as the training progresses, StyleNet becomes capable of generating outputs that closely mirror the style reference image. Throughout the training process, we employ an instance-adaptive optimization method including an instance-adaptive coefficient α and an early stopping technique, which collectively accelerate the training process. For the task of image transfer, only the training stage is required. In cases where a set of images with contextual consistency serves as input, the first image goes through the training stage, while the subsequent images undergo the inference stage. During the inference stage, the StyleNet, now equipped with learned style transformation, is directly applied to the following images.
Lightweight Photorealistic StyleNet
Architecture of Lightweight Photorealistic StyleNet. The StyleNet operates through a sequential process. 1) It begins by normalizing the given input using the mean and standard deviation values derived from the ImageNet dataset. The StyleNet then diverges into two branches: a style transfer branch and a content shortcut branch. 2) Within the style transfer branch, the input is downsampled to the SD (480p) resolution, ensuring the resolution-agnosticism and effective computation of the network. The style branch comprises a lightweight and compact neural network designed for applying color transformation. Notably, the transformations exclusively manipulate input channels while preserving the spatial information untouched. The outcome of this diminutive neural network is a color transformation mask, which is subsequently upsampled to match the original resolution, serving as the ultimate output from the style transfer branch. Concurrently, the content shortcut branch executes an identity mapping process, thereby aiding the preservation of non-color information within the StyleNet. 3) Upon completing their respective tasks, the outputs from the two branches are added and then denormalized to generate the final output.