← Back to all projects

Project · Generative Models

Long-Tail Class Generation via Content-Style based Transfer Learning

Generative Models Long-tailed Learning Transfer Learning Content-Style Generative Model Conditional Image Generation

The Problem

Real-world image datasets are almost never balanced. A handful of head classes carry the bulk of the samples, while a long list of tail classes have only a few. When a class-conditional generative model — e.g. a conditional GAN (cGAN) — is trained on such data, two failure modes appear:

The natural way out is knowledge transfer from head to tail: similar classes (e.g. dog breeds) share most of their generative process, so the few samples we have for a tail class should benefit from the abundant samples of nearby head classes. Most prior work — GSR-GAN[2], Transitional GAN[3], Noisy-Twin[4], UTLO[5] — implements this idea with heuristics on architecture, regularization, or training schedules. We instead ground the knowledge transfer from head to tail transfer in a principled content-style generative model that explictly separates what is shared across all classes (content) and what is class-specific.

Setup

We have $N$ classes split into $M$ head classes (abundant data) and $N-M$ tail classes (scarce data). For class $n \in [N]$ we observe a dataset $\{ \bx_i^{(n)} \}_{i=1}^{D^{(n)}}$ with $D^{(n)}$ samples. The imbalance ratio is $\rho = D^{(1)} / D^{(N)}$. Our goal is to learn a single generative model that produces high-fidelity, diverse samples for every class — including those with only a few real images.

Proposed Formulation

Content–Style Generative Model

We model every image as a composition of two latent factors:

$$ \bx^{(n)} = \bg(\bc, \bs^{(n)}), \quad \bc \sim \bP_{\bc}, \quad \bs^{(n)} \sim \bP_{\bs^{(n)}}, $$

where $\bg : \mathcal{C}\times\mathcal{S}^{(n)} \to \mathcal{X}^{(n)}$ is a (smooth) bijection. To make the two latent distributions trainable, we transport standard Gaussians through learnable encoders:

$$ \bc = \be_{\rm C}(\bz_{\rm C}), \qquad \bs^{(n)} = \be_{\rm S}(\bz_{\rm S}, \bw_n), \qquad \bz_{\rm C}\sim\cN(\boldsymbol{0},\boldsymbol{I}), \quad \bz_{\rm S}\sim\cN(\boldsymbol{0},\boldsymbol{I}). $$

The same style function $\be_{\rm S}$ is reused across all classes; the only thing that distinguishes class $n$ is its embedding $\bw_n$. Crucially, we ask $\be_{\rm S}$ to be smooth in $\bw$:

$$ \big\| \be_{\rm S}(\bz_{\rm S}, \bw_i) - \be_{\rm S}(\bz_{\rm S}, \bw_j) \big\|_2 \;\leq\; L\,\| \bw_i - \bw_j \|_2, \qquad \forall \bw_i, \bw_j \in \widetilde{\cW}. $$

This Lipschitz constraint is the formal expression of "tail classes look similar to nearby head classes": small movements in class embedding space produce small changes in generated style for a given random noise $\bz_{\rm S}$. It is what enables transfer from head to tail.

Why Smoothness, and why sharing content and style encoder with the head classes?

With the shared modeling and the smoothness constraint, the head classes alone determine $\be_{\rm C}(\cdot)$ and $\be_{\rm S}(\cdot, \bw_h)$ for $h\in[M]$, since they have enough data for distribution matching. Smoothness then help learn $\bw_h$ for the tail classes pulling it closer to the head classes. This also means style space $\mathcal{S}$ is connected and smooth across all classes.

Conceptual Loss

We use the Generative Adversarial Network (GAN) to instantiate the content-style generative model. Combining the GAN distribution-matching with the smoothness constraint and an embedding encoder $\bh$ that ties latent embeddings to images, the conceptual loss is:

$$ \begin{aligned} \min_{\bg,\be_{\rm C},\be_{\rm S},\bw_n}\;\max_{\bd}\quad & \sum_{n=1}^{M} \bbE\!\left[ \log \bd(\bx^{(n)}) + \log\!\big(1 - \bd(\bg(\be_{\rm C}(\bz_{\rm C}), \be_{\rm S}(\bz_{\rm S},\bw_n)))\big) \right] \\ \text{s.t.}\quad & \be_{\rm S}(\bz_{\rm S},\,\cdot) \text{ is smooth in } \bw_n. \end{aligned} $$

Practical Implementation

For training we replace the hard smoothness constraint with a soft Jacobian regularizer:

$$ \cR(\be_{\rm S}) \;=\; \sum_{n=1}^{N} \bbE_{\bz_{\rm S}}\Big[\; \big\| J_{\be_{\rm S}}(\bz_{\rm S}, \bw_n) \big\|_{\rm F}\;\Big], $$

where $J_{\be_{\rm S}}$ is the Jacobian of $\be_{\rm S}$ with respect to $\bw$. The Jacobian Frobenius norm directly upper-bounds the Lipschitz constant. To avoid materializing a full Jacobian, we use a cheap finite-difference estimate:

$$ \big\| J_{\be_{\rm S}}(\bz_{\rm S},\bw_n) \big\|_{\rm F} \;\approx\; \frac{\| \be_{\rm S}(\bz_{\rm S},\bw_n) \;-\; \be_{\rm S}(\bz_{\rm S},\bw_n + \epsilon\boldsymbol{\delta}) \|_2}{\delta}, \qquad \boldsymbol{\delta}\sim\cN(\boldsymbol{0},\boldsymbol{I}),\;\epsilon\in[0,1]. $$

The full training objective is the cGAN loss plus $\lambda_j\,\cR(\be_{\rm S})$. We use the StyleGAN2-ADA[1] backbone for $\bg,\bd$ and two separate mapping networks for $\be_{\rm C}$ and $\be_{\rm S}$.

Visualizing What the Model Learns — Does the Model Actually Recover Tail Classes?

With content and style representations, we can fix the content noise $\bz_{\rm C}$ and sweep across class embeddings to inspect how style varies. The figures below show qualitative results: the same content (pose, layout) is preserved while only style (texture, color, breed) changes across classes — including tail classes that had only 2–4 real samples.

Flowers-LT content–style sweep
(a) Flowers-LT. Tail classes (top to bottom) had only 4, 2, and 2 real samples.
AnimalFaces-LT content–style sweep
(b) AnimalFaces-LT content–style decomposition.

Figure 1. Qualitative results: a fixed $\bz_{\rm C}$ across all images yields consistent content while style varies smoothly with the class embedding $\bw_n$.

AnimalFaces-LT generation samples
Figure 2. Qualitative results on AnimalFaces-LT. Each column uses the same style noise vector but a different class embedding and each row uses the same content noise vector. Our method produces visibly diverse content and style for both head and tail classes, while baselines collapse on the tail.

Results

Datasets

We evaluate on four widely used long-tailed benchmarks. Tail-class size is what matters most — Flowers-LT for instance has tail classes with as few as 2 images.

Dataset $N$ Resolution $\rho$ $N-M$ (tail)
AnimalFaces-LT2064 × 642510
CIFAR100-LT10032 × 3210030
CIFAR10-LT1032 × 321004
Flowers-LT102128 × 12810052

Table 1. Long-tailed benchmark statistics. $\rho$ is the head-to-tail ratio.

Metrics

FID and KID applied to a long-tailed dataset are dominated by the head classes and can hide tail-class failure. Following UTLO[5], we report:

Table 2 — FID-few and FID-all

Method AnimalFaces-LT CIFAR100-LT CIFAR10-LT Flowers-LT
FID-fewFID-all FID-fewFID-all FID-fewFID-all FID-fewFID-all
StyleGAN2-ADA 123.479.1 28.512.6 23.09.1 19.012.3
GSR 128.987.5 30.315.9 20.18.9 25.916.2
Transitional 69.635.5 26.911.2 21.28.7 24.714.1
Noisy-Twin 55.734.2 26.810.3 18.78.7 20.611.5
UTLO 50.729.4 27.511.7 19.28.6 17.310.1
Proposed 42.425.3 24.38.7 17.37.3 16.710.1

Table 2. FID-few and FID-all (lower is better) across four datasets. Bold = best, underline = second.

Table 3 — KID-few and KID-all (× 10³)

Method AnimalFaces-LT CIFAR100-LT CIFAR10-LT Flowers-LT
KID-fewKID-all KID-fewKID-all KID-fewKID-all KID-fewKID-all
StyleGAN2-ADA 54.730.8 12.35.3 9.23.3 3.32.7
GSR 32.623.3 17.27.6 7.94.2 7.45.7
Transitional 23.011.8 11.95.3 8.54.1 7.24.3
Noisy-Twin 21.213.6 11.54.3 6.02.3 5.13.5
UTLO 19.112.1 10.55.8 8.23.4 3.72.7
Proposed 10.87.4 9.12.9 5.82.0 2.52.4

Table 3. KID-few and KID-all (lower is better, ×10³).

What the Numbers Say

Summary

Long-tail class generation is hard because a few tail samples can't anchor distribution matching by themselves. By modeling images as a content–style decomposition with a smooth, class-conditioned style function, we get a single principled heuristic based on a content-style generative model that is easy to implement. In practice, a simple Jacobian-norm regularizer on the style mapping network implemented with a finite difference turns this into a measurable improvement: state-of-the-art FID and KID on AnimalFaces-LT, CIFAR-LT, and Flowers-LT.

References

  1. T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila, Training generative adversarial networks with limited data (StyleGAN2-ADA). NeurIPS 2020. arXiv:2006.06676
  2. H. Rangwani, N. Jaswani, T. Karmali, V. Jampani, and R. V. Babu, Improving gans for long tailed data through group spectral regularization (GSR-GAN). ECCV 2022. arXiv:2208.09932
  3. M. Shahbazi, M. Danelljan, D. P. Paudel, and L. Van Gool, Collapse by conditioning: Training class-conditional gans with limited data (Transitional GAN). ECCV 2022. arXiv:2201.06578
  4. H. Rangwani, L. Bansal, K. Sharma, T. Karmali, V. Jampani, and R. V. Babu, “Noisytwins: Class-consistent and diverse image generation through stylegans ECCV 2022. arXiv:2304.05866
  5. S. Khorram, M. Jiang, M. Shahbazi, M. H. Danesh, and L. Fuxin, Taming the tail in class-conditional gans: Knowledge sharing via unconditional training at lower resolutions (UTLO). ICML 2023. arXiv:2402.17065

Project by: Subash Timilsina, Sagar Shrestha, Xiao Fu.