Project · Generative Models

Long-Tail Class Generation via Content-Style based Transfer Learning

Generative Models Long-tailed Learning Transfer Learning Content-Style Generative Model Conditional Image Generation

The Problem

Real-world image datasets are almost never balanced. A handful of head classes carry the bulk of the samples, while a long list of tail classes have only a few. When a class-conditional generative model — e.g. a conditional GAN (cGAN) — is trained on such data, two failure modes appear:

Mode collapse on tail classes. The generator produces near-identical outputs for any tail-class condition; diversity vanishes.
Drop in fidelity. Tail samples look distorted or unrealistic, even when head-class samples look great.

The natural way out is knowledge transfer from head to tail: similar classes (e.g. dog breeds) share most of their generative process, so the few samples we have for a tail class should benefit from the abundant samples of nearby head classes. Most prior work — GSR-GAN^[2], Transitional GAN^[3], Noisy-Twin^[4], UTLO^[5] — implements this idea with heuristics on architecture, regularization, or training schedules. We instead ground the knowledge transfer from head to tail transfer in a principled content-style generative model that explictly separates what is shared across all classes (content) and what is class-specific.

Setup

We have $N$ classes split into $M$ head classes (abundant data) and $N-M$ tail classes (scarce data). For class $n \in [N]$ we observe a dataset $\{ \bx_i^{(n)} \}_{i=1}^{D^{(n)}}$ with $D^{(n)}$ samples. The imbalance ratio is $\rho = D^{(1)} / D^{(N)}$. Our goal is to learn a single generative model that produces high-fidelity, diverse samples for every class — including those with only a few real images.

Proposed Formulation

Content–Style Generative Model

We model every image as a composition of two latent factors:

The content $\bc$ — semantics shared across all classes (e.g., the pose of an animal face, the layout of a flower).
The style $\bs^{(n)}$ — class-specific appearance (e.g., fur texture for a particular breed, petal color for a particular flower species).

\bx^{(n)} = \bg(\bc, \bs^{(n)}), \quad \bc \sim \bP_{\bc}, \quad \bs^{(n)} \sim \bP_{\bs^{(n)}},

where $\bg : \mathcal{C}\times\mathcal{S}^{(n)} \to \mathcal{X}^{(n)}$ is a (smooth) bijection. To make the two latent distributions trainable, we transport standard Gaussians through learnable encoders:

\bc = \be_{\rm C}(\bz_{\rm C}), \qquad \bs^{(n)} = \be_{\rm S}(\bz_{\rm S}, \bw_n), \qquad \bz_{\rm C}\sim\cN(\boldsymbol{0},\boldsymbol{I}), \quad \bz_{\rm S}\sim\cN(\boldsymbol{0},\boldsymbol{I}).

The same style function $\be_{\rm S}$ is reused across all classes; the only thing that distinguishes class $n$ is its embedding $\bw_n$. Crucially, we ask $\be_{\rm S}$ to be smooth in $\bw$:

\big\| \be_{\rm S}(\bz_{\rm S}, \bw_i) - \be_{\rm S}(\bz_{\rm S}, \bw_j) \big\|_2 \;\leq\; L\,\| \bw_i - \bw_j \|_2, \qquad \forall \bw_i, \bw_j \in \widetilde{\cW}.

This Lipschitz constraint is the formal expression of "tail classes look similar to nearby head classes": small movements in class embedding space produce small changes in generated style for a given random noise $\bz_{\rm S}$. It is what enables transfer from head to tail.

Why Smoothness, and why sharing content and style encoder with the head classes?

With the shared modeling and the smoothness constraint, the head classes alone determine $\be_{\rm C}(\cdot)$ and $\be_{\rm S}(\cdot, \bw_h)$ for $h\in[M]$, since they have enough data for distribution matching. Smoothness then help learn $\bw_h$ for the tail classes pulling it closer to the head classes. This also means style space $\mathcal{S}$ is connected and smooth across all classes.

Conceptual Loss

We use the Generative Adversarial Network (GAN) to instantiate the content-style generative model. Combining the GAN distribution-matching with the smoothness constraint and an embedding encoder $\bh$ that ties latent embeddings to images, the conceptual loss is:

\begin{aligned} \min_{\bg,\be_{\rm C},\be_{\rm S},\bw_n}\;\max_{\bd}\quad & \sum_{n=1}^{M} \bbE\!\left[ \log \bd(\bx^{(n)}) + \log\!\big(1 - \bd(\bg(\be_{\rm C}(\bz_{\rm C}), \be_{\rm S}(\bz_{\rm S},\bw_n)))\big) \right] \\ \text{s.t.}\quad & \be_{\rm S}(\bz_{\rm S},\,\cdot) \text{ is smooth in } \bw_n. \end{aligned}

Practical Implementation

For training we replace the hard smoothness constraint with a soft Jacobian regularizer:

\cR(\be_{\rm S}) \;=\; \sum_{n=1}^{N} \bbE_{\bz_{\rm S}}\Big[\; \big\| J_{\be_{\rm S}}(\bz_{\rm S}, \bw_n) \big\|_{\rm F}\;\Big],

where $J_{\be_{\rm S}}$ is the Jacobian of $\be_{\rm S}$ with respect to $\bw$. The Jacobian Frobenius norm directly upper-bounds the Lipschitz constant. To avoid materializing a full Jacobian, we use a cheap finite-difference estimate:

\big\| J_{\be_{\rm S}}(\bz_{\rm S},\bw_n) \big\|_{\rm F} \;\approx\; \frac{\| \be_{\rm S}(\bz_{\rm S},\bw_n) \;-\; \be_{\rm S}(\bz_{\rm S},\bw_n + \epsilon\boldsymbol{\delta}) \|_2}{\delta}, \qquad \boldsymbol{\delta}\sim\cN(\boldsymbol{0},\boldsymbol{I}),\;\epsilon\in[0,1].

The full training objective is the cGAN loss plus $\lambda_j\,\cR(\be_{\rm S})$. We use the StyleGAN2-ADA^[1] backbone for $\bg,\bd$ and two separate mapping networks for $\be_{\rm C}$ and $\be_{\rm S}$.

Visualizing What the Model Learns — Does the Model Actually Recover Tail Classes?

With content and style representations, we can fix the content noise $\bz_{\rm C}$ and sweep across class embeddings to inspect how style varies. The figures below show qualitative results: the same content (pose, layout) is preserved while only style (texture, color, breed) changes across classes — including tail classes that had only 2–4 real samples.

Flowers-LT content–style sweep — (a) Flowers-LT. Tail classes (top to bottom) had only 4, 2, and 2 real samples.

AnimalFaces-LT content–style sweep — (b) AnimalFaces-LT content–style decomposition.

Figure 1. Qualitative results: a fixed $\bz_{\rm C}$ across all images yields consistent content while style varies smoothly with the class embedding $\bw_n$.

AnimalFaces-LT generation samples — Figure 2. Qualitative results on AnimalFaces-LT. Each **column uses the same style noise vector** but a different class embedding and each **row uses the same content noise vector**. Our method produces visibly diverse content *and* style for both head and tail classes, while baselines collapse on the tail.

Results

Datasets

We evaluate on four widely used long-tailed benchmarks. Tail-class size is what matters most — Flowers-LT for instance has tail classes with as few as 2 images.

Dataset	$N$	Resolution	$\rho$	$N-M$ (tail)
AnimalFaces-LT	20	64 × 64	25	10
CIFAR100-LT	100	32 × 32	100	30
CIFAR10-LT	10	32 × 32	100	4
Flowers-LT	102	128 × 128	100	52

Table 1. Long-tailed benchmark statistics. $\rho$ is the head-to-tail ratio.

Metrics

FID and KID applied to a long-tailed dataset are dominated by the head classes and can hide tail-class failure. Following UTLO^[5], we report:

FID-few / KID-few — computed against an equal number of real images per tail class, isolating tail-class quality.
FID-all / KID-all — standard scores over the entire dataset (50k generated images).

Table 2 — FID-few and FID-all

Method	AnimalFaces-LT		CIFAR100-LT		CIFAR10-LT		Flowers-LT
Method	FID-few	FID-all	FID-few	FID-all	FID-few	FID-all	FID-few	FID-all
StyleGAN2-ADA	123.4	79.1	28.5	12.6	23.0	9.1	19.0	12.3
GSR	128.9	87.5	30.3	15.9	20.1	8.9	25.9	16.2
Transitional	69.6	35.5	26.9	11.2	21.2	8.7	24.7	14.1
Noisy-Twin	55.7	34.2	26.8	10.3	18.7	8.7	20.6	11.5
UTLO	50.7	29.4	27.5	11.7	19.2	8.6	17.3	10.1
Proposed	42.4	25.3	24.3	8.7	17.3	7.3	16.7	10.1

Table 2. FID-few and FID-all (lower is better) across four datasets. Bold = best, underline = second.

Table 3 — KID-few and KID-all (× 10³)

Method	AnimalFaces-LT		CIFAR100-LT		CIFAR10-LT		Flowers-LT
Method	KID-few	KID-all	KID-few	KID-all	KID-few	KID-all	KID-few	KID-all
StyleGAN2-ADA	54.7	30.8	12.3	5.3	9.2	3.3	3.3	2.7
GSR	32.6	23.3	17.2	7.6	7.9	4.2	7.4	5.7
Transitional	23.0	11.8	11.9	5.3	8.5	4.1	7.2	4.3
Noisy-Twin	21.2	13.6	11.5	4.3	6.0	2.3	5.1	3.5
UTLO	19.1	12.1	10.5	5.8	8.2	3.4	3.7	2.7
Proposed	10.8	7.4	9.1	2.9	5.8	2.0	2.5	2.4

Table 3. KID-few and KID-all (lower is better, ×10³).

What the Numbers Say

Best on every dataset, on every metric. Proposed wins both the tail-focused (-few) and full-dataset (-all) variants of FID and KID across all four benchmarks.
Largest gain on tail classes. On AnimalFaces-LT — the dataset with the smallest resolution-adjusted images and a real (not synthetic) imbalance — KID-few drops from 19.1 (UTLO) to 10.8, nearly halving the previous best.
Tail improvements don't cost head fidelity. Proposed also has the lowest FID-all and KID-all, meaning the model isn't trading head-class quality for tail-class diversity — the content–style decomposition lets it improve both.

Summary

Long-tail class generation is hard because a few tail samples can't anchor distribution matching by themselves. By modeling images as a content–style decomposition with a smooth, class-conditioned style function, we get a single principled heuristic based on a content-style generative model that is easy to implement. In practice, a simple Jacobian-norm regularizer on the style mapping network implemented with a finite difference turns this into a measurable improvement: state-of-the-art FID and KID on AnimalFaces-LT, CIFAR-LT, and Flowers-LT.

References

T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila, Training generative adversarial networks with limited data (StyleGAN2-ADA). NeurIPS 2020. arXiv:2006.06676
H. Rangwani, N. Jaswani, T. Karmali, V. Jampani, and R. V. Babu, Improving gans for long tailed data through group spectral regularization (GSR-GAN). ECCV 2022. arXiv:2208.09932
M. Shahbazi, M. Danelljan, D. P. Paudel, and L. Van Gool, Collapse by conditioning: Training class-conditional gans with limited data (Transitional GAN). ECCV 2022. arXiv:2201.06578
H. Rangwani, L. Bansal, K. Sharma, T. Karmali, V. Jampani, and R. V. Babu, “Noisytwins: Class-consistent and diverse image generation through stylegans ECCV 2022. arXiv:2304.05866
S. Khorram, M. Jiang, M. Shahbazi, M. H. Danesh, and L. Fuxin, Taming the tail in class-conditional gans: Knowledge sharing via unconditional training at lower resolutions (UTLO). ICML 2023. arXiv:2402.17065

Project by: Subash Timilsina, Sagar Shrestha, Xiao Fu.

Back to projects