ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction

Abstract

ProDiG: Site Reconstruction only using Aerial Images

ProDiG addresses aerial-to-ground reconstruction when only aerial imagery is available. Rather than directly jumping from aerial views to ground-level rendering, it progressively synthesizes intermediate altitudes and refines the Gaussian scene representation at each stage. The method combines a geometry-aware causal attention module for diffusion-based refinement with a distance-adaptive Gaussian module that adjusts scale and opacity based on camera distance. This enables more stable, coherent, and realistic ground-level reconstructions under extreme viewpoint gaps.

Method Overview

Figure placeholder: use the overview figure from page 2 of the paper.

Progressive Altitude Refinement Causal Attention Mixing Epipolar Geometry Conditioning Distance-Adaptive Gaussian Module

ProDiG progressively lowers the viewpoint altitude, renders noisy novel views, refines them with aeroFix, and adds the fixed views back into training. The diffusion module uses pose-aware conditioning, Plücker ray embeddings, and epipolar-constrained causal attention to preserve structural consistency across large viewpoint changes.

aeroFix: Geometry-Aware Diffusion Refinement

Figure placeholder: use the aeroFix diagram from page 4 of the paper.

aeroFix refines noisy novel renders using a reference view while explicitly constraining cross-view attention. The model masks novel-query/reference-key interactions using epipolar lines, blocks reference-query/novel-key attention to preserve causality, and injects pose difference information into diffusion conditioning. Multi-scale Sobel weighting and DSSIM further preserve edges and perceptual consistency.

Qualitative Results

Figure placeholder: use the refinement comparison from page 6.

Compared with Difix3D+, aeroFix better preserves structure under large viewpoint differences and reduces reference-copying artifacts.

Aerial-to-Ground Reconstruction

Qualitative ProDiG reconstruction results

Figure placeholder: use the main qualitative comparison from page 6.

ProDiG produces more coherent geometry and more realistic ground-level renderings than 3DGS and Difix3D+ in challenging aerial-to-ground settings.

aeroFix Comparison

Method	DreamSim ↓	PSNR ↑	SSIM ↑	LPIPS ↓
Difix3D+	0.15	20.47	0.54	0.42
Difix (LoRA)	0.07	21.45	0.59	0.30
Pose + Plücker	0.06	22.30	0.64	0.27
Pose + Plücker + Causal	0.03	23.35	0.68	0.24
aeroFix (Ours)	0.03	23.68	0.69	0.24

WRIVA Results

Site	DreamSim ↓	PSNR ↑	SSIM ↑	LPIPS ↓
S06	0.50	11.26	0.33	0.67
S01	0.29	13.10	0.45	0.58

On WRIVA, ProDiG improves both structural and perceptual quality over strong Gaussian Splatting and diffusion-guided baselines.

Matrix City

Method	DreamSim ↓	PSNR ↑	SSIM ↑	LPIPS ↓
3DGS	0.51	10.71	0.40	0.77
2DGS	0.62	9.29	0.28	0.81
3DGS-MCMC	0.49	10.84	0.41	0.77
Scaffold-GS	0.54	10.19	0.35	0.75
Difix3D+	0.48	11.38	0.38	0.63
Ours	0.39	12.39	0.41	0.50

Generalization and Ablations

Figure placeholder: use page 8 figures for varying-altitude generalization and ablations.

The paper reports that the Original Noisy Closer progressive strategy is the most stable overall, and that the distance-adaptive Gaussian module improves PSNR and SSIM especially when camera distances vary widely.

Key Contributions

1. Causal Attention Mixing

Epipolar-constrained attention makes diffusion refinement more geometrically grounded under large viewpoint changes.

2. Distance-Adaptive Gaussian Module

Gaussian scale and opacity are modulated using per-Gaussian features and camera distance for stable refinement across altitudes.

3. Progressive Altitude Refinement

Intermediate-altitude synthesis gradually bridges the aerial-to-ground distribution gap instead of making a single large leap.

BibTeX

@inproceedings{mitra2026prodig,
  title={ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction},
  author={Mitra, Sirshapan and Rawat, Yogesh S},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={22--32},
  year={2026}
}