CPPF++ is a novel method for category-level 6D object pose estimation that bridges the sim-to-real gap using only 3D CAD models for training. Building upon the point-pair voting scheme of CPPF, our approach introduces a probabilistic framework that significantly improves robustness and accuracy in real-world scenarios.
- Uncertainty-aware voting: Models probabilistic distribution of point pairs to handle vote collisions
- N-point tuple features: Enhanced contextual information through multi-point feature aggregation
- Sim2Real robustness: Training exclusively on synthetic CAD models, generalizing to real images
- DiversePose 300: New challenging dataset for category-level pose estimation
- β No real pose annotations required for training
- β State-of-the-art sim2real performance on NOCS REAL275
- β Superior generalization to novel datasets (Wild6D, PhoCAL, DiversePose)
- β Real-time inference capability for practical applications
- Quick Start
- Evaluation
- Training
- Training with Custom Objects
- Datasets
- Method Overview
- Related Work
- Updates & Changelog
- Citation
Casual video captured on iPhone 15 - no smoothing, no object mesh required
Check our new object pose benchmark PACE on ECCV 2024.
Also, check our work RPMArt on IROS 2024 that uses CPPF++ to get the pose of articulated objects (e.g., microwaves, drawers).
- Python 3.8+
- PyTorch 1.8+
- CUDA 11.0+
- CMake 3.16+
-
Clone the repository
git clone https://github.com/qq456cvb/CPPF2.git cd CPPF2
-
Install Python dependencies
pip install torch torchvision torchaudio numpy opencv-python imageio tqdm omegaconf # Additional dependencies for the complete pipeline: pip install torch-scatter visdom matplotlib pillow fire # For optimization (optional): pip install lietorch
-
Build SHOT descriptor extraction
cd src_shot mkdir build && cd build cmake .. -A x64 -DCMAKE_BUILD_TYPE=Release cmake --build . --config Release
Try our method on your own videos to generate pose estimation GIFs:
python demo.py --input_video your_video.mp4 --intrinsics_path camera_intrinsics.npy --object_category mug --output_gif result.gif
Required parameters:
--input_video
: Path to input video file (e.g.,.mp4
,.avi
)--intrinsics_path
: Path to camera intrinsics numpy file (.npy
)--object_category
: Target object category (e.g.,mug
,bottle
,bowl
)
Optional parameters:
--output_gif
: Output GIF filename (default:output.gif
)--input_depth_dir
: Directory containing depth files (optional, for better accuracy)--frame_skip
: Process every N-th frame (default: 1, for faster processing use higher values)--fps
: GIF frame rate (default: 10)
Camera Intrinsics Format:
The camera intrinsics should be a 3x3 numpy array saved as .npy
file:
import numpy as np
# Example camera intrinsics matrix
intrinsics = np.array([
[fx, 0, cx],
[ 0, fy, cy],
[ 0, 0, 1]
])
np.save('camera_intrinsics.npy', intrinsics)
Where fx
, fy
are focal lengths and cx
, cy
are principal point coordinates.
Download our pre-trained checkpoints from the ckpts/
directory:
- DINO model: Visual feature extraction
- SHOT model: Geometric feature extraction
python eval.py
Note: You'll need Mask-RCNN masks from SAR-Net for evaluation.
Our method uses an ensemble of DINO (visual) and SHOT (geometric) features for optimal performance.
# Extract and dump training features (this may take some time)
python dataset.py
Also
conda create -f environment.yml
# Train DINO model (visual features)
python train_dino.py
# Train SHOT model (geometric features)
python train_shot.py
We provide a comprehensive tutorial for training on your own objects:
- π Interactive Tutorial:
train_custom.ipynb
For detailed instructions, see the Custom Training Section below.
π Interactive Tutorial: Check train_custom.ipynb
for a complete walkthrough with examples!
- NOCS REAL275: Category-level pose estimation benchmark
- YCB-Video: Instance-level object pose dataset
- Wild6D: In-the-wild pose estimation
- PhoCAL: Photorealistic object dataset
- DiversePose 300: Our new diverse pose dataset
We provide utilities to convert datasets into REAL275 format:
python utils/wild6d_convert2real275.py
python utils/phocal_convert2real275.py
CPPF++ extends the point-pair voting framework with several key innovations:
- Probabilistic Voting: Instead of deterministic votes, we model the uncertainty of each point pair's contribution
- N-point Tuples: Enhanced feature representation using multiple point relationships
- Ensemble Learning: Combining DINO (visual) and SHOT (geometric) features
- Uncertainty Estimation: Explicit modeling of pose estimation confidence
- CPPF: Our previous point-pair voting framework
- RPMArt: Extension to articulated objects (IROS 2024)
- SAR-Net: Source for Mask-RCNN masks used in evaluation
- 2024/06/27 - π Added custom object training tutorial (
train_custom.ipynb
) - 2024/06/16 - π¬ Demo code release for reproducing results
- 2024/06/15 - π Paper accepted to IEEE TPAMI!
- 2024/04/10 - π§ Data processing scripts for Wild6D and PhoCAL datasets
- 2024/04/01 - πΎ Pre-trained models available in
ckpts/
- 2024/03/28 - π Major performance improvements across all datasets
- 2023/09/06 - π Significant method updates with enhanced NOCS and YCB-Video performance
We welcome contributions! Please feel free to:
- π Report issues or bugs
- π‘ Suggest new features or improvements
- π Improve documentation
- π§ Submit pull requests
For questions about the paper or code, please:
- Open a GitHub issue for technical problems
- Contact Yang You for research discussions
If you find our work helpful, please consider citing:
@ARTICLE{youcppf2,
author={You, Yang and He, Wenhao and Liu, Jin and Xiong, Hongkai and Wang, Weiming and Lu, Cewu},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={CPPF++: Uncertainty-Aware Sim2Real Object Pose Estimation by Vote Aggregation},
year={2024},
volume={46},
number={12},
pages={9239-9254},
keywords={Pose estimation;Training;Solid modeling;Noise measurement;Uncertainty;Training data;Filtering;Object pose estimation;sim-to-real transfer;point-pair features;dataset creation},
doi={10.1109/TPAMI.2024.3419038}
}
This project is released under the MIT License. See LICENSE for details.