# Pose **Repository Path**: markgosling/pose ## Basic Information - **Project Name**: Pose - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-07-24 - **Last Updated**: 2024-07-24 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # easy_ViTPose

easy_ViTPose

## Accurate 2d human and animal pose estimation Open In Colab ### Easy to use SOTA `ViTPose` [Y. Xu et al., 2022] models for fast inference. We provide all the VitPose original models, converted for inference, with single dataset format output. In addition to that we also provide a Coco-25 model, trained on the original coco dataset + feet https://cmu-perceptual-computing-lab.github.io/foot_keypoint_dataset/ Finetuning is not currently supported, you can check de43d54cad87404cf0ad4a7b5da6bacf4240248b and previous commits for a working state of `train.py` > [!WARNING] > Ultralytics `yolov8` has issue with wrong bounding boxes when using `mps`, upgrade to latest version! (Works correctly on 8.2.48) ## Results ![resimg](https://github.com/JunkyByte/easy_ViTPose/assets/24314647/51c0777f-b268-448a-af02-9a3537f288d8) https://github.com/JunkyByte/easy_ViTPose/assets/24314647/e9a82c17-6e99-4111-8cc8-5257910cb87e https://github.com/JunkyByte/easy_ViTPose/assets/24314647/63af44b1-7245-4703-8906-3f034a43f9e3 (Credits dance: https://www.youtube.com/watch?v=p-rSdt0aFuw ) (Credits zebras: https://www.youtube.com/watch?v=y-vELRYS8Yk ) ## Features - Image / Video / Webcam support - Video support using SORT algorithm to track bboxes between frames - Torch / ONNX / Tensorrt inference - Runs the original VitPose checkpoints from [ViTAE-Transformer/ViTPose](https://github.com/ViTAE-Transformer/ViTPose) - 4 ViTPose architectures with different sizes and performances (s: small, b: base, l: large, h: huge) - Multi skeleton and dataset: (AIC / MPII / COCO / COCO + FEET / COCO WHOLEBODY / APT36k / AP10k) - Human / Animal pose estimation - cpu / gpu / metal support - show and save images / videos and output to json We run YOLOv8 for detection, it does not provide complete animal detection. You can finetune a custom yolo model to detect the animal you are interested in, if you do please open an issue, we might want to integrate other models for detection. ### Benchmark: You can expect realtime >30 fps with modern nvidia gpus and apple silicon (using metal!). ### Skeleton reference There are multiple skeletons for different dataset. Check the definition here [visualization.py](https://github.com/JunkyByte/easy_ViTPose/blob/main/easy_ViTPose/vit_utils/visualization.py). ## Installation and Usage > [!IMPORTANT] > Install `torch>2.0 with cuda / mps support` by yourself. > also check `requirements_gpu.txt`. ```bash git clone git@github.com:JunkyByte/easy_ViTPose.git cd easy_ViTPose/ pip install -e . pip install -r requirements.txt ``` ### Download models - Download the models from [Huggingface](https://huggingface.co/JunkyByte/easy_ViTPose) We provide torch models for every dataset and architecture. If you want to run onnx / tensorrt inference download the appropriate torch ckpt and use `export.py` to convert it. You can use `ultralytics` `yolo export` command to export yolo to onnx and tensorrt as well. #### Export to onnx and tensorrt ```bash $ python export.py --help usage: export.py [-h] --model-ckpt MODEL_CKPT --model-name {s,b,l,h} [--output OUTPUT] [--dataset DATASET] optional arguments: -h, --help show this help message and exit --model-ckpt MODEL_CKPT The torch model that shall be used for conversion --model-name {s,b,l,h} [s: ViT-S, b: ViT-B, l: ViT-L, h: ViT-H] --output OUTPUT File (without extension) or dir path for checkpoint output --dataset DATASET Name of the dataset. If None it"s extracted from the file name. ["coco", "coco_25", "wholebody", "mpii", "ap10k", "apt36k", "aic"] ``` ### Run inference To run inference from command line you can use the `inference.py` script as follows: ```bash $ python inference.py --help usage: inference.py [-h] [--input INPUT] [--output-path OUTPUT_PATH] --model MODEL [--yolo YOLO] [--dataset DATASET] [--det-class DET_CLASS] [--model-name {s,b,l,h}] [--yolo-size YOLO_SIZE] [--conf-threshold CONF_THRESHOLD] [--rotate {0,90,180,270}] [--yolo-step YOLO_STEP] [--single-pose] [--show] [--show-yolo] [--show-raw-yolo] [--save-img] [--save-json] optional arguments: -h, --help show this help message and exit --input INPUT path to image / video or webcam ID (=cv2) --output-path OUTPUT_PATH output path, if the path provided is a directory output files are "input_name +_result{extension}". --model MODEL checkpoint path of the model --yolo YOLO checkpoint path of the yolo model --dataset DATASET Name of the dataset. If None it"s extracted from the file name. ["coco", "coco_25", "wholebody", "mpii", "ap10k", "apt36k", "aic"] --det-class DET_CLASS ["human", "cat", "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "animals"] --model-name {s,b,l,h} [s: ViT-S, b: ViT-B, l: ViT-L, h: ViT-H] --yolo-size YOLO_SIZE YOLOv8 image size during inference --conf-threshold CONF_THRESHOLD Minimum confidence for keypoints to be drawn. [0, 1] range --rotate {0,90,180,270} Rotate the image of [90, 180, 270] degress counterclockwise --yolo-step YOLO_STEP The tracker can be used to predict the bboxes instead of yolo for performance, this flag specifies how often yolo is applied (e.g. 1 applies yolo every frame). This does not have any effect when is_video is False --single-pose Do not use SORT tracker because single pose is expected in the video --show preview result during inference --show-yolo draw yolo results --show-raw-yolo draw yolo result before that SORT is applied for tracking (only valid during video inference) --save-img save image results --save-json save json results ``` You can run inference from code as follows: ```python import cv2 from easy_ViTPose import VitInference # Image to run inference RGB format img = cv2.imread('./examples/img1.jpg') img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # set is_video=True to enable tracking in video inference # be sure to use VitInference.reset() function to reset the tracker after each video # There are a few flags that allows to customize VitInference, be sure to check the class definition model_path = './ckpts/vitpose-s-coco_25.pth' yolo_path = './yolov8s.pth' # If you want to use MPS (on new macbooks) use the torch checkpoints for both ViTPose and Yolo # If device is None will try to use cuda -> mps -> cpu (otherwise specify 'cpu', 'mps' or 'cuda') # dataset and det_class parameters can be inferred from the ckpt name, but you can specify them. model = VitInference(model_path, yolo_path, model_name='s', yolo_size=320, is_video=False, device=None) # Infer keypoints, output is a dict where keys are person ids and values are keypoints (np.ndarray (25, 3): (y, x, score)) # If is_video=True the IDs will be consistent among the ordered video frames. keypoints = model.inference(img) # call model.reset() after each video img = model.draw(show_yolo=True) # Returns RGB image with drawings cv2.imshow('image', cv2.cvtColor(img, cv2.COLOR_RGB2BGR)); cv2.waitKey(0) ``` > [!NOTE] > If the input file is a video [SORT](https://github.com/abewley/sort) is used to track people IDs and output consistent identifications. ### OUTPUT json format The output format of the json files: ``` { "keypoints": [ # The list of frames, len(json['keypoints']) == len(video) { # For each frame a dict "0": [ # keys are id to track people and value the keypoints [121.19, 458.15, 0.99], # Each keypoint is (y, x, score) [110.02, 469.43, 0.98], [110.86, 445.04, 0.99], ], "1": [ ... ], }, { "0": [ [122.19, 458.15, 0.91], [105.02, 469.43, 0.95], [122.86, 445.04, 0.99], ], "1": [ ... ] } ], "skeleton": { # Skeleton reference, key the idx, value the name "0": "nose", "1": "left_eye", "2": "right_eye", "3": "left_ear", "4": "right_ear", "5": "neck", ... } } ``` ## Finetuning Finetuning is possible but not officially supported right now. If you would like to finetune and need help open an issue. You can check `train.py`, `datasets/COCO.py` and `config.yaml` for details. --- ## Evaluation on COCO dataset 1. Download COCO dataset images and labels - 2017 Val images [5K/1GB]: http://images.cocodataset.org/zips/val2017.zip
The extracted directory looks like this: ``` val2017/ ├── 000000000139.jpg ├── 000000000285.jpg ├── 000000000632.jpg └── ... ``` - 2017 Train/Val annotations [241MB]: http://images.cocodataset.org/annotations/annotations_trainval2017.zip
The extracted directory looks like this: ``` annotations/ ├── person_keypoints_val2017.json ├── person_keypoints_train2017.json └── ... ``` 2. Run the following command: ```bash $ python evaluation_on_coco.py Command line arguments: --model_path: Path to the pretrained ViT Pose model --yolo_path: Path to the YOLOv8 model --img_folder_path: Path to the directory containing COCO val images (/val2017 extracted in step 1). --annFile: Path to json file for COCO keypoints for val set (annotations/person_keypoints_val2017.json extracted in step 1) ``` --- ## TODO: - refactor finetuning (currently not available) - benchmark and check bottlenecks of inference pipeline - parallel batched inference - other minor fixes - yolo version for animal pose, check https://github.com/JunkyByte/easy_ViTPose/pull/18 - solve cuda exceptions on script exit when using tensorrt (no idea how) - add infos about inferred informations during inference, better output of inference status (device etc) - check if is possible to make colab work without runtime restart Feel free to open issues, pull requests and contribute on these TODOs. ## Reference Thanks to the VitPose authors and their official implementation [ViTAE-Transformer/ViTPose](https://github.com/ViTAE-Transformer/ViTPose). The SORT code is taken from [abewley/sort](https://github.com/abewley/sort)