There are many applications for computer vision and hand gesture recognition. I chose to use gesture recognition to play Rock Paper Scissors.

Introduction

There has been a rapid increase in the amount of commercially available robots in the past few years. Some of these robots use cameras to help navigate their surroundings. In this project I explored using computer vision to play Rock Paper Scissors with a Nao robot. Computer vision is a field of study that focuses on enabling computers to interpret and understand visual data from the world around them. It involves the use of algorithms and mathematical models to analyze, interpret, and understand images and videos. Computer vision has numerous applications, such as facial recognition, object detection, and hand gesture recognition.

ROS, on the other hand, is a robotics middleware system that provides a collection of software libraries and tools for building and integrating robotics applications. ROS is designed to be modular, flexible, and hardware-independent, making it easy to develop robotics applications that can run on different platforms and hardware configurations. ROS is widely used in robotics research and development because of its powerful features and capabilities, including hardware abstraction, communication protocols, and data management.

A hand gesture recognition system is a technology that enables computers to recognize and interpret hand movements and gestures. The system involves capturing and analyzing hand movements and converting them into commands that can be used to control devices or perform tasks. Vision-based systems use cameras to capture images and recognize hand gestures based on the image features. The hand gesture recognition system can be broken down into three main components: hand tracking, feature extraction, and gesture recognition. Hand tracking involves locating and tracking the hand's movement in the image or sensor data. Feature extraction involves extracting relevant features from the hand movement data. Gesture recognition involves recognizing a specific gesture or pattern from the feature data. Finally, the user interface is used to interact with the system, allowing users to control devices or perform tasks based on the recognized hand gestures.

Methodology

I first researched the ROS[2] ecosystem and learned how to capture video from a camera. I then learned about hand gesture recognition and MediaPipe. MediaPipe[3] is an open source machine learning package developed by Google that can help track human hands and fingers in an image. The model is trained to use two other models that will recognize a person’s palm and a separate model that tracks landmarks representing knuckles. These landmarks are then used to train a third model that will characterize the gestures as one of the following five: Rock, Paper, Scissors, Yes, No. The first three gestures are used when playing the game, the remaining two are used to indicate if the player wishes to play again.

ROS uses a system of Publishers and Subscribers to allow data to be shared within the ecosystem. This allows each node in the system to take in data and then provide other nodes with processed data. Nodes can be as large or as small as required to complete a task. Though like any other computer program, each node should be as simple as possible in order to allow the system to be easily understood and updated. Another benefit from using nodes is that other developers can implement a node that can then be used by any other ROS compliant system with minimal modifications.

Results

I was able to train the model to recognize the 3 moves that a player can chose: Rock, Paper, and Scissors. In addition to these moves I also included game control gestures for yes and no. Those gestures would be used before the first game and after each round to see if the player would like to play a game of “ROS Paper Scissors.”

The model and video inputs are synchronized by ROS by using publishers and subscribers. The first publisher is the image publisher. This will take the camera input and publish the image topic so that images are available to be processed by the other subroutines. Then the hand sign recognition subscribes to the image topic and pulls in the image data. It then passes the image data off to the gesture recognition script that will classify the hand gesture and will create a gesture topic that contains a string based on what hand gesture is made in the picture.

Conclusions

There are a lot of application for hand gesture recognition in everyday life. Interpreting sign language so that those who are deaf or mute can be understood more easily by people who don’t know sign language. Another application is in video games. This project is one example of a game that can use hand gestures to play. Finally hand gestures can be used in virtual reality as the primary way to get user input.

While ROS is a useful system to standardize packages across multiple platforms and languages. Getting ROS set up properly is a fairly complicated task. In this project I had to use four different computers before I successfully set up ROS. One had an issue where a required CPU instruction was not available for TensorFlow to work properly. Another had an issue where a required package for ROS was updated however the functions ROS used in that package were removed. Another issue was switching between ROS and ROS2. While both ROS and ROS2 accomplish the same goal of having interoperability across multiple devices and in multiple languages, the way ROS packages are build and written is fundamentally different in ROS2.

Acknowledgements

TrinhNC did an amazing job explaining how to set up ROS and set up