Object Detection

Introduction

The object_detection service allows to detect many type of objects in an image. Detectron2 () is used as backend. For now, we only support detecting humans, but more classes could be added in the future (). The detection is using instance segmentation which means that it returns a binary mask per detected object in the image. Each mask has the same shape as the input image where 1โ€™s correspond to the place where the object is detected, and 0โ€™s everywhere else.

Docker name: object_detection

Input

  • sensors: Camera (stereo or mono)

  • actuators: None

  • services: X

  • parameters (note that the following parameters are hard-coded at the top of the file object_detection_service.py):

    • Threshold: float, sets the confidence level threshold. Default: 0.7

    • DPI: int, sets the number of Detections Per Image. Default: 100

    • MODEL: str, path to the model file (.pkl). Default: model_final_f10217.pkl

    • MODEL_PATH: str, path to the model configuration file (.yaml). Default: 'COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml'

Service Configuration

For the setup of the service, you can chose from multiple models.

Output

  • sensors: none

  • actuators: none

There are two outputs of this service. First, a simple string output is published. Second, a Protobuf object is added to a redis zrange.

String output The string output consists of โ€œ[X-COORDINATE];[Y-COORDINATE]โ€. The coordinates represent the centroid of the instance segmentation mask. For example, if the centroid is [150,220] the output will look like:

"150;220"

This is published to the detected_object topic.

Protobuf output The Protobuf output is used to output the segmentation masks. The Protobuf is build as follows:

image_masks = ImageMasks() image_masks.timestamp_ms # timestamp of image in miliseconds image_masks.mask_width # width in pixels of mask image_masks.mask_height # height in pixels of mask image_masks.mask_count # number of detected objects image_masks.masks # Python array (list) of booleans

Such a Protobuf object can be 'unpacked' to obtain the original masks again:

orginal_masks = array(image_masks.masks).reshape((image_masks.mask_count, image_masks.mask_height, image_masks.mask_width)) orginal_masks = orginal_masks.astype(bool)

As you can see the shape of orginal_masks is (N, H, W), where N is the number of masks, H the height in pixels, and W the width in pixels.

This Protobuf output is added to the zrange of the segmentation_stream as a serialized Protobuf object. A zrange is the redis-implementation of a Python dictionary. The timestamp_ms is used as key, where the serialized Protobuf is the value.

Initialisation

Using the service

In order to use our service for your purposes, an instance of the BasicSICConnector class has to be created. You can find the details of this class here. You may also want to write a callback function for retrieving a recognized object from the detection result.

In order to run this service, the following steps must be taken into consideration:

  1. You have the relevant services and drivers running.

  2. To pass your local IP address, instance of BasicSIC connector, and an instance of ActionRunner.

  3. A callback function is set up for retrieving a recognized object from the detection result.

Example

The following file, https://bitbucket.org/socialroboticshub/connectors/src/master/python/speech_recognition_example.py, is available for the purpose of demonstration. Two questions are dealt with in this example. The first is an entity question where the point of interest is the name of the user. The second is a yes, no, or donโ€™t know question.

Setting up the agent

In order to deal with the first question, an intent needs to be set up. An intent is a value recognised from an end-user. In our example, the name of the person. The following steps will set an intent of your Dialogflow agent:

  1. Navigate to the agentโ€™s page to set the intent, training phrases and parameters.

  2. Create an agent intent.

    • It is recommended that the name suggests the kind of answer you are looking for in the audio. In our example, the name of the user (โ€˜answer_nameโ€™).

    • the intent defined in the agent should correspond to the intent used in the code

  3. Create a context.

    • the number next to the context corresponds to the number of responses expected from the user in that context. In our example, that number is 0

  4. Create training phrases for the intent

    • the training phrases should be input examples that contain the intent. In our example, 'my name is name`

    • Dialogflow learns from these phrases and matches future user inputs based on them

  5. Create parameters for the intent

    • select words from the training phrases as parameters by double-clicking on them, then match them with their corresponding entity. They automatically appear in the โ€˜Action and parametersโ€™ section. In our example, we are only interested in the โ€˜nameโ€™ of the user

Our complete intent example thus looks like this (note: using sys.given-name is usually preferred):

Figure 1

Events

  • onAudioIntent

    • a new intent is detected

  • IntentDetectionDone

    • a new intent has finished being detected

  • onAudioLanguage

    • the audio language has been changed

  • LoadAudioDone

    • if an audio file is used, the event is raised when the file has finished being loaded

Known Issues

  • None

ย