Introduction
The object_detection
service allows to detect many type of objects in an image. Detectron2 () is used as backend. For now, we only support detecting humans, but more classes could be added in the future (). The detection is using instance segmentation which means that it returns a binary mask per detected object in the image. Each mask has the same shape as the input image where 1’s correspond to the place where the object is detected, and 0’s everywhere else.
Docker name: object_detection
Input
sensors: Camera (stereo or mono)
actuators: None
services: X
parameters (note that the following parameters are hard-coded at the top of the file
object_detection_service.py
):Threshold:
float
, sets the confidence level thresholdDPI:
int
, sets the number of Detections Per ImageMODEL:
str
, path to the model file (.pkl
). Default:model_final_f10217.pkl
MODEL_PATH:
str
, path to the model configuration file (.yaml
). Default:'COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml'
Service Configuration
For the setup of the service, you can chose from multiple models.
First, choose a model from https://github.com/facebookresearch/detectron2/blob/main/MODEL_ZOO.md#coco-instance-segmentation-baselines-with-mask-r-cnn.
Then, download the model through the download link.
Set
MODEL
parameter to the path of the downloaded model.Finally, set
MODEL_PATH
to'COCO-InstanceSegmentation/[MODEL_NAME].yaml'
, where MODEL_NAME is the file name according to the GitHub model zoo page. For example, model “R50-C4” has MODEL_NAME “mask_rcnn_R_50_C4_1x”.
Output
sensors: none
actuators: none
There are two outputs of this service. First, a simple string output is published. Second, a Protobuf
object is added to a redis zrange
.
String output The string output consists of “[X-COORDINATE];[Y-COORDINATE]”. The coordinates represent the centroid of the instance segmentation mask. For example, if the centroid is [150,220] the output will look like:
"150;220"
This is published to the detected_object
topic.
Protobuf output The Protobuf output is used to output the segmentation masks.
{'intent': '[YOUR_INTENT]', 'parameters': {'[YOUR_PARAMETER]': '[PARAMETER_RESPONSE]'}, 'confidence': [CONFIDENCE_VALUE], 'text': '[RESPONSE_TEXT]', 'source': 'audio'}
'intent':
str
the intent on which the audio was recognised, corresponding to the intent set on the agent
‘parameters’:
dict
the parameters defined in the agent
each parameter is a
str
key, with the its response asstr
value pairing
'confidence':
int
number ranging from 0 to 100 that defines how confident the API is with the intent and text detection
‘text’:
str
speech-to-text response from the API
'source':
str
for the SIC framework Dialogflow usage, the source is always ‘audio’
Initialisation
Using the service
In order to use our service for your purposes, an instance of the BasicSICConnector class has to be created. You can find the details of this class here. You may also need a class to manage speech_recognition attempts and a callback function for retrieving a recognized entity from the detection result.
In order to run this service, the following steps must be taken into consideration:
You have the relevant services and drivers running.
To pass your local IP address, Dialogflow key file path, and Dialogflow agent ID, when creating an instance of BasicSIC connector.
A partial function is set up for retrieving a recognized entity from the detection result.
Example
The following file, https://bitbucket.org/socialroboticshub/connectors/src/master/python/speech_recognition_example.py, is available for the purpose of demonstration. Two questions are dealt with in this example. The first is an entity question where the point of interest is the name of the user. The second is a yes, no, or don’t know question.
Setting up the agent
In order to deal with the first question, an intent needs to be set up. An intent is a value recognised from an end-user. In our example, the name of the person. The following steps will set an intent of your Dialogflow agent:
Navigate to the agent’s page to set the intent, training phrases and parameters.
Create an agent intent.
It is recommended that the name suggests the kind of answer you are looking for in the audio. In our example, the name of the user (‘answer_name’).
the intent defined in the agent should correspond to the intent used in the code
Create a context.
the number next to the context corresponds to the number of responses expected from the user in that context. In our example, that number is 0
Create training phrases for the intent
the training phrases should be input examples that contain the intent. In our example, 'my name is name`
Dialogflow learns from these phrases and matches future user inputs based on them
Create parameters for the intent
select words from the training phrases as parameters by double-clicking on them, then match them with their corresponding entity. They automatically appear in the ‘Action and parameters’ section. In our example, we are only interested in the ‘name’ of the user
Our complete intent example thus looks like this (note: using sys.given-name
is usually preferred):
Events
onAudioIntent
a new intent is detected
IntentDetectionDone
a new intent has finished being detected
onAudioLanguage
the audio language has been changed
LoadAudioDone
if an audio file is used, the event is raised when the file has finished being loaded
Known Issues
There is a rare bug where sometimes Dialogflow will suddenly only respond with ‘UNAUTHENTICATED’ errors. Restarting Docker and/or your entire machine seems to be the only way to resolve this.