View Source

Week 1

In week 1 we met up on the following days:

	What did we work on?
Tuesday 11/1 13:00-17:00	Trying to get the SIC frame working Brainstorm ideas Connect to Pepper
Thursday 13/1 13:00-17:00	Successfully connected to Pepper Basic bowing Research on Japanese Bows Types Meaning Trying to come up with a workflow for performing social greeting Pose recognition Age recognition Eyes Fixation Posture recognition
Friday 14/1 9:00-16:15	Implement the bows → first via Python → then via Choreographe Make the plan more concrete → obstacles/planning/feasibility

Tuesday

On Tuesday we met up with Kiran and Mincke in the Social AI Lab. Since Maike had COVID and Siddhant had an exam the following day, we consulted with them via Zoom and brainstormed ideas. In the end we came up with a personalised Japanese greeting, that -depending on the person- could be made more formal or informal. We plan to make use of computer vision as a way to distinguish what person is in front of Pepper and depending on the person what type of greeting should be used. If this fails, then we want to make use of a decision tree based questionnaire, where Pepper asks questions to the person in front of them and depending on the answers (e.g. “how old are you”, “are we familiar”), greets accordingly. The rest of the lab we both tried to install Mediapipe and get it up and running. The last hour of the tuesday lab was spent trying to connect to Pepper. However, due to some errors we did not yet succeed in doing so.

On Wednesday we did some literature research and looked into libraries.

Personalised Japanese Greetings:

We based our idea on the following explanation of Japanese bows: http://hanko-seal.com/archives/7423.

Thursday

Met up with Mincke, Siddhant, and Kiran in the Social AI Lab. Siddhant and Mincke worked on connecting to Pepper, this turned out to be complicated since we got a lot of errors ('google module not found' and after that we had no way of selecting the mic, cam, etc.). After three hours, Vincent found the solution that the problem was the firewall. Kiran worked on the research part. After many attempts, we finally managed to connect to Pepper and perform actions by running the provided code in the repository. After connecting, we played around with recording motions with Pepper, trying to get it to make bows. However, it turned out to be difficult since it seemed that we could only move the arms and not the legs/torso.

As of now the three questions provided in the assignment is answered as follows:

What types of social greetings you want to focus on?
1. Greet(Verbal) and Japanese Bow(Non-Verbal) - Japanese Bow begins with a greet.
What personalisation means in your case?
1. Pepper bows differently -with her arms in front as it assumed that Pepper is a female- based on her familiarity with the individual that in front of her.
  1. For example:
    1. Pepper performs “Eshaku”- basic standing bow- if she knows the individual. Pepper leans her body at an angle of 15^o.
    2. Pepper performs “Keirei”- standing bow- if she is met the individual for the first time. Pepper leans her body at an angle of 30^o.
How does the teaching happens?
1. In order to teach the individual can ask a question or tap Peppers head.
2. We hope to make use of the mediapipe python module to extract the skeleton structure of the individual and map it to Peppers joints to perform actions.

Friday

Kiran: look into object detection library yolo
Siddhant & Mincke: try to implement the bows (15 & 30 & 45 degrees)
All: make the plan concrete, discuss different elements of the project and try to work them out.

Siddhant brought a Windows laptop, so now we have two laptops that could connect to Pepper. We spent the first 1/1,5h of the morning discussing and brainstorming about our project with Bülent & Kim. After this Siddhant and Mincke tried to record a bow using the motion_recording_example.py. However, we faced an issue and could not seem to move the torso/leg/hips. It seemed as though anything but the left and right arm was disabled for movement. This is something we’re going to ask Vincent next Tuesday. After this we installed Choreographe in order to check if the robot was able to make a bow with that program. This did succeed. We found out that the bow we wanted to make of 45 degrees probably would not work since Pepper cannot bow that deep. We decided that we would implement three bows that Pepper can make with the hands next to the legs (this is the way a male bows → we change Pepper from female to male after this). After talking to Kim we concluded that the eye fixation was not something that we should give priority to. If the age guessing of Pepper is not accurate then Pepper can always ask the subjects’ age. If Pepper detects a bow → it should skip the intro (“hello”) and bow directly. We decided that it is not feasible for this specific project to have Pepper bow lower and lower when the subject does that as well (i.e. in Japan it is common when meeting that if someone bows lower, the other person bows lower as well). Possible obstacles: when Pepper bows down, it is not able to see if the subject has stopped bowing → we need to determine later on if this is feasible for us to program.

Outline interaction:

It will be a one-on-one interaction: Pepper will interact with one subject, this person will give feedback. We do not involve an extra person who will give feedback as well.
The interaction will be done in ‘Workshop-style’. One person will interact, after this person is done, 30 seconds (or less) rest, next person will interact, 30 seconds rest, next person, etc.

Possible teaching moments:

Teach the amount of space Pepper needs to be away from the subject in order to make the bow. → at the end of the day we decided that this is not something we want to do.
Teach what bow to make in what situation:
- Position 1 (least deep/ informal bow): when presented with a known face and that person is not in a suit (object detection) or person is below 20 years and younger (known and unknown).
- Position 2 (Middle bow): when presented with a known face & higher status/ age above 20-35.
- Position 3 (Formal/deepest bow): unknown person or man/woman in a suit or presented with a known face & that person is above the age of 35. Since this is the subservient position, Pepper will hold this position longer.
Teach how long Pepper needs to make the bow (the duration of the bow says something about the level of respect that is expressed from Pepper):
- Position 1: since this is the least formal bow, the duration of the bow can be shorter.
- Position 2: the duration of the bow has to be somewhat inbetween long and short.
- Position 3: this is the most formal bow, and hence the bow should be the longest.
  - Possible obstacles: someone of 35 wearing a suit could have a shorter bow than someone of 65 and wearing a suit → is this feasible for this project?
Reinforcement Learning:
- Features for the RL: known/unknown face, object detection (suit or not, formal/informal clothing), age.
- Rewards/punishments: two variables (duration/angle of the bow), we give verbal feedback afterwards.
- Start with a long bow for the learning part → give verbal feedback on (1) if it’s the right bow/ the right angle and (2) if the duration of the bow is ok. Translate this in 0/1 rewards → next try. Longer/shorter bow or different angle. Use Mediapipe holistic to track the skeleton of the person in front of Pepper bowing. The hands are on the upper thighs → if the person starts to stand up, the hands on the upper thighs start to move → if the movement exceeds a certain range → we can assume that the person starts to stand up → hence Pepper can also start to stand up as well. → interaction is over.

TODO:

How long should each position be held by Pepper? → position 3: maybe 10 sec, position 2: 5 sec and position 1: only 2 sec? → try this out.
Ask Vincent about motion recording of the other joints → can we also use hip/torso/leg movement? → reassess if we want to use Choreographe or not.
Ask Vincent: How to access the sensors/ data that the camera’s collected?
How to implement mediapipe in pepper?
Make presentation before monday lecture
Maybe create profiles for each user? Someone without a suit of age 23 should bow for X seconds, someone of age 35 without a suit and a known face should bow for Y seconds.

Week 2

	What did we work on?
Monday 9:00 - 15:30	Finalising hard-coding the 3 bows on Choreographe & running them in Python Finalising the presentation (all)
Tuesday 9:30 - 16:15	Run the Socket_Connection files Get age recognition working for static images Work on YOLO
Thursday 9:30 - 15:30	Get age recognition working for webcam (worked & accurate) + write Python file to get age_recog working on Pepper (semi worked) Get object detection working on Pepper (worked) Look into Reinforcement Learning algorithms
Friday 9:30 - 15:00	Work on age recognition in Pepper What to put in the presentation (monday) Create dataset for suit recognition (annotate)

Monday 17/1

TODO for today:

Kiran: look into the libraries useful for extracting features from the environment that can be used for Personalisation and Reinforcement Learning. Features to be extracted are:
- Age
- Clothing
  - There were problems in installing the libraries tensorflow and keras because of the MacOS and python version in the system resulting in conflicts.
    - Error - zsh: illegal hardware instruction python
  - Ran into a problem when installing mediapipe, but was resolved:
    - Error - No matching distribution found for mediapipe mac
      - Solution : Step 1 - Install Rosetta 2 ( /usr/sbin/softwareupdate --install-rosetta)

Step 2 - Follow https://stackoverflow.com/questions/68659865/cannot-pip-install-mediapipe-on-macos-m1

Siddhant: finalising hard-coding the 3 bows on Choreographe & running them in Python. Research external libraries.
Mincke: look into implementation of external libraries (MediaPipe & OpenCV) & reading sensor data using ALMemory. I tried to read off data from the sensors via Choreographe. During the discussion Kim said that he had posted python file which allowed us to read off sensor data so I will look more into that tomorrow.
All: finalising the presentation

We met up with Kim and Koen in the afternoon for a discussion of our project. There are still a lot of decisions to be made, however we want to ask some questions to Vincent tomorrow. His answers will let us know what is feasible/ non-feasible concerning sensor data and learning movements.
Decisions to be made:

Do we look at the features attire & age or do we look at the feature face detection (known/unknown face).
Do we let Pepper perform actions based on features or do we perform actions based on copying movements from users (subject shows Pepper how to bow)?
What exactly will be the interactive learning moment?

We concluded that we would like to look at the attire & age features more than at the face recognition feature. We would like Pepper to deduce the information about the attire and age (if the age feature is too noisy then we will input the age via speech) → based on this, the subject will stand on the side and show Pepper the exact bow that should be made for a person in this attire and with this age.

Potential challenges:

Will Pepper correctly see and copy the demonstrated bow?
When Pepper is presented with person A) age 35 and formal attire and person B ) age 60 and formal attire → should Pepper bow more deeply for person B? Maybe we should implement priorities in what features Pepper should look at first → e.g. age above attire.

Presentation

We presented a short presentation to Kim and Koen during the afternoon regarding our idea, our current status with the project and what issues we were facing with it. We also created short demonstration videos utilizing Choreographe to show the various types of bows we plan to implement.

The presentation can be found here: https://docs.google.com/presentation/d/1PTS3qL7yETuXRI7W1pe-HStI9x2n8gpOg620hhgAj8E/edit?usp=sharing .

Tuesday 18/1

Mincke: goal for today is to run the socket_installation files provided by Muhan. This gave me a lot of errors and I had to do a lot of trouble-shooting. Since Muhan was not on campus at the moment and the TA’s and Vincent did not know how the socket_installation files worked, it took some time to get this to work. In the end I had contact with Muhan via Slack and Mahir also helped a bit with trying to come up with solutions as to why it didn’t work. In the afternoon we could start working with a robot of which we could access the video camera and run the MediaPipe library on.
In the afternoon I worked on trying to get the age recognition of OpenCV working. At the moment it is working for photo inputs, but it is not working accurately. The next step will be, getting the age recognition working for webcam input, and after that getting it to work on Pepper. I will look into this more to try and see if it is feasible to have an age recognition part in Pepper or if we should settle for giving the age as speech input. See image below for an example of how inaccurate the predictions currently are. Idea → change the intervals.
Kiran: goal for the day was to execute the YOLO program(for clothes) and Age Detection program. The problem faced when installing tensorflow from the previous day was resolved by installing tensorflow 2.5.0 as this was the version compatible with MacM1. There is a lot of issues when running the models because of the compatibility issues with tensorflow and keras.
Siddhant was unfortunately not feeling well so he could not come to the lab. He did some research at home into reinforcement learning and what algorithm we should use for that.

We had a meeting with the whole group at 2 o’clock in the afternoon to discuss the progress we’ve made and to try and narrow down our project and come up with feasible next steps. The conclusion was that for now we will focus on 3 bows and let Pepper learn how long it should bow for what combination of attributes and age in a person. We will discuss this plan on thursday with Kim or Koen.

Thursday 20/2

TODO: today we want to spend trying out what exactly we get when running the socket_connector files, e.g. what xyz-values do we get, can Pepper detect it when we come up from bowing and our hands start moving up?

Mincke: goal for today was to get age recognition to work on webcam → see if it (semi-)accurately detects the age. After this I will try to get the age recognition webcam version to work on Pepper.
The age recognition on the webcam seems to predict more accurately. Siddhant, Kiran, Fajjaaz and I were all put into the 25-32 age category ( relevant intervals are: 15-20 and 25-32 yrs). Since Siddhant and Kiran are 25, Fajjaaz is 26 and I am 24 yrs old, the 25-32 age category was actually the best category to be put into. Fun fact: Pepper is put into the 8-12 years category.
Since the algorithm can very accurately define our age, we decide to move forward with implementing the age detection in Pepper as a way of guessing the age (instead of asking the age). A verbal confirmation might still be useful. (see image below for a screenshot of the pop-up window with the webcam and the age guess). Another thing the algorithm detects is the gender. This is currently not relevant for our project, but we have decided not yet to remove this.
A next step would be trying this out with a different age category to see if it also correctly guesses these age intervals. We asked some staff members from the 11th floor if they would be willing to put their face in front of the camera for a sec to see if it correctly guesses their age as well. A woman of 50 years was so kind to help us with this. We found out that when there was more distance between the camera and the face, the algorithm could guess more accurately. When her face was turned to the side, it was more often wrong in guessing. This in fact is not a problem for our use case, since there will be some distance between the participant and Pepper, and the participant will be facing Pepper as well.
We’ve asked a second participant (male, age 38, beard) and his face was varying defined as 25-32 and 38-43. With the second participant, we found the exact opposite of the previous use case. When his face moved closer to the camera, it more accurately guessed his age. When he moved away from the camera, his face was more often thought to be of the age category 25-32. This makes logically of course also more sense since closer to the camera → better image → better guess. Another participant is needed for us to confirm this assumption.

Socially Intelligent Robotics Project - 2023 > Group 2 Log book > image-20220120-122551.png

Siddhant: Looked into 3D objectron and 2D object detection to attempt to detect suits or ties for individuals but the lack of a trained dataset and no direct connection to mediapipe has raised concerns and issues.
Additionally, the 3D objectron which is a part of mediapipe cannot detect objects properly when being run on pepper as it is unreliable for basic images such as shoes and cups. Additional time was spent trying to find datasets and troubleshoot the errors from the object detection code. As it stands we might need to drop the idea for object detection as it might be more time consuming and detrimental to our overall progress.
Kiran: Looked into YOLO and also looked at the different types of Reinforcement Learning algorithms including TAMER.

Idea: since the age detection on the webcam is working very well → we define Pepper to be in the age category 25-32. Pepper sees someone in front of him and guesses the age:

If the person in front of Pepper is of an age category below 25-32 (e.g. 0-2, 4-6, 8-12 or 15-20) it performs bow 1
If the person in front of Pepper is of the same age category: it performs bow 2
If the person in front of Pepper is of an older age category: it performs bow 3

Friday 21/1

Mincke: goal for today is to get age detection working on Pepper. Succeeded in this: it is currently working, the files can be found on the github of group 2.
Kiran: Making dataset for training model to recognize suits.
Siddhant: research into OpenSMILE → not a lot of documentation.

Week 3

We met up on the following days:

	What did we do?
Monday 24/1 10:00 - 15:45	Testing object recognition Speech recognition & dialogflow Mapping age detection to specific bow Make presentation
Tuesday 25/1 9:30 - 16:30	Training YOLO with full dataset Trouble-shooting bow in python Trouble-shooting dialogflow
Thursday 27/1 9:15 - 16:45	Finalizing mapping age to a specific bow Research into multi-armed bandit problem
Friday 28/1 9:15 - 15:30	Merging the different files (M) Contextual MAB research (K) Dialogflow trouble-shooting (S)

Monday 24.1

Mincke: goal for today is to take the output of the commandprompt and try to make Pepper bow according to the age.
Kiran: training the object recognition model(YOLOv5) + testing
Siddhant: looking into speech recognition & dialogflow

All: discuss progress + make presentation.

Discussion:
The presentation for week 3 can be found here.

Kiran manually annotated 600+ images containing formal and informal clothing to be trained using the YOLOv5 model. For training the model, an annotated set of data were to be obtained using http://app.roboflow.com . The images were obtained from https://storage.googleapis.com/openimages/web/visualizer/index.html?set=train&type=segmentation&r=false&c=%2Fm%2F0388q as it contained images with clothes and fashion accessories. The dataset was obtained using the “fiftyone” python module. The annotations were also included but could not be uploaded to http://app.roboflow.com . As a result, we tried to manually annotate the images which was time consuming. Although the images were annotated and the model was trained(https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data ), the performance was really poor. The threshold for detection had to be reduced to 0.1 for the model to recognise and locate at-least one object from the image. The model also produced invalid detection of objects. We decided to give YOLOv5 to give one final try tomorrow with much better dataset or drop the idea of cloth detection.

The following feedback was given during the discussion on monday afternoon:

Do we want to confirm the conclusions that Pepper makes about the participants' age? → does this even matter? → maybe we should skip the confirmation, does not feel natural.
Focus for this week: get an interaction working → we should be able to give feedback & robot changes its behaviour. Focus should be on working on speech recognition & getting the robot to bow according to a specific age range.
The focus should not be on trying to get the object recognition working. We have already spent too much time on this, and even though it is working at the moment, it is not worth it since it is that inaccurate.
Maybe implementing a reinforcement learning algorithm is overkill for this specific project → mult-armed bandit problem, simulated annealing and gradient descent were some of the suggestions that were mentioned.
TODO for later:
- make a diagram for the final presentation about the workings of our model.
- do more case studies with Pepper and participants where Pepper has to guess the age → try to get an estimation about how accurate Pepper can detect someone’s age.
- look on internet how well the age detection algorithm works.
Some ideas for future work:
- Since Pepper is dependant of the distance in relation to its ability to guess someone’s age, e.g. further away than 2 meters → “no face detected”, too close → wrong guess, an idea for future work could be that Pepper learns what distance it should be in relation to the participant in order to make a good guess.
- Pepper cannot guess age very well when it is looking directly into the sun → learn Pepper how to turn in order to face away from the light → what is the optimal position in order to guess someone’s age the most accurate.

Tuesday 25.1:

We couldn’t get into the lab until 13h, so we spent the morning doing research into other options than reinforcement learning.

Mincke: Unfortunately, I spent most of the day trouble-shooting. Had some issues with Choregraphe in exporting behaviours and loading them onto Pepper. After that, an issue with running the python file which executed the movement, I couldn’t select anything in the pop-up venster even after disabling my firewall. Next to this, I have the error that my computer is actively refusing the connection. At the end of the day after a lot of trouble-shooting, I finally managed to let Pepper make a bow via Python. Success!!
Siddhant: Spent most of the day troubleshooting the issues with Dialogflow. This issue had occured first on Thursday and had persisted throughout the weekend till Tuesday. Paired with connection issues with Pepper, it caused a lot of problems overall that sadly had no solutions to it by the end of the day.
Kiran: Collected annotated image datasets using a new method mentioned in https://www.youtube.com/watch?v=tly7mYK7fv8 . The https://www.youtube.com/watch?v=tly7mYK7fv8 used https://github.com/EscVM/OIDv4_ToolKit . Using the script, images containing the objects(classes) : Trousers, Shirt, Jeans, Jacket, and Suit(1000 images per class) were downloaded and uploaded to http://app.roboflow.com (along with annotations). The dataset was successfully uploaded and converted to the format necessary for training the YOLOv5 model without the need for manual annotation. The final dataset contained a total of 4970 images with images containing 2706 Suits, 2474 Jeans, 1981 ackets, 1747 Trousers, and 1374 Shirts. The YOLOv5 model is currently being trained on this dataset.
- The YOLOv5 mode is able to detect the above mentioned classes

Thursday 27.1

Mincke: goal for today is to further work on implementing the bows on Pepper. At the end of the day we finalized a python script for performing different bows for different age intervals. Right now we still have two python scripts, one for detecting the age → this gives as output a .txt file with a single line: the guessed age. And another python script for performing the bows, depending on the age in the .txt file. The next step will be to merge these two files, making two classes.
Siddhant: Wednesday and Thursday was spent trying to find and understand the logic behind MAB and more specifically contextual bandits. Small changes to the speech recognition code was made to be tested out on Friday as well.
Kiran: Looked into the theoretical concepts of Contextual Bandits algorithms and its implementation. Our learning problem is formulated as a Contextual Bandit Problem where the duration of the bow the learned based on feedback from the user. It will also allow us to Personalise to a specific user.

Friday 28.1

Mincke: work on merging the age_detection and bowing python file. We made a .bat file, but it is not working since the socket disconnects automatically after one run. We spent some time looking into ways to not get the socket to disconnect. We have made a new folder inside pepper with id 192.168.0.196, called ‘tryout1’. This folder contains a new server.py python file, which allows the socket to stay connected throughout multiple runs. The current problem seems to be that the separate files do work, but when the merged .bat file is run, then the age is not recognized. But it is recognized when we run the file separately. This is something to look into on monday.
Siddhant: dialogflow problems were solved thanks to Geo’s help and the PhD students working on Nao. Looked into Contextual bandits as well.
Kiran: Worked on Contextual Bandit problem algorithms. Trying to use the algorithms suitable to our problem. We’re thinking of considering combination of attires as a context feature along with the age.

Week 4

	What did we work on?
Monday 10:00 - 16:30	Make presentation (Siddhant) Look into training dataset age detection model (Mincke) Research into MAB (Siddhant & Kiran) Trouble-shooting merging the files (Mincke) Update logbook with feedback & start case study (Mincke)
Tuesday 10:00 - 17:00	Perform case study with 12 participants (Siddhant & Mincke) Change bows in Choregraphe (S&M) Meeting with Buelent about MAB and SA (Kiran)
Thursday 9:15 - 17:30	Dialogflow issues Implementing the learning part Merging all parts together
Friday 9:15 - 15:00	Final touched Final presentation + demo

Monday 31.1

Unfortuately we did not have access to the lab today, so we worked on the VU together on the presentation and stuff we could do without having the robots available.

Kiran & Siddhant morning: goal for today is implementing a contextual MAB problem in python and get it running asap.
Mincke: the problem seems to be that two identical strings in python do not match. It detects the age in the age_pepper.py file and writes it to participant_1.txt. Then in action_pepper.py it reads in the txt file and even though the string is exactly the same, it does not recognize it to be matching. Quite a problem because we want both python files to run automatically after one another without manual intervention needed. I finally found and fixed the error just before the discussion. It turned out that the id’s of the strings to be matched weren’t the same. It was fixed by taking the age from the first line of the participant_1.txt file and by doing age.strip(). Apparently there was a chance of getting extra spaces surrounding the string and this removed it → conclusion: both strings matched and we can run the .bat file.
All after discussion: look into simulated annealing → is this a better approach than MAB?

Discussion:

We made the following presentation for the discussion of week 4:

Feedback:

Continuous MAB > discrete MAB?
But maybe even better: Simulated Annealing instead of MAB?
In what context would Pepper be implemented?
We did a small use case study where we used Koen to see if Pepper can accurately guess the right age & perform the right bow. → this turned out not to be the case. Why? → maybe because of inbalanced training dataset?
- How many data points is the age detection model trained on? → it consistently guesses our age correctly (25-32) but it guessed Koen’s age wrong 4 times in a row. The data that this age detection model was trained on can be found in the following paper: Eidinger, Enbar & Hassner, 2014. As can be seen in the image below (table II), the dataset was not trained on a stable number of images per age interval. We only consider the frontal images that were used, since we only present Pepper with frontal images as well. For the age range (25-32) there were 3335 images, while the model only had 585 and 572 images for respectively the age category (60-100) and (48-53) . This explains why Koen’s age was not accurately estimated.

After the discussion we had a group meeting where we decided that the rest of the afternoon would be spent on looking into simulated annealing.

Tuesday 1.2

Since receiving the feedback at yesterday’s discussion, we have decided to completely throw away our current approach to the learning part of our project, and go for another approach, simulated annealing. This morning was spent researching usable code, brainstorming about how the interaction would work and discussing our options. Fajjaaz pointed us towards the scipy simulated annealing function, which was really helpful. We had a meeting planned with Buelent to brainstorm about our learning plan at 13:30. In the meantime the goal was to recruit more people for our case study so we can present some numbers about the accuracy of the prediction on friday’s presentation.

We changed the bows in Choregraphe, and split the bows up. We used to have behaviors that showed the bow (bowing down-standing up) → since we want to influence the amount of time that Pepper stays down → we have broken down the behaviour into a bowing down behaviour and a standing up behaviour. That way we are able to modify the amount of seconds that Pepper bows.

After the lunch Siddhant & Mincke performed a case study, for results see the table below. For the experiment, we asked people to stand in front of Pepper (approximately 1/1,5 meters distance), and let Pepper guess their age. We performed these experiments with and without glasses and with and without a face mask. We report an accuracy of 49% of a study with 12 participants (8 males, 4 females) with ages ranging from 21 to 50 years.

Kiran spent the day doing research into MAB and SA. According to Buelent and Kiran, MAB could be feasible/possible. The plan for now is that tomorrow, Kiran will try to implement MAB. Siddhant & Mincke will look into Simulated Annealing, and try to implement that. At the end of the afternoon we will have a zoom meeting where we discuss progress. Then we will decide which course to take.

Case study: How well can Pepper guess the participant’s age & perform the correct corresponding bow?

	Detected Age	Specifications
Participant 1 (Koen, male, 50) (31.1) Accuracy: 1/5	48 - 53 years 8 - 12 years no face detected 25 - 32 years no face detected 4 - 6 years	glasses no accessory mask (blue) no accessory no accessory glasses
Participant 2 (Fajjaz, male, 26) (01.02) Accuracy: 4/5	25 - 32 years 25 - 32 years 25 - 32 years 38 - 43 years 25 - 32 years 38 - 43 yearsJam	glasses glasses no accessory no accessory no accessory mask (blue) + glasses
Participant 3 (Kiran, male, 25, beard) (01.02) Accuracy: 4/4	25 - 32 years 25 - 32 years 25 -32 years 25 -32 years 25 -32 years 25 -32 years	glasses + beanie beanie no accessory no accessory mask (black) mask (black)
Participant 4 (Mincke, female, 24) (01.02) Accuracy: 4/5	25 - 32 years 25 - 32 years 25 - 32 years 25 -32 years 38 - 43 years no face detected no face detected	no accessory no accessory no accessory glasses glasses mask (black) + glasses mask (black))
Participant 5 (Siddhant, male, 25, beard) (01.02) Accuracy: 3/5	25 - 32 years no face detected 38 - 43 years 25 - 32 years 25 - 32 years 38 - 43 years	glasses glasses + mask (black) no accessory (too far) no accessory (armslength) glasses glasses
Participant 6 (Thomas, male, 21, beard) (01.02) Accuracy: 0/6	8 - 12 years 8 - 12 years 8 - 12 years 25 - 32 years 25 - 32 years 25 - 32 years no face detected	no accessory no accessory glasses glasses + closer to Pepper no accessory no accessory mask (blue)
Participant 7 (Yi Wen, male, 32) (01.02) Accuracy: 0	38 - 43 years 4 - 6 years no face detected 38 - 43 years 38 - 43 years	mask (blue) mask (blue) + closer mask (blue) mask (blue) mask (blue)
Participant 8 (Florian, male, 34, no beard)(01.02) Accuracy: 2/4	25 - 32 years 38 - 43 years 38 - 43 years no face detected 38 - 43 years 25 - 32 years	no accessory no accessory no accessory mask (blue) mask (blue) + glasses glasses
Participant 9 (Nimat Ullah, male, 34, no beard)(01.02) Accuracy: 1/4	38 - 43 years 25 - 32 years 38 - 43 years 38 - 43 years 38 - 43 years 4 - 6 years	no accessory no accessory no accessory glasses mask (blue) + glasses mask (blue)
Participant 10 (Lima ,female, 33) (01.02) Accuracy: 2/4	15 - 20 years 15 - 20 years 25 - 32 years 25 - 32 years 15 - 20 years 15 - 20 years	no accessory no accessory no accessory glasses mask (blue) + glasses mask (blue)
Participant 11 (Mojca, female, 50) (01.02) Accuracy: 3/5	25 - 32 years 48 - 53 years 25 - 32 years 48 - 53 years 48 - 53 years 25 - 32 years 15 - 20 years	no accessory no accessory no accessory + smile no accessory +no smile glasses mask (white) + glasses mask (white)
Participant 12 (Jamie, female, 24) (01.02) Accuracy: 1/4	15 - 20 years 15 - 20 years 25 - 32 years 48 - 53 years no face detected no face detected	no accessory no accessory no accessory glasses mask (black) + glasses mask (black)

	Males (8)	Females (4)
Accuracy (no accessory + glasses)	15/33 = 45.5%	10/18 = 55.6%

Overall accuracy (males + females) (no accessory): 17/33 = 51,5%

Overall accuracy (males + females) (no accessory + glasses): 25/51 = 49.0%

Wednesday 3.1

Worked on developing the learning part for Pepper. Had a lot of discussion with Buelent in developing the algorithm for carrying out the learning part of Pepper. The goal is to find the right duration for bowing with limited number of interaction using a policy that follows a probability distribution based on the age group detected. The method used were obtained from some of the resources shared by Buelent. The action space(duration) is treated as continuous. Based on the feedback the policy is shifted right or left depending on the feedback. It is shifted right if the feedback is ‘longer’ and ‘left’ if the feedback is shorter. This was achieved by using the mean of the distribution as a parameter and based on the feedback we shift the mean. The distribution chosen was a Gaussian distribution. The feature considered here is age. Since our age prediction model spits out age ranges, we decided to take the mean and standard deviation of the the age range as features.

We tried the following:

Initialising a random action obtained from a normal distribution with random mean and random standard deviation. Following this we treated the mean and std as parameters that gets updated with feedback. But, Buelent suggested that the update rule for std is not correct and to opt for simplicity by initialising the std as 1. We also tried out different values for the learning parameter and found out that learning parameter of 0.01 produced better results. For the learning parameter we also included a decay such that the learning parameter decays with the number of interactions. This is based on the assumption that the more we interact the more closer we get to the value desired by the user.

Thursday 3.2

The goal for today was to integrate speech recognition into the python file for learning that Kiran made. The second goal was to integrate all modules together.

We had some major setbacks today. While Dialogflow did work on Siddhant’s laptop last Friday, it suddenly stopped working today. We spent a lot of time troubleshooting Dialogflow and also doing research into other options. Some of the other options we researched were: ALSpeechRecognition, using the sensors (head, left hand, and right hand). The sensors seemed like a good backup, however, we couldn’t get information about when the sensors were touched via the Python script. Then we thought we might design a ‘detect sensor touched’ movement via Choregraphe, but this also failed.

We got help from Yue, a PhD’er and he helped us with the code, ‘speech_recognition_example.py’ which was provided by Vincent. But he also at some point couldn’t help us any further, unfortunately.

Since we couldn’t make use of the DialogFlow for Speech Recognition, we decided to communicate our feedback through our laptop. The Speech Recognition used is a naive implementation that uses the Google Speech-to-Text module. We look for certain keywords that indicate ‘longer’, ‘shorter’, or ‘correct’ in the text obtained from the speech based on which the duration can be optimized using a method found in http://incompleteideas.net/book/the-book.html page number: 337, Section: 13.7

The method used for Learning Part

The method used learns statistics of the probability distribution of the action space. For simplicity, we assume that our action space(duration) follows a Gaussian Distribution with a standard deviation of 1 and is dependent on the Age which is treated as the state. As defined in http://incompleteideas.net/book/the-book.html , the policy is defined as follows:

Socially Intelligent Robotics Project - 2023 > Group 2 Log book > Policy.png

Here,

Socially Intelligent Robotics Project - 2023 > Group 2 Log book > Theta.png

Socially Intelligent Robotics Project - 2023 > Group 2 Log book > Mu.png

The mean is assumed to be a linear function of the mean of the state (x_mu(s)) and the parameter (theta_mu) defining the action. Since the standard deviation is positive, it is an exponential linear function of the standard deviation of the mean of the state (x_std(s)) and the parameter (theta_std).

For example, in our problem, the age detected by Pepper is a range like ‘Age: 25 - 32’. To define the state, we compute the mean and standard deviation of the Age range by considering the lower and upper limits of the age range. For the age range 25-32, the mean is 28.5 and the standard deviation is 4.94.

Therefore, initially, x_mu('Age: 25 -32') = 28.5 and x_std('Age: 25-32') = 4.94

For defining the initial parameters of the duration action space, we used priors, i.e., the initial duration of 3 seconds for ‘Age: 25-32’, 6 seconds for 'Age: 38-43 years, Age: 48-53 years, and Age: 60-100 years '. Based on the prior, the parameters of the action space were defined as:

For Age: 25-32,

theta_mu = 3/ x_mu('Age: 25 -32') = 3/28.5 = 0.105

theta_std = 1

Here, the standard deviation is assumed to be 1 for simplicity as we’re only trying to optimize theta_mu based on the feedback provided by the user.

Similarly for Age: 38-43 years,

theta_mu = 6/ x_mu('Age: 38 -43') = 6/40.5 = 0.148

theta_std = 1

The parameter, theta_mu is updated based on the feedback of the user as follows:

Socially Intelligent Robotics Project - 2023 > Group 2 Log book > Updaate.png

Here, theta_mu+1 is the updated mean parameter of the duration action space depending on the feedback of the user.
Beta is the learning parameter that decays with each interaction (n) defined as:

Beta = 0.01/n

After 1st interaction with a user, Beta = 0.01/1 = 0.01

After 2nd interaction with the user, Beta = 0.01/2 = 0.005

After 3rd interaction with the user, Beta = 0.01/3 = 0.003

…

Here, we wanted the learning parameter to decay as we believe that with each interaction, we get closer to the user’s desired duration of the bow.

G_tis the encoded feedback.
- The feedback means either to:
  - increase duration
  - decrease duration
  - stop
- If the feedback means to increase, G_t would be a positive value.
- But since we want to increase the duration, we do not want any negative value while updating that would decrease the theta_mu
- Therefore we check if :

(where A_tis the prior duration action, mu_theta(S_t) is the mean of the Age range), is positive or negative. If the difference is negative and the feedback means to increase, G_twould be a negative value(say -0.1) and if the feedback is meant to increase and if the difference is positive, G_t would be a positive value(say 0.1).

- If the feedback means to decrease, G_t would be a negative value following the same encoding rules as mentioned above.
Here, the variance is the variance of the age range and the gradient is computed w.r.t to theta_mu, which is just the mean of the age range.
A new duration would be sampled/resampled from the gaussian distribution with mean theta_mu+1 and standard deviation 1. Here we would resample a new duration depending on whether the value is greater than equal or less than the previous duration based on the feedback.

The above-mentioned method can be explained with an example scenario:

Suppose Pepper detected an age range of 25 - 32. Pepper bows based on the prior i.e., 3 seconds. If the user provides feedback that means to increase the duration, in the first interaction:

x_mu('Age: 25 -32') = 28.5

x_std('Age: 25-32') = 4.94

Variance = 4.94² = 24.40

theta_mu = 0.148

theta_std = 1

Beta = 0.01

A_t = 3

Since, A_t- x_mu('Age: 25 -32') < 0, G_t= - 0.1

Therefore,

theta_mu+1 = 0.148 + ((0.01)(- 0.1)([3 - 28.5]/24.40)(28.5)) = 0.148 + 0.029 = 0.177

A new action > previous action will be sampled from a gaussian distribution with mean 0.177 and standard deviation 1.

The iteration continues till the user is satisfied with the duration of bow.

Depending on the feedback ,the mean of the normal distribution gets shifted to either right or left.

A condition was also included such that the mean of the distribution does not become negative, as our duration action space can only have positive values.

Friday 4.2

The goal for today was to have a working system that displayed our idea for the project. Work was done to create and finalize the presentation with inputs from all members of the team as well as ensuring that the code written for learning worked during the presentation.

https://docs.google.com/presentation/d/1tQD5zUsPURcJq6x6HotX9a6lND5a5AlLng8mt5Sm4U0/edit?usp=sharing

Reflection

Though our original idea was to have Pepper bow with a varied duration for individuals based on whether it recognized them and understand their age and attire to show the appropriate respect backed by vocal feedback, we realized that the goal was too lofty for the duration we had.

Though the original idea could be achieved by ensuring proper time was devoted to the project, we had to inevitably cut down on the modules we would require to achieve it. That is why the attire detection was dropped. However, for future work it could be very interesting to implement attire detection in combination with age detection.

We experienced some issues with the learning part of the experiment. None of us really had any experience with reinforcement learning, and this posed to be quite a challenge. Also some major setbacks were endured throughout the course (dialogflow, object recognition).