Week 1

In week 1 we met up on the following days:

What did we work on?

Tuesday 11/1 13:00-17:00

  • Trying to get the SIC frame working

  • Brainstorm ideas

  • Connect to Pepper

Thursday 13/1 13:00-17:00

  • Successfully connected to Pepper

  • Basic bowing

  • Research on Japanese Bows

    • Types

    • Meaning

  • Trying to come up with a workflow for performing social greeting

    • Pose recognition

    • Age recognition

    • Eyes Fixation

    • Posture recognition

Friday 14/1 9:00-16:15

  • Implement the bows → first via Python → then via Choreographe

  • Make the plan more concrete → obstacles/planning/feasibility

Tuesday

On Tuesday we met up with Kiran and Mincke in the Social AI Lab. Since Maike had COVID and Siddhant had an exam the following day, we consulted with them via Zoom and brainstormed ideas. In the end we came up with a personalised Japanese greeting, that -depending on the person- could be made more formal or informal. We plan to make use of computer vision as a way to distinguish what person is in front of Pepper and depending on the person what type of greeting should be used. If this fails, then we want to make use of a decision tree based questionnaire, where Pepper asks questions to the person in front of them and depending on the answers (e.g. “how old are you”, “are we familiar”), greets accordingly. The rest of the lab we both tried to install Mediapipe and get it up and running. The last hour of the tuesday lab was spent trying to connect to Pepper. However, due to some errors we did not yet succeed in doing so.

On Wednesday we did some literature research and looked into libraries.

Personalised Japanese Greetings:

We based our idea on the following explanation of Japanese bows: http://hanko-seal.com/archives/7423.

Thursday

Met up with Mincke, Siddhant, and Kiran in the Social AI Lab. Siddhant and Mincke worked on connecting to Pepper, this turned out to be complicated since we got a lot of errors ('google module not found' and after that we had no way of selecting the mic, cam, etc.). After three hours, Vincent found the solution that the problem was the firewall. Kiran worked on the research part. After many attempts, we finally managed to connect to Pepper and perform actions by running the provided code in the repository. After connecting, we played around with recording motions with Pepper, trying to get it to make bows. However, it turned out to be difficult since it seemed that we could only move the arms and not the legs/torso.

As of now the three questions provided in the assignment is answered as follows:

  1. What types of social greetings you want to focus on?

    1. Greet(Verbal) and Japanese Bow(Non-Verbal) - Japanese Bow begins with a greet.

  2. What personalisation means in your case?

    1. Pepper bows differently -with her arms in front as it assumed that Pepper is a female- based on her familiarity with the individual that in front of her.

      1. For example:

        1. Pepper performs “Eshaku”- basic standing bow- if she knows the individual. Pepper leans her body at an angle of 15o.

        2. Pepper performs “Keirei”- standing bow- if she is met the individual for the first time. Pepper leans her body at an angle of 30o.

  3. How does the teaching happens?

    1. In order to teach the individual can ask a question or tap Peppers head.

    2. We hope to make use of the mediapipe python module to extract the skeleton structure of the individual and map it to Peppers joints to perform actions.

Friday

Kiran: look into object detection library yolo
Siddhant & Mincke: try to implement the bows (15 & 30 & 45 degrees)
All: make the plan concrete, discuss different elements of the project and try to work them out.

Siddhant brought a Windows laptop, so now we have two laptops that could connect to Pepper. We spent the first 1/1,5h of the morning discussing and brainstorming about our project with Bülent & Kim. After this Siddhant and Mincke tried to record a bow using the motion_recording_example.py. However, we faced an issue and could not seem to move the torso/leg/hips. It seemed as though anything but the left and right arm was disabled for movement. This is something we’re going to ask Vincent next Tuesday. After this we installed Choreographe in order to check if the robot was able to make a bow with that program. This did succeed. We found out that the bow we wanted to make of 45 degrees probably would not work since Pepper cannot bow that deep. We decided that we would implement three bows that Pepper can make with the hands next to the legs (this is the way a male bows → we change Pepper from female to male after this). After talking to Kim we concluded that the eye fixation was not something that we should give priority to. If the age guessing of Pepper is not accurate then Pepper can always ask the subjects’ age. If Pepper detects a bow → it should skip the intro (“hello”) and bow directly. We decided that it is not feasible for this specific project to have Pepper bow lower and lower when the subject does that as well (i.e. in Japan it is common when meeting that if someone bows lower, the other person bows lower as well). Possible obstacles: when Pepper bows down, it is not able to see if the subject has stopped bowing → we need to determine later on if this is feasible for us to program.

Outline interaction:

Possible teaching moments:

TODO:

Week 2

What did we work on?

Monday 9:00 - 15:30

  • Finalising hard-coding the 3 bows on Choreographe & running them in Python

  • Finalising the presentation (all)

Tuesday 9:30 - 16:15

  • Run the Socket_Connection files

  • Get age recognition working for static images

  • Work on YOLO

Thursday 9:30 - 15:30

  • Get age recognition working for webcam (worked & accurate) + write Python file to get age_recog working on Pepper (semi worked)

  • Get object detection working on Pepper (worked)

  • Look into Reinforcement Learning algorithms

Friday 9:30 - 15:00

  • Work on age recognition in Pepper

  • What to put in the presentation (monday)

  • Create dataset for suit recognition (annotate)

Monday 17/1

TODO for today:

Step 2 - Follow  https://stackoverflow.com/questions/68659865/cannot-pip-install-mediapipe-on-macos-m1

We met up with Kim and Koen in the afternoon for a discussion of our project. There are still a lot of decisions to be made, however we want to ask some questions to Vincent tomorrow. His answers will let us know what is feasible/ non-feasible concerning sensor data and learning movements.
Decisions to be made:

We concluded that we would like to look at the attire & age features more than at the face recognition feature. We would like Pepper to deduce the information about the attire and age (if the age feature is too noisy then we will input the age via speech) → based on this, the subject will stand on the side and show Pepper the exact bow that should be made for a person in this attire and with this age.

Potential challenges:

Presentation

We presented a short presentation to Kim and Koen during the afternoon regarding our idea, our current status with the project and what issues we were facing with it. We also created short demonstration videos utilizing Choreographe to show the various types of bows we plan to implement.

The presentation can be found here: https://docs.google.com/presentation/d/1PTS3qL7yETuXRI7W1pe-HStI9x2n8gpOg620hhgAj8E/edit?usp=sharing .

Tuesday 18/1

We had a meeting with the whole group at 2 o’clock in the afternoon to discuss the progress we’ve made and to try and narrow down our project and come up with feasible next steps. The conclusion was that for now we will focus on 3 bows and let Pepper learn how long it should bow for what combination of attributes and age in a person. We will discuss this plan on thursday with Kim or Koen.

Thursday 20/2

TODO: today we want to spend trying out what exactly we get when running the socket_connector files, e.g. what xyz-values do we get, can Pepper detect it when we come up from bowing and our hands start moving up?

Idea: since the age detection on the webcam is working very well → we define Pepper to be in the age category 25-32. Pepper sees someone in front of him and guesses the age:

Friday 21/1

Week 3

We met up on the following days:

What did we do?

Monday 24/1 10:00 - 15:45

  • Testing object recognition

  • Speech recognition & dialogflow

  • Mapping age detection to specific bow

  • Make presentation

Tuesday 25/1 9:30 - 16:30

  • Training YOLO with full dataset

  • Trouble-shooting bow in python

  • Trouble-shooting dialogflow

Thursday 27/1 9:15 - 16:45

  • Finalizing mapping age to a specific bow

  • Research into multi-armed bandit problem

Friday 28/1 9:15 - 15:30

  • Merging the different files (M)

  • Contextual MAB research (K)

  • Dialogflow trouble-shooting (S)

Monday 24.1

All: discuss progress + make presentation.

Discussion:
The presentation for week 3 can be found here.

Kiran manually annotated 600+ images containing formal and informal clothing to be trained using the YOLOv5 model. For training the model, an annotated set of data were to be obtained using http://app.roboflow.com . The images were obtained from https://storage.googleapis.com/openimages/web/visualizer/index.html?set=train&type=segmentation&r=false&c=%2Fm%2F0388q as it contained images with clothes and fashion accessories. The dataset was obtained using the “fiftyone” python module. The annotations were also included but could not be uploaded to http://app.roboflow.com . As a result, we tried to manually annotate the images which was time consuming. Although the images were annotated and the model was trained(https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data ), the performance was really poor. The threshold for detection had to be reduced to 0.1 for the model to recognise and locate at-least one object from the image. The model also produced invalid detection of objects. We decided to give YOLOv5 to give one final try tomorrow with much better dataset or drop the idea of cloth detection.


The following feedback was given during the discussion on monday afternoon:

Tuesday 25.1:

We couldn’t get into the lab until 13h, so we spent the morning doing research into other options than reinforcement learning.

Thursday 27.1

Friday 28.1

Week 4

What did we work on?

Monday 10:00 - 16:30

  • Make presentation (Siddhant)

  • Look into training dataset age detection model (Mincke)

  • Research into MAB (Siddhant & Kiran)

  • Trouble-shooting merging the files (Mincke)

  • Update logbook with feedback & start case study (Mincke)

Tuesday 10:00 - 17:00

  • Perform case study with 12 participants (Siddhant & Mincke)

  • Change bows in Choregraphe (S&M)

  • Meeting with Buelent about MAB and SA (Kiran)

Thursday 9:15 - 17:30

  • Dialogflow issues

  • Implementing the learning part

  • Merging all parts together

Friday 9:15 - 15:00

  • Final touched

  • Final presentation + demo

Monday 31.1

Unfortuately we did not have access to the lab today, so we worked on the VU together on the presentation and stuff we could do without having the robots available.

Discussion:

We made the following presentation for the discussion of week 4:

Feedback:

After the discussion we had a group meeting where we decided that the rest of the afternoon would be spent on looking into simulated annealing.

Tuesday 1.2

Since receiving the feedback at yesterday’s discussion, we have decided to completely throw away our current approach to the learning part of our project, and go for another approach, simulated annealing. This morning was spent researching usable code, brainstorming about how the interaction would work and discussing our options. Fajjaaz pointed us towards the scipy simulated annealing function, which was really helpful. We had a meeting planned with Buelent to brainstorm about our learning plan at 13:30. In the meantime the goal was to recruit more people for our case study so we can present some numbers about the accuracy of the prediction on friday’s presentation.

We changed the bows in Choregraphe, and split the bows up. We used to have behaviors that showed the bow (bowing down-standing up) → since we want to influence the amount of time that Pepper stays down → we have broken down the behaviour into a bowing down behaviour and a standing up behaviour. That way we are able to modify the amount of seconds that Pepper bows.

After the lunch Siddhant & Mincke performed a case study, for results see the table below. For the experiment, we asked people to stand in front of Pepper (approximately 1/1,5 meters distance), and let Pepper guess their age. We performed these experiments with and without glasses and with and without a face mask. We report an accuracy of 49% of a study with 12 participants (8 males, 4 females) with ages ranging from 21 to 50 years.

Kiran spent the day doing research into MAB and SA. According to Buelent and Kiran, MAB could be feasible/possible. The plan for now is that tomorrow, Kiran will try to implement MAB. Siddhant & Mincke will look into Simulated Annealing, and try to implement that. At the end of the afternoon we will have a zoom meeting where we discuss progress. Then we will decide which course to take.

Case study: How well can Pepper guess the participant’s age & perform the correct corresponding bow?

Detected Age

Specifications

Participant 1 (Koen, male, 50) (31.1)

  • Accuracy: 1/5

  • 48 - 53 years

  • 8 - 12 years

  • no face detected

  • 25 - 32 years

  • no face detected

  • 4 - 6 years

  • glasses

  • no accessory

  • mask (blue)

  • no accessory

  • no accessory

  • glasses

Participant 2 (Fajjaz, male, 26) (01.02)

  • Accuracy: 4/5

  • 25 - 32 years

  • 25 - 32 years

  • 25 - 32 years

  • 38 - 43 years

  • 25 - 32 years

  • 38 - 43 yearsJam

  • glasses

  • glasses

  • no accessory

  • no accessory

  • no accessory

  • mask (blue) + glasses

Participant 3 (Kiran, male, 25, beard) (01.02)

  • Accuracy: 4/4

  • 25 - 32 years

  • 25 - 32 years

  • 25 -32 years

  • 25 -32 years

  • 25 -32 years

  • 25 -32 years

  • glasses + beanie

  • beanie

  • no accessory

  • no accessory

  • mask (black)

  • mask (black)

Participant 4 (Mincke, female, 24) (01.02)

  • Accuracy: 4/5

  • 25 - 32 years

  • 25 - 32 years

  • 25 - 32 years

  • 25 -32 years

  • 38 - 43 years

  • no face detected

  • no face detected

  • no accessory

  • no accessory

  • no accessory

  • glasses

  • glasses

  • mask (black) + glasses

  • mask (black))

Participant 5 (Siddhant, male, 25, beard) (01.02)

  • Accuracy: 3/5

  • 25 - 32 years

  • no face detected

  • 38 - 43 years

  • 25 - 32 years

  • 25 - 32 years

  • 38 - 43 years

  • glasses

  • glasses + mask (black)

  • no accessory (too far)

  • no accessory (armslength)

  • glasses

  • glasses

Participant 6 (Thomas, male, 21, beard) (01.02)

  • Accuracy: 0/6

  • 8 - 12 years

  • 8 - 12 years

  • 8 - 12 years

  • 25 - 32 years

  • 25 - 32 years

  • 25 - 32 years

  • no face detected

  • no accessory

  • no accessory

  • glasses

  • glasses + closer to Pepper

  • no accessory

  • no accessory

  • mask (blue)

Participant 7 (Yi Wen, male, 32) (01.02)

  • Accuracy: 0

  • 38 - 43 years

  • 4 - 6 years

  • no face detected

  • 38 - 43 years

  • 38 - 43 years

  • mask (blue)

  • mask (blue) + closer

  • mask (blue)

  • mask (blue)

  • mask (blue)

Participant 8 (Florian, male, 34, no beard)(01.02)

  • Accuracy: 2/4

  • 25 - 32 years

  • 38 - 43 years

  • 38 - 43 years

  • no face detected

  • 38 - 43 years

  • 25 - 32 years

  • no accessory

  • no accessory

  • no accessory

  • mask (blue)

  • mask (blue) + glasses

  • glasses

Participant 9 (Nimat Ullah, male, 34, no beard)(01.02)

  • Accuracy: 1/4

  • 38 - 43 years

  • 25 - 32 years

  • 38 - 43 years

  • 38 - 43 years

  • 38 - 43 years

  • 4 - 6 years

  • no accessory

  • no accessory

  • no accessory

  • glasses

  • mask (blue) + glasses

  • mask (blue)

Participant 10 (Lima ,female, 33) (01.02)

  • Accuracy: 2/4

  • 15 - 20 years

  • 15 - 20 years

  • 25 - 32 years

  • 25 - 32 years

  • 15 - 20 years

  • 15 - 20 years

  • no accessory

  • no accessory

  • no accessory

  • glasses

  • mask (blue) + glasses

  • mask (blue)

Participant 11 (Mojca, female, 50) (01.02)

  • Accuracy: 3/5

  • 25 - 32 years

  • 48 - 53 years

  • 25 - 32 years

  • 48 - 53 years

  • 48 - 53 years

  • 25 - 32 years

  • 15 - 20 years

  • no accessory

  • no accessory

  • no accessory + smile

  • no accessory +no smile

  • glasses

  • mask (white) + glasses

  • mask (white)

Participant 12 (Jamie, female, 24) (01.02)

  • Accuracy: 1/4

  • 15 - 20 years

  • 15 - 20 years

  • 25 - 32 years

  • 48 - 53 years

  • no face detected

  • no face detected

  • no accessory

  • no accessory

  • no accessory

  • glasses

  • mask (black) + glasses

  • mask (black)

Males (8)

Females (4)

Accuracy (no accessory + glasses)

15/33 = 45.5%

10/18 = 55.6%

Overall accuracy (males + females) (no accessory): 17/33 = 51,5%

Overall accuracy (males + females) (no accessory + glasses): 25/51 = 49.0%

Wednesday 3.1

Worked on developing the learning part for Pepper. Had a lot of discussion with Buelent in developing the algorithm for carrying out the learning part of Pepper. The goal is to find the right duration for bowing with limited number of interaction using a policy that follows a probability distribution based on the age group detected. The method used were obtained from some of the resources shared by Buelent. The action space(duration) is treated as continuous. Based on the feedback the policy is shifted right or left depending on the feedback. It is shifted right if the feedback is ‘longer’ and ‘left’ if the feedback is shorter. This was achieved by using the mean of the distribution as a parameter and based on the feedback we shift the mean. The distribution chosen was a Gaussian distribution. The feature considered here is age. Since our age prediction model spits out age ranges, we decided to take the mean and standard deviation of the the age range as features.

We tried the following:

Thursday 3.2

The goal for today was to integrate speech recognition into the python file for learning that Kiran made. The second goal was to integrate all modules together.

We had some major setbacks today. While Dialogflow did work on Siddhant’s laptop last Friday, it suddenly stopped working today. We spent a lot of time troubleshooting Dialogflow and also doing research into other options. Some of the other options we researched were: ALSpeechRecognition, using the sensors (head, left hand, and right hand). The sensors seemed like a good backup, however, we couldn’t get information about when the sensors were touched via the Python script. Then we thought we might design a ‘detect sensor touched’ movement via Choregraphe, but this also failed.

We got help from Yue, a PhD’er and he helped us with the code, ‘speech_recognition_example.py’ which was provided by Vincent. But he also at some point couldn’t help us any further, unfortunately.

Since we couldn’t make use of the DialogFlow for Speech Recognition, we decided to communicate our feedback through our laptop. The Speech Recognition used is a naive implementation that uses the Google Speech-to-Text module. We look for certain keywords that indicate ‘longer’, ‘shorter’, or ‘correct’ in the text obtained from the speech based on which the duration can be optimized using a method found in http://incompleteideas.net/book/the-book.html page number: 337, Section: 13.7

The method used for Learning Part

The method used learns statistics of the probability distribution of the action space. For simplicity, we assume that our action space(duration) follows a Gaussian Distribution with a standard deviation of 1 and is dependent on the Age which is treated as the state. As defined in http://incompleteideas.net/book/the-book.html , the policy is defined as follows:

Here,

The mean is assumed to be a linear function of the mean of the state (xmu(s)) and the parameter (thetamu) defining the action. Since the standard deviation is positive, it is an exponential linear function of the standard deviation of the mean of the state (xstd(s)) and the parameter (thetastd).

For example, in our problem, the age detected by Pepper is a range like ‘Age: 25 - 32’. To define the state, we compute the mean and standard deviation of the Age range by considering the lower and upper limits of the age range. For the age range 25-32, the mean is 28.5 and the standard deviation is 4.94.

Therefore, initially, xmu('Age: 25 -32') = 28.5 and xstd('Age: 25-32') = 4.94

For defining the initial parameters of the duration action space, we used priors, i.e., the initial duration of 3 seconds for ‘Age: 25-32’, 6 seconds for 'Age: 38-43 years, Age: 48-53 years, and Age: 60-100 years '. Based on the prior, the parameters of the action space were defined as:

For Age: 25-32,

thetamu = 3/ xmu('Age: 25 -32') = 3/28.5 = 0.105

thetastd = 1

Here, the standard deviation is assumed to be 1 for simplicity as we’re only trying to optimize thetamu based on the feedback provided by the user.

Similarly for Age: 38-43 years,

thetamu = 6/ xmu('Age: 38 -43') = 6/40.5 = 0.148

thetastd = 1

The parameter, thetamu is updated based on the feedback of the user as follows:

Beta = 0.01/n

After 1st interaction with a user, Beta = 0.01/1 = 0.01

After 2nd interaction with the user, Beta = 0.01/2 = 0.005

After 3rd interaction with the user, Beta = 0.01/3 = 0.003

Here, we wanted the learning parameter to decay as we believe that with each interaction, we get closer to the user’s desired duration of the bow.

(where At is the prior duration action, mutheta(St) is the mean of the Age range), is positive or negative. If the difference is negative and the feedback means to increase, Gt would be a negative value(say -0.1) and if the feedback is meant to increase and if the difference is positive, Gt would be a positive value(say 0.1).

The above-mentioned method can be explained with an example scenario:

Suppose Pepper detected an age range of 25 - 32. Pepper bows based on the prior i.e., 3 seconds. If the user provides feedback that means to increase the duration, in the first interaction:

xmu('Age: 25 -32') = 28.5

xstd('Age: 25-32') = 4.94

Variance = 4.942 = 24.40

thetamu = 0.148

thetastd = 1

Beta = 0.01

At = 3

Since, At - xmu('Age: 25 -32') < 0, Gt = - 0.1

Therefore,

A new action > previous action will be sampled from a gaussian distribution with mean 0.177 and standard deviation 1.

The iteration continues till the user is satisfied with the duration of bow.

Depending on the feedback ,the mean of the normal distribution gets shifted to either right or left.

Friday 4.2

The goal for today was to have a working system that displayed our idea for the project. Work was done to create and finalize the presentation with inputs from all members of the team as well as ensuring that the code written for learning worked during the presentation.

https://docs.google.com/presentation/d/1tQD5zUsPURcJq6x6HotX9a6lND5a5AlLng8mt5Sm4U0/edit?usp=sharing

Reflection

Though our original idea was to have Pepper bow with a varied duration for individuals based on whether it recognized them and understand their age and attire to show the appropriate respect backed by vocal feedback, we realized that the goal was too lofty for the duration we had.

Though the original idea could be achieved by ensuring proper time was devoted to the project, we had to inevitably cut down on the modules we would require to achieve it. That is why the attire detection was dropped. However, for future work it could be very interesting to implement attire detection in combination with age detection.

We experienced some issues with the learning part of the experiment. None of us really had any experience with reinforcement learning, and this posed to be quite a challenge. Also some major setbacks were endured throughout the course (dialogflow, object recognition).