A novel embodied simulator to generate multimodal referring expressions containing both verbal utterances and non-verbal gestures captured from multiple views.
A sample data generated using CAESAR simulator
Verbal utterance: The red apple to the left of the black kettle