Garden of Intent

Garden of Intent – Machine Learning Powered Projection Mapping

Getting busy in HeavyM2

Hey friends, good to see you. We have a new interactive art piece that’s powered by machine learning. The Garden of Intent responds to the environment by changing what it shows based on what the people around it are doing. Specifically, the Garden of Intent is a projection mapping that’s controlled via Emotion Recognition which is one of many areas of machine learning under quite a bit of study.

Many people ‘out there’ are trying to figure out just how to detect emotions and there are a couple major areas of study. Speech Emotion Recognition or SER for short tries to determine emotions based on human speech (aka emotion from sounds). Of course sound is but one way to detect emotion, you can also use images (your xbox can tell if you are sad by looking at your face, true story). The Garden of Intent is indeed sound based, but it works off another type of analysis known as sentiment analysis where we analyze human language for emotional content, much in the same way twitter does.

Emotion Detection in Speech

The Garden of Intent is a projection mapped art piece. The projections are performed by the software HeavyM and are controlled vie the Emotion Engine which is a special server written in a language called python. The emotion engine receives microphone input, examines emotions, and then controls HeavyM via a control protocol known as OSC.

Okay, that’s all fine and good, but what you really came here to learn about was how we detect emotions… let’s break it down. To determine emotions from people’s speech we have a emotion detection pipeline (a pipeline is a series of steps in order to achieve an outcome). Here’s a picture of the emotion detection pipeline:

Speech Emotion Detection for The Garden of Intent. Note that half of this pipeline is just trying to determine if someone is even talking.

Emotion detection has three major components:

  • Voice Activity Detection (VAD) – determine if a human is speaking
  • Speech to Text – we use text based sentiment analysis to determine emotion so speech is converted to text prior to emotion extraction
  • Emotion Extraction – using a form of sentiment analysis (the same stuff that zucks you on facebook) we extract emotions

Sounds simple enough though there’s a lot of science and engineering behind this tech. There are in fact 4 different machine learning based analysis actions that occur: 2 machine learning layers to detect humans, 1 machine learning layer to convert speech to text, and a final machine learning layer to determine emotions from text. Phew, it’s a lot – and lucky for us lots of very smart people did most of the really hard work already.

Want to learn more?

Well, as always you can simply contact us and we’ll tell you everything you’d ever want to know. For those self motivated individuals, here’s some reading:

  • HeavyM – Project Mapping Software used
  • Tutorial on Projection Mapping
  • Vosk – API used for Speech Recognition
  • Sirelo VAD – one of two VAD layers used to detect humans
  • SER – one of many papers on Speech Emotion Recognition
  • OSC – open sound control, the control protocol used to manipulate the projections