Emotion Powered Art – Detecting Emotion from Speech

Hey friends, good to see you. Maybe you’ve heard about our new art piece debuting at Lake Merritt’s Autumn Lights 2022 festival. The Garden of Intent responds to the environment by changing what it shows based on what the people around it are doing. Specifically, the Garden of Intent is a projection mapping that’s controlled via Speech Emotion Recognition which is one of many areas of machine learning under quite a bit of study.

Many people ‘out there’ are trying to figure out just how to detect emotions and there are a couple major areas of study. Speech Emotion Recognition or SER for short tries to determine emotions based on human speech (aka emotion from sounds). Of course sound is but one way to detect emotion, you can also use images (your xbox can tell if you are sad, true story). The Garden of Intent is indeed sound based, but it works off another type of analysis known as sentiment analysis where we analyze human language for emotional content, much in the same way twitter does.

Here’s a video of The Garden of Intent during it’s development – watch this if you want to see how all this tech is being used.

Garden of Intent – Machine Learning Powered Projection Mapping

The Garden of Intent is a projection mapped art piece. The projections are performed by the software HeavyM and are controlled vie the Emotion Engine which is a special server written in a language called python. The emotion engine receives microphone input, examines emotions, and then controls HeavyM via a control protocol known as OSC.

Emotion Detection in Speech

Okay, that’s all fine and good, but what you really came here to learn about was how we detect emotions… let’s break it down. To determine emotions from people’s speech we have a emotion detection pipeline (a pipeline is a series of steps in order to achieve an outcome). Here’s a picture of the emotion detection pipeline:

Speech Emotion Detection for The Garden of Intent. Note that half of this pipeline is just trying to determine if someone is even talking.

Emotion detection has three major components:

  • Voice Activity Detection (VAD) – determine if a human is speaking
  • Speech to Text – we use text based sentiment analysis to determine emotion so speech is converted to text prior to emotion extraction
  • Emotion Extraction – using a form of sentiment analysis (the same stuff that zucks you on facebook) we extract emotions

Sounds simple enough though there’s a lot of science and engineering behind this tech. There are in fact 4 different machine learning based analysis actions that occur: 2 machine learning layers to detect humans, 1 machine learning layer to convert speech to text, and a final machine learning layer to determine emotions from text. Phew, it’s a lot – and lucky for us lots of very smart people did most of the really hard work already.

Garden of Intent – See it at Lake Merritt’s Autumn Lights 2022 Festival

Well folks, that was a lot of science but what about the cool art videos? They are coming – this project is still being developed after all. We will, of course, post videos and images as things progress.

Until then, the only way to see this installation is to get your tickets to the Autumn Lights Festival.