This is a guest post by Mehek Gosalia, a high school student from Sammamish, Washington. She plans to study computer science.


Over the past 4 years, I've worked to develop a disability-accessible rhythm education tool called the Mehek Box. It started as a physical tool, a box representing a measure with blocks of various durations representing rhythmic notes. Users could move around "note blocks" to fill up the measure in a tangible way and understand the comparative duration of each note. After meeting with a music therapist and a Braille educator, I decided to push the tactile nature of the tool to make the tool accessible to students with disabilities by adding Braille. During the pandemic, I also created an app and web version of the Mehek Box, which incorporate audio, vibration, and animation as multisensory means of representation for a given rhythm. I'm currently working with 4 programs to test and refine the web app and physical tool: two low-income schools, one adaptive music program, and the Perkins School for the Blind's music program.

Over the past few months, I wanted to create an interface between the physical and digital Mehek Box, to explore their potential combined use in classrooms in the future. I came up with adding an image recognition feature in the Mehek Box app, so that users can scan the blocks in the physical box using an in-app camera, and import those rhythms into the virtual box, so they can hear, see, and feel the rhythm played. To accomplish this, I trained a Tensorflow object detection model to recognize each rhythmic block using a custom dataset.

To create the initial dataset, I started with 12 types of blocks of varying sizes and colors. I created about 30-40 different configurations of full measures and took videos while rotating the box at different angles to vary lighting and exposure. I selected about 7-8 frames from each video, creating a 280 image dataset. But I was afraid that by training only on that dataset, the model would overfit and not be able to recognize rhythms on backgrounds other than my tile floor. Since taking individual pictures and labeling them is very time consuming, I decided to augment the existing dataset. Originally, I thought I would have to write a script to augment the images with just Gaussian noise, and vary it so that different pictures got different levels of augmentation. However, when doing research online, I found Roboflow, and after getting in contact with their team, had the resources to augment my 300 images into a 10,000 image dataset.

I applied 6 different augmentations including Gaussian noise, exposure, and hue over ranges where the blocks were still identifiable to the human eye. Roboflow randomized the degree to which each augmentation was applied to each image (ex: an image could have somewhere between -25% and 25% exposure added to it, generating a new image). Since the positions of each block stayed constant, I didn’t have to relabel any of the new images, greatly reducing the time of brute force work.

To test the efficacy of my augmentation, I set up an experiment. I created a single configuration file (pipeline.config) to tell each model how to train, with the same number of steps, initial model, (I used EfficientDet-D0 from the Object Detection Model Zoo) and proportion of training to testing data. Then I trained the initial model in Google Colab using three different datasets:

  • the 280-image unaugmented dataset,
  • the 1,400-image switched-background dataset,
  • and the 10000-image noisy augmented dataset.

I tracked the total loss throughout the training, and then tested them on a completely new testing dataset with different backgrounds to evaluate each model on images it had never seen (but could potentially see in usage) using accuracy.

Surprisingly, all models achieved over 59% accuracy on the testing set, with the unaugmented dataset achieving the highest accuracy (73%). While image augmentation did not yield improved performance in this specific case with my testing data, Roboflow made it easy to test my theory quickly, making it easier for me to iterate with additional features moving forward; most misidentified images were due to extremely angled boxes in images, so I plan to add a frame on screen to help users better align their images.