How to Build a Drinking Detection App with Computer Vision

automated drinking detection with computer vision

Published Sep 16, 2025 • 6 min read

I love the feeling of falling into the "zone" when I'm working. An hour zips by like a minute while I destroy my digital checklist. But then I notice that chalky taste in my mouth. My stomach churns. I've forgotten to drink water again.

The funny thing is, my computer has been watching me this whole time through its webcam. What if it could actually do something useful with that information? What if it could tell when I'm drinking and remind me when I'm not?

So I built Hydrovisor, a web app that watches for actual drinking behavior and reminds you to hydrate based on what it sees, not just on a timer.

Automated Drink Detection: The Computer Vision Challenge

Detecting whether someone is drinking is trickier than you'd think. You need to figure out:

Object detection - Is there a cup, glass, or bottle in the frame? What if there are several?
Face tracking - Where's the person's mouth exactly?
Spatial relationships - Is that cup actually near the mouth or just sitting on the desk?
Temporal patterns - Are they drinking or just moving the cup past their face?

You could try training one model to do everything, but I went with something simpler: use the best tool for each job. Roboflow for finding cups and bottles, MediaPipe for tracking faces. Let each model do what it's good at.

How to Build the Drinking Detection System

For face tracking, Google's FaceMesh was the obvious choice. They've spent millions developing it, and it can track 468 facial landmarks with incredible precision. Why reinvent that wheel?

Detecting drinks was a different story. I needed to train my own model, which meant finding training data. Roboflow Universe was a goldmine here, with millions of community-uploaded images to get started.

My coworkers were amazing and sent videos of themselves actually drinking water at their desks. This real-world data made all the difference.

The dataset evolution was interesting. Started with stock photos of people drinking from glasses. Total failure in the real world. Added coworker videos with their actual coffee mugs and water bottles. Better. Threw in static shots of various cups and bottles for good measure. After several iterations, I landed on 2,043 images that finally gave me a model that worked across different environments.

The model's publicly available if you want to play with it. But having a model is only half the battle. Now I needed to build an app that could run all day without killing someone's laptop or racking up cloud bills.

Why Roboflow Inference.js?

I had some specific requirements. The app needed to run continuously without being annoying, expensive, or complicated to set up. Roboflow's Inference.js checked all the boxes.

1. Runs on Your Machine, Not Mine

Inference.js uses your browser and your computer's processing power. No cloud servers, no monthly bills. I can offer Hydrovisor free forever because I'm not paying for GPUs in some data center.

// Roboflow Inference.js runs entirely in the browser
const inferenceEngine = new InferenceEngine();
const workerId = await inferenceEngine.startWorker(
  "beverage-detection-model",
  2,  // model version
  "YOUR_PUBLISHABLE_KEY"
);

2. Works Instantly

Visit the website, allow camera access, start tracking hydration. That's it. No app downloads, no Python environments, no CUDA installations. If you can browse Instagram, you can use Hydrovisor.

// Simply import from npm - no CDN scripts or manual downloads needed
import { InferenceEngine, CVImage } from 'inferencejs';

// That's it! The library handles all the heavy lifting
// No WASM files to manually load, no model downloads to manage

3. Slow is Actually Fine

Here's something counterintuitive: you don't need blazing fast inference for drinking detection. It takes several seconds to lift a cup to your mouth and drink. Running inference at 5 FPS catches that just fine, and it means the app can run all day without turning your laptop into a space heater.

// Configure frame rates for optimal performance without overloading the system
export const FRAME_RATES = {
  faceDetection: 15,    // Face tracking needs higher FPS for smooth UX
  objectDetection: 5,   // Object detection can run at lower FPS to save resources
} as const;

// Use requestAnimationFrame with frame rate limiting
const OBJECT_FRAME_INTERVAL = 1000 / FRAME_RATES.objectDetection;
let lastFrameTime = 0;

const detectObjects = async (currentTime) => {
  // Only process if enough time has passed (200ms for 5 FPS)
  if (currentTime - lastFrameTime < OBJECT_FRAME_INTERVAL) {
    requestAnimationFrame(detectObjects);
    return;
  }
  
  lastFrameTime = currentTime;
  // Run inference here
  const detections = await inferenceEngine.infer(workerId, image);
  
  requestAnimationFrame(detectObjects);
};

The Tech Stack

With Inference.js as the foundation, I kept everything else simple and mainstream:

React 19 with TypeScript for a robust, type-safe frontend
Roboflow Inference.js for real-time object detection
MediaPipe Face Mesh for facial landmark tracking
Zustand for efficient state management
Framer Motion for smooth animations

Making It Work: Object Detection

First up, finding the drinks. Roboflow's Inference.js handles this part:

const detectObjects = async (currentTime: number) => {
  // Frame rate limiting at 5 FPS
  if (currentTime - lastFrameTime < OBJECT_FRAME_INTERVAL) {
    requestAnimationFrame(detectObjects);
    return;
  }

  const image = new CVImage(videoElement);
  const detections = await inferenceEngine.infer(workerId, image);
  
  // Filter for cups, glasses, bottles
  const drinkingObjects = detections.filter((d: Detection) => 
    DETECTION_CLASSES.includes(d.class.toLowerCase())
  );

  if (drinkingObjects.length > 0) {
    // Keep only the largest cup (by area) for tracking
    const largestCup = drinkingObjects.reduce((prev, current) => {
      const prevArea = prev.bbox.width * prev.bbox.height;
      const currentArea = current.bbox.width * current.bbox.height;
      return currentArea > prevArea ? current : prev;
    });

    setObjectDetected([largestCup]);
  }

  requestAnimationFrame(detectObjects);
};

Finding Faces with MediaPipe

While Roboflow finds the cups, MediaPipe tracks where your mouth is:

// Key MediaPipe landmark indices for lips
const LIP_LANDMARKS = {
  upperLipTop: 13,
  lowerLipBottom: 16,
  leftCorner: 61,
  rightCorner: 291,
  ...
};

export function getLipBoundingBox(faceKeypoints: Point[]): BoundingBox {
  // Extract lip points
  const lipPoints = [
    faceKeypoints[13], faceKeypoints[16], 
    faceKeypoints[61], faceKeypoints[291]
  ];

  // Calculate bounding box with 20% padding
  const xs = lipPoints.map(p => p.x);
  const ys = lipPoints.map(p => p.y);
  
  const width = (Math.max(...xs) - Math.min(...xs)) * 1.4;
  const height = (Math.max(...ys) - Math.min(...ys)) * 1.4;

  return {
    x: (Math.min(...xs) + Math.max(...xs)) / 2,
    y: (Math.min(...ys) + Math.max(...ys)) / 2,
    width, height
  };
}

Putting It Together: The Drinking Detection

Now for the fun part. We have cups, we have faces, but are they actually drinking? We use Intersection over Union (IoU) to figure out if the cup is near the mouth:

export function calculateIoU(box1: BoundingBox, box2: BoundingBox): number {
  // Convert to rectangles and calculate intersection
  const rect1 = boundingBoxToRectangle(box1);
  const rect2 = boundingBoxToRectangle(box2);
  
  const xOverlap = Math.max(0, 
    Math.min(rect1.right, rect2.right) - Math.max(rect1.left, rect2.left));
  const yOverlap = Math.max(0, 
    Math.min(rect1.bottom, rect2.bottom) - Math.max(rect1.top, rect2.top));
  
  const intersectionArea = xOverlap * yOverlap;
  const unionArea = box1.width * box1.height + box2.width * box2.height - intersectionArea;
  
  return intersectionArea / unionArea;
}

But here's the thing: drinking takes time. You lift the cup, drink, put it down. We need to handle that whole motion without getting confused by brief moments when detection fails:

const OVERLAP_THRESHOLD = 0.02; // 2% IoU threshold
const DEBOUNCE_MS = 500; // Continue for 500ms after last detection
const MIN_DRINKING_FRAMES = 2; // Minimum frames to start drinking

// Track overlap over time with refs
if (isOverlapping) {
  lastOverlapTimeRef.current = now;
  overlapFrameCountRef.current++;
  
  // Start drinking after minimum frames
  if (!isDrinking && overlapFrameCountRef.current >= MIN_DRINKING_FRAMES) {
    startDrinking();
  }
} else if (isDrinking) {
  const timeSinceLastOverlap = now - lastOverlapTimeRef.current;
  
  // Stop only after 500ms without overlap
  if (timeSinceLastOverlap > DEBOUNCE_MS) {
    stopDrinking();
    overlapFrameCountRef.current = 0;
  }
}

This smoothing is crucial. Without it, the app would constantly flip between "drinking" and "not drinking" as you move the cup around.

The Multiple Cups Problem

What if there are multiple drinks on your desk? Simple solution that actually works: track the biggest one. The cup you're drinking from is usually closer to the camera, so it appears larger:

// Find the largest cup by area
const largestCup = drinkingObjects.reduce((prev, current) => {
  const prevArea = prev.bbox.width * prev.bbox.height;
  const currentArea = current.bbox.width * current.bbox.height;
  return currentArea > prevArea ? current : prev;
});

What I Learned: Real-Time Drink Detection with Computer Vision

Building Hydrovisor proved that you don't need massive models or expensive infrastructure to solve real problems with computer vision. Sometimes the best solution is combining tools that already work well.

A few takeaways:

Let specialized models do what they're good at instead of forcing one model to do everything
Real-world interactions are messy; temporal smoothing is your friend
Simple heuristics beat complex logic more often than you'd think
5 FPS is plenty when you're detecting actions that take seconds

The whole thing runs in your browser, costs nothing to use, and actually helps people stay hydrated. No data leaves your computer, no monthly bills, just a reminder to drink water when you need it.

Try it yourself at hydrovisor.live and explore the source code at github.com/rvirani1/hydrovisor. Never forget to hydrate again!

Written by Riaz Virani

Cite this Post

Use the following entry to cite this post in your research:

Riaz Virani. (Sep 16, 2025). Real-Time Drinking Detection: Building Hydrovisor. Roboflow Blog: https://blog.roboflow.com/drinking-detection/

Stay Connected

Get the Latest in Computer Vision First

Written by

Riaz Virani

View more posts

Topics

Computer Vision

Real-Time Drinking Detection: Building Hydrovisor

Automated Drink Detection: The Computer Vision Challenge

How to Build the Drinking Detection System

Why Roboflow Inference.js?

1. Runs on Your Machine, Not Mine

2. Works Instantly

3. Slow is Actually Fine

The Tech Stack

Making It Work: Object Detection

Finding Faces with MediaPipe

Putting It Together: The Drinking Detection

The Multiple Cups Problem

What I Learned: Real-Time Drink Detection with Computer Vision

Cite this Post

Written by

Topics

More About Computer Vision

What Is Promptable Concept Segmentation (PCS)?

How to Fine-Tune Segment Anything 3 (SAM 3) on a Custom Dataset

Launch: Use Segment Anything 3 (SAM 3) with Roboflow

What is Segment Anything 3 (SAM 3)? Segment Anything with Concepts

How to Train an RF-DETR Segmentation Model with a Custom Dataset

Roboflow Named Leader in Image Recognition Software