Posts Written by Leo Ueno

Leo Ueno

ML Growth Associate @ Roboflow | Sharing the magic of computer vision | leoueno.com

Finetuning Moondream2 for Computer Vision Tasks

In this guide, we finetune and improve Moondream2, a small, local, fast multimodal Vision Language Model, for a computer vision task.

PaliGemma: An Open Multimodal Model by Google

PaliGemma is a vision language model (VLM) developed and released by Google that has multimodal capabilities. Learn how to use it.

GPT-4o: The Comprehensive Guide and Explanation

Learn what GPT-4o is, how it differs from previous models, evaluate its performance, and use cases for GPT-4o.

Realtime Video Stream Analysis with Computer Vision

In this guide, we use computer vision to process multiple live video streams to perform analysis and gain insights.

What is Handwriting Recognition?

In this guide, we go over an overview of handwriting recognition, including the use cases, challenges, and ways of using of handwriting recognition, as well as a tutorial.

How to Use OCR on Videos

In this guide, we cover the process of how to use OCR on videos together with computer vision to solve real-world problems.

Best OCR Models for Text Recognition in Images

See how nine different OCR models compare for scene text recognition across industrial domains.

How to Use YOLO-World With Active Learning to Train a Custom Model

In this guide, we demonstrate an approach where we can start using the benefits of YOLO-World now, while simultaneously collecting data to train a faster custom model later.

How to Use Multiple Models to Label Datasets with Autodistill

In this guide, we cover the benefits of and how to combine multiple models in order to automatically label a dataset of images.

Occupancy Analytics with Computer Vision

Computer vision can be used to understand videos for real-time analytics and automatically gather information about complex physical environments.

Comparing Specialized Models to AWS Rekognition

In this guide, we cover how to compare Amazon Rekognition, a suite of computer vision APIs, against each other.

Google's Gemini Multimodal Model: What We Know

In this guide, we are going to discuss what Gemini is, for whom it is available, and what Gemini can do (according to the information available from Google). We will also look ahead to potential applications for Gemini in computer vision tasks.

Comparing Custom Models to Google Cloud Vision API

In this guide, we go over how to evaluate object detection models on Roboflow Universe versus Google Cloud Vision.

Comparing Computer Vision Models On Custom Data

In this guide, show how to compare how two person detection models on Roboflow Universe perform using a benchmark dataset and supervision.

Using Computer Vision to Improve Railway Safety

In this guide, we show how to use computer vision to identify hazardous situations on railways for use in building safety systems.

How to Use Kaggle for Computer Vision

In this guide, we show how to use Kaggle Notebooks for computer vision tasks.

How to Use Node-RED with Roboflow

In this guide, we show how to run inference on computer vision models with Roboflow and Node-RED.

Ultimate Guide to Converting Bounding Boxes, Masks and Polygons

In this guide, we show how to convert bounding boxes (xyxy), masks, and polygons.

A LLaMa 2, Midjourney & Autodistill Computer Vision Pipeline

Combine the use of Midjourney, Autodistill, LLaMa 2 and Roboflow to create a object detection model without data collection or labeling.

Prompting Google Bard with Images & How it Compares to Bing

Google Bard Accepts Images in Prompts Google’s large language model (LLM) chatbot Bard recently unveiled a feature to accept image prompts, making it multimodal. It strikes comparisons with a

How Good Is Bing (GPT-4) Multimodality?

In this blog post, we qualitatively analyze how well Bing’s combination of text and image input ability performs at object detection tasks.

Recognizing Math Equations with Computer Vision

In this article, we show a process for recognizing math equations using computer vision.