AI Models in Action

As artificial intelligence (AI) rapidly advances, tools like large language models (LLMs), computer vision, and vision-language models are transforming industries and solving complex problems. Today, AI enables BNSF Railway Company, one of North America's largest freight rail networks covering 32,500 route miles, to have real-time observability in their intermodal yards. This enables answering critical questions: “What’s in my facility, and where specifically in my facility is it?”.

There are many ways to harness the latest AI technology, so this article provides an overview of AI models, highlighting how they are trained, deployed, and used across different domains. Whether you're looking to deploy AI models for manufacturing or logistics, understanding the core concepts and how to use these models is essential for leveraging AI's full potential.

Introduction to Artificial Intelligence and AI Models

Ask your child, dad, and grandmother if they've heard of ChatGPT, and you'll find the answer is probably yes. LLMs such as ChatGPT have taken the world by storm. Popular LLMs are part of a class of models called foundation models, and are part of the generative AI field.

AI is a broad field encompassing various algorithms applied to diverse use cases. Here's a quick overview of AI, computer vision, and machine learning.

Artificial intelligence: AI references a lot of different things about artificial intelligence and the way AI algorithms are being applied broadly across a whole host of use cases. That includes things like computer vision; it also includes machine learning; and it includes other disciplines such as natural language processing, automated speech recognition (ASR), or any kind of branch of machine learning where different types of unstructured data is being processed. AI is the most general term, and tends to be used more in situations about where research is headed, and around the general notion of machine algorithms getting more and more powerful. It's broken down into different sub-disciplines such as computer vision and NLP.

Computer vision: While NLP and LLMs only deals with text (sentences, words, and word pieces), computer vision deals with images and video, and all those different applications. Computer vision is the ability for a computer to see and understand the physical world. With computer vision, computers can learn to identify, recognize, and pinpoint the position of objects.

Machine learning: Machine learning refers specifically to the different kind of techniques you can use to transfer data into predictions. It's very much a highly supervised setting where you're using labels and iterating through a machine learning pipeline to transform your data with your labels and predictions. It could encompass both NLP and computer vision, and is used as a term frequently by practitioners who are hands on, rather than in an academic setting.

Different Machine Learning Frameworks

For building and optimizing AI systems, there are training, intermediary, and deployment machine learning frameworks. Training frameworks are where you're teaching your model how to model a problem - they tend to be a lot heavier than other frameworks. Deployment frameworks are where you put it into production, so it can infer at a fast beat. The intermediary framework is used to convert between those two (training and deployment).

What Is an AI Model?

AI models are pushing the boundaries of what's possible. An AI model is a mathematical structure designed to perform tasks like classification, prediction, or decision making based on input data.

AI models are built by gathering and labeling data, then training the model to learn patterns from it. Specialized models, like those for customer service chatbots or predictive maintenance, address specific problems. Foundation models serve as general-purpose base models that, through fine-tuning, can be rapidly adapted into specialized models, accelerating development by building on pre-existing knowledge rather than starting from scratch. You can think of an AI model as a system that "thinks" based on vast amounts of data, enabling machines to solve problems or enhance user experiences.

Types of AI Models

AI models can be categorized into several types based on how they learn, the tasks they perform, and their applications. There are computer vision models for tasks such as vision-language, object detection, classification, keypoint detection, instance segmentation, and semantic segmentation.

There are also many different model sizes, for example train nano models for fast iteration and low compute deployments or XL models for the highest level of accuracy. Some Large Language Models have vision capabilities that enable you to ask questions about the contents of images. LLMs are models trained on vast amounts of text data, and are designed to process and predict language patterns, enabling them to produce human-like language responses across a wide range of topics.

Discover AI Models and Examples

Because AI is rapidly evolving, it can be a challenge to keep on top of the releases of advanced models. Explore a selection of cutting-edge AI models spanning a wide range of capabilities, from natural language processing to vision and multimodal tasks. Learn about each model's unique features, performance improvements, and potential applications.

Claude 3

Announced on March 4th, 2024, Claude 3 is a family of Large Language Models by Anthropic with vision capabilities. Claude 3 is its most advanced version, competing with OpenAI's GPT-4 and Google's Gemini, with improvements in reasoning, coding, and multilingual capabilities. Claude 3 comes in three versions: Opus, Sonnet, and Haiku.

Grok 3

From Elon Musk's artificial intelligence company xAI, comes a less censored version of other mainstreams LLMs. Grok 3 launched on February 17, 2025, and is designed to enhance understanding, problem solving, and contextual awareness. It incorporates advanced reasoning capabilities, allowing users to engage a "Think" mode for complex problem solving. Additionally, xAI introduced Grok 3 mini, a variant offering faster responses with some trade-offs in accuracy.

DeepSeek R1

DeepSeek R1 is an advanced open-source AI model developed by the Chinese startup DeepSeek, released in January 2025 under the MIT License. It's a reasoning model that solves complex problems by breaking them down into steps, and was designed to enhance logical inference, mathematical reasoning, and real-time problem solving.

ChatGPT

ChatGPT is an advanced AI chatbot developed by OpenAI that uses natural language processing to understand and generate human-like text responses. It is powered by OpenAI’s Generative Pretrained Transformer models, which are trained on vast amounts of text data to assist with various tasks, including answering questions, generating content, summarizing information, and even coding.

Mistral 7B

A large language model developed by Mistral AI, known for its open-weight architecture and efficiency. With 7 billion parameters, it is designed to perform various natural language processing tasks, such as text generation, summarization, and question answering, while being more computationally efficient than larger models.

Gemini-2 Pro

As of February 5, 2025, Gemini 2.0 Pro is available to everyone. It is Google DeepMind's most advanced AI model, designed to excel in coding tasks and complex prompt handling. The model supports multimodal inputs, including text, images, video, and audio, and provides text-based outputs. Additionally, Gemini 2.0 Pro can utilize tools such as Google Search and execute code.

Llama 3.2

Llama 3.2 Vision is Meta's latest open-source multimodal LLM that processes both text and images, enabling advanced visual recognition and reasoning capabilities. Released in September 2024, it is available in two sizes: 11 billion parameters and 90 billion parameters. It includes lightweight versions designed to operate efficiently on mobile hardware.

Llama 4

Meta’s Llama 4 family of LLMs is making “great progress in training,” said CEO Mark Zuckerberg late January 29, 2025, with Llama 4 Mini done with pretraining. It's going to “unlock a lot of new use cases” he said.

Llama 4 is Meta's forthcoming large language model (LLM), expected to launch in 2025. It is designed as an "omni-model" with multimodal capabilities, enabling it to process and interpret various types of data, such as text and images, simultaneously. Additionally, Llama 4 is anticipated to possess agentic features, allowing it to perform tasks autonomously based on user inputs.

Qwen 2.5 VL

Qwen-VL is an LMM developed by Alibaba Cloud. Qwen-VL accepts images, text, and bounding boxes as inputs. The model can output text and bounding boxes. Qwen-VL naturally supports English, Chinese, and multilingual conversation. Qwen 2.5 VL is a vision language model that accepts an image and an optional text prompt.

Magma

Magma is a foundation model developed by Microsoft for multimodal AI agents, capable of handling both virtual and real-world environments. It excels in tasks involving image and video understanding, robotic manipulation, and UI navigation, with the ability to generate goal-driven visual plans and actions. Magma's architecture incorporates scalable pretraining from unlabeled videos, enhancing its generalization ability for real-world applications.

OpenAI o3-mini

OpenAI o3-mini is a specialized LLM designed to enhance reasoning capabilities, particularly in STEM applications. Released on January 31, 2025, o3-mini offers improved performance in tasks such as mathematics, coding, and scientific problem-solving, while maintaining cost-effectiveness and reduced latency compared to its predecessor, o1-mini.

YOLOV12

Released on February 18th, 2025, YOLOv12 is a state-of-the-art computer vision model architecture. YOLOv12 was made by researchers Yunjie Tian, Qixiang Ye, David Doermann and introduced in the paper “YOLOv12: Attention-Centric Real-Time Object Detectors”. YOLOv12 has an accompanying open source implementation that you can use to fine-tune models. The model achieves both a lower latency and higher mAP when benchmarked on the Microsoft COCO dataset. 

Segment Anything Model

Segment Anything (SAM) is an image segmentation model developed by Meta Research, released in April 2023, capable of doing zero-shot segmentation. Using SAM, you can generate segmentation masks for all of the objects in an image that the model can find, or masks for objects that meet a provided text prompt. SAM has strong zero-shot capabilities, meaning it can segment unrecognized objects without further training. This provides SAM with a large advantage over other segmentation models which may require fine-tuning for different uses. The dataset on which SAM was trained contains over one billion image masks and 11 million images. Learn how to use the Segment Anything Model.

Segment Anything 2 Model

Segment Anything 2 (SAM 2) is a real-time image and video segmentation model. SAM 2 works on both images and videos. The previous version of SAM, on the other hand, was built explicitly for use in images.

YOLOv8 Pose Estimation Model

The YOLOv8 pose estimation model allows you to detect keypoints in an image. Keypoint detection enables you to identify specific points on an image. For example, you can identify the orientation of a part on an assembly line with keypoint detection. This functionality could be used to ensure the orientation of the part is correct before moving to the next step in the assembly process. You could use keypoint detection to identify key points on a robotic arm, for use in measuring the envelope of the device. Finally, a common use case is human pose estimation, useful in exercise applications or factory workstation ergonomics.

YOLOv8 Instance Segmentation

YOLOv8 is optimized for speed. The state-of-the-art YOLOv8 model, created by Ultralytics, the developers of YOLOv5. launched on January 10, 2023, and comes with support for instance segmentation tasks. After detecting objects, YOLOv8 generates pixel-level segmentation masks for each detected object, allowing precise object boundaries.

YOLOv9

YOLOv9's main contributions are its performance and efficiency, its use of PGIs, and its use of reversible functions. YOLOv9 is an object detection model architecture released on February 21st, 2024 by Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. Here's how to train YOLOv9 on a custom dataset. YOLOv9 introduced two new architectures: YOLOv9 and Generalized Efficient Layer Aggregation Network (GELAN).

GroundingDINO

Grounding DINO is a zero-shot object detection model made by combining a Transformer-based DINO detector and grounded pretraining. Grounding DINO does impressively well in zero-shot object detection, where it achieves an impressive performance on COCO and LVIS without being trained on these datasets directly. It's a powerful tool for vision-language tasks and is widely used for referring expression comprehension, where users can highlight or describe objects in an image and get precise detections in return.

YOLOWorld

YOLO-World, introduced in the research paper “YOLO-World: Real-Time Open-Vocabulary Object Detection”, shows a significant advancement in the field of open-vocabulary object detection by demonstrating that lightweight detectors, such as those from the YOLO series, can achieve strong open-vocabulary performance. This is particularly noteworthy for real-world applications where efficiency and speed are crucial, like edge applications.

PaliGemma

PaliGemma, released at the 2024 Google I/O event, is a combined multimodal model based on two other models from Google research: SigLIP, a vision model, and Gemma, a large language model, which means the model is a composition of a Transformer decoder and a Vision Transformer image encoder. It takes both image and text as input and generates text as output, supporting multiple languages. Unlike other VLMs which have struggled with object detection and segmentation, PaliGemma has a wide range of abilities, paired with the ability to fine-tune for better performance on specific tasks.

GPT-4o

GPT-4o is OpenAI’s third major iteration of GPT-4 expanding on the capabilities of GPT-4 with Vision. The newly released model is able to talk, see, and interact with the user in an integrated and seamless way, more so than previous versions when using the ChatGPT interface.

Tesseract

Originally developed by Hewlett Packard (HP) between 1984 and 1994, Tesseract is a highly popular OCR engine and project, now primarily developed open-source by Google. It is used for extracting text from images or scanned documents, and is especially effective for recognizing printed text in a variety of languages.

YOLOv11

YOLOv11 is latest iteration in the YOLO (You Only Look Once) series of real-time object detection models developed by Ultralytics. YOLOv11 supports object detection, segmentation, classification, keypoint detection, and oriented bounding box (OBB) detection. YOLOv11 introduces the C3k2 (Cross Stage Partial with kernel size 2) block, SPPF (Spatial Pyramid Pooling - Fast), and C2PSA (Convolutional block with Parallel Spatial Attention) components. These new techniques advance feature extraction and improve model accuracy which continues the YOLO lineage of better models for real-time object detection use cases.

Detectron2

Detectron2 is a computer vision model zoo of its own written in PyTorch by the FAIR Facebook AI Research group. Detectron2 includes all the models that were available in the original Detectron, such as Faster R-CNN, Mask R-CNN, RetinaNet, and DensePose as well as some newer models including Cascade R-CNN, Panoptic FPN, and TensorMask. You can use Detectron2 to do key point detection, object detection, and semantic segmentation. Detectron2 registers datasets in COCO JSON format.

MediaPipe

First released in the Google I/O conference in 2023, MediaPipe is an open-source framework developed by Google for building cross-platform, high-performance applications that involve processing of multimedia content such as images, video, and audio. It includes several pretrained models for various tasks such as hand gesture recognition, pose estimation, and image segmentation. Explore even more Google AI models here.

Florence-2

Florence-2, an MIT-licensed, multimodal vision model released by Microsoft Research, supports generating image captions of varying degrees of richness. The model demonstrates strong zero-shot and fine-tuning capabilities across tasks such as captioning, object detection, grounding, and segmentation. Despite its small size, it achieves results on par with models many times larger, like Kosmos-2. The model's strength lies not in a complex architecture but in the large-scale FLD-5B dataset, consisting of 126 million images and 5.4 billion comprehensive visual annotations.

4M: Massively Multimodal Masked Modeling

4M: Massively Multimodal Masked Modeling, released by Apple in 2024, addresses critical challenges in vision models which have traditionally been highly specialized and limited to a single modality and task. The 4M architecture introduces a multimodal training scheme that leverages a unified Transformer encoder-decoder with a masked modeling objective across diverse input/output modalities such as text, images, geometric, and semantic data, as well as neural network feature maps. 4M can seamlessly handle a broad spectrum of vision tasks, excel in fine-tuning for unseen tasks or new input modalities, and function as a generative model conditioned on arbitrary modalities.

EasyOCR

EasyOCR is an OCR Python package for detecting and recognizing text in images. Based on PyTorch, it focuses on ease of use and its wide range of languages, supporting 80+ languages including English, with new languages occasionally added. It also features the capability to train and use a custom-trained recognition and detection model.

NVIDIA PreTrained Models

NVIDIA pretrained AI models are a collection of 600+ highly accurate models built by NVIDIA researchers and engineers using representative public and proprietary datasets for domain-specific tasks. 

Explore more state-of-the-art computer vision model architectures, immediately usable for training with your custom dataset.

How to Use AI Models

AI models are driving real change in the real world. Computer vision software is being used to inspect label compliance, track process execution, monitor traffic, and optimize a warehouse's footprint. As just one example, a Roboflow automotive manufacturing customer saved $8 million by automatically detecting defects on their production line.

To use AI models, you'll need to gather data, label your data, train a model, and deploy it. Roboflow provides a comprehensive platform of visual AI tools to streamline model deployment and enhance performance for specific use cases. Get started by following these steps:

  1. Gather relevant images and label them for the task.
  2. Upload your labeled dataset to the platform.
  3. Use data augmentation and preprocessing tools to optimize your dataset.
  4. Choose from available models or use Roboflow’s AutoML to train on your dataset.
  5. Once trained, export the model for deployment on your preferred platform or integrate it into your application.

Roboflow is free to get started and easy to scale up. We’re SOC 2 Type II compliant, support custom security rules, and power customers operating at global scale with terabytes of data today. Learn more about how Roboflow Enterprise can bring computer vision solutions to your business. Talk to an AI expert to discuss your unique use cases.