The below post is a guest post written by data scientist Joseph Rosenblum. He is using computer vision to make cities more efficient and decrease the bias in traffic-related policing. (He is open to new opportunities!)

Municipalities today are cash-strapped, with $360 billion in losses expected between 2020 and 2022 due to the post-coronavirus recessionary impact. Localities are seeking ways to create efficiencies through technology and automation, with trillions in anticipated spending coming down the pike. At the same time, more will need to be done with less as operating budgets shrink.

One opportunity area a municipality may have to stretch their dollars further is to better leverage existing technology, such as CCTV surveillance cameras. Initially expensive to purchase, install, and maintain, these systems hold great promise: not only can they improve public safety and save lives, but they can do so while reducing policing man-hours and increasing summons-collection revenues.

Most traffic cameras are singularly focused on enforcing red-lights or speeding violations. CCTV cameras are not designed to enforce laws, but rather to assist in solving crimes that have already occurred. Such cameras are often installed in places with frequent violations and with clear views of vehicular traffic.

Machine learning can be applied to footage generated from these CCTV cameras and analyzed for almost any traffic violation, helping to realize more of the technology’s full potential and increasing its lifetime ROI. In practice, this means leveraging advanced object detection techniques to create traffic enforcement cameras out of CCTV cameras, thus broadening their range of capabilities.

Data Background

I used WebCamT, which was originally created as part of the initiative Understanding Traffic Density from Large-Scale Web Camera Data. This data consists of compressed, low-resolution webcam footage taken in various locations around New York City. All images are formatted to 352 x 240 pixels. Data in this set includes custom .XML annotations for each vehicle. (Annotations are the coordinates that tell the model where objects exist within the image.)

This dataset was chosen because it has close to the exact perspective and quality this analysis required. The camera perspective nearly matches the cameras that the City of New York uses to monitor intersections. For example, the image quality is similar to what is produced by some of the older cameras currently in use. These images offer a solid starting point because they can be used to detect violations; higher-quality images may offer even more possibilities!

A traffic camera view of cars moving through Manhattan.

Data Preparation

Object detection alone can tell us where an object is within a single frame of video. Many vehicular violations involve motion and are not necessarily obvious if a still image of an event is being examined. For the initial scope of this project, I chose to use object detection to identify obvious violations within the context of a still image.

A still camera view of cars in Manhattan.

I started with cars stopping in the crosswalk against the signal. I needed images containing crosswalks, traffic signals that could be seen, and, of course, vehicles.

I used CVAT to create tiny annotations to denote when a pedestrian crossing signal was red.

An image of a red traffic light with a pink bounding box around the light.

Annotations of vehicles were included in the dataset. They were created in XML, but because they do not conform to Pascal VOC, I wrote a script to convert them to YOLO format. During this process, I identified useful images with missing and corrupted annotations and re-annotated them using CVAT.

I used images from WebCamT that contained all these elements and divided them up into “train,” “test,” and “validation” components.


Initially I created a custom CNN in Keras to identify whether an image contained a violation. After struggling to create a performant model, I used Roboflow’s Model Zoo to explore a few pre-existing models, including YOLOv5. I found YOLOv5 relatively simple to use and quite powerful. I used the pre-trained weights from the COCO dataset for transfer learning the custom data used for this project.

An image of cars in Manhattan with predicted labels and bounding boxes.

I tried every different size of the YOLOv5 model family, seeing incremental improvements in accuracy with the larger models. I settled on YOLOv5x – it was more than fast enough for the given purposes and able to run inference on the data at around 50 fps on my hardware.

Of note, I did run the model locally on my computer (as opposed to on the cloud) because I had access to a GPU. This dataset was small enough that it was manageable on my own desktop and enabled me to experiment with trying various ways to train my model. For example, I tried training with and without mosaic augmentation. This is what the mosaic training built into YOLOv5 looks like:

Source images --> augmented images --> model.

Roboflow was a great resource. It offered a rich set of tools for image augmentation. It also offered robust conceptual overviews, head-to-head analyses of the best open-source models available, and tutorials, including those on object detection considerations.

Impact of Model

I can now rely on my model to deliver output and can begin to shift to building the logic for interpreting the detected objects. One example is using predetermined lines to anticipate where the crosswalk is.

Crosswalks were (obviously) in the image but I didn’t use object detection to detect crosswalks. Because the images I used were static (e.g. frames of video where camera position does not change), I could reliably state where the crosswalk was. In the long run, this isn’t the best approach --  the camera could be moved and I wouldn’t be able to re-anchor the crosswalk to the new position. To solve for this, I could train the model to identify crosswalks as well.


Description automatically generated
An image of cars in Manhattan with predicted labels and bounding boxes, with a crosswalk.

This is a live and ongoing project built on a model that is still in development. I am continuing to explore and refine it. If you want to get in touch to nerd out over other improvements I can make or have any questions, feel free to ping me on LinkedIn!

Roboflow was an inspirational resource that not only provided me with relevant information and perspective (both Joseph Nelson and Matt Brems had a hand in teaching me data science), but Roboflow also allowed me to leverage powerful, easy-to-use tools and helped me find the path forward in using them.

Additional references:

  1. My project's Github.
  2. YOLOv5 Github.
  3. YOLOv5 requirements.
  4. cuDNN install guide.
  5. Roboflow’s Guide to Object Detection.
  6. WebCamT (Link 1) and WebCamT (Link 2).
  7. I labeled using CVAT. (Since tackling this project, labeling has been made available in Roboflow for free!)
  8. Microsoft COCO dataset.