Software Engineering Daily Podcast

Roboflow co-founder Brad Dwyer was a guest on the Software Engineering Daily podcast. Listen on your favorite podcast app (Apple Podcasts, Spotify, Overcast, Stitcher), or see the full transcript below.

‎Software Engineering Daily: Roboflow: Computer Vision Models with Brad Dwyer on Apple Podcasts
Training a computer vision model is not easy. Bottlenecks in the development process make it even harder. Ad hoc code, inconsistent data sets, and other workflow issues hamper the ability to streamline models. Roboflow is a company built to simplify and streamline these model training workflows. Bra…

Listen on Apple Podcasts

Listen In Your Web Browser

Your browser doesn’t support HTML audio. Here is a link to the recording.

Transcript

EPISODE 1146

[INTRODUCTION]

[00:00:00] Jeff Meyerson: Training a computer vision model is not easy. Bottlenecks in the development process make it even harder. Ad-hoc code, inconsistent datasets, and other workflow issues hamper the ability to streamline models.

Roboflow is a company built to simplify and streamline these model building workflows. Brad Dwyer is a founder of Roboflow and he joins the show to talk about model development and his company.

[INTERVIEW]

[00:00:37] JM: Brad, welcome to the show.

[00:00:38] Brad Dwyer: Thanks for having me.

[00:00:40] JM: You work on Roboflow, and one way to look at Roboflow is as extract transform load, or ETL for computer vision. So, many people have heard previous episodes about extract transform load. They know that pattern. Explain how that applies to Roboflow.

[00:00:57] BD: Yeah. So when we got started with Roboflow, it was because we are building our own augmented reality applications that were powered by computer vision. And we felt this really big frustration around that ETL layer of things, where it seemed like there were purpose-built tools for doing things like labeling your images and training your models. But there is this piece in between labeling and training where everybody basically just has to build their own one-off Python scripts to do all these menial tasks that aren’t really important to the problem.

And so when we’re asking around all of our friends, “Hey, how do you all do this?” It seemed like everybody was just reinventing the wheel. And so we figured since we were going to have to write these tools internally for ourselves for our own app, we should release those out to the world. And now we’ve pivoted the company to be completely developer tools. Focused primarily on that area in between labeling your images and training your model.

[00:01:48] JM: Okay. And so explain what a prototypical workflow would look like for Roboflow.

[00:01:53] BD: Yup. So we’re a universal conversion tool. You can imagine us an open platform that plays nicely with all sorts of other point solutions. So we're pretty agnostic about where you do your labeling and training. We can accept over 34 different annotation formats ranging from outsourced labeling tools like Scale or Amazon SageMaker. Or if you want to label them yourself, we support all of the major labeling tools.

So you label your images and then you get these annotation files out of them. And one of the biggest pain points that you'll feel is that when you export your data from all these tools, they come in formats specific to the tool, which are not compatible with any of the models that you’re going to want to use. And so kind of one of our hooks is we can import from all those tools and then output to all these other different formats. So that's one of the ways that people find us, is they're looking to convert VOC XML to TFRecords for TensorFlow.

And so the way that it works is you come into Roboflow, you drop in your images and annotations and we perform a bunch of checks up on them, help you make sure that that data looks good and is ready to go for model training. We’ll help you augment your images, preprocess them and do all those sorts of things that you would kind of have to do one-off Python scripts for and then we’ll export them and then you’re ready to train your model.

[00:03:09] JM: And how does this save time?

[00:03:11] BD: Yeah. So most of our customers are looking at this as kind of a tradeoff between doing it themselves by writing all those Python scripts or using something like Roboflow.

We talked to one of our early customers and asked them, “Hey, how much time are you actually saving by doing this?” And they told us that it had been on their backlog to try out a Pytorch model. They were a TensorFlow shop for quite a while but had never gotten to the top of their stack, because it just seem like such a pain to switch all their image processing pipelines, convert all their annotation formats and all that stuff. And so they estimated that that would've taken a week of development time for their team of three.

But with Roboflow, they were able to do it with a single engineer over the course of an afternoon and they had a Pytorch model trained and actually found that it worked better than their TensorFlow stack that they'd been using for quite some time.

[00:03:57] JM: Can you explain what annotation formats are? Why are there multiple annotation formats?

[00:04:03] BD: An annotation format describes where the examples are in your images. So when you train a machine learning model, you’re kind of training it by example you're showing it “Hey, here's the thing that I'm looking for. Try and figure out how to replicate these inputs that I give you.”

And so an annotation format is just a way of encoding that. It could be XML. It could be JSON. It could be a binary format like TFRecord, or a text file, or any number of other different formats.

And the reason that there is so many different formats is that everybody just invents their own thing internally and then releases their stuff as open source. And so they don't play nicely together.

You have all these researchers that are publishing papers and they’re not thinking about, “How are people going to use this to deploy stuff into the wild?” They’re thinking about how do I get state-of-the-art results and release something?

So it’s all this hacky academic code rather than production code. And then people take those research papers and they’re trying to convert them into production, and they end up having to use whatever default format the researcher had in their paper to reproduce that.

And so you end up with this spaghetti-ness of all these file formats that don't play nicely together.

[00:05:12] JM: Do you have a prototypical example of a company that's using Roboflow?

[00:05:17] BD: It spans the gamut. Everything from hobbyists that are building raccoon detectors to train little robots to chase them out of their backyard, to some of the world's largest companies.

We have a Fortune 500 oil and gas company that’s using Roboflow to train models with their security camera footage to look for oil leaks in their pipeline. They have 10,000 miles of pipeline and they have these little leaks that turn into big problems and they don't get addressed until a human notices them. So they’re training models to automatically look at their security camera footage in real-time and alert them so that they can fix the tiny leaks before they become big problems.

[00:05:57] JM: Let’s go through a little bit more on the raccoon example. So let's say I'm training a model to recognize raccoons and chase them out of my backyard with maybe some sort of drone or a Roomba like robot. What would be my workflow with and without Roboflow?

[00:06:14] BD: The raccoon detector is a funny example, because we actually have six distinct users all working on detecting raccoons. One of them is trying to automatically turn on their garden hose to spray the raccoons. One of them has this little spider robot that he is trying to train to chase them. So it’s been kind of a surprising niche that we’ve found.

So the workflow would be most these guys are using something like a Nest camera to export a bunch of images. You find some of the ones that have raccoons, some of the ones that don't. So one guy is trying to make sure that his robot doesn't also chase his dog. So he wants to label his dog. He wants to label the raccoon. And then he'll end up training a model to do that in realtime.

Without Roboflow, he'd be writing all these Python scripts for after he labels his stuff to convert them, to augment the images so that it works well in the daytime and nighttime, whether it's a cloudy or not, if the camera happens to be at a slightly different angle, if the wind has blown stuff around.

Instead of collecting millions of different images, he’ll do something called data augmentation, which helps your model generalize more by giving it more examples that aren’t all exactly the same. So those are some examples of things that you’d be writing Python scripts to do and are very specific to the input format for one specific model.

With Roboflow, that's all done within the platform. And you can try a whole bunch of different experiments much more quickly and then export them and you can try maybe EfficientDet or YoloV5 or YoloV4 on Pytorch, TensorFlow, Darknet, and you can try all those things over the course of a week rather than having to spend a week each.

[00:07:53] JM: What have you had to build in Roboflow? Tell me a little about the infrastructure.

[00:07:59] BD: Probably the biggest parts of the app are our annotation parsing system.

We support all these different formats importing and exporting from them.

When you do all the combinations, it's 400 or 500 different combinations of X to Y formats. And previously if you like Google converting Pascal VOC to COCO JSON, there's a specific Python script that you can find on GitHub to do that one specific thing. W spent a considerable amount of time abstracting that so that it's really easy for us to provide this abstraction layer over the top of all these different formats.

And then we have a pretty robust image transformation pipeline that spins up servers to do things like rotating images and adjusting their brightness and contrast, resizing them and getting them ready for training.

We also have another pipeline that takes all of those transformed images and compiles them together into one output format. So that might be a TFRecord binary format for TensorFlow, or it might be a zip file containing all your XML and JPEG images for another format.

[00:09:04] JM: Tell me about some of the particularly hard technical problems you’ve had to solve in building Roboflow.

[00:09:09] BD: Some of things that were surprising to us, I guess not surprising, but we didn't think that we’d have to address them so quickly, was the sheer scale of things.

We started out by using cloud functions to do a lot of these tasks. What we quickly learned was that cloud functions have some technical limitations in terms of the amount of time that they can run and the amount of memory that they can use.

You can imagine if you’re creating a dataset with a thousand images, that might work fine. But then all of a sudden somebody uploads 250,000 images and it just breaks. We had to replicate and abstract away from that. We internally have basically our own version of Google Cloud Functions or AWS Lambda that runs on Docker containers now instead so that we can create images and datasets of arbitrary size. We can add GPUs to those and do things that you're not able to do with the native serverless cloud functions.

[00:10:04] JM: Sorry. So what were the problems with the cloud functions?

[00:10:07] BD: On AWS and Google, it’s 2 GB and 3 GB is the maximum amount of memory and disk space you can use on those.

You can imagine, if you're trying to create a zip file that's a hundred gigabytes, it's really a tough technical challenge to do that on an instance that can have maximum of 2 GB of memory and disk space storage.

And they’re also limited to only running for a certain amount of time. I think on Google that's nine minutes. So if you're trying to download 250,000 images and process them all on one cloud function and output one file, if it takes longer than nine minutes, you're just kind of out of luck.

By building our own version of those sorts of tools, we can eliminate some of those hard barriers that you're not allowed to overcome on serverless cloud functions.

[00:10:56] JM: So if you’re trying to be this sort of middleware stitching together all these different frameworks and toolsets, it seems like there's probably some issues in being that glue between them. Are there any particularly difficult problems that you’ve had to solve in gluing together these different frameworks?

[00:11:17] BD: I don't think there’s necessarily big technical challenges there. I think there are philosophical things that we've had to overcome.

You can imagine some of the big cloud providers are also trying to be this end-to-end machine learning platform. And if you look at AWS’s incentives on their platform, they really want to lock you into using the AWS labeling tool, and the AWS notebooks, and the AWS training, and the AWS deployment stuff.

What we found is that doesn't really work well for a lot of teams. There are all these good point solutions out there that aren’t built by AWS, and that end-to-end platform where like you either take the whole thing or none of it at all just doesn't work for folks.

I think one of our like core observations is that if we can help people use the best point solutions for each step of the pipeline, that can be really powerful. And is one of the reasons that people choose us is that we’re interoperable with all these other tools. We don't have to control your training flow. We don't have to control your labeling. We can just play nicely with everything and help you use those all together and be the tool that just reduces friction between all these other parts of the pipeline.

[00:12:28] JM: Do you find that people pick a particular cloud provider and just go all-in on it like on AWS? Or they go all-in one the Google TensorFlow stack? Or do you find the people really want heterogeneity?

[00:12:42] BD: It really depends. We have a couple of different customer types. And certainly there are really advanced companies that have built out a team of PhD's that are working on computer vision. And those more mature teams definitely have their workflow that they prefer.

But there're also these other teams that we find that are just getting started. We think one of our core mission statements is to make computer vision something that all software developers can use. You shouldn't have to hire a team of PhD's and be a company the size of Google to use it. By letting these teams experiment more quickly with their existing engineering resources, you let them find what works best for their problem.

When you're first like exploring a problem space, you don't know like whether

TensorFlow or PyTorch is going to be the right solution for your problem. You probably want try them both.

We help you navigate that problem space really efficiently and quickly and find what's going to work best for you.

And yeah, we have seen teams switch from TensorFlow to PyTorch, but usually it's in the course of experimentation. Machine learning is one of those things where it's never done. You’re always iterating and trying to find something that works a little bit better.

If a new model comes out and it has research results that work better but the code is in PyTorch and you’re in TensorFlow, you don't want to be stuck in your legacy platform if there's something that's going to work better. And so helping them experiment and use whatever tool is best at a given moment in has been really valuable.

[00:14:07] JM: Tell me more about what you need to do to integrate with the labeling providers.

[00:14:12] BD: It's actually been interesting. We assumed that most people who are doing computer vision in production, were going to be outsourcing their labeling to tools like Scale or AWS SageMaker Ground Truth.


But what we found is even a lot of big companies are still in this experimentation phase where they’re just doing things in-house. And so they’re having these highly paid, highly skilled engineers that are labeling bounding boxes on their images just because it's kind of big problem and pain point for them to go out and source of provider. And a lot of these providers have big minimums that they have to spend and there's like a procurement process.


We feel like if we can reduce the amount of friction to that and free up the time of developers to be working on the things that actually need their unique skillsets, that can be really powerful.

Right now we integrate with basically all of the self-serve labeling tools. Whether they've used CVAT or VoTT or another tool to label the images themselves, they can import those all directly into Roboflow.

We also have helped many teams outsource their labeling for the first time. There's a bunch of different providers. They all have their own pros and cons and we feel like if we can help match those users with the labeling provider that’s going to be right for their use case, that can provide a lot of value to them as well.

[00:15:30] JM: And what about the model training tools? What are the integrations with the model training tools like?

[00:15:37] BD: We have a model library that has Colab notebooks that are set up to do most of the modern, state-of-the-art object detection models.

So you can pick those up and play around with those. We also have integrations with all three major cloud providers’ AutoML tools. You can try Google’s, you can try Microsoft’s, and you can try Amazon's and get a baseline for what is the naïve level of performance that you wouldd get off-the-shelf. And that something that's hard to do right now. You’d have to integrate with each of their individual APIs to try them out. But with Roboflow, you can try all three and see how it works.

One thing that we’ve found is that a lot of these software developers that aren't machine learning experts that are using computer vision via Roboflow, they go to our model library and they have these Jupyter Notebooks, and all they're doing is hitting enter-enter-enter-enter. They're not really doing anything custom.

And they end up with this weights file that they don't know what to do with. When they want to deploy it and use it in their application, they’re then trying to figure out, “How do I spin up servers and host this and like build DevOps infrastructure around that?”

One of the places we’re moving into is providing our own hosted training and deployment environment for some those users who just want to use computer vision, get something that works well enough built into their application and not really worry about all the details of tuning their model and whatnot.

As we have talked to users, we've found out that while we’re solving a bunch these problems and eliminating the boilerplate Python code that they have to write, by getting them over that hump, they then hit these other problems like, “Oh crap! Now I have this trained model. What do I actually do with that?”

We’re really trying to make that easy for them to integrate computer vision to their apps and focus exclusively on the things that are unique to their app and not the things that are boilerplate computer vision infrastructure that's reinventing the wheel of things that's already out there and not providing unique value to their domain-specific problems.

[00:17:39] JM: On your website, you have some areas where people can share datasets and seems kind of random compared to your other tools. What's the objective with these datasets, the shared datasets you have on your website?

[00:17:54] BD: Beyond providing the tools, if we want to enable any developer to use computer vision, we need to chop down those barriers that make it hard.

One of the things that we found is that a lot of developers have a problem that they have an inkling of an idea that they could use computer vision for, but they don't have a dataset that they've already gone out and collected.

They just hit this hump where they don't have something to try it out on. So we figured one way that we could get them over that hump was to provide a whole bunch of datasets that they could use to play around and try things out.

So we curated and released a bunch of open source dataset. Some of which we collected. Some are from our users that were willing to share those with other researchers. And some that were already open-source that we either improved or converted.

It’s one of those humps that if you don't have any data, it's really hard to get started learning computer vision. And along those lines, education and teaching people how to use computer vision is another big stumbling block.

If you're such software developer, computer vision can be one of these things where it feels inaccessible and like something where you'd have to go back to school to use it. And, in fact, I was a software developer before with no computer vision or machine learning expertise and at one time it seems like an insurmountable hurdle to me too.

Educating people and putting out tutorials and making sure that we’re doing everything that we can to democratize this technology and make it accessible and a part of every developer’s tool chest is something that we should play a part in.

We have tutorials, we have YouTube videos, we have those public datasets, we have those open source models, and we’re trying to do everything we can to make that hurdle to getting computer vision into your app as low as possible.

[00:19:42] JM: And what are the main hurdles to getting computer vision into my application today? Let’s say I’m building a to-do-list app. I'm a brand-new developer. I’m building a todo-list app and I want to have computer vision in my application because I want to – I don't know, take a picture of a blanket and have it recognize that it's a blanket so it can tell me to fold the blanket and put a to-do on my list for folding a blanket. Why is that hard today?

[00:20:15] BD: One of the biggest hurdles actually is for software developers to even realize that this is something that they can do. I mean, for the first 50 years of computing, teaching a computer to understand image data was an intractable problem. It was just something that even with a team of PhDs you couldn’t do. And it's really only been in the past decade that this has become accessible to not only teams of PhD's, but just a single solo software developer off the street.

One of the biggest challenges is just convincing people that it is possible and that it is something that they can do.

Once you get them over that hump, a lot of the other stuff is just normal software engineering stuff of getting a Python script up and running and following a tutorial and getting through things.

I think the last challenge is on the deployment side. It gets kind of complicated when you're looking to actually deploy it into the wild, because whether you're deploying it on a server or on a mobile phone or an embedded device somewhere else, you almost have to start from the end and think about, “Where am I going to put this?” And then that informs a lot of the decisions beforehand.

It's this like forwards pass of I need to figure out that I can do this. And then a backwards pass of, “Okay, so I think this is tractable. Now, I want to do this for real. If I want to put this on a Raspberry Pi, what considerations and decisions do I need to make before that to make sure that the model that I come out with is deployable there?”

It's an iteration process of getting over the hump of training your first model and then creating your first project that you can use in the wild.

[00:21:55] JM: Let's take it from the top again. Let's say I have a bunch of images of a chess board and each image has a configuration of a board situation and I want to generate solutions to use to those chess problems. What would be my process for using Roboflow do that?

[00:22:20] BD: This is a great example. My cofounder and I actually built a computer vision powered chess solver for a hackathon at Techcrunch Disrupt last fall.


Maybe it would make sense for me to walk you through how we did that.


We came into this hackathon with basically nothing. We had a chess board. We had an iPhone. We had our laptops and that was it.


The first step was setting up the chess board and setting up a bunch of positions, taking pictures of those with our iPhone. Then you offload those pictures from your phone. And now you have the unlabeled images. So you need to label those images.


We brought those into a tool on the Mac called RectLabel, which creates annotations in a VOC XML format. Y go through, you draw a box around each individual piece. You tell it, “This is a white queen. This is a black queen. This is a pawn.” And then you have this serialized format of the state of the chess board.

From there, at the hackathon, because we didn't have Roboflow yet, we wrote a bunch of Python scripts to modify that, resize the things, create some augmentations so that it would work depending on different lighting.

With Roboflow, you would just drop those images and annotations into our software and you get a GUI where you could play around with all those different settings.

At the hackathon, we then trained a model. We used Apple's tool called CreateML, which is a no-code training platform. With Roboflow, you could still do that. You just click, “Hey, I want these annotations in CreateML format.” Hit go. You get a zip file. You drop those into the app and hit train.

You could also with Roboflow say, “Hey, I want to train these on AWS with their Rekognition Custom Labels.” Or I want to use Roboflow Train, which is our competitor to that. You click a button, you get a model.

For us, training the model at the hackathon was something that we did overnight the first night. While that was going on, we are working on scaffolding out the app that was going to consume this model.

It was taking images from the iPhone camera. It was going to feed them through this black box model and get back JSON results (essentially of: is it looking at a chess board? Where are the pieces?)

And then you have a traditional problem to solve, right? You have the location of these pieces that you have serialized to say, “Okay, so this X-Y position represents this position on the board.” And once you have that, then you can feed it to a chess solver app. I think the one that we were using – I can't remember it. I think it was Stock Fish at the hackathon.

Basically, you’re treating that computer vision as a black box that converts your image into usable computer data. And so it’s these two concurrent processes of developing the app and then developing the model that then end up working in tandem.

[00:24:59] JM: The different phases of using Roboflow; analyze, preprocess, augment, convert, export and share. Could you go through each of these in a little more detail?

[00:25:12] BD: Sure. Conversion, I think we we've touched on. There're all those different formats. We’re the universal conversion tool where you can import in one format and export in another. And one of the ways that we like to think about that is if you are an author and you were spending a bunch of your time converting .doc to .pdf as part of your process, that would be ridiculous. And that's how we think about engineers and machine learning people spending a bunch of time converting formats and writing python scripts to do that. It’s just a ridiculous thing that you would have to spend any time writing file conversion tools in 2020. And so that's the piece of the process that we handle with the conversion side.

On the analyze side, we have these tools that once you upload your annotations into Roboflow, we can perform checks and tell you, “Oh, hey. This was a malformed annotation that's going to cause problems with your training script.” We automatically fix a bunch of those and we bring to your attention other potential problems.

As examples of that, some things that you run into you when you're training a machine learning model are class inbalance. So let’s say you have your chess board images and it turns out there is 16 pawns on the board for every one queen. You’re going to end up with your model overweighting and seeing way more pawns than it sees queens. And so you'd probably want to rebalance those things so that your model is not able to cheat by just guessing pawn, because that's what optimizes its score, because in the later game, there's going to be less pawns and more queens relatively.

So we help you identify class imbalance. We help you identify things like, “Hey, this queen on this chessboard, like in 90% of your images, it was on the exact same portion of the image.” You might want to augment that so that your model doesn't learn the queen is always on the same white square and learn to cheat that way. We can do things like re-cropping the image or translating that bounding box around or going out and taking more photos with more examples of the queen on different squares.

Augmentation is what we mentioned earlier of making sure that your model generalizes. Doing things like adjusting the brightness and contrast, rotating it, cropping it.

There are some advanced augmentations that you can use. One of which is called mosaic, where it will take multiple different images from your training set and it will combine them all together to create an image that has four pieces of other images.

The purpose of augmentation is really to help your model generalize. If you feed it the same image over and over again, it just learns to memorize that particular iteration of your problem. By augmenting your images and feeding it a slightly different variation every time it sees an image, you get better results on images that it's never seen before.

And then on the training side, I mentioned we have all those export formats that go to TensorFlow, PyTorch, the cloud AutoML tools or our one-click training platform. And we’re adding support for more and more as time goes on.

Our hope is to be that connector that connects every labeling tool with every training tool. When customers come to us and they’re like, “Hey, I have this like random annotation format from a Chinese paper that was published in 2012. Do you support that?” The answer is always yes. And we spend an hour adding support for that before they get onboarded. And our hope is to support every single format and every single training platform.

And then on the share side, this is one of the big pain points that we felt when we were building our own apps, is that it feels like the olden days before Dropbox where you would be emailing around these version 2.final.reallyfinal, and you have multiple people working on these datasets. And let's say it in the olden days (before Roboflow) I took 20 chess images, I emailed a link on Dropbox to my cofounder. He combined that with his 20 images that he took. Then he found a problem with one of my images and he updated it. Well, now you have three different versions of the dataset and it’s not entirely clear which one you should use or how you should be working together on that.

Roboflow is the single source of truth for your datasets. By combining them into this platform that’s a multiuser sort of thing, you can really keep track of who's done what. What are the different versions? Who trained which models on which versions? And make sure that you’re staying in sync rather than having a bunch of different versions floating around out there that some of them are cropped and some of them are resized. You really just want your original files and then transformed for your models in a non-destructive manner so you can experiment without getting completely lost in all the data.

[00:29:49] JM: And again, the process of preparing datasets for training. Let's go a little bit deeper into that. So the different things that Roboflow is going to do is assess annotation quality, fix unbalanced classes, de-duplicate images, visualize model inputs and version control datasets. And then you can share them with your teammates. Tell me more about the preparation for dataset training.

[00:30:14] BD: As I mentioned, not only are all the labeling tools using different formats, but the training tools are all using different formats as well.

And most of the time they don't match up with any of the labeling tools. So for TensorFlow, you have to create what's called a TFRecord, which is a binary format that has all of your images and all of your annotations compiled into one file that it's going to load at training time to go through and create a data loader and iterate through all of your different images.

That’s something that traditionally you'd have to write your own Python script to take all of your images from disk, pair them with your annotations, encode them in this specific format and then output this TFRecord file that’s going to go through TensorFlow to do training.

You can imagine, that’s something that there are countless StackOverflow questions about how do I convert this format into a TFRecord for training with the TensorFlow Object Detection API.

With Roboflow, it's just a click of a button. When you click export, you get a drop-down list, and one of the options is “Create a TFRecord”, and then it will compile those all together and it will either let you download that zip file to your computer or give you a link to that hosted it in the cloud so that when you spin up your cloud server or boot up your Colab notebook you just drop in that one line of code and it will downloaded it from the cloud, unzip it, and it will be ready for training with your model.

[00:31:38] JM: What problems do you think machine learning is uniquely positioned to solve in the next year, or 5 years, or 20 years?

[00:31:46] BD: That’s actually pretty interesting. One of the like opinionated stances that we take is that computer vision is actually its own unique beast.

It’s certainly a part of machine learning. But we think that the tooling and solutions that are needed for computer vision are actually much different than the ones that are used for, say, natural language processing.

I think when you think about “what is machine learning going to do,” it’s such a broad answer. Focusing in on “what is computer vision going to do” is probably the part that I'm most suited to answer.

And I think – if I think in 20 to 30 years down the road, the state that we’re in right now with computer vision is similar to how the web was in the 90s where, certainly, there were e-commerce websites in the 90s, right? But in order to build them, you had to you invent your own database and create your own web server. And if you wanted to accept payments, you had to be an expert in cryptography to be able to do that. And we went forward, all those things were abstracted and made into tools that basically any software developer off the street could pick up and use.

Our mission is to do that for computer vision. And when you do that, you enable all of these new use cases.

If you think about what computer vision has done to the car industry with self-driving cars, it’s just this massive transformation that not only changes how cars operate, but also like how cities are going to be organized. And our core hypothesis is that computer vision isn't just about self-driving cars. It's like the PC or the Internet where it’s going to touch every industry and transform every industry.

If you look at kind of some of the use cases that are coming down the pipeline, I mentioned detecting oil leaks, but that's just the start.

We have a student that is working on detecting wildfires from computer vision. You can imagine having these security cameras on top of weather stations that are looking for smoke. And he wants to deploy a drone at the first sign of smoke to douse the fire before it gets out of control.

Or we have these other students that are doing human rights monitoring. So there’s this tribe in Africa called the Maasai people that the government is burning their villages, and he wanted to track their migration. And so he's using satellite imagery with computer vision to find where the camp sites were, where they are now, and track how this tribe is being displaced.

And we have companies that are building their entire company on top of Roboflow with computer vision. One of those was a Y Combinator company that is building a pill counting app. And they’re replacing this old $15,000 machine with something that runs on commodity hardware. It's going to make this accessible to all these small pharmacies and make their job so much easier.

And so I think when you when you think into future, every app or company is going to be able to use computer vision without it having to be a core competency and without having to hire a bunch of PHDs. And that's really exciting.

It's a future that I want to live in. And if I could take a time machine and travel to the future and see how amazing things are going be once developers have access to all this technology, it's totally something that I’d be intrigued by and interested in doing.

[00:34:58] JM: As a company goes from a test model to a production level model, what are the considerations they must take regarding datasets and dataset pipelines?

[00:35:09] BD: One of the biggest paradigm shifts for developers getting into machine learning and computer vision for the first time is that it's not a binary thing. It's not like your machine learning model works or it doesn't. There's this gradient of how well it works.

Traditionally, in software, you can just write a test and be like, “Yes, my code works. It does exactly what it's supposed to do.” But with machine learning, it's not entirely clear when you're done. And in fact you may never be done.

There’s this iteration cycle where you want to get something that works well enough for your first version, deploy that and then find all of edge cases where it's failing, and then pull the things that it's not confident about or that a user reports are incorrect back from your production model into the beginning of the flow and put that into your dataset to make it more complete.

You train another model and you go through this iteration process where you deploy it and then you see, “Okay. Well, what's it still messing up?” Bring that back. And, over time, your model gets better and better. But you really have to close that loop.

The workflow and the cycle is such that you need to figure out “what is my MVP?” And make sure that you're picking a problem where you can actually deploy a first version of the model and it's not going to cause a car to run people over.

You can then figure out what the edge cases are and go back and keep iterating and making that model better and better over time.

For a lot of software developers, that's a new paradigm where you’re shipping something out there that you know is only going to work 80% of the time and then thinking about how you design your software around that knowing that 20% of the time it's not going to work well.

That's not only a software problem, but also a design problem and a business problem and a strategy problem that needs to be solved. And it’s kind of a new frontier for folks.

It's been interesting to kind of discuss that with people and have this skepticism about like what you mean my software is not going to be rock solid and work 100% of the time? It's just like, “Well it’s a probabilistic thing?” There are definitely use cases where having it be right 80% of the time is better than not trying at all.

It's an exercise in defining the right problem where you’re going to be successful rather than need something that you’re going to have to spend a decade on bulletproofing before you release the first version.

[00:37:35] JM: Are there some other common issues in dataset management that you've seen lead to poor model performance?

[00:37:41] BD: Yeah, there are a few.

One of the most common things that people run into is trying to detect tiny objects in their images.

You can imagine you’re trying to train something on satellite data and you're trying to detect people on a beach. Well, the resolution is such that those people only are a few pixels in your image. And the way that machine learning models commonly work is they have a fixed input size. And so even if your images is, let's say, 20 megapixels, it's going to get shrunk down to like 800x800 or 416x416.

And those like small number of pixels in the big image end up like becoming one pixel or less in that shrunk down image. This is one of those things where it's an implementation detail that, if you're not a machine learning expert, you're just a software developer that's following a tutorial, you might not have this mental model for like what's actually going on behind-the-scenes.

And so if you have small objects inside of your images, it turns out that a lot of times your model just can't detect those things, because when it gets processed by the model, it gets shrunk down so far that there's just not any information for it to detect.

If your objects are small relative to the size of your input images, you have to do some things to account for that. So one of those might be tiling your images where instead of running your model one time on this 20 megapixel image, you shrink it down and cut it into a 10 x 10. And now all the sudden you have 100 different images that you run through your model, but each one, the relative size of the object compared to the size of the image is bigger. And so it gives the model more pixels to work with.

[00:39:26] JM: What are the other parts of the machine learning or computer vision process that could be automated?

[00:39:32] BD: Training and deployment is an interesting area where it's not clear that you actually need a custom model for each problem.

There are plenty of models that are very good off-the-shelf that you can – It's called fine tuning them. So you can take existing weights that are trained on a large dataset like COCO, which is one that was released that has millions of images that represent a whole bunch of different things. And the model can represent these generic things. You start from that base and then you train it to learn your individual objects. So if it’s chess pieces, the COCO dataset doesn't know anything about chess pieces, but you can start from, “Oh, it knows how to like identify dogs and cats.” And the features of dogs and cats like curves and like changes in patterns and those sorts of things are applicable also to isolating chess pieces on top of a chess board.

It turns out that just by fine-tuning this model, you can get pretty good results without actually changing the architecture of the model. It’s just changing the weights. You can think of the model as a meta-program that can learn a whole bunch of different domains.

Over the last decade, a lot of work has gone into optimizing your model architecture. But we’re getting to the point where these models are good enough for a lot of problems where you don't need to do a whole bunch of core research on the model architecture. You just need to retrain them on something else. And once you do that, you can get to these solutions where it doesn't actually need human intervention. You can just run through the same process. Get new weights, and it works pretty well.

Doing that and then deploying it is one area where – Certainly, for some problems, you're going to need to do some core R&D. But for a lot of problems, you can just kind of automate that process and get something deployed that works pretty well.

[00:41:30] JM: Any other predictions about the future of computer vision?

[00:41:34] BD: I think augmented reality is one that I think people have written off at this point, because the early example applications that people have come out with so far have been pretty underwhelming.

When you combine augmented reality with computer vision, it enables you to do really interesting things. And so when you hear Tim Cook saying that he thinks that AR is going to be the follow-up to the world's most successful product in the iPhone, I think people roll their eyes. They’re like, “AR is just this gimmicky thing that lets you put Pokémon on the street.”

But, really, when you combine it with computer vision, it allows you to put a software overlay over the top with the real world, which I think is really interesting and thinking about taking real-world objects and enhancing them with software for the first time without embedding a computer in the thing to make it smart. You just make it smart by adding a software layer that understands what it's looking at and can add features to it is something that is going to take people by surprise.

I would not write-off AR just because the first version of it was pretty underwhelming. I think that there's a huge greenfield of opportunity there.

[00:42:44] JM: Okay. Well, thanks for coming on the show. It’s been great talking to you.

[00:42:47] BD: Yeah, likewise. Thanks for having me.

[END]