How to Control a PTZ Camera with Computer Vision and Roboflow Workflows
PTZ (pan, tilt, zoom) cameras that can move to objects of interest have many use cases in computer vision applications, from general security to manufacturing defect inspection. In many cases these cameras offer simpler and less expensive solutions vs using a larger number of cameras to capture a wide area.
But controlling a camera to point it to an object of interest has always been a challenge. Until now.
This blog introduces the application of PTZ tracking as a Roboflow workflow block, which allows for the construction of a computer vision application using simple, graphical logic. The post will introduce the general setup of the block as well as provide advanced usage examples for common applications.
A PTZ camera following our raccoon friend
Applications of Computer Vision-Aided PTZ Tracking
Inspection
PTZ tracking is an excellent way to inspect large objects or areas without the need to install large numbers of individual cameras.
If the “zoom if able” feature is selected and a zoom compatible ONVIF camera is used, the block will automatically zoom into an object when possible, attempting to fill the image with the bounding box. The precondition for this, however, is that the object is mostly stationary. Pan/tilt tracking will stop during a zoom operation. Once the zoom is complete, and the it's idle, the camera can optionally move back to a preset.
Workflow design in this case is critical. Objects of interest, or already inspected objects, need to be filtered out using the appropriate tools so that only an object to be inspected is passed to the PTZ block. Otherwise the PTZ block can perpetually try to seek the same object. Examples of such workflow designs are in the “Workflow Examples” section below.
A common application for instance, would be label inspection. An event might happen such as a forklift placing an object on a shelf. The workflow could be developed to recognize this event and then pass the object of interest to the PTZ camera. At this point the camera will locate and zoom into the label. Once the label is read, and the camera is no longer tracking, the workflow can stop sending the object to the PTZ block and will wait for the next forklift to arrive.
Safety
For workplace safety applications, a camera could follow an item of interest such as a worker in a prohibited area. It could also be used to enforce a safety buffer around a moving forklift, or follow an item on the end of a crane.
Security
Security cameras are another common use case for PTZ tracking. Many high end PTZ cameras already have some kind of auto-tracking capability for certain objects, like people. But the use of a security camera in a Roboflow workflow adds significantly improved intelligence and capability to any camera.
For instance, various models of objects can be developed for greater accuracy. Models can be developed to detect suspicious objects or people; for instance a person carrying a gun, a person whose face is obscured, or a person walking in a specific direction.
A workflow could even have two stages - it could detect a person, detect the head on a person, and then move the camera to the person’s head. Note, however, that the current version of the PTZ block does rely on the coordinate system matching the original workflow image. Dynamic crop should not be used unless the cropped coordinates and image size are adjusted to match the original.
Controlling PTZ Cameras: The ONVIF Protocol
ONVIF (Open Network Video Interface Forum) is a protocol implemented by many manufacturers of PTZ cameras for administration and control. The protocol communicates with the camera using SOAP (Simple Object Access Protocol) over HTTP.
Each camera can optionally implement a number of ONVIF SOAP services. The services used by the PTZ block are:
- GetConfigurationOptions: used to determine the camera’s movement spaces
- ContinuousMove: used to move the camera at relative pan, tilt and zoom speeds
- GoToPreset: used to move the camera to a preset position after idle (if defined)
Wireshark is a great tool for viewing ONVIF payloads and inspecting specific movement commands.
Note that, as shown above, for tracking the block doesn’t just send a single ONVIF command. Each workflow execution potentially changes the movement iteratively using a control loop as the object gets closer to the center. Update rate limits must be set within the block’s parameters to avoid overloading the camera with excessive commands.
Selecting a PTZ Camera
Various types of PTZ cameras exist, but for the purposes of this document, we can broadly classify them as fitting one of two categories:
- Higher end cameras with variable speeds and coordinate systems
- Lower end cameras with single fixed speeds and no coordinate system
In both cases, the camera must support the ONVIF protocol - specifically the ContinuousMove service.
Support for RTSP isn’t required, but some mechanism to get the camera image into the workflow must be available and RTSP is the most common mechanism.
Not all cameras will work with this block, even if they support both ONVIF and RTSP.
Higher end cameras
Higher end cameras usually have significantly more ONVIF services available. Most importantly, being able to support variable speeds allows the PTZ block to use a PID controller to accurately follow an object. As a result, cameras with variable speed control work significantly better with this block.
Lower end cameras
ONVIF Services
Lower end cameras won’t support as many ONVIF services. In order to improve compatibility with lower end cameras, the block uses the ContinuousMove ONVIF service, rather than the AbsoluteMove service, which most lower and higher end cameras do support.
In order to move to a preset after idle (optional) the camera must support the GoToPreset service. Most cameras support this service whether they have presets or not. Many lower end cameras without coordinate systems don’t have the ability to define presets. Other cameras without coordinate systems will define presets, but only at maximum positions that don’t require coordinates (ex. “LeftMost,” and “RightMost” on Foscam cameras)
Variable Speed Control
Furthermore, while the block can control and stop a camera without variable speed control, it’s likely that the lack of variable speed control won’t be able to compensate for video and processing lag.
For instance, if the RTSP buffering adds 2 seconds of lag, and the camera already moves the full range of motion in that 2 seconds, it’s unlikely that the camera will receive the stop signal in time to actually center on the object.
To account for this, the block has a “Simulate Variable Speed” option that sends a series of stop commands after each short movement command. This leads to jerky movement and hunting, but can work for some lower end cameras. If the camera doesn’t respond quickly to ONVIF commands, it’s still possible that the camera could overshoot its target. Note that decreasing the update rate limit is generally required when using this simulation.
Simulated variable speed control
Using ONVIF and RTSP
ONVIF and RTSP are not enabled by default on all cameras that support the protocols. Depending on the camera, you might have to go to the camera’s settings to enable both services. This tends to be found in “ports” or “network configuration.”
RTSP
The workflow must have a video stream from the camera in order to adjust movement for it. An RTSP stream is not required, but is usually the easiest way to get a video stream into the workflow. Each camera has its own RTSP URL, which can also contain the username and password for the stream (see the table below for URL configurations for the cameras we’ve tested).
RTSP stream used as a workflow preview source:
Alternatively, some cameras can also be used as webcams, which Inference works with natively as well. Adapters can be built in cases where a camera doesn’t use an Inference supported protocol.
ONVIF
Within the workflow, even though they’re similar, the ONVIF setup is separate from the RTSP stream. The ONVIF settings are entered into the workflow block, and the block makes the assumption that the images injected into the block are from the camera being controlled.
Camera Settings Comparison
The example below includes two cameras - one with greater capabilities, and the other with fewer capabilities. The greater capability is an ANPVIZ PTZIP30A60WD and the lesser capability is a Reolink E1 Zoom. Note this isn’t a buying guide. They're actually close in price, and the classification is purely based on features required by the PTZ block. We’re not considering factors like image quality, networking capability or general durability.
The ANPVIZ camera has a coordinate system. It’s capable of arbitrary presets, and variable speed movement. While there is some RTSP lag, it’ll respond to an ONVIF command almost instantly.
The Reolink camera, while highly capable, would be considered lesser-featured in our categorization. It doesn’t support variable speed movement, and does not have ONVIF capable presets. It also does not respond to ONVIF commands instantly, and there appears to be some lag before it honors a stop command.
Both cameras can be controlled by the block. The ANPVIZ camera can be tuned to quickly and instantly find a preset with minimal hunting (moving back and forth to try and center an object), even if the RTSP stream is highly lagged. The Reolink camera, however, is not capable of variable speed. Due to RTSP lag, the camera’s higher speed movement will likely overshoot the target. As a result it must use the “Simulate Variable Speed” option on the block, resulting in jerky movement and hunting. Note that “Simulate Variable Speed” won’t likely work with every camera.
The table below has a comparison of the optimal settings used for both cameras during our testing. Note that we tested with lower frame rates, and set the cameras to the lowest main-stream resolution available, in order to help improve the cameras’ ONVIF responsiveness.
Camera Specs | ||
Class | High End | Low End |
Manufacturer | ANPVIZ | Reolink |
Model | PTZIP30A60WD-SA-5X | E1 Zoom |
Coordinate Presets | Yes | No |
Variable Speed Control | Yes | No |
Default ONVIF port | 80 | 8000 |
Default RTSP port | 554 | 554 |
Has zoom | Yes | Yes |
RTSP Lag (observed*) | ~2 seconds | ~2 seconds |
ONVIF Lag (observed*) | <1 second | ~1 second |
RTSP URL | rtsp://username:password@ip:554/h264 | rtsp://username:password@ip:554/h264Preview_01_main |
Camera and Workflow Settings Used | ||
Simulate Variable Speed | False | True |
Test Image Resolution | 1920x1080 px | 2304x1296 px |
Frame Rate | 10 fps | 10 fps |
Dead Zone | 150 px (14% of height) | 250 px (20% of height) |
Camera Update Rate Limit | 500 ms (2/second)** | 100 ms (10/second) |
Flip X Movement*** | True | True |
Flip Y Movement*** | False | False |
Minimum Camera Speed | 0.05 (out of 1) | 0.25 (out of 1) |
Pid Kp | 3 | 0.3 |
Pid Ki | 0 | 0 |
Pid Kd | 0.1 | 0.5 |
*These are subjective measurements observed during our own testing.
**This camera’s zoom doesn’t respond if the update rate is faster than 2/second. Faster rates can be used if zoom is not required for the application.
***Flip X and Y movement options exist because the camera images can be reversed. Right now, because of the way the motion control is set up, the Y movement should generally be flipped by default. This might change in future block releases.
Important note regarding update rate limits
In this case, as mentioned above, the Reolink camera does not have variable speed and can be slow to respond to ONVIF commands. The variable speed simulation will send successive stop commands in order to start and stop the camera. But the update rate limit can prevent the successive stop commands from getting through, so it’s important to keep the update rate limit as fast as possible. This is why we can get away with 500ms for the variable speed camera, but we’re requiring 100ms for the non variable speed camera.
Running the PTZ Camera Control Block
The only input to the PTZ block is an object detection prediction (see https://inference.roboflow.com/workflows/kinds/#kinds-declared-in-roboflow-plugins for more information on workflow kinds). This prediction contains both information about the original image size, and the location of the detected objects. The PTZ block can then calculate the relative movement required to move the camera to the object. An image input is not necessary for the block.
The block can only move to a single prediction. By default, it will select the highest confidence prediction from the set. If “Follow Tracker” is enabled, then the camera will follow a tracked object until that tracker is no longer available, even if a higher confidence object becomes available. Note that in order to make a tracker available, some tracker such as Byte Track must exist in the workflow prior to the PTZ block.
Ideally, in order for the PTZ block to work correctly, a workflow designer should try and design the workflow in a way that only passes a minimal number of specific objects of interest into the camera. Using a filter in between a model and the PTZ block is a good way of accomplishing this.
A simple workflow structure capable of basic tracking is shown below:
This workflow can be modified to include filtering and tracking:
The workflow block also creates two outputs that can be used in other blocks. It has a flag to indicate whether the block is currently seeking a tracked object. Note that this is set asynchronously and might not correspond to a specific frame execution, but represents the block’s action at any point in time. It also outputs the object currently being tracked, which can be used to verify operation. This is set synchronously, but therefore can only represent the block’s input and not it’s output.
Filtering Requirements
If the block is used in an inspection capacity, and the preset moves to a zone where the objects are still visible, some type of filtering needs to be added to the workflow to filter out previously inspected objects by tracker id. If this isn’t done, the camera will still go back to the preset, but immediately attempt to seek the same object.
See the section below “Workflow Examples” for detailed instructions on how to implement a filter.
Deploying the Workflow
The boiler plate code provided by the workflow is the best starting point for running the workflow in inference. Inference pipeline should be used as the block is meant for video.
Buffering strategy
Note that because the PTZ block is a real time control application, effort should be made to minimize the lag between the video and control movements. Greater lag requires greater PID compensation and can result in slower movements and hunting.
The buffering strategy used in inference can produce lag in excess of the RTSP stream lag alone. By default, inference uses a lazy buffer consumption strategy, which results in smoother video but requires a larger buffer and increases lag. This isn’t recommended for use with the PTZ block.
While it’s possible to compensate for lag up to a point through PID tuning, using an eager buffer consumption strategy is almost always a better approach. In that case the inference pipeline initialization needs to specify this:
pipeline = InferencePipeline.init_with_workflow(
api_key="xxxxxxxxx",
workspace_name="lou-loizides-mgjtt",
workflow_id="onvif",
source_buffer_consumption_strategy=BufferConsumptionStrategy.EAGER,
source_buffer_filling_strategy=BufferFillingStrategy.DROP_OLDEST,
video_reference="rtsp://admin:default_password@0.0.0.0:554/h264",
max_fps=10,
on_prediction=my_sink
)
Buffering should be changed to EAGER if possible for the motion control to work best
Eager buffering can, however, have the downside of producing jerky video depending on the processing power required by the application.
Note the significant difference in Eager and Lazy lag in the video below. The first example is just the RTSP lag without running inference (this is mostly camera dependent). The second example is with Eager buffering, which is about equivalent to the native RTSP stream lag. The third example is Lazy buffering, which is currently the default setting. It provides smoother inferencing, but with a significant delay. The several seconds of buffering that Lazy adds can make motion control extremely difficult, and the use of Eager is highly recommended.
Buffer consumption setting comparison
PID (Movement) Tuning
The PTZ block uses a PID loop for emotional control. PID scales the control output (in this case the camera speed) using 3 constants - proportional (Kp), integral (Ki) and derivative (Kd).
Currently the PTZ block doesn’t include any PID auto-tuning capability, and tuning the PID/Movement by adjusting these constants is likely the most critical part of configuring the PTZ block. The default parameters might work, but PID parameters can vary significantly from one application and/or camera to the next.
A suggested tuning method is as as follows:
- Set Ki and Kd to 0
- Set Kp to a low number
- Adjust Kp until the camera movement doesn’t significantly overshoot the object it’s tracking. To some degree, Kp is basically the speed control.
- Increase Kd to provide some dampening and prevent the overshoot
Note that if the lag were non-existent, and the ONVIF response were immediate, neither Kd nor Ki would likely be necessary and could be left at zero. And in most applications, even with lag, Ki can usually be ignored and left at zero.
Poor and good PID tuning comparison
Dead Zone
In addition to movement tuning, the block has a setting for a dead zone. Movement will stop once the object is within this dead zone. A larger dead zone usually means less hunting, but hunting generally is more dependent on control lag and PID settings.
In addition to this, the dead zone is used to define the the target margin between a zoomed in object and the edge of the frame when zoom is enabled.
Note that generally the dead zone should be set to the maximum tolerable amount. But in the case of zooming, and excessively large dead zone can move the image too far to one side during the zooming operation, causing the need for additional pan and tilt movements.
Workflow Examples
Using Tracker IDs to Forget Objects (By Using a Timeout)
The block has the capability to center on an object, zoom in on it, and go back to a preset. It does not currently have any built in capability to forget objects it’s already inspected, as this type of inspection is application specific. For security applications, where the subject is expected to walk off camera, this behavior can usually be acceptable. But for inspection applications, the workflow usually needs some more advanced logic to perform the inspection correctly.
The most common way to forget about an object is to have the camera move to a preset where the object no longer exists once the inspection is complete (note that inexpensive cameras might not have preset capability).
Note, however, that the camera doesn’t know where in the “go to preset” movement it is, and many cameras will still lock onto and reattempt to move to an object even during the go to preset movement. See the video below for example.
Example of a camera re-seeking the same object after being idle
The more robust way to forget an object is through the use of tracker ids, which exist within the object prediction kind (the object prediction is described as the supervision detections object at https://supervision.roboflow.com/0.20.0/detection/core/).
This can be done by including some kind of filtering before the object prediction is passed to the PTZ block. For instance, if the goal of the inspection is to read a barcode, then once the application has a successful barcode read, the workflow should forget about that object and filter it out.
One example on how to perform this using a timeout is below. In this example, we use the “seeking” flag to indicate when the camera is idle (not moving for zooming). Once we haven’t been seeking for some time, we forget about this tracking id. Note that this is the same logic used in the “move to position after idle seconds” in the block.
In this case the following blocks are added to the previous example workflows above:
- Cache Get/Set - stores the list of previously inspected ids we’re going to filter out
- Filter Tracked - this is a few lines of custom Python that remove the previously inspected ids from the object detection set
- Tracked IDs - when the PTZ block has been idle for some number of seconds, we add the current id to the cached list
def run(self, predictions, tracked_ids) -> BlockResult: if isinstance(tracked_ids,list) and isinstance(predictions.tracker_id,np.ndarray): if len(predictions.tracker_id)>0: predictions = predictions[predictions.tracker_id not in tracked_ids] return {"predictions": predictions}
from datetime import datetime IDLE_SECONDS = 5 not_seeking_start_time = None def run(self, predictions, existing_cache_ids, seeking) -> BlockResult: global not_seeking_start_time tracker_id = -1 if predictions.tracker_id and len(predictions.tracker_id)>0: tracker_id = predictions.tracker_id[0] if existing_cache_ids == False: existing_cache_ids = [] if seeking: not_seeking_start_time = None elif not not_seeking_start_time: not_seeking_start_time = datetime.now() elif (datetime.now()-not_seeking_start_time).seconds>IDLE_SECONDS: existing_cache_ids.append(tracker_id) not_seeking_start_time = None return {"tracked_ids":existing_cache_ids}
Note that this example does include a TTL for old ids, which should be the case to avoid perpetual memory growth. Alternatively, since Byte Tracker serves ids consecutively, the example could simply remember the highest id and ignore all lower ids.
Many alternative methods of creating similar workflows exist. Time in zone, for instance, is a convenient block to use for determining when an object has been centered for long enough, but won’t likely consider zoom movements.
The video below shows the impact of this design. In this case the camera zooms in on the object, and after it waits a few seconds it forgets it. The bounding box is removed and the object is no longer recognized, so it doesn’t try to re-track it.
Object is successfully forgotten after the camera is idle and isn't re-seeked
Using Tracker IDs to Forget Objects (By Using a Successful Inference)
Generally timeouts won’t be used in practice, because what we actually care about in most use cases is a successful inference. Once we’ve inferred what we need from the object, we can then ignore it. A common example might be reading a label.
In the example below, instead of forgetting a tracker id on a timeout, we’ve given our raccoon friend a barcode. The camera zooms in until it gets a good barcode read. Once it’s read the barcode successfully, it forgets about the tracker. Several seconds later the camera then moves back to its idle preset. This pattern could be used in inventory applications to read barcodes and labels of items placed on shelves, in a truck, etc. It can also be expanded to include factors such as prediction confidence
This only requires a slight modification to the tracked id block, and changing one of the inputs from the “seeking” output of the PTZ block to the barcode detection prediction output:
from datetime import datetime IDLE_SECONDS = 5 not_seeking_start_time = None def run(self, predictions, existing_cache_ids, barcode) -> BlockResult: global not_seeking_start_time tracker_id = -1 if predictions.tracker_id and len(predictions.tracker_id)>0: tracker_id = predictions.tracker_id[0] if existing_cache_ids == False: existing_cache_ids = [] if len(barcode.xyxy)>0: d = barcode.data barcode_labels = d.get("data") if len(barcode_labels)>0: existing_cache_ids.append(tracker_id) not_seeking_start_time = None return {"tracked_ids":existing_cache_ids}
Also note the following block has been added before the image output in order to place the barcode contents on the screen once read:
barcode_text = "" def run(self, barcode, image) -> BlockResult: global barcode_text labeled_image = image.numpy_image if len(barcode.xyxy)>0: d = barcode.data barcode_labels = d.get("data") if len(barcode_labels)>0: print(barcode_labels[0]) barcode_text = barcode_labels[0] if barcode_text: cv2.putText(labeled_image, f"Read barcode: {barcode_text}", (100,150), cv2.FONT_HERSHEY_COMPLEX, 4, (255, 255, 255), 0, cv2.LINE_AA) return {"labeled_image":WorkflowImageData.copy_and_replace( origin_image_data=image, numpy_image=labeled_image, ), "barcode_text":barcode_text}
This video is the result of this block. The camera seeks and zooms into the object. Once it’s read the barcode, regardless of the zoom level, it forgets about the object and then moves back to its preset.
Object is forgotten after the barcode is read successfully
Using GoToPreset
The PTZ block has two modes:
- Follow an object
- Go to preset
The block has a built-in capability to go to a defined preset after being idle. But in many cases the preset we want the camera to go to might be application independent. If this is the case, the go to preset logic can be added to the workflow by adding another PTZ block in go to preset mode.
Additionally, we might have cases where the camera is expected to focus on objects or features in a discrete location after some event. Maybe pallets are loaded onto a shelf, and the goal of the workflow is to read a serial number on each pallet. In this case, presets can be created and executed depending on the pallet that changes. Explicitly adding a “Go to preset” block into the workflow provides significantly more control over the camera behavior.
Challenges and Limitations
The camera can’t track something the model doesn’t detect
Not all models work perfectly. The model might fail to detect an object in a frame, as shown in many of the video examples here. Generally the camera will stop moving in this case and pick up the object in subsequent frames. This can also cause the loss of a tracker, although adjusting tracker buffering can compensate for it.
In addition to this, models frequently won’t work when objects are partially visible. In the example below, only the ears of the raccoon are visible. The model isn’t trained well enough to recognize this as our object. Since the camera won’t move if it doesn’t detect an object, it won’t track downwards to follow this object.
In this case the solution is generally more training. Vision models pick up on features. One critical feature the model likely picked up on was that the ball was a circle. Including images in the training set with a partially obscured ball would have likely helped.
In the case below, the object is seen, but it’s clipped. The camera doesn’t realize that there’s more above the frame and won’t try to center it. The one case where it could work, is if zoom is enabled - the block will see that the bounding box is at the edge and try to zoom out. But that only works if any negative zoom is available (in the case below, it isn’t).
Impact to trackers
If the camera is using a tracker (provided by byte tracker for example) to lock onto a detected object, and that object isn’t detected in subsequent frames, the tracking id can be lost and the camera might lock onto another object as a result.
This is true even though most trackers can continue tracking even if a detection is briefly lost, because the camera will continually seek out a currently tracked object or the highest confidence prediction.
Bounding boxes that fill the camera frame
The PTZ tracker attempts to center a bounding box within the image. If zoom is enabled, and the bounding box touches the edge of an image, it will try to zoom out. But this isn’t always possible.
In the case where a bounding box touches two opposite edges of the image, it’s already centered on one axis and won’t move along that axis. This can cause things like heads to get cut off in the case that the camera is using a full body model.
If tracking a certain feature like a head is critical and that person has the potential to get too close to the camera, then a model should have been trained on the head alone.
Cameras that already have auto-tracking
Some cameras already have auto-tracking capability. This can conflict with the ONVIF commands and should be disabled when the PTZ block is used.
Coordinate System Limitations
Cameras operate on a polar coordinate system, where both pan and tilt represent an angle. The image, however, is represented with an XY coordinate system. For maximum accuracy, we’d want to control the camera using the XY image coordinates converted to polar coordinates.
This conversion, however, isn’t possible for the following reasons:
- The PTZ block works using ContinuousMove for maximum compatibility, and has no knowledge of the camera’s current coordinate. Most PTZ cameras, in fact, have no encoder and don’t even have a coordinate system.
- Cameras that do have coordinate systems usually have arbitrary coordinates (ex. 0-1 rather than degrees) and those arbitrary coordinates can be anywhere along the range of movement
- The range of movement from camera to camera changes
As a result, motion control on the camera uses an XY coordinate system where X is pan and Y is tilt. This works very well in most cases where the camera is pointed at a mostly horizontal angle. It does also tend to work when the camera is pointed straight up or down and the pan rotates it, but the operation at that angle will tend to hunt more or might not track the object.
Zooming Limitations
The current V1 version of the block is not capable of zooming while moving. Part of this is because the movement speed within the image tends to increase as the zoom increases, requiring an entirely different set of PID constants. In order to create a block where motion is stable enough to use with zooming, the camera will only zoom once an object is centered, and will not move while the zoom is being executed on.
Conclusion
Using PTZ cameras in workflows is a powerful combination that opens up a wide range of applications from general security to manufacturing inspection. Roboflow workflows provide a simple and easily maintainable way to build powerful PTZ camera applications.