The client — a top educational institute — is highly ranked both nationally and internationally for academic excellence and research contributions. Their research projects span multiple disciplines, including artificial intelligence, automation, materials science, and real-world industrial applications.
Within this broader research landscape, the institution is actively exploring waste management and recycling technologies. One of their research teams is developing AI-driven systems capable of automatically identifying and classifying waste items as they move along conveyor belts, aiming to improve sorting accuracy, reduce manual dependency, and enable more efficient recycling operations at scale.
To train an AI system that could automatically recognize different types of waste on a conveyor belt (like in recycling plants), the client needed video annotation services. They provided a large set of CCTV footage and kept adding to that dataset as the project progressed.
Our video labeling team had to look at each image/frame and:
However, the client was very specific about avoiding bad labels. So they shared a mandatory guideline for conditions where annotation must be skipped:
The client did not just want “a lot of labeled data.” Their primary purpose was to avoid bad labels, which would degrade the AI’s performance. Therefore, we had to ensure that only clean and accurate examples were included in the AI training data.
However, our team faced certain hurdles:
Since the images were sourced from regular CCTV footage, the waste items did not look as a clean textbook example would. They often appeared crumpled, torn, wet, or overlapped with other materials. Several categories also shared similar visual characteristics (e.g., cardboard vs. paper, plastic film vs. foil). This made it difficult to assign the correct label confidently.
Since the conveyor belt moved slowly, many consecutive video frames were almost identical. If we annotated every frame, we would spend time labeling the same visual information repeatedly, which would reduce throughput without improving the training dataset. The challenge was to quickly identify which frames introduced new, useful visibility of waste items and which frames could be skipped without compromising accuracy or speed.
The client’s guidelines mandated annotators to skip any object that could not be clearly identified. This forced annotators to use their judgment, rather than simply following a mechanical tagging workflow. But different annotators can interpret the same frame differently. For instance, a flattened plastic bottle could look similar to a piece of transparent plastic film in low-resolution footage. One annotator might label it as a plastic bottle, another might skip it due to an unclear shape.
The goal was not just to label data, but to create a repeatable, auditable process that could deliver precise annotations at scale — without compromising on the client’s strict research standards. So, we customized our video annotation services to meet the client’s needs. Our approach combined structured human judgment with intelligent automation, ensuring every annotated frame contributed meaningful information to the AI model. Here’s how we engineered this end-to-end system.
Before automation could be effective, our annotation team needed a shared understanding of how each waste category appeared in real-world CCTV footage. We created a visual reference guide for all 16 waste types using actual client images — showing variations such as torn paper, reflective foil, wet cardboard, or crushed bottles.
This reference set guided our annotation decisions, ensuring that all annotators labeled objects consistently. It was also embedded into the automated video annotation workflows we built.
To prevent the team from wasting time on labeling images that were almost identical, we asked annotators to compare each frame with its immediately previous one. If nothing changed, they could skip labeling the frame. If something changed (such as an item shifting, unfolding, separating, or becoming clearer), they could annotate the image.
However, at this scale, manually screening frames was very inefficient. Unfortunately, CVAT does not have a built-in feature that can automatically detect and skip "near duplicate" video frames during annotation.
To address this, we developed a custom preprocessing pipeline that ran outside CVAT, but was integrated with its upload workflow. This Python-based system, powered by OpenCV and scikit-image, automatically analyzed video sequences before they entered the annotation stage. It compared consecutive frames for visual similarity and motion to filter out near-duplicate frames. Only the most informative frames — those showing new object angles or improved clarity — were passed for video labeling.
There was no way in CVAT to natively enforce the client’s detailed “skip” conditions or record why a frame was skipped. So, we customized the interface to enable annotators to follow the rules more consistently.
We added a “Skip Reason” dropdown that required annotators to select why a frame or object was not labeled (for example, less than half visible, too blurry, or unclear category). We also included small on-screen tips reminding them of key video labeling rules. Each skip reason and annotator ID was automatically recorded in the task data, making every decision traceable.
With these modifications, we transformed CVAT from a generic data labeling tool into a domain-specific annotation control environment aligned precisely with our client’s research needs.
To ensure uniform interpretation of skip conditions and prevent human bias, we established a two-tier validation process. It helped ambiguous cases where annotators were at risk of making subjective (and hence, inconsistent) decisions based on their perception:
Every annotation event — including skips, flags, and corrections — was automatically logged within CVAT’s metadata. We extended this logging into a QA documentation report that summarized:
Weekly reports were reviewed jointly with the client’s research team. This collaborative audit process enabled them to refine category definitions, analyze which item types were most frequently skipped, and supply additional footage featuring those items, ensuring the model had more relevant examples to learn from.
Although CVAT provides standard COCO export options, the client’s AI training workflow required additional metadata and a unified dataset structure that the default format couldn’t support. We developed a custom export converter that merged multiple CVAT task outputs, appended client-specific fields (annotator ID, skip reason, timestamp, and frame sequence), and validated schema consistency prior to integration.
This ensured full traceability and a ready-to-train dataset that aligned precisely with the client’s TensorFlow pipeline.
Where feasible, we leveraged CVAT’s AI-assisted pre-annotation feature. Early batches of annotated data were used to train a lightweight object detector, which then auto-generated preliminary bounding boxes for new frames. This automated pre-annotation approach handled most object detection tasks, while human annotators performed targeted verification and corrections as needed, achieving high accuracy with considerably less manual work.
Raw image
Annotated image
Through two-tier video labeling QC and a standardized reference guide.
Through automated frame filtering and AI-assisted pre-annotation.
As recorded during initial training, compared to their previous training data.
Not every training dataset fits a standard labeling playbook. Your model may require nuanced judgment, conditional rules, evolving edge cases, or annotation logic that shifts as research advances. That’s the point where most vendors break. Our data annotation services don’t.
Our text annotation, image annotation, and video annotation services are designed to adapt to your specific rules, constraints, and quality thresholds. Where tools fall short, we extend or customize. Where automation helps, we use it — always with human verification built in.
If your data labeling needs aren’t “regular,” you’ll notice the difference working with us. See it for yourself - start with a free sample.