When Less Beats More: How Advex Composer (VLMs) Outshines CNNs

Advex

On the production line, non-quality costs are estimated at roughly 5% of turnover (revenue) for manufacturing firms, and even a single batch of scratched rims can trigger cascading losses. In automotive manufacturing, rim defects are more than cosmetic; they can force rework, halt production, or reach customers and come back as warranty claims. A small scratch can snowball into millions in losses.

Traditional AI inspection systems require lots of labeled data and expert tuning. Teams have to collect hundreds of images and painstakingly annotate them, a process prone to annotation fatigue. On the other hand, Large Vision Models (VLMs) are rewriting the playbook. With 10B+ parameters, far beyond traditional CNNs, they arrive pre-trained on millions of images. Instead of starting from zero, they adapt with just a few examples, called few-shot learning.

If these models are truly this capable, why not put them to the test? We set up a fair, head-to-head: Advex Composer (a VLM-based, no-code platform) vs ResNet-34 (a traditional CNN on Roboflow, a platform for vision training and annotating). We presented these two AI systems with the same challenge: to detect rim scratches with minimal training data. One needed 98 images (68 training, 20 validation, 10 test) to reach 80% accuracy. Advex Composer achieved 94% with just 9 images.

Methodology: The Rules of the Experiment

We framed the task as a strict comparison. Both systems saw the same images and were judged on the same 16-image held-out test set to ensure a fair validation. We began with a minimal training set. Composer’s approach is few-shot supervised fine-tuning on a custom, pre-trained VLM. Roboflow’s approach is to fine-tune a pre-trained CNN like ResNet-34, which historically needs many more images.

Problem Statement: We needed to detect hairline scratches on aluminum wheel rims. Missing even minor scratches can lead to rework or recalls, so accuracy is paramount.
Tools Compared:
- Advex Composer is built on a vision-language foundation model. You upload images, label the defect (scratch), and hit “train.” All model decisions (architecture, loss, hyperparameters) are automated by Composer’s backend.
- Roboflow provides a custom training pipeline. We uploaded the same images to Roboflow, annotated them, then went through Roboflow’s interface: choosing an object-detection model, setting thresholds, choosing augmentations, etc.
Experiment Setup: We used a 16-image validation set held out for all tests. We initially attempted 9 images for both platforms, but the CNN was unable to initialize training with so few samples. We expanded to 25 total images (split into train/val/test per requirements), then 98 images (68 train, 20 validation,10 test) to achieve usable results.

By keeping conditions equal, we isolated data efficiency and workflow complexity. In short, Composer played with 9 images and no user “tweaks.” Roboflow’s CNN relied on human choices and more images. The results speak for themselves.

The Results: A New Baseline for Speed and Efficiency

Experiment 1: The Minimal-Data Reality

With very limited data, the VLM approach immediately excelled. Advex Composer, trained on just 9 images, achieved 94% accuracy on the 16-image test set after only 5 minutes (upload + labeling + training). The Roboflow CNN, given 25 images, failed to converge to a usable model even after ~30 minutes. On the performance side, the CNN model achieved only 2.1% mAP, with massive false positives - detecting scratches everywhere while missing actual defects.

With such few images, CNN models become overfitted, which are ill-suited for a production-level system.

Figure: Classification Result of Images on Advex Composer VLM and Roboflow ResNet34 CNN

Experiment 2: Giving the CNN More Data

We next boosted the CNN’s dataset. With 98 total images (68 for training), the Roboflow model finally reached ~80% accuracy. In short, the CNN needed roughly 7.5× more training data (68 vs 9 images) to approach the VLM’s performance. With such few images, CNN models become overfitted, which are ill-suited for a production-level system.

Figure: Performance comparison. The VLM model reaches 94% accuracy from just 9 examples, whereas the CNN requires 98 images to reach 80% accuracy.

Technical Complexity Gap

Beyond numbers, the user experience differed greatly. Training the CNN required many manual choices: selecting an architecture, setting confidence thresholds, choosing data augmentations, etc. Such hyperparameters must be carefully tuned by experts to improve model performance. Composer, by contrast, made no such decisions – you simply upload images and annotate them. Composer automated all the difficult configuration steps that normally slow down a conventional ML pipeline. Here’s a side-by-side of both:

Metric	Roboflow (CNN)	Advex Composer (VLM)	Winner
Minimum images for target acc	68 training (+30 val/test) for ~80%	9 for 94%	Composer (≈10× fewer)
Total setup time (including training)	~90 minutes	~5 minutes	Composer (≈18× faster)
Model/hyperparams required	Many manual choices	None (automated)	Composer
Best accuracy	~80%	94%	Composer
Suitable for non-experts?	No (complex setup)	Yes (plug-and-play)	Composer

Table 1: Comparison of the two models across dataset size, training time, decision-making, and accuracy.

The Analysis: What the Results Mean

Production environments rarely look like training datasets. The adaptability of the model across such images is a vital part and is referred to as model adaptability. Because of its criticality, we tested both models on unseen edge cases, namely overexposed and underexposed images, rotated rims, and motion blur.

Figure: Even with just 9 training images, the VLM model correctly finds defects under varying lighting and angles, demonstrating strong generalization.

The differences were stark. Advex Composer maintained high accuracy across all conditions, consistently identifying defects and clean rims correctly. By contrast, Roboflow’s CNN struggled once images deviated from the training distribution. Under changes in brightness, its predictions hovered near random. At different angles, accuracy dropped below 50%. In blurred cases (common in high-speed inspection lines), it confidently misclassified clean rims as defective (false positives at 70–79%).

This illustrates the core advantage of VLMs: they generalize from prior knowledge rather than memorizing a narrow feature space. While CNNs need carefully curated datasets covering every possible variation, Composer can handle real-world variability with only a handful of labeled examples. In practical terms, that means fewer costly false alarms and fewer missed defects slipping through the line.

The Pre-Training Advantage

Advex Composer’s model was pre-trained on web-scale image-text data, giving it a broad understanding of visual concepts. In practice, this means Composer already knows what a “scratch” looks like in many contexts – it only needs a few rim-specific examples to align that concept. But wait, aren’t CNNs also pretrained? They are, but their limited capacity makes it difficult for them to generalize from a small number of new images. VLMs, on the other hand, have an advantage thanks to their built-in language understanding. Since they’re trained on billions of image–text pairs, they can connect visual and linguistic concepts, making it much easier to simply annotate and describe what the model is seeing.

The Overfitting Disaster

With only 25 training images, the CNN nearly collapsed. Its training-loss curve was erratic, and it effectively memorized the tiny dataset, yielding poor test performance. This is typical for deep networks on tiny datasets: they quickly overfit and fail to generalize. Only after supplying 98 images did the CNN begin to form a reliable detector. VLMs avoid this pitfall because their massive pre-training serves as a strong prior, requiring far fewer new examples to calibrate.

Adaptability and Generalization

We also tested both models under real-world variations: different lighting, rotated rims, cluttered backgrounds, and partial occlusions. Composer’s model (trained on 9 images) correctly identified scratches in these edge cases. The CNN’s accuracy, however, dropped below 50% on these unseen conditions.

In a factory, parts move and lighting changes constantly, so this robustness is crucial. The Composer’s few-shot model generalized naturally from its concept-level knowledge, while the CNN would need extensive new data to cover each variation.

Conclusion

Our head-to-head test confirms that VLM-based vision is a fundamental leap for industrial AI. Advex Composer was not just a little better; it was orders of magnitude more data-efficient and faster. It delivered equal or better accuracy with ~96% less data and ~98% less time than the traditional CNN.

This translates directly to business value: the months-long bottleneck of data collection and parameter tuning is eliminated. You no longer need a specialized AI team to build a production-grade vision system. With Composer, a few labeled images and minutes of effort yield a reliable defect detector.

Ready to see for yourself? Book a Demo with our Team or Try Composer on Your Own Images Today.

Note: Results based on testing with the Rim Scratches dataset. Your mileage may vary, but it only takes 5 minutes to find out.

‹ From CNNs to VLMs: The Next Leap in Industrial Inspection