Edge Vision for Predicting Conveyor Belt Tear Using Tiny YOLOv8 and IMU Correlation

A couple weekends ago I got pulled into a frustrating smart-manufacturing problem: a conveyor belt would start “micro-tearing” along the same seam line, and by the time operators noticed, it was usually already expensive. The sensors on the line were telling something was wrong, but not what.

So I built a small edge pipeline that watches a single belt region with an on-device camera model and cross-checks it with an IMU (accelerometer/gyroscope) mounted near the motor. The goal was very specific: detect early tear artifacts in the image and correlate them with vibration patterns that happen when the belt starts slipping or deforming.

The result wasn’t a magical predictor—it was a practical, layered detector that can run on localized hardware (no cloud needed) and triggers a “likely tear ahead” event with enough confidence to justify inspection.

What I built (and why it worked)

I used:

Tiny YOLOv8: A lightweight object detection model. “Object detection” means the model outputs bounding boxes and class labels for features it learned during training.
IMU vibration features: From accelerometer/gyro streams, I computed simple metrics like RMS vibration and dominant frequency energy.
Correlation logic: I combined “tear probability from vision” with “vibration signature from IMU” to reduce false positives (dust, lighting changes, random marks).

The niche part: I focused on a very narrow seam region of the belt and trained the model to detect tear edge micro-features (thin bright/dark fringes) rather than generic “damage.” That made the detector sensitive to the real failure mode.

Hardware assumptions

This example is written to be portable, but I designed the flow around a typical edge setup:

A camera pointed at the belt seam region
An IMU on the conveyor motor frame streaming at a few hundred Hz
An edge device running Python (Raspberry Pi class, Jetson class, or industrial PC)

For the code below, I simulated camera frames and IMU data so the pipeline is runnable anywhere.

Step 1: Define the event scoring rules

My scoring strategy was intentionally simple:

Vision model outputs a tear_prob between 0 and 1.
IMU features produce a vibration_score between 0 and 1.
I compute:
- final_score = 0.65 * tear_prob + 0.35 * vibration_score
I require final_score > threshold and persistence over a short window to avoid single-frame glitches.

Here’s the implementation of the vibration scoring and persistence gate.

import numpy as np
from collections import deque

def rms(x: np.ndarray) -> float:
    return float(np.sqrt(np.mean(np.square(x))))

def dominant_energy(x: np.ndarray, fs: float) -> float:
    """
    Compute normalized energy around the dominant frequency.
    This is a simple proxy for "the vibration has a strong tone".
    """
    x = x - np.mean(x)
    n = len(x)
    if n < 8:
        return 0.0

    # Real FFT
    freqs = np.fft.rfftfreq(n, d=1.0/fs)
    spec = np.abs(np.fft.rfft(x))**2

    idx = int(np.argmax(spec))
    if idx == 0:
        return 0.0

    # Normalize by total energy
    total = float(np.sum(spec))
    if total <= 1e-12:
        return 0.0
    return float(spec[idx] / total)

class TearPredictor:
    def __init__(self, threshold=0.72, window_seconds=2.0, imu_fs=200.0):
        self.threshold = threshold
        self.window_len = int(window_seconds * (imu_fs / 1.0))  # used for IMU chunks count
        self.events = deque(maxlen=60)  # store last N fused scores (simple persistence)

    def vibration_score(self, ax, ay, az, gx, gy, gz, fs):
        """
        Produce a score in [0,1] from IMU signals using:
        - RMS acceleration magnitude
        - Dominant frequency energy from accel magnitude
        """
        a_mag = np.sqrt(ax**2 + ay**2 + az**2)
        g_mag = np.sqrt(gx**2 + gy**2 + gz**2)

        a_rms = rms(a_mag)
        g_rms = rms(g_mag)

        # Normalize with heuristic scaling for demo purposes
        # In a real line you calibrate these ranges from healthy data.
        a_rms_norm = np.clip(a_rms / 2.5, 0, 1)
        g_rms_norm = np.clip(g_rms / 25.0, 0, 1)

        dom = dominant_energy(a_mag, fs)  # also in [0,1] due to normalization
        # Weighted blend: strong tone + stronger vibration = higher score
        score = 0.55 * dom + 0.35 * a_rms_norm + 0.10 * g_rms_norm
        return float(np.clip(score, 0, 1))

    def update(self, tear_prob, vib_score):
        """
        Fuse vision + vibration, then gate with persistence:
        trigger only if enough recent frames exceed threshold.
        """
        final_score = 0.65 * tear_prob + 0.35 * vib_score
        self.events.append(final_score)

        if len(self.events) < 10:
            return False, final_score

        recent = list(self.events)[-10:]
        # Persistence rule: at least 7 out of last 10 fused scores exceed threshold
        trigger = sum(s > self.threshold for s in recent) >= 7
        return trigger, final_score

Why these choices?

RMS alone detects “more motion,” but conveyors always vibrate.
Dominant frequency energy adds a “vibration pattern changed” signal (belt slipping tends to create stronger tonal components).
Persistence prevents a single good/bad frame from triggering.

Step 2: Build a tiny end-to-end demo pipeline (simulated)

To make this blog post runnable, I simulate:

Camera frames that sometimes contain a “tear” region
IMU streams that correlate with that event

In production you’d replace the simulation with:

a camera grab loop
an IMU reader (serial, CAN, Ethernet, GPIO-based IMU module, etc.)
a real YOLOv8 model inference

Here’s the demo runner.

import numpy as np
import time

def simulate_imu(fs=200.0, n=400, fault=False, seed=0):
    """
    Simulate IMU windows.
    When fault=True, add stronger tonal vibration and higher RMS.
    """
    rng = np.random.default_rng(seed)

    t = np.arange(n) / fs
    # base vibration: noise + mild tone
    base_tone = 12.0  # Hz
    tone_amp = 0.35 if not fault else 0.85

    noise = 0.08 * rng.standard_normal((6, n))

    ax = tone_amp * np.sin(2*np.pi*base_tone*t) + noise[0]
    ay = 0.6 * tone_amp * np.sin(2*np.pi*base_tone*t + 0.4) + noise[1]
    az = (0.9 * tone_amp * np.sin(2*np.pi*base_tone*t + 1.2) + noise[2]) + 1.0

    # gyro: correlated but weaker
    gx = 2.0 * (0.3 if not fault else 0.9) * np.sin(2*np.pi*base_tone*t + 0.2) + noise[3]
    gy = 2.0 * (0.2 if not fault else 0.7) * np.sin(2*np.pi*base_tone*t + 1.1) + noise[4]
    gz = 2.0 * (0.25 if not fault else 0.8) * np.sin(2*np.pi*base_tone*t + 2.0) + noise[5]

    return ax, ay, az, gx, gy, gz

def simulate_tear_prob(step, fault_start_step=25, fault_end_step=45):
    """
    Simulate vision probabilities:
    before fault: low probabilities
    during fault: higher probabilities with some jitter
    """
    rng = np.random.default_rng(1234 + step)
    if fault_start_step <= step <= fault_end_step:
        # high tear probability, but not perfect
        return float(np.clip(0.55 + 0.35 * rng.random(), 0, 1))
    else:
        # mostly low
        return float(np.clip(0.05 + 0.25 * rng.random(), 0, 1))

def run_demo():
    fs = 200.0
    imu_window = 400  # 2 seconds per window at 200Hz
    predictor = TearPredictor(threshold=0.72, window_seconds=2.0, imu_fs=fs)

    # Simulate 70 steps (each step corresponds to one vision+imu window)
    for step in range(70):
        fault_now = (25 <= step <= 45)

        tear_prob = simulate_tear_prob(step)

        ax, ay, az, gx, gy, gz = simulate_imu(fs=fs, n=imu_window, fault=fault_now, seed=step)
        vib_score = predictor.vibration_score(ax, ay, az, gx, gy, gz, fs=fs)

        trigger, final_score = predictor.update(tear_prob, vib_score)

        print(
            f"step={step:02d} fault={fault_now} "
            f"tear_prob={tear_prob:.2f} vib_score={vib_score:.2f} "
            f"final={final_score:.2f} TRIGGER={trigger}"
        )

        # Make it fast but human-readable
        time.sleep(0.02)

if __name__ == "__main__":
    run_demo()

What you should see

Before step ~25, tear_prob stays low and vib_score stays low → final rarely exceeds the threshold.
During step ~25–45, both vision and vibration spike → TRIGGER=True becomes frequent due to the persistence gate.
After step ~45, values drop and triggers stop.

This confirms the behavioral logic even without the real model.

Step 3: Swap in real Tiny YOLOv8 inference (real code scaffold)

Below is the structure I used when I swapped simulation for a real camera + YOLO model. This assumes you have:

ultralytics installed
a trained YOLO model file for your seam micro-tear class

from ultralytics import YOLO
import cv2
import numpy as np

class VisionTearDetector:
    def __init__(self, model_path, conf=0.25, target_class_id=0):
        """
        model_path: path to trained YOLOv8 model (e.g., runs/train/.../weights/best.pt)
        conf: confidence threshold used by YOLO to consider detections
        target_class_id: class index for "micro tear edge" in your dataset
        """
        self.model = YOLO(model_path)
        self.conf = conf
        self.target_class_id = target_class_id

    def infer_prob(self, frame_bgr, roi):
        """
        frame_bgr: full camera frame (H,W,3) in BGR
        roi: (x1, y1, x2, y2) defining the narrow seam region we care about
        """
        x1, y1, x2, y2 = roi
        roi_img = frame_bgr[y1:y2, x1:x2]

        # YOLO expects RGB sometimes; ultralytics handles many formats, but explicit convert is safe
        roi_rgb = cv2.cvtColor(roi_img, cv2.COLOR_BGR2RGB)

        results = self.model.predict(
            source=roi_rgb,
            conf=self.conf,
            verbose=False
        )

        # YOLO returns a list of Results; we use the first one
        r = results[0]

        # Default low probability when nothing matches
        tear_prob = 0.02

        if r.boxes is not None and len(r.boxes) > 0:
            # boxes.cls: detected class ids
            # boxes.conf: confidence per detection (0..1)
            cls = r.boxes.cls.cpu().numpy().astype(int)
            confs = r.boxes.conf.cpu().numpy()

            # take max confidence for the target class
            mask = (cls == self.target_class_id)
            if np.any(mask):
                tear_prob = float(np.max(confs[mask]))

        return tear_prob

Why ROI matters so much

I trained and inferred on a small ROI around the seam because:

tiny tear artifacts occupy only a few pixels
background (rollers, shadows, labels) otherwise swamps the model
inference becomes faster (less pixels to process)

Step 4: Combine Vision + IMU in one loop (production pattern)

Here’s the “real” loop structure. It’s written so the vision and IMU pieces are pluggable.

import time
import numpy as np

# Assume TearPredictor from earlier is imported
# Assume VisionTearDetector from earlier is imported

def imu_read_window():
    """
    Placeholder for real IMU read.
    Should return (ax, ay, az, gx, gy, gz) arrays of equal length.
    """
    fs = 200.0
    n = 400
    ax = np.random.normal(0, 0.2, n)
    ay = np.random.normal(0, 0.2, n)
    az = np.random.normal(0, 0.2, n) + 1.0
    gx = np.random.normal(0, 1.0, n)
    gy = np.random.normal(0, 1.0, n)
    gz = np.random.normal(0, 1.0, n)
    return fs, ax, ay, az, gx, gy, gz

def camera_read():
    """
    Placeholder for real camera frame acquisition.
    Should return a BGR frame (H,W,3).
    """
    # Fake frame
    frame = np.zeros((480, 640, 3), dtype=np.uint8)
    return frame

def main():
    fs, *_ = imu_read_window()
    predictor = TearPredictor(threshold=0.72, window_seconds=2.0, imu_fs=fs)

    # You must replace with your own model path
    vision = VisionTearDetector(
        model_path="path/to/your_tiny_yolov8_seam_tear_model.pt",
        conf=0.25,
        target_class_id=0
    )

    # Example ROI: narrow seam strip (tune to your camera)
    roi = (250, 210, 430, 330)  # x1, y1, x2, y2

    while True:
        frame = camera_read()

        # Vision probability from ROI
        tear_prob = vision.infer_prob(frame, roi)

        # IMU window features
        fs, ax, ay, az, gx, gy, gz = imu_read_window()
        vib_score = predictor.vibration_score(ax, ay, az, gx, gy, gz, fs=fs)

        trigger, final_score = predictor.update(tear_prob, vib_score)

        if trigger:
            # In production this could publish MQTT, log to historian, trigger PLC relay, etc.
            print(f"[ALERT] likely conveyor tear! final_score={final_score:.2f} tear_prob={tear_prob:.2f} vib={vib_score:.2f}")

        # Control loop timing: vision and IMU windows often drive the cadence
        time.sleep(0.05)

if __name__ == "__main__":
    main()

Step 5: Practical training notes from my seam-focused dataset

This is where I learned the most:

Labeling tiny micro-tear artifacts is hard—small errors in bounding boxes change what the detector learns.
I tightened the labeling scope to a seam ROI and trained only one class: micro_tear_edge.
For data, I deliberately included:
- dust/scratches that look similar but don’t progress into tears
- lighting variations (industrial lights flicker sometimes)
- healthy belts with “scuff” marks

That’s how the correlation got its job done: vision gives a “this looks like the tear pattern” probability, IMU confirms “the belt dynamics changed.”

Conclusion

I built an edge pipeline for smart manufacturing that predicts early conveyor belt tear by fusing two signals: a Tiny YOLOv8 detector focused on a seam-region micro-tear class, and IMU vibration features that reflect belt slip/deformation patterns. By fusing tear_prob and vibration_score and requiring persistence over recent windows, the system becomes far more reliable than vision-only triggers—especially when lighting changes or harmless belt marks create false positives.