OMNOAI / ADLYTICAI

Edge AI Deployment

Senior CV Engineer
Dec 2021 – Feb 2023

Architected and deployed hardware-optimized AI pipelines across 200+ NVIDIA Jetson devices, achieving 40+ FPS real-time inference for 15+ commercial products.

40+

FPS

200+

Devices Deployed

15+

Products Shipped

8x

Speedup

The Challenge

The company had powerful AI models that worked great in the cloud but were too slow and expensive to deploy at scale. Customers needed real-time inference at the edge—in retail stores, warehouses, and public spaces—without cloud connectivity dependencies. The challenge was making sophisticated AI run on constrained edge hardware while maintaining accuracy.

Key Challenges

  • Models optimized for cloud GPUs were 10x too slow for edge devices
  • Memory constraints on Jetson devices (4-8GB) vs training machines (32GB+)
  • Need for consistent performance across varying environmental conditions
  • Deploying and managing updates across 200+ distributed devices
  • Power consumption limits in some deployment scenarios

The Solution

I developed a comprehensive edge optimization pipeline that transforms cloud-trained models into edge-ready deployments. The pipeline includes automated quantization, architecture optimization, and TensorRT compilation. I also built a device management platform for OTA updates and monitoring.

1

Built automated INT8 quantization pipeline with calibration dataset curation

2

Implemented model architecture optimization including layer fusion and channel pruning

3

Developed TensorRT-native inference engine with custom plugins for unsupported operations

4

Created dynamic batching system that balances latency and throughput based on load

5

Built Docker-based deployment system with OTA update capabilities

System Architecture

The architecture separates model optimization (done once in cloud) from deployment (repeated across devices), enabling rapid iteration while maintaining consistency.

Model Optimization Pipeline

Cloud-based pipeline that takes a trained model and produces optimized artifacts. Includes ONNX export, quantization, TensorRT compilation, and validation.

Edge Inference Engine

C++ inference runtime built on TensorRT with custom memory management. Handles multi-model orchestration and resource sharing.

Preprocessing Accelerator

CUDA-accelerated preprocessing (resize, normalize, color conversion) eliminating CPU bottlenecks. Uses unified memory for zero-copy transfers.

Device Management Platform

Central platform for device health monitoring, model deployment, and configuration management. Supports staged rollouts and automatic rollback.

Telemetry System

Lightweight telemetry collecting inference metrics, hardware stats, and anomaly flags. Enables proactive maintenance and optimization.

Key Implementation Details

INT8 Quantization Without Accuracy Loss

Naive INT8 quantization often degrades accuracy. I developed a calibration dataset curation process that selects representative samples covering the full data distribution. Combined with per-tensor quantization for sensitive layers, we maintained 99%+ of original accuracy while achieving 3-4x speedup.

Custom TensorRT Plugins

Some model operations weren't natively supported by TensorRT. I wrote custom CUDA plugins for operations like deformable convolutions and custom attention mechanisms, enabling end-to-end TensorRT execution without CPU fallbacks.

Memory-Efficient Multi-Model Serving

Many products require multiple models (detection + classification + tracking). I implemented a shared memory pool across models and dynamic model loading/unloading based on usage patterns. This enabled running 3-4 models on devices that couldn't fit them all simultaneously.

Thermal-Aware Inference

Sustained high load causes thermal throttling on edge devices. I built a thermal management system that monitors device temperature and adjusts inference frequency and batch size to maintain consistent performance while preventing overheating.

Tech Stack

Optimization

TensorRTCUDAcuDNNONNXINT8 Quantization

Edge Devices

NVIDIA Jetson (Nano, TX2, Xavier, Orin)DeepStream

Development

C++PythonCMakeDocker

Infrastructure

BalenaAWS IoTPrometheusGrafana

Key Learnings

Edge optimization should be considered from model design, not as an afterthought

Quantization-aware training produces much better results than post-training quantization

Device management at scale requires treating deployments like software releases—staged rollouts, rollback capability

Thermal and power management are as important as raw performance for edge devices

Interested in Similar Solutions?

Let's discuss how I can help bring your AI ideas to production.

Get in Touch