Part of AI, Machine Learning and Computer Vision Meetup Network - 52 groups

Mumbai AI, Machine Learning and Computer Vision Meetup

4.8•26 ratings

About us

🖖 This group is for data scientists, machine learning engineers, and open source enthusiasts.

Every month we’ll bring you diverse speakers working at the cutting edge of AI, machine learning, and computer vision.

Are you interested in speaking at a future Meetup?
Is your company interested in sponsoring a Meetup?

Send me a DM on Linkedin

This Meetup is sponsored by Voxel51, the lead maintainers of the open source FiftyOne computer vision toolset. To learn more, visit the FiftyOne project page on GitHub.

Upcoming events

See all

Network event
July 29 - MCP, Agents and Skills Meetup
Wed, Jul 29 · 9:30 PM IST
·
Online
Online
610 attendees from 48 groups
Join our virtual meetup to hear talks from experts on cutting-edge topics across MCP, Agents and Skills.

Date, Time and Location

Jul 29, 2026
9:00 AM - 11:00 AM PST
Online. Register for the Zoom!

The Agent Control Plane: Turning Coding Agents into Reliable Engineering Workflows

AI coding agents are powerful but often unreliable — they hallucinate, lose context, and produce inconsistent results across runs. In this talk, Alex introduces Atomic, an open-source control plane that adds persistent memory, deterministic workflow phases (Research → Specify → Implement → Ship), and human-in-the-loop gates around coding agents like Claude Code and GitHub Copilot. The result: repeatable, auditable engineering workflows that teams can actually trust in production.

About the Speaker

Alex Lavaee is an Applied AI engineer at Microsoft Research and the creator of Atomic, an open-source SDK that wraps deterministic, research-to-execution workflows around AI coding agents. He previously conducted AI research at Harvard Medical School and Boston University, and has worked as an MLE and data scientist at companies including Boeing and Themis AI, an MIT CSAIL spinoff.

UISurf: Toward Universal UI Automation with Cross-Environment Agents

In this talk, we introduce UISurf, an open-source multimodal agentic UI automation platform in which agents can perceive, reason, and collaborate across browser and desktop environments to complete end-to-end tasks that require interaction with multiple user interfaces.

UISurf comprises three main components: uisurf-agent, the runtime for UI automation agents; uisurf-admin, the session orchestration and management service; and uisurf-app, the full-stack user application. Its multi-agent architecture includes a planning_agent that transforms natural-language requests into structured execution plans, specialized Browser and Desktop Agents for environment-specific interaction, an automation_agent that coordinates execution and inter-agent handoff through Agent-to-Agent (A2A) communication, and a summarization_agent that produces the final task summary for the user. UISurf supports both autonomous execution and human-in-the-loop supervision, offering a practical and extensible framework for studying and deploying cross-environment UI automation.

About the Speaker

Dr. Henry Ruiz is a Research Scientist at Texas A&M University @ AgriLife Research, specializing in Artificial Intelligence (AI) and Remote Sensing. His work focuses on the development of advanced software systems and computational algorithms for analyzing multi-source remote sensing data, including satellite imagery, UAVs (Unmanned Aerial Vehicles), LiDAR (Light Detection and Ranging), and Ground Penetrating Radar (GPR).

From Manual Workflows to AI-Assisted Skills: Building Reliable Internal Automation

In this session, I will discuss how teams can turn repetitive manual workflows into reliable AI-assisted and automation-driven “skills.” I will share practical lessons from building internal tools for CAD and engineering workflows, including how automation can reduce manual effort, improve consistency, and support better process control. The talk will also cover why many AI/agent experiments fail when they are not connected to real team workflows, standards, and validation steps. Attendees will walk away with a practical framework for identifying repeatable workflows, designing useful internal tools, and adopting AI assistance without losing accuracy or trust.

About the Speaker

Janvi Vijaykumar Saddi - Janvi Saddi is a Computer Science graduate and CAD/Data Automation professional with experience in data center design workflows, AutoCAD automation, process improvement, and data analytics. She currently works as a CAD Tech 2 at Astreya, supporting Google data center design workflows by building internal tools that reduce manual effort, improve accuracy, and streamline engineering processes. Her background also includes SQL, Power BI, market research analytics, and AI-assisted development.

Building Safe Agent Sandboxes: Let Agents Act Without Breaking Production

AI agents become truly useful when they can take action, not just generate text. But giving agents access to code, data, and systems raises an important question: how do you let them explore, execute, fail, and improve without putting production at risk?

In this talk, we'll explore the sandbox pattern for agent systems and how to equip agents with tools to read, write, execute, and iterate within controlled environments while using permissions, human approval, and safety guardrails to keep them reliable. We'll cover practical architectures and lessons learned for building agents that can safely evolve from experimentation to production..

About the Speaker

Adonai Vera - Adonai Vera - Machine Learning Engineer & DevRel at Voxel51. With over 7 years of experience building computer vision and machine learning models using TensorFlow, Docker, and OpenCV.
11 attendees from this group
Network event
Aug 4 - Visual AI in Manufacturing
Tue, Aug 4 · 9:30 PM IST
·
Online
Online
143 attendees from 52 groups
Join our virtual meetup to hear talks from experts on cutting-edge topics at the intersection of manufacturing, AI, ML, and computer vision.

Date, Time and Location

Aug 04, 2026
9:00 AM - 11:00 AM PST
Online. Register for the Zoom!

Enabling Multimodal Agents on the Edge

The next generation of AI agents is moving beyond cloud-based text-only models and will interact with the physical multimodal world in real-time. For example in the vision domain, AI agents rely on Vision-Language Models (VLMs) in their backbone. However, deploying massive VLMs with billions of parameters on the edge devices remains a significant engineering hurdle.

Drawing on our recent ICML and CVPR research papers, this session explores advancements in agentic model optimizations, specifically how distillation and pruning transform 'heavyweight' models into lean, edge-ready engines. Lastly, I present our UI agent running on the actual phone that is being developed by our lab's team.

About the Speaker

Denis Gudovskiy is a Distinguished AI Engineer at Panasonic North America where he conducts R&D activities of various core AI methods, including multimodal and hardware-efficient agents, supervised and RL training pipelines, and robustness to out-of-distribution scenarios.

When the Camera Can’t Be Trusted: Health-Aware Visual AI for Reliable Near-Miss Detection

Near-miss detection systems are often evaluated as though every camera frame is equally trustworthy, even though blur, poor exposure, occlusion, contamination, and changing lighting can silently degrade the visual evidence used to make safety decisions. This talk presents an online camera-health framework that estimates visual reliability before downstream perception performance significantly deteriorates.

I will discuss how camera-health signals can support condition-aware evaluation, prioritize human review, reduce unreliable alerts, and trigger appropriate fallback behavior. Drawing from research in safety-critical visual perception, the talk will demonstrate how these principles can be adapted to industrial video systems operating across different cameras, shifts, layouts, and environmental conditions.

The presentation will also connect camera-health monitoring with rare-event discovery and failure-driven dataset improvement for more trustworthy near-miss detection.

About the Speaker

Shiva Aher is a computer vision researcher with a graduate background in computer science from the Georgia Institute of Technology, specializing in artificial intelligence.

Agentic VLM applications in manufacturing

Vision Language Models (VLMs) introduce net-new functionality to vision workloads in manufacturing that traditional computer vision models simply do not offer (e.g., open-vocabulary detection, in-context-learning). Even so, fine-tuned models like YOLO offer a level of precision and recall that today's VLMs struggle to match out-of-the-box.

Through agentic harnesses that coordinate calls to VLMs, we can start to deliver similar reliability on manufacturing-relevant tasks (e.g., many-class, many-instance detection), while also supporting the net new functionalities (e.g., multimodal search) that make VLMs distinct. In this talk, we walk through the design of these harnesses, how you serve them efficiently, and how they deliver value in manufacturing.

About the Speaker

Subraiz Ahmed is a member of the Technical Staff at Perceptron AI. He builds the infrastructure to serve frontier vision models. He previously founded a series of startups.
2 attendees from this group
Network event
Aug 6 - Audio and AI Meetup
Thu, Aug 6 · 9:30 PM IST
·
Online
Online
186 attendees from 51 groups
Join our virtual meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Date, Time and Location

Aug 06, 2026
9:00 AM - 11:00 AM PST
Online. Register for the Zoom!

Do Speech Models Actually Understand Speech? Evaluating Speech LLMs Under Realistic Spoken Instruction Conditions

Speech Large Language Models (SLLMs) are increasingly capable; but are we evaluating them the right way? Most benchmarks rely on text prompts, yet real users interact with these systems through speech, a modality that introduces noise, disfluencies, and stylistic variation that text simply doesn't capture.
In this talk, we present findings from a systematic study across 11 tasks, 12 languages, and five prompt styles, examining how prompt modality, language, and task type shape SLLM performance.

About the Speaker

Maike Züfle is a PhD student at the Karlsruhe Institute of Technology (KIT), working in Prof. Jan Niehues's group on interactive speech systems for more natural human–machine communication. Her research focuses on instruction-following speech models with speech as both input and output, with a recent emphasis on full-duplex systems. Beyond her research, she co-organises the instruction-following and speech translation metrics shared tasks at IWSLT. She is a 2026 Apple Scholar in AI/ML.

AI based Audio Forensics

In this presentation, attendees will discover several modules developed by Gradiant for the detection and analysis of synthetically generated or manipulated audio. The session will be delivered by one of the developers involved in the design and implementation of these technologies, providing first-hand insight into their capabilities and underlying methodology.

The presentation will cover the traceability module, which helps identify the origin of AI-generated content. It will also cover the segment detection tool, designed to locate manipulated regions within an audio recording, as well as the complete audio detection tool, which assesses whether an entire recording has been synthetically generated.

About the Speaker

Daniel Paniagua Ares is a research engineer at Gradiant. Graduated in computer engineering from the FIC and with a master's degree in AI from the VIU.

Curating, Searching, and Evaluating Audio Datasets in FiftyOne

In this talk, we'll start with the ESC-50 environmental-sound dataset to show how FiftyOne represents audio: browsing clips in the tabular view, rendering spectrograms directly in the sample grid with a custom renderer, and turning sounds into searchable vectors with CLAP embeddings. Then we'll demo a similarity-search panel that lets you query an entire audio collection by example clip or a natural-language prompt to quickly find matching sounds.

We'll conclude with a live research problem: Audio Moment Retrieval from the DCASE 2026 Challenge, where the goal is to localize the exact moment in a long recording that matches a text query. We'll frame this as temporal detection, evaluate predictions, and visualize ground-truth vs. predicted moments on an interactive timeline to intuitively expose model failure modes.

Attendees will leave with a concrete blueprint and open code for applying visual data-centric AI practices to their own audio and multimodal datasets.

About the Speaker

John Duncan is a Machine Learning Engineer, Customer Success at Voxel51. His research interests include vision, LiDAR, and audio perception for robots and intelligent systems.

Real-Time ASR at 4x on Consumer Hardware: The Meetily Architecture

This talk covers the engineering behind Meetily, an open-source meeting assistant that runs Whisper and NVIDIA Parakeet transcription entirely on-device. We'll walk through how we got Parakeet to roughly 4x real-time on consumer hardware, and the specific points where it still falls over.

We'll also get into the honest trade-offs between local and cloud inference: latency, accuracy, cost, and what you actually give up by choosing one over the other. Wrapping ML inference in a Rust/Tauri desktop app came with its own costs, which we'll unpack as well.

Finally, we'll look at what "fully local" really means at an architecture level, where that boundary sits, and how easily it leaks once you add model downloads, integrations, or a pluggable LLM backend.

About the Speaker

Sandeep Zachariah is the Founder and CEO of Zackriya Solutions and the leads the team behind Meetily, an open-source, privacy-first meeting assistant that runs Whisper and NVIDIA Parakeet transcription entirely on-device. He brings a rare full-stack perspective on audio AI — from low-level embedded systems and hardware acceleration up through real-time ASR and local LLM summarization — with deep experience deploying speech and ML models across servers, GPUs and consumer hardware.
6 attendees from this group
Network event
Aug 6 - Audio and AI Meetup
Thu, Aug 6 · 9:30 PM IST
·
Online
Online
4 attendees from 52 groups
Join us on Aug 6 for a special edition of the AI, ML, and Computer Vision Meetup focused on audio use cases!

Date, Time and Location

Aug 06, 2026
9:00 AM - 11:00 AM PST
Online. Register for the Zoom

Do Speech Models Actually Understand Speech? Evaluating Speech LLMs Under Realistic Spoken Instruction Conditions

Speech Large Language Models (SLLMs) are increasingly capable; but are we evaluating them the right way? Most benchmarks rely on text prompts, yet real users interact with these systems through speech, a modality that introduces noise, disfluencies, and stylistic variation that text simply doesn't capture.
In this talk, we present findings from a systematic study across 11 tasks, 12 languages, and five prompt styles, examining how prompt modality, language, and task type shape SLLM performance.

About the Speaker

Maike Züfle is a PhD student at the Karlsruhe Institute of Technology (KIT), working in Prof. Jan Niehues's group on interactive speech systems for more natural human–machine communication.

AI based Audio Forensics

In this presentation, attendees will discover several modules developed by Gradiant for the detection and analysis of synthetically generated or manipulated audio. The session will be delivered by one of the developers involved in the design and implementation of these technologies, providing first-hand insight into their capabilities and underlying methodology.

The presentation will cover the traceability module, which helps identify the origin of AI-generated content. It will also cover the segment detection tool, designed to locate manipulated regions within an audio recording, as well as the complete audio detection tool, which assesses whether an entire recording has been synthetically generated.

About the Speaker

Daniel Paniagua Ares is a research engineer at Gradiant. Graduated in computer engineering from the FIC and with a master's degree in AI from the VIU.

Curating, Searching, and Evaluating Audio Datasets in FiftyOne

In this talk, we'll start with the ESC-50 environmental-sound dataset to show how FiftyOne represents audio: browsing clips in the tabular view, rendering spectrograms directly in the sample grid with a custom renderer, and turning sounds into searchable vectors with CLAP embeddings. Then we'll demo a similarity-search panel that lets you query an entire audio collection by example clip or a natural-language prompt to quickly find matching sounds.

We'll conclude with a live research problem: Audio Moment Retrieval from the DCASE 2026 Challenge, where the goal is to localize the exact moment in a long recording that matches a text query. We'll frame this as temporal detection, evaluate predictions, and visualize ground-truth vs. predicted moments on an interactive timeline to intuitively expose model failure modes.

Attendees will leave with a concrete blueprint and open code for applying visual data-centric AI practices to their own audio and multimodal datasets.

About the Speaker

John Duncan is a Machine Learning Engineer, Customer Success at Voxel51. His research interests include vision, LiDAR, and audio perception for robots and intelligent systems.