Salah-eddine LAIDOUDI defends his doctoral thesis on thirsday, November 17, 2025, 2pm, University of Évry, Pelvoux Site, Yasmina Bestaoui Bx30 Amphitheather. The Thesis defense is aso accessible via Zoom: https://univ-evry-fr.zoom.us/j/92115017016?pwd=Kes1az5N9Bpz9WjOZPOn8daYFbpdWJ.1 .
Title : Deep Learning for Immersion and Natural 3D Interaction within Intelligent Mixed Reality Experiences.
Keywords
mixed reality, 3D interaction, deep learning, object detection, AR
Abstract
Mixed Reality aims to merge real and virtual worlds so that digital objects feel present and directly manipulable. Two hurdles persist: real-time scene understanding at 30–60 fps and natural 3D interaction with bare hands under tight latency and power constraints. Recent anchor-free transformers, multi-scale attention, and temporal-shift networks are accurate but often exceed the 30 ms budget on standalone devices. This thesis asks how to co-design models, data, and deployment so a mobile MR device achieves reliable detection and hands-free interaction within strict real-time (< 30 ms) and compute (≈ 10 GFLOPs) envelopes, since compression alone rarely suffices.
We introduce SSAFT (Single-Shot Anchor-Free Transformer), coupling a shallow CNN backbone with a compact transformer decoder (7.8 M parameters; 7.3 GFLOPs) trained on Indoor-Objects-28 (19,162 images, 28 classes). To better capture small and low-texture targets under the same budget, we add DFMA (Dual-Focus Multi-Scale Attention), a depthwise 3/5/7-pixel block with squeeze–excite gating inserted at three stages for ≤ 5% FLOPs overhead. To probe robustness, a synthetic astronomy corpus (40,000 renders) enables sim-to-real adaptation, yielding +3 AP on real astrophotographic images without violating runtime constraints.
For hand interaction, dense 3D networks (I3D, SlowFast) reach high accuracy but need tens of GFLOPs and large memory; lightweight temporal-shift variants improve efficiency yet plateau near 90% Top-1 accuracy. We therefore design G-DiTSM, a compact MobileNet-based temporal-shift model that isolates motion via forward and backward frame differences, trained end-to-end on 20BN-Jester (148k clips, 27 gestures). It preserves around 95% Top-1 accuracy, fits in approximately 15 MB, and runs comfortably above 30 fps on mobile devices with sub-10 ms latency.
Under fair settings, Indoor-Objects-28 results show SSAFT at 42.0% mAP50:95 with 12 ms average inference; YOLOv8-n attains 40.4% at 13 ms, while RT-DETR reaches 45% but is about 15× slower (184 ms). On MS-COCO, adding DFMA lifts mAP50:95 to 41.3% at 7.6 GFLOPs—about +4 points at a runtime comparable to YOLOv8-n— whereas RT-DETR-R18 (46.5%) exceeds the target compute (~15 GFLOPs). For gestures, G-DiTSM achieves 95.3% Top-1 accuracy with just 0.084 GFLOPs and < 10 ms latency, outperforming MobileNet-TSM by 5.5% while using an order of magnitude fewer resources than 3D baselines.
After INT8 quantization and Unity/Sentis integration, the full stack sustains continuous real-time throughput on consumer XR hardware; a user study indicates corresponding gains in perceived speed and usability. Looking ahead, RGB–depth–IMU fusion, CLIP-style open-set detection, continuous-stream gesture segmentation, LLM-based context, and adaptive INT8/FP16 switching offer principled paths to broader robustness while keeping compute and energy budgets in check.
Composition du jury de thèse/Doctoral thesis jury composition
| Membre du jury | Titre | Lieu d’exercice | Fonction dans le jury |
|---|---|---|---|
| Mehdi AMMI | Professeur des Universités | Université Paris 8 | Examinateur |
| Hanane AZZAG | Professeure des Universités | Université Sorbonne Paris Nord | Examinatrice |
| Elhadj BENKHELIFA | Full Professor | Staffordshire University | Rapporteur |
| Madjid MAIDI | Maître de Conférences HDR | Université Paris 8 | Co-encadrant de thèse |
| Samir OTMANE | Professeure des Universités | Université Évry Paris-Saclay | Directeur de thèse |
| Titus ZAHARIA | Professeur | Institut Polytechnique de Paris / TELECOM SudParis | Rapporteur |
- Date: Monday, November 17, 2025, 2 p.m.
- Location: University of Évry, Pelvoux Campus, Yasmina Bestaoui Lecture Hall Bx30, 36 rue du Pelvoux, 91080 ÉVRY-COURCOURONNES
- Zoom: https://univ-evry-fr.zoom.us/j/92115017016?pwd=Kes1az5N9Bpz9WjOZPOn8daYFbpdWJ.1
- Doctoral student: Salah-Eddine LAIDOUDI (University of Évry Paris-Saclay, IBISC IRA2 team)
- Thesis supervision: Samir OTMANE (Professor, Évry University Institute of Technology, IBISC IRA2 team), thesis supervisor; Madjid MAIDI (Assistant Professor, Paris 8 University), thesis co-supervisor