| .gitignore | ||
| arm_main.py | ||
| config.py | ||
| main.ino | ||
| README.md | ||
| README_EN.md | ||
| requirements.txt | ||
| TRAINING.md | ||
| voice_main.py | ||
| whisper_main.py | ||
Voice-Controlled Robot Arm
A full-stack embodied AI system — voice in, physical action out — running entirely offline on consumer hardware.
Overview
| Layer | Implementation |
|---|---|
| Hear | Faster-Whisper, local Chinese speech recognition |
| Think | DeepSeek-R1-1.5B + QLoRA fine-tune, natural language → JSON |
| See | YOLOv8s object detection + homography hand-eye calibration |
| Move | D-H inverse kinematics + S-Curve trajectory, ESP32 PWM |
Total hardware cost ¥317 (~$45 USD). Requires an NVIDIA GPU for LLM inference (RTX 3060 6GB recommended, <4GB VRAM at runtime, <200ms latency).
Architecture
flowchart TD
MIC["🎤 Microphone"] --> STT["Faster-Whisper<br/>Chinese speech recognition"]
STT --> RULE{"Regex engine<br/>Simple command match"}
RULE -- "Hit" --> ACT["JSON action"]
RULE -- "Miss (has object name)" --> LLM["DeepSeek-R1-1.5B<br/>QLoRA FP16<br/>Natural language → JSON"]
LLM --> ACT
ACT --> VIS["YOLOv8s + Homography<br/>Object detection · hand-eye calibration<br/>Pixel coords → robot coords mm"]
VIS --> MOT["arm_main.py<br/>D-H IK + S-Curve"]
MOT --> ESP["ESP32 PWM → Servos"]
Bill of Materials
Total: ¥317 (~$45 USD)
| # | Item | Spec | Qty | Unit | Total |
|---|---|---|---|---|---|
| 1 | 3D-printed robot arm kit | Acrylic/PLA structural parts | 1 | ¥71 | ¥71 |
| 2 | ESP32 dev board | Dual-core MCU, WiFi + BT | 1 | ¥19 | ¥19 |
| 3 | ESP32 accessories | Connectors / expansion board | 1 | ¥5 | ¥5 |
| 4 | USB industrial camera | Plug-and-play, wide-angle, 1280×720 | 1 | ¥61 | ¥61 |
| 5 | Digital servo MG996R | Metal gear, high torque | 5 | ¥27 | ¥133 |
| 6 | Regulated power supply | 6V 6A, servo-dedicated | 1 | ¥29 | ¥29 |
Wiring
- ESP32 pins: X→14, Y→4, Z→5, B→18, Gripper→23
- Power: servos and ESP32 on separate supplies (external 6V/6A) to prevent inrush surge
- Camera: USB, mounted in front of the arm covering the full work surface
- Serial: USB to ESP32, default port
COM3, override withROBOT_PORTenv var
Installation
1. Flash Firmware
Arduino IDE 2.x, board: "ESP32 Dev Module". Open main.ino, select the correct port, click Upload.
2. Python Environment
Python 3.10+, CUDA 11.8 or 12.x.
# Install the correct CUDA build of PyTorch from pytorch.org first, then:
pip install -r requirements.txt
3. Configure
All tunables are in config.py and support environment variable overrides — no code changes needed:
ROBOT_PORT=COM5 python voice_main.py # change serial port
LLM_MODEL_PATH=D:\models\lora python voice_main.py # change LLM path
YOLO_MODEL_PATH=runs/best.pt python voice_main.py # change YOLO path
4. Models
Speech (Whisper): the base model is downloaded automatically on first run.
Vision (YOLO): train your own detector — 50 labelled images is enough for transfer learning:
yolo detect train model=yolov8s.pt data=data.yaml epochs=100 imgsz=640
# Output: runs/detect/train/weights/best.pt → copy to project root
LLM: fine-tune DeepSeek-R1-1.5B or Qwen1.5-1.8B with QLoRA. See TRAINING.md for the complete guide.
Training data format (Alpaca):
{
"instruction": "lift the pencil sharpener 5cm",
"input": "",
"system": "You are a robot arm JSON converter...",
"output": "[{\"action\": \"lift\", \"target\": \"part\", \"height\": 50}]"
}
Quick Start
python voice_main.py
On startup the system loads in order: serial port → YOLO → Whisper → LLM → camera window.
Keyboard Shortcuts
| Key | Function |
|---|---|
| SPACE (hold) | Record audio; release to transcribe and execute |
| C | Toggle hand-eye calibration mode |
| R | Manual reset to home position |
| O | Force open gripper |
| Q | Quit |
Voice Commands
Speak natural Chinese. No special syntax required.
Pick and transport (requires visual detection)
"把削笔刀抓起来" — pick up the pencil sharpener
"抓住那个盒子" — grab that box
"把削笔刀抬起5厘米" — lift the pencil sharpener 5cm
"将零件举高10公分" — raise the part 10cm
Precise directional movement
"向上三厘米" → Z +30mm
"向左移动四毫米" → Y +4mm
"往前伸10厘米" → X +100mm
Fuzzy movement (no explicit distance, defaults to 5cm per config.DEFAULT_MOVE_MM)
"向左" "抬起" "往下"
Gestures and state commands
"点头" — nod: oscillate Z ×3 (±3cm)
"摇头" — shake head: oscillate Y ×3 (±3cm)
"放下" — lower to table height (Z=-15mm) and release
"复位" — return to home position [120, 0, 60] mm
"松开" — open gripper without moving
Speech compatibility: built-in homophone correction for common Whisper mishearings, e.g. "零米"→"厘米", "小笔刀"→"削笔刀", "电头"→"点头".
Hand-Eye Calibration
Recalibrate whenever the camera is moved. Press C to enter calibration mode, then click 4 corner points in order:
P1 (top-left) ↔ robot coords (90, 90)
P2 (top-right) ↔ robot coords (200, 90)
P3 (bottom-right) ↔ robot coords (200, -90)
P4 (bottom-left) ↔ robot coords (90, -90)
The homography matrix updates instantly after the 4th click. No restart needed.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| SPACE does nothing | Camera window not focused | Click the camera window first |
| Garbled recognition | Mic noise / speaking too fast | Quiet environment, moderate pace; hold SPACE 0.5s before speaking |
| "Target not found" | YOLO didn't detect the object | Adjust lighting/angle; verify object is in training classes |
| Pick position offset | Camera was moved | Press C and redo 4-point calibration |
| Serial connection failed | ESP32 not plugged in / wrong port | Check device manager; set ROBOT_PORT env var |
| Violent shaking on startup | 5-servo simultaneous inrush | Firmware staggers power-on; if it persists, check PSU capacity |
Technical Notes
Key engineering problems solved during development.
D-H Inverse Kinematics
The 130mm L4 link causes ~40° path deviation with geometric IK during horizontal moves. Solved by Scipy SLSQP numerical optimization with a Pitch=-90° constraint (end-effector always perpendicular to the table), eliminating the nonlinear offset entirely.
S-Curve + Multi-Layer Damping MG996R servos vibrate badly under a long lever arm. Five-layer damping pipeline: tilt correction → moving-average filter (deque) → speed cap → EMA damping → dead-zone filter.
Dual-Channel Parse Architecture
Simple commands (release/reset/directional moves) bypass the LLM entirely via a regex engine (microseconds). Only complex commands containing object names reach the LLM (<200ms). This prevents the common failure mode where "move down 3cm" gets misclassified as a lift action.
Pre-filling to Skip Chain-of-Thought
DeepSeek-R1 outputs a <think>...</think> chain-of-thought by default. Appending <|Assistant|> as a pre-fill token forces the model to skip the thinking phase and emit JSON directly, achieving 100% format compliance.
Whisper Anti-Hallucination
Three defences, all encapsulated in RobotEar.get_text(): silence trimming + duration guards; condition_on_previous_text=False; repeated-phrase regex dedup (removes "向右向右向右..." loops). All thresholds are tunable via config.py.
Engineering Pitfall: System Prompt Alignment The system prompt at inference must exactly match the one used during fine-tuning. Any mismatch causes output drift (e.g., outputting 500mm instead of 50mm). A warning comment is included in the source.
LLM Training
~500 domain-specific samples, QLoRA fine-tune of DeepSeek-R1-1.5B, loss converged to 0.0519, format error rate 0%.
See TRAINING.md for the full guide: QLoRA hyperparameter config, GGUF vs Transformers comparison, pre-filling inference details, and experiment results.
Project Structure
robot_arm/
├── README.md Chinese documentation
├── README_EN.md This file
├── TRAINING.md LLM LoRA fine-tuning research notes
├── requirements.txt Python dependencies
├── config.py All tunables: hardware, motion, audio & gesture constants
│
├── main.ino ESP32 firmware, LEDC PWM servo control
├── arm_main.py Kinematics core: D-H IK + S-Curve trajectory
├── whisper_main.py Full ASR pipeline: silence trim → transcribe → post-process
└── voice_main.py Main app: voice → LLM → vision → motion
Key Specs
| Metric | Value |
|---|---|
| Hardware cost | ¥317 (~$45 USD) |
| GPU requirement | RTX 3060 6GB (<4GB VRAM at runtime) |
| Inference latency | <200ms (LLM), <50ms (rule engine) |
| Training samples | ~500 |
| Format error rate | 0% |
| Operation mode | Fully offline |