robot_arm/README_EN.md

8.9 KiB
Raw Permalink Blame History

Voice-Controlled Robot Arm

A full-stack embodied AI system — voice in, physical action out — running entirely offline on consumer hardware.

中文


Overview

Layer Implementation
Hear Faster-Whisper, local Chinese speech recognition
Think DeepSeek-R1-1.5B + QLoRA fine-tune, natural language → JSON
See YOLOv8s object detection + homography hand-eye calibration
Move D-H inverse kinematics + S-Curve trajectory, ESP32 PWM

Total hardware cost ¥317 (~$45 USD). Requires an NVIDIA GPU for LLM inference (RTX 3060 6GB recommended, <4GB VRAM at runtime, <200ms latency).


Architecture

flowchart TD
    MIC["🎤 Microphone"] --> STT["Faster-Whisper<br/>Chinese speech recognition"]
    STT --> RULE{"Regex engine<br/>Simple command match"}
    RULE -- "Hit" --> ACT["JSON action"]
    RULE -- "Miss (has object name)" --> LLM["DeepSeek-R1-1.5B<br/>QLoRA FP16<br/>Natural language → JSON"]
    LLM --> ACT
    ACT --> VIS["YOLOv8s + Homography<br/>Object detection · hand-eye calibration<br/>Pixel coords → robot coords mm"]
    VIS --> MOT["arm_main.py<br/>D-H IK + S-Curve"]
    MOT --> ESP["ESP32 PWM → Servos"]

Bill of Materials

Total: ¥317 (~$45 USD)

# Item Spec Qty Unit Total
1 3D-printed robot arm kit Acrylic/PLA structural parts 1 ¥71 ¥71
2 ESP32 dev board Dual-core MCU, WiFi + BT 1 ¥19 ¥19
3 ESP32 accessories Connectors / expansion board 1 ¥5 ¥5
4 USB industrial camera Plug-and-play, wide-angle, 1280×720 1 ¥61 ¥61
5 Digital servo MG996R Metal gear, high torque 5 ¥27 ¥133
6 Regulated power supply 6V 6A, servo-dedicated 1 ¥29 ¥29

Wiring

  • ESP32 pins: X→14, Y→4, Z→5, B→18, Gripper→23
  • Power: servos and ESP32 on separate supplies (external 6V/6A) to prevent inrush surge
  • Camera: USB, mounted in front of the arm covering the full work surface
  • Serial: USB to ESP32, default port COM3, override with ROBOT_PORT env var

Installation

1. Flash Firmware

Arduino IDE 2.x, board: "ESP32 Dev Module". Open main.ino, select the correct port, click Upload.

2. Python Environment

Python 3.10+, CUDA 11.8 or 12.x.

# Install the correct CUDA build of PyTorch from pytorch.org first, then:
pip install -r requirements.txt

3. Configure

All tunables are in config.py and support environment variable overrides — no code changes needed:

ROBOT_PORT=COM5               python voice_main.py  # change serial port
LLM_MODEL_PATH=D:\models\lora python voice_main.py  # change LLM path
YOLO_MODEL_PATH=runs/best.pt  python voice_main.py  # change YOLO path

4. Models

Speech (Whisper): the base model is downloaded automatically on first run.

Vision (YOLO): train your own detector — 50 labelled images is enough for transfer learning:

yolo detect train model=yolov8s.pt data=data.yaml epochs=100 imgsz=640
# Output: runs/detect/train/weights/best.pt → copy to project root

LLM: fine-tune DeepSeek-R1-1.5B or Qwen1.5-1.8B with QLoRA. See TRAINING.md for the complete guide.

Training data format (Alpaca):

{
  "instruction": "lift the pencil sharpener 5cm",
  "input": "",
  "system": "You are a robot arm JSON converter...",
  "output": "[{\"action\": \"lift\", \"target\": \"part\", \"height\": 50}]"
}

Quick Start

python voice_main.py

On startup the system loads in order: serial port → YOLO → Whisper → LLM → camera window.

Keyboard Shortcuts

Key Function
SPACE (hold) Record audio; release to transcribe and execute
C Toggle hand-eye calibration mode
R Manual reset to home position
O Force open gripper
Q Quit

Voice Commands

Speak natural Chinese. No special syntax required.

Pick and transport (requires visual detection)

"把削笔刀抓起来"   — pick up the pencil sharpener
"抓住那个盒子"     — grab that box
"把削笔刀抬起5厘米" — lift the pencil sharpener 5cm
"将零件举高10公分"  — raise the part 10cm

Precise directional movement

"向上三厘米"      → Z +30mm
"向左移动四毫米"   → Y +4mm
"往前伸10厘米"    → X +100mm

Fuzzy movement (no explicit distance, defaults to 5cm per config.DEFAULT_MOVE_MM)

"向左"  "抬起"  "往下"

Gestures and state commands

"点头"  — nod: oscillate Z ×3 (±3cm)
"摇头"  — shake head: oscillate Y ×3 (±3cm)
"放下"  — lower to table height (Z=-15mm) and release
"复位"  — return to home position [120, 0, 60] mm
"松开"  — open gripper without moving

Speech compatibility: built-in homophone correction for common Whisper mishearings, e.g. "零米"→"厘米", "小笔刀"→"削笔刀", "电头"→"点头".


Hand-Eye Calibration

Recalibrate whenever the camera is moved. Press C to enter calibration mode, then click 4 corner points in order:

P1 (top-left)     ↔  robot coords (90,  90)
P2 (top-right)    ↔  robot coords (200,  90)
P3 (bottom-right) ↔  robot coords (200, -90)
P4 (bottom-left)  ↔  robot coords (90, -90)

The homography matrix updates instantly after the 4th click. No restart needed.


Troubleshooting

Symptom Cause Fix
SPACE does nothing Camera window not focused Click the camera window first
Garbled recognition Mic noise / speaking too fast Quiet environment, moderate pace; hold SPACE 0.5s before speaking
"Target not found" YOLO didn't detect the object Adjust lighting/angle; verify object is in training classes
Pick position offset Camera was moved Press C and redo 4-point calibration
Serial connection failed ESP32 not plugged in / wrong port Check device manager; set ROBOT_PORT env var
Violent shaking on startup 5-servo simultaneous inrush Firmware staggers power-on; if it persists, check PSU capacity

Technical Notes

Key engineering problems solved during development.

D-H Inverse Kinematics The 130mm L4 link causes ~40° path deviation with geometric IK during horizontal moves. Solved by Scipy SLSQP numerical optimization with a Pitch=-90° constraint (end-effector always perpendicular to the table), eliminating the nonlinear offset entirely.

S-Curve + Multi-Layer Damping MG996R servos vibrate badly under a long lever arm. Five-layer damping pipeline: tilt correction → moving-average filter (deque) → speed cap → EMA damping → dead-zone filter.

Dual-Channel Parse Architecture Simple commands (release/reset/directional moves) bypass the LLM entirely via a regex engine (microseconds). Only complex commands containing object names reach the LLM (<200ms). This prevents the common failure mode where "move down 3cm" gets misclassified as a lift action.

Pre-filling to Skip Chain-of-Thought DeepSeek-R1 outputs a <think>...</think> chain-of-thought by default. Appending <Assistant> as a pre-fill token forces the model to skip the thinking phase and emit JSON directly, achieving 100% format compliance.

Whisper Anti-Hallucination Three defences, all encapsulated in RobotEar.get_text(): silence trimming + duration guards; condition_on_previous_text=False; repeated-phrase regex dedup (removes "向右向右向右..." loops). All thresholds are tunable via config.py.

Engineering Pitfall: System Prompt Alignment The system prompt at inference must exactly match the one used during fine-tuning. Any mismatch causes output drift (e.g., outputting 500mm instead of 50mm). A warning comment is included in the source.


LLM Training

~500 domain-specific samples, QLoRA fine-tune of DeepSeek-R1-1.5B, loss converged to 0.0519, format error rate 0%.

See TRAINING.md for the full guide: QLoRA hyperparameter config, GGUF vs Transformers comparison, pre-filling inference details, and experiment results.


Project Structure

robot_arm/
├── README.md          Chinese documentation
├── README_EN.md       This file
├── TRAINING.md        LLM LoRA fine-tuning research notes
├── requirements.txt   Python dependencies
├── config.py          All tunables: hardware, motion, audio & gesture constants
│
├── main.ino           ESP32 firmware, LEDC PWM servo control
├── arm_main.py        Kinematics core: D-H IK + S-Curve trajectory
├── whisper_main.py    Full ASR pipeline: silence trim → transcribe → post-process
└── voice_main.py      Main app: voice → LLM → vision → motion

Key Specs

Metric Value
Hardware cost ¥317 (~$45 USD)
GPU requirement RTX 3060 6GB (<4GB VRAM at runtime)
Inference latency <200ms (LLM), <50ms (rule engine)
Training samples ~500
Format error rate 0%
Operation mode Fully offline