m1ngsama/robot_arm

Fork 0

mirror of https://oauth2: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@github.com/m1ngsama/robot_arm.git synced 2026-06-26 02:54:38 +08:00

本项目构建了一套完整的具身智能（Embodied AI）机械臂控制系统，实现了"语音下达指令 → 大模型语义解析 → 视觉定位目标 → 机械臂精准执行"的全链路闭环。系统在消费级硬件（RTX 3060 Laptop, 6GB 显存）上完成全部推理，无需联网，满足边缘部署需求。

Find a file

m1ngsama 85a1c92268 docs: fix Mermaid line breaks \n -> <br/>		2026-02-20 21:58:29 +08:00
.gitignore	docs: rewrite README (bilingual) + restructure documentation	2026-02-20 21:23:47 +08:00
arm_main.py	fix(arm_main): remove dead plt import, path_history leak, fix OFFSET init	2026-02-20 21:45:04 +08:00
config.py	refactor(audio): move full pipeline into RobotEar.get_text(); add config constants	2026-02-20 21:45:16 +08:00
main.ino	chore(ino): translate main.ino comments and Serial output to English	2026-02-20 21:45:36 +08:00
README.md	docs: fix Mermaid line breaks \n -> <br/>	2026-02-20 21:58:29 +08:00
README_EN.md	docs: fix Mermaid line breaks \n -> <br/>	2026-02-20 21:58:29 +08:00
requirements.txt	fix(voice_main): pass actual current_pos to execute_pick/execute_lift; add requirements.txt	2026-02-20 20:26:30 +08:00
TRAINING.md	docs: rewrite README (bilingual) + restructure documentation	2026-02-20 21:23:47 +08:00
voice_main.py	refactor(audio): move full pipeline into RobotEar.get_text(); add config constants	2026-02-20 21:45:16 +08:00
whisper_main.py	refactor(audio): move full pipeline into RobotEar.get_text(); add config constants	2026-02-20 21:45:16 +08:00

README_EN.md

Voice-Controlled Robot Arm

A full-stack embodied AI system — voice in, physical action out — running entirely offline on consumer hardware.

中文

Overview

Layer	Implementation
Hear	Faster-Whisper, local Chinese speech recognition
Think	DeepSeek-R1-1.5B + QLoRA fine-tune, natural language → JSON
See	YOLOv8s object detection + homography hand-eye calibration
Move	D-H inverse kinematics + S-Curve trajectory, ESP32 PWM

Total hardware cost ¥317 (~$45 USD). Requires an NVIDIA GPU for LLM inference (RTX 3060 6GB recommended, <4GB VRAM at runtime, <200ms latency).

Architecture

flowchart TD
    MIC["🎤 Microphone"] --> STT["Faster-Whisper<br/>Chinese speech recognition"]
    STT --> RULE{"Regex engine<br/>Simple command match"}
    RULE -- "Hit" --> ACT["JSON action"]
    RULE -- "Miss (has object name)" --> LLM["DeepSeek-R1-1.5B<br/>QLoRA FP16<br/>Natural language → JSON"]
    LLM --> ACT
    ACT --> VIS["YOLOv8s + Homography<br/>Object detection · hand-eye calibration<br/>Pixel coords → robot coords mm"]
    VIS --> MOT["arm_main.py<br/>D-H IK + S-Curve"]
    MOT --> ESP["ESP32 PWM → Servos"]

Bill of Materials

Total: ¥317 (~$45 USD)

#	Item	Spec	Qty	Unit	Total
1	3D-printed robot arm kit	Acrylic/PLA structural parts	1	¥71	¥71
2	ESP32 dev board	Dual-core MCU, WiFi + BT	1	¥19	¥19
3	ESP32 accessories	Connectors / expansion board	1	¥5	¥5
4	USB industrial camera	Plug-and-play, wide-angle, 1280×720	1	¥61	¥61
5	Digital servo MG996R	Metal gear, high torque	5	¥27	¥133
6	Regulated power supply	6V 6A, servo-dedicated	1	¥29	¥29

Wiring

ESP32 pins: X→14, Y→4, Z→5, B→18, Gripper→23
Power: servos and ESP32 on separate supplies (external 6V/6A) to prevent inrush surge
Camera: USB, mounted in front of the arm covering the full work surface
Serial: USB to ESP32, default port COM3, override with ROBOT_PORT env var

Installation

1. Flash Firmware

Arduino IDE 2.x, board: "ESP32 Dev Module". Open main.ino, select the correct port, click Upload.

2. Python Environment

Python 3.10+, CUDA 11.8 or 12.x.

# Install the correct CUDA build of PyTorch from pytorch.org first, then:
pip install -r requirements.txt

3. Configure

All tunables are in config.py and support environment variable overrides — no code changes needed:

ROBOT_PORT=COM5               python voice_main.py  # change serial port
LLM_MODEL_PATH=D:\models\lora python voice_main.py  # change LLM path
YOLO_MODEL_PATH=runs/best.pt  python voice_main.py  # change YOLO path

4. Models

Speech (Whisper): the base model is downloaded automatically on first run.

Vision (YOLO): train your own detector — 50 labelled images is enough for transfer learning:

yolo detect train model=yolov8s.pt data=data.yaml epochs=100 imgsz=640
# Output: runs/detect/train/weights/best.pt → copy to project root

LLM: fine-tune DeepSeek-R1-1.5B or Qwen1.5-1.8B with QLoRA. See TRAINING.md for the complete guide.

Training data format (Alpaca):

{
  "instruction": "lift the pencil sharpener 5cm",
  "input": "",
  "system": "You are a robot arm JSON converter...",
  "output": "[{\"action\": \"lift\", \"target\": \"part\", \"height\": 50}]"
}

Quick Start

python voice_main.py

On startup the system loads in order: serial port → YOLO → Whisper → LLM → camera window.

Keyboard Shortcuts

Key	Function
SPACE (hold)	Record audio; release to transcribe and execute
C	Toggle hand-eye calibration mode
R	Manual reset to home position
O	Force open gripper
Q	Quit

Voice Commands

Speak natural Chinese. No special syntax required.

Pick and transport (requires visual detection)

"把削笔刀抓起来"   — pick up the pencil sharpener
"抓住那个盒子"     — grab that box
"把削笔刀抬起5厘米" — lift the pencil sharpener 5cm
"将零件举高10公分"  — raise the part 10cm

Precise directional movement

"向上三厘米"      → Z +30mm
"向左移动四毫米"   → Y +4mm
"往前伸10厘米"    → X +100mm

Fuzzy movement (no explicit distance, defaults to 5cm per config.DEFAULT_MOVE_MM)

"向左"  "抬起"  "往下"

Gestures and state commands

"点头"  — nod: oscillate Z ×3 (±3cm)
"摇头"  — shake head: oscillate Y ×3 (±3cm)
"放下"  — lower to table height (Z=-15mm) and release
"复位"  — return to home position [120, 0, 60] mm
"松开"  — open gripper without moving

Speech compatibility: built-in homophone correction for common Whisper mishearings, e.g. "零米"→"厘米", "小笔刀"→"削笔刀", "电头"→"点头".

Hand-Eye Calibration

Recalibrate whenever the camera is moved. Press C to enter calibration mode, then click 4 corner points in order:

P1 (top-left)     ↔  robot coords (90,  90)
P2 (top-right)    ↔  robot coords (200,  90)
P3 (bottom-right) ↔  robot coords (200, -90)
P4 (bottom-left)  ↔  robot coords (90, -90)

The homography matrix updates instantly after the 4th click. No restart needed.

Troubleshooting

Symptom	Cause	Fix
SPACE does nothing	Camera window not focused	Click the camera window first
Garbled recognition	Mic noise / speaking too fast	Quiet environment, moderate pace; hold SPACE 0.5s before speaking
"Target not found"	YOLO didn't detect the object	Adjust lighting/angle; verify object is in training classes
Pick position offset	Camera was moved	Press C and redo 4-point calibration
Serial connection failed	ESP32 not plugged in / wrong port	Check device manager; set `ROBOT_PORT` env var
Violent shaking on startup	5-servo simultaneous inrush	Firmware staggers power-on; if it persists, check PSU capacity

Technical Notes

Key engineering problems solved during development.

D-H Inverse Kinematics The 130mm L4 link causes ~40° path deviation with geometric IK during horizontal moves. Solved by Scipy SLSQP numerical optimization with a Pitch=-90° constraint (end-effector always perpendicular to the table), eliminating the nonlinear offset entirely.

S-Curve + Multi-Layer Damping MG996R servos vibrate badly under a long lever arm. Five-layer damping pipeline: tilt correction → moving-average filter (deque) → speed cap → EMA damping → dead-zone filter.

Dual-Channel Parse Architecture Simple commands (release/reset/directional moves) bypass the LLM entirely via a regex engine (microseconds). Only complex commands containing object names reach the LLM (<200ms). This prevents the common failure mode where "move down 3cm" gets misclassified as a lift action.

Pre-filling to Skip Chain-of-Thought DeepSeek-R1 outputs a <think>...</think> chain-of-thought by default. Appending <｜Assistant｜> as a pre-fill token forces the model to skip the thinking phase and emit JSON directly, achieving 100% format compliance.

Whisper Anti-Hallucination Three defences, all encapsulated in RobotEar.get_text(): silence trimming + duration guards; condition_on_previous_text=False; repeated-phrase regex dedup (removes "向右向右向右..." loops). All thresholds are tunable via config.py.

Engineering Pitfall: System Prompt Alignment The system prompt at inference must exactly match the one used during fine-tuning. Any mismatch causes output drift (e.g., outputting 500mm instead of 50mm). A warning comment is included in the source.

LLM Training

~500 domain-specific samples, QLoRA fine-tune of DeepSeek-R1-1.5B, loss converged to 0.0519, format error rate 0%.

See TRAINING.md for the full guide: QLoRA hyperparameter config, GGUF vs Transformers comparison, pre-filling inference details, and experiment results.

Project Structure

robot_arm/
├── README.md          Chinese documentation
├── README_EN.md       This file
├── TRAINING.md        LLM LoRA fine-tuning research notes
├── requirements.txt   Python dependencies
├── config.py          All tunables: hardware, motion, audio & gesture constants
│
├── main.ino           ESP32 firmware, LEDC PWM servo control
├── arm_main.py        Kinematics core: D-H IK + S-Curve trajectory
├── whisper_main.py    Full ASR pipeline: silence trim → transcribe → post-process
└── voice_main.py      Main app: voice → LLM → vision → motion

Key Specs

Metric	Value
Hardware cost	¥317 (~$45 USD)
GPU requirement	RTX 3060 6GB (<4GB VRAM at runtime)
Inference latency	<200ms (LLM), <50ms (rule engine)
Training samples	~500
Format error rate	0%
Operation mode	Fully offline

README_EN.md Unescape Escape