Initial commit: Split Macha autonomous system into separate flake
Macha is now a standalone NixOS flake that can be imported into other systems. This provides: - Independent versioning - Easier reusability - Cleaner separation of concerns - Better development workflow Includes: - Complete autonomous system code - NixOS module with full configuration options - Queue-based architecture with priority system - Chunked map-reduce for large outputs - ChromaDB knowledge base - Tool calling system - Multi-host SSH management - Gotify notification integration All capabilities from DESIGN.md are preserved.
This commit is contained in:
269
DESIGN.md
Normal file
269
DESIGN.md
Normal file
@@ -0,0 +1,269 @@
|
||||
# Macha Autonomous System - Design Document
|
||||
|
||||
> **⚠️ IMPORTANT - READ THIS FIRST**
|
||||
> **FOR AI ASSISTANT**: This document is YOUR reference guide when modifying Macha's code.
|
||||
> - **ALWAYS consult this BEFORE refactoring** to ensure you don't remove existing capabilities
|
||||
> - **CHECK this when adding features** to avoid conflicts
|
||||
> - **UPDATE this document** when new capabilities are added
|
||||
> - **DO NOT DELETE ANYTHING FROM THIS DOCUMENT**
|
||||
> - During major refactors, you MUST verify each capability listed here is preserved
|
||||
|
||||
## Overview
|
||||
Macha is an AI-powered autonomous system administrator capable of monitoring, maintaining, and managing multiple NixOS hosts in the infrastructure.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Local System Management
|
||||
- Monitor system health (CPU, memory, disk, services)
|
||||
- Read and analyze logs via `journalctl`
|
||||
- Check service status and restart failed services
|
||||
- Execute system commands (with safety restrictions)
|
||||
- Monitor and repair Nix store corruption
|
||||
- Hardware awareness (CPU, GPU, network, storage)
|
||||
|
||||
### 2. Multi-Host Management via SSH
|
||||
|
||||
**Macha CAN and SHOULD use SSH to manage other hosts.**
|
||||
|
||||
#### SSH Access
|
||||
- Runs as `macha` user (UID 2501)
|
||||
- Has `NOPASSWD` sudo access for administrative commands
|
||||
- Shares SSH keys with other hosts in the infrastructure
|
||||
- Can SSH to: `rhiannon`, `alexander`, `UCAR-Kinston`, and others in the flake
|
||||
|
||||
#### SSH Usage Patterns
|
||||
1. **Direct diagnostic commands:**
|
||||
```bash
|
||||
ssh rhiannon systemctl status ollama
|
||||
ssh alexander df -h
|
||||
```
|
||||
- Commands automatically prefixed with `sudo` by the tools layer
|
||||
- Full command: `ssh macha@rhiannon sudo systemctl status ollama`
|
||||
|
||||
2. **Status checks:**
|
||||
- Check service health on remote hosts
|
||||
- Gather system metrics
|
||||
- Review logs
|
||||
- Monitor resource usage
|
||||
|
||||
3. **File operations:**
|
||||
- Use `scp` to copy files between hosts
|
||||
- Read configuration files on remote systems
|
||||
|
||||
#### When to use SSH vs nh
|
||||
- **SSH**: For diagnostics, status checks, log review, quick commands
|
||||
- **nh remote deployment**: For applying NixOS configuration changes
|
||||
- `nh os switch -u --target-host=rhiannon --hostname=rhiannon`
|
||||
- Builds locally, deploys to remote host
|
||||
- Use for permanent configuration changes
|
||||
|
||||
### 3. NixOS Configuration Management
|
||||
|
||||
#### Local Changes
|
||||
- Can propose changes to NixOS configuration
|
||||
- Requires human approval before applying
|
||||
- Uses `nh os switch` for local updates
|
||||
|
||||
#### Remote Deployment
|
||||
- Can deploy to other hosts using `nh` with `--target-host`
|
||||
- Builds configuration locally (on Macha)
|
||||
- Pushes to remote system
|
||||
- Can take up to 1 hour for complex builds
|
||||
- **IMPORTANT**: Be patient with long-running builds, don't retry prematurely
|
||||
|
||||
### 4. Hardware Awareness
|
||||
|
||||
#### Local Hardware Detection
|
||||
- CPU: `lscpu` via `nix-shell -p util-linux`
|
||||
- GPU: `lspci` via `nix-shell -p pciutils`
|
||||
- Network: `lsblk`, `ip addr`
|
||||
- Storage: `df -h`, `lsblk`
|
||||
- USB devices: `lsusb`
|
||||
|
||||
#### GPU Metrics
|
||||
- AMD GPUs: Try `rocm-smi`, sysfs (`/sys/class/drm/card*/device/`)
|
||||
- NVIDIA GPUs: Try `nvidia-smi`
|
||||
- Fallback: `sensors` for temperature data
|
||||
- Queries: temperature, utilization, clock speeds, power usage
|
||||
|
||||
### 5. Ollama Queue System
|
||||
|
||||
#### Architecture
|
||||
- **File-based queue**: `/var/lib/macha/queues/ollama/`
|
||||
- **Queue worker**: `ollama-queue-worker.service` (runs as `macha` user)
|
||||
- **Purpose**: Serialize all LLM requests to prevent resource contention
|
||||
|
||||
#### Request Flow
|
||||
1. Any user (including regular users) → Write request to `pending/`
|
||||
2. Queue worker → Process requests serially (FIFO with priority)
|
||||
3. Queue worker → Write response to `completed/`
|
||||
4. Original requester → Read response from `completed/`
|
||||
|
||||
#### Priority Levels
|
||||
- `INTERACTIVE` (0): User requests via `macha-chat`, `macha-ask`
|
||||
- `AUTONOMOUS` (1): Background maintenance checks
|
||||
- `BATCH` (2): Low-priority bulk operations
|
||||
|
||||
#### Large Output Handling
|
||||
- Outputs >8KB: Split into chunks for hierarchical processing
|
||||
- Each chunk ~8KB (~2000 tokens)
|
||||
- Process chunks serially with progress feedback
|
||||
- Generate chunk summaries → meta-summary
|
||||
- Full outputs cached in `/var/lib/macha/tool_cache/`
|
||||
|
||||
### 6. Knowledge Base & Learning
|
||||
|
||||
#### ChromaDB Collections
|
||||
1. **System Context**: Infrastructure topology, service relationships
|
||||
2. **Issues**: Historical problems and resolutions
|
||||
3. **Knowledge**: Operational wisdom learned from experience
|
||||
|
||||
#### Automatic Learning
|
||||
- After successful operations, Macha reflects and extracts key learnings
|
||||
- Stores: topic, knowledge content, category
|
||||
- Retrieved automatically when relevant to current tasks
|
||||
- Use `macha-knowledge` CLI to view/manage
|
||||
|
||||
### 7. Notifications
|
||||
|
||||
#### Gotify Integration
|
||||
- Can send notifications via `macha-notify` command
|
||||
- Tool: `send_notification(title, message, priority)`
|
||||
|
||||
#### Priority Levels
|
||||
- `2` (Low/Info): Routine status updates, completed tasks
|
||||
- `5` (Medium/Attention): Important events, configuration changes
|
||||
- `8` (High/Critical): Service failures, critical errors, security issues
|
||||
|
||||
#### When to Notify
|
||||
- Critical service failures
|
||||
- Successful completion of major operations
|
||||
- Configuration changes that may affect users
|
||||
- Security-related events
|
||||
- When explicitly requested by user
|
||||
|
||||
### 8. Safety & Constraints
|
||||
|
||||
#### Command Restrictions
|
||||
**Allowed Commands** (see `tools.py` for full list):
|
||||
- System management: `systemctl`, `journalctl`, `nh`, `nixos-rebuild`
|
||||
- Monitoring: `free`, `df`, `uptime`, `ps`, `top`, `ip`, `ss`
|
||||
- Hardware: `lscpu`, `lspci`, `lsblk`, `lshw`, `dmidecode`
|
||||
- Remote: `ssh`, `scp`
|
||||
- Power: `reboot`, `shutdown`, `poweroff` (use cautiously!)
|
||||
- File ops: `cat`, `ls`, `grep`
|
||||
- Network: `ping`, `dig`, `nslookup`, `curl`, `wget`
|
||||
- Logging: `logger`
|
||||
|
||||
**NOT Allowed**:
|
||||
- Direct package modifications (`nix-env`, `nix profile`)
|
||||
- Destructive file operations (`rm -rf`, `dd`)
|
||||
- User management outside of NixOS config
|
||||
- Direct editing of system files (use NixOS config instead)
|
||||
|
||||
#### Critical Services
|
||||
**Never disable or stop:**
|
||||
- SSH (network access)
|
||||
- Networking (connectivity)
|
||||
- systemd (system management)
|
||||
- Boot-related services
|
||||
|
||||
#### Approval Required
|
||||
- Reboots or system power changes
|
||||
- Major configuration changes
|
||||
- Disabling any service
|
||||
- Changes to multiple hosts
|
||||
|
||||
### 9. Nix Store Maintenance
|
||||
|
||||
#### Verification & Repair
|
||||
- Command: `nix-store --verify --check-contents --repair`
|
||||
- **WARNING**: Can take 30+ minutes to several hours
|
||||
- Only use when corruption is suspected
|
||||
- Not for routine maintenance
|
||||
- Verifies all store paths, repairs corrupted files
|
||||
|
||||
#### Garbage Collection
|
||||
- Automatic via system configuration
|
||||
- Can be triggered manually with approval
|
||||
- Frees disk space by removing unused derivations
|
||||
|
||||
### 10. Conversational Behavior
|
||||
|
||||
#### Distinguish Requests from Acknowledgments
|
||||
- "Thanks" / "Thank you" → Acknowledgment (don't re-execute)
|
||||
- "Can you..." / "Please..." → Request (execute)
|
||||
- "What is..." / "How do..." → Question (answer)
|
||||
|
||||
#### Tool Calling
|
||||
- Don't repeat tool calls unnecessarily
|
||||
- If a tool succeeds, don't run it again unless asked
|
||||
- Use cached results when available (`retrieve_cached_output`)
|
||||
|
||||
#### Context Management
|
||||
- Be aware of token limits
|
||||
- Use hierarchical processing for large outputs
|
||||
- Prune conversation history intelligently
|
||||
- Cache and summarize when needed
|
||||
|
||||
## Infrastructure Topology
|
||||
|
||||
### Hosts in Flake
|
||||
- **macha**: Main autonomous system (self), GPU server
|
||||
- **rhiannon**: Production server
|
||||
- **alexander**: Production server
|
||||
- **UCAR-Kinston**: Work laptop
|
||||
- **test-vm**: Testing environment
|
||||
|
||||
### Shared Configuration
|
||||
- All hosts share root SSH keys (for `nh` remote deployment)
|
||||
- `macha` user (UID 2501) exists on all hosts
|
||||
- Common NixOS configuration via flake
|
||||
|
||||
## Service Ecosystem
|
||||
|
||||
### Core Services on Macha
|
||||
- `ollama.service`: LLM inference engine
|
||||
- `ollama-queue-worker.service`: Request serialization
|
||||
- `macha-autonomous.service`: Autonomous monitoring loop
|
||||
- Servarr stack: Sonarr, Radarr, Prowlarr, Lidarr, Readarr, Whisparr
|
||||
- Media: Transmission, SABnzbd, Calibre
|
||||
|
||||
### State Directories
|
||||
- `/var/lib/macha/`: Main state directory (0755, macha:macha)
|
||||
- `/var/lib/macha/queues/`: Queue directories (0777 for multi-user)
|
||||
- `/var/lib/macha/tool_cache/`: Cached tool outputs (0777)
|
||||
- `/var/lib/macha/system_context.db`: ChromaDB database
|
||||
|
||||
## CLI Tools
|
||||
|
||||
- `macha-chat`: Interactive chat with tool calling
|
||||
- `macha-ask`: Single-question interface
|
||||
- `macha-check`: Trigger immediate health check
|
||||
- `macha-approve`: Approve pending actions
|
||||
- `macha-logs`: View autonomous service logs
|
||||
- `macha-issues`: Query issue database
|
||||
- `macha-knowledge`: Query knowledge base
|
||||
- `macha-systems`: List managed systems
|
||||
- `macha-notify`: Send Gotify notification
|
||||
|
||||
## Philosophy & Principles
|
||||
|
||||
1. **KISS (Keep It Simple, Stupid)**: Use existing NixOS options, avoid custom wrappers
|
||||
2. **Verify first**: Check source code/documentation before acting
|
||||
3. **Safety first**: Never break critical services, always require approval for risky changes
|
||||
4. **Learn continuously**: Extract and store operational knowledge
|
||||
5. **Multi-host awareness**: Macha manages the entire infrastructure, not just herself
|
||||
6. **User-friendly**: Clear communication, appropriate notifications
|
||||
7. **Patience**: Long-running operations (builds, repairs) can take an hour - don't panic
|
||||
8. **Tool reuse**: Use existing, verified tools instead of writing custom scripts
|
||||
|
||||
## Future Capabilities (Not Yet Implemented)
|
||||
|
||||
- [ ] Automatic security updates across all hosts
|
||||
- [ ] Predictive failure detection
|
||||
- [ ] Resource optimization recommendations
|
||||
- [ ] Integration with other communication platforms
|
||||
- [ ] Multi-agent coordination between hosts
|
||||
- [ ] Automated testing before deployment
|
||||
|
||||
Reference in New Issue
Block a user