Files
macha-autonomous/DESIGN.md
Lily Miller cc8334f2c5 Enhance DESIGN.md with new features and clarifications
- Added USB device detection and monitoring capabilities.
- Updated SSH usage patterns to reflect dynamic host configurations.
- Introduced automatic system discovery from journal logs, including OS detection and system profiling.
- Enhanced configuration file intelligence with semantic search and categorization.
- Expanded knowledge base structure and automatic learning processes.
- Clarified the architecture and key modules for better understanding of system components.
2025-10-09 16:18:34 -06:00

467 lines
18 KiB
Markdown

# Macha Autonomous System - Design Document
> **⚠️ IMPORTANT - READ THIS FIRST**
> **FOR AI ASSISTANT**: This document is YOUR reference guide when modifying Macha's code.
> - **ALWAYS consult this BEFORE refactoring** to ensure you don't remove existing capabilities
> - **CHECK this when adding features** to avoid conflicts
> - **UPDATE this document** when new capabilities are added
> - **DO NOT DELETE ANYTHING FROM THIS DOCUMENT**
> - During major refactors, you MUST verify each capability listed here is preserved
## Overview
Macha is an AI-powered autonomous system administrator capable of monitoring, maintaining, and managing multiple NixOS hosts in the infrastructure.
## Core Capabilities
### 1. Local System Management
- Monitor system health (CPU, memory, disk, services)
- Read and analyze logs via `journalctl`
- Check service status and restart failed services
- Execute system commands (with safety restrictions)
- Monitor and repair Nix store corruption
- Hardware awareness (CPU, GPU, network, storage)
- USB device detection and monitoring
### 2. Multi-Host Management via SSH
**Macha CAN and SHOULD use SSH to manage other hosts.**
#### SSH Access
- **CRITICAL**: All command patterns defined in `command_patterns.py` (SINGLE SOURCE OF TRUTH)
- Always uses explicit SSH key path: `-i /var/lib/macha/.ssh/id_ed25519`
- All SSH commands automatically include the `-i` flag with absolute key path
- Remote commands always prefixed with `sudo`
- Runs as `macha` user (UID 2501) for standard operations
- **Note**: Some internal operations (like remote monitoring) may use `root` SSH for privileged access
- **DO NOT DUPLICATE these patterns elsewhere** - import from `command_patterns.py`
- Has `NOPASSWD` sudo access for administrative commands
- Shares SSH keys with other hosts in the infrastructure
- Can SSH to any hosts defined in your NixOS flake configuration
#### SSH Usage Patterns
1. **Direct diagnostic commands:**
```bash
ssh hostname systemctl status service-name
ssh hostname df -h
```
- Commands automatically transformed by the tools layer
- Full command: `ssh -i /var/lib/macha/.ssh/id_ed25519 -o StrictHostKeyChecking=no macha@hostname sudo systemctl status service-name`
- SSH key path is always explicit, commands are automatically prefixed with `sudo`
2. **Status checks:**
- Check service health on remote hosts
- Gather system metrics
- Review logs
- Monitor resource usage
3. **File operations:**
- Use `scp` to copy files between hosts
- Read configuration files on remote systems
#### When to use SSH vs nh
- **SSH**: For diagnostics, status checks, log review, quick commands
- **nh remote deployment**: For applying NixOS configuration changes
- `nh os switch -u --target-host=hostname --hostname=hostname`
- Builds locally, deploys to remote host
- Use for permanent configuration changes
### 3. Automatic System Discovery
#### Discovery from Journal Logs
- Monitors `systemd-journal-remote` logs for new systems sending logs
- Parses `_HOSTNAME` field from remote journal entries
- Automatically converts short hostnames to FQDNs (`.coven.systems`)
- Discovers systems within configurable time windows (default: 10 minutes)
#### OS Detection
- Detects operating system via SSH (`/etc/os-release`, `uname`)
- Supports: NixOS, Ubuntu, Debian, Arch, Fedora, RHEL, Alpine, macOS, FreeBSD
- Falls back to generic "linux" if specific distro cannot be determined
- Records OS type in system registry for appropriate management strategies
#### System Profiling
- Automatically gathers system information upon discovery:
- Running services (via `systemctl list-units`)
- Hardware info (CPU cores, memory)
- Capabilities (containers, web-server, database, remote-access)
- System role determination (ai-workstation, server, workstation, minimal)
- Registers discovered systems in ChromaDB context database
- Sends notification when new systems are discovered
### 4. NixOS Configuration Management
#### Local Changes
- Can propose changes to NixOS configuration
- Requires human approval before applying
- Uses `nh os switch` for local updates
#### Remote Deployment
- Can deploy to other hosts using `nh` with `--target-host`
- Builds configuration locally (on Macha)
- Pushes to remote system
- Can take up to 1 hour for complex builds
- **IMPORTANT**: Be patient with long-running builds, don't retry prematurely
### 5. Configuration File Intelligence (RAG)
#### Semantic Configuration Search
- NixOS configuration files indexed in ChromaDB for semantic search
- Query configurations by natural language: "gotify configuration", "journald settings"
- CLI tools: `macha-configs <query>` and `macha-configs-read <path>`
- Relevance scoring helps find the right config files quickly
- Supports filtering by system hostname or category
#### Configuration Categories
- `apps/` - Application configurations (Gotify, Nextcloud, etc.)
- `systems/` - Per-host system configurations
- `osconfigs/` - Operating system level settings
- `users/` - User management configurations
#### Git Context Analysis
- Tracks recent changes to configuration files via `git_context.py`
- Correlates config changes with system behavior
- Provides context when investigating issues after deployments
- Helps understand "what changed" when debugging
### 6. Hardware Awareness
#### Local Hardware Detection
- CPU: `lscpu` via `nix-shell -p util-linux`
- GPU: `lspci` via `nix-shell -p pciutils`
- Network: `lsblk`, `ip addr`
- Storage: `df -h`, `lsblk`
- USB devices: `lsusb`
#### GPU Metrics
- AMD GPUs: Try `rocm-smi`, sysfs (`/sys/class/drm/card*/device/`)
- NVIDIA GPUs: Try `nvidia-smi`
- Fallback: `sensors` for temperature data
- Queries: temperature, utilization, clock speeds, power usage
### 7. Ollama Queue System
#### Architecture
- **File-based queue**: `/var/lib/macha/queues/ollama/`
- **Queue worker**: `ollama-queue-worker.service` (runs as `macha` user)
- **Purpose**: Serialize all LLM requests to prevent resource contention
#### Request Flow
1. Any user (including regular users) → Write request to `pending/`
2. Queue worker → Process requests serially (FIFO with priority)
3. Queue worker → Write response to `completed/`
4. Original requester → Read response from `completed/`
#### Priority Levels
- `INTERACTIVE` (0): User requests via `macha-chat`, `macha-ask`
- `AUTONOMOUS` (1): Background maintenance checks
- `BATCH` (2): Low-priority bulk operations
#### Large Output Handling
- Outputs >8KB: Split into chunks for hierarchical processing
- Each chunk ~8KB (~2000 tokens)
- Process chunks serially with progress feedback
- Generate chunk summaries → meta-summary
- Full outputs cached in `/var/lib/macha/tool_cache/`
### 8. Knowledge Base & Learning
#### ChromaDB Architecture
- **Service**: ChromaDB runs as a standalone service on port 8000
- **Storage**: Data persisted at `/var/lib/chromadb`
- **Frontend**: `context_db.py` provides structured Python interface to ChromaDB
- **Connection**: HTTP client to `localhost:8000`
#### ChromaDB Collections
1. **systems**: Infrastructure topology, registered hosts, OS types
2. **relationships**: System dependencies and relationships
3. **issues**: Historical problems and resolutions
4. **decisions**: AI decisions and outcomes
5. **config_files**: NixOS configuration files for RAG
6. **knowledge**: Operational wisdom learned from experience
#### Automatic Learning & Reflection
- After successful operations, Macha automatically reflects via `reflect_and_learn()`
- Extracts 1-2 specific, actionable learnings from each successful operation
- Stores: topic, knowledge content, category, confidence level
- Categories: command, pattern, troubleshooting, performance, general
- Retrieved automatically when relevant to current tasks
- Use `macha-knowledge` CLI to view/manage/search
- Use `seed_knowledge.py` to populate initial operational knowledge
### 9. Notifications
#### Gotify Integration
- Can send notifications via `macha-notify` command
- Tool: `send_notification(title, message, priority)`
#### Priority Levels
- `2` (Low/Info): Routine status updates, completed tasks
- `5` (Medium/Attention): Important events, configuration changes
- `8` (High/Critical): Service failures, critical errors, security issues
#### When to Notify
- Critical service failures
- Successful completion of major operations
- Configuration changes that may affect users
- Security-related events
- New system discoveries
- When explicitly requested by user
### 10. Safety & Constraints
#### Command Restrictions
**Allowed Commands** (see `tools.py` for full list):
- System management: `systemctl`, `journalctl`, `nh`, `nixos-rebuild`
- Monitoring: `free`, `df`, `uptime`, `ps`, `top`, `ip`, `ss`
- Hardware: `lscpu`, `lspci`, `lsblk`, `lshw`, `dmidecode`
- Remote: `ssh`, `scp`
- Power: `reboot`, `shutdown`, `poweroff` (use cautiously!)
- File ops: `cat`, `ls`, `grep`
- Network: `ping`, `dig`, `nslookup`, `curl`, `wget`
- Logging: `logger`
**NOT Allowed**:
- Direct package modifications (`nix-env`, `nix profile`)
- Destructive file operations (`rm -rf`, `dd`)
- User management outside of NixOS config
- Direct editing of system files (use NixOS config instead)
#### Critical Services
**Never disable or stop:**
- SSH (network access)
- Networking (connectivity)
- systemd (system management)
- Boot-related services
#### Approval Required
- Reboots or system power changes
- Major configuration changes
- Disabling any service
- Changes to multiple hosts
#### Interactive Discussion
- `macha-approve discuss <N>` enables interactive Q&A about proposed actions
- Implemented via `conversation.py` module
- Users can ask follow-up questions before approving/rejecting
- Provides detailed explanations and reasoning
- Commands: `approve`, `reject`, `exit` to control flow
### 11. Nix Store Maintenance
#### Verification & Repair
- Command: `nix-store --verify --check-contents --repair`
- **WARNING**: Can take 30+ minutes to several hours
- Only use when corruption is suspected
- Not for routine maintenance
- Verifies all store paths, repairs corrupted files
#### Garbage Collection
- Automatic via system configuration
- Can be triggered manually with approval
- Frees disk space by removing unused derivations
### 12. Conversational Behavior
#### Distinguish Requests from Acknowledgments
- "Thanks" / "Thank you" → Acknowledgment (don't re-execute)
- "Can you..." / "Please..." → Request (execute)
- "What is..." / "How do..." → Question (answer)
#### Tool Calling
- Don't repeat tool calls unnecessarily
- If a tool succeeds, don't run it again unless asked
- Use cached results when available (`retrieve_cached_output`)
#### Context Management
- Be aware of token limits
- Use hierarchical processing for large outputs
- Prune conversation history intelligently
- Cache and summarize when needed
## Infrastructure Topology
### Managed Hosts
- **Self**: Main autonomous system running Macha
- **Configured hosts**: Systems defined in your NixOS flake
- **Auto-discovered hosts**: Additional systems detected via journal logs
### Shared Configuration
- All hosts share root SSH keys (for `nh` remote deployment)
- `macha` user (UID 2501) exists on all managed hosts
- Common NixOS configuration via flake
- Multi-OS support: NixOS, Ubuntu, Debian, Arch, macOS, and others
## Service Ecosystem
### Core Services on Macha
- `ollama.service`: LLM inference engine
- `ollama-queue-worker.service`: Request serialization
- `macha-autonomous.service`: Autonomous monitoring loop
- `chromadb.service`: Vector database for context and knowledge (port 8000)
### State Directories
- `/var/lib/macha/`: Main state directory (0755, macha:macha)
- `/var/lib/macha/queues/`: Queue directories (0777 for multi-user)
- `/var/lib/macha/tool_cache/`: Cached tool outputs (0777)
- `/var/lib/macha/logs/`: Log files and closed issues archive
- `/var/lib/chromadb/`: ChromaDB vector database storage
## CLI Tools
- `macha-chat`: Interactive chat with tool calling
- `macha-ask`: Single-question interface
- `macha-check`: Trigger immediate health check
- `macha-approve`: Approve pending actions
- `macha-approve list` - Show pending actions
- `macha-approve discuss <N>` - Interactive Q&A about action N
- `macha-approve approve <N>` - Approve action N
- `macha-approve reject <N>` - Reject action N
- `macha-logs`: View autonomous service logs
- `macha-issues`: Query issue database
- `macha-knowledge`: Query knowledge base
- `macha-systems`: List managed systems
- `macha-configs`: Semantic search for configuration files
- `macha-configs-read`: Read full configuration file content
- `macha-notify`: Send Gotify notification
## Architecture & Key Modules
### Core Modules
#### `agent.py` - AI Agent
- Interfaces with Ollama LLM for reasoning
- Implements tool calling with `tools.py`
- Manages conversation history and context
- Automatic learning via `reflect_and_learn()`
- Supports queue-based and direct API modes
#### `orchestrator.py` - Main Control Loop
- Continuous monitoring and health checks
- Coordinates all other components
- Manages check intervals and autonomy levels
- Initializes system registry and configuration parsing
- Handles system discovery and registration
#### `executor.py` - Safe Action Execution
- Manages approval queue for actions
- Respects autonomy levels (observe, suggest, auto-safe, auto-full)
- Executes approved actions with safety checks
- Logs all actions and outcomes
#### `tools.py` - System Administration Tools
- Defines all available tools for the AI agent
- Command allow-list for safe mode
- Executes system commands, reads files, checks services
- Implements hardware queries and GPU metrics
- Integrates with `command_patterns.py` for SSH
#### `command_patterns.py` - SSH Command Patterns
- **SINGLE SOURCE OF TRUTH** for SSH commands
- Builds SSH commands with correct key paths and options
- Handles automatic sudo prefixing for remote commands
- Provides `build_ssh_command()` and `build_scp_command()`
#### `context_db.py` - ChromaDB Frontend
- Structured interface to ChromaDB vector database
- Manages 6 collections: systems, relationships, issues, decisions, config_files, knowledge
- Implements semantic search for configurations and knowledge
- Tracks system relationships and dependencies
#### `monitor.py` - Local System Monitoring
- Collects system health metrics (CPU, memory, disk)
- Checks systemd services and recent errors
- Monitors NixOS generations and Nix store size
- Generates human-readable summaries
#### `remote_monitor.py` - Remote System Monitoring
- SSH-based monitoring of remote hosts
- Collects resources, services, disk, network status
- Verifies connectivity before operations
- Uses `command_patterns.py` for SSH access
#### `system_discovery.py` - Auto-Discovery
- Discovers new systems from journal logs
- Detects OS types via SSH probing
- Profiles systems: services, hardware, capabilities
- Determines system roles automatically
#### `issue_tracker.py` - Issue Management
- Creates, updates, resolves, and closes issues
- Finds similar past issues
- Auto-resolves issues when problems disappear
- Archives closed issues to JSONL logs
#### `notifier.py` - Gotify Integration
- Sends notifications at appropriate priority levels
- Special methods for common events (failures, discoveries, actions)
- Fails gracefully if Gotify unavailable
#### `ollama_queue.py` - Request Serialization
- File-based queue at `/var/lib/macha/queues/ollama/`
- Three priority levels: INTERACTIVE, AUTONOMOUS, BATCH
- Prevents resource contention on LLM
- Tracks request status: pending → processing → completed/failed
#### `ollama_worker.py` - Queue Worker Daemon
- Processes queue requests serially
- Runs as systemd service `ollama-queue-worker.service`
- Handles timeouts and failures gracefully
#### `conversation.py` - Interactive Discussion
- Implements `macha-approve discuss` feature
- Enables Q&A about proposed actions
- Maintains context during discussion
- Helps users understand AI reasoning
#### `config_parser.py` - Configuration File Parsing
- Parses NixOS configuration files from git repository
- Indexes configurations in ChromaDB for RAG
- Categorizes by type: apps, systems, osconfigs, users
#### `git_context.py` - Git Analysis
- Tracks recent configuration changes
- Provides context when debugging after deployments
- Correlates config changes with system issues
#### `journal_monitor.py` - Journal Log Monitoring
- Monitors systemd journal for specific patterns
- Triggers on error conditions
- Feeds into auto-discovery system
#### `seed_knowledge.py` - Knowledge Seeding
- Populates initial operational knowledge
- Loads foundational patterns and commands
- Run via `macha-knowledge seed`
#### `chat.py` - Interactive Chat Interface
- Implements `macha-chat` and `macha-ask` commands
- Manages conversation state
- Integrates with queue system for LLM requests
#### `module.nix` - NixOS Module
- Defines all configuration options
- Creates systemd services (macha-autonomous, ollama-queue-worker, chromadb)
- Sets up users, permissions, state directories
- Provides all CLI tool wrappers
## Philosophy & Principles
1. **KISS (Keep It Simple, Stupid)**: Use existing NixOS options, avoid custom wrappers
2. **Verify first**: Check source code/documentation before acting
3. **Safety first**: Never break critical services, always require approval for risky changes
4. **Learn continuously**: Extract and store operational knowledge
5. **Multi-host awareness**: Macha manages the entire infrastructure, not just herself
6. **User-friendly**: Clear communication, appropriate notifications
7. **Patience**: Long-running operations (builds, repairs) can take an hour - don't panic
8. **Tool reuse**: Use existing, verified tools instead of writing custom scripts
## Future Capabilities (Not Yet Implemented)
- [ ] Automatic security updates across all hosts
- [ ] Predictive failure detection
- [ ] Resource optimization recommendations
- [ ] Integration with other communication platforms
- [ ] Multi-agent coordination between hosts
- [ ] Automated testing before deployment