# Macha Autonomous System - Design Document > **⚠️ IMPORTANT - READ THIS FIRST** > **FOR AI ASSISTANT**: This document is YOUR reference guide when modifying Macha's code. > - **ALWAYS consult this BEFORE refactoring** to ensure you don't remove existing capabilities > - **CHECK this when adding features** to avoid conflicts > - **UPDATE this document** when new capabilities are added > - **DO NOT DELETE ANYTHING FROM THIS DOCUMENT** > - During major refactors, you MUST verify each capability listed here is preserved ## Overview Macha is an AI-powered autonomous system administrator capable of monitoring, maintaining, and managing multiple NixOS hosts in the infrastructure. ## Core Capabilities ### 1. Local System Management - Monitor system health (CPU, memory, disk, services) - Read and analyze logs via `journalctl` - Check service status and restart failed services - Execute system commands (with safety restrictions) - Monitor and repair Nix store corruption - Hardware awareness (CPU, GPU, network, storage) - USB device detection and monitoring ### 2. Multi-Host Management via SSH **Macha CAN and SHOULD use SSH to manage other hosts.** #### SSH Access - **CRITICAL**: All command patterns defined in `command_patterns.py` (SINGLE SOURCE OF TRUTH) - Always uses explicit SSH key path: `-i /var/lib/macha/.ssh/id_ed25519` - All SSH commands automatically include the `-i` flag with absolute key path - Remote commands always prefixed with `sudo` - Runs as `macha` user (UID 2501) for standard operations - **Note**: Some internal operations (like remote monitoring) may use `root` SSH for privileged access - **DO NOT DUPLICATE these patterns elsewhere** - import from `command_patterns.py` - Has `NOPASSWD` sudo access for administrative commands - Shares SSH keys with other hosts in the infrastructure - Can SSH to any hosts defined in your NixOS flake configuration #### SSH Usage Patterns 1. **Direct diagnostic commands:** ```bash ssh hostname systemctl status service-name ssh hostname df -h ``` - Commands automatically transformed by the tools layer - Full command: `ssh -i /var/lib/macha/.ssh/id_ed25519 -o StrictHostKeyChecking=no macha@hostname sudo systemctl status service-name` - SSH key path is always explicit, commands are automatically prefixed with `sudo` 2. **Status checks:** - Check service health on remote hosts - Gather system metrics - Review logs - Monitor resource usage 3. **File operations:** - Use `scp` to copy files between hosts - Read configuration files on remote systems #### When to use SSH vs nh - **SSH**: For diagnostics, status checks, log review, quick commands - **nh remote deployment**: For applying NixOS configuration changes - `nh os switch -u --target-host=hostname --hostname=hostname` - Builds locally, deploys to remote host - Use for permanent configuration changes ### 3. Automatic System Discovery #### Discovery from Journal Logs - Monitors `systemd-journal-remote` logs for new systems sending logs - Parses `_HOSTNAME` field from remote journal entries - Automatically converts short hostnames to FQDNs (`.coven.systems`) - Discovers systems within configurable time windows (default: 10 minutes) #### OS Detection - Detects operating system via SSH (`/etc/os-release`, `uname`) - Supports: NixOS, Ubuntu, Debian, Arch, Fedora, RHEL, Alpine, macOS, FreeBSD - Falls back to generic "linux" if specific distro cannot be determined - Records OS type in system registry for appropriate management strategies #### System Profiling - Automatically gathers system information upon discovery: - Running services (via `systemctl list-units`) - Hardware info (CPU cores, memory) - Capabilities (containers, web-server, database, remote-access) - System role determination (ai-workstation, server, workstation, minimal) - Registers discovered systems in ChromaDB context database - Sends notification when new systems are discovered ### 4. NixOS Configuration Management #### Local Changes - Can propose changes to NixOS configuration - Requires human approval before applying - Uses `nh os switch` for local updates #### Remote Deployment - Can deploy to other hosts using `nh` with `--target-host` - Builds configuration locally (on Macha) - Pushes to remote system - Can take up to 1 hour for complex builds - **IMPORTANT**: Be patient with long-running builds, don't retry prematurely ### 5. Configuration File Intelligence (RAG) #### Semantic Configuration Search - NixOS configuration files indexed in ChromaDB for semantic search - Query configurations by natural language: "gotify configuration", "journald settings" - CLI tools: `macha-configs ` and `macha-configs-read ` - Relevance scoring helps find the right config files quickly - Supports filtering by system hostname or category #### Configuration Categories - `apps/` - Application configurations (Gotify, Nextcloud, etc.) - `systems/` - Per-host system configurations - `osconfigs/` - Operating system level settings - `users/` - User management configurations #### Git Context Analysis - Tracks recent changes to configuration files via `git_context.py` - Correlates config changes with system behavior - Provides context when investigating issues after deployments - Helps understand "what changed" when debugging ### 6. Hardware Awareness #### Local Hardware Detection - CPU: `lscpu` via `nix-shell -p util-linux` - GPU: `lspci` via `nix-shell -p pciutils` - Network: `lsblk`, `ip addr` - Storage: `df -h`, `lsblk` - USB devices: `lsusb` #### GPU Metrics - AMD GPUs: Try `rocm-smi`, sysfs (`/sys/class/drm/card*/device/`) - NVIDIA GPUs: Try `nvidia-smi` - Fallback: `sensors` for temperature data - Queries: temperature, utilization, clock speeds, power usage ### 7. Ollama Queue System #### Architecture - **File-based queue**: `/var/lib/macha/queues/ollama/` - **Queue worker**: `ollama-queue-worker.service` (runs as `macha` user) - **Purpose**: Serialize all LLM requests to prevent resource contention #### Request Flow 1. Any user (including regular users) → Write request to `pending/` 2. Queue worker → Process requests serially (FIFO with priority) 3. Queue worker → Write response to `completed/` 4. Original requester → Read response from `completed/` #### Priority Levels - `INTERACTIVE` (0): User requests via `macha-chat`, `macha-ask` - `AUTONOMOUS` (1): Background maintenance checks - `BATCH` (2): Low-priority bulk operations #### Large Output Handling - Outputs >8KB: Split into chunks for hierarchical processing - Each chunk ~8KB (~2000 tokens) - Process chunks serially with progress feedback - Generate chunk summaries → meta-summary - Full outputs cached in `/var/lib/macha/tool_cache/` ### 8. Knowledge Base & Learning #### ChromaDB Architecture - **Service**: ChromaDB runs as a standalone service on port 8000 - **Storage**: Data persisted at `/var/lib/chromadb` - **Frontend**: `context_db.py` provides structured Python interface to ChromaDB - **Connection**: HTTP client to `localhost:8000` #### ChromaDB Collections 1. **systems**: Infrastructure topology, registered hosts, OS types 2. **relationships**: System dependencies and relationships 3. **issues**: Historical problems and resolutions 4. **decisions**: AI decisions and outcomes 5. **config_files**: NixOS configuration files for RAG 6. **knowledge**: Operational wisdom learned from experience #### Automatic Learning & Reflection - After successful operations, Macha automatically reflects via `reflect_and_learn()` - Extracts 1-2 specific, actionable learnings from each successful operation - Stores: topic, knowledge content, category, confidence level - Categories: command, pattern, troubleshooting, performance, general - Retrieved automatically when relevant to current tasks - Use `macha-knowledge` CLI to view/manage/search - Use `seed_knowledge.py` to populate initial operational knowledge ### 9. Notifications #### Gotify Integration - Can send notifications via `macha-notify` command - Tool: `send_notification(title, message, priority)` #### Priority Levels - `2` (Low/Info): Routine status updates, completed tasks - `5` (Medium/Attention): Important events, configuration changes - `8` (High/Critical): Service failures, critical errors, security issues #### When to Notify - Critical service failures - Successful completion of major operations - Configuration changes that may affect users - Security-related events - New system discoveries - When explicitly requested by user ### 10. Safety & Constraints #### Command Restrictions **Allowed Commands** (see `tools.py` for full list): - System management: `systemctl`, `journalctl`, `nh`, `nixos-rebuild` - Monitoring: `free`, `df`, `uptime`, `ps`, `top`, `ip`, `ss` - Hardware: `lscpu`, `lspci`, `lsblk`, `lshw`, `dmidecode` - Remote: `ssh`, `scp` - Power: `reboot`, `shutdown`, `poweroff` (use cautiously!) - File ops: `cat`, `ls`, `grep` - Network: `ping`, `dig`, `nslookup`, `curl`, `wget` - Logging: `logger` **NOT Allowed**: - Direct package modifications (`nix-env`, `nix profile`) - Destructive file operations (`rm -rf`, `dd`) - User management outside of NixOS config - Direct editing of system files (use NixOS config instead) #### Critical Services **Never disable or stop:** - SSH (network access) - Networking (connectivity) - systemd (system management) - Boot-related services #### Approval Required - Reboots or system power changes - Major configuration changes - Disabling any service - Changes to multiple hosts #### Interactive Discussion - `macha-approve discuss ` enables interactive Q&A about proposed actions - Implemented via `conversation.py` module - Users can ask follow-up questions before approving/rejecting - Provides detailed explanations and reasoning - Commands: `approve`, `reject`, `exit` to control flow ### 11. Nix Store Maintenance #### Verification & Repair - Command: `nix-store --verify --check-contents --repair` - **WARNING**: Can take 30+ minutes to several hours - Only use when corruption is suspected - Not for routine maintenance - Verifies all store paths, repairs corrupted files #### Garbage Collection - Automatic via system configuration - Can be triggered manually with approval - Frees disk space by removing unused derivations ### 12. Conversational Behavior #### Distinguish Requests from Acknowledgments - "Thanks" / "Thank you" → Acknowledgment (don't re-execute) - "Can you..." / "Please..." → Request (execute) - "What is..." / "How do..." → Question (answer) #### Tool Calling - Don't repeat tool calls unnecessarily - If a tool succeeds, don't run it again unless asked - Use cached results when available (`retrieve_cached_output`) #### Context Management - Be aware of token limits - Use hierarchical processing for large outputs - Prune conversation history intelligently - Cache and summarize when needed ## Infrastructure Topology ### Managed Hosts - **Self**: Main autonomous system running Macha - **Configured hosts**: Systems defined in your NixOS flake - **Auto-discovered hosts**: Additional systems detected via journal logs ### Shared Configuration - All hosts share root SSH keys (for `nh` remote deployment) - `macha` user (UID 2501) exists on all managed hosts - Common NixOS configuration via flake - Multi-OS support: NixOS, Ubuntu, Debian, Arch, macOS, and others ## Service Ecosystem ### Core Services on Macha - `ollama.service`: LLM inference engine - `ollama-queue-worker.service`: Request serialization - `macha-autonomous.service`: Autonomous monitoring loop - `chromadb.service`: Vector database for context and knowledge (port 8000) ### State Directories - `/var/lib/macha/`: Main state directory (0755, macha:macha) - `/var/lib/macha/queues/`: Queue directories (0777 for multi-user) - `/var/lib/macha/tool_cache/`: Cached tool outputs (0777) - `/var/lib/macha/logs/`: Log files and closed issues archive - `/var/lib/chromadb/`: ChromaDB vector database storage ## CLI Tools - `macha-chat`: Interactive chat with tool calling - `macha-ask`: Single-question interface - `macha-check`: Trigger immediate health check - `macha-approve`: Approve pending actions - `macha-approve list` - Show pending actions - `macha-approve discuss ` - Interactive Q&A about action N - `macha-approve approve ` - Approve action N - `macha-approve reject ` - Reject action N - `macha-logs`: View autonomous service logs - `macha-issues`: Query issue database - `macha-knowledge`: Query knowledge base - `macha-systems`: List managed systems - `macha-configs`: Semantic search for configuration files - `macha-configs-read`: Read full configuration file content - `macha-notify`: Send Gotify notification ## Architecture & Key Modules ### Core Modules #### `agent.py` - AI Agent - Interfaces with Ollama LLM for reasoning - Implements tool calling with `tools.py` - Manages conversation history and context - Automatic learning via `reflect_and_learn()` - Supports queue-based and direct API modes #### `orchestrator.py` - Main Control Loop - Continuous monitoring and health checks - Coordinates all other components - Manages check intervals and autonomy levels - Initializes system registry and configuration parsing - Handles system discovery and registration #### `executor.py` - Safe Action Execution - Manages approval queue for actions - Respects autonomy levels (observe, suggest, auto-safe, auto-full) - Executes approved actions with safety checks - Logs all actions and outcomes #### `tools.py` - System Administration Tools - Defines all available tools for the AI agent - Command allow-list for safe mode - Executes system commands, reads files, checks services - Implements hardware queries and GPU metrics - Integrates with `command_patterns.py` for SSH #### `command_patterns.py` - SSH Command Patterns - **SINGLE SOURCE OF TRUTH** for SSH commands - Builds SSH commands with correct key paths and options - Handles automatic sudo prefixing for remote commands - Provides `build_ssh_command()` and `build_scp_command()` #### `context_db.py` - ChromaDB Frontend - Structured interface to ChromaDB vector database - Manages 6 collections: systems, relationships, issues, decisions, config_files, knowledge - Implements semantic search for configurations and knowledge - Tracks system relationships and dependencies #### `monitor.py` - Local System Monitoring - Collects system health metrics (CPU, memory, disk) - Checks systemd services and recent errors - Monitors NixOS generations and Nix store size - Generates human-readable summaries #### `remote_monitor.py` - Remote System Monitoring - SSH-based monitoring of remote hosts - Collects resources, services, disk, network status - Verifies connectivity before operations - Uses `command_patterns.py` for SSH access #### `system_discovery.py` - Auto-Discovery - Discovers new systems from journal logs - Detects OS types via SSH probing - Profiles systems: services, hardware, capabilities - Determines system roles automatically #### `issue_tracker.py` - Issue Management - Creates, updates, resolves, and closes issues - Finds similar past issues - Auto-resolves issues when problems disappear - Archives closed issues to JSONL logs #### `notifier.py` - Gotify Integration - Sends notifications at appropriate priority levels - Special methods for common events (failures, discoveries, actions) - Fails gracefully if Gotify unavailable #### `ollama_queue.py` - Request Serialization - File-based queue at `/var/lib/macha/queues/ollama/` - Three priority levels: INTERACTIVE, AUTONOMOUS, BATCH - Prevents resource contention on LLM - Tracks request status: pending → processing → completed/failed #### `ollama_worker.py` - Queue Worker Daemon - Processes queue requests serially - Runs as systemd service `ollama-queue-worker.service` - Handles timeouts and failures gracefully #### `conversation.py` - Interactive Discussion - Implements `macha-approve discuss` feature - Enables Q&A about proposed actions - Maintains context during discussion - Helps users understand AI reasoning #### `config_parser.py` - Configuration File Parsing - Parses NixOS configuration files from git repository - Indexes configurations in ChromaDB for RAG - Categorizes by type: apps, systems, osconfigs, users #### `git_context.py` - Git Analysis - Tracks recent configuration changes - Provides context when debugging after deployments - Correlates config changes with system issues #### `journal_monitor.py` - Journal Log Monitoring - Monitors systemd journal for specific patterns - Triggers on error conditions - Feeds into auto-discovery system #### `seed_knowledge.py` - Knowledge Seeding - Populates initial operational knowledge - Loads foundational patterns and commands - Run via `macha-knowledge seed` #### `chat.py` - Interactive Chat Interface - Implements `macha-chat` and `macha-ask` commands - Manages conversation state - Integrates with queue system for LLM requests #### `module.nix` - NixOS Module - Defines all configuration options - Creates systemd services (macha-autonomous, ollama-queue-worker, chromadb) - Sets up users, permissions, state directories - Provides all CLI tool wrappers ## Philosophy & Principles 1. **KISS (Keep It Simple, Stupid)**: Use existing NixOS options, avoid custom wrappers 2. **Verify first**: Check source code/documentation before acting 3. **Safety first**: Never break critical services, always require approval for risky changes 4. **Learn continuously**: Extract and store operational knowledge 5. **Multi-host awareness**: Macha manages the entire infrastructure, not just herself 6. **User-friendly**: Clear communication, appropriate notifications 7. **Patience**: Long-running operations (builds, repairs) can take an hour - don't panic 8. **Tool reuse**: Use existing, verified tools instead of writing custom scripts ## Future Capabilities (Not Yet Implemented) - [ ] Automatic security updates across all hosts - [ ] Predictive failure detection - [ ] Resource optimization recommendations - [ ] Integration with other communication platforms - [ ] Multi-agent coordination between hosts - [ ] Automated testing before deployment