Files

Lily Miller 2f367f7cdc Refactor: Centralize command patterns in single source of truth

CRITICAL: Prevents inconsistent sudo/SSH patterns across codebase.

Created command_patterns.py with:
- Single source of truth for ALL command execution patterns
- SSH key path constant: /var/lib/macha/.ssh/id_ed25519
- Remote user constant: macha
- sudo prefix for all remote commands
- Helper functions: build_ssh_command(), transform_ssh_command()
- Self-validation tests

Updated files to use centralized patterns:
- tools.py: Uses transform_ssh_command()
- remote_monitor.py: Uses build_ssh_command()
- system_discovery.py: Uses build_ssh_command()
- DESIGN.md: Documents centralized approach

Benefits:
- Impossible to have inconsistent patterns
- Single place to update if needed
- Self-documenting with validation tests
- Prevents future refactoring errors

DO NOT duplicate these patterns in other files - always import.

2025-10-06 16:06:31 -06:00

9.9 KiB

Raw Blame History

Macha Autonomous System - Design Document

⚠️ IMPORTANT - READ THIS FIRST
FOR AI ASSISTANT: This document is YOUR reference guide when modifying Macha's code.

ALWAYS consult this BEFORE refactoring to ensure you don't remove existing capabilities

CHECK this when adding features to avoid conflicts

UPDATE this document when new capabilities are added

DO NOT DELETE ANYTHING FROM THIS DOCUMENT

During major refactors, you MUST verify each capability listed here is preserved

Overview

Macha is an AI-powered autonomous system administrator capable of monitoring, maintaining, and managing multiple NixOS hosts in the infrastructure.

Core Capabilities

1. Local System Management

Monitor system health (CPU, memory, disk, services)
Read and analyze logs via journalctl
Check service status and restart failed services
Execute system commands (with safety restrictions)
Monitor and repair Nix store corruption
Hardware awareness (CPU, GPU, network, storage)

2. Multi-Host Management via SSH

Macha CAN and SHOULD use SSH to manage other hosts.

SSH Access

CRITICAL: All command patterns defined in command_patterns.py (SINGLE SOURCE OF TRUTH)
Always uses explicit SSH key path: -i /var/lib/macha/.ssh/id_ed25519
All SSH commands automatically include the -i flag with absolute key path
Remote commands always prefixed with sudo
Runs as macha user (UID 2501)
DO NOT DUPLICATE these patterns elsewhere - import from command_patterns.py
Has NOPASSWD sudo access for administrative commands
Shares SSH keys with other hosts in the infrastructure
Can SSH to: rhiannon, alexander, UCAR-Kinston, and others in the flake

SSH Usage Patterns

Direct diagnostic commands:
```
ssh rhiannon systemctl status ollama
ssh alexander df -h
```
- Commands automatically transformed by the tools layer
- Full command: ssh -i /var/lib/macha/.ssh/id_ed25519 -o StrictHostKeyChecking=no macha@rhiannon sudo systemctl status ollama
- SSH key path is always explicit, commands are automatically prefixed with sudo
Status checks:
- Check service health on remote hosts
- Gather system metrics
- Review logs
- Monitor resource usage
File operations:
- Use scp to copy files between hosts
- Read configuration files on remote systems

When to use SSH vs nh

SSH: For diagnostics, status checks, log review, quick commands
nh remote deployment: For applying NixOS configuration changes
- nh os switch -u --target-host=rhiannon --hostname=rhiannon
- Builds locally, deploys to remote host
- Use for permanent configuration changes

3. NixOS Configuration Management

Local Changes

Can propose changes to NixOS configuration
Requires human approval before applying
Uses nh os switch for local updates

Remote Deployment

Can deploy to other hosts using nh with --target-host
Builds configuration locally (on Macha)
Pushes to remote system
Can take up to 1 hour for complex builds
IMPORTANT: Be patient with long-running builds, don't retry prematurely

4. Hardware Awareness

Local Hardware Detection

CPU: lscpu via nix-shell -p util-linux
GPU: lspci via nix-shell -p pciutils
Network: lsblk, ip addr
Storage: df -h, lsblk
USB devices: lsusb

GPU Metrics

AMD GPUs: Try rocm-smi, sysfs (/sys/class/drm/card*/device/)
NVIDIA GPUs: Try nvidia-smi
Fallback: sensors for temperature data
Queries: temperature, utilization, clock speeds, power usage

5. Ollama Queue System

Architecture

File-based queue: /var/lib/macha/queues/ollama/
Queue worker: ollama-queue-worker.service (runs as macha user)
Purpose: Serialize all LLM requests to prevent resource contention

Request Flow

Any user (including regular users) → Write request to pending/
Queue worker → Process requests serially (FIFO with priority)
Queue worker → Write response to completed/
Original requester → Read response from completed/

Priority Levels

INTERACTIVE (0): User requests via macha-chat, macha-ask
AUTONOMOUS (1): Background maintenance checks
BATCH (2): Low-priority bulk operations

Large Output Handling

Outputs >8KB: Split into chunks for hierarchical processing
Each chunk ~8KB (~2000 tokens)
Process chunks serially with progress feedback
Generate chunk summaries → meta-summary
Full outputs cached in /var/lib/macha/tool_cache/

6. Knowledge Base & Learning

ChromaDB Collections

System Context: Infrastructure topology, service relationships
Issues: Historical problems and resolutions
Knowledge: Operational wisdom learned from experience

Automatic Learning

After successful operations, Macha reflects and extracts key learnings
Stores: topic, knowledge content, category
Retrieved automatically when relevant to current tasks
Use macha-knowledge CLI to view/manage

7. Notifications

Gotify Integration

Can send notifications via macha-notify command
Tool: send_notification(title, message, priority)

Priority Levels

2 (Low/Info): Routine status updates, completed tasks
5 (Medium/Attention): Important events, configuration changes
8 (High/Critical): Service failures, critical errors, security issues

When to Notify

Critical service failures
Successful completion of major operations
Configuration changes that may affect users
Security-related events
When explicitly requested by user

8. Safety & Constraints

Command Restrictions

Allowed Commands (see tools.py for full list):

System management: systemctl, journalctl, nh, nixos-rebuild
Monitoring: free, df, uptime, ps, top, ip, ss
Hardware: lscpu, lspci, lsblk, lshw, dmidecode
Remote: ssh, scp
Power: reboot, shutdown, poweroff (use cautiously!)
File ops: cat, ls, grep
Network: ping, dig, nslookup, curl, wget
Logging: logger

NOT Allowed:

Direct package modifications (nix-env, nix profile)
Destructive file operations (rm -rf, dd)
User management outside of NixOS config
Direct editing of system files (use NixOS config instead)

Critical Services

Never disable or stop:

SSH (network access)
Networking (connectivity)
systemd (system management)
Boot-related services

Approval Required

Reboots or system power changes
Major configuration changes
Disabling any service
Changes to multiple hosts

9. Nix Store Maintenance

Verification & Repair

Command: nix-store --verify --check-contents --repair
WARNING: Can take 30+ minutes to several hours
Only use when corruption is suspected
Not for routine maintenance
Verifies all store paths, repairs corrupted files

Garbage Collection

Automatic via system configuration
Can be triggered manually with approval
Frees disk space by removing unused derivations

10. Conversational Behavior

Distinguish Requests from Acknowledgments

"Thanks" / "Thank you" → Acknowledgment (don't re-execute)
"Can you..." / "Please..." → Request (execute)
"What is..." / "How do..." → Question (answer)

Tool Calling

Don't repeat tool calls unnecessarily
If a tool succeeds, don't run it again unless asked
Use cached results when available (retrieve_cached_output)

Context Management

Be aware of token limits
Use hierarchical processing for large outputs
Prune conversation history intelligently
Cache and summarize when needed

Infrastructure Topology

Hosts in Flake

macha: Main autonomous system (self), GPU server
rhiannon: Production server
alexander: Production server
UCAR-Kinston: Work laptop
test-vm: Testing environment

Shared Configuration

All hosts share root SSH keys (for nh remote deployment)
macha user (UID 2501) exists on all hosts
Common NixOS configuration via flake

Service Ecosystem

Core Services on Macha

ollama.service: LLM inference engine
ollama-queue-worker.service: Request serialization
macha-autonomous.service: Autonomous monitoring loop
Servarr stack: Sonarr, Radarr, Prowlarr, Lidarr, Readarr, Whisparr
Media: Transmission, SABnzbd, Calibre

State Directories

/var/lib/macha/: Main state directory (0755, macha:macha)
/var/lib/macha/queues/: Queue directories (0777 for multi-user)
/var/lib/macha/tool_cache/: Cached tool outputs (0777)
/var/lib/macha/system_context.db: ChromaDB database

CLI Tools

macha-chat: Interactive chat with tool calling
macha-ask: Single-question interface
macha-check: Trigger immediate health check
macha-approve: Approve pending actions
macha-logs: View autonomous service logs
macha-issues: Query issue database
macha-knowledge: Query knowledge base
macha-systems: List managed systems
macha-notify: Send Gotify notification

Philosophy & Principles

KISS (Keep It Simple, Stupid): Use existing NixOS options, avoid custom wrappers
Verify first: Check source code/documentation before acting
Safety first: Never break critical services, always require approval for risky changes
Learn continuously: Extract and store operational knowledge
Multi-host awareness: Macha manages the entire infrastructure, not just herself
User-friendly: Clear communication, appropriate notifications
Patience: Long-running operations (builds, repairs) can take an hour - don't panic
Tool reuse: Use existing, verified tools instead of writing custom scripts

Future Capabilities (Not Yet Implemented)

Automatic security updates across all hosts
Predictive failure detection
Resource optimization recommendations
Integration with other communication platforms
Multi-agent coordination between hosts
Automated testing before deployment

9.9 KiB Raw Blame History

Macha Autonomous System - Design Document

Overview

Core Capabilities

1. Local System Management

2. Multi-Host Management via SSH

SSH Access

SSH Usage Patterns

When to use SSH vs nh

3. NixOS Configuration Management

Local Changes

Remote Deployment

4. Hardware Awareness

Local Hardware Detection

GPU Metrics

5. Ollama Queue System

Architecture

Request Flow

Priority Levels

Large Output Handling

6. Knowledge Base & Learning

ChromaDB Collections

Automatic Learning

7. Notifications

Gotify Integration

Priority Levels

When to Notify

8. Safety & Constraints

Command Restrictions

Critical Services

Approval Required

9. Nix Store Maintenance

Verification & Repair

Garbage Collection

10. Conversational Behavior

Distinguish Requests from Acknowledgments

Tool Calling

Context Management

Infrastructure Topology

Hosts in Flake

Shared Configuration

Service Ecosystem

Core Services on Macha

State Directories

CLI Tools

Philosophy & Principles

Future Capabilities (Not Yet Implemented)

9.9 KiB

Raw Blame History