The Role · Under the Hood · ADRs Deep Dive
Orbiit Recovery

Architecture Decision Records

5 ADRs that show how we think. Every significant decision - documented before implementation, with trade-offs, alternatives rejected, and consequences accepted.

Selected from 48 ADRs in the codebase
ADR-0006: Micro-Courses ADR-0009: User Hierarchy ADR-0011: Passwordless Auth ADR-0014: AI Collaboration ADR-0046: Fail-Open

Why ADRs Matter

Architecture Decision Records are the backbone of how this platform was built. When AI is your primary development tool, tribal knowledge doesn't work - you can't say "ask Steve, he built that part." Every decision needs to be written down: the context, the constraints, the options considered, what we chose, and why.

48 ADRs in 6 months. That's more architectural documentation than most Series B companies have. These aren't templates filled in for compliance - they're working documents that the AI reads before touching the codebase. When a new session starts, the first thing that happens is reading the relevant ADRs. That's how context survives across sessions.

What you're reading below are 5 curated selections - chosen to show range: product architecture (0006), data modeling (0009), security design (0011), methodology (0014), and a production incident turned into principled design (0046). The other 43 cover everything from HIPAA compliance to UUID migration to AI kill switches.


ADR-0006 Accepted August 2025

Micro-Courses Architecture & JIT Token Delivery

This is the core product architecture. Orbiit delivers 8 daily touchpoints to patients via SMS - meditations, affirmations, reflections (text-only), and micro-courses (magic link to mobile web page with quiz). The question was: how do you deliver personalized content to someone without making them log in?

The Problem

Patients in early recovery have high cognitive load and low tolerance for friction. Passwords are a barrier. App downloads are a barrier. Even email links are a barrier - SMS has a 98% open rate vs 20% for email. We needed a way to deliver one course to one user, collect quiz responses, and attribute the interaction - without standing up full login flows.

The Decision

Scope-bound, short-TTL magic links. Each token binds to (user_id, course_id, run_id) with a 14-day TTL. No PII in the URL. The token can only read one course and submit one quiz response. Just-in-time generation reduces exposure by 96.7% compared to pre-generated tokens.

Patient's Phone │ SMS with magic link Astro /course/{token} ── GET ──► Django API ── validates token ──► PostgreSQL │ │ │ quiz submit │ marks opened └──── POST answers + token ──────────┘ SOBER Score recalculation

Daily Delivery Cadence

TimeTypeDelivery
7:00 AMMeditationText-only SMS
9:00 AMCourse 1Magic link → mobile web + quiz
11:30 AMAffirmationText-only SMS
1:00 PMCourse 2Magic link → mobile web + quiz
3:30 PMAffirmationText-only SMS
5:00 PMCourse 3Magic link → mobile web + quiz
7:30 PMAffirmationText-only SMS
9:00 PMReflectionText-only SMS

Content Types & IDs

TypeID PatternDeliveryExample
CourseC#####Magic link → mobile web page + quizC00101
MeditationM###Text-only SMS bodyM001
AffirmationA###Text-only SMS bodyA001
ReflectionR###Text-only SMS bodyR001

Text-only items never require a web page. Courses always render a page + quiz.

Data Model

# Core models (apps.courses, apps.messaging) class MagicLinkToken: token # secrets.token_urlsafe(32) - 256-bit, unique, indexed user # FK to User - who this token authenticates course # FK to Course - what content it unlocks run_id # UUID - enables re-assignment without invalidating history expires_at # Default 14 days TTL used_at # Tracks first use for audit class UserCourseAssignment: user, course, run_id status # draft → sent → opened → completed sent_at, opened_at, completed_at class ContentItem: id # C/M/A/R prefix ID type # course | meditation | affirmation | reflection day, order # Day 1-180, slot 1-8 title, body link_required # True for courses, False for text-only class DeliveryPlan: user, date, day, order content_item, send_at status # planned → queued → sent → delivered → opened → completed → failed run_id, retries, last_error class OutboundMessage: user, content_item twilio_sid, body, send_at status, status_payload # JSONB from Twilio webhooks

API Contract

EndpointMethodPurpose
/api/v1/courses/magic/{token}GETRead course via magic token. Marks assignment opened (idempotent).
/api/v1/submissions/tokenPOSTSubmit quiz answers. Resolves (user, course) by token. Creates Submission once per (user, quiz, run_id). Returns 201 or 200 already_completed.
/api/v1/sms/statusPOSTTwilio status callback. Updates OutboundMessage/DeliveryPlan.
/api/v1/sms/inboundPOSTTwilio inbound. Handles STOP/START/UNSUBSCRIBE/CANCEL/END/QUIT.

First-Party Scheduler (No GHL/Make)

Edge Cases

Security Properties

Why Not Alternatives?

Why This ADR Matters

This is the product's DNA. The magic link architecture is what makes Orbiit work for a population that won't download apps, won't remember passwords, and won't tolerate friction. Every design choice - TTLs, scope binding, idempotent submission, JIT generation - flows from one principle: meet the patient where they are.


ADR-0009 Accepted October 2025

User Hierarchy & Profile-Based RBAC

The platform serves multiple user types with very different access needs: patients viewing their own recovery data, clinicians managing caseloads, administrators overseeing clinics or entire organizations, billers processing claims, and platform superadmins. The question: how do you model this without a God Object User model?

The Design Principle

"Clinician and Admin are separate profiles, not roles." A StaffMember can have a ClinicianProfile (clinical work), an AdministratorProfile (management), a BillingProfile (financial), or any combination. In small treatment centers, the same person often wears multiple hats - this architecture handles that without role flags or permission spaghetti.

User (Django AbstractUser) ├── user_type: patient | staff | superadmin ├── Patient (if patient) │ ├── Organization → Region → Clinic │ └── 1 primary clinician + N additional ├── StaffMember (if staff) │ ├── ClinicianProfile (optional) │ ├── AdministratorProfile (optional) │ ├── BillingProfile (optional) │ └── ResearcherProfile (optional) └── SuperAdminPermission (if superuser) └── Granular toggles (AWS IAM-style)

Multi-Tenant Hierarchy

Every query is scoped by organization. The hierarchy is Organization → Region → Clinic → Patient. Administrator scope is explicit:

Permission Matrix (Abbreviated)

ActionPatientClinicianRegion AdminOrg AdminSuper Admin
View own dataYesYesYesYesYes
View assigned patients-YesYesYesYes
View all org patients---YesYes
Create staff--RegionOrg-wideYes*
Impersonate users----Yes*

* Requires specific toggle enabled (IAM-style granular permissions)

The Models (Actual Code)

class StaffMember(models.Model): """Base model for all organizational employees. Can have 0, 1, or multiple profiles simultaneously.""" user = OneToOneField(User, related_name='staff_profile') organization = ForeignKey(Organization, on_delete=PROTECT) region = ForeignKey(Region, null=True, blank=True) clinic = ForeignKey(Clinic, null=True, blank=True) job_title = CharField(max_length=100, blank=True) active = BooleanField(default=True) @property def is_clinician(self): return hasattr(self, 'clinician') @property def is_administrator(self): return hasattr(self, 'administrator') @property def profile_types(self): # Returns ['clinician', 'administrator', 'billing', 'researcher'] as applicable class ClinicianProfile(models.Model): staff = OneToOneField(StaffMember, related_name='clinician') license_number = CharField(max_length=50) license_state = CharField(max_length=2) credentials = CharField(help_text='LCSW, LMFT, MD, PhD, etc.') clinical_role = CharField(choices=[ ('counselor', 'Counselor'), ('therapist', 'Therapist'), ('case_manager', 'Case Manager'), ('psychiatrist', 'Psychiatrist'), ('nurse', 'Nurse'), ('peer_specialist', 'Peer Specialist'), ]) can_activate_bonus = BooleanField(default=True) can_edit_treatment_plans = BooleanField(default=True) class AdministratorProfile(models.Model): staff = OneToOneField(StaffMember, related_name='administrator') scope = CharField(choices=[ ('clinic', 'Clinic Admin'), ('region', 'Region Admin'), ('org', 'Organization Admin'), ]) can_create_staff = BooleanField(default=True) can_manage_schedules = BooleanField(default=True) can_view_billing = BooleanField(default=False) can_generate_reports = BooleanField(default=True) def save(self, *args, **kwargs): # Validates scope matches staff assignment if self.scope == 'clinic' and not self.staff.clinic: raise ValueError("Clinic admin must be assigned to a clinic") super().save(*args, **kwargs) class SuperAdminPermission(models.Model): """AWS IAM-style granular toggles for platform admins""" user = OneToOneField(User, limit_choices_to={'is_superuser': True}) can_create_organizations = BooleanField(default=False) can_impersonate_users = BooleanField(default=False) can_modify_system_settings = BooleanField(default=False) can_view_all_patients = BooleanField(default=False) can_export_phi = BooleanField(default=False) can_manage_subscriptions = BooleanField(default=False) # ... 14 toggles total

Access Control Logic

class StaffMember: def get_accessible_patients(self): """Queryset of patients this staff can access""" patients = Patient.objects.none() # Admin scope: org/region/clinic level if self.is_administrator: admin = self.administrator if admin.scope == 'org': patients |= Patient.objects.filter(organization=self.organization) elif admin.scope == 'region': patients |= Patient.objects.filter(region=self.region) elif admin.scope == 'clinic': patients |= Patient.objects.filter(clinic=self.clinic) # Clinician scope: assigned patients only if self.is_clinician: patients |= Patient.objects.filter( Q(primary_clinician=self) | Q(additional_clinicians=self) ) return patients.distinct() # Usage in views: @staff_required def staff_patient_list(request): staff = request.user.staff_profile patients = staff.get_accessible_patients() # Automatically scoped

Audit Logging

class ImpersonationLog(models.Model): """Logged when Super Admin impersonates users""" admin = ForeignKey(User, related_name='impersonations_made') target_user = ForeignKey(User, related_name='impersonations_received') started_at, ended_at reason = TextField() # Required - no silent impersonation ip_address = GenericIPAddressField() class PHIAccessLog(models.Model): """HIPAA-compliant PHI access tracking""" user = ForeignKey(User) patient = ForeignKey(Patient) action = CharField() # 'view', 'edit', 'export' ip_address = GenericIPAddressField() created_at = DateTimeField(auto_now_add=True)

Why Profiles, Not Roles?

Consequences

TypeDetail
PositiveClean separation of concerns. Flexible multi-hat capability (common in small orgs). Easy to add new profile types. Granular Super Admin permissions reduce platform risk.
NegativeMore models than simple Staff approach. Slightly more complex queries for cross-profile operations. Multiple table joins for staff with multiple profiles.
Mitigationsselect_related() and prefetch_related() for query performance. Admin interface makes adding/removing profiles intuitive.

Why This ADR Matters

Data modeling decisions made at the foundation define everything built on top. This ADR shows deliberate domain modeling for healthcare - not forcing clinical workflows into a generic SaaS user/role pattern. The profile-based approach has scaled cleanly from 1 organization to the current multi-tenant deployment.


ADR-0011 Active October 2025

Passwordless Authentication

Two user populations with completely different needs. Patients in early recovery - high stress, low tolerance for friction, privacy concerns, shared devices. Staff at treatment organizations - enterprise security requirements, centralized access control, compliance obligations. One auth system can't serve both. So we didn't try.

Patients: SMS Magic Links

No passwords. No app downloads. No accounts to create. The patient's daily touchpoints arrive via SMS with embedded magic links. Click the link, view the content. The link is the authentication.

PropertyValue
Token entropy256-bit (secrets.token_urlsafe(32))
Expiration3 days (org-configurable: 24-168 hours)
ReusabilityReusable within expiration window
Session timeout30 minutes inactivity
Rate limiting5 attempts per identifier per 15 minutes

Why reusable? Rehabilitative, not punitive. A patient who clicks a link twice shouldn't get an error. They're studying, reviewing, or retrying from a different device. The population we serve needs friction removed, not added.

Staff: Enterprise SSO

Microsoft Entra ID (primary), Google Workspace (optional). No password option exists for staff - all authentication delegated to the corporate identity provider. MFA enforced by the IdP, not by us. Immediate revocation on termination.

For smaller organizations without IT resources, staff can also use email magic links with tighter security: 15-minute expiration, single-use only. Superadmins are excluded from this fallback - they must use SSO.

HIPAA Compliance

Staff Magic Links vs Patient Magic Links

Added in December 2025 for organizations without IT resources for SSO setup. Deliberately tighter security than patient links:

PropertyPatient Magic LinksStaff Magic Links
Expiration3 days (org-configurable)15 minutes (fixed)
ReusabilityReusable within windowSingle-use only
Token entropy256-bit256-bit
DeliverySMS (primary), email (fallback)Email only
SuperadminsN/AExcluded - must use SSO

Implementation Models

class MagicLinkToken: token # CharField(255, unique, indexed) - secrets.token_urlsafe(32) patient # FK to Patient delivery_method # 'sms' | 'email' content_type # Optional - for touchpoint-specific links content_id # Optional created_at, expires_at used_for_login # BooleanField - was this token used to create a session? use_count # IntegerField - tracks reuse within window ip_address, user_agent # Security audit trail class LoginAttempt: user # FK to User (nullable - may not match) email_or_phone # What was attempted auth_method # 'magic_link' | 'sso_microsoft' | 'sso_google' success # BooleanField failure_reason # 'expired' | 'not_found' | 'inactive' | 'rate_limited' ip_address, user_agent, attempted_at # Middleware class SessionTimeoutMiddleware: # Patients: 30-minute inactivity timeout # Staff: Standard Django session (2 weeks) # Tracks last_activity in session, auto-logout if exceeded

Threat Model

ThreatImpactLikelihoodMitigation
SMS not deliveredHighLowEmail fallback, 3-day validity
Phone stolenMediumLowRemote deactivation, session timeout
SSO provider outageHighVery LowProvider SLA (99.9%+), staff magic link fallback
Token enumerationMediumLowRate limiting (5/15min), audit logging, 256-bit entropy
Session hijackingHighVery LowHTTPS only, secure cookies, 30-min timeout
Email interception (staff)MediumLow15-minute expiration, single-use

Why Not Alternatives?

OptionRejected Because
Password + MFA for allHigh cognitive burden for patients in recovery. Password resets = top support burden. Patients will choose weak passwords.
Password only (no MFA)Fails HIPAA strong authentication requirements (§164.312(d)). Vulnerable to credential stuffing, password reuse across sites.
Biometric (Face ID / fingerprint)Not universally available on older phones. Can't revoke biometric data. Privacy concerns with biometric storage. Complexity exceeds benefit for MVP.
Email magic links only (no SMS)20% open rate vs 98% for SMS. Doesn't align with daily touchpoint flow. Patients less likely to check email daily.
SAML for staff SSODeferred. OAuth2 covers 70%+ of market (Microsoft/Google). SAML more complex to implement. Enterprise customers can wait.

Success Metrics (From the ADR)

MetricTarget
Magic link click rate> 95%
Session timeout complaints< 1%
"Can't login" support tickets< 2%
Average time to login< 5 seconds (patients), < 10 seconds (staff)
Password reset tickets0 (no passwords exist)
Rate limit triggers per day< 10 (indicates attack if higher)

Post-production security audit (Oct 2025): The audit flagged magic link token entropy and cross-organization access as potential vulnerabilities. Investigation confirmed both were intentional design: 256-bit entropy exceeds NIST standards (audit had incorrectly calculated 188-bit), and cross-org access enables authorized family viewing of educational content (non-PHI). Course content is educational (CBT exercises, reflections, meditations), not PHI-containing - even if a token is accessed by the wrong party, no patient identity, treatment history, or sensitive health information is disclosed. Full security assessment in the repo: docs/audit/20251029/MAGIC_LINK_SECURITY_INVESTIGATION.md


ADR-0014 Active January 2026

AI Collaboration Contract

This is the ADR that defines how the platform gets built. When your primary development tool is an AI agent, you need operating rules - not guidelines, not best practices, but a contract. What the AI must do before writing code. What it must never do. How git, documentation, and context management work.

Core Principle: Repository-Driven Development

All project management, documentation, and planning live in the repository. No Jira. No Asana. No Confluence. The repo IS the source of truth. ADRs are immutable architectural decisions. The backlog is a JSON file that AI edits programmatically. Technical docs are in markdown next to the code.

Why: AI needs context accessible in the repo. External tools break the context chain. Git-based tracking means every change is versioned, diff-able, and auditable.

Workflow Phases

  1. Context Gathering - Read relevant ADRs, check backlog, scan related files. Never start coding without understanding existing architecture.
  2. Planning - For non-trivial tasks, create a todo list with acceptance criteria. Mark items completed immediately after finishing each step.
  3. Implementation - Read before edit. Use targeted edits. Follow Django patterns. No secrets in code. HIPAA compliance in every query.
  4. Git Workflow - Meaningful commits. Batch related changes. Include context. Co-authoring footer on every commit.

Operating Rules

DoDon't
Read ADRs before architectural changesStart coding without reading existing code
Commit every 30-60 minutesWork 3+ hours without committing
Create management commands for admin tasksMake breaking changes without confirmation
Update backlog when completing featuresLeave todos in "in progress" state
Be concise, direct, and helpfulRedesign working systems mid-task

The Data Loss Incident (Oct 2025)

What happened: During Staff Dashboard development, approximately 3 hours of work was lost - content type badges, filters, clickable rows, clinician notes, program warnings. All fully implemented and working. Never committed to git.

Timeline

  1. Session Start: Began implementing Discovery 10-17 features
  2. Commit Pause: CTO paused commits to avoid affecting production (no staging environment yet)
  3. 3 hours of development: Sophisticated dashboard with all P0 features built
  4. Claude Code session crash: Context lost
  5. Token limit hit: Ran out of tokens on $100/month plan (~12 hours of coding in 24 hours)
  6. Plan upgrade: $200/month Max plan to continue
  7. Data loss confirmed: Features existed in screenshots but were never committed
  8. Recovery: Rebuilt from discovery log specifications over 3-4 additional hours

Financial Impact

ItemCost
Initial token usage (12 hrs development)~$50
Token usage during recovery attempt~$50 (hit limit)
Max plan upgrade$100/month ongoing
Token usage for rebuild~$50
Total one-time cost~$150
Ongoing increased cost$100/month

What Changed Because of This

Story Points for Everything

AI development broke the time-effort correlation. Architecture research, UX iterations, documentation, planning - all count as story points. Sessions 8 and 9 originally showed 0 SP despite 4-6 hours of work each, because the planning/research wasn't captured.

// Retroactive estimation tracking (backlog-data.json) { "storyPoints": 5, "estimatedStoryPoints": 0, "retroactiveEstimation": { "reason": "PM/Dev gap: architecture work not captured", "breakdown": { "ADR-0019 architecture research": 3, "R3 PWA release planning": 1, "Backlog reordering": 1 } } }

The consulting work principle: "If I need to document or plan, that is still development time for me." All development work is tracked - not just code. Architecture research, UX design, planning sessions, documentation. This corrected velocity from 89 SP to 97 SP across the first 11 sessions when the gap was found.

Definition of Done

For FeaturesFor Bug FixesFor ADRs
Code implemented and tested
Tests pass
Documentation updated
Committed with clear message
Pushed to GitHub
Verified in production
Backlog updated
Root cause identified
Fix implemented and tested
No regressions
Committed with reference
Deployed and verified
Context clearly stated
Decision with rationale
Consequences documented
Alternatives considered
Implementation notes
Related ADRs linked

Why This ADR Matters

This isn't a style guide. It's an operational contract between human and AI that evolved from real failures - data loss, context drift, untracked work. The methodology has produced 662 story points across 199 sessions. The rules exist because we learned what happens when they don't.


ADR-0046 Accepted February 2026

Fail-Open Rate Limiting for Patient Access

This ADR exists because of a production incident. On February 7, 2026, patients reported "Server Error" when clicking magic links. Redis Cloud Azure had a regional outage. Our rate limiting depended on Redis - when Redis went down, the throttle middleware raised exceptions that blocked all patient requests. For 30 minutes, no one could access their recovery content.

Root Cause Chain: Patient clicks magic link → Django view with throttle decorator → PatientRateLimitThrottle.allow_request()cache.get() to Redis → Redis unavailable → Exception raised → 500 Server Error. The rate limiter designed to protect patients was the thing blocking them.

The Decision

Fail open. When Redis is unavailable, rate limiting allows requests through rather than blocking them. Patient access to recovery content is more important than rate limiting during infrastructure failures.

class PatientRateLimitThrottle(BaseThrottle): def allow_request(self, request, view): try: return super().allow_request(request, view) except Exception as e: logger.warning(f"Cache unavailable, allowing request: {e}") return True # Fail open - patient access is critical

Additional Hardening

Why Not Alternatives?

OptionRejected Because
Fail closed (block all requests)Patients can't access recovery content during outage. Unacceptable for this population.
In-memory rate limiting fallbackPer-instance limits (not distributed), complex to implement correctly. Over-engineering for a rare edge case.
Multiple Redis providersSignificant complexity, cost, and maintenance. YAGNI.

Affected Throttle Classes

ClassEndpointFail-Open Behavior
PatientRateLimitThrottleMagic link viewsAllow all requests
PatientMagicLinkThrottleMagic link generationAllow all requests
DemoSMSThrottleDemo SMS sendingAllow all requests

Cache Configuration

# Django cache settings - hardened after incident CACHES = { "default": { "BACKEND": "django_redis.cache.RedisCache", "OPTIONS": { "SOCKET_CONNECT_TIMEOUT": 2, # 2 sec connect timeout "SOCKET_TIMEOUT": 2, # 2 sec read/write timeout "IGNORE_EXCEPTIONS": True, # Don't raise on cache errors } } } # Health check - won't hang container orchestration with ThreadPoolExecutor(max_workers=1) as executor: future = executor.submit(cache.get, "health_check_probe") future.result(timeout=3) # 3-second hard timeout

Accepted Trade-offs

Monitoring Plan

Why This ADR Matters

This is a 137-line document that came from a real production incident. It shows the team's values: patient access is sacred. It shows engineering maturity: the incident was diagnosed, the fix was designed with trade-off analysis, alternatives were rejected with reasoning, and the monitoring plan was documented. Not a hotfix thrown at the wall - a principled architectural decision born from operational pain.


What You'd Own

These 5 ADRs represent different aspects of the engineering discipline at Orbiit:

ADRShows
0006 - Micro-CoursesProduct thinking embedded in architecture. Domain-specific design for a specific population.
0009 - User HierarchyDeliberate data modeling. Healthcare domain expertise baked into the schema.
0011 - Passwordless AuthSecurity design that serves the user, not the other way around. HIPAA mapped to implementation.
0014 - AI CollaborationOperational methodology. How one engineer + AI ships a platform.
0046 - Fail-OpenProduction incident response. Values-driven engineering under pressure.

The full set of 48 ADRs is in the repo at docs/architecture-decisions/. When you onboard, you read them. That's day one. By day three, you understand not just what the system does, but why it was built that way.

Want to See the Rest?

48 ADRs, 199 session logs, 11 runbooks, root cause analyses - all in the repo. Let's walk through the codebase together.

Back to Engineering Overview [email protected]