Architecture Decision Records

Why ADRs Matter

Architecture Decision Records are the backbone of how this platform was built. When AI is your primary development tool, tribal knowledge doesn't work - you can't say "ask Steve, he built that part." Every decision needs to be written down: the context, the constraints, the options considered, what we chose, and why.

48 ADRs in 6 months. That's more architectural documentation than most Series B companies have. These aren't templates filled in for compliance - they're working documents that the AI reads before touching the codebase. When a new session starts, the first thing that happens is reading the relevant ADRs. That's how context survives across sessions.

What you're reading below are 5 curated selections - chosen to show range: product architecture (0006), data modeling (0009), security design (0011), methodology (0014), and a production incident turned into principled design (0046). The other 43 cover everything from HIPAA compliance to UUID migration to AI kill switches.

ADR-0006 Accepted August 2025

Micro-Courses Architecture & JIT Token Delivery

This is the core product architecture. Orbiit delivers 8 daily touchpoints to patients via SMS - meditations, affirmations, reflections (text-only), and micro-courses (magic link to mobile web page with quiz). The question was: how do you deliver personalized content to someone without making them log in?

The Problem

Patients in early recovery have high cognitive load and low tolerance for friction. Passwords are a barrier. App downloads are a barrier. Even email links are a barrier - SMS has a 98% open rate vs 20% for email. We needed a way to deliver one course to one user, collect quiz responses, and attribute the interaction - without standing up full login flows.

The Decision

Scope-bound, short-TTL magic links. Each token binds to (user_id, course_id, run_id) with a 14-day TTL. No PII in the URL. The token can only read one course and submit one quiz response. Just-in-time generation reduces exposure by 96.7% compared to pre-generated tokens.

Patient's Phone │ │ SMS with magic link ▼ Astro /course/{token} ── GET ──► Django API ── validates token ──► PostgreSQL │ │ │ quiz submit │ marks opened └──── POST answers + token ──────────┘ │ SOBER Score recalculation

Daily Delivery Cadence

Time	Type	Delivery
7:00 AM	Meditation	Text-only SMS
9:00 AM	Course 1	Magic link → mobile web + quiz
11:30 AM	Affirmation	Text-only SMS
1:00 PM	Course 2	Magic link → mobile web + quiz
3:30 PM	Affirmation	Text-only SMS
5:00 PM	Course 3	Magic link → mobile web + quiz
7:30 PM	Affirmation	Text-only SMS
9:00 PM	Reflection	Text-only SMS

Content Types & IDs

Type	ID Pattern	Delivery	Example
Course	`C#####`	Magic link → mobile web page + quiz	`C00101`
Meditation	`M###`	Text-only SMS body	`M001`
Affirmation	`A###`	Text-only SMS body	`A001`
Reflection	`R###`	Text-only SMS body	`R001`

Text-only items never require a web page. Courses always render a page + quiz.

Data Model

# Core models (apps.courses, apps.messaging)

class MagicLinkToken:
    token       # secrets.token_urlsafe(32) - 256-bit, unique, indexed
    user        # FK to User - who this token authenticates
    course      # FK to Course - what content it unlocks
    run_id      # UUID - enables re-assignment without invalidating history
    expires_at  # Default 14 days TTL
    used_at     # Tracks first use for audit

class UserCourseAssignment:
    user, course, run_id
    status      # draft → sent → opened → completed
    sent_at, opened_at, completed_at

class ContentItem:
    id          # C/M/A/R prefix ID
    type        # course | meditation | affirmation | reflection
    day, order  # Day 1-180, slot 1-8
    title, body
    link_required  # True for courses, False for text-only

class DeliveryPlan:
    user, date, day, order
    content_item, send_at
    status      # planned → queued → sent → delivered → opened → completed → failed
    run_id, retries, last_error

class OutboundMessage:
    user, content_item
    twilio_sid, body, send_at
    status, status_payload  # JSONB from Twilio webhooks

API Contract

Endpoint	Method	Purpose
`/api/v1/courses/magic/{token}`	GET	Read course via magic token. Marks assignment `opened` (idempotent).
`/api/v1/submissions/token`	POST	Submit quiz answers. Resolves `(user, course)` by token. Creates Submission once per `(user, quiz, run_id)`. Returns `201` or `200 already_completed`.
`/api/v1/sms/status`	POST	Twilio status callback. Updates OutboundMessage/DeliveryPlan.
`/api/v1/sms/inbound`	POST	Twilio inbound. Handles STOP/START/UNSUBSCRIBE/CANCEL/END/QUIT.

First-Party Scheduler (No GHL/Make)

Celery Beat plans the day per user (timezone-aware) into DeliveryPlan rows
Celery Workers send SMS via Twilio at send_at, create OutboundMessage, attach status callback URLs
Twilio callbacks update delivery states; inbound STOP/START toggles opted_out
Retry policy: Exponential backoff on transient failures, capped retries, audit in OutboundMessage
Inbound STOP/DND: Any inbound SMS containing STOP, STOPALL, UNSUBSCRIBE, CANCEL, END, or QUIT immediately marks opted_out=true, cancels all future DeliveryPlan rows, and raises an AlertEvent for guide/admin follow-up

Edge Cases

Expired/invalid token: Friendly page explaining how to request a fresh link
Multiple opens: Allowed (study/review). Only the first valid submission closes the assignment
Re-assigning a course: New run_id in UserCourseAssignment and new token. Prior submissions remain immutable
Late signup (e.g., 6 PM): Send remaining items for the day. Carry remainder to next day
Anonymous device changes: No session. All attribution is via token → user mapping

Security Properties

Least privilege: Token can only read one course and post one submission
256-bit entropy: secrets.token_urlsafe(32) - exceeds NIST SP 800-63B
No PII in URLs: Token is opaque, no patient data exposed
Idempotent submission: Same (user, quiz, run_id) can't create duplicate records
Rate limited: DRF throttle by token/IP, 429 on abuse
Token never logged in plaintext: Issue/open/submit events use token hash in persistent logs
CORS/CSRF: Submission endpoint is token-based, CSRF exempt but origin-checked, no third-party embeds

Why Not Alternatives?

Full login (OIDC): Too heavy. Patients in crisis don't want to create accounts.
Shared session links: Higher blast radius. One leaked link exposes more than one course.
Static pre-generated links: No per-user attribution. Can't track who opened what.
Pre-generated tokens in bulk: 96.7% more exposure window. JIT is safer.

Why This ADR Matters

This is the product's DNA. The magic link architecture is what makes Orbiit work for a population that won't download apps, won't remember passwords, and won't tolerate friction. Every design choice - TTLs, scope binding, idempotent submission, JIT generation - flows from one principle: meet the patient where they are.

ADR-0009 Accepted October 2025

User Hierarchy & Profile-Based RBAC

The platform serves multiple user types with very different access needs: patients viewing their own recovery data, clinicians managing caseloads, administrators overseeing clinics or entire organizations, billers processing claims, and platform superadmins. The question: how do you model this without a God Object User model?

The Design Principle

"Clinician and Admin are separate profiles, not roles." A StaffMember can have a ClinicianProfile (clinical work), an AdministratorProfile (management), a BillingProfile (financial), or any combination. In small treatment centers, the same person often wears multiple hats - this architecture handles that without role flags or permission spaghetti.

User (Django AbstractUser) ├── user_type: patient | staff | superadmin │ ├── Patient (if patient) │ ├── Organization → Region → Clinic │ └── 1 primary clinician + N additional │ ├── StaffMember (if staff) │ ├── ClinicianProfile (optional) │ ├── AdministratorProfile (optional) │ ├── BillingProfile (optional) │ └── ResearcherProfile (optional) │ └── SuperAdminPermission (if superuser) └── Granular toggles (AWS IAM-style)

Multi-Tenant Hierarchy

Every query is scoped by organization. The hierarchy is Organization → Region → Clinic → Patient. Administrator scope is explicit:

Clinic admin: Sees patients at their clinic
Region admin: Sees patients across all clinics in their region
Org admin: Sees all patients in the organization
Clinician: Sees only their assigned patients (primary + additional)

Permission Matrix (Abbreviated)

Action	Patient	Clinician	Region Admin	Org Admin	Super Admin
View own data	Yes	Yes	Yes	Yes	Yes
View assigned patients	-	Yes	Yes	Yes	Yes
View all org patients	-	-	-	Yes	Yes
Create staff	-	-	Region	Org-wide	Yes*
Impersonate users	-	-	-	-	Yes*

* Requires specific toggle enabled (IAM-style granular permissions)

The Models (Actual Code)

class StaffMember(models.Model):
    """Base model for all organizational employees.
    Can have 0, 1, or multiple profiles simultaneously."""
    user         = OneToOneField(User, related_name='staff_profile')
    organization = ForeignKey(Organization, on_delete=PROTECT)
    region       = ForeignKey(Region, null=True, blank=True)
    clinic       = ForeignKey(Clinic, null=True, blank=True)
    job_title    = CharField(max_length=100, blank=True)
    active       = BooleanField(default=True)

    @property
    def is_clinician(self):  return hasattr(self, 'clinician')
    @property
    def is_administrator(self): return hasattr(self, 'administrator')
    @property
    def profile_types(self):
        # Returns ['clinician', 'administrator', 'billing', 'researcher'] as applicable

class ClinicianProfile(models.Model):
    staff           = OneToOneField(StaffMember, related_name='clinician')
    license_number  = CharField(max_length=50)
    license_state   = CharField(max_length=2)
    credentials     = CharField(help_text='LCSW, LMFT, MD, PhD, etc.')
    clinical_role   = CharField(choices=[
        ('counselor', 'Counselor'), ('therapist', 'Therapist'),
        ('case_manager', 'Case Manager'), ('psychiatrist', 'Psychiatrist'),
        ('nurse', 'Nurse'), ('peer_specialist', 'Peer Specialist'),
    ])
    can_activate_bonus      = BooleanField(default=True)
    can_edit_treatment_plans = BooleanField(default=True)

class AdministratorProfile(models.Model):
    staff = OneToOneField(StaffMember, related_name='administrator')
    scope = CharField(choices=[
        ('clinic', 'Clinic Admin'),
        ('region', 'Region Admin'),
        ('org',    'Organization Admin'),
    ])
    can_create_staff       = BooleanField(default=True)
    can_manage_schedules   = BooleanField(default=True)
    can_view_billing       = BooleanField(default=False)
    can_generate_reports   = BooleanField(default=True)

    def save(self, *args, **kwargs):
        # Validates scope matches staff assignment
        if self.scope == 'clinic' and not self.staff.clinic:
            raise ValueError("Clinic admin must be assigned to a clinic")
        super().save(*args, **kwargs)

class SuperAdminPermission(models.Model):
    """AWS IAM-style granular toggles for platform admins"""
    user = OneToOneField(User, limit_choices_to={'is_superuser': True})
    can_create_organizations = BooleanField(default=False)
    can_impersonate_users    = BooleanField(default=False)
    can_modify_system_settings = BooleanField(default=False)
    can_view_all_patients    = BooleanField(default=False)
    can_export_phi           = BooleanField(default=False)
    can_manage_subscriptions = BooleanField(default=False)
    # ... 14 toggles total

Access Control Logic

class StaffMember:
    def get_accessible_patients(self):
        """Queryset of patients this staff can access"""
        patients = Patient.objects.none()

        # Admin scope: org/region/clinic level
        if self.is_administrator:
            admin = self.administrator
            if admin.scope == 'org':
                patients |= Patient.objects.filter(organization=self.organization)
            elif admin.scope == 'region':
                patients |= Patient.objects.filter(region=self.region)
            elif admin.scope == 'clinic':
                patients |= Patient.objects.filter(clinic=self.clinic)

        # Clinician scope: assigned patients only
        if self.is_clinician:
            patients |= Patient.objects.filter(
                Q(primary_clinician=self) | Q(additional_clinicians=self)
            )

        return patients.distinct()

# Usage in views:
@staff_required
def staff_patient_list(request):
    staff = request.user.staff_profile
    patients = staff.get_accessible_patients()  # Automatically scoped

Audit Logging

class ImpersonationLog(models.Model):
    """Logged when Super Admin impersonates users"""
    admin       = ForeignKey(User, related_name='impersonations_made')
    target_user = ForeignKey(User, related_name='impersonations_received')
    started_at, ended_at
    reason      = TextField()       # Required - no silent impersonation
    ip_address  = GenericIPAddressField()

class PHIAccessLog(models.Model):
    """HIPAA-compliant PHI access tracking"""
    user    = ForeignKey(User)
    patient = ForeignKey(Patient)
    action  = CharField()  # 'view', 'edit', 'export'
    ip_address = GenericIPAddressField()
    created_at = DateTimeField(auto_now_add=True)

Why Profiles, Not Roles?

No User model bloat: Roles become flags, flags become spaghetti. Profiles are clean OneToOne relations.
Real-world flexibility: A clinical director is both a clinician and an org admin. That's two profiles, not a "super-clinician" role.
Easy to extend: Adding a CaseManagerProfile or PharmacistProfile is adding a model, not rewriting permission logic.
Simple access checks: if staff.is_clinician: reads better than if staff.role in ['clinician', 'supervisor', 'clinical_director']:

Consequences

Type	Detail
Positive	Clean separation of concerns. Flexible multi-hat capability (common in small orgs). Easy to add new profile types. Granular Super Admin permissions reduce platform risk.
Negative	More models than simple Staff approach. Slightly more complex queries for cross-profile operations. Multiple table joins for staff with multiple profiles.
Mitigations	`select_related()` and `prefetch_related()` for query performance. Admin interface makes adding/removing profiles intuitive.

Why This ADR Matters

Data modeling decisions made at the foundation define everything built on top. This ADR shows deliberate domain modeling for healthcare - not forcing clinical workflows into a generic SaaS user/role pattern. The profile-based approach has scaled cleanly from 1 organization to the current multi-tenant deployment.

ADR-0011 Active October 2025

Passwordless Authentication

Two user populations with completely different needs. Patients in early recovery - high stress, low tolerance for friction, privacy concerns, shared devices. Staff at treatment organizations - enterprise security requirements, centralized access control, compliance obligations. One auth system can't serve both. So we didn't try.

Patients: SMS Magic Links

No passwords. No app downloads. No accounts to create. The patient's daily touchpoints arrive via SMS with embedded magic links. Click the link, view the content. The link is the authentication.

Property	Value
Token entropy	256-bit (`secrets.token_urlsafe(32)`)
Expiration	3 days (org-configurable: 24-168 hours)
Reusability	Reusable within expiration window
Session timeout	30 minutes inactivity
Rate limiting	5 attempts per identifier per 15 minutes

Why reusable? Rehabilitative, not punitive. A patient who clicks a link twice shouldn't get an error. They're studying, reviewing, or retrying from a different device. The population we serve needs friction removed, not added.

Staff: Enterprise SSO

Microsoft Entra ID (primary), Google Workspace (optional). No password option exists for staff - all authentication delegated to the corporate identity provider. MFA enforced by the IdP, not by us. Immediate revocation on termination.

For smaller organizations without IT resources, staff can also use email magic links with tighter security: 15-minute expiration, single-use only. Superadmins are excluded from this fallback - they must use SSO.

HIPAA Compliance

Person or entity authentication (§164.312(d)): Magic link = 2FA by design (something you know: your phone number; something you have: the phone)
Unique user identification (§164.312(a)(2)(i)): Phone/email verified at enrollment
Automatic logoff (§164.312(a)(2)(iii)): 30-minute session timeout
Transmission security (§164.312(e)): HTTPS + token expiration
Audit controls (§164.312(b)): All login attempts logged (LoginAttempt model)

Staff Magic Links vs Patient Magic Links

Added in December 2025 for organizations without IT resources for SSO setup. Deliberately tighter security than patient links:

Property	Patient Magic Links	Staff Magic Links
Expiration	3 days (org-configurable)	15 minutes (fixed)
Reusability	Reusable within window	Single-use only
Token entropy	256-bit	256-bit
Delivery	SMS (primary), email (fallback)	Email only
Superadmins	N/A	Excluded - must use SSO

Implementation Models

class MagicLinkToken:
    token           # CharField(255, unique, indexed) - secrets.token_urlsafe(32)
    patient         # FK to Patient
    delivery_method # 'sms' | 'email'
    content_type    # Optional - for touchpoint-specific links
    content_id      # Optional
    created_at, expires_at
    used_for_login  # BooleanField - was this token used to create a session?
    use_count       # IntegerField - tracks reuse within window
    ip_address, user_agent  # Security audit trail

class LoginAttempt:
    user            # FK to User (nullable - may not match)
    email_or_phone  # What was attempted
    auth_method     # 'magic_link' | 'sso_microsoft' | 'sso_google'
    success         # BooleanField
    failure_reason  # 'expired' | 'not_found' | 'inactive' | 'rate_limited'
    ip_address, user_agent, attempted_at

# Middleware
class SessionTimeoutMiddleware:
    # Patients: 30-minute inactivity timeout
    # Staff: Standard Django session (2 weeks)
    # Tracks last_activity in session, auto-logout if exceeded

Threat Model

Threat	Impact	Likelihood	Mitigation
SMS not delivered	High	Low	Email fallback, 3-day validity
Phone stolen	Medium	Low	Remote deactivation, session timeout
SSO provider outage	High	Very Low	Provider SLA (99.9%+), staff magic link fallback
Token enumeration	Medium	Low	Rate limiting (5/15min), audit logging, 256-bit entropy
Session hijacking	High	Very Low	HTTPS only, secure cookies, 30-min timeout
Email interception (staff)	Medium	Low	15-minute expiration, single-use

Why Not Alternatives?

Option	Rejected Because
Password + MFA for all	High cognitive burden for patients in recovery. Password resets = top support burden. Patients will choose weak passwords.
Password only (no MFA)	Fails HIPAA strong authentication requirements (§164.312(d)). Vulnerable to credential stuffing, password reuse across sites.
Biometric (Face ID / fingerprint)	Not universally available on older phones. Can't revoke biometric data. Privacy concerns with biometric storage. Complexity exceeds benefit for MVP.
Email magic links only (no SMS)	20% open rate vs 98% for SMS. Doesn't align with daily touchpoint flow. Patients less likely to check email daily.
SAML for staff SSO	Deferred. OAuth2 covers 70%+ of market (Microsoft/Google). SAML more complex to implement. Enterprise customers can wait.

Success Metrics (From the ADR)

Metric	Target
Magic link click rate	> 95%
Session timeout complaints	< 1%
"Can't login" support tickets	< 2%
Average time to login	< 5 seconds (patients), < 10 seconds (staff)
Password reset tickets	0 (no passwords exist)
Rate limit triggers per day	< 10 (indicates attack if higher)

Post-production security audit (Oct 2025): The audit flagged magic link token entropy and cross-organization access as potential vulnerabilities. Investigation confirmed both were intentional design: 256-bit entropy exceeds NIST standards (audit had incorrectly calculated 188-bit), and cross-org access enables authorized family viewing of educational content (non-PHI). Course content is educational (CBT exercises, reflections, meditations), not PHI-containing - even if a token is accessed by the wrong party, no patient identity, treatment history, or sensitive health information is disclosed. Full security assessment in the repo: docs/audit/20251029/MAGIC_LINK_SECURITY_INVESTIGATION.md

ADR-0014 Active January 2026

AI Collaboration Contract

This is the ADR that defines how the platform gets built. When your primary development tool is an AI agent, you need operating rules - not guidelines, not best practices, but a contract. What the AI must do before writing code. What it must never do. How git, documentation, and context management work.

Core Principle: Repository-Driven Development

All project management, documentation, and planning live in the repository. No Jira. No Asana. No Confluence. The repo IS the source of truth. ADRs are immutable architectural decisions. The backlog is a JSON file that AI edits programmatically. Technical docs are in markdown next to the code.

Why: AI needs context accessible in the repo. External tools break the context chain. Git-based tracking means every change is versioned, diff-able, and auditable.

Workflow Phases

Context Gathering - Read relevant ADRs, check backlog, scan related files. Never start coding without understanding existing architecture.
Planning - For non-trivial tasks, create a todo list with acceptance criteria. Mark items completed immediately after finishing each step.
Implementation - Read before edit. Use targeted edits. Follow Django patterns. No secrets in code. HIPAA compliance in every query.
Git Workflow - Meaningful commits. Batch related changes. Include context. Co-authoring footer on every commit.

Operating Rules

Do	Don't
Read ADRs before architectural changes	Start coding without reading existing code
Commit every 30-60 minutes	Work 3+ hours without committing
Create management commands for admin tasks	Make breaking changes without confirmation
Update backlog when completing features	Leave todos in "in progress" state
Be concise, direct, and helpful	Redesign working systems mid-task

The Data Loss Incident (Oct 2025)

What happened: During Staff Dashboard development, approximately 3 hours of work was lost - content type badges, filters, clickable rows, clinician notes, program warnings. All fully implemented and working. Never committed to git.

Timeline

Session Start: Began implementing Discovery 10-17 features
Commit Pause: CTO paused commits to avoid affecting production (no staging environment yet)
3 hours of development: Sophisticated dashboard with all P0 features built
Claude Code session crash: Context lost
Token limit hit: Ran out of tokens on $100/month plan (~12 hours of coding in 24 hours)
Plan upgrade: $200/month Max plan to continue
Data loss confirmed: Features existed in screenshots but were never committed
Recovery: Rebuilt from discovery log specifications over 3-4 additional hours

Financial Impact

Item	Cost
Initial token usage (12 hrs development)	~$50
Token usage during recovery attempt	~$50 (hit limit)
Max plan upgrade	$100/month ongoing
Token usage for rebuild	~$50
Total one-time cost	~$150
Ongoing increased cost	$100/month

What Changed Because of This

Commit every 30-60 minutes during active development (now a hard rule)
Staging environment elevated from P2 to P1 - fear of breaking production was causing "commit paralysis"
Session end checklist: Run git status, commit or stash all changes, document WIP in session notes
Discovery logs as recovery tool: Detailed feature specifications enabled complete reconstruction
Never rely on session continuity across AI restarts

Story Points for Everything

AI development broke the time-effort correlation. Architecture research, UX iterations, documentation, planning - all count as story points. Sessions 8 and 9 originally showed 0 SP despite 4-6 hours of work each, because the planning/research wasn't captured.

// Retroactive estimation tracking (backlog-data.json)
{
  "storyPoints": 5,
  "estimatedStoryPoints": 0,
  "retroactiveEstimation": {
    "reason": "PM/Dev gap: architecture work not captured",
    "breakdown": {
      "ADR-0019 architecture research": 3,
      "R3 PWA release planning": 1,
      "Backlog reordering": 1
    }
  }
}

The consulting work principle: "If I need to document or plan, that is still development time for me." All development work is tracked - not just code. Architecture research, UX design, planning sessions, documentation. This corrected velocity from 89 SP to 97 SP across the first 11 sessions when the gap was found.

Definition of Done

For Features	For Bug Fixes	For ADRs
Code implemented and tested Tests pass Documentation updated Committed with clear message Pushed to GitHub Verified in production Backlog updated	Root cause identified Fix implemented and tested No regressions Committed with reference Deployed and verified	Context clearly stated Decision with rationale Consequences documented Alternatives considered Implementation notes Related ADRs linked

Why This ADR Matters

This isn't a style guide. It's an operational contract between human and AI that evolved from real failures - data loss, context drift, untracked work. The methodology has produced 662 story points across 199 sessions. The rules exist because we learned what happens when they don't.

ADR-0046 Accepted February 2026

Fail-Open Rate Limiting for Patient Access

This ADR exists because of a production incident. On February 7, 2026, patients reported "Server Error" when clicking magic links. Redis Cloud Azure had a regional outage. Our rate limiting depended on Redis - when Redis went down, the throttle middleware raised exceptions that blocked all patient requests. For 30 minutes, no one could access their recovery content.

Root Cause Chain: Patient clicks magic link → Django view with throttle decorator → PatientRateLimitThrottle.allow_request() → cache.get() to Redis → Redis unavailable → Exception raised → 500 Server Error. The rate limiter designed to protect patients was the thing blocking them.

The Decision

Fail open. When Redis is unavailable, rate limiting allows requests through rather than blocking them. Patient access to recovery content is more important than rate limiting during infrastructure failures.

class PatientRateLimitThrottle(BaseThrottle):
    def allow_request(self, request, view):
        try:
            return super().allow_request(request, view)
        except Exception as e:
            logger.warning(f"Cache unavailable, allowing request: {e}")
            return True  # Fail open - patient access is critical

Additional Hardening

Redis socket timeouts: 2-second connect and read/write timeouts prevent hanging connections
IGNORE_EXCEPTIONS: Django cache backend won't raise on cache errors
Health check timeout: /health/ready/ uses 3-second ThreadPoolExecutor timeout for Redis ping - won't hang container orchestration

Why Not Alternatives?

Option	Rejected Because
Fail closed (block all requests)	Patients can't access recovery content during outage. Unacceptable for this population.
In-memory rate limiting fallback	Per-instance limits (not distributed), complex to implement correctly. Over-engineering for a rare edge case.
Multiple Redis providers	Significant complexity, cost, and maintenance. YAGNI.

Affected Throttle Classes

Class	Endpoint	Fail-Open Behavior
`PatientRateLimitThrottle`	Magic link views	Allow all requests
`PatientMagicLinkThrottle`	Magic link generation	Allow all requests
`DemoSMSThrottle`	Demo SMS sending	Allow all requests

Cache Configuration

# Django cache settings - hardened after incident
CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "OPTIONS": {
            "SOCKET_CONNECT_TIMEOUT": 2,   # 2 sec connect timeout
            "SOCKET_TIMEOUT": 2,            # 2 sec read/write timeout
            "IGNORE_EXCEPTIONS": True,      # Don't raise on cache errors
        }
    }
}

# Health check - won't hang container orchestration
with ThreadPoolExecutor(max_workers=1) as executor:
    future = executor.submit(cache.get, "health_check_probe")
    future.result(timeout=3)  # 3-second hard timeout

Accepted Trade-offs

During cache outage, rate limits are not enforced - potential for abuse
Mitigated by: Cloudflare WAF (baseline DDoS), Azure App Service throttling, Redis outages are rare and short
Monitoring: all fail-open events logged at WARNING level for detection

Monitoring Plan

Log pattern: "Cache unavailable for rate limiting, allowing request" at WARNING level
Health check: /health/ready/ returns warning state when Redis unavailable (does not block container orchestration)
Recommended alerts: Sustained cache.get() failures > 5 minutes, health check warning state

Why This ADR Matters

This is a 137-line document that came from a real production incident. It shows the team's values: patient access is sacred. It shows engineering maturity: the incident was diagnosed, the fix was designed with trade-off analysis, alternatives were rejected with reasoning, and the monitoring plan was documented. Not a hotfix thrown at the wall - a principled architectural decision born from operational pain.

What You'd Own

These 5 ADRs represent different aspects of the engineering discipline at Orbiit:

ADR	Shows
0006 - Micro-Courses	Product thinking embedded in architecture. Domain-specific design for a specific population.
0009 - User Hierarchy	Deliberate data modeling. Healthcare domain expertise baked into the schema.
0011 - Passwordless Auth	Security design that serves the user, not the other way around. HIPAA mapped to implementation.
0014 - AI Collaboration	Operational methodology. How one engineer + AI ships a platform.
0046 - Fail-Open	Production incident response. Values-driven engineering under pressure.

The full set of 48 ADRs is in the repo at docs/architecture-decisions/. When you onboard, you read them. That's day one. By day three, you understand not just what the system does, but why it was built that way.

Want to See the Rest?

48 ADRs, 199 session logs, 11 runbooks, root cause analyses - all in the repo. Let's walk through the codebase together.

Back to Engineering Overview [email protected]