Sample App — Main Spec

Design and architect a secure web app that solves (describe the problem) and results in (success criteria).

DO NOT start working yet. First, ask me clarifying questions so we can define the approach together. Only begin once we’ve aligned.

Capture the spec for this app in a markdown file saved to this folder. If the app is sufficiently complicated that it needs to be broken into phases, then create numbered prompt markdown files aligned to each phase.

Start by designing the user interfaces and asking me to approve them. Design them one at a time in this order:

Home (which manages the Workflow described below)
Admin Dashboard
Sample output report
Other Admin pages like User Roles, Sessions, API, etc.

Use design.md as the design system for the design.

Every time we iterate on this app, save a summary of the work we did in this session and the next steps to a file called context-log.md in this folder. always start each session for this project by reading the context log.

DEVELOPMENT RULES TO FOLLOW

I am not a developer — never assume technical familiarity
Always explain what a command does before running it
Present one clear recommendation, not a menu of options
Auto-check Socket.dev before any npm package installs
Prefer solutions that don’t require me to maintain running processes or servers
I hate copy-pasting into Terminal. I find navigating around folders in Terminal a pain. If you can get a CLI and post to the CLI with my approval, do that.
Client apps must be hosted on Canadian servers (fly.io YYZ).
All HTML files, web pages, and email templates must be fully responsive and optimised for mobile (iPhone 14 portrait / 390px as the baseline). Use CSS media queries, fluid widths, and simplified mobile layouts — never fixed-width-only designs.

NOTE

The prompts and spec below was written by the AI based on the initial prompt above and iterations where it asked me questions and I responded. I don’t understand most of it.

Build Scope — Phase 1

We are building Phase 1 only. Phase 1 is the core workflow. Do NOT build admin features, email drafting, report creation, or advanced session management yet. Phase 2 and 3 items are listed at the bottom of this document for context only.

Phase 1 includes:

Project setup (Next.js, Tailwind, shadcn/ui, Drizzle, PostgreSQL, Fly.io) **
Authentication (NextAuth.js with Google OAuth, domain restriction). Drive drive.file scope ships unverified-mode for ≤10 users.
Database schema
Prompt selection and display
File upload with validation, ClamAV scan, and Python-sidecar-driven metadata strip
Pre-processing pipeline (parse, calculate, summarise) — runs in the Python sidecar
Results display with formatting toggles and copy + quality trust chip
Document generation — .docx + .pdf (LibreOffice headless) + Google Doc fallback when LibreOffice fails
Basic session saving (auto-save, list past sessions). Retention: 120d for completed sessions, 50d for abandoned/errored/in-progress
Meeting app action items integration (pull outstanding action items from client meetings, with manual paste fallback)
Basic error handling and logging + structured logging with request_id propagated through Next.js + Python sidecar + OpenAI
Privacy retention alignment — privacy notice text aligned to actual deletion windows; backups + R2 lifecycle aligned
AI eval suite
OpenAI cost budget alerts — daily/weekly thresholds → email; alert content uses error-email allowlist
Auto-archive completed sessions to Drive — service-account write of output reports + encrypted JSON manifest into folder
App Health Phase 1 endpoints — /api/health/* for liveness (DB, OpenAI, ClamAV, Python sidecar, R2, Drive service-account, retention-job) and /api/metrics/* for analytics (restoration-rate, grounding-rate, cost-rate, eval-baseline, shadow-run state). Phase 2 dashboard wires up to these endpoints.

Phase 1 does NOT include: Admin dashboard UI, user role management UI, email template CRUD UI, Gmail draft creation, full Google Doc UX (drive picker / share controls — Phase 2; Phase 1 has Google Doc only as PDF fallback, API cost reporting page (basic alerts ship in Phase 1, admin page, advanced session management UI, the re-run workflow, the Help section UI, the full file upload to Google Drive UX (the auto-archive piece is Phase 1 ; the user-controlled upload UX with picker/change buttons is Phase 2), the App Health dashboard UI (Phase 1 ships the underlying endpoints; the dashboard UI is Phase 2), the PIPEDA right-to-deletion UI (Phase 1 handles deletion via manual SQL). Build the database tables for these features but do not build the UI or logic.

Tech Stack

Next.js (App Router) with Turbopack
React
shadcn/ui v4 with @base-ui/react
Tailwind CSS v4
React Context for shared state (clients, team members)
Drizzle ORM
PostgreSQL on Fly.io (Fly Postgres, yyz Toronto region)
GitHub at github.com/------
Fly.io hosting at _________ (auto-deploys on push, yyz Toronto region for Canadian data residency)
NextAuth.js with Google OAuth for authentication
docx npm package for Word document generation
LibreOffice (headless) on Fly.io for PDF conversion
OpenAI API (ChatGPT) for AI analysis — v1 only
Google Drive API for uploading files to folders (Phase 2) — requires Drive write scope added to Google OAuth
Cloudflare R2 for object storage (session files in Phase 1, help article images in Phase 2). All file paths prefixed with {organization_id}/ for tenant isolation.
Vitest for unit and integration testing, Playwright for end-to-end testing (Phase 1 Late)
GitHub Actions for CI/CD pipeline (Phase 2)
ClamAV for server-side virus/malware scanning of uploaded files
Google Workspace service account for the Drive auto-archive write on session completion. Credentials stored as Fly.io secret; rotation procedure documented; granted access to client root folders by admin. Used only for the auto-archive piece in Phase 1 (the user-controlled Drive upload UX in Phase 2 keeps the spec’s per-user OAuth model)

Dependency Management & Maintenance

This app uses open-source npm packages. Unlike WordPress (which has a core platform and plugins to update through a dashboard), this app has no “core” to update — but its dependencies need to stay current for security and stability.

Automated Security Monitoring

GitHub Dependabot is configured on the repo (.github/dependabot.yml) with:

Security vulnerability alerts — GitHub scans daily and notifies immediately if any dependency has a known vulnerability
Automated security fix PRs — For critical/high vulnerabilities, GitHub opens a PR with the fix automatically
Weekly version update PRs — Every Monday morning (Toronto time), Dependabot checks for routine updates. Minor and patch updates are grouped into a single PR. Max 5 open PRs at a time.

Socket.dev

The Claude Code security hook checks every npm install against Socket.dev for supply-chain attacks, typosquatting, and malware before any package is installed. This is in addition to Dependabot.

Update Workflow

Security PRs (urgent) — Merge within 1–2 days. These fix known vulnerabilities.
Routine update PRs (weekly) — Review and merge when convenient, or ask Claude Code to review them.
Quarterly review — Every 3 months, check for major version upgrades (Next.js, React, etc.), deprecated packages, and end-of-life runtimes.

Disaster Recovery (Phase 2)

The app has Fly.io’s built-in daily database backups from day one. Write the full plan in Phase 2.

Targets

RTO (Recovery Time Objective): 4 hours — the app should be back within half a business day
RPO (Recovery Point Objective): 1 hour — tolerate losing up to 1 hour of session data

Backups

Automated daily backups (Fly.io Postgres built-in)
Weekly offsite backup: encrypted database dump to Cloudflare R2 (separate bucket from session files)

Restore Testing

Once per quarter, restore a backup to the staging environment and verify it works
An untested backup is not a backup

Runbook

Document exact commands for: database is down, app won’t deploy, need to roll back a migration, need to restore from backup
See documentation/admin/disaster-recovery.md

Workflow Health (Phase 1)

The session state machine defines transitions but on its own can’t detect a worker that crashed mid-step. This section adds a heartbeat + detector that catches orphaned sessions before users notice they’re stuck.

Heartbeat

Every Sessions row carries a last_activity_at timestamp, updated on every state transition AND every step completion. Long-running steps (Step 7 stripping, Step 8 pre-processing, Step 9 AI call) write this field every 30 seconds while running so a worker crash mid-step is detectable.

Worker-Crash Recovery

On worker startup, the boot routine queries for sessions in mid-run states (processing, analyzing, stripping) with stale heartbeats (last_activity_at > 60 seconds ago) and marks them error with cleanup. This catches sessions that were running when the previous worker died.

Authentication & Authorisation

Authentication

Use NextAuth.js with Google OAuth. The domain restriction (@yourdomain.ca) is read from the Organizations table (organization.domain), not hard-coded. On login, the app looks up the user’s email domain against the Organizations table to find their organization. If no organization matches, deny login. No external users for v1, but this design means additional organizations can be onboarded later by adding a row to Organizations — no code changes required.

Security Requirements

Authentication: NextAuth.js with Google OAuth, restricted to @yourdomain.ca.
Authorisation: Role-based access control enforced on both frontend routes and every API endpoint. Every API route must check the user’s role before returning data.

Tenant-scoped reads use a helper. All reads of tenant-scoped entities (Sessions, Clients, Prompts, Outputs, Session Files, Audit Log) go through a single helper function pattern — getEntity(id, auth) — that filters by organization_id AND (user_id == auth.user_id OR auth.role == 'admin'). Never call findById(id) or raw queries that take only an ID. This is enforced by code review and by an explicit lint rule on the data-access layer.

All primary keys are UUID v4. Sessions, Clients, Prompts, Outputs, Session Files, Idempotency Keys, File Scan Log, Audit Log — all use UUID v4 as the primary key. Sequential or predictable IDs are forbidden because they enable URL-guessing IDOR attacks.

Admin session-view anomaly detection. Every admin session view writes session_viewed_by_admin to the audit log (already specified). If a single admin views more than 10 sessions in any 5-minute window, fire admin_session_view_anomaly to the audit log AND alert the platform admin (Slack to admin). This catches potential exfiltration attempts where an admin account is compromised.
Data at rest: Encrypt uploaded financial statements in storage. Use PostgreSQL pgcrypto or encrypt files before storing in the filesystem.
Data in transit: HTTPS only (enforced by Fly.io). All API calls to OpenAI over HTTPS.
API keys: Stored as Fly.io secrets (environment variables) only. Never exposed to the frontend. Never logged. Never included in error messages. The admin UI may show a masked placeholder but cannot read or edit the actual key.
Client name & confidential name stripping: All client identifying information and names of customers, suppliers, vendors and donors (for charities) removed from files and replaced with anonymized variables before any data leaves the server to a third-party API. These must be stored encrypted at rest in storage as variables so they can be restored after the data comes back from the AI.
Session isolation: users can only access their own sessions. Admins can access all sessions.
Input validation: Validate all file uploads by type and size. Maximum 25 MB per file, 100 MB per session. Accepted formats: .xlsx, .xls, .csv, .pdf, .zip. Sanitise all user inputs.
Rate limiting: Rate-limit all endpoints, not just AI analysis. Defaults: AI analysis 20 per user per day (configurable by admin), login attempts 5 failed per email per 15 minutes (log to audit log), file uploads 10 per user per hour, read API endpoints 100 per user per minute, write API endpoints 30 per user per minute. Implementation: RateLimiterPostgres from rate-limiter-flexible per ER-3 (NOT in-memory — in-memory limits are per-process and break across multiple Next.js machines, making the actual rate limit N × machine_count). Counters live in a Postgres table; one query per limited request, well within budget for Phase 1 traffic.
Error handling: Never expose stack traces, database queries, or internal file paths to users. Log errors server-side. Send error notification emails to admin@yourdomain.ca.
PIPEDA notice (updated post-CEO-review — Cherry-1 + Cherry-5): Display a brief privacy notice in the UI explaining that financial data is processed server-side, client, supplier, vendor and donor names are stripped before AI analysis, and:

“Successfully completed analyses are retained for 120 days. Abandoned, errored, and in-progress sessions are deleted within 50 days. Audit logs are retained for 90 days. Backups (Fly.io Postgres + R2 lifecycle) are aligned with these windows. Senior users may view your sessions during prompt comparison (shadow-run feature, Phase 1 Late). We use OpenAI’s API on a Business tier with a signed Data Processing Agreement; OpenAI retains request logs for 30 days and does not use submitted data for model training.”

PIPEDA right-to-deletion handling (Phase 1): manual SQL handled by admin per F4.3=C; UI deferred to Phase 2.
OpenAI Data Processing Agreement: Use the OpenAI API on a Business or Enterprise tier with a signed Data Processing Agreement (DPA) confirming that submitted data is not used for model training. Verify this is in place before launch. The PIPEDA notice should name the AI provider and link to their privacy policy.
Envelope encryption (Tension-1=A — confirmed Phase 1; ER-4 adds rotation mechanics): Use envelope encryption for data at rest. A master key (MEK, stored as a Fly.io secret) encrypts per-session data encryption keys (DEKs). Each session gets its own DEK, which encrypts that session’s R2 files plus the encrypted JSON fields (name_stripping_map, restoration_verification, content_grounding, shadow_outputs, fellow_match_log). The DEK itself is encrypted with the MEK and stored alongside the session. This limits blast radius — a leaked DEK exposes one session, not all data.

Rotation mechanics (ER-4): Sessions carry a mek_version integer column from Day 1. Schema supports old + new MEK coexisting during rollover. Rotation script (scripts/rotate-mek.ts, Phase 1 Late) walks Sessions in batches with --dry-run and --confirm flags, decrypts each wrapped DEK with the old MEK, re-encrypts with the new MEK + new mek_version, verifies a sample post-rotation by round-trip-decrypting an encrypted JSON field. Previous MEK kept for 30 days for dual-read / rollback. Keep an offline backup of every active MEK in a secure location separate from the server.

Outside-voice note (codex review, 2026-05-06): the original CEO-plan-draft attempted to defer envelope-per-session DEKs to Phase 2 to save ~3 days CC. Codex correctly flagged this as a Phase 1 security downgrade, not migration debt — the spec lists per-session DEKs specifically to limit blast radius. Restored to Phase 1 unconditionally.
Deep file metadata stripping: Before sending any data to the AI, strip identifying information using purpose-built libraries that go beyond shallow document-properties stripping.

For Excel (.xlsx / .xls) — use openpyxl:

Strip document properties: Author, Company, Manager, Title, Subject, Keywords.
Strip comments and cell notes.
For each named range, anonymise both the name AND inspect the formula definition for client-name string literals. Example: a named range [Smith_Revenue] defined as =Sheet1!$A$1:$B$100 is fine after the name is changed; a named range defined as =INDIRECT("Smith_Corp"&...) contains a literal name that must also be stripped.
Strip data-validation embedded formulas (see Security Requirement #15 for details).

For PDFs — use pikepdf:

Strip metadata: Author, Title, Subject, Creator, Producer, Keywords.
Remove all annotations OR scan their content with the same Step 7 fuzzy match. Annotations can carry “Prepared by Smith Corp” hidden in comment objects.
Remove the document outline (bookmarks).
Remove embedded files.
Reject PDFs with XFA forms (rare and dangerous — most legitimate financial PDFs do not use XFA).

Post-strip verification: After stripping, re-open the file and confirm:

All targeted document-property fields are empty.
For Excel: comments removed, no named ranges with client-name literals.
For PDFs: annotations and bookmarks removed.

If any residual content is found, log a warning and surface it to the user before they confirm the stripped version: “PDF stripped, but [N] suspicious entries remain. Manual review recommended before continuing.” The session’s existing Step 7 stripping audit log gets a metadata_residuals: [...] array so warnings persist with the session.

Tolerance: warnings, not hard rejects — PDFs are inherently messy and the user gets to decide whether to proceed. 15. Malicious file protection: Reject Excel files that contain VBA macros (check for macro content and reject or strip). Sanitize zip extraction to prevent zip-slip path traversal attacks (validate that extracted paths don’t escape the target directory). Strip Excel formulas to values during parsing — do not evaluate formulas. Scan PDFs for embedded JavaScript and reject if found.

ClamAV deployment:

Version: ClamAV 0.104+ deployed as a sidecar service in the same Fly.io app, OR as a separate Fly.io machine reachable over the internal network. The choice does not change behaviour, only ops topology.
Signature DB lives on a Fly.io Volume mounted at /var/lib/clamav, persisted across machine restarts.
A daily cron runs freshclam to refresh signatures. Last-update timestamp is exposed via GET /api/health as clamav.last_signature_update.

Fail-closed degradation: the health endpoint pings ClamAV. If ClamAV is unreachable, all uploads are rejected with: “File scanning is temporarily unavailable. Please try again in a few minutes.” Every blocked upload during an outage logs clamav_unavailable to the audit log so the admin can re-invite affected users once service resumes. After 5 minutes of continuous downtime, an admin alert fires (email + Slack webhook).

Magic-byte sniff (independent layer, runs before ClamAV):

Extension	Required magic bytes
`.xlsx`, `.xls` (post-2007 OOXML)	`50 4B 03 04` (ZIP signature, since OOXML is a ZIP)
`.xls` (legacy CFBF)	`D0 CF 11 E0 A1 B1 1A E1`
`.pdf`	`25 50 44 46 2D` (`%PDF-`)
`.csv`	No magic bytes; enforce UTF-8 / Latin-1 / UTF-16 BOM detection — reject binary content
`.zip`	`50 4B 03 04`

If extension and magic bytes don’t match, reject with: “This file does not match its file extension. Please re-save it in the correct format and try again.” Log to audit log as file_rejected_magic_mismatch.

Scan log: every scan recorded to the new FileScanLog entity (timestamp, user_id, filename, file_size, scan_engine, scan_result, scan_duration_ms, virus_name_if_found). Retention: 90 days. Used for forensics and ClamAV health metrics.

Step 5 ordering: the file-upload pipeline runs in this order — extension check → magic-byte sniff → ClamAV → content-attack mitigations (below) → metadata strip → store. Any rejection along the way short-circuits with the appropriate user-facing message and audit-log entry.

User Workflow

Multi-tenancy note: Every database query in the workflow must filter by organization_id. This is both a multi-tenancy requirement and a defense-in-depth measure against data leaks between organizations.

Idempotency note (Tension-3 — refined hash inputs; ER-10 clarifies timing + PK): The hash is computed at the FINAL “Run Analysis” click — the click the user presses after reviewing the strip preview, after acknowledging data-quality warnings, after confirming the prompt. By that point all hash inputs exist on the client. Earlier clicks (start upload, confirm strip, confirm acks) are NOT idempotency-protected because they’re step-completion submissions, not workflow-run commitments. The IdempotencyKeys.key column is the natural primary key (the hash itself is already 256-bit collision-resistant; the table only ever holds rows for 300s; no UUID id column is added — explicit carve-out from the universal-UUID-PK rule). The “Run Analysis” button computes a deterministic SHA-256 idempotency hash on the client of:

user_id
client_id (or temp_client_id for manual entries — Tension-4)
mode (preview/final)
prompt_version_id (NOT prompt_id — prompts version on save)
ordered_file_SHA256s (concatenated, ordered)
manual_strip_additions_hash (hash of user’s add-to-strip list, if any)
manual_fellow_text_hash (hash of any pasted action items)
data_quality_ack_state_hash (hash of which warnings were acknowledged)

The server stores every key in the IdempotencyKeys table with a state machine (pending → succeeded / failed) and a 300-second TTL. Server-side TTL handles time-window dedup; the hash itself does not include a timestamp window. On a second request with the same key:

Not found / TTL expired: treated as a fresh request; insert pending, run the workflow.
Found, status = pending: wait up to 30 seconds (poll once per second) for the original to complete. If it does, return its result_session_id. If it does not, return HTTP 409 “still processing — please wait.”
Found, status = succeeded: return the original result_session_id immediately. No new AI call. No double billing.
Found, status = failed: allow the retry — replace the failed row with a new pending insert.

Because the hash includes prompt-version, mode, and manual inputs, two clicks with the same files but different mode/prompt-version/manual-input do NOT collapse into one run. Same files + same prompt_version + same mode + same manual inputs + same acks within 300s = same idempotency key, regardless of which browser tab issued the click. This protects against the “two-tab double-click” failure mode while still allowing legitimate distinct runs.

Step 1 — Authentication

The user (user) must authenticate using their Google Workspace account via NextAuth.js. The permitted domain is read from the Organizations table, not hard-coded.

Step 2 — Client Selection

User selects a client name from a dropdown. The client list is populated from a Google Sheet that syncs. The app reads the Google Sheet using a Google Service Account. The Sheet ID and Service Account credentials are stored as Fly.io secrets.

The sync runs on every app load / user login to keep the client list fresh. The admin Client Sync page (Phase 2) will also have a manual “Sync Now” button.

Sync validation: Log every sync event (clients added, updated, unchanged). If a sync detects an unusually large number of new clients (configurable threshold, default 50), pause the sync and alert the admin rather than importing blindly. This prevents accidental or malicious bulk injection of fake client records.

Each client record includes: client name, sector, Google Drive client folder ID, Google Drive folder name. (Phase 2 — needed for the file upload step; column may be blank for some clients).

Field	Column header in spreadsheet
Client name	fc_client_name
Sector	client_sector
Google Drive client folder ID	drive_client_folder_id
Google Drive client folder	drive_client_folder_name
Google Drive client folder URL	drive_client_folder_url

If the client is not in the database, the user can enter the client name manually and select the sector from a dropdown.

Manual client tokenization:

Sample Client UX : A seeded Sample Client (Demo) row (is_sample = TRUE) appears in the dropdown with a DEMO chip + amber background to differentiate it from real clients. When selected, Step 4 (Display Required Reports) shows a prominent banner: “You are running the demo client. Output is for practice only and will not be sent to anyone.” The Step 10b output is also flagged. Sample sessions are filtered out of all admin metrics, cost-alert evaluation, App Health analytics, and audit-log search by default (per Security Req #25). Admins can toggle “include samples” when debugging the demo flow.

First-run experience : New users landing on Day 1 see an empty client list with a Try the Demo CTA pointing at the Sample Client. Drops time-to-first-value from “30 minutes of fumbling” to “5 minutes confidence” without requiring a 1:1 onboarding meeting.

Step 3 — Mode Selection

Two modes: Preview and Final.

Preview mode: Runs the full workflow but skips the final step (email draft creation). Allows the user to review the summary and cheat sheet, make changes, and re-run if needed.
Final mode: Runs the full workflow including email draft creation (Phase 2 feature — for now, just save the mode selection to the session).

Step 4 — Display Required Reports & Prompt

After the client is selected, display:

A list of the required financial statements for the client’s sector (see PROMPT-03-required-reports.md).
The prompt that will be used for the AI analysis.

Each client must have an assigned sector and a default prompt. There are two kinds of prompts:

System prompts: Global prompts created by admins. Editable by admins and senior users.
User prompts: Personal prompts saved to the authenticated user’s profile.

Prompts are named by sector and have full CRUD and clone functionality. Clone means duplicate all fields except date created and unique IDs, then open as a draft.

The humanizer rules (see the bottom section of PROMPT-02-ai-prompts.md) are stored as an editable section of the prompt entity, not hard-coded.