State-of-Art Module Analysis — Assembly Internal

Rating system

Build Complexity (1 = trivial configuration → 5 = novel research-grade engineering) · Established Precedence (1 = no existing solutions → 5 = commodity, mature products) · Innovation Risk (1 = well-understood path → 5 = unsolved problem, high probability of setback).

COMMODITY

Off-the-shelf solutions exist. Wire them in — no custom engineering beyond configuration.

ESTABLISHED

Proven technology with integration effort. Custom glue code, no novel research.

CUTTING EDGE

Technology exists but the application is novel. Significant custom engineering, limited precedent.

BLEEDING EDGE

Assembly's proprietary innovation territory. No existing solution. Research-grade engineering.

Module summary scorecard

#	Module	Complexity	Precedence	Risk	Tier
M1	Assembly Session Engine	2/5	5/5	1/5	COMMODITY
M2	Multi-modal Input Processor	3/5	4/5	2/5	ESTABLISHED
M3	Speaker Diarizer & Role Mapper	3/5	3/5	3/5	CUTTING EDGE
M4	Agentic Orchestration Engine	5/5	1/5	5/5	BLEEDING EDGE
M5	Phase Barometer & Transition Manager	4/5	1/5	4/5	BLEEDING EDGE
M6	AI Agent Persona Framework	4/5	2/5	4/5	CUTTING EDGE
M7	Environment Agent Personas (Dev/UAT/Prod)	5/5	1/5	5/5	BLEEDING EDGE
M8	Code Generation Engine	1/5	5/5	1/5	COMMODITY
M9	UI/UX Generation Engine	2/5	4/5	2/5	ESTABLISHED
M10	Voice Model Integration (STT/TTS)	1/5	5/5	1/5	COMMODITY
M11	Prototype Renderer & App Agent	4/5	3/5	3/5	CUTTING EDGE
M12	Product Development Data Model	3/5	2/5	3/5	CUTTING EDGE
M13	Timeline Manager	3/5	3/5	3/5	CUTTING EDGE
M14	IAM & Session Control	1/5	5/5	1/5	COMMODITY
M15	Analytics & Observability	1/5	5/5	1/5	COMMODITY

Module deep-dive analysis

Assembly Session Engine

COMMODITY

Complexity 2/5 Precedence 5/5 Risk 1/5

Best available: LiveKit · Daily.co · Pipecat · Agora

WebRTC-based real-time multi-party session management is fully solved as of 2026. LiveKit went 1.0 in April 2025 and is the infrastructure backbone for OpenAI ChatGPT Voice, Meta, and Character.ai. The LiveKit Agents SDK allows AI agents to join rooms as first-class participants — receiving audio, processing through an STT→LLM→TTS pipeline, and responding in under 500ms. Pipecat layers a higher-level pipeline abstraction on top with 40+ model plugins. Assembly builds no session infrastructure from scratch.

Gap to build

No meaningful gap at the infrastructure level. Custom work: session state persistence schema (M12) and evergreen link management (M14). Estimate: 2 weeks integration + testing.

Risk note

Low. Biggest risk is session state restoration after long absences — handled as a persistence concern in M12/M13, not in LiveKit.

Multi-modal Input Processor

ESTABLISHED

Complexity 3/5 Precedence 4/5 Risk 2/5

Best available: Pipecat · Deepgram Nova-3 · AssemblyAI Universal-3 Pro

Voice, screen share, and chat are each individually solved — streaming STT is production-grade with sub-100ms latency, screen capture uses standard browser Media Capture APIs, chat is trivial WebSocket messaging. The novel engineering is the context fusion layer: combining speaker-attributed voice turns with screen annotation metadata and chat into a single unified context object the Orchestration Engine can consume.

Gap to build

The multi-modal context fusion layer has no off-shelf equivalent: define the context object schema (utterance + speaker ID + role + timestamp + screen_region + confidence) and build the fusion processor. Estimate: 3–4 weeks.

Risk note

Medium-low. Primarily a data engineering problem. Edge case: simultaneous talkers while screen sharing — overlapping events need a well-defined merge strategy.

Speaker Diarizer & Role Mapper

CUTTING EDGE

Complexity 3/5 Precedence 3/5 Risk 3/5

Best available: pyannote 3.1 · AssemblyAI · Deepgram

Diarization itself is well-solved (pyannote 3.1 leads offline accuracy; Deepgram bundles diarization into its streaming API). The cutting-edge component is the Role Mapper: linking voice fingerprints captured at session join to each participant's business role, so every utterance is attributed not just by speaker but by role — enabling agents to weight input accordingly.

Gap to build

Session-bootstrap voice enrollment, fingerprint storage in the participant profile, real-time lookup during transcription. Estimate: 3–4 weeks including testing with 5+ concurrent speakers.

Risk note

Medium. Diarization degrades with 3+ simultaneous speakers in noise. Mitigation: push-to-talk / mic-indicator UX for critical sessions, diarization as default.

Agentic Orchestration Engine

BLEEDING EDGE · CORE IP

Complexity 5/5 Precedence 1/5 Risk 5/5

Best available: LangGraph (state machine backbone) · OpenAI Agents SDK

Assembly's most critical and most novel module — no existing production system coordinates multiple AI agent personas with multiple human participants in a live voice session. The closest analog (OpenAI Realtime multi-agent pattern) targets 1:1 sessions with sequential handoffs. LangGraph is the strongest foundation: conditional-edge routing, built-in checkpointing for the Timeline, best-in-class latency/reliability in 2026. The real innovation: (1) real-time voice context routing, (2) a turn-taking protocol that prevents agent pile-ons, (3) a phase awareness model, and (4) an inter-agent hidden communication channel.

Gap to build

CRITICAL GAP. The coordination layer between LangGraph and the voice pipeline is entirely original engineering. Estimate: 4–6 months dedicated, with parallel experimental prototyping before architecture commitment.

Risk note

CRITICAL. If M4 is poorly designed, the entire product experience breaks. Start with single-agent + multi-human; build a turn-taking simulation harness; treat M4 as a research workstream alongside Milestone 1.

Phase Barometer & Transition Manager

BLEEDING EDGE

Complexity 4/5 Precedence 1/5 Risk 4/5

Best available: LangGraph state machines (structural backbone only)

A real-time readiness score computed from session data that triggers phase transitions — one of Assembly's most distinctive concepts with no direct technical precedent. LangGraph provides the state machine backbone; the readiness scoring model (weighted signals: product type completeness, CUJ coverage, UX/UI coverage, feasibility, pain-point breadth) is entirely novel and needs extensive calibration with real users.

Gap to build

No reference implementation. Signal selection, weights, and thresholds are product design questions that precede engineering; the visual barometer UX must avoid distorting natural conversation. Estimate: 2–3 months including calibration sessions.

Risk note

High. Fires too early → poor prototype, disappointment. Too conservative → conversation drags, lost faith. Calibrate with real users across multiple beta rounds.

AI Agent Persona Framework

CUTTING EDGE

Complexity 4/5 Precedence 2/5 Risk 4/5

Best available: CrewAI · LangGraph · AutoGen/AG2

Plugin-style infrastructure for distinct personas (PM, UX, UI, Tech Architect), each with domain scope, prompt strategy, probing question library, and output schema. CrewAI is the most mature for role-based definitions; LangGraph is the production routing choice. Cutting-edge parts: voice-native persona behavior (when to speak vs. stay silent — no framework addresses this), probing-question timing, and the hidden inter-agent channel.

Gap to build

CrewAI for rapid prototyping → LangGraph for production. Persona prompt library and probing-question corpus hand-crafted and tuned in real sessions. Estimate: 6–8 weeks per persona (~4 months total).

Risk note

High. Too chatty overwhelms participants; too passive misses critical product dimensions. Plan dedicated persona behavior testing sprints.

Environment Agent Personas (Dev / UAT / Prod)

BLEEDING EDGE

Complexity 5/5 Precedence 1/5 Risk 5/5

Best available: No direct equivalent — built on M6 + CI/CD APIs

Assembly's most radical differentiator: non-technical users converse with Dev / UAT / Prod agents, with intent-based routing ("how many users signed up yesterday?" → Prod Agent). No precedent in any production product. Requires real environment integrations: CI/CD APIs (GitHub Actions, Vercel) for Dev, staging/test execution for UAT, observability ingestion (PostHog, Sentry, DataDog) for Prod.

Gap to build

Entirely novel: persona design, intent routing, all three environment integrations. Prod Agent is the most complex. Estimate: 3–4 months, Milestone 4 work after M6 is production-proven.

Risk note

CRITICAL but deferred. Build last. The demo potential ("Prod Agent, how many users do we have in Europe?") is a key VC moment — do not rush it.

Code Generation Engine

COMMODITY

Complexity 1/5 Precedence 5/5 Risk 1/5

Best available: Claude API · OpenAI Codex · Gemini

Fully commoditized as of 2026 — Claude leads SWE-bench Verified by a wide margin. Agents construct a structured product spec and invoke the model; iterative refinement loops incorporate session feedback. Assembly's role is prompt engineering mastery and loop design, not model development. Large context windows allow passing the full spec plus previous code versions.

Gap to build

None at the model level. Custom: spec-to-prompt translation layer and a diff-based (not full-rewrite) refinement loop. Estimate: 3–4 weeks.

Risk note

Low. Token cost is a management challenge, not a risk: tiered models (cheap for simple iterations, frontier for complex generation) and output caching.

UI/UX Generation Engine

ESTABLISHED

Complexity 2/5 Precedence 4/5 Risk 2/5

Best available: v0 by Vercel · Lovable · Claude Artifacts

The UI generation landscape matured dramatically in 2025–2026: v0 generates production-grade React + Tailwind + shadcn/ui from natural language; Lovable hit $20M ARR in 2 months; Bolt.new runs entirely in-browser. Assembly needs output captured as code and rendered inside its own session iframe (M11) — not redirected to external tools.

Gap to build

Keep rendered output inside the session, and make the feedback loop incremental ("make that button bigger and blue" → diff, not regeneration). Estimate: 3–4 weeks.

Risk note

Low. Primary risk is run-to-run output inconsistency. Mitigation: seed with design system constraints and lock the model version.

M10

Voice Model Integration (STT / TTS)

COMMODITY

Complexity 1/5 Precedence 5/5 Risk 1/5

Best available: Deepgram Nova-3 (STT, ~90ms TTFB) · Cartesia Sonic (TTS, ~40ms TTFA)

Fully commoditized; sub-100ms round-trip is the industry baseline. Optimal stack: Deepgram for STT (lowest latency + native diarization), Cartesia Sonic for TTS. Pipecat has native plugins for 40+ providers. Multi-vendor fallback recommended for resilience.

Gap to build

Zero gap — configuration only. One nuance: each agent persona should have a distinct voice for social presence (ElevenLabs / Cartesia voice cloning).

Risk note

Minimal. Heavy accents / multi-language reduce accuracy; v1 is English-first. Budget multi-vendor routing from day one.

M11

Prototype Renderer & App Agent

CUTTING EDGE

Complexity 4/5 Precedence 3/5 Risk 3/5

Best available: Stagehand v3 · Browserbase · Playwright

AI browser automation matured in 2025–2026: Stagehand v3 uses natural-language instructions over a modular driver system, and DOM-driven control benchmarks 12–17 points more reliable than vision-driven approaches. The App Agent navigates the generated prototype and streams it to all participants via WebRTC screen-capture forwarding from a cloud browser into the LiveKit room.

Gap to build

Multi-participant browser streaming is not out-of-the-box: cloud browser → capture → forward into LiveKit → all participants. Achievable with LiveKit's screen-share participant feature, but custom plumbing. Estimate: 4–6 weeks including reliability testing on AI-generated UIs.

Risk note

Medium. Automation on dynamically generated UIs can be brittle even with NL selectors. Test against a corpus of AI-generated UIs; plan App Agent recovery flows.

M12

Product Development Data Model

CUTTING EDGE

Complexity 3/5 Precedence 2/5 Risk 3/5

Best available: Supabase (Postgres) · Neo4j · Prisma ORM

The database tech is commodity; the novel engineering is the domain-specific schema: Product, Session, Phase, Feature, UserJourney, Component, AgentDecision, TimelineEvent, Participant. Hardest problems: AgentDecision attribution (every state change traces to an agent action triggered by a human utterance), the FeatureGraph relationship model, and multi-tenant access. Supabase (Postgres + Realtime + RLS) is the strongest v1 platform.

Gap to build

Schema design IS the gap and precedes all code: a 2–3 week design sprint with domain expert sessions. Neo4j may help for graph traversal at scale.

Risk note

Medium. Schema mistakes are expensive post-launch. Under-designing AgentDecision attribution breaks Timeline causality — invest disproportionately upfront.

M13

Timeline Manager

CUTTING EDGE

Complexity 3/5 Precedence 3/5 Risk 3/5

Best available: EventStoreDB v24 · LangGraph checkpointing · Axon Framework

Event sourcing is a mature discipline; applying it to product-lifecycle state is novel. Key decisions: what is a meaningful timeline event vs. noise, snapshot frequency for efficient point-in-time restore, and the conversational UX for temporal navigation ("show me how the app looked in February 2025"). Closest real-world analog: Cursor's timeline.

Gap to build

Domain-specific event taxonomy + conversational timeline navigation (no existing system supports NL temporal queries in a product context). Estimate: 4–6 weeks.

Risk note

Medium. Event replay at scale needs a snapshotting strategy from day one (e.g., every 100 events or every phase transition) — do not defer.

M14

IAM & Session Control

COMMODITY

Complexity 1/5 Precedence 5/5 Risk 1/5

Best available: Clerk · WorkOS · Supabase Auth

Fully commoditized. Clerk's organization/membership model maps cleanly to role-based session participants and evergreen link access control; WorkOS adds enterprise SSO/SAML; Supabase Auth is cost-optimal if the data layer is Supabase (native RLS integration).

Gap to build

Two custom pieces: agent authentication tokens for external APIs (secrets via Doppler/Vault) and the evergreen-link permission model (always-valid URL, role-gated). Estimate: 1–2 weeks.

Risk note

Minimal. Ensure expired invites never break access for existing participants.

M15

Analytics & Observability

COMMODITY

Complexity 1/5 Precedence 5/5 Risk 1/5

Best available: PostHog · Sentry · Langfuse · DataDog

Fully commoditized. PostHog: product analytics + session replay + error tracking + LLM observability in one platform. Langfuse tracks LLM traces (token cost per session, agent decision attribution, latency) natively with LangGraph. This stack is also the Prod Agent's (M7) data feed.

Gap to build

Plug in and configure. Define LLM observability metrics from day one — cost per session/phase, agent accuracy, latency distribution — instrumented before launch, not retrofitted. Estimate: 1 week.

Risk note

Minimal. Build LLM cost attribution into the data model (M12) from the start so dashboards are meaningful from launch day.

Strategic build roadmap by module risk

Low-risk commodity modules form the infrastructure backbone; established modules form the product experience layer; maximum time and R&D goes to the bleeding-edge modules that constitute Assembly's IP.

Workstream A — Wire It In Weeks 1–4

M8 · M10 · M14 · M15 — Voice pipeline working in 3 days; auth in 1 week; first generated prototype in 1 week; analytics instrumented before any user touches the product.

Workstream B — Integrate & Customize Months 1–3

M1 · M2 · M9 — LiveKit sessions with evergreen links; context-fusion processor; UI generation embedded in-session with an incremental feedback loop.

Workstream C — Build With Care Months 2–6

M3 · M6 · M11 · M12 · M13 — M12 schema sprint FIRST. Event taxonomy before timeline code. Diarizer with voice enrollment. Renderer with LiveKit forwarding. Personas: CrewAI prototype → LangGraph production.

Workstream D — Research & Invent Months 1–9

M4 · M5 · M7 — M4 starts Day 1 as a parallel research workstream. M5 designed alongside it, calibrated with real users. M7 is Milestone 4 only, after M4 and M6 are production-proven.

Critical path

The single most important engineering decision for Assembly's success is M4. Everything else can be iterated. M4 cannot be recovered from if poorly designed. Invest in it early, prototype constantly, and do not commit to a final architecture until you have observed real user sessions.