Each of Assembly's 15 technical modules evaluated across three dimensions and classified into four readiness tiers. Ratings reflect the state of the industry as of May 2026, not future projections.
Build Complexity (1 = trivial configuration → 5 = novel research-grade engineering) · Established Precedence (1 = no existing solutions → 5 = commodity, mature products) · Innovation Risk (1 = well-understood path → 5 = unsolved problem, high probability of setback).
Off-the-shelf solutions exist. Wire them in — no custom engineering beyond configuration.
Proven technology with integration effort. Custom glue code, no novel research.
Technology exists but the application is novel. Significant custom engineering, limited precedent.
Assembly's proprietary innovation territory. No existing solution. Research-grade engineering.
| # | Module | Complexity | Precedence | Risk | Tier |
|---|---|---|---|---|---|
| M1 | Assembly Session Engine | 2/5 | 5/5 | 1/5 | COMMODITY |
| M2 | Multi-modal Input Processor | 3/5 | 4/5 | 2/5 | ESTABLISHED |
| M3 | Speaker Diarizer & Role Mapper | 3/5 | 3/5 | 3/5 | CUTTING EDGE |
| M4 | Agentic Orchestration Engine | 5/5 | 1/5 | 5/5 | BLEEDING EDGE |
| M5 | Phase Barometer & Transition Manager | 4/5 | 1/5 | 4/5 | BLEEDING EDGE |
| M6 | AI Agent Persona Framework | 4/5 | 2/5 | 4/5 | CUTTING EDGE |
| M7 | Environment Agent Personas (Dev/UAT/Prod) | 5/5 | 1/5 | 5/5 | BLEEDING EDGE |
| M8 | Code Generation Engine | 1/5 | 5/5 | 1/5 | COMMODITY |
| M9 | UI/UX Generation Engine | 2/5 | 4/5 | 2/5 | ESTABLISHED |
| M10 | Voice Model Integration (STT/TTS) | 1/5 | 5/5 | 1/5 | COMMODITY |
| M11 | Prototype Renderer & App Agent | 4/5 | 3/5 | 3/5 | CUTTING EDGE |
| M12 | Product Development Data Model | 3/5 | 2/5 | 3/5 | CUTTING EDGE |
| M13 | Timeline Manager | 3/5 | 3/5 | 3/5 | CUTTING EDGE |
| M14 | IAM & Session Control | 1/5 | 5/5 | 1/5 | COMMODITY |
| M15 | Analytics & Observability | 1/5 | 5/5 | 1/5 | COMMODITY |
WebRTC-based real-time multi-party session management is fully solved as of 2026. LiveKit went 1.0 in April 2025 and is the infrastructure backbone for OpenAI ChatGPT Voice, Meta, and Character.ai. The LiveKit Agents SDK allows AI agents to join rooms as first-class participants — receiving audio, processing through an STT→LLM→TTS pipeline, and responding in under 500ms. Pipecat layers a higher-level pipeline abstraction on top with 40+ model plugins. Assembly builds no session infrastructure from scratch.
No meaningful gap at the infrastructure level. Custom work: session state persistence schema (M12) and evergreen link management (M14). Estimate: 2 weeks integration + testing.
Low. Biggest risk is session state restoration after long absences — handled as a persistence concern in M12/M13, not in LiveKit.
Voice, screen share, and chat are each individually solved — streaming STT is production-grade with sub-100ms latency, screen capture uses standard browser Media Capture APIs, chat is trivial WebSocket messaging. The novel engineering is the context fusion layer: combining speaker-attributed voice turns with screen annotation metadata and chat into a single unified context object the Orchestration Engine can consume.
The multi-modal context fusion layer has no off-shelf equivalent: define the context object schema (utterance + speaker ID + role + timestamp + screen_region + confidence) and build the fusion processor. Estimate: 3–4 weeks.
Medium-low. Primarily a data engineering problem. Edge case: simultaneous talkers while screen sharing — overlapping events need a well-defined merge strategy.
Diarization itself is well-solved (pyannote 3.1 leads offline accuracy; Deepgram bundles diarization into its streaming API). The cutting-edge component is the Role Mapper: linking voice fingerprints captured at session join to each participant's business role, so every utterance is attributed not just by speaker but by role — enabling agents to weight input accordingly.
Session-bootstrap voice enrollment, fingerprint storage in the participant profile, real-time lookup during transcription. Estimate: 3–4 weeks including testing with 5+ concurrent speakers.
Medium. Diarization degrades with 3+ simultaneous speakers in noise. Mitigation: push-to-talk / mic-indicator UX for critical sessions, diarization as default.
Assembly's most critical and most novel module — no existing production system coordinates multiple AI agent personas with multiple human participants in a live voice session. The closest analog (OpenAI Realtime multi-agent pattern) targets 1:1 sessions with sequential handoffs. LangGraph is the strongest foundation: conditional-edge routing, built-in checkpointing for the Timeline, best-in-class latency/reliability in 2026. The real innovation: (1) real-time voice context routing, (2) a turn-taking protocol that prevents agent pile-ons, (3) a phase awareness model, and (4) an inter-agent hidden communication channel.
CRITICAL GAP. The coordination layer between LangGraph and the voice pipeline is entirely original engineering. Estimate: 4–6 months dedicated, with parallel experimental prototyping before architecture commitment.
CRITICAL. If M4 is poorly designed, the entire product experience breaks. Start with single-agent + multi-human; build a turn-taking simulation harness; treat M4 as a research workstream alongside Milestone 1.
A real-time readiness score computed from session data that triggers phase transitions — one of Assembly's most distinctive concepts with no direct technical precedent. LangGraph provides the state machine backbone; the readiness scoring model (weighted signals: product type completeness, CUJ coverage, UX/UI coverage, feasibility, pain-point breadth) is entirely novel and needs extensive calibration with real users.
No reference implementation. Signal selection, weights, and thresholds are product design questions that precede engineering; the visual barometer UX must avoid distorting natural conversation. Estimate: 2–3 months including calibration sessions.
High. Fires too early → poor prototype, disappointment. Too conservative → conversation drags, lost faith. Calibrate with real users across multiple beta rounds.
Plugin-style infrastructure for distinct personas (PM, UX, UI, Tech Architect), each with domain scope, prompt strategy, probing question library, and output schema. CrewAI is the most mature for role-based definitions; LangGraph is the production routing choice. Cutting-edge parts: voice-native persona behavior (when to speak vs. stay silent — no framework addresses this), probing-question timing, and the hidden inter-agent channel.
CrewAI for rapid prototyping → LangGraph for production. Persona prompt library and probing-question corpus hand-crafted and tuned in real sessions. Estimate: 6–8 weeks per persona (~4 months total).
High. Too chatty overwhelms participants; too passive misses critical product dimensions. Plan dedicated persona behavior testing sprints.
Assembly's most radical differentiator: non-technical users converse with Dev / UAT / Prod agents, with intent-based routing ("how many users signed up yesterday?" → Prod Agent). No precedent in any production product. Requires real environment integrations: CI/CD APIs (GitHub Actions, Vercel) for Dev, staging/test execution for UAT, observability ingestion (PostHog, Sentry, DataDog) for Prod.
Entirely novel: persona design, intent routing, all three environment integrations. Prod Agent is the most complex. Estimate: 3–4 months, Milestone 4 work after M6 is production-proven.
CRITICAL but deferred. Build last. The demo potential ("Prod Agent, how many users do we have in Europe?") is a key VC moment — do not rush it.
Fully commoditized as of 2026 — Claude leads SWE-bench Verified by a wide margin. Agents construct a structured product spec and invoke the model; iterative refinement loops incorporate session feedback. Assembly's role is prompt engineering mastery and loop design, not model development. Large context windows allow passing the full spec plus previous code versions.
None at the model level. Custom: spec-to-prompt translation layer and a diff-based (not full-rewrite) refinement loop. Estimate: 3–4 weeks.
Low. Token cost is a management challenge, not a risk: tiered models (cheap for simple iterations, frontier for complex generation) and output caching.
The UI generation landscape matured dramatically in 2025–2026: v0 generates production-grade React + Tailwind + shadcn/ui from natural language; Lovable hit $20M ARR in 2 months; Bolt.new runs entirely in-browser. Assembly needs output captured as code and rendered inside its own session iframe (M11) — not redirected to external tools.
Keep rendered output inside the session, and make the feedback loop incremental ("make that button bigger and blue" → diff, not regeneration). Estimate: 3–4 weeks.
Low. Primary risk is run-to-run output inconsistency. Mitigation: seed with design system constraints and lock the model version.
Fully commoditized; sub-100ms round-trip is the industry baseline. Optimal stack: Deepgram for STT (lowest latency + native diarization), Cartesia Sonic for TTS. Pipecat has native plugins for 40+ providers. Multi-vendor fallback recommended for resilience.
Zero gap — configuration only. One nuance: each agent persona should have a distinct voice for social presence (ElevenLabs / Cartesia voice cloning).
Minimal. Heavy accents / multi-language reduce accuracy; v1 is English-first. Budget multi-vendor routing from day one.
AI browser automation matured in 2025–2026: Stagehand v3 uses natural-language instructions over a modular driver system, and DOM-driven control benchmarks 12–17 points more reliable than vision-driven approaches. The App Agent navigates the generated prototype and streams it to all participants via WebRTC screen-capture forwarding from a cloud browser into the LiveKit room.
Multi-participant browser streaming is not out-of-the-box: cloud browser → capture → forward into LiveKit → all participants. Achievable with LiveKit's screen-share participant feature, but custom plumbing. Estimate: 4–6 weeks including reliability testing on AI-generated UIs.
Medium. Automation on dynamically generated UIs can be brittle even with NL selectors. Test against a corpus of AI-generated UIs; plan App Agent recovery flows.
The database tech is commodity; the novel engineering is the domain-specific schema: Product, Session, Phase, Feature, UserJourney, Component, AgentDecision, TimelineEvent, Participant. Hardest problems: AgentDecision attribution (every state change traces to an agent action triggered by a human utterance), the FeatureGraph relationship model, and multi-tenant access. Supabase (Postgres + Realtime + RLS) is the strongest v1 platform.
Schema design IS the gap and precedes all code: a 2–3 week design sprint with domain expert sessions. Neo4j may help for graph traversal at scale.
Medium. Schema mistakes are expensive post-launch. Under-designing AgentDecision attribution breaks Timeline causality — invest disproportionately upfront.
Event sourcing is a mature discipline; applying it to product-lifecycle state is novel. Key decisions: what is a meaningful timeline event vs. noise, snapshot frequency for efficient point-in-time restore, and the conversational UX for temporal navigation ("show me how the app looked in February 2025"). Closest real-world analog: Cursor's timeline.
Domain-specific event taxonomy + conversational timeline navigation (no existing system supports NL temporal queries in a product context). Estimate: 4–6 weeks.
Medium. Event replay at scale needs a snapshotting strategy from day one (e.g., every 100 events or every phase transition) — do not defer.
Fully commoditized. Clerk's organization/membership model maps cleanly to role-based session participants and evergreen link access control; WorkOS adds enterprise SSO/SAML; Supabase Auth is cost-optimal if the data layer is Supabase (native RLS integration).
Two custom pieces: agent authentication tokens for external APIs (secrets via Doppler/Vault) and the evergreen-link permission model (always-valid URL, role-gated). Estimate: 1–2 weeks.
Minimal. Ensure expired invites never break access for existing participants.
Fully commoditized. PostHog: product analytics + session replay + error tracking + LLM observability in one platform. Langfuse tracks LLM traces (token cost per session, agent decision attribution, latency) natively with LangGraph. This stack is also the Prod Agent's (M7) data feed.
Plug in and configure. Define LLM observability metrics from day one — cost per session/phase, agent accuracy, latency distribution — instrumented before launch, not retrofitted. Estimate: 1 week.
Minimal. Build LLM cost attribution into the data model (M12) from the start so dashboards are meaningful from launch day.
Low-risk commodity modules form the infrastructure backbone; established modules form the product experience layer; maximum time and R&D goes to the bleeding-edge modules that constitute Assembly's IP.
M8 · M10 · M14 · M15 — Voice pipeline working in 3 days; auth in 1 week; first generated prototype in 1 week; analytics instrumented before any user touches the product.
M1 · M2 · M9 — LiveKit sessions with evergreen links; context-fusion processor; UI generation embedded in-session with an incremental feedback loop.
M3 · M6 · M11 · M12 · M13 — M12 schema sprint FIRST. Event taxonomy before timeline code. Diarizer with voice enrollment. Renderer with LiveKit forwarding. Personas: CrewAI prototype → LangGraph production.
M4 · M5 · M7 — M4 starts Day 1 as a parallel research workstream. M5 designed alongside it, calibrated with real users. M7 is Milestone 4 only, after M4 and M6 are production-proven.
The single most important engineering decision for Assembly's success is M4. Everything else can be iterated. M4 cannot be recovered from if poorly designed. Invest in it early, prototype constantly, and do not commit to a final architecture until you have observed real user sessions.