Why voice + RAG
Speech lowers friction for end users, but raw LLMs hallucinate and drift from organizational truth. Coupling automatic speech recognition with semantic retrieval constrains generation to ingested documents (per avatar / tenant), so answers stay traceable to source material while remaining conversational when delivered via TTS.
Architecture
A layered monolith: HTTP adapters delegate to stateless services that wrap provider SDKs. Long-lived concerns (auth, validation, rate limits, structured logging) sit in middleware; orchestration for the primary user journey lives in `askService` (STT → language handling → RAG+LLM → TTS → persistence).
No message broker for the default path — each `/ask` request is an async pipeline of awaited I/O steps. That trade-off favours operational simplicity; horizontal scale is achieved by scaling stateless app instances and keeping Pinecone + MongoDB as shared backends.
What I learned
Voice is a transport layer — the trust boundary for factual answers is still retrieval + citation-oriented prompting, not the modality.
Orchestration clarity beats clever abstractions for small teams: one explicit pipeline (`askService`) is easier to operate than scattered triggers.
Multitenancy belongs in retrieval metadata — namespaces normalize 'which knowledge base' independent of embeddings math.