π»System Architecture Overview
System Component Context
This section describes the primary components and the sequential hand-offs between them during the end-to-end processing of a single record.

Core Components:
Admin Users: Interact via Frontend Dashboard (Nginx).
External Data Sources: Push bulk data via API Service.
Dedup Engine: The core processing layer (Kafka -> Worker -> Postgres).
External Identity Providers (Tech5): T5-LDS (Liveness) and T5-ABIS (Biometrics).
High-Level Data Flow
This section describes the movement of data from the Producers (API/Ingestion) through the Message Bus (Kafka) to the Consumer (Dedup Service), including the feedback loops to the database.
Key Flows:
Path A (Auditable Batch): Client -> API -> DB (Records) -> StreamTask -> Kafka.
Path B (High-Speed Stream): Client -> API -> Kafka.
Processing: Kafka -> Dedup Worker -> External Checks -> Final Persistence.
Detailed Architecture & Tech Stack Breakdown
Based on the architectural design and system context, the solution employs a containerized, event-driven microservices architecture. It decouples high-speed ingestion from resource-intensive processing to ensure resilience and scalability.
A. Core Architecture Pattern
Pattern: Event-Driven Microservices (Producer/Consumer).
Decoupling: The lightweight API Service (Producer) is completely decoupled from the heavy Dedup Service (Consumer) via Kafka. This prevents backpressure from slowing down ingestion.
Orchestration: Celery manages the complex, multi-step pipeline (Liveness -> Demo -> Bio -> Match) as a stateful background task.
B. Comprehensive Tech Stack
Ingestion Layer
FastAPI / Python
Provides the high-performance api_service. Handles /ingest (real-time) and /records (batch) endpoints.
Message Bus
Apache Kafka
Acts as the central nervous system. Buffers traffic bursts in the raw_records topic to protect downstream workers.
Processing Layer
Celery + Python
The dedup_service workers. Manages distributed task execution, retries, and failure handling.
Containerization
Docker
Encapsulates all services (API, Worker, DBs) for consistent deployment.
Orchestration
Kubernetes (K8s) + KEDA
Manages container lifecycles. KEDA (Kubernetes Event-driven Autoscaling) automatically scales worker pods based on Kafka lag.
Database (Relational)
PostgreSQL
Primary persistent store. Stores customers (Golden Records), adjudication (conflicts), and audit_logs.
Database (Search)
Elasticsearch
Powering the Demographic Matching. Enables high-speed fuzzy search on text fields (Names, DOB).
Biometrics (External)
Tech5 T5-ABIS
External API for 1:N biometric identification (Facial/Fingerprint).
Liveness (External)
Tech5 T5-LDS
External API for passive liveness detection to prevent spoofing.
Gateway/Security
Nginx / Certbot
Reverse proxy handling HTTPS termination, load balancing, and serving static content.
Authentication
JWT (JSON Web Tokens)
Secures API endpoints and manages user sessions.
C. Scalability & Resilience Mechanisms
Horizontal Scaling: The Dedup Service workers can scale from 1 to N pods instantly as the Kafka queue depth increases, driven by KEDA metrics.
Fail-Fast Logic: The architecture enforces a "Fail-Fast" rule at the Liveness check step. If an image fails liveness (T5-LDS), the expensive Biometric check (T5-ABIS) is skipped entirely, saving significant computational resources.
Resilience: If external services (Tech5 APIs) are down, Celery automatically retries the task with exponential backoff, ensuring no records are dropped ("At-least-once" processing).
Last updated