πŸ’»System Architecture Overview

System Component Context

This section describes the primary components and the sequential hand-offs between them during the end-to-end processing of a single record.

Core Components:

  • Admin Users: Interact via Frontend Dashboard (Nginx).

  • External Data Sources: Push bulk data via API Service.

  • Dedup Engine: The core processing layer (Kafka -> Worker -> Postgres).

  • External Identity Providers (Tech5): T5-LDS (Liveness) and T5-ABIS (Biometrics).

High-Level Data Flow

This section describes the movement of data from the Producers (API/Ingestion) through the Message Bus (Kafka) to the Consumer (Dedup Service), including the feedback loops to the database.

Key Flows:

  • Path A (Auditable Batch): Client -> API -> DB (Records) -> StreamTask -> Kafka.

  • Path B (High-Speed Stream): Client -> API -> Kafka.

  • Processing: Kafka -> Dedup Worker -> External Checks -> Final Persistence.

Detailed Architecture & Tech Stack Breakdown

Based on the architectural design and system context, the solution employs a containerized, event-driven microservices architecture. It decouples high-speed ingestion from resource-intensive processing to ensure resilience and scalability.

A. Core Architecture Pattern

  • Pattern: Event-Driven Microservices (Producer/Consumer).

  • Decoupling: The lightweight API Service (Producer) is completely decoupled from the heavy Dedup Service (Consumer) via Kafka. This prevents backpressure from slowing down ingestion.

  • Orchestration: Celery manages the complex, multi-step pipeline (Liveness -> Demo -> Bio -> Match) as a stateful background task.

B. Comprehensive Tech Stack

Component Layer
Technology
Role & Justification

Ingestion Layer

FastAPI / Python

Provides the high-performance api_service. Handles /ingest (real-time) and /records (batch) endpoints.

Message Bus

Apache Kafka

Acts as the central nervous system. Buffers traffic bursts in the raw_records topic to protect downstream workers.

Processing Layer

Celery + Python

The dedup_service workers. Manages distributed task execution, retries, and failure handling.

Containerization

Docker

Encapsulates all services (API, Worker, DBs) for consistent deployment.

Orchestration

Kubernetes (K8s) + KEDA

Manages container lifecycles. KEDA (Kubernetes Event-driven Autoscaling) automatically scales worker pods based on Kafka lag.

Database (Relational)

PostgreSQL

Primary persistent store. Stores customers (Golden Records), adjudication (conflicts), and audit_logs.

Database (Search)

Elasticsearch

Powering the Demographic Matching. Enables high-speed fuzzy search on text fields (Names, DOB).

Biometrics (External)

Tech5 T5-ABIS

External API for 1:N biometric identification (Facial/Fingerprint).

Liveness (External)

Tech5 T5-LDS

External API for passive liveness detection to prevent spoofing.

Gateway/Security

Nginx / Certbot

Reverse proxy handling HTTPS termination, load balancing, and serving static content.

Authentication

JWT (JSON Web Tokens)

Secures API endpoints and manages user sessions.

C. Scalability & Resilience Mechanisms

  • Horizontal Scaling: The Dedup Service workers can scale from 1 to N pods instantly as the Kafka queue depth increases, driven by KEDA metrics.

  • Fail-Fast Logic: The architecture enforces a "Fail-Fast" rule at the Liveness check step. If an image fails liveness (T5-LDS), the expensive Biometric check (T5-ABIS) is skipped entirely, saving significant computational resources.

  • Resilience: If external services (Tech5 APIs) are down, Celery automatically retries the task with exponential backoff, ensuring no records are dropped ("At-least-once" processing).

Last updated