Data Ingestion & Normalization

The data_ingestion module is the entry point for all research data. Its primary goal is to transform disparate, “noisy” external files into a unified, high-fidelity internal schema that our Intelligence Engine can process.

🌪️ The Normalization Pipeline

Raw data is rarely ready for analysis. Our ingestion engine follows a strict “Sanitize and Standardize” workflow:

Signature Detection

When a file (CSV, Excel, PDF) is uploaded, the system performs a header-signature scan. We don’t just look at file extensions; we inspect the byte-stream to identify the true encoding and structure.

Semantic Type Inference

This is the “Brain” of ingestion. The engine scans the first 100 rows of any table to guess the intent of each column.

Is this a Date?
Is this a Currency?
Is this “Unstructured Text” meant for NLP? By pre-tagging columns, we save the analyst from manual configuration inside the dashboard.

Schema Flattening

If the data is multi-sheet or hierarchical, we flatten it into a singular canonical DataFrame. This ensures that when the Intelligence Engine runs a “Cross-Tab,” it has a clean 2D surface to work upon.

📋 The Survey Synthesis Engine

The Survey module is more than a form builder; it is a Generative Scaffolding Tool.

AI-First Design

We leverage Gemini 2.0 Flash to bridge the gap between “Intent” and “Implementation.” A researcher can provide a one-line intent (e.g., “Customer satisfaction for a luxury resort”), and the engine will synthesize a 10-question survey with pre-configured logic branches.

Sticky Reasoning in Surveys

The choice of question type is dictated by the desired analytical output:

Likert Scales are chosen when the engine detects a need for “Health Score” tracking.
Open-Ended Text blocks are added when the “Sentiment Analysis” flag is detected in the prompt.

💡 Technical Guardrails

📁

Provenance Tracking: Every response and file row is tagged with a RawDataFile ID. This allows for “Deep Debugging”—the ability to trace a specific outlier in a 5-star rating back to the exact source row in the original client upload.

Rate Limiting: Public survey endpoints are heavily throttled to prevent “Bot Stuffing” of research data.
Memory Buffer: During Google Sheets imports, we use a streaming parser to ensure that massive sheets (1M+ cells) don’t trigger Out-Of-Memory (OOM) errors in the Celery worker.