Skip to content
Vijay Work Resume Blog Contact

Project case study

Browsing-log analytics and safe-browsing pipelines

Browsing-log analytics pipelines for safe-browsing classification, audience management, and cohort creation.

Turned raw browsing activity into governed analytical signals: threat classification, spam URL marking, audience segments, browsing-pattern cohorts, and reusable data products.

NiFi Spark Airflow Python SQL Apache SeaTunnel Apache Kyuubi Trino

Context

The problem

Browsing-log data can support several product paths, but only when ingestion, validation, enrichment, threat classification, and analytics preparation are separated cleanly enough to operate and evolve.

This work is separate from mobile tower network-event systems. It focuses on browsing logs and domain/URL signals: NiFi-heavy ingestion, AI threat-analysis outputs, safe-browsing classification, audience management, cohort creation, and queryable analytical datasets.

System trace

How the work moved through the system

A high-level operating path: where the request starts, how the system shapes it, and how other teams consume the result.

  1. 1

    Browsing-log data moves through NiFi-heavy ingestion before validation and enrichment.

  2. 2

    Domain and URL signals are prepared for AI threat-analysis and safe-browsing classification.

  3. 3

    Spam or unsafe URL decisions are produced as downstream classification signals rather than mixed into ingestion logic.

Data shape

Browsing patterns

The work spans browsing-log ingestion, domain/URL enrichment, safe-browsing classification, audience management, and cohort analytics.

Safety path

URL classification

AI threat-analysis outputs helped mark suspicious or spam URLs for safe-browsing use cases.

Analytics path

Audience cohorts

Browsing patterns were shaped into audience-management signals and cohort datasets for downstream use.

Architecture

System shape

6
  1. 1 Browsing-log data moves through NiFi-heavy ingestion before validation and enrichment.
  2. 2 Domain and URL signals are prepared for AI threat-analysis and safe-browsing classification.
  3. 3 Spam or unsafe URL decisions are produced as downstream classification signals rather than mixed into ingestion logic.
  4. 4 Analytical pipelines transform browsing patterns into audience-management and cohort datasets.
  5. 5 Spark and Airflow jobs handle transformation, cleanup, retention, and refresh workflows.
  6. 6 Kyuubi/Trino/query-engine paths provide controlled execution and access for downstream users.

Ownership

What I handled

5
  1. 1 Implemented browsing-log ingestion, cleanup, and retention workflow changes.
  2. 2 Built safe-browsing and URL spam-classification support around NiFi ingestion and AI threat-analysis outputs.
  3. 3 Built analytical pipelines for audience management and cohort creation from browsing patterns.
  4. 4 Built validation and metadata behavior for file and record-level ingestion.
  5. 5 Improved query-engine and Kyuubi/Trino behavior for platform-controlled data access.

Lessons

What carried forward

2
  1. 1 Browsing-log analytics needs clear boundaries between ingestion, enrichment, safety classification, and audience products.
  2. 2 The same raw data can serve multiple products only when the intermediate signals are reusable and governed.

Engineering decisions

Keep browsing logs separate from network-event analytics

Browsing logs serve audience, cohort, and safe-browsing products; mobile tower network events serve mobile network analytics. Mixing them makes the portfolio story and the system boundaries unclear.

Separate ingestion from interpretation

NiFi ingestion handled data movement while AI threat analysis and URL classification stayed as distinct enrichment and interpretation layers.

Build reusable audience signals

Audience management and cohort creation work better when browsing-pattern features are prepared as reusable datasets instead of one-off analysis outputs.

Put validation before product use

File and record validation reduce downstream ambiguity and make ingestion failures easier to isolate.

What can be shown

Public evidence without internal names

The internal systems stay private. This section keeps the public parts: my role, system boundaries, technology context, scale, decisions, constraints, and what I learned.

Internal enterprise system Scale signal High-level architecture

Domain

Browsing logs

The work focuses on browsing activity, domain/URL signals, safe-browsing classification, and audience analytics.

Product paths

Safety + audience

The same data foundation supported spam/unsafe URL classification and analytics for audience management and cohort creation.

Shape

Ingestion + analytics

Contributions include NiFi ingestion, validation, enrichment, orchestration, query paths, and downstream analytical datasets.

Architecture shape

  • Browsing-log and domain/URL data moves through NiFi-heavy ingestion before validation, enrichment, and analytics preparation.
  • AI threat-analysis outputs feed safe-browsing classification so suspicious URLs can be marked as spam or unsafe.
  • Analytical pipelines transform browsing patterns into audience-management signals and cohort datasets.
  • Workflow orchestration coordinates backfills, retention windows, cleanup, and downstream data-product refreshes.
  • Query and validation services provide controlled access patterns for analytics and operational consumers.

Responsibilities

  • Built ingestion and cleanup workflows for browsing-log and domain/URL datasets.
  • Implemented NiFi-heavy ingestion and safe-browsing classification paths.
  • Built analytical pipelines for audience management and cohort creation based on browsing patterns.
  • Implemented validation and metadata behavior around file and record-level ingestion paths.
  • Worked on query-engine and Kyuubi/Trino execution paths for controlled data access.

Constraints

  • Internal dataset names, user groups, exact data volumes, and operational dashboards are not published.
  • This case study focuses on the data-product shape, technology context, and implementation responsibilities.

Supporting context

High-level architecture

Browsing-log data product pipeline

Can be represented as browsing logs, NiFi ingestion, validation, enrichment, AI threat-analysis outputs, safe-browsing classification, audience/cohort generation, query engines, and downstream data products.

Related case studies

Continue through related work or return to the full project index.

Related projects

Continue in the same area

Project index

Apache Kyuubi + Trino + Platform engineering

Self-service data platform and governance architecture

Built data-mesh platform capabilities around Kyuubi, custom engine routing, RBAC, secrets management, Trino query access, dbt transformations, DataHub metadata, and Metabase BI.

Apache Kyuubi + Spark + Backend engineering

Automatic engine selection for Kyuubi

Changed Kyuubi engine selection so shared compute could route interactive or batch sessions using user group context.

Spark + Python + Data platform

Retail adjacency and store-flow analytics

Built reusable analytics workflows for cross-shopping, category adjacency, aisle-flow, and store-flow analysis across departments, categories, and products.