Blog

November 9, 2025
28 min read

MongoDB Aggregation Framework: Advanced Analytics and Real-Time Data Transformations for Enterprise Applications

Modern enterprise applications require sophisticated data processing capabilities that can handle complex transformations, real-time analytics, and multi-stage data aggregations with high performance and scalability. Traditional database approaches often struggle with complex analytical queries, requiring expensive joins, subqueries, and multiple round trips that create performance bottlenecks and operational complexity in production environments.

MongoDB's Aggregation Framework provides comprehensive data processing pipelines that enable sophisticated analytics, transformations, and real-time computations within the database itself. Unlike traditional SQL approaches that require complex joins and expensive operations, MongoDB's aggregation pipelines deliver optimized, single-pass data processing with automatic query optimization, distributed processing capabilities, and native support for complex document transformations.

The Traditional Analytics Challenge

Conventional relational database approaches to complex analytics face significant performance and scalability limitations:

-- Traditional PostgreSQL analytics - complex joins and expensive operations

-- Multi-table sales analytics with complex aggregations
WITH customer_segments AS (
    SELECT 
        c.customer_id,
        c.customer_name,
        c.email,
        c.registration_date,
        c.country,
        c.state,

        -- Customer segmentation logic
        CASE 
            WHEN c.registration_date >= CURRENT_DATE - INTERVAL '90 days' THEN 'new_customer'
            WHEN c.last_order_date >= CURRENT_DATE - INTERVAL '30 days' THEN 'active_customer'
            WHEN c.last_order_date >= CURRENT_DATE - INTERVAL '180 days' THEN 'returning_customer'
            ELSE 'dormant_customer'
        END as customer_segment,

        -- Calculate customer lifetime metrics
        c.total_orders,
        c.total_spent,
        c.average_order_value,
        c.last_order_date,

        -- Geographic classification
        CASE 
            WHEN c.country = 'US' THEN 'domestic'
            WHEN c.country IN ('CA', 'MX') THEN 'north_america'
            WHEN c.country IN ('GB', 'DE', 'FR', 'IT', 'ES') THEN 'europe'
            ELSE 'international'
        END as geographic_segment

    FROM customers c
    WHERE c.is_active = true
),

order_analytics AS (
    SELECT 
        o.order_id,
        o.customer_id,
        o.order_date,
        o.order_status,
        o.total_amount,
        o.discount_amount,
        o.tax_amount,
        o.shipping_amount,

        -- Time-based analytics
        DATE_TRUNC('month', o.order_date) as order_month,
        DATE_TRUNC('quarter', o.order_date) as order_quarter,
        DATE_TRUNC('year', o.order_date) as order_year,
        EXTRACT(dow FROM o.order_date) as day_of_week,
        EXTRACT(hour FROM o.order_date) as hour_of_day,

        -- Order categorization
        CASE 
            WHEN o.total_amount >= 1000 THEN 'high_value'
            WHEN o.total_amount >= 500 THEN 'medium_value'
            WHEN o.total_amount >= 100 THEN 'low_value'
            ELSE 'micro_transaction'
        END as order_value_segment,

        -- Seasonal analysis
        CASE 
            WHEN EXTRACT(month FROM o.order_date) IN (12, 1, 2) THEN 'winter'
            WHEN EXTRACT(month FROM o.order_date) IN (3, 4, 5) THEN 'spring'
            WHEN EXTRACT(month FROM o.order_date) IN (6, 7, 8) THEN 'summer'
            ELSE 'fall'
        END as season,

        -- Payment method analysis
        o.payment_method,
        o.payment_processor,

        -- Fulfillment metrics
        o.shipping_method,
        o.warehouse_id,
        EXTRACT(EPOCH FROM (o.shipped_date - o.order_date)) / 86400 as fulfillment_days

    FROM orders o
    WHERE o.order_date >= CURRENT_DATE - INTERVAL '2 years'
      AND o.order_status IN ('completed', 'shipped', 'delivered')
),

product_analytics AS (
    SELECT 
        oi.order_id,
        oi.product_id,
        p.product_name,
        p.category,
        p.subcategory,
        p.brand,
        p.supplier_id,
        oi.quantity,
        oi.unit_price,
        oi.total_price,
        oi.discount_amount as item_discount,

        -- Product performance metrics
        p.cost_per_unit,
        (oi.unit_price - p.cost_per_unit) as unit_margin,
        (oi.unit_price - p.cost_per_unit) * oi.quantity as total_margin,

        -- Product categorization
        CASE 
            WHEN p.category = 'Electronics' THEN 'tech'
            WHEN p.category IN ('Clothing', 'Shoes', 'Accessories') THEN 'fashion'
            WHEN p.category IN ('Home', 'Garden', 'Furniture') THEN 'home'
            ELSE 'other'
        END as product_group,

        -- Inventory and supply chain
        p.current_stock,
        p.reorder_level,
        CASE 
            WHEN p.current_stock <= p.reorder_level THEN 'low_stock'
            WHEN p.current_stock <= p.reorder_level * 2 THEN 'medium_stock'
            ELSE 'high_stock'
        END as stock_status,

        -- Supplier performance
        s.supplier_name,
        s.supplier_rating,
        s.average_lead_time

    FROM order_items oi
    JOIN products p ON oi.product_id = p.product_id
    JOIN suppliers s ON p.supplier_id = s.supplier_id
    WHERE p.is_active = true
),

comprehensive_sales_analytics AS (
    SELECT 
        cs.customer_id,
        cs.customer_name,
        cs.customer_segment,
        cs.geographic_segment,

        oa.order_id,
        oa.order_date,
        oa.order_month,
        oa.order_quarter,
        oa.order_value_segment,
        oa.season,
        oa.payment_method,
        oa.shipping_method,
        oa.fulfillment_days,

        pa.product_id,
        pa.product_name,
        pa.category,
        pa.brand,
        pa.product_group,
        pa.quantity,
        pa.unit_price,
        pa.total_price,
        pa.total_margin,
        pa.stock_status,
        pa.supplier_name,

        -- Advanced calculations requiring window functions
        SUM(pa.total_price) OVER (
            PARTITION BY cs.customer_id, oa.order_month
        ) as customer_monthly_spend,

        AVG(pa.unit_price) OVER (
            PARTITION BY pa.category, oa.order_quarter
        ) as category_avg_price_quarterly,

        ROW_NUMBER() OVER (
            PARTITION BY cs.customer_id 
            ORDER BY oa.order_date DESC
        ) as customer_order_recency,

        RANK() OVER (
            PARTITION BY oa.order_month 
            ORDER BY pa.total_margin DESC
        ) as product_margin_rank_monthly,

        -- Complex aggregations with multiple groupings
        COUNT(*) OVER (
            PARTITION BY cs.geographic_segment, oa.season
        ) as segment_seasonal_order_count,

        SUM(pa.total_price) OVER (
            PARTITION BY pa.brand, oa.order_quarter
        ) as brand_quarterly_revenue

    FROM customer_segments cs
    JOIN order_analytics oa ON cs.customer_id = oa.customer_id
    JOIN product_analytics pa ON oa.order_id = pa.order_id
),

performance_metrics AS (
    SELECT 
        csa.*,

        -- Customer behavior analysis
        CASE 
            WHEN customer_order_recency <= 3 THEN 'frequent_buyer'
            WHEN customer_order_recency <= 10 THEN 'regular_buyer'
            ELSE 'occasional_buyer'
        END as buying_frequency,

        -- Product performance analysis
        CASE 
            WHEN product_margin_rank_monthly <= 10 THEN 'top_margin_product'
            WHEN product_margin_rank_monthly <= 50 THEN 'good_margin_product'
            ELSE 'low_margin_product'
        END as margin_performance,

        -- Market analysis
        ROUND(
            (customer_monthly_spend / NULLIF(segment_seasonal_order_count::DECIMAL, 0)) * 100, 
            2
        ) as customer_segment_contribution_pct,

        ROUND(
            (brand_quarterly_revenue / SUM(brand_quarterly_revenue) OVER ()) * 100,
            2
        ) as brand_market_share_pct

    FROM comprehensive_sales_analytics csa
)

SELECT 
    -- Dimensional attributes
    customer_segment,
    geographic_segment,
    order_quarter,
    season,
    product_group,
    category,
    brand,
    payment_method,
    shipping_method,

    -- Aggregated metrics
    COUNT(DISTINCT customer_id) as unique_customers,
    COUNT(DISTINCT order_id) as total_orders,
    COUNT(DISTINCT product_id) as unique_products,

    -- Revenue metrics
    SUM(total_price) as total_revenue,
    AVG(total_price) as avg_order_value,
    SUM(total_margin) as total_margin,
    ROUND(AVG(total_margin), 2) as avg_margin_per_item,
    ROUND((SUM(total_margin) / SUM(total_price)) * 100, 1) as margin_percentage,

    -- Customer metrics
    AVG(customer_monthly_spend) as avg_customer_monthly_spend,
    COUNT(DISTINCT CASE WHEN buying_frequency = 'frequent_buyer' THEN customer_id END) as frequent_buyers,
    COUNT(DISTINCT CASE WHEN buying_frequency = 'regular_buyer' THEN customer_id END) as regular_buyers,

    -- Product performance
    COUNT(CASE WHEN margin_performance = 'top_margin_product' THEN 1 END) as top_margin_products,
    AVG(category_avg_price_quarterly) as avg_category_price,

    -- Operational metrics
    AVG(fulfillment_days) as avg_fulfillment_days,
    COUNT(CASE WHEN stock_status = 'low_stock' THEN 1 END) as low_stock_items,
    COUNT(DISTINCT supplier_name) as unique_suppliers,

    -- Time-based trends
    AVG(brand_market_share_pct) as avg_brand_market_share,
    ROUND(AVG(customer_segment_contribution_pct), 1) as avg_segment_contribution,

    -- Growth indicators (comparing to previous period)
    LAG(SUM(total_price)) OVER (
        PARTITION BY customer_segment, geographic_segment, product_group
        ORDER BY order_quarter
    ) as prev_quarter_revenue,

    ROUND(
        ((SUM(total_price) - LAG(SUM(total_price)) OVER (
            PARTITION BY customer_segment, geographic_segment, product_group
            ORDER BY order_quarter
        )) / NULLIF(LAG(SUM(total_price)) OVER (
            PARTITION BY customer_segment, geographic_segment, product_group
            ORDER BY order_quarter
        ), 0)) * 100,
        1
    ) as revenue_growth_pct

FROM performance_metrics
GROUP BY 
    customer_segment, geographic_segment, order_quarter, season,
    product_group, category, brand, payment_method, shipping_method
HAVING 
    COUNT(DISTINCT customer_id) >= 10  -- Filter for statistical significance
    AND SUM(total_price) >= 1000       -- Minimum revenue threshold
ORDER BY 
    order_quarter DESC,
    total_revenue DESC,
    unique_customers DESC
LIMIT 1000;

-- Problems with traditional SQL analytics approach:
-- 1. Extremely complex query structure with multiple CTEs and window functions
-- 2. Expensive JOIN operations across multiple large tables
-- 3. Poor performance due to multiple aggregation passes
-- 4. Limited support for nested data structures and arrays
-- 5. Difficult to maintain and modify complex analytical logic
-- 6. Memory-intensive operations with large intermediate result sets
-- 7. No native support for document-based data transformations
-- 8. Complex indexing requirements for optimal performance
-- 9. Difficult real-time processing due to query complexity
-- 10. Limited horizontal scaling for large analytical workloads

-- MySQL analytical limitations (even more restrictive)
SELECT 
    c.customer_segment,
    DATE_FORMAT(o.order_date, '%Y-%m') as order_month,
    COUNT(DISTINCT o.customer_id) as customers,
    SUM(o.total_amount) as revenue,
    AVG(o.total_amount) as avg_order_value
FROM (
    SELECT 
        customer_id,
        CASE 
            WHEN registration_date >= DATE_SUB(NOW(), INTERVAL 90 DAY) THEN 'new'
            WHEN last_order_date >= DATE_SUB(NOW(), INTERVAL 30 DAY) THEN 'active'  
            ELSE 'dormant'
        END as customer_segment
    FROM customers 
) c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= DATE_SUB(NOW(), INTERVAL 1 YEAR)
GROUP BY c.customer_segment, DATE_FORMAT(o.order_date, '%Y-%m')
ORDER BY order_month DESC, revenue DESC;

-- MySQL limitations for analytics:
-- - No window functions in older versions (pre-8.0)
-- - Limited CTE support 
-- - Poor JSON handling for complex nested data
-- - Basic aggregation functions only
-- - No advanced analytical functions
-- - Limited support for complex data transformations
-- - Poor performance with large analytical queries
-- - No native support for real-time streaming analytics

MongoDB Aggregation Framework provides powerful, optimized data processing pipelines:

// MongoDB Aggregation Framework - Comprehensive analytics and data transformation
const { MongoClient, ObjectId } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('enterprise_analytics');

// Comprehensive Enterprise Analytics with MongoDB Aggregation Framework
class AdvancedAnalyticsProcessor {
  constructor(db) {
    this.db = db;
    this.collections = {
      customers: db.collection('customers'),
      orders: db.collection('orders'),
      products: db.collection('products'),
      analytics: db.collection('analytics_results'),
      realTimeMetrics: db.collection('real_time_metrics')
    };

    // Performance optimization settings
    this.aggregationOptions = {
      allowDiskUse: true,
      maxTimeMS: 300000, // 5 minutes timeout
      hint: null, // Will be set dynamically based on query
      explain: false,
      comment: 'enterprise_analytics_query'
    };

    this.setupAnalyticsIndexes();
  }

  async setupAnalyticsIndexes() {
    console.log('Setting up optimized indexes for analytics...');

    try {
      // Customer collection indexes
      await this.collections.customers.createIndexes([
        { key: { customerId: 1 }, background: true, name: 'customer_id_idx' },
        { key: { registrationDate: -1, customerSegment: 1 }, background: true, name: 'registration_segment_idx' },
        { key: { 'address.country': 1, 'address.state': 1 }, background: true, name: 'geographic_idx' },
        { key: { loyaltyTier: 1, totalSpent: -1 }, background: true, name: 'loyalty_spending_idx' },
        { key: { lastOrderDate: -1, isActive: 1 }, background: true, name: 'activity_idx' }
      ]);

      // Orders collection indexes
      await this.collections.orders.createIndexes([
        { key: { customerId: 1, orderDate: -1 }, background: true, name: 'customer_date_idx' },
        { key: { orderDate: -1, status: 1 }, background: true, name: 'date_status_idx' },
        { key: { 'financial.total': -1, orderDate: -1 }, background: true, name: 'value_date_idx' },
        { key: { 'items.productId': 1, orderDate: -1 }, background: true, name: 'product_date_idx' },
        { key: { 'shipping.region': 1, orderDate: -1 }, background: true, name: 'region_date_idx' }
      ]);

      // Products collection indexes  
      await this.collections.products.createIndexes([
        { key: { productId: 1 }, background: true, name: 'product_id_idx' },
        { key: { category: 1, subcategory: 1 }, background: true, name: 'category_idx' },
        { key: { brand: 1, 'pricing.currentPrice': -1 }, background: true, name: 'brand_price_idx' },
        { key: { 'inventory.currentStock': 1, 'inventory.reorderLevel': 1 }, background: true, name: 'inventory_idx' },
        { key: { supplierId: 1, isActive: 1 }, background: true, name: 'supplier_active_idx' }
      ]);

      console.log('Analytics indexes created successfully');

    } catch (error) {
      console.error('Error creating analytics indexes:', error);
    }
  }

  async performComprehensiveCustomerAnalytics(timeRange = 'last_12_months', customerSegments = null) {
    console.log(`Performing comprehensive customer analytics for ${timeRange}...`);

    const startTime = Date.now();

    // Calculate date range
    const dateRanges = {
      'last_30_days': new Date(Date.now() - 30 * 24 * 60 * 60 * 1000),
      'last_90_days': new Date(Date.now() - 90 * 24 * 60 * 60 * 1000),
      'last_6_months': new Date(Date.now() - 6 * 30 * 24 * 60 * 60 * 1000),
      'last_12_months': new Date(Date.now() - 12 * 30 * 24 * 60 * 60 * 1000),
      'last_2_years': new Date(Date.now() - 2 * 365 * 24 * 60 * 60 * 1000)
    };

    const startDate = dateRanges[timeRange] || dateRanges['last_12_months'];

    const pipeline = [
      // Stage 1: Match orders within time range
      {
        $match: {
          orderDate: { $gte: startDate },
          status: { $in: ['completed', 'shipped', 'delivered'] }
        }
      },

      // Stage 2: Lookup customer information
      {
        $lookup: {
          from: 'customers',
          localField: 'customerId',
          foreignField: 'customerId',
          as: 'customer'
        }
      },

      // Stage 3: Unwind customer array (should be single document)
      {
        $unwind: '$customer'
      },

      // Stage 4: Filter by customer segments if specified
      ...(customerSegments ? [{
        $match: {
          'customer.segment': { $in: customerSegments }
        }
      }] : []),

      // Stage 5: Lookup product information for each order item
      {
        $lookup: {
          from: 'products',
          localField: 'items.productId',
          foreignField: 'productId',
          as: 'productDetails'
        }
      },

      // Stage 6: Add comprehensive calculated fields
      {
        $addFields: {
          // Time-based dimensions
          orderMonth: {
            $dateTrunc: {
              date: '$orderDate',
              unit: 'month'
            }
          },
          orderQuarter: {
            $concat: [
              { $toString: { $year: '$orderDate' } },
              '-Q',
              { $toString: {
                $ceil: { $divide: [{ $month: '$orderDate' }, 3] }
              }}
            ]
          },
          orderYear: { $year: '$orderDate' },
          dayOfWeek: { $dayOfWeek: '$orderDate' },
          hourOfDay: { $hour: '$orderDate' },

          // Seasonal classification
          season: {
            $switch: {
              branches: [
                {
                  case: { $in: [{ $month: '$orderDate' }, [12, 1, 2]] },
                  then: 'winter'
                },
                {
                  case: { $in: [{ $month: '$orderDate' }, [3, 4, 5]] },
                  then: 'spring'
                },
                {
                  case: { $in: [{ $month: '$orderDate' }, [6, 7, 8]] },
                  then: 'summer'
                }
              ],
              default: 'fall'
            }
          },

          // Customer segmentation
          customerSegment: {
            $switch: {
              branches: [
                {
                  case: {
                    $gte: [
                      '$customer.registrationDate',
                      new Date(Date.now() - 90 * 24 * 60 * 60 * 1000)
                    ]
                  },
                  then: 'new_customer'
                },
                {
                  case: {
                    $gte: [
                      '$customer.lastOrderDate',
                      new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)
                    ]
                  },
                  then: 'active_customer'
                },
                {
                  case: {
                    $gte: [
                      '$customer.lastOrderDate',
                      new Date(Date.now() - 180 * 24 * 60 * 60 * 1000)
                    ]
                  },
                  then: 'returning_customer'
                }
              ],
              default: 'dormant_customer'
            }
          },

          // Geographic classification
          geographicSegment: {
            $switch: {
              branches: [
                {
                  case: { $eq: ['$customer.address.country', 'US'] },
                  then: 'domestic'
                },
                {
                  case: { $in: ['$customer.address.country', ['CA', 'MX']] },
                  then: 'north_america'
                },
                {
                  case: { $in: ['$customer.address.country', ['GB', 'DE', 'FR', 'IT', 'ES']] },
                  then: 'europe'
                }
              ],
              default: 'international'
            }
          },

          // Order value classification
          orderValueSegment: {
            $switch: {
              branches: [
                {
                  case: { $gte: ['$financial.total', 1000] },
                  then: 'high_value'
                },
                {
                  case: { $gte: ['$financial.total', 500] },
                  then: 'medium_value'
                },
                {
                  case: { $gte: ['$financial.total', 100] },
                  then: 'low_value'
                }
              ],
              default: 'micro_transaction'
            }
          },

          // Enhanced item analysis with product details
          enrichedItems: {
            $map: {
              input: '$items',
              as: 'item',
              in: {
                $mergeObjects: [
                  '$$item',
                  {
                    productDetails: {
                      $arrayElemAt: [
                        {
                          $filter: {
                            input: '$productDetails',
                            cond: { $eq: ['$$this.productId', '$$item.productId'] }
                          }
                        },
                        0
                      ]
                    }
                  },
                  {
                    // Calculate margins and performance metrics
                    unitMargin: {
                      $subtract: [
                        '$$item.unitPrice',
                        {
                          $arrayElemAt: [
                            {
                              $map: {
                                input: {
                                  $filter: {
                                    input: '$productDetails',
                                    cond: { $eq: ['$$this.productId', '$$item.productId'] }
                                  }
                                },
                                in: '$$this.costPerUnit'
                              }
                            },
                            0
                          ]
                        }
                      ]
                    },

                    categoryGroup: {
                      $let: {
                        vars: {
                          category: {
                            $arrayElemAt: [
                              {
                                $map: {
                                  input: {
                                    $filter: {
                                      input: '$productDetails',
                                      cond: { $eq: ['$$this.productId', '$$item.productId'] }
                                    }
                                  },
                                  in: '$$this.category'
                                }
                              },
                              0
                            ]
                          }
                        },
                        in: {
                          $switch: {
                            branches: [
                              { case: { $eq: ['$$category', 'Electronics'] }, then: 'tech' },
                              { case: { $in: ['$$category', ['Clothing', 'Shoes', 'Accessories']] }, then: 'fashion' },
                              { case: { $in: ['$$category', ['Home', 'Garden', 'Furniture']] }, then: 'home' }
                            ],
                            default: 'other'
                          }
                        }
                      }
                    }
                  }
                ]
              }
            }
          },

          // Customer lifetime metrics (approximation)
          estimatedCustomerValue: {
            $multiply: [
              '$financial.total',
              { $add: ['$customer.averageOrdersPerYear', 1] }
            ]
          },

          // Fulfillment performance
          fulfillmentDays: {
            $cond: {
              if: { $and: ['$fulfillment.shippedAt', '$orderDate'] },
              then: {
                $divide: [
                  { $subtract: ['$fulfillment.shippedAt', '$orderDate'] },
                  86400000 // Convert milliseconds to days
                ]
              },
              else: null
            }
          }
        }
      },

      // Stage 7: Group by multiple dimensions for comprehensive analytics
      {
        $group: {
          _id: {
            customerSegment: '$customerSegment',
            geographicSegment: '$geographicSegment',
            orderMonth: '$orderMonth',
            orderQuarter: '$orderQuarter',
            season: '$season',
            orderValueSegment: '$orderValueSegment'
          },

          // Customer metrics
          uniqueCustomers: { $addToSet: '$customerId' },
          totalOrders: { $sum: 1 },

          // Financial metrics
          totalRevenue: { $sum: '$financial.total' },
          totalDiscount: { $sum: '$financial.discount' },
          totalTax: { $sum: '$financial.tax' },
          totalShipping: { $sum: '$financial.shipping' },

          // Order value statistics
          avgOrderValue: { $avg: '$financial.total' },
          maxOrderValue: { $max: '$financial.total' },
          minOrderValue: { $min: '$financial.total' },

          // Product and item metrics
          totalItems: { $sum: { $size: '$items' } },
          avgItemsPerOrder: { $avg: { $size: '$items' } },
          uniqueProducts: { 
            $addToSet: {
              $reduce: {
                input: '$items',
                initialValue: [],
                in: { $concatArrays: ['$$value', ['$$this.productId']] }
              }
            }
          },

          // Category distribution
          categoryBreakdown: {
            $push: {
              $map: {
                input: '$enrichedItems',
                in: '$$this.categoryGroup'
              }
            }
          },

          // Customer behavior metrics
          avgCustomerValue: { $avg: '$estimatedCustomerValue' },
          loyaltyTierDistribution: { $push: '$customer.loyaltyTier' },

          // Operational metrics
          avgFulfillmentDays: { $avg: '$fulfillmentDays' },
          paymentMethodDistribution: { $push: '$payment.method' },
          shippingMethodDistribution: { $push: '$shipping.method' },

          // Geographic insights
          stateDistribution: { $push: '$customer.address.state' },
          countryDistribution: { $push: '$customer.address.country' },

          // Time-based patterns
          dayOfWeekDistribution: { $push: '$dayOfWeek' },
          hourOfDayDistribution: { $push: '$hourOfDay' },

          // Customer acquisition and retention
          newCustomersCount: {
            $sum: {
              $cond: [{ $eq: ['$customerSegment', 'new_customer'] }, 1, 0]
            }
          },
          returningCustomersCount: {
            $sum: {
              $cond: [{ $eq: ['$customerSegment', 'returning_customer'] }, 1, 0]
            }
          },

          // First and last order dates for trend analysis
          firstOrderDate: { $min: '$orderDate' },
          lastOrderDate: { $max: '$orderDate' }
        }
      },

      // Stage 8: Calculate derived metrics and insights
      {
        $addFields: {
          // Calculate actual unique counts
          uniqueCustomerCount: { $size: '$uniqueCustomers' },
          uniqueProductCount: {
            $size: {
              $reduce: {
                input: '$uniqueProducts',
                initialValue: [],
                in: { $setUnion: ['$$value', '$$this'] }
              }
            }
          },

          // Revenue per customer
          revenuePerCustomer: {
            $cond: {
              if: { $gt: [{ $size: '$uniqueCustomers' }, 0] },
              then: { $divide: ['$totalRevenue', { $size: '$uniqueCustomers' }] },
              else: 0
            }
          },

          // Margin analysis
          grossMargin: { $subtract: ['$totalRevenue', '$totalDiscount'] },
          marginPercentage: {
            $multiply: [
              { $divide: [{ $subtract: ['$totalRevenue', '$totalDiscount'] }, '$totalRevenue'] },
              100
            ]
          },

          // Category insights
          topCategories: {
            $slice: [
              {
                $map: {
                  input: {
                    $sortArray: {
                      input: {
                        $objectToArray: {
                          $reduce: {
                            input: {
                              $reduce: {
                                input: '$categoryBreakdown',
                                initialValue: [],
                                in: { $concatArrays: ['$$value', '$$this'] }
                              }
                            },
                            initialValue: {},
                            in: {
                              $mergeObjects: [
                                '$$value',
                                { ['$$this']: { $add: [{ $ifNull: [{ $getField: ['$$this', '$$value'] }, 0] }, 1] } }
                              ]
                            }
                          }
                        }
                      },
                      sortBy: { v: -1 }
                    }
                  },
                  in: { category: '$$this.k', count: '$$this.v' }
                }
              },
              5 // Top 5 categories
            ]
          },

          // Customer distribution insights
          customerSegmentMetrics: {
            newCustomerPercentage: {
              $multiply: [
                { $divide: ['$newCustomersCount', '$totalOrders'] },
                100
              ]
            },
            returningCustomerPercentage: {
              $multiply: [
                { $divide: ['$returningCustomersCount', '$totalOrders'] },
                100
              ]
            }
          },

          // Time range analysis
          analysisPeriodDays: {
            $divide: [
              { $subtract: ['$lastOrderDate', '$firstOrderDate'] },
              86400000 // Convert to days
            ]
          },

          // Performance indicators
          performanceMetrics: {
            ordersPerDay: {
              $divide: [
                '$totalOrders',
                { $divide: [{ $subtract: ['$lastOrderDate', '$firstOrderDate'] }, 86400000] }
              ]
            },
            avgRevenuePerDay: {
              $divide: [
                '$totalRevenue',
                { $divide: [{ $subtract: ['$lastOrderDate', '$firstOrderDate'] }, 86400000] }
              ]
            }
          }
        }
      },

      // Stage 9: Project final results with clean structure
      {
        $project: {
          _id: 0,

          // Dimensions
          dimensions: '$_id',

          // Core metrics
          metrics: {
            customers: {
              total: '$uniqueCustomerCount',
              new: '$newCustomersCount',
              returning: '$returningCustomersCount',
              newPercentage: '$customerSegmentMetrics.newCustomerPercentage',
              returningPercentage: '$customerSegmentMetrics.returningCustomerPercentage'
            },

            orders: {
              total: '$totalOrders',
              averageValue: '$avgOrderValue',
              maxValue: '$maxOrderValue',
              minValue: '$minOrderValue',
              itemsPerOrder: '$avgItemsPerOrder'
            },

            revenue: {
              total: '$totalRevenue',
              gross: '$grossMargin',
              marginPercentage: '$marginPercentage',
              revenuePerCustomer: '$revenuePerCustomer',
              totalDiscount: '$totalDiscount',
              totalTax: '$totalTax',
              totalShipping: '$totalShipping'
            },

            products: {
              uniqueCount: '$uniqueProductCount',
              totalItems: '$totalItems',
              topCategories: '$topCategories'
            },

            operations: {
              avgFulfillmentDays: '$avgFulfillmentDays',
              analysisPeriodDays: '$analysisPeriodDays'
            },

            performance: '$performanceMetrics'
          },

          // Insights and distributions
          insights: {
            loyaltyTiers: '$loyaltyTierDistribution',
            paymentMethods: '$paymentMethodDistribution',
            shippingMethods: '$shippingMethodDistribution',
            geographic: {
              states: '$stateDistribution',
              countries: '$countryDistribution'
            },
            temporal: {
              daysOfWeek: '$dayOfWeekDistribution',
              hoursOfDay: '$hourOfDayDistribution'
            }
          },

          // Time range
          timeRange: {
            startDate: '$firstOrderDate',
            endDate: '$lastOrderDate'
          }
        }
      },

      // Stage 10: Sort results by significance
      {
        $sort: {
          'metrics.revenue.total': -1,
          'metrics.customers.total': -1
        }
      }
    ];

    try {
      const results = await this.collections.orders.aggregate(pipeline, this.aggregationOptions).toArray();

      const processingTime = Date.now() - startTime;
      console.log(`Customer analytics completed in ${processingTime}ms, found ${results.length} result groups`);

      return {
        success: true,
        processingTimeMs: processingTime,
        timeRange: timeRange,
        resultCount: results.length,
        analytics: results,
        metadata: {
          queryComplexity: 'high',
          stagesCount: pipeline.length,
          indexesUsed: 'multiple_compound_indexes',
          aggregationFeatures: [
            'lookup_joins',
            'complex_expressions',
            'grouping_aggregations', 
            'conditional_logic',
            'array_operations',
            'date_functions',
            'mathematical_operations'
          ]
        }
      };

    } catch (error) {
      console.error('Error performing customer analytics:', error);
      return {
        success: false,
        error: error.message,
        processingTimeMs: Date.now() - startTime
      };
    }
  }

  async performRealTimeProductAnalytics(refreshInterval = 60000) {
    console.log('Starting real-time product performance analytics...');

    const pipeline = [
      // Stage 1: Match recent orders (last 24 hours)
      {
        $match: {
          orderDate: { 
            $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) 
          },
          status: { $in: ['completed', 'processing', 'shipped'] }
        }
      },

      // Stage 2: Unwind order items for item-level analysis
      {
        $unwind: '$items'
      },

      // Stage 3: Lookup product details
      {
        $lookup: {
          from: 'products',
          localField: 'items.productId',
          foreignField: 'productId',
          as: 'product'
        }
      },

      // Stage 4: Unwind product (should be single document)
      {
        $unwind: '$product'
      },

      // Stage 5: Lookup current inventory levels
      {
        $lookup: {
          from: 'inventory',
          localField: 'items.productId',
          foreignField: 'productId',
          as: 'inventory'
        }
      },

      // Stage 6: Add calculated fields for real-time metrics
      {
        $addFields: {
          // Time buckets for real-time analysis
          hourBucket: {
            $dateTrunc: {
              date: '$orderDate',
              unit: 'hour'
            }
          },

          // Product performance metrics
          itemRevenue: { $multiply: ['$items.quantity', '$items.unitPrice'] },
          itemMargin: {
            $multiply: [
              '$items.quantity',
              { $subtract: ['$items.unitPrice', '$product.costPerUnit'] }
            ]
          },

          // Inventory status
          currentStock: { $arrayElemAt: ['$inventory.currentStock', 0] },
          reorderLevel: { $arrayElemAt: ['$inventory.reorderLevel', 0] },

          // Product categorization
          categoryGroup: {
            $switch: {
              branches: [
                { case: { $eq: ['$product.category', 'Electronics'] }, then: 'tech' },
                { case: { $in: ['$product.category', ['Clothing', 'Shoes']] }, then: 'fashion' },
                { case: { $in: ['$product.category', ['Home', 'Garden']] }, then: 'home' }
              ],
              default: 'other'
            }
          },

          // Price performance
          pricePoint: {
            $switch: {
              branches: [
                { case: { $gte: ['$items.unitPrice', 500] }, then: 'premium' },
                { case: { $gte: ['$items.unitPrice', 100] }, then: 'mid_range' },
                { case: { $gte: ['$items.unitPrice', 25] }, then: 'budget' }
              ],
              default: 'economy'
            }
          },

          // Velocity indicators
          orderRecency: {
            $divide: [
              { $subtract: [new Date(), '$orderDate'] },
              3600000 // Convert to hours
            ]
          }
        }
      },

      // Stage 7: Group by product and time buckets for real-time aggregation
      {
        $group: {
          _id: {
            productId: '$items.productId',
            productName: '$product.name',
            category: '$product.category',
            categoryGroup: '$categoryGroup',
            brand: '$product.brand',
            pricePoint: '$pricePoint',
            hourBucket: '$hourBucket'
          },

          // Sales metrics
          totalQuantitySold: { $sum: '$items.quantity' },
          totalRevenue: { $sum: '$itemRevenue' },
          totalMargin: { $sum: '$itemMargin' },
          uniqueOrders: { $addToSet: '$_id' },
          avgOrderQuantity: { $avg: '$items.quantity' },

          // Pricing metrics
          avgSellingPrice: { $avg: '$items.unitPrice' },
          maxSellingPrice: { $max: '$items.unitPrice' },
          minSellingPrice: { $min: '$items.unitPrice' },

          // Inventory insights
          currentStockLevel: { $first: '$currentStock' },
          reorderThreshold: { $first: '$reorderLevel' },

          // Time-based insights
          avgOrderRecency: { $avg: '$orderRecency' },
          latestOrderTime: { $max: '$orderDate' },
          earliestOrderTime: { $min: '$orderDate' },

          // Customer insights
          uniqueCustomers: { $addToSet: '$customerId' },

          // Geographic distribution
          regions: { $addToSet: '$customer.address.state' },

          // Order characteristics
          avgOrderValue: { $avg: '$financial.total' },
          shippingMethodsUsed: { $addToSet: '$shipping.method' }
        }
      },

      // Stage 8: Calculate performance indicators and rankings
      {
        $addFields: {
          // Performance calculations
          marginPercentage: {
            $cond: {
              if: { $gt: ['$totalRevenue', 0] },
              then: { $multiply: [{ $divide: ['$totalMargin', '$totalRevenue'] }, 100] },
              else: 0
            }
          },

          uniqueOrderCount: { $size: '$uniqueOrders' },
          uniqueCustomerCount: { $size: '$uniqueCustomers' },

          // Inventory health
          stockStatus: {
            $switch: {
              branches: [
                {
                  case: { $lte: ['$currentStockLevel', 0] },
                  then: 'out_of_stock'
                },
                {
                  case: { $lte: ['$currentStockLevel', '$reorderThreshold'] },
                  then: 'low_stock'
                },
                {
                  case: { $lte: ['$currentStockLevel', { $multiply: ['$reorderThreshold', 2] }] },
                  then: 'medium_stock'
                }
              ],
              default: 'high_stock'
            }
          },

          // Velocity metrics
          salesVelocity: {
            $divide: ['$totalQuantitySold', { $max: ['$avgOrderRecency', 1] }]
          },

          // Customer engagement
          customerRetention: {
            $divide: ['$uniqueCustomerCount', '$uniqueOrderCount']
          },

          // Regional penetration
          regionalReach: { $size: '$regions' }
        }
      },

      // Stage 9: Add ranking and performance classification
      {
        $setWindowFields: {
          sortBy: { totalRevenue: -1 },
          output: {
            revenueRank: { $rank: {} },
            revenuePercentile: { $percentRank: {} }
          }
        }
      },

      {
        $setWindowFields: {
          partitionBy: '$_id.categoryGroup',
          sortBy: { totalQuantitySold: -1 },
          output: {
            categoryRank: { $rank: {} }
          }
        }
      },

      // Stage 10: Add performance classification
      {
        $addFields: {
          performanceClassification: {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $lte: ['$revenueRank', 10] },
                      { $gt: ['$marginPercentage', 20] },
                      { $gt: ['$uniqueCustomerCount', 5] }
                    ]
                  },
                  then: 'star_performer'
                },
                {
                  case: {
                    $and: [
                      { $lte: ['$revenueRank', 50] },
                      { $gt: ['$marginPercentage', 15] }
                    ]
                  },
                  then: 'strong_performer'
                },
                {
                  case: {
                    $and: [
                      { $gt: ['$totalRevenue', 100] },
                      { $gt: ['$marginPercentage', 10] }
                    ]
                  },
                  then: 'solid_performer'
                },
                {
                  case: { $lte: ['$totalRevenue', 50] },
                  then: 'low_performer'
                }
              ],
              default: 'average_performer'
            }
          },

          // Action recommendations
          recommendations: {
            $switch: {
              branches: [
                {
                  case: { $eq: ['$stockStatus', 'out_of_stock'] },
                  then: ['urgent_restock', 'review_demand_forecast']
                },
                {
                  case: { $eq: ['$stockStatus', 'low_stock'] },
                  then: ['schedule_restock', 'monitor_sales_velocity']
                },
                {
                  case: {
                    $and: [
                      { $eq: ['$performanceClassification', 'star_performer'] },
                      { $gt: ['$currentStockLevel', '$reorderThreshold'] }
                    ]
                  },
                  then: ['increase_marketing', 'optimize_pricing', 'expand_availability']
                },
                {
                  case: { $eq: ['$performanceClassification', 'low_performer'] },
                  then: ['review_pricing', 'improve_marketing', 'consider_discontinuation']
                }
              ],
              default: ['monitor_performance', 'optimize_inventory_levels']
            }
          },

          // Real-time alerts
          alerts: {
            $filter: {
              input: [
                {
                  $cond: {
                    if: { $eq: ['$stockStatus', 'out_of_stock'] },
                    then: {
                      type: 'critical',
                      message: 'Product is out of stock with active sales',
                      priority: 'high'
                    },
                    else: null
                  }
                },
                {
                  $cond: {
                    if: {
                      $and: [
                        { $eq: ['$performanceClassification', 'star_performer'] },
                        { $eq: ['$stockStatus', 'low_stock'] }
                      ]
                    },
                    then: {
                      type: 'opportunity',
                      message: 'High-performing product running low on stock',
                      priority: 'medium'
                    },
                    else: null
                  }
                },
                {
                  $cond: {
                    if: { $lt: ['$marginPercentage', 5] },
                    then: {
                      type: 'margin_concern',
                      message: 'Product margin below threshold',
                      priority: 'low'
                    },
                    else: null
                  }
                }
              ],
              cond: { $ne: ['$$this', null] }
            }
          }
        }
      },

      // Stage 11: Final projection with structured output
      {
        $project: {
          _id: 0,

          // Product identification
          product: {
            id: '$_id.productId',
            name: '$_id.productName',
            category: '$_id.category',
            categoryGroup: '$_id.categoryGroup',
            brand: '$_id.brand',
            pricePoint: '$_id.pricePoint'
          },

          // Time context
          timeContext: {
            hourBucket: '$_id.hourBucket',
            latestOrder: '$latestOrderTime',
            earliestOrder: '$earliestOrderTime',
            avgOrderRecencyHours: '$avgOrderRecency'
          },

          // Performance metrics
          performance: {
            totalQuantitySold: '$totalQuantitySold',
            totalRevenue: { $round: ['$totalRevenue', 2] },
            totalMargin: { $round: ['$totalMargin', 2] },
            marginPercentage: { $round: ['$marginPercentage', 1] },
            uniqueOrders: '$uniqueOrderCount',
            uniqueCustomers: '$uniqueCustomerCount',
            avgOrderQuantity: { $round: ['$avgOrderQuantity', 2] },
            salesVelocity: { $round: ['$salesVelocity', 3] },
            customerRetention: { $round: ['$customerRetention', 3] }
          },

          // Pricing insights
          pricing: {
            avgSellingPrice: { $round: ['$avgSellingPrice', 2] },
            maxSellingPrice: '$maxSellingPrice',
            minSellingPrice: '$minSellingPrice',
            priceVariation: { $subtract: ['$maxSellingPrice', '$minSellingPrice'] }
          },

          // Inventory status
          inventory: {
            currentStock: '$currentStockLevel',
            reorderLevel: '$reorderThreshold',
            stockStatus: '$stockStatus',
            stockTurnover: {
              $cond: {
                if: { $gt: ['$currentStockLevel', 0] },
                then: { $divide: ['$totalQuantitySold', '$currentStockLevel'] },
                else: null
              }
            }
          },

          // Market position
          marketPosition: {
            revenueRank: '$revenueRank',
            revenuePercentile: { $round: ['$revenuePercentile', 3] },
            categoryRank: '$categoryRank',
            performanceClass: '$performanceClassification'
          },

          // Geographic and market reach
          marketReach: {
            regionalReach: '$regionalReach',
            regions: '$regions',
            avgOrderValue: { $round: ['$avgOrderValue', 2] },
            shippingMethods: '$shippingMethodsUsed'
          },

          // Actionable insights
          insights: {
            recommendations: '$recommendations',
            alerts: '$alerts',

            // Key insights derived from data
            keyInsights: {
              $filter: {
                input: [
                  {
                    $cond: {
                      if: { $gt: ['$uniqueCustomerCount', 10] },
                      then: 'High customer engagement - good repeat purchase potential',
                      else: null
                    }
                  },
                  {
                    $cond: {
                      if: { $gt: ['$regionalReach', 5] },
                      then: 'Strong geographic distribution - consider expanding marketing',
                      else: null
                    }
                  },
                  {
                    $cond: {
                      if: { $gt: ['$salesVelocity', 1] },
                      then: 'Fast-moving product - ensure adequate inventory levels',
                      else: null
                    }
                  }
                ],
                cond: { $ne: ['$$this', null] }
              }
            }
          }
        }
      },

      // Stage 12: Sort by performance and significance
      {
        $sort: {
          'performance.totalRevenue': -1,
          'performance.uniqueCustomers': -1,
          'inventory.stockTurnover': -1
        }
      },

      // Stage 13: Limit to top performers for real-time display
      {
        $limit: 100
      }
    ];

    try {
      const results = await this.collections.orders.aggregate(pipeline, {
        ...this.aggregationOptions,
        maxTimeMS: 30000 // Shorter timeout for real-time queries
      }).toArray();

      // Store results for real-time dashboard
      await this.collections.realTimeMetrics.replaceOne(
        { type: 'product_performance' },
        {
          type: 'product_performance',
          timestamp: new Date(),
          refreshInterval: refreshInterval,
          dataCount: results.length,
          data: results
        },
        { upsert: true }
      );

      console.log(`Real-time product analytics completed: ${results.length} products analyzed`);

      return {
        success: true,
        timestamp: new Date(),
        productCount: results.length,
        analytics: results,
        summary: {
          totalRevenue: results.reduce((sum, product) => sum + product.performance.totalRevenue, 0),
          totalQuantitySold: results.reduce((sum, product) => sum + product.performance.totalQuantitySold, 0),
          avgMarginPercentage: results.reduce((sum, product) => sum + product.performance.marginPercentage, 0) / results.length,
          outOfStockProducts: results.filter(product => product.inventory.stockStatus === 'out_of_stock').length,
          starPerformers: results.filter(product => product.marketPosition.performanceClass === 'star_performer').length,
          criticalAlerts: results.reduce((sum, product) => 
            sum + product.insights.alerts.filter(alert => alert.priority === 'high').length, 0
          )
        }
      };

    } catch (error) {
      console.error('Error performing real-time product analytics:', error);
      return {
        success: false,
        error: error.message,
        timestamp: new Date()
      };
    }
  }

  async performAdvancedCohortAnalysis(cohortType = 'monthly', lookbackPeriods = 12) {
    console.log(`Performing ${cohortType} cohort analysis for ${lookbackPeriods} periods...`);

    const startTime = Date.now();

    // Calculate cohort periods based on type
    const cohortConfig = {
      'weekly': { unit: 'week', periodMs: 7 * 24 * 60 * 60 * 1000 },
      'monthly': { unit: 'month', periodMs: 30 * 24 * 60 * 60 * 1000 },
      'quarterly': { unit: 'quarter', periodMs: 90 * 24 * 60 * 60 * 1000 }
    };

    const config = cohortConfig[cohortType] || cohortConfig['monthly'];
    const startDate = new Date(Date.now() - lookbackPeriods * config.periodMs);

    const pipeline = [
      // Stage 1: Get all customer first orders to establish cohorts
      {
        $match: {
          orderDate: { $gte: startDate },
          status: { $in: ['completed', 'shipped', 'delivered'] }
        }
      },

      // Stage 2: Get customer first order dates
      {
        $group: {
          _id: '$customerId',
          firstOrderDate: { $min: '$orderDate' },
          allOrders: {
            $push: {
              orderId: '$_id',
              orderDate: '$orderDate',
              total: '$financial.total',
              items: '$items'
            }
          },
          totalOrders: { $sum: 1 },
          totalSpent: { $sum: '$financial.total' },
          lastOrderDate: { $max: '$orderDate' }
        }
      },

      // Stage 3: Calculate cohort membership and period analysis
      {
        $addFields: {
          // Determine which cohort this customer belongs to
          cohortPeriod: {
            $dateTrunc: {
              date: '$firstOrderDate',
              unit: config.unit
            }
          },

          // Calculate customer lifetime span
          lifetimeSpanDays: {
            $divide: [
              { $subtract: ['$lastOrderDate', '$firstOrderDate'] },
              86400000
            ]
          },

          // Analyze orders by period
          ordersByPeriod: {
            $map: {
              input: '$allOrders',
              as: 'order',
              in: {
                $mergeObjects: [
                  '$$order',
                  {
                    orderPeriod: {
                      $dateTrunc: {
                        date: '$$order.orderDate',
                        unit: config.unit
                      }
                    },
                    periodsFromFirstOrder: {
                      $divide: [
                        {
                          $subtract: [
                            {
                              $dateTrunc: {
                                date: '$$order.orderDate',
                                unit: config.unit
                              }
                            },
                            {
                              $dateTrunc: {
                                date: '$firstOrderDate',
                                unit: config.unit
                              }
                            }
                          ]
                        },
                        config.periodMs
                      ]
                    }
                  }
                ]
              }
            }
          }
        }
      },

      // Stage 4: Unwind orders to analyze period-by-period behavior
      {
        $unwind: '$ordersByPeriod'
      },

      // Stage 5: Group by cohort and period for retention analysis
      {
        $group: {
          _id: {
            cohortPeriod: '$cohortPeriod',
            orderPeriod: '$ordersByPeriod.orderPeriod',
            periodsFromFirst: { 
              $floor: '$ordersByPeriod.periodsFromFirstOrder' 
            }
          },

          // Customer retention metrics
          activeCustomers: { $addToSet: '$_id' },
          totalOrders: { $sum: 1 },
          totalRevenue: { $sum: '$ordersByPeriod.total' },
          avgOrderValue: { $avg: '$ordersByPeriod.total' },

          // Customer behavior metrics
          avgLifetimeSpan: { $avg: '$lifetimeSpanDays' },
          totalCustomerLifetimeValue: { $avg: '$totalSpent' },
          avgOrdersPerCustomer: { $avg: '$totalOrders' },

          // Period-specific insights
          newCustomersInPeriod: {
            $sum: {
              $cond: [
                { $eq: ['$ordersByPeriod.periodsFromFirstOrder', 0] },
                1,
                0
              ]
            }
          },

          // Revenue distribution
          revenueDistribution: {
            $push: '$ordersByPeriod.total'
          },

          // Order frequency analysis
          orderFrequencyDistribution: {
            $push: '$totalOrders'
          }
        }
      },

      // Stage 6: Calculate cohort size (initial customers in each cohort)
      {
        $lookup: {
          from: 'orders',
          pipeline: [
            {
              $match: {
                orderDate: { $gte: startDate },
                status: { $in: ['completed', 'shipped', 'delivered'] }
              }
            },
            {
              $group: {
                _id: '$customerId',
                firstOrderDate: { $min: '$orderDate' }
              }
            },
            {
              $addFields: {
                cohortPeriod: {
                  $dateTrunc: {
                    date: '$firstOrderDate',
                    unit: config.unit
                  }
                }
              }
            },
            {
              $group: {
                _id: '$cohortPeriod',
                cohortSize: { $sum: 1 }
              }
            }
          ],
          as: 'cohortSizes'
        }
      },

      // Stage 7: Add cohort size information
      {
        $addFields: {
          cohortSize: {
            $let: {
              vars: {
                matchingCohort: {
                  $arrayElemAt: [
                    {
                      $filter: {
                        input: '$cohortSizes',
                        cond: { $eq: ['$$this._id', '$_id.cohortPeriod'] }
                      }
                    },
                    0
                  ]
                }
              },
              in: '$$matchingCohort.cohortSize'
            }
          },

          activeCustomerCount: { $size: '$activeCustomers' },

          // Calculate retention rate
          retentionRate: {
            $let: {
              vars: {
                cohortSize: {
                  $arrayElemAt: [
                    {
                      $map: {
                        input: {
                          $filter: {
                            input: '$cohortSizes',
                            cond: { $eq: ['$$this._id', '$_id.cohortPeriod'] }
                          }
                        },
                        in: '$$this.cohortSize'
                      }
                    },
                    0
                  ]
                }
              },
              in: {
                $multiply: [
                  { $divide: [{ $size: '$activeCustomers' }, '$$cohortSize'] },
                  100
                ]
              }
            }
          }
        }
      },

      // Stage 8: Calculate advanced cohort metrics
      {
        $addFields: {
          // Revenue per customer in this period
          revenuePerCustomer: {
            $divide: ['$totalRevenue', '$activeCustomerCount']
          },

          // Customer engagement score
          engagementScore: {
            $multiply: [
              { $divide: ['$totalOrders', '$activeCustomerCount'] },
              { $divide: ['$retentionRate', 100] }
            ]
          },

          // Revenue distribution analysis
          revenueMetrics: {
            median: {
              $arrayElemAt: [
                {
                  $sortArray: {
                    input: '$revenueDistribution',
                    sortBy: 1
                  }
                },
                { $floor: { $divide: [{ $size: '$revenueDistribution' }, 2] } }
              ]
            },
            total: '$totalRevenue',
            average: '$avgOrderValue',
            max: { $max: '$revenueDistribution' },
            min: { $min: '$revenueDistribution' }
          },

          // Period classification
          periodClassification: {
            $switch: {
              branches: [
                { case: { $eq: ['$_id.periodsFromFirst', 0] }, then: 'acquisition' },
                { case: { $lte: ['$_id.periodsFromFirst', 3] }, then: 'early_engagement' },
                { case: { $lte: ['$_id.periodsFromFirst', 12] }, then: 'mature_relationship' }
              ],
              default: 'long_term_loyalty'
            }
          }
        }
      },

      // Stage 9: Group by cohort for final analysis
      {
        $group: {
          _id: '$_id.cohortPeriod',
          cohortSize: { $first: '$cohortSize' },

          // Retention analysis by period
          retentionByPeriod: {
            $push: {
              period: '$_id.periodsFromFirst',
              orderPeriod: '$_id.orderPeriod',
              activeCustomers: '$activeCustomerCount',
              retentionRate: '$retentionRate',
              totalRevenue: '$totalRevenue',
              revenuePerCustomer: '$revenuePerCustomer',
              avgOrderValue: '$avgOrderValue',
              totalOrders: '$totalOrders',
              engagementScore: '$engagementScore',
              periodClassification: '$periodClassification',
              revenueMetrics: '$revenueMetrics'
            }
          },

          // Aggregate cohort metrics
          totalLifetimeRevenue: { $sum: '$totalRevenue' },
          avgLifetimeValue: { $avg: '$totalCustomerLifetimeValue' },
          peakRetentionRate: { $max: '$retentionRate' },
          finalRetentionRate: { $last: '$retentionRate' },
          avgEngagementScore: { $avg: '$engagementScore' },

          // Cohort performance classification
          cohortHealth: {
            $avg: {
              $cond: [
                { $gte: ['$retentionRate', 30] }, // 30% retention considered healthy
                1,
                0
              ]
            }
          }
        }
      },

      // Stage 10: Calculate cohort performance indicators
      {
        $addFields: {
          // Lifetime value per customer in cohort
          lifetimeValuePerCustomer: {
            $divide: ['$totalLifetimeRevenue', '$cohortSize']
          },

          // Retention curve analysis
          retentionTrend: {
            $let: {
              vars: {
                firstPeriodRetention: {
                  $arrayElemAt: [
                    {
                      $map: {
                        input: {
                          $filter: {
                            input: '$retentionByPeriod',
                            cond: { $eq: ['$$this.period', 0] }
                          }
                        },
                        in: '$$this.retentionRate'
                      }
                    },
                    0
                  ]
                },
                lastPeriodRetention: '$finalRetentionRate'
              },
              in: {
                $subtract: ['$$lastPeriodRetention', '$$firstPeriodRetention']
              }
            }
          },

          // Cohort quality classification
          cohortQuality: {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $gte: ['$peakRetentionRate', 50] },
                      { $gte: ['$avgEngagementScore', 1.5] },
                      { $gte: ['$lifetimeValuePerCustomer', 500] }
                    ]
                  },
                  then: 'excellent'
                },
                {
                  case: {
                    $and: [
                      { $gte: ['$peakRetentionRate', 35] },
                      { $gte: ['$avgEngagementScore', 1.0] },
                      { $gte: ['$lifetimeValuePerCustomer', 250] }
                    ]
                  },
                  then: 'good'
                },
                {
                  case: {
                    $and: [
                      { $gte: ['$peakRetentionRate', 20] },
                      { $gte: ['$avgEngagementScore', 0.5] }
                    ]
                  },
                  then: 'fair'
                }
              ],
              default: 'poor'
            }
          },

          // Strategic recommendations
          recommendations: {
            $switch: {
              branches: [
                {
                  case: { $lt: ['$peakRetentionRate', 20] },
                  then: ['improve_onboarding', 'enhance_early_engagement', 'review_product_fit']
                },
                {
                  case: { $lt: ['$finalRetentionRate', 10] },
                  then: ['develop_loyalty_program', 'improve_long_term_value', 'increase_engagement']
                },
                {
                  case: { $lt: ['$avgEngagementScore', 0.5] },
                  then: ['enhance_customer_experience', 'increase_purchase_frequency', 'improve_product_recommendations']
                }
              ],
              default: ['maintain_excellence', 'scale_successful_strategies', 'explore_expansion_opportunities']
            }
          }
        }
      },

      // Stage 11: Final projection and formatting
      {
        $project: {
          _id: 0,

          // Cohort identification
          cohortPeriod: '$_id',
          cohortSize: 1,
          cohortQuality: 1,

          // Key performance metrics
          performance: {
            lifetimeValuePerCustomer: { $round: ['$lifetimeValuePerCustomer', 2] },
            avgLifetimeValue: { $round: ['$avgLifetimeValue', 2] },
            totalLifetimeRevenue: { $round: ['$totalLifetimeRevenue', 2] },
            peakRetentionRate: { $round: ['$peakRetentionRate', 1] },
            finalRetentionRate: { $round: ['$finalRetentionRate', 1] },
            retentionTrend: { $round: ['$retentionTrend', 1] },
            avgEngagementScore: { $round: ['$avgEngagementScore', 2] },
            cohortHealth: { $round: ['$cohortHealth', 2] }
          },

          // Detailed retention analysis
          retentionAnalysis: {
            $map: {
              input: { $sortArray: { input: '$retentionByPeriod', sortBy: { period: 1 } } },
              in: {
                period: '$$this.period',
                orderPeriod: '$$this.orderPeriod',
                activeCustomers: '$$this.activeCustomers',
                retentionRate: { $round: ['$$this.retentionRate', 1] },
                revenuePerCustomer: { $round: ['$$this.revenuePerCustomer', 2] },
                avgOrderValue: { $round: ['$$this.avgOrderValue', 2] },
                totalRevenue: { $round: ['$$this.totalRevenue', 2] },
                totalOrders: '$$this.totalOrders',
                engagementScore: { $round: ['$$this.engagementScore', 2] },
                periodClassification: '$$this.periodClassification',
                revenueMetrics: {
                  median: { $round: ['$$this.revenueMetrics.median', 2] },
                  average: { $round: ['$$this.revenueMetrics.average', 2] },
                  max: '$$this.revenueMetrics.max',
                  min: '$$this.revenueMetrics.min'
                }
              }
            }
          },

          // Strategic insights
          insights: {
            recommendations: '$recommendations',
            keyInsights: {
              $filter: {
                input: [
                  {
                    $cond: {
                      if: { $gt: ['$peakRetentionRate', 40] },
                      then: 'High-quality cohort with strong initial engagement',
                      else: null
                    }
                  },
                  {
                    $cond: {
                      if: { $gt: ['$retentionTrend', 0] },
                      then: 'Retention improving over time - successful loyalty building',
                      else: null
                    }
                  },
                  {
                    $cond: {
                      if: { $gt: ['$avgEngagementScore', 2] },
                      then: 'Highly engaged cohort with frequent repeat purchases',
                      else: null
                    }
                  },
                  {
                    $cond: {
                      if: { $gt: ['$lifetimeValuePerCustomer', 1000] },
                      then: 'High-value cohort - focus on retention and expansion',
                      else: null
                    }
                  }
                ],
                cond: { $ne: ['$$this', null] }
              }
            }
          }
        }
      },

      // Stage 12: Sort by cohort period (most recent first)
      {
        $sort: { cohortPeriod: -1 }
      }
    ];

    try {
      const results = await this.collections.orders.aggregate(pipeline, this.aggregationOptions).toArray();

      const processingTime = Date.now() - startTime;
      console.log(`Cohort analysis completed in ${processingTime}ms, analyzed ${results.length} cohorts`);

      // Calculate cross-cohort insights
      const crossCohortInsights = {
        totalCohorts: results.length,
        avgCohortSize: results.reduce((sum, cohort) => sum + cohort.cohortSize, 0) / results.length,
        avgLifetimeValue: results.reduce((sum, cohort) => sum + cohort.performance.lifetimeValuePerCustomer, 0) / results.length,
        bestPerformingCohort: results.reduce((best, current) => 
          current.performance.lifetimeValuePerCustomer > best.performance.lifetimeValuePerCustomer ? current : best, results[0]
        ),
        retentionTrendAvg: results.reduce((sum, cohort) => sum + cohort.performance.retentionTrend, 0) / results.length,
        excellentCohorts: results.filter(cohort => cohort.cohortQuality === 'excellent').length,
        improvingCohorts: results.filter(cohort => cohort.performance.retentionTrend > 0).length
      };

      return {
        success: true,
        processingTimeMs: processingTime,
        cohortType: cohortType,
        lookbackPeriods: lookbackPeriods,
        analysisDate: new Date(),
        cohortCount: results.length,
        cohorts: results,
        crossCohortInsights: crossCohortInsights,
        metadata: {
          aggregationComplexity: 'very_high',
          stagesCount: pipeline.length,
          analyticsFeatures: [
            'customer_lifetime_value',
            'retention_analysis',
            'cohort_segmentation',
            'behavioral_analysis',
            'revenue_attribution',
            'trend_analysis',
            'performance_classification',
            'strategic_recommendations'
          ]
        }
      };

    } catch (error) {
      console.error('Error performing cohort analysis:', error);
      return {
        success: false,
        error: error.message,
        processingTimeMs: Date.now() - startTime
      };
    }
  }
}

// Benefits of MongoDB Aggregation Framework:
// - Single-pass processing for complex multi-stage analytics
// - Native document transformation without expensive JOINs
// - Automatic query optimization and index utilization
// - Horizontal scaling across sharded clusters
// - Real-time processing capabilities with streaming aggregation
// - Rich expression language for complex calculations
// - Built-in statistical and analytical functions
// - Memory-efficient processing with spill-to-disk support
// - Integration with MongoDB's native features (GeoSpatial, Text Search, etc.)
// - SQL-compatible operations through QueryLeaf integration

module.exports = {
  AdvancedAnalyticsProcessor
};

Understanding MongoDB Aggregation Framework Architecture

Advanced Pipeline Optimization and Performance Patterns

Implement sophisticated aggregation strategies for enterprise MongoDB deployments:

// Production-optimized MongoDB Aggregation with advanced performance tuning
class EnterpriseAggregationOptimizer {
  constructor(db, optimizationConfig) {
    this.db = db;
    this.config = {
      ...optimizationConfig,
      enableQueryPlanCache: true,
      enableParallelProcessing: true,
      enableIncrementalProcessing: true,
      maxMemoryUsage: '2GB',
      enableIndexHints: true,
      enableResultCaching: true
    };

    this.queryPlanCache = new Map();
    this.resultCache = new Map();
    this.performanceMetrics = new Map();
  }

  async optimizeAggregationPipeline(pipeline, collectionName, options = {}) {
    console.log(`Optimizing aggregation pipeline for ${collectionName}...`);

    const optimizationStrategies = [
      this.moveMatchToBeginning,
      this.optimizeIndexUsage,
      this.enableEarlyFiltering,
      this.minimizeDataMovement,
      this.optimizeGroupingOperations,
      this.enableParallelExecution
    ];

    let optimizedPipeline = [...pipeline];

    for (const strategy of optimizationStrategies) {
      optimizedPipeline = await strategy.call(this, optimizedPipeline, collectionName, options);
    }

    return {
      originalStages: pipeline.length,
      optimizedStages: optimizedPipeline.length,
      optimizedPipeline: optimizedPipeline,
      estimatedPerformanceGain: this.calculatePerformanceGain(pipeline, optimizedPipeline)
    };
  }

  async enableRealTimeAggregation(pipeline, collectionName, refreshInterval = 5000) {
    console.log(`Setting up real-time aggregation for ${collectionName}...`);

    // Implementation of real-time aggregation with Change Streams
    const changeStream = this.db.collection(collectionName).watch([
      {
        $match: {
          operationType: { $in: ['insert', 'update', 'delete'] }
        }
      }
    ]);

    const realTimeProcessor = {
      pipeline: pipeline,
      lastResults: null,
      isProcessing: false,

      async processChanges() {
        if (this.isProcessing) return;

        this.isProcessing = true;
        try {
          const results = await this.db.collection(collectionName)
            .aggregate(pipeline, { allowDiskUse: true })
            .toArray();

          this.lastResults = results;

          // Emit real-time results to subscribers
          this.emitResults(results);

        } catch (error) {
          console.error('Real-time aggregation error:', error);
        } finally {
          this.isProcessing = false;
        }
      }
    };

    // Process changes as they occur
    changeStream.on('change', () => {
      realTimeProcessor.processChanges();
    });

    return realTimeProcessor;
  }

  async implementIncrementalAggregation(pipeline, collectionName, incrementField = 'updatedAt') {
    console.log(`Setting up incremental aggregation for ${collectionName}...`);

    // Track last processed timestamp
    let lastProcessedTime = await this.getLastProcessedTime(collectionName);

    const incrementalPipeline = [
      // Only process new/updated documents
      {
        $match: {
          [incrementField]: { $gt: lastProcessedTime }
        }
      },
      ...pipeline
    ];

    const results = await this.db.collection(collectionName)
      .aggregate(incrementalPipeline, { allowDiskUse: true })
      .toArray();

    // Update last processed time
    await this.updateLastProcessedTime(collectionName, new Date());

    return {
      incrementalResults: results,
      lastProcessedTime: lastProcessedTime,
      newProcessedTime: new Date(),
      documentsProcessed: results.length
    };
  }
}

SQL-Style Aggregation Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Aggregation Framework operations:

-- QueryLeaf aggregation operations with SQL-familiar syntax

-- Complex customer analytics with CTEs and window functions
WITH customer_segments AS (
  SELECT 
    customer_id,
    customer_name,
    registration_date,

    -- Customer segmentation using CASE expressions
    CASE 
      WHEN registration_date >= CURRENT_TIMESTAMP - INTERVAL '90 days' THEN 'new_customer'
      WHEN last_order_date >= CURRENT_TIMESTAMP - INTERVAL '30 days' THEN 'active_customer'
      WHEN last_order_date >= CURRENT_TIMESTAMP - INTERVAL '180 days' THEN 'returning_customer'
      ELSE 'dormant_customer'
    END as customer_segment,

    -- Geographic classification using nested CASE
    CASE 
      WHEN JSON_EXTRACT(address, '$.country') = 'US' THEN 'domestic'
      WHEN JSON_EXTRACT(address, '$.country') IN ('CA', 'MX') THEN 'north_america'
      WHEN JSON_EXTRACT(address, '$.country') IN ('GB', 'DE', 'FR', 'IT', 'ES') THEN 'europe'
      ELSE 'international'
    END as geographic_segment,

    -- Customer value classification
    total_spent,
    total_orders,
    average_order_value,
    loyalty_tier

  FROM customers
  WHERE is_active = true
),

order_analytics AS (
  SELECT 
    o._id as order_id,
    o.customer_id,
    o.order_date,

    -- Time-based dimensions using date functions
    DATE_TRUNC('month', o.order_date) as order_month,
    DATE_TRUNC('quarter', o.order_date) as order_quarter,
    EXTRACT(year FROM o.order_date) as order_year,
    EXTRACT(dow FROM o.order_date) as day_of_week,
    EXTRACT(hour FROM o.order_date) as hour_of_day,

    -- Seasonal analysis
    CASE 
      WHEN EXTRACT(month FROM o.order_date) IN (12, 1, 2) THEN 'winter'
      WHEN EXTRACT(month FROM o.order_date) IN (3, 4, 5) THEN 'spring'
      WHEN EXTRACT(month FROM o.order_date) IN (6, 7, 8) THEN 'summer'
      ELSE 'fall'
    END as season,

    -- Financial metrics
    JSON_EXTRACT(financial, '$.total') as order_total,
    JSON_EXTRACT(financial, '$.discount') as discount_amount,
    JSON_EXTRACT(financial, '$.tax') as tax_amount,
    JSON_EXTRACT(financial, '$.shipping') as shipping_amount,

    -- Order classification
    CASE 
      WHEN JSON_EXTRACT(financial, '$.total') >= 1000 THEN 'high_value'
      WHEN JSON_EXTRACT(financial, '$.total') >= 500 THEN 'medium_value'
      WHEN JSON_EXTRACT(financial, '$.total') >= 100 THEN 'low_value'
      ELSE 'micro_transaction'
    END as order_value_segment,

    -- Item analysis using JSON functions
    JSON_ARRAY_LENGTH(items) as item_count,

    -- Payment and shipping insights
    JSON_EXTRACT(payment, '$.method') as payment_method,
    JSON_EXTRACT(shipping, '$.method') as shipping_method,
    JSON_EXTRACT(shipping, '$.region') as shipping_region

  FROM orders o
  WHERE o.order_date >= CURRENT_TIMESTAMP - INTERVAL '12 months'
    AND o.status IN ('completed', 'shipped', 'delivered')
),

product_performance AS (
  SELECT 
    oi.order_id,

    -- Unnest items array for item-level analysis
    JSON_EXTRACT(item, '$.product_id') as product_id,
    JSON_EXTRACT(item, '$.quantity') as quantity,
    JSON_EXTRACT(item, '$.unit_price') as unit_price,
    JSON_EXTRACT(item, '$.total_price') as item_total,

    -- Product details from JOIN
    p.product_name,
    p.category,
    p.brand,
    p.cost_per_unit,

    -- Calculate margins
    (JSON_EXTRACT(item, '$.unit_price') - p.cost_per_unit) as unit_margin,
    (JSON_EXTRACT(item, '$.unit_price') - p.cost_per_unit) * JSON_EXTRACT(item, '$.quantity') as total_margin,

    -- Product categorization
    CASE 
      WHEN p.category = 'Electronics' THEN 'tech'
      WHEN p.category IN ('Clothing', 'Shoes', 'Accessories') THEN 'fashion'
      WHEN p.category IN ('Home', 'Garden', 'Furniture') THEN 'home'
      ELSE 'other'
    END as product_group

  FROM order_analytics oa
  CROSS JOIN JSON_TABLE(
    oa.items, '$[*]' COLUMNS (
      item JSON PATH '$'
    )
  ) AS items_table
  JOIN products p ON JSON_EXTRACT(items_table.item, '$.product_id') = p.product_id
  WHERE p.is_active = true
),

comprehensive_analytics AS (
  SELECT 
    -- Dimensional attributes
    cs.customer_segment,
    cs.geographic_segment,
    oa.order_month,
    oa.order_quarter,
    oa.season,
    oa.order_value_segment,
    pp.product_group,
    pp.category,
    pp.brand,

    -- Aggregated metrics using window functions
    COUNT(DISTINCT cs.customer_id) as unique_customers,
    COUNT(DISTINCT oa.order_id) as total_orders,
    COUNT(DISTINCT pp.product_id) as unique_products,

    -- Revenue metrics
    SUM(oa.order_total) as total_revenue,
    AVG(oa.order_total) as avg_order_value,
    SUM(pp.total_margin) as total_margin,

    -- Customer metrics with window functions
    AVG(SUM(oa.order_total)) OVER (
      PARTITION BY cs.customer_id
    ) as avg_customer_monthly_spend,

    -- Product performance with rankings
    RANK() OVER (
      PARTITION BY oa.order_month
      ORDER BY SUM(pp.total_margin) DESC
    ) as product_margin_rank,

    -- Time-based analysis
    COUNT(*) OVER (
      PARTITION BY cs.geographic_segment, oa.season
    ) as segment_seasonal_orders,

    -- Advanced statistical functions
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY oa.order_total) as median_order_value,
    STDDEV_POP(oa.order_total) as order_value_stddev,

    -- Cohort analysis elements
    MIN(oa.order_date) OVER (PARTITION BY cs.customer_id) as customer_first_order,
    MAX(oa.order_date) OVER (PARTITION BY cs.customer_id) as customer_last_order,

    -- Calculate customer lifetime metrics
    COUNT(*) OVER (PARTITION BY cs.customer_id) as customer_total_orders,
    SUM(oa.order_total) OVER (PARTITION BY cs.customer_id) as customer_lifetime_value

  FROM customer_segments cs
  JOIN order_analytics oa ON cs.customer_id = oa.customer_id
  JOIN product_performance pp ON oa.order_id = pp.order_id
),

final_analytics AS (
  SELECT 
    customer_segment,
    geographic_segment,
    order_quarter,
    season,
    product_group,
    category,
    brand,

    -- Core metrics
    unique_customers,
    total_orders,
    unique_products,
    ROUND(total_revenue, 2) as total_revenue,
    ROUND(avg_order_value, 2) as avg_order_value,
    ROUND(total_margin, 2) as total_margin,
    ROUND((total_margin / total_revenue) * 100, 1) as margin_percentage,

    -- Customer insights
    ROUND(avg_customer_monthly_spend, 2) as avg_customer_monthly_spend,
    ROUND(median_order_value, 2) as median_order_value,
    ROUND(order_value_stddev, 2) as order_value_stddev,

    -- Performance indicators
    CASE 
      WHEN product_margin_rank <= 10 THEN 'top_performer'
      WHEN product_margin_rank <= 50 THEN 'good_performer'
      ELSE 'average_performer'
    END as performance_tier,

    -- Customer behavior analysis
    AVG(EXTRACT(days FROM (customer_last_order - customer_first_order))) as avg_customer_lifespan_days,
    AVG(customer_total_orders) as avg_orders_per_customer,
    ROUND(AVG(customer_lifetime_value), 2) as avg_customer_lifetime_value,

    -- Growth analysis using LAG function
    LAG(total_revenue) OVER (
      PARTITION BY customer_segment, geographic_segment, product_group
      ORDER BY order_quarter
    ) as prev_quarter_revenue,

    -- Calculate growth rate
    ROUND(
      ((total_revenue - LAG(total_revenue) OVER (
        PARTITION BY customer_segment, geographic_segment, product_group
        ORDER BY order_quarter
      )) / NULLIF(LAG(total_revenue) OVER (
        PARTITION BY customer_segment, geographic_segment, product_group
        ORDER BY order_quarter
      ), 0)) * 100,
      1
    ) as revenue_growth_pct,

    -- Market share analysis
    ROUND(
      (total_revenue / SUM(total_revenue) OVER (PARTITION BY order_quarter)) * 100,
      2
    ) as market_share_pct,

    -- Seasonal performance indexing
    ROUND(
      total_revenue / AVG(total_revenue) OVER (
        PARTITION BY customer_segment, geographic_segment, product_group
      ) * 100,
      1
    ) as seasonal_index

  FROM comprehensive_analytics
)

SELECT 
  -- Dimensional attributes
  customer_segment,
  geographic_segment,
  order_quarter,
  season,
  product_group,
  category,
  brand,

  -- Core metrics
  unique_customers,
  total_orders,
  unique_products,
  total_revenue,
  avg_order_value,
  total_margin,
  margin_percentage,

  -- Customer insights
  avg_customer_monthly_spend,
  median_order_value,
  order_value_stddev,
  avg_customer_lifespan_days,
  avg_orders_per_customer,
  avg_customer_lifetime_value,

  -- Performance classification
  performance_tier,

  -- Growth metrics
  prev_quarter_revenue,
  revenue_growth_pct,
  market_share_pct,
  seasonal_index,

  -- Business insights and recommendations
  CASE 
    WHEN revenue_growth_pct > 25 THEN 'high_growth_opportunity'
    WHEN revenue_growth_pct > 10 THEN 'steady_growth'
    WHEN revenue_growth_pct > 0 THEN 'slow_growth'
    WHEN revenue_growth_pct IS NULL THEN 'new_segment'
    ELSE 'declining_segment'
  END as growth_classification,

  CASE 
    WHEN margin_percentage > 30 AND revenue_growth_pct > 15 THEN 'invest_and_expand'
    WHEN margin_percentage > 30 AND revenue_growth_pct < 0 THEN 'optimize_and_retain'  
    WHEN margin_percentage < 15 AND revenue_growth_pct > 15 THEN 'improve_margins'
    WHEN margin_percentage < 15 AND revenue_growth_pct < 0 THEN 'consider_exit'
    ELSE 'monitor_and_optimize'
  END as strategic_recommendation,

  -- Key performance indicators
  CASE 
    WHEN avg_customer_lifetime_value > 1000 AND avg_orders_per_customer > 5 THEN 'high_value_loyal'
    WHEN avg_customer_lifetime_value > 500 THEN 'high_value'
    WHEN avg_orders_per_customer > 3 THEN 'loyal_customers'
    ELSE 'acquisition_focus'
  END as customer_strategy

FROM final_analytics
WHERE total_revenue > 1000  -- Filter for statistical significance
ORDER BY 
  total_revenue DESC,
  revenue_growth_pct DESC NULLS LAST,
  margin_percentage DESC
LIMIT 500;

-- Real-time product performance dashboard
CREATE VIEW real_time_product_performance AS
WITH hourly_product_metrics AS (
  SELECT 
    JSON_EXTRACT(item, '$.product_id') as product_id,
    DATE_TRUNC('hour', order_date) as hour_bucket,

    -- Sales metrics
    SUM(JSON_EXTRACT(item, '$.quantity')) as total_quantity_sold,
    SUM(JSON_EXTRACT(item, '$.total_price')) as total_revenue,
    COUNT(DISTINCT order_id) as unique_orders,
    COUNT(DISTINCT customer_id) as unique_customers,

    -- Pricing analysis
    AVG(JSON_EXTRACT(item, '$.unit_price')) as avg_selling_price,
    MAX(JSON_EXTRACT(item, '$.unit_price')) as max_selling_price,
    MIN(JSON_EXTRACT(item, '$.unit_price')) as min_selling_price,

    -- Performance indicators
    AVG(JSON_EXTRACT(financial, '$.total')) as avg_order_value,
    SUM(JSON_EXTRACT(item, '$.quantity')) / COUNT(DISTINCT order_id) as avg_quantity_per_order

  FROM orders o
  CROSS JOIN JSON_TABLE(
    o.items, '$[*]' COLUMNS (
      item JSON PATH '$'
    )
  ) AS items_unnested
  WHERE o.order_date >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND o.status IN ('completed', 'processing', 'shipped')
  GROUP BY 
    JSON_EXTRACT(item, '$.product_id'),
    DATE_TRUNC('hour', order_date)
),

product_rankings AS (
  SELECT 
    hpm.*,
    p.product_name,
    p.category,
    p.brand,
    p.cost_per_unit,

    -- Calculate margins
    (hpm.avg_selling_price - p.cost_per_unit) as unit_margin,
    ((hpm.avg_selling_price - p.cost_per_unit) * hpm.total_quantity_sold) as total_margin,

    -- Performance rankings using window functions
    RANK() OVER (ORDER BY total_revenue DESC) as revenue_rank,
    RANK() OVER (ORDER BY total_quantity_sold DESC) as quantity_rank,
    RANK() OVER (PARTITION BY p.category ORDER BY total_revenue DESC) as category_rank,

    -- Percentile rankings
    PERCENT_RANK() OVER (ORDER BY total_revenue) as revenue_percentile,
    PERCENT_RANK() OVER (ORDER BY total_quantity_sold) as quantity_percentile,

    -- Growth analysis (comparing to previous hour)
    LAG(total_revenue) OVER (
      PARTITION BY product_id 
      ORDER BY hour_bucket
    ) as prev_hour_revenue,

    LAG(total_quantity_sold) OVER (
      PARTITION BY product_id 
      ORDER BY hour_bucket
    ) as prev_hour_quantity

  FROM hourly_product_metrics hpm
  JOIN products p ON hpm.product_id = p.product_id
  WHERE p.is_active = true
)

SELECT 
  product_id,
  product_name,
  category,
  brand,
  hour_bucket,

  -- Sales performance
  total_quantity_sold,
  ROUND(total_revenue, 2) as total_revenue,
  unique_orders,
  unique_customers,

  -- Pricing metrics
  ROUND(avg_selling_price, 2) as avg_selling_price,
  max_selling_price,
  min_selling_price,
  ROUND(unit_margin, 2) as unit_margin,
  ROUND(total_margin, 2) as total_margin,
  ROUND((total_margin / total_revenue) * 100, 1) as margin_percentage,

  -- Performance rankings
  revenue_rank,
  quantity_rank,
  category_rank,
  ROUND(revenue_percentile * 100, 1) as revenue_percentile_score,

  -- Growth metrics
  ROUND(
    CASE 
      WHEN prev_hour_revenue > 0 THEN
        ((total_revenue - prev_hour_revenue) / prev_hour_revenue) * 100
      ELSE NULL
    END,
    1
  ) as hourly_revenue_growth_pct,

  ROUND(
    CASE 
      WHEN prev_hour_quantity > 0 THEN
        ((total_quantity_sold - prev_hour_quantity) / prev_hour_quantity::DECIMAL) * 100
      ELSE NULL
    END,
    1
  ) as hourly_quantity_growth_pct,

  -- Customer metrics
  ROUND(avg_order_value, 2) as avg_order_value,
  ROUND(avg_quantity_per_order, 2) as avg_quantity_per_order,
  ROUND(total_revenue / unique_customers, 2) as revenue_per_customer,

  -- Performance classification
  CASE 
    WHEN revenue_rank <= 10 AND margin_percentage > 20 THEN 'star_performer'
    WHEN revenue_rank <= 50 AND margin_percentage > 15 THEN 'strong_performer'
    WHEN revenue_rank <= 100 THEN 'solid_performer'
    ELSE 'monitor_performance'
  END as performance_classification,

  -- Alert indicators
  CASE 
    WHEN hourly_revenue_growth_pct > 50 THEN 'trending_up'
    WHEN hourly_revenue_growth_pct < -30 THEN 'trending_down'
    WHEN revenue_rank <= 20 AND margin_percentage < 10 THEN 'margin_concern'
    ELSE 'normal'
  END as alert_status,

  -- Recommendations
  CASE 
    WHEN performance_classification = 'star_performer' THEN 'increase_inventory_and_marketing'
    WHEN alert_status = 'trending_down' THEN 'investigate_declining_performance'
    WHEN margin_percentage < 10 THEN 'review_pricing_strategy'
    WHEN revenue_rank > 100 THEN 'consider_promotion_or_discontinuation'
    ELSE 'maintain_current_strategy'
  END as recommendation

FROM product_rankings
WHERE hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY 
  hour_bucket DESC,
  revenue_rank ASC,
  margin_percentage DESC;

-- Advanced cohort analysis with SQL window functions
WITH customer_cohorts AS (
  SELECT 
    customer_id,
    DATE_TRUNC('month', MIN(order_date)) as cohort_month,
    MIN(order_date) as first_order_date,
    COUNT(*) as total_orders,
    SUM(JSON_EXTRACT(financial, '$.total')) as total_spent

  FROM orders
  WHERE status IN ('completed', 'delivered')
    AND order_date >= CURRENT_TIMESTAMP - INTERVAL '24 months'
  GROUP BY customer_id
),

cohort_periods AS (
  SELECT 
    cc.customer_id,
    cc.cohort_month,
    cc.first_order_date,
    cc.total_orders,
    cc.total_spent,

    o.order_date,
    o._id as order_id,
    JSON_EXTRACT(o.financial, '$.total') as order_value,

    -- Calculate periods since first order
    FLOOR(
      MONTHS_BETWEEN(DATE_TRUNC('month', o.order_date), cc.cohort_month)
    ) as periods_since_first_order,

    DATE_TRUNC('month', o.order_date) as order_month

  FROM customer_cohorts cc
  JOIN orders o ON cc.customer_id = o.customer_id
  WHERE o.status IN ('completed', 'delivered')
    AND o.order_date >= cc.first_order_date
),

cohort_analysis AS (
  SELECT 
    cohort_month,
    periods_since_first_order,
    order_month,

    -- Cohort metrics
    COUNT(DISTINCT customer_id) as active_customers,
    COUNT(DISTINCT order_id) as total_orders,
    SUM(order_value) as total_revenue,
    AVG(order_value) as avg_order_value,

    -- Customer behavior
    AVG(total_orders) as avg_lifetime_orders,
    AVG(total_spent) as avg_lifetime_value,

    -- Period-specific insights
    COUNT(DISTINCT customer_id) / 
    FIRST_VALUE(COUNT(DISTINCT customer_id)) OVER (
      PARTITION BY cohort_month 
      ORDER BY periods_since_first_order
      ROWS UNBOUNDED PRECEDING
    ) as retention_rate

  FROM cohort_periods
  GROUP BY cohort_month, periods_since_first_order, order_month
),

cohort_summary AS (
  SELECT 
    cohort_month,

    -- Cohort size (customers who made first purchase in this month)
    MAX(CASE WHEN periods_since_first_order = 0 THEN active_customers END) as cohort_size,

    -- Retention rates by period
    MAX(CASE WHEN periods_since_first_order = 1 THEN retention_rate END) as month_1_retention,
    MAX(CASE WHEN periods_since_first_order = 3 THEN retention_rate END) as month_3_retention,
    MAX(CASE WHEN periods_since_first_order = 6 THEN retention_rate END) as month_6_retention,
    MAX(CASE WHEN periods_since_first_order = 12 THEN retention_rate END) as month_12_retention,

    -- Revenue metrics
    SUM(total_revenue) as cohort_total_revenue,
    AVG(avg_lifetime_value) as avg_customer_ltv,

    -- Performance indicators
    MAX(periods_since_first_order) as max_observed_periods,
    AVG(avg_order_value) as cohort_avg_order_value

  FROM cohort_analysis
  GROUP BY cohort_month
)

SELECT 
  cohort_month,
  cohort_size,

  -- Retention analysis
  ROUND(month_1_retention * 100, 1) as month_1_retention_pct,
  ROUND(month_3_retention * 100, 1) as month_3_retention_pct,
  ROUND(month_6_retention * 100, 1) as month_6_retention_pct,
  ROUND(month_12_retention * 100, 1) as month_12_retention_pct,

  -- Financial metrics
  ROUND(cohort_total_revenue, 2) as cohort_total_revenue,
  ROUND(avg_customer_ltv, 2) as avg_customer_ltv,
  ROUND(cohort_avg_order_value, 2) as avg_order_value,
  ROUND(cohort_total_revenue / cohort_size, 2) as revenue_per_customer,

  -- Cohort performance classification
  CASE 
    WHEN month_3_retention >= 0.4 AND avg_customer_ltv >= 500 THEN 'excellent'
    WHEN month_3_retention >= 0.3 AND avg_customer_ltv >= 300 THEN 'good'
    WHEN month_3_retention >= 0.2 OR avg_customer_ltv >= 200 THEN 'fair'
    ELSE 'poor'
  END as cohort_quality,

  -- Growth trend analysis
  ROUND(
    (month_6_retention - month_1_retention) * 100,
    1
  ) as retention_trend,

  -- Business insights
  CASE 
    WHEN month_1_retention < 0.2 THEN 'improve_onboarding'
    WHEN month_12_retention < 0.1 THEN 'enhance_loyalty_program'
    WHEN avg_customer_ltv < 100 THEN 'increase_customer_value'
    ELSE 'maintain_performance'
  END as recommendation,

  max_observed_periods

FROM cohort_summary
WHERE cohort_size >= 10  -- Filter for statistical significance
ORDER BY cohort_month DESC;

-- QueryLeaf provides comprehensive aggregation capabilities:
-- 1. SQL-familiar syntax for complex MongoDB aggregation pipelines
-- 2. Advanced analytics with CTEs, window functions, and statistical operations
-- 3. Real-time processing with familiar SQL patterns and aggregation functions  
-- 4. Complex customer segmentation and behavioral analysis using SQL constructs
-- 5. Product performance analytics with rankings and growth calculations
-- 6. Cohort analysis with retention rates and lifetime value calculations
-- 7. Integration with MongoDB's native aggregation optimizations
-- 8. Familiar SQL data types, functions, and expression syntax
-- 9. Advanced time-series analysis and trend detection capabilities
-- 10. Enterprise-ready analytics with performance optimization and scalability

Best Practices for Aggregation Framework Implementation

Performance Optimization and Pipeline Design

Essential strategies for effective MongoDB Aggregation Framework usage:

Early Stage Filtering: Place $match stages as early as possible to reduce data processing volume
Index Utilization: Design compound indexes that support aggregation pipeline operations
Memory Management: Use allowDiskUse: true for large aggregations and monitor memory usage
Pipeline Ordering: Arrange stages to minimize data movement and intermediate result sizes
Expression Optimization: Use efficient expressions and avoid complex nested operations when possible
Result Set Limiting: Apply $limit stages strategically to control output size

Enterprise Analytics Architecture

Design scalable aggregation systems for production deployments:

Distributed Processing: Leverage MongoDB's sharding to distribute aggregation workloads
Caching Strategies: Implement result caching for frequently accessed aggregations
Real-time Processing: Combine aggregation pipelines with Change Streams for live analytics
Incremental Updates: Design incremental aggregation patterns for large, frequently updated datasets
Performance Monitoring: Track aggregation performance and optimize based on usage patterns
Resource Planning: Size clusters appropriately for expected aggregation workloads and data volumes

Conclusion

MongoDB's Aggregation Framework provides comprehensive data processing capabilities that eliminate the complexity and performance limitations of traditional SQL analytics approaches through optimized single-pass processing, native document transformations, and distributed execution capabilities. The rich expression language and extensive operator library enable sophisticated analytics while maintaining high performance and operational simplicity.

Key MongoDB Aggregation Framework benefits include:

Unified Processing: Single-pass analytics without expensive JOINs or multiple query rounds
Rich Expressions: Comprehensive mathematical, statistical, and analytical operations
Document-Native: Native handling of nested documents, arrays, and complex data structures
Performance Optimization: Automatic query optimization with index utilization and parallel processing
Horizontal Scaling: Distributed aggregation processing across sharded MongoDB clusters
Real-time Capabilities: Integration with Change Streams for live analytical processing

Whether you're building business intelligence platforms, real-time analytics systems, customer segmentation tools, or complex reporting solutions, MongoDB's Aggregation Framework with QueryLeaf's familiar SQL interface provides the foundation for powerful, scalable, and maintainable analytical processing.

QueryLeaf Integration: QueryLeaf seamlessly translates SQL analytics queries into optimized MongoDB aggregation pipelines while providing familiar SQL syntax for complex analytics, statistical functions, and reporting operations. Advanced aggregation patterns including cohort analysis, customer segmentation, and real-time analytics are elegantly handled through familiar SQL constructs, making sophisticated data processing both powerful and accessible to SQL-oriented analytics teams.

The combination of MongoDB's robust aggregation capabilities with SQL-style analytical operations makes it an ideal platform for applications requiring both advanced analytics functionality and familiar database interaction patterns, ensuring your analytical infrastructure can deliver insights efficiently while maintaining developer productivity and operational excellence.

November 8, 2025
30 min read

MongoDB Time Series Collections for IoT Data Management: Real-Time Analytics and High-Performance Data Processing

Modern IoT applications generate massive volumes of time-stamped sensor data that require specialized storage and processing capabilities to handle millions of data points per second while enabling real-time analytics and efficient historical data queries. Traditional database approaches struggle with the scale, write-heavy workloads, and time-based query patterns characteristic of IoT systems, often requiring complex partitioning schemes, multiple storage tiers, and custom optimization strategies that increase operational complexity and development overhead.

MongoDB Time Series Collections provide purpose-built storage optimization for time-stamped data with automatic bucketing, compression, and query optimization specifically designed for IoT workloads. Unlike traditional approaches that require manual time-based partitioning and complex indexing strategies, Time Series Collections automatically organize data by time ranges, apply intelligent compression, and optimize queries for time-based access patterns while maintaining MongoDB's flexible document model and powerful aggregation capabilities.

The Traditional IoT Data Storage Challenge

Conventional approaches to storing and processing IoT time series data face significant scalability and performance limitations:

-- Traditional PostgreSQL time series approach - complex partitioning and limited scalability

-- IoT sensor data with traditional table design
CREATE TABLE sensor_readings (
    reading_id BIGSERIAL PRIMARY KEY,
    device_id VARCHAR(100) NOT NULL,
    sensor_type VARCHAR(50) NOT NULL,
    location VARCHAR(200),

    -- Time series data
    timestamp TIMESTAMPTZ NOT NULL,
    value DECIMAL(15,6) NOT NULL,
    unit VARCHAR(20),
    quality_score DECIMAL(3,2) DEFAULT 1.0,

    -- Device and context metadata
    device_metadata JSONB,
    environmental_conditions JSONB,

    -- Data processing flags
    processed BOOLEAN DEFAULT FALSE,
    anomaly_detected BOOLEAN DEFAULT FALSE,
    data_source VARCHAR(100),

    -- Partitioning helper columns
    year_month INTEGER GENERATED ALWAYS AS (EXTRACT(YEAR FROM timestamp) * 100 + EXTRACT(MONTH FROM timestamp)) STORED,
    date_partition DATE GENERATED ALWAYS AS (DATE(timestamp)) STORED
);

-- Complex time-based partitioning (manual maintenance required)
CREATE TABLE sensor_readings_2024_01 PARTITION OF sensor_readings
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE sensor_readings_2024_02 PARTITION OF sensor_readings
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

CREATE TABLE sensor_readings_2024_03 PARTITION OF sensor_readings
FOR VALUES FROM ('2024-03-01') TO ('2024-04-01');

-- Additional partitions must be created manually each month
-- Automation required to prevent partition overflow

-- Indexing strategy for time series queries (expensive maintenance)
CREATE INDEX idx_sensor_readings_device_time ON sensor_readings (device_id, timestamp DESC);
CREATE INDEX idx_sensor_readings_sensor_type_time ON sensor_readings (sensor_type, timestamp DESC);
CREATE INDEX idx_sensor_readings_location_time ON sensor_readings (location, timestamp DESC);
CREATE INDEX idx_sensor_readings_timestamp_only ON sensor_readings (timestamp DESC);
CREATE INDEX idx_sensor_readings_processed_flag ON sensor_readings (processed, timestamp DESC);

-- Additional indexes for different query patterns
CREATE INDEX idx_sensor_readings_anomaly_time ON sensor_readings (anomaly_detected, timestamp DESC) WHERE anomaly_detected = TRUE;
CREATE INDEX idx_sensor_readings_device_type_time ON sensor_readings (device_id, sensor_type, timestamp DESC);

-- Materialized view for real-time aggregations (complex maintenance)
CREATE MATERIALIZED VIEW sensor_readings_hourly_summary AS
WITH hourly_aggregations AS (
    SELECT 
        device_id,
        sensor_type,
        location,
        DATE_TRUNC('hour', timestamp) as hour_bucket,

        -- Statistical aggregations
        COUNT(*) as reading_count,
        AVG(value) as avg_value,
        MIN(value) as min_value,
        MAX(value) as max_value,
        STDDEV(value) as stddev_value,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) as median_value,
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY value) as p95_value,

        -- Data quality metrics
        AVG(quality_score) as avg_quality,
        COUNT(*) FILTER (WHERE quality_score < 0.8) as low_quality_readings,
        COUNT(*) FILTER (WHERE anomaly_detected = true) as anomaly_count,

        -- Value change analysis
        (MAX(value) - MIN(value)) as value_range,
        CASE 
            WHEN COUNT(*) > 1 THEN
                (LAST_VALUE(value) OVER (ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) - 
                 FIRST_VALUE(value) OVER (ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING))
            ELSE 0
        END as value_change_in_hour,

        -- Processing statistics
        COUNT(*) FILTER (WHERE processed = true) as processed_readings,
        (COUNT(*) FILTER (WHERE processed = true)::DECIMAL / COUNT(*)) * 100 as processing_rate_percent,

        -- Time coverage analysis
        (EXTRACT(EPOCH FROM MAX(timestamp) - MIN(timestamp)) / 3600) as time_coverage_hours,
        COUNT(*)::DECIMAL / (EXTRACT(EPOCH FROM MAX(timestamp) - MIN(timestamp)) / 60) as readings_per_minute

    FROM sensor_readings
    WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    GROUP BY device_id, sensor_type, location, DATE_TRUNC('hour', timestamp)
)
SELECT 
    ha.*,

    -- Additional calculated metrics
    CASE 
        WHEN ha.reading_count < 50 THEN 'sparse'
        WHEN ha.reading_count < 200 THEN 'normal'
        WHEN ha.reading_count < 500 THEN 'dense'
        ELSE 'very_dense'
    END as data_density_category,

    CASE 
        WHEN ha.avg_quality >= 0.95 THEN 'excellent'
        WHEN ha.avg_quality >= 0.8 THEN 'good'
        WHEN ha.avg_quality >= 0.6 THEN 'fair'
        ELSE 'poor'
    END as quality_category,

    -- Anomaly rate analysis
    CASE 
        WHEN ha.anomaly_count = 0 THEN 'normal'
        WHEN (ha.anomaly_count::DECIMAL / ha.reading_count) < 0.01 THEN 'low_anomalies'
        WHEN (ha.anomaly_count::DECIMAL / ha.reading_count) < 0.05 THEN 'moderate_anomalies'
        ELSE 'high_anomalies'
    END as anomaly_level,

    -- Performance indicators
    CASE 
        WHEN ha.readings_per_minute >= 10 THEN 'high_frequency'
        WHEN ha.readings_per_minute >= 1 THEN 'medium_frequency'
        WHEN ha.readings_per_minute >= 0.1 THEN 'low_frequency'
        ELSE 'very_low_frequency'
    END as sampling_frequency_category

FROM hourly_aggregations ha;

-- Must be refreshed periodically (expensive operation)
CREATE UNIQUE INDEX idx_sensor_hourly_unique ON sensor_readings_hourly_summary (device_id, sensor_type, location, hour_bucket);

-- Complex query for real-time analytics (resource-intensive)
WITH device_performance AS (
    SELECT 
        sr.device_id,
        sr.sensor_type,
        sr.location,
        DATE_TRUNC('minute', sr.timestamp) as minute_bucket,

        -- Real-time aggregations (expensive on large datasets)
        COUNT(*) as readings_per_minute,
        AVG(sr.value) as avg_value,
        STDDEV(sr.value) as value_stability,

        -- Change detection (requires window functions)
        LAG(AVG(sr.value)) OVER (
            PARTITION BY sr.device_id, sr.sensor_type 
            ORDER BY DATE_TRUNC('minute', sr.timestamp)
        ) as prev_minute_avg,

        -- Quality assessment
        AVG(sr.quality_score) as avg_quality,
        COUNT(*) FILTER (WHERE sr.anomaly_detected) as anomaly_count,

        -- Processing lag calculation
        AVG(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP - sr.timestamp)) as avg_processing_lag_seconds

    FROM sensor_readings sr
    WHERE 
        sr.timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
        AND sr.processed = true
    GROUP BY sr.device_id, sr.sensor_type, sr.location, DATE_TRUNC('minute', sr.timestamp)
),

real_time_alerts AS (
    SELECT 
        dp.*,

        -- Alert conditions
        CASE 
            WHEN ABS(dp.avg_value - dp.prev_minute_avg) > (dp.value_stability * 3) THEN 'value_spike'
            WHEN dp.avg_quality < 0.7 THEN 'quality_degradation'
            WHEN dp.anomaly_count > 0 THEN 'anomalies_detected'
            WHEN dp.avg_processing_lag_seconds > 300 THEN 'processing_delay'
            WHEN dp.readings_per_minute < 0.5 THEN 'data_gap'
            ELSE 'normal'
        END as alert_type,

        -- Severity calculation
        CASE 
            WHEN dp.anomaly_count > 10 OR dp.avg_quality < 0.5 THEN 'critical'
            WHEN dp.anomaly_count > 5 OR dp.avg_quality < 0.7 OR dp.avg_processing_lag_seconds > 600 THEN 'high'
            WHEN dp.anomaly_count > 2 OR dp.avg_quality < 0.8 OR dp.avg_processing_lag_seconds > 300 THEN 'medium'
            ELSE 'low'
        END as alert_severity,

        -- Performance assessment
        CASE 
            WHEN dp.readings_per_minute >= 30 AND dp.avg_quality >= 0.9 THEN 'optimal'
            WHEN dp.readings_per_minute >= 10 AND dp.avg_quality >= 0.8 THEN 'good'
            WHEN dp.readings_per_minute >= 1 AND dp.avg_quality >= 0.6 THEN 'acceptable'
            ELSE 'poor'
        END as performance_status

    FROM device_performance dp
    WHERE dp.minute_bucket >= CURRENT_TIMESTAMP - INTERVAL '15 minutes'
),

device_health_summary AS (
    SELECT 
        rta.device_id,
        COUNT(*) as total_minutes_analyzed,

        -- Health metrics
        COUNT(*) FILTER (WHERE rta.alert_type != 'normal') as minutes_with_alerts,
        COUNT(*) FILTER (WHERE rta.alert_severity IN ('critical', 'high')) as critical_minutes,
        COUNT(*) FILTER (WHERE rta.performance_status IN ('optimal', 'good')) as good_performance_minutes,

        -- Overall device status
        AVG(rta.avg_quality) as overall_quality,
        AVG(rta.readings_per_minute) as avg_data_rate,
        SUM(rta.anomaly_count) as total_anomalies,

        -- Most recent status
        LAST_VALUE(rta.performance_status) OVER (
            PARTITION BY rta.device_id 
            ORDER BY rta.minute_bucket 
            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
        ) as current_status,

        LAST_VALUE(rta.alert_type) OVER (
            PARTITION BY rta.device_id 
            ORDER BY rta.minute_bucket 
            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
        ) as current_alert_type

    FROM real_time_alerts rta
    GROUP BY rta.device_id
)

-- Final real-time dashboard query
SELECT 
    dhs.device_id,
    dhs.current_status,
    dhs.current_alert_type,

    -- Health indicators
    ROUND(dhs.overall_quality, 3) as quality_score,
    ROUND(dhs.avg_data_rate, 1) as data_rate_per_minute,
    dhs.total_anomalies,

    -- Alert summary
    dhs.minutes_with_alerts,
    dhs.critical_minutes,
    dhs.good_performance_minutes,

    -- Performance assessment
    ROUND((dhs.good_performance_minutes::DECIMAL / dhs.total_minutes_analyzed) * 100, 1) as uptime_percentage,
    ROUND((dhs.minutes_with_alerts::DECIMAL / dhs.total_minutes_analyzed) * 100, 1) as alert_percentage,

    -- Device health classification
    CASE 
        WHEN dhs.critical_minutes > 2 OR dhs.overall_quality < 0.6 THEN 'unhealthy'
        WHEN dhs.minutes_with_alerts > 5 OR dhs.overall_quality < 0.8 THEN 'degraded'
        WHEN dhs.good_performance_minutes >= (dhs.total_minutes_analyzed * 0.8) THEN 'healthy'
        ELSE 'monitoring'
    END as device_health_status,

    -- Recommendations
    CASE 
        WHEN dhs.total_anomalies > 20 THEN 'investigate_sensor_calibration'
        WHEN dhs.avg_data_rate < 1 THEN 'check_connectivity'
        WHEN dhs.overall_quality < 0.7 THEN 'review_sensor_maintenance'
        WHEN dhs.critical_minutes > 0 THEN 'immediate_attention_required'
        ELSE 'operating_normally'
    END as recommended_action

FROM device_health_summary dhs
ORDER BY 
    CASE dhs.current_status
        WHEN 'poor' THEN 1
        WHEN 'acceptable' THEN 2
        WHEN 'good' THEN 3
        WHEN 'optimal' THEN 4
    END,
    dhs.critical_minutes DESC,
    dhs.total_anomalies DESC;

-- Traditional time series problems:
-- 1. Complex manual partitioning and maintenance overhead
-- 2. Expensive materialized view refreshes for real-time analytics
-- 3. Limited compression and storage optimization for time series data
-- 4. Complex indexing strategies with high maintenance costs
-- 5. Poor write performance under high-volume IoT workloads
-- 6. Difficult horizontal scaling for time series data
-- 7. Limited time-based query optimization
-- 8. Complex time window and rollup aggregations
-- 9. Expensive historical data archiving and cleanup operations
-- 10. No built-in time series specific features and optimizations

MongoDB Time Series Collections provide comprehensive IoT data management with automatic optimization and intelligent compression:

// MongoDB Time Series Collections - Optimized IoT data storage and analytics
const { MongoClient, ObjectId } = require('mongodb');

// Comprehensive IoT Time Series Data Manager
class IoTTimeSeriesManager {
  constructor(connectionString, iotConfig = {}) {
    this.connectionString = connectionString;
    this.client = null;
    this.db = null;

    this.config = {
      // Time series configuration
      defaultGranularity: iotConfig.defaultGranularity || 'seconds',
      enableAutomaticIndexing: iotConfig.enableAutomaticIndexing !== false,
      enableCompression: iotConfig.enableCompression !== false,

      // IoT-specific features
      enableRealTimeAlerts: iotConfig.enableRealTimeAlerts !== false,
      enableAnomalyDetection: iotConfig.enableAnomalyDetection !== false,
      enablePredictiveAnalytics: iotConfig.enablePredictiveAnalytics !== false,
      enableDataQualityMonitoring: iotConfig.enableDataQualityMonitoring !== false,

      // Performance optimization
      batchWriteSize: iotConfig.batchWriteSize || 1000,
      writeConcern: iotConfig.writeConcern || { w: 1, j: true },
      readPreference: iotConfig.readPreference || 'primaryPreferred',
      maxConnectionPoolSize: iotConfig.maxConnectionPoolSize || 100,

      // Data retention and archival
      enableDataLifecycleManagement: iotConfig.enableDataLifecycleManagement !== false,
      defaultRetentionDays: iotConfig.defaultRetentionDays || 365,
      enableAutomaticArchiving: iotConfig.enableAutomaticArchiving !== false,

      // Analytics and processing
      enableStreamProcessing: iotConfig.enableStreamProcessing !== false,
      enableRealTimeAggregation: iotConfig.enableRealTimeAggregation !== false,
      aggregationWindowSize: iotConfig.aggregationWindowSize || '1 minute',

      ...iotConfig
    };

    // Time series collections for different data types
    this.timeSeriesCollections = new Map();
    this.aggregationCollections = new Map();
    this.alertCollections = new Map();

    // Real-time processing components
    this.changeStreams = new Map();
    this.processingPipelines = new Map();
    this.alertRules = new Map();

    // Performance metrics
    this.performanceMetrics = {
      totalDataPoints: 0,
      writeOperationsPerSecond: 0,
      queryOperationsPerSecond: 0,
      averageLatency: 0,
      compressionRatio: 0,
      alertsTriggered: 0
    };
  }

  async initializeIoTTimeSeriesSystem() {
    console.log('Initializing MongoDB IoT Time Series system...');

    try {
      // Connect to MongoDB
      this.client = new MongoClient(this.connectionString, {
        maxPoolSize: this.config.maxConnectionPoolSize,
        writeConcern: this.config.writeConcern,
        readPreference: this.config.readPreference
      });

      await this.client.connect();
      this.db = this.client.db();

      // Create time series collections for different sensor types
      await this.createTimeSeriesCollections();

      // Setup real-time processing pipelines
      if (this.config.enableStreamProcessing) {
        await this.setupStreamProcessing();
      }

      // Initialize real-time aggregations
      if (this.config.enableRealTimeAggregation) {
        await this.setupRealTimeAggregations();
      }

      // Setup anomaly detection
      if (this.config.enableAnomalyDetection) {
        await this.setupAnomalyDetection();
      }

      // Initialize data lifecycle management
      if (this.config.enableDataLifecycleManagement) {
        await this.setupDataLifecycleManagement();
      }

      console.log('IoT Time Series system initialized successfully');

    } catch (error) {
      console.error('Error initializing IoT Time Series system:', error);
      throw error;
    }
  }

  async createTimeSeriesCollections() {
    console.log('Creating optimized time series collections...');

    // Sensor readings time series collection with automatic bucketing
    await this.createOptimizedTimeSeriesCollection('sensor_readings', {
      timeField: 'timestamp',
      metaField: 'device',
      granularity: this.config.defaultGranularity,
      bucketMaxSpanSeconds: 3600, // 1 hour buckets
      bucketRoundingSeconds: 60,  // Round to nearest minute

      // Optimize for IoT data patterns
      expireAfterSeconds: this.config.defaultRetentionDays * 24 * 60 * 60,

      // Index optimization for common IoT queries
      additionalIndexes: [
        { 'device.id': 1, 'timestamp': -1 },
        { 'device.type': 1, 'timestamp': -1 },
        { 'device.location': 1, 'timestamp': -1 },
        { 'sensor.type': 1, 'timestamp': -1 }
      ]
    });

    // Environmental monitoring time series
    await this.createOptimizedTimeSeriesCollection('environmental_data', {
      timeField: 'timestamp',
      metaField: 'location',
      granularity: 'minutes',
      bucketMaxSpanSeconds: 7200, // 2 hour buckets for slower changing data

      additionalIndexes: [
        { 'location.facility': 1, 'location.zone': 1, 'timestamp': -1 },
        { 'sensor_type': 1, 'timestamp': -1 }
      ]
    });

    // Equipment performance monitoring
    await this.createOptimizedTimeSeriesCollection('equipment_metrics', {
      timeField: 'timestamp',
      metaField: 'equipment',
      granularity: 'seconds',
      bucketMaxSpanSeconds: 1800, // 30 minute buckets for high-frequency data

      additionalIndexes: [
        { 'equipment.id': 1, 'equipment.type': 1, 'timestamp': -1 },
        { 'metric_type': 1, 'timestamp': -1 }
      ]
    });

    // Energy consumption tracking
    await this.createOptimizedTimeSeriesCollection('energy_consumption', {
      timeField: 'timestamp',
      metaField: 'meter',
      granularity: 'minutes',
      bucketMaxSpanSeconds: 3600, // 1 hour buckets

      additionalIndexes: [
        { 'meter.id': 1, 'timestamp': -1 },
        { 'meter.building': 1, 'meter.floor': 1, 'timestamp': -1 }
      ]
    });

    // Vehicle telemetry data
    await this.createOptimizedTimeSeriesCollection('vehicle_telemetry', {
      timeField: 'timestamp',
      metaField: 'vehicle',
      granularity: 'seconds',
      bucketMaxSpanSeconds: 900, // 15 minute buckets for mobile data

      additionalIndexes: [
        { 'vehicle.id': 1, 'timestamp': -1 },
        { 'vehicle.route': 1, 'timestamp': -1 },
        { 'telemetry_type': 1, 'timestamp': -1 }
      ]
    });

    console.log('Time series collections created successfully');
  }

  async createOptimizedTimeSeriesCollection(collectionName, config) {
    console.log(`Creating time series collection: ${collectionName}`);

    try {
      // Create time series collection with MongoDB's native optimization
      const collection = await this.db.createCollection(collectionName, {
        timeseries: {
          timeField: config.timeField,
          metaField: config.metaField,
          granularity: config.granularity,
          bucketMaxSpanSeconds: config.bucketMaxSpanSeconds,
          bucketRoundingSeconds: config.bucketRoundingSeconds || 60
        },

        // Set TTL for automatic data expiration
        ...(config.expireAfterSeconds && {
          expireAfterSeconds: config.expireAfterSeconds
        }),

        // Enable compression for storage optimization
        storageEngine: {
          wiredTiger: {
            configString: 'block_compressor=zstd'
          }
        }
      });

      // Create additional indexes for query optimization
      if (config.additionalIndexes) {
        await collection.createIndexes(
          config.additionalIndexes.map(indexSpec => ({
            key: indexSpec,
            background: true,
            name: `idx_${Object.keys(indexSpec).join('_')}`
          }))
        );
      }

      // Store collection reference and configuration
      this.timeSeriesCollections.set(collectionName, {
        collection: collection,
        config: config,
        stats: {
          documentsInserted: 0,
          bytesStored: 0,
          compressionRatio: 0,
          lastInsertTime: null
        }
      });

      console.log(`Time series collection ${collectionName} created successfully`);

    } catch (error) {
      console.error(`Error creating time series collection ${collectionName}:`, error);
      throw error;
    }
  }

  async insertSensorData(collectionName, sensorDataPoints) {
    const startTime = Date.now();

    try {
      const collectionInfo = this.timeSeriesCollections.get(collectionName);
      if (!collectionInfo) {
        throw new Error(`Time series collection ${collectionName} not found`);
      }

      const collection = collectionInfo.collection;

      // Prepare data points with enhanced metadata
      const enhancedDataPoints = sensorDataPoints.map(dataPoint => ({
        // Time series fields
        timestamp: dataPoint.timestamp || new Date(),

        // Device/sensor metadata (automatically indexed as metaField)
        device: {
          id: dataPoint.deviceId,
          type: dataPoint.deviceType || 'generic_sensor',
          location: {
            facility: dataPoint.facility || 'unknown',
            zone: dataPoint.zone || 'default',
            coordinates: dataPoint.coordinates || null,
            floor: dataPoint.floor || null,
            room: dataPoint.room || null
          },
          firmware: dataPoint.firmwareVersion || null,
          manufacturer: dataPoint.manufacturer || null,
          model: dataPoint.model || null,
          installDate: dataPoint.installDate || null
        },

        // Sensor information
        sensor: {
          type: dataPoint.sensorType,
          unit: dataPoint.unit || null,
          precision: dataPoint.precision || null,
          calibrationDate: dataPoint.calibrationDate || null,
          maintenanceSchedule: dataPoint.maintenanceSchedule || null
        },

        // Measurement data
        value: dataPoint.value,
        rawValue: dataPoint.rawValue || dataPoint.value,

        // Data quality indicators
        quality: {
          score: dataPoint.qualityScore || 1.0,
          flags: dataPoint.qualityFlags || [],
          confidence: dataPoint.confidence || 1.0,
          calibrationStatus: dataPoint.calibrationStatus || 'valid',
          sensorHealth: dataPoint.sensorHealth || 'healthy'
        },

        // Environmental context
        environmentalConditions: {
          temperature: dataPoint.ambientTemperature || null,
          humidity: dataPoint.ambientHumidity || null,
          pressure: dataPoint.atmosphericPressure || null,
          vibration: dataPoint.vibrationLevel || null,
          electricalNoise: dataPoint.electricalNoise || null
        },

        // Processing metadata
        processing: {
          receivedAt: new Date(),
          source: dataPoint.dataSource || 'direct',
          protocol: dataPoint.protocol || 'unknown',
          gateway: dataPoint.gatewayId || null,
          processingLatency: dataPoint.processingLatency || null,
          networkLatency: dataPoint.networkLatency || null
        },

        // Alert and anomaly flags
        alerts: {
          anomalyDetected: dataPoint.anomalyDetected || false,
          thresholdViolation: dataPoint.thresholdViolation || null,
          alertLevel: dataPoint.alertLevel || 'normal',
          alertReason: dataPoint.alertReason || null
        },

        // Business context
        businessContext: {
          assetId: dataPoint.assetId || null,
          processId: dataPoint.processId || null,
          operationalMode: dataPoint.operationalMode || 'normal',
          shiftId: dataPoint.shiftId || null,
          operatorId: dataPoint.operatorId || null
        },

        // Additional custom metadata
        customMetadata: dataPoint.customMetadata || {}
      }));

      // Perform batch insert with write concern
      const result = await collection.insertMany(enhancedDataPoints, {
        writeConcern: this.config.writeConcern,
        ordered: false // Allow partial success for better performance
      });

      const insertTime = Date.now() - startTime;

      // Update collection statistics
      collectionInfo.stats.documentsInserted += result.insertedCount;
      collectionInfo.stats.lastInsertTime = new Date();

      // Update performance metrics
      this.updatePerformanceMetrics('insert', result.insertedCount, insertTime);

      // Trigger real-time processing if enabled
      if (this.config.enableStreamProcessing) {
        await this.processRealTimeData(collectionName, enhancedDataPoints);
      }

      // Check for alerts if enabled
      if (this.config.enableRealTimeAlerts) {
        await this.checkAlertConditions(collectionName, enhancedDataPoints);
      }

      console.log(`Inserted ${result.insertedCount} sensor data points into ${collectionName} in ${insertTime}ms`);

      return {
        success: true,
        collection: collectionName,
        insertedCount: result.insertedCount,
        insertTime: insertTime,
        insertedIds: result.insertedIds
      };

    } catch (error) {
      console.error(`Error inserting sensor data into ${collectionName}:`, error);
      return {
        success: false,
        collection: collectionName,
        error: error.message,
        insertTime: Date.now() - startTime
      };
    }
  }

  async queryTimeSeriesData(collectionName, query) {
    const startTime = Date.now();

    try {
      const collectionInfo = this.timeSeriesCollections.get(collectionName);
      if (!collectionInfo) {
        throw new Error(`Time series collection ${collectionName} not found`);
      }

      const collection = collectionInfo.collection;

      // Build comprehensive aggregation pipeline for time series analysis
      const pipeline = [
        // Time range filtering (optimized for time series collections)
        {
          $match: {
            timestamp: {
              $gte: query.startTime,
              $lte: query.endTime || new Date()
            },
            ...(query.deviceIds && { 'device.id': { $in: query.deviceIds } }),
            ...(query.deviceTypes && { 'device.type': { $in: query.deviceTypes } }),
            ...(query.sensorTypes && { 'sensor.type': { $in: query.sensorTypes } }),
            ...(query.locations && { 'device.location.facility': { $in: query.locations } }),
            ...(query.minQualityScore && { 'quality.score': { $gte: query.minQualityScore } }),
            ...(query.alertLevel && { 'alerts.alertLevel': query.alertLevel })
          }
        },

        // Time-based grouping and aggregation
        {
          $group: {
            _id: {
              deviceId: '$device.id',
              deviceType: '$device.type',
              sensorType: '$sensor.type',
              location: '$device.location',

              // Time bucketing based on query granularity
              timeBucket: query.granularity === 'hour' 
                ? { $dateTrunc: { date: '$timestamp', unit: 'hour' } }
                : query.granularity === 'minute'
                ? { $dateTrunc: { date: '$timestamp', unit: 'minute' } }
                : query.granularity === 'day'
                ? { $dateTrunc: { date: '$timestamp', unit: 'day' } }
                : '$timestamp' // Raw timestamp for second-level granularity
            },

            // Statistical aggregations
            count: { $sum: 1 },
            avgValue: { $avg: '$value' },
            minValue: { $min: '$value' },
            maxValue: { $max: '$value' },
            sumValue: { $sum: '$value' },

            // Advanced statistical measures
            stdDevValue: { $stdDevPop: '$value' },
            varianceValue: { $pow: [{ $stdDevPop: '$value' }, 2] },

            // Percentile calculations using $percentile (MongoDB 7.0+)
            percentiles: {
              $percentile: {
                input: '$value',
                p: [0.25, 0.5, 0.75, 0.9, 0.95, 0.99],
                method: 'approximate'
              }
            },

            // Data quality metrics
            avgQualityScore: { $avg: '$quality.score' },
            minQualityScore: { $min: '$quality.score' },
            lowQualityCount: {
              $sum: { $cond: [{ $lt: ['$quality.score', 0.8] }, 1, 0] }
            },

            // Alert and anomaly statistics
            anomalyCount: {
              $sum: { $cond: ['$alerts.anomalyDetected', 1, 0] }
            },
            alertCounts: {
              $push: {
                $cond: [
                  { $ne: ['$alerts.alertLevel', 'normal'] },
                  '$alerts.alertLevel',
                  '$$REMOVE'
                ]
              }
            },

            // Time-based metrics
            firstReading: { $min: '$timestamp' },
            lastReading: { $max: '$timestamp' },

            // Value change analysis
            valueRange: { $subtract: [{ $max: '$value' }, { $min: '$value' }] },

            // Environmental conditions (if available)
            avgAmbientTemp: { $avg: '$environmentalConditions.temperature' },
            avgAmbientHumidity: { $avg: '$environmentalConditions.humidity' },

            // Processing performance
            avgProcessingLatency: { $avg: '$processing.processingLatency' },
            maxProcessingLatency: { $max: '$processing.processingLatency' },

            // Raw data points (if requested for detailed analysis)
            ...(query.includeRawData && {
              rawDataPoints: {
                $push: {
                  timestamp: '$timestamp',
                  value: '$value',
                  quality: '$quality.score',
                  anomaly: '$alerts.anomalyDetected'
                }
              }
            })
          }
        },

        // Calculate additional derived metrics
        {
          $addFields: {
            // Time coverage and sampling rate analysis
            timeCoverageSeconds: {
              $divide: [
                { $subtract: ['$lastReading', '$firstReading'] },
                1000
              ]
            },

            // Data completeness analysis
            expectedReadings: {
              $cond: [
                { $eq: [query.granularity, 'minute'] },
                { $divide: [{ $subtract: ['$lastReading', '$firstReading'] }, 60000] },
                { $cond: [
                  { $eq: [query.granularity, 'hour'] },
                  { $divide: [{ $subtract: ['$lastReading', '$firstReading'] }, 3600000] },
                  '$count'
                ]}
              ]
            },

            // Statistical analysis
            coefficientOfVariation: {
              $cond: [
                { $ne: ['$avgValue', 0] },
                { $divide: ['$stdDevValue', '$avgValue'] },
                0
              ]
            },

            // Data quality percentage
            qualityPercentage: {
              $multiply: [
                { $divide: [
                  { $subtract: ['$count', '$lowQualityCount'] },
                  '$count'
                ]},
                100
              ]
            },

            // Anomaly rate
            anomalyRate: {
              $multiply: [
                { $divide: ['$anomalyCount', '$count'] },
                100
              ]
            },

            // Alert distribution
            alertDistribution: {
              $reduce: {
                input: '$alertCounts',
                initialValue: {},
                in: {
                  $mergeObjects: [
                    '$$value',
                    { ['$$this']: { $add: [{ $ifNull: [{ $getField: { field: '$$this', input: '$$value' } }, 0] }, 1] } }
                  ]
                }
              }
            },

            // Performance classification
            performanceCategory: {
              $switch: {
                branches: [
                  { 
                    case: { 
                      $and: [
                        { $gte: ['$qualityPercentage', 95] },
                        { $lt: ['$anomalyRate', 1] },
                        { $lte: ['$avgProcessingLatency', 100] }
                      ]
                    }, 
                    then: 'excellent' 
                  },
                  { 
                    case: { 
                      $and: [
                        { $gte: ['$qualityPercentage', 85] },
                        { $lt: ['$anomalyRate', 5] },
                        { $lte: ['$avgProcessingLatency', 300] }
                      ]
                    }, 
                    then: 'good' 
                  },
                  { 
                    case: { 
                      $and: [
                        { $gte: ['$qualityPercentage', 70] },
                        { $lt: ['$anomalyRate', 10] }
                      ]
                    }, 
                    then: 'fair' 
                  }
                ],
                default: 'poor'
              }
            },

            // Trending analysis (basic)
            valueTrend: {
              $cond: [
                { $and: [
                  { $ne: ['$minValue', '$maxValue'] },
                  { $gt: ['$count', 1] }
                ]},
                {
                  $switch: {
                    branches: [
                      { case: { $gt: ['$valueRange', { $multiply: ['$avgValue', 0.2] }] }, then: 'volatile' },
                      { case: { $gt: ['$coefficientOfVariation', 0.3] }, then: 'variable' },
                      { case: { $lt: ['$coefficientOfVariation', 0.1] }, then: 'stable' }
                    ],
                    default: 'moderate'
                  }
                },
                'insufficient_data'
              ]
            }
          }
        },

        // Data completeness analysis
        {
          $addFields: {
            dataCompleteness: {
              $multiply: [
                { $divide: ['$count', { $max: ['$expectedReadings', 1] }] },
                100
              ]
            },

            // Sampling rate (readings per minute)
            samplingRate: {
              $cond: [
                { $gt: ['$timeCoverageSeconds', 0] },
                { $divide: ['$count', { $divide: ['$timeCoverageSeconds', 60] }] },
                0
              ]
            }
          }
        },

        // Final projection and organization
        {
          $project: {
            // Identity fields
            deviceId: '$_id.deviceId',
            deviceType: '$_id.deviceType',
            sensorType: '$_id.sensorType',
            location: '$_id.location',
            timeBucket: '$_id.timeBucket',

            // Basic statistics
            dataPoints: '$count',
            statistics: {
              avg: { $round: ['$avgValue', 4] },
              min: '$minValue',
              max: '$maxValue',
              sum: { $round: ['$sumValue', 2] },
              stdDev: { $round: ['$stdDevValue', 4] },
              variance: { $round: ['$varianceValue', 4] },
              coefficientOfVariation: { $round: ['$coefficientOfVariation', 4] },
              valueRange: { $round: ['$valueRange', 4] },
              percentiles: '$percentiles'
            },

            // Data quality metrics
            dataQuality: {
              avgScore: { $round: ['$avgQualityScore', 3] },
              minScore: { $round: ['$minQualityScore', 3] },
              qualityPercentage: { $round: ['$qualityPercentage', 1] },
              lowQualityCount: '$lowQualityCount'
            },

            // Alert and anomaly information
            alerts: {
              anomalyCount: '$anomalyCount',
              anomalyRate: { $round: ['$anomalyRate', 2] },
              alertDistribution: '$alertDistribution'
            },

            // Time-based analysis
            temporal: {
              firstReading: '$firstReading',
              lastReading: '$lastReading',
              timeCoverageSeconds: { $round: ['$timeCoverageSeconds', 0] },
              dataCompleteness: { $round: ['$dataCompleteness', 1] },
              samplingRate: { $round: ['$samplingRate', 2] }
            },

            // Environmental context
            environment: {
              avgTemperature: { $round: ['$avgAmbientTemp', 1] },
              avgHumidity: { $round: ['$avgAmbientHumidity', 1] }
            },

            // Performance metrics
            performance: {
              avgProcessingLatency: { $round: ['$avgProcessingLatency', 0] },
              maxProcessingLatency: { $round: ['$maxProcessingLatency', 0] },
              performanceCategory: '$performanceCategory'
            },

            // Analysis results
            analysis: {
              valueTrend: '$valueTrend',
              overallAssessment: {
                $switch: {
                  branches: [
                    { 
                      case: { 
                        $and: [
                          { $eq: ['$performanceCategory', 'excellent'] },
                          { $gte: ['$dataCompleteness', 95] }
                        ]
                      }, 
                      then: 'optimal_performance' 
                    },
                    { 
                      case: { 
                        $and: [
                          { $in: ['$performanceCategory', ['good', 'excellent']] },
                          { $gte: ['$dataCompleteness', 80] }
                        ]
                      }, 
                      then: 'good_performance' 
                    },
                    { 
                      case: { $lt: ['$dataCompleteness', 50] }, 
                      then: 'data_gaps_detected' 
                    },
                    { 
                      case: { $gt: ['$anomalyRate', 15] }, 
                      then: 'high_anomaly_rate' 
                    },
                    { 
                      case: { $lt: ['$qualityPercentage', 70] }, 
                      then: 'quality_issues' 
                    }
                  ],
                  default: 'acceptable_performance'
                }
              }
            },

            // Include raw data if requested
            ...(query.includeRawData && { rawDataPoints: 1 })
          }
        },

        // Sort results
        { $sort: { deviceId: 1, timeBucket: 1 } },

        // Apply result limits
        ...(query.limit && [{ $limit: query.limit }])
      ];

      // Execute aggregation pipeline
      const results = await collection.aggregate(pipeline, {
        allowDiskUse: true,
        maxTimeMS: 30000
      }).toArray();

      const queryTime = Date.now() - startTime;

      // Update performance metrics
      this.updatePerformanceMetrics('query', results.length, queryTime);

      console.log(`Time series query completed: ${results.length} results in ${queryTime}ms`);

      return {
        success: true,
        collection: collectionName,
        results: results,
        resultCount: results.length,
        queryTime: queryTime,
        queryMetadata: {
          timeRange: {
            start: query.startTime,
            end: query.endTime || new Date()
          },
          granularity: query.granularity || 'raw',
          filters: {
            deviceIds: query.deviceIds?.length || 0,
            deviceTypes: query.deviceTypes?.length || 0,
            sensorTypes: query.sensorTypes?.length || 0,
            locations: query.locations?.length || 0
          },
          optimizationsApplied: ['time_series_bucketing', 'statistical_aggregation', 'index_optimization']
        }
      };

    } catch (error) {
      console.error(`Error querying time series data from ${collectionName}:`, error);
      return {
        success: false,
        collection: collectionName,
        error: error.message,
        queryTime: Date.now() - startTime
      };
    }
  }

  async setupRealTimeAggregations() {
    console.log('Setting up real-time aggregation pipelines...');

    // Create aggregation collections for different time windows
    const aggregationConfigs = [
      {
        name: 'sensor_readings_1min',
        sourceCollection: 'sensor_readings',
        windowSize: '1 minute',
        retentionDays: 7
      },
      {
        name: 'sensor_readings_5min',
        sourceCollection: 'sensor_readings',
        windowSize: '5 minutes',
        retentionDays: 30
      },
      {
        name: 'sensor_readings_1hour',
        sourceCollection: 'sensor_readings', 
        windowSize: '1 hour',
        retentionDays: 365
      },
      {
        name: 'sensor_readings_1day',
        sourceCollection: 'sensor_readings',
        windowSize: '1 day',
        retentionDays: 1825 // 5 years
      }
    ];

    for (const config of aggregationConfigs) {
      await this.createAggregationPipeline(config);
    }

    console.log('Real-time aggregation pipelines setup completed');
  }

  async createAggregationPipeline(config) {
    console.log(`Creating aggregation pipeline: ${config.name}`);

    // Create collection for storing aggregated data
    const aggregationCollection = await this.db.createCollection(config.name, {
      timeseries: {
        timeField: 'timestamp',
        metaField: 'device',
        granularity: config.windowSize.includes('minute') ? 'minutes' : 
                    config.windowSize.includes('hour') ? 'hours' : 'days'
      },
      expireAfterSeconds: config.retentionDays * 24 * 60 * 60
    });

    this.aggregationCollections.set(config.name, {
      collection: aggregationCollection,
      config: config
    });
  }

  async processRealTimeData(collectionName, dataPoints) {
    console.log(`Processing real-time data for ${collectionName}: ${dataPoints.length} points`);

    // Update real-time aggregations
    for (const [aggName, aggInfo] of this.aggregationCollections.entries()) {
      if (aggInfo.config.sourceCollection === collectionName) {
        await this.updateRealTimeAggregation(aggName, dataPoints);
      }
    }

    // Process data through ML pipelines if enabled
    if (this.config.enablePredictiveAnalytics) {
      await this.processPredictiveAnalytics(dataPoints);
    }
  }

  async checkAlertConditions(collectionName, dataPoints) {
    console.log(`Checking alert conditions for ${dataPoints.length} data points`);

    const alertsTriggered = [];

    for (const dataPoint of dataPoints) {
      // Check various alert conditions
      const alerts = [];

      // Value threshold alerts
      if (dataPoint.sensor.type === 'temperature' && dataPoint.value > 80) {
        alerts.push({
          type: 'threshold_violation',
          severity: 'high',
          message: `Temperature ${dataPoint.value}°C exceeds threshold`,
          deviceId: dataPoint.device.id
        });
      }

      // Quality score alerts
      if (dataPoint.quality.score < 0.7) {
        alerts.push({
          type: 'quality_degradation',
          severity: 'medium',
          message: `Quality score ${dataPoint.quality.score} below acceptable level`,
          deviceId: dataPoint.device.id
        });
      }

      // Anomaly alerts
      if (dataPoint.alerts.anomalyDetected) {
        alerts.push({
          type: 'anomaly_detected',
          severity: 'high',
          message: `Anomaly detected in sensor reading`,
          deviceId: dataPoint.device.id
        });
      }

      // Processing latency alerts
      if (dataPoint.processing.processingLatency > 5000) { // 5 seconds
        alerts.push({
          type: 'processing_delay',
          severity: 'medium',
          message: `Processing latency ${dataPoint.processing.processingLatency}ms exceeds threshold`,
          deviceId: dataPoint.device.id
        });
      }

      if (alerts.length > 0) {
        alertsTriggered.push(...alerts);
        this.performanceMetrics.alertsTriggered += alerts.length;
      }
    }

    // Store alerts if any were triggered
    if (alertsTriggered.length > 0) {
      await this.storeAlerts(alertsTriggered);
    }

    return alertsTriggered;
  }

  async storeAlerts(alerts) {
    try {
      // Create alerts collection if it doesn't exist
      if (!this.alertCollections.has('iot_alerts')) {
        const alertsCollection = await this.db.createCollection('iot_alerts');
        await alertsCollection.createIndexes([
          { key: { deviceId: 1, timestamp: -1 }, background: true },
          { key: { severity: 1, timestamp: -1 }, background: true },
          { key: { type: 1, timestamp: -1 }, background: true }
        ]);

        this.alertCollections.set('iot_alerts', alertsCollection);
      }

      const alertsCollection = this.alertCollections.get('iot_alerts');

      const alertDocuments = alerts.map(alert => ({
        ...alert,
        timestamp: new Date(),
        acknowledged: false,
        resolvedAt: null
      }));

      await alertsCollection.insertMany(alertDocuments);

      console.log(`Stored ${alertDocuments.length} alerts`);

    } catch (error) {
      console.error('Error storing alerts:', error);
    }
  }

  updatePerformanceMetrics(operation, count, duration) {
    if (operation === 'insert') {
      this.performanceMetrics.totalDataPoints += count;
      this.performanceMetrics.writeOperationsPerSecond = 
        (count / duration) * 1000;
    } else if (operation === 'query') {
      this.performanceMetrics.queryOperationsPerSecond = 
        (count / duration) * 1000;
    }

    // Update average latency
    this.performanceMetrics.averageLatency = 
      (this.performanceMetrics.averageLatency + duration) / 2;
  }

  async getSystemStatistics() {
    console.log('Gathering IoT Time Series system statistics...');

    const stats = {
      collections: {},
      performance: this.performanceMetrics,
      aggregations: {},
      systemHealth: 'healthy'
    };

    // Gather statistics from each time series collection
    for (const [collectionName, collectionInfo] of this.timeSeriesCollections.entries()) {
      try {
        const collection = collectionInfo.collection;

        const [collectionStats, sampleData] = await Promise.all([
          collection.stats(),
          collection.find().sort({ timestamp: -1 }).limit(1).toArray()
        ]);

        stats.collections[collectionName] = {
          documentCount: collectionStats.count || 0,
          storageSize: collectionStats.size || 0,
          indexSize: collectionStats.totalIndexSize || 0,
          avgDocumentSize: collectionStats.avgObjSize || 0,
          compressionRatio: collectionStats.size > 0 ? 
            (collectionStats.storageSize / collectionStats.size) : 1,
          lastDataPoint: sampleData[0]?.timestamp || null,
          configuration: collectionInfo.config,
          performance: collectionInfo.stats
        };

      } catch (error) {
        stats.collections[collectionName] = {
          error: error.message,
          available: false
        };
      }
    }

    return stats;
  }

  async shutdown() {
    console.log('Shutting down IoT Time Series Manager...');

    // Close change streams
    for (const [streamName, changeStream] of this.changeStreams.entries()) {
      try {
        await changeStream.close();
        console.log(`Closed change stream: ${streamName}`);
      } catch (error) {
        console.error(`Error closing change stream ${streamName}:`, error);
      }
    }

    // Close MongoDB connection
    if (this.client) {
      await this.client.close();
    }

    console.log('IoT Time Series Manager shutdown complete');
  }
}

// Benefits of MongoDB Time Series Collections:
// - Native time series optimization with automatic bucketing and compression
// - Specialized indexing and query optimization for time-based data patterns
// - Efficient storage with automatic data lifecycle management
// - Real-time aggregation pipelines for IoT analytics
// - Built-in support for high-volume write workloads
// - Intelligent compression reducing storage costs by up to 90%
// - Seamless integration with MongoDB's distributed architecture
// - SQL-compatible time series operations through QueryLeaf integration
// - Native support for IoT-specific query patterns and analytics
// - Automatic data archiving and retention management

module.exports = {
  IoTTimeSeriesManager
};

Understanding MongoDB Time Series Collections Architecture

IoT Data Patterns and Optimization Strategies

MongoDB Time Series Collections are specifically designed for the unique characteristics of IoT and time-stamped data:

// Advanced IoT Time Series Processing with Enterprise Features
class EnterpriseIoTProcessor extends IoTTimeSeriesManager {
  constructor(connectionString, enterpriseConfig) {
    super(connectionString, enterpriseConfig);

    this.enterpriseConfig = {
      ...enterpriseConfig,
      enableEdgeComputing: true,
      enablePredictiveAnalytics: true,
      enableDigitalTwins: true,
      enableMLPipelines: true,
      enableAdvancedVisualization: true,
      enableMultiTenancy: true
    };

    this.setupEnterpriseFeatures();
    this.initializeMLPipelines();
    this.setupDigitalTwins();
  }

  async implementAdvancedIoTStrategies() {
    console.log('Implementing enterprise IoT data strategies...');

    const strategies = {
      // Edge computing integration
      edgeComputing: {
        edgeDataAggregation: true,
        intelligentFiltering: true,
        localAnomalyDetection: true,
        bandwidthOptimization: true
      },

      // Predictive analytics
      predictiveAnalytics: {
        equipmentFailurePrediction: true,
        energyOptimization: true,
        maintenanceScheduling: true,
        capacityPlanning: true
      },

      // Digital twin implementation
      digitalTwins: {
        realTimeSimulation: true,
        processOptimization: true,
        scenarioModeling: true,
        performanceAnalytics: true
      }
    };

    return await this.deployEnterpriseIoTStrategies(strategies);
  }

  async setupAdvancedAnalytics() {
    console.log('Setting up advanced IoT analytics capabilities...');

    const analyticsConfig = {
      // Real-time processing
      realTimeProcessing: {
        streamProcessing: true,
        complexEventProcessing: true,
        patternRecognition: true,
        correlationAnalysis: true
      },

      // Machine learning integration
      machineLearning: {
        anomalyDetection: true,
        predictiveModeling: true,
        classificationModels: true,
        reinforcementLearning: true
      },

      // Advanced visualization
      visualization: {
        realTimeDashboards: true,
        historicalAnalytics: true,
        geospatialVisualization: true,
        threeDimensionalModeling: true
      }
    };

    return await this.deployAdvancedAnalytics(analyticsConfig);
  }
}

SQL-Style Time Series Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Time Series operations and IoT data management:

-- QueryLeaf Time Series operations with SQL-familiar syntax for IoT data

-- Create optimized time series collections for different IoT data types
CREATE TIME_SERIES_COLLECTION sensor_readings (
  timestamp TIMESTAMPTZ,
  device_id STRING,
  sensor_type STRING,
  value DECIMAL,
  quality_score DECIMAL,
  location OBJECT,
  metadata OBJECT
)
WITH (
  time_field = 'timestamp',
  meta_field = 'device_metadata',
  granularity = 'seconds',
  bucket_max_span_seconds = 3600,
  bucket_rounding_seconds = 60,
  expire_after_seconds = 31536000, -- 1 year retention
  enable_compression = true,
  compression_algorithm = 'zstd'
);

-- Create specialized collections for different IoT use cases
CREATE TIME_SERIES_COLLECTION equipment_telemetry (
  timestamp TIMESTAMPTZ,
  equipment_id STRING,
  metric_type STRING,
  value DECIMAL,
  operational_status STRING,
  maintenance_flags ARRAY,
  performance_indicators OBJECT
)
WITH (
  granularity = 'seconds',
  bucket_max_span_seconds = 1800, -- 30 minute buckets for high-frequency data
  enable_automatic_indexing = true
);

CREATE TIME_SERIES_COLLECTION environmental_monitoring (
  timestamp TIMESTAMPTZ,
  location_id STRING,
  sensor_network STRING,
  measurements OBJECT,
  weather_conditions OBJECT,
  air_quality_index DECIMAL
)
WITH (
  granularity = 'minutes',
  bucket_max_span_seconds = 7200, -- 2 hour buckets for environmental data
  enable_predictive_analytics = true
);

-- Advanced IoT data insertion with comprehensive metadata
INSERT INTO sensor_readings (
  timestamp, device_metadata, sensor_info, measurements, quality_metrics, context
)
WITH iot_data_enrichment AS (
  SELECT 
    reading_timestamp as timestamp,

    -- Device and location metadata (optimized as metaField)
    JSON_OBJECT(
      'device_id', device_identifier,
      'device_type', equipment_type,
      'firmware_version', firmware_ver,
      'location', JSON_OBJECT(
        'facility', facility_name,
        'zone', zone_identifier,
        'coordinates', JSON_OBJECT('lat', latitude, 'lng', longitude),
        'floor', floor_number,
        'room', room_identifier
      ),
      'network_info', JSON_OBJECT(
        'gateway_id', gateway_identifier,
        'signal_strength', rssi_value,
        'protocol', communication_protocol,
        'network_latency', network_delay_ms
      )
    ) as device_metadata,

    -- Sensor information and calibration data
    JSON_OBJECT(
      'sensor_type', sensor_category,
      'model_number', sensor_model,
      'serial_number', sensor_serial,
      'calibration_date', last_calibration,
      'maintenance_schedule', maintenance_interval,
      'measurement_unit', measurement_units,
      'precision', sensor_precision,
      'accuracy', sensor_accuracy_percent,
      'operating_range', JSON_OBJECT(
        'min_value', minimum_measurable,
        'max_value', maximum_measurable,
        'optimal_range', optimal_operating_range
      )
    ) as sensor_info,

    -- Measurement data with statistical context
    JSON_OBJECT(
      'primary_value', sensor_reading,
      'raw_value', unprocessed_reading,
      'calibrated_value', calibration_adjusted_value,
      'statistical_context', JSON_OBJECT(
        'recent_average', rolling_average_10min,
        'recent_min', rolling_min_10min,
        'recent_max', rolling_max_10min,
        'trend_indicator', trend_direction,
        'volatility_index', measurement_volatility
      ),
      'related_measurements', JSON_OBJECT(
        'secondary_sensors', related_sensor_readings,
        'environmental_factors', ambient_conditions,
        'operational_context', equipment_operating_mode
      )
    ) as measurements,

    -- Comprehensive quality assessment
    JSON_OBJECT(
      'overall_score', data_quality_score,
      'confidence_level', measurement_confidence,
      'quality_factors', JSON_OBJECT(
        'sensor_health', sensor_status_indicator,
        'calibration_validity', calibration_status,
        'environmental_conditions', environmental_suitability,
        'signal_integrity', signal_quality_assessment,
        'power_status', power_supply_stability
      ),
      'quality_flags', quality_warning_flags,
      'anomaly_indicators', JSON_OBJECT(
        'statistical_anomaly', statistical_outlier_flag,
        'temporal_anomaly', temporal_pattern_anomaly,
        'contextual_anomaly', contextual_deviation_flag,
        'severity_level', anomaly_severity_rating
      )
    ) as quality_metrics,

    -- Business and operational context
    JSON_OBJECT(
      'business_context', JSON_OBJECT(
        'asset_id', primary_asset_identifier,
        'process_id', manufacturing_process_id,
        'production_batch', current_batch_identifier,
        'shift_information', JSON_OBJECT(
          'shift_id', current_shift,
          'operator_id', responsible_operator,
          'supervisor_id', shift_supervisor
        )
      ),
      'operational_context', JSON_OBJECT(
        'equipment_mode', current_operational_mode,
        'production_rate', current_production_speed,
        'efficiency_metrics', operational_efficiency_data,
        'maintenance_status', equipment_maintenance_state,
        'compliance_flags', regulatory_compliance_status
      ),
      'alert_configuration', JSON_OBJECT(
        'threshold_settings', alert_threshold_values,
        'notification_rules', alert_notification_config,
        'escalation_procedures', alert_escalation_rules,
        'suppression_conditions', alert_suppression_rules
      )
    ) as context

  FROM raw_iot_data_stream
  WHERE 
    data_timestamp >= CURRENT_TIMESTAMP - INTERVAL '5 minutes'
    AND data_quality_preliminary >= 0.5
    AND device_status != 'maintenance_mode'
)
SELECT 
  timestamp,
  device_metadata,
  sensor_info,
  measurements,
  quality_metrics,
  context,

  -- Processing metadata
  JSON_OBJECT(
    'ingestion_timestamp', CURRENT_TIMESTAMP,
    'processing_latency_ms', 
      EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - timestamp)) * 1000,
    'data_pipeline_version', '2.1.0',
    'enrichment_applied', JSON_ARRAY(
      'metadata_enhancement',
      'quality_assessment',
      'anomaly_detection',
      'contextual_enrichment'
    )
  ) as processing_metadata

FROM iot_data_enrichment
WHERE 
  -- Final data quality validation
  JSON_EXTRACT(quality_metrics, '$.overall_score') >= 0.6
  AND JSON_EXTRACT(measurements, '$.primary_value') IS NOT NULL

ORDER BY timestamp;

-- Real-time IoT analytics with time-based aggregations
WITH real_time_sensor_analytics AS (
  SELECT 
    DATE_TRUNC('minute', timestamp) as time_bucket,
    JSON_EXTRACT(device_metadata, '$.device_id') as device_id,
    JSON_EXTRACT(device_metadata, '$.device_type') as device_type,
    JSON_EXTRACT(device_metadata, '$.location.facility') as facility,
    JSON_EXTRACT(device_metadata, '$.location.zone') as zone,
    JSON_EXTRACT(sensor_info, '$.sensor_type') as sensor_type,

    -- Statistical aggregations optimized for time series
    COUNT(*) as reading_count,
    AVG(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as avg_value,
    MIN(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as min_value,
    MAX(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as max_value,
    STDDEV(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as stddev_value,

    -- Percentile calculations for distribution analysis
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as p25_value,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as median_value,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as p75_value,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as p95_value,

    -- Data quality aggregations
    AVG(JSON_EXTRACT(quality_metrics, '$.overall_score')::DECIMAL) as avg_quality_score,
    MIN(JSON_EXTRACT(quality_metrics, '$.overall_score')::DECIMAL) as min_quality_score,
    COUNT(*) FILTER (WHERE JSON_EXTRACT(quality_metrics, '$.overall_score')::DECIMAL < 0.8) as low_quality_readings,

    -- Anomaly detection aggregations
    COUNT(*) FILTER (WHERE JSON_EXTRACT(quality_metrics, '$.anomaly_indicators.statistical_anomaly')::BOOLEAN = true) as statistical_anomalies,
    COUNT(*) FILTER (WHERE JSON_EXTRACT(quality_metrics, '$.anomaly_indicators.temporal_anomaly')::BOOLEAN = true) as temporal_anomalies,
    COUNT(*) FILTER (WHERE JSON_EXTRACT(quality_metrics, '$.anomaly_indicators.contextual_anomaly')::BOOLEAN = true) as contextual_anomalies,

    -- Value change and trend analysis
    (MAX(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) - 
     MIN(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL)) as value_range,

    -- Time coverage and sampling analysis
    (EXTRACT(EPOCH FROM MAX(timestamp) - MIN(timestamp))) as time_span_seconds,
    COUNT(*)::DECIMAL / GREATEST(1, EXTRACT(EPOCH FROM MAX(timestamp) - MIN(timestamp)) / 60) as readings_per_minute,

    -- Processing performance metrics
    AVG(JSON_EXTRACT(processing_metadata, '$.processing_latency_ms')::DECIMAL) as avg_processing_latency,
    MAX(JSON_EXTRACT(processing_metadata, '$.processing_latency_ms')::DECIMAL) as max_processing_latency,

    -- Network performance indicators
    AVG(JSON_EXTRACT(device_metadata, '$.network_info.network_latency')::DECIMAL) as avg_network_latency,
    AVG(JSON_EXTRACT(device_metadata, '$.network_info.signal_strength')::DECIMAL) as avg_signal_strength,

    -- Environmental context aggregations
    AVG(JSON_EXTRACT(measurements, '$.related_measurements.environmental_factors.temperature')::DECIMAL) as avg_ambient_temp,
    AVG(JSON_EXTRACT(measurements, '$.related_measurements.environmental_factors.humidity')::DECIMAL) as avg_ambient_humidity,

    -- Operational context
    MODE() WITHIN GROUP (ORDER BY JSON_EXTRACT(context, '$.operational_context.equipment_mode')::STRING) as primary_equipment_mode,
    AVG(JSON_EXTRACT(context, '$.operational_context.production_rate')::DECIMAL) as avg_production_rate

  FROM sensor_readings
  WHERE 
    timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
    AND JSON_EXTRACT(quality_metrics, '$.overall_score')::DECIMAL >= 0.5
  GROUP BY 
    time_bucket, device_id, device_type, facility, zone, sensor_type
),

performance_analysis AS (
  SELECT 
    rtsa.*,

    -- Data quality assessment
    ROUND((rtsa.reading_count - rtsa.low_quality_readings)::DECIMAL / rtsa.reading_count * 100, 2) as quality_percentage,

    -- Anomaly rate calculations
    ROUND((rtsa.statistical_anomalies + rtsa.temporal_anomalies + rtsa.contextual_anomalies)::DECIMAL / rtsa.reading_count * 100, 2) as total_anomaly_rate,

    -- Statistical analysis
    CASE 
      WHEN rtsa.avg_value != 0 THEN ROUND(rtsa.stddev_value / ABS(rtsa.avg_value), 4)
      ELSE 0
    END as coefficient_of_variation,

    -- Data completeness analysis (expected vs actual readings)
    ROUND(rtsa.reading_count / GREATEST(1, rtsa.time_span_seconds / 60) * 100, 1) as data_completeness_percent,

    -- Performance classification
    CASE 
      WHEN rtsa.avg_quality_score >= 0.95 AND rtsa.total_anomaly_rate <= 1 THEN 'excellent'
      WHEN rtsa.avg_quality_score >= 0.85 AND rtsa.total_anomaly_rate <= 5 THEN 'good'
      WHEN rtsa.avg_quality_score >= 0.70 AND rtsa.total_anomaly_rate <= 10 THEN 'acceptable'
      ELSE 'poor'
    END as performance_category,

    -- Trend analysis
    CASE 
      WHEN rtsa.coefficient_of_variation > 0.5 THEN 'highly_variable'
      WHEN rtsa.coefficient_of_variation > 0.3 THEN 'variable'
      WHEN rtsa.coefficient_of_variation > 0.1 THEN 'moderate'
      ELSE 'stable'
    END as stability_classification,

    -- Alert conditions
    CASE 
      WHEN rtsa.avg_quality_score < 0.7 THEN 'quality_alert'
      WHEN rtsa.total_anomaly_rate > 15 THEN 'anomaly_alert'
      WHEN rtsa.avg_processing_latency > 5000 THEN 'latency_alert'
      WHEN rtsa.data_completeness_percent < 80 THEN 'data_gap_alert'
      WHEN ABS(rtsa.avg_signal_strength) < -80 THEN 'connectivity_alert'
      ELSE 'normal'
    END as alert_status,

    -- Operational efficiency indicators
    CASE 
      WHEN rtsa.primary_equipment_mode = 'production' AND rtsa.avg_production_rate >= 95 THEN 'optimal_efficiency'
      WHEN rtsa.primary_equipment_mode = 'production' AND rtsa.avg_production_rate >= 80 THEN 'good_efficiency'
      WHEN rtsa.primary_equipment_mode = 'production' AND rtsa.avg_production_rate >= 60 THEN 'reduced_efficiency'
      WHEN rtsa.primary_equipment_mode = 'maintenance' THEN 'maintenance_mode'
      ELSE 'unknown_efficiency'
    END as operational_efficiency,

    -- Time-based patterns
    EXTRACT(HOUR FROM rtsa.time_bucket) as hour_of_day,
    EXTRACT(DOW FROM rtsa.time_bucket) as day_of_week

  FROM real_time_sensor_analytics rtsa
),

device_health_assessment AS (
  SELECT 
    pa.device_id,
    pa.device_type,
    pa.facility,
    pa.zone,
    pa.sensor_type,

    -- Current status indicators
    LAST_VALUE(pa.performance_category) OVER (
      PARTITION BY pa.device_id 
      ORDER BY pa.time_bucket 
      ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as current_performance_status,

    LAST_VALUE(pa.alert_status) OVER (
      PARTITION BY pa.device_id 
      ORDER BY pa.time_bucket 
      ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as current_alert_status,

    -- Performance trends over the analysis window
    COUNT(*) as analysis_periods,
    COUNT(*) FILTER (WHERE pa.performance_category IN ('excellent', 'good')) as good_periods,
    COUNT(*) FILTER (WHERE pa.alert_status != 'normal') as alert_periods,

    -- Average performance metrics
    ROUND(AVG(pa.avg_quality_score), 3) as overall_avg_quality,
    ROUND(AVG(pa.total_anomaly_rate), 2) as overall_anomaly_rate,
    ROUND(AVG(pa.readings_per_minute), 2) as overall_data_rate,
    ROUND(AVG(pa.avg_processing_latency), 0) as overall_processing_latency,

    -- Stability and consistency
    ROUND(AVG(pa.coefficient_of_variation), 4) as average_stability_index,
    ROUND(AVG(pa.data_completeness_percent), 1) as average_data_completeness,

    -- Network and connectivity
    ROUND(AVG(pa.avg_network_latency), 0) as average_network_latency,
    ROUND(AVG(pa.avg_signal_strength), 1) as average_signal_strength,

    -- Environmental context
    ROUND(AVG(pa.avg_ambient_temp), 1) as average_ambient_temperature,
    ROUND(AVG(pa.avg_ambient_humidity), 1) as average_ambient_humidity,

    -- Operational efficiency
    MODE() WITHIN GROUP (ORDER BY pa.operational_efficiency) as predominant_efficiency_level,

    -- Value statistics across all time periods
    ROUND(AVG(pa.avg_value), 4) as overall_average_value,
    ROUND(AVG(pa.stddev_value), 4) as overall_value_variability,
    MIN(pa.min_value) as absolute_minimum_value,
    MAX(pa.max_value) as absolute_maximum_value

  FROM performance_analysis pa
  GROUP BY 
    pa.device_id, pa.device_type, pa.facility, pa.zone, pa.sensor_type
)

-- Comprehensive IoT device health and performance report
SELECT 
  dha.device_id,
  dha.device_type,
  dha.sensor_type,
  dha.facility,
  dha.zone,

  -- Current status
  dha.current_performance_status,
  dha.current_alert_status,

  -- Performance summary
  dha.overall_avg_quality as quality_score,
  dha.overall_anomaly_rate as anomaly_rate_percent,
  dha.overall_data_rate as readings_per_minute,
  dha.overall_processing_latency as avg_latency_ms,

  -- Reliability indicators
  ROUND((dha.good_periods::DECIMAL / dha.analysis_periods) * 100, 1) as uptime_percentage,
  ROUND((dha.alert_periods::DECIMAL / dha.analysis_periods) * 100, 1) as alert_percentage,
  dha.average_data_completeness as data_completeness_percent,

  -- Performance classification
  CASE 
    WHEN dha.overall_avg_quality >= 0.9 AND dha.overall_anomaly_rate <= 2 AND dha.uptime_percentage >= 95 THEN 'optimal'
    WHEN dha.overall_avg_quality >= 0.8 AND dha.overall_anomaly_rate <= 5 AND dha.uptime_percentage >= 90 THEN 'good'
    WHEN dha.overall_avg_quality >= 0.6 AND dha.overall_anomaly_rate <= 10 AND dha.uptime_percentage >= 80 THEN 'acceptable'
    WHEN dha.overall_avg_quality < 0.5 OR dha.overall_anomaly_rate > 20 THEN 'critical'
    ELSE 'needs_attention'
  END as device_health_classification,

  -- Operational context
  dha.predominant_efficiency_level,
  dha.overall_average_value as typical_reading_value,
  dha.overall_value_variability as measurement_stability,

  -- Environmental factors
  dha.average_ambient_temperature,
  dha.average_ambient_humidity,

  -- Connectivity and infrastructure
  dha.average_network_latency as network_latency_ms,
  dha.average_signal_strength as signal_strength_dbm,

  -- Recommendations and next actions
  CASE 
    WHEN dha.current_alert_status = 'quality_alert' THEN 'calibrate_sensor_immediate'
    WHEN dha.current_alert_status = 'anomaly_alert' THEN 'investigate_anomaly_patterns'
    WHEN dha.current_alert_status = 'latency_alert' THEN 'optimize_data_pipeline'
    WHEN dha.current_alert_status = 'connectivity_alert' THEN 'check_network_infrastructure'
    WHEN dha.current_alert_status = 'data_gap_alert' THEN 'verify_sensor_connectivity'
    WHEN dha.overall_avg_quality < 0.8 THEN 'schedule_maintenance'
    WHEN dha.overall_anomaly_rate > 10 THEN 'review_operating_conditions'
    WHEN dha.uptime_percentage < 90 THEN 'improve_system_reliability'
    ELSE 'continue_monitoring'
  END as recommended_action,

  -- Priority level for action
  CASE 
    WHEN dha.device_health_classification = 'critical' THEN 'immediate'
    WHEN dha.device_health_classification = 'needs_attention' THEN 'high'
    WHEN dha.current_alert_status != 'normal' THEN 'medium'
    WHEN dha.device_health_classification = 'acceptable' THEN 'low'
    ELSE 'routine'
  END as action_priority

FROM device_health_assessment dha
ORDER BY 
  CASE action_priority
    WHEN 'immediate' THEN 1
    WHEN 'high' THEN 2  
    WHEN 'medium' THEN 3
    WHEN 'low' THEN 4
    ELSE 5
  END,
  dha.overall_anomaly_rate DESC,
  dha.overall_avg_quality ASC;

-- Time series forecasting and predictive analytics
WITH historical_patterns AS (
  SELECT 
    JSON_EXTRACT(device_metadata, '$.device_id') as device_id,
    JSON_EXTRACT(sensor_info, '$.sensor_type') as sensor_type,
    DATE_TRUNC('hour', timestamp) as hour_bucket,

    AVG(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as hourly_avg_value,
    COUNT(*) as readings_count,
    MIN(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as hourly_min,
    MAX(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as hourly_max,

    -- Time-based features for forecasting
    EXTRACT(HOUR FROM timestamp) as hour_of_day,
    EXTRACT(DOW FROM timestamp) as day_of_week,
    EXTRACT(DAY FROM timestamp) as day_of_month,

    -- Seasonal indicators
    CASE 
      WHEN EXTRACT(MONTH FROM timestamp) IN (12, 1, 2) THEN 'winter'
      WHEN EXTRACT(MONTH FROM timestamp) IN (3, 4, 5) THEN 'spring'
      WHEN EXTRACT(MONTH FROM timestamp) IN (6, 7, 8) THEN 'summer'
      ELSE 'autumn'
    END as season,

    -- Operational context features
    MODE() WITHIN GROUP (ORDER BY JSON_EXTRACT(context, '$.operational_context.equipment_mode')) as predominant_mode

  FROM sensor_readings
  WHERE 
    timestamp >= CURRENT_TIMESTAMP - INTERVAL '30 days'
    AND JSON_EXTRACT(quality_metrics, '$.overall_score')::DECIMAL >= 0.8
  GROUP BY 
    device_id, sensor_type, hour_bucket, hour_of_day, day_of_week, day_of_month, season
),

trend_analysis AS (
  SELECT 
    hp.*,

    -- Moving averages for trend analysis
    AVG(hp.hourly_avg_value) OVER (
      PARTITION BY hp.device_id, hp.sensor_type 
      ORDER BY hp.hour_bucket 
      ROWS BETWEEN 23 PRECEDING AND CURRENT ROW
    ) as moving_avg_24h,

    AVG(hp.hourly_avg_value) OVER (
      PARTITION BY hp.device_id, hp.sensor_type 
      ORDER BY hp.hour_bucket 
      ROWS BETWEEN 167 PRECEDING AND CURRENT ROW  -- 7 days * 24 hours
    ) as moving_avg_7d,

    -- Lag values for change detection
    LAG(hp.hourly_avg_value, 1) OVER (
      PARTITION BY hp.device_id, hp.sensor_type 
      ORDER BY hp.hour_bucket
    ) as prev_hour_value,

    LAG(hp.hourly_avg_value, 24) OVER (
      PARTITION BY hp.device_id, hp.sensor_type 
      ORDER BY hp.hour_bucket
    ) as prev_day_same_hour_value,

    LAG(hp.hourly_avg_value, 168) OVER (
      PARTITION BY hp.device_id, hp.sensor_type 
      ORDER BY hp.hour_bucket
    ) as prev_week_same_hour_value,

    -- Seasonal comparison
    AVG(hp.hourly_avg_value) OVER (
      PARTITION BY hp.device_id, hp.sensor_type, hp.hour_of_day, hp.day_of_week
      ORDER BY hp.hour_bucket
      ROWS BETWEEN 672 PRECEDING AND 672 PRECEDING  -- 4 weeks ago, same hour/day
    ) as seasonal_baseline

  FROM historical_patterns hp
),

predictive_indicators AS (
  SELECT 
    ta.*,

    -- Change calculations
    COALESCE(ta.hourly_avg_value - ta.prev_hour_value, 0) as hourly_change,
    COALESCE(ta.hourly_avg_value - ta.prev_day_same_hour_value, 0) as daily_change,
    COALESCE(ta.hourly_avg_value - ta.prev_week_same_hour_value, 0) as weekly_change,
    COALESCE(ta.hourly_avg_value - ta.seasonal_baseline, 0) as seasonal_deviation,

    -- Trend direction indicators
    CASE 
      WHEN ta.hourly_avg_value > ta.moving_avg_24h * 1.05 THEN 'upward'
      WHEN ta.hourly_avg_value < ta.moving_avg_24h * 0.95 THEN 'downward'  
      ELSE 'stable'
    END as short_term_trend,

    CASE 
      WHEN ta.moving_avg_24h > ta.moving_avg_7d * 1.02 THEN 'increasing'
      WHEN ta.moving_avg_24h < ta.moving_avg_7d * 0.98 THEN 'decreasing'
      ELSE 'steady'
    END as long_term_trend,

    -- Volatility measures
    ABS(ta.hourly_avg_value - ta.moving_avg_24h) / NULLIF(ta.moving_avg_24h, 0) as relative_volatility,

    -- Anomaly scoring
    CASE 
      WHEN ABS(ta.hourly_avg_value - ta.seasonal_baseline) > (ta.moving_avg_7d * 0.3) THEN 'high_anomaly'
      WHEN ABS(ta.hourly_avg_value - ta.seasonal_baseline) > (ta.moving_avg_7d * 0.15) THEN 'moderate_anomaly'
      WHEN ABS(ta.hourly_avg_value - ta.seasonal_baseline) > (ta.moving_avg_7d * 0.05) THEN 'low_anomaly'
      ELSE 'normal'
    END as anomaly_level,

    -- Predictive risk assessment
    CASE 
      WHEN ta.short_term_trend = 'upward' AND ta.long_term_trend = 'increasing' AND ta.relative_volatility > 0.2 THEN 'high_risk'
      WHEN ta.short_term_trend = 'downward' AND ta.long_term_trend = 'decreasing' AND ta.relative_volatility > 0.15 THEN 'high_risk'
      WHEN ta.relative_volatility > 0.25 THEN 'moderate_risk'
      WHEN ta.anomaly_level IN ('high_anomaly', 'moderate_anomaly') THEN 'moderate_risk'
      ELSE 'low_risk'
    END as predictive_risk_level

  FROM trend_analysis ta
  WHERE ta.hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '7 days'
)

-- Predictive analytics and forecasting results
SELECT 
  pi.device_id,
  pi.sensor_type,
  pi.hour_bucket,

  -- Current values and trends
  ROUND(pi.hourly_avg_value, 4) as current_value,
  ROUND(pi.moving_avg_24h, 4) as trend_24h,
  ROUND(pi.moving_avg_7d, 4) as trend_7d,

  -- Change analysis
  ROUND(pi.hourly_change, 4) as hour_to_hour_change,
  ROUND(pi.daily_change, 4) as day_to_day_change,
  ROUND(pi.weekly_change, 4) as week_to_week_change,
  ROUND(pi.seasonal_deviation, 4) as seasonal_variance,

  -- Trend classification
  pi.short_term_trend,
  pi.long_term_trend,
  pi.anomaly_level,
  pi.predictive_risk_level,

  -- Risk indicators
  ROUND(pi.relative_volatility * 100, 2) as volatility_percent,

  -- Simple linear forecast (next hour prediction)
  ROUND(
    pi.hourly_avg_value + 
    (COALESCE(pi.hourly_change, 0) * 0.7) + 
    (COALESCE(pi.daily_change, 0) * 0.2) + 
    (COALESCE(pi.weekly_change, 0) * 0.1), 
    4
  ) as predicted_next_hour_value,

  -- Confidence level for prediction
  CASE 
    WHEN pi.relative_volatility < 0.05 AND pi.anomaly_level = 'normal' THEN 'high'
    WHEN pi.relative_volatility < 0.15 AND pi.anomaly_level IN ('normal', 'low_anomaly') THEN 'medium'
    WHEN pi.relative_volatility < 0.30 THEN 'low'
    ELSE 'very_low'
  END as prediction_confidence,

  -- Maintenance and operational recommendations
  CASE 
    WHEN pi.predictive_risk_level = 'high_risk' THEN 'schedule_immediate_inspection'
    WHEN pi.anomaly_level = 'high_anomaly' THEN 'investigate_root_cause'
    WHEN pi.long_term_trend = 'decreasing' AND pi.sensor_type = 'efficiency' THEN 'schedule_maintenance'
    WHEN pi.relative_volatility > 0.2 THEN 'check_sensor_calibration'
    WHEN pi.short_term_trend != pi.long_term_trend THEN 'monitor_closely'
    ELSE 'continue_routine_monitoring'
  END as maintenance_recommendation

FROM predictive_indicators pi
WHERE pi.hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY 
  CASE pi.predictive_risk_level
    WHEN 'high_risk' THEN 1
    WHEN 'moderate_risk' THEN 2
    ELSE 3
  END,
  pi.relative_volatility DESC,
  pi.device_id,
  pi.hour_bucket DESC;

-- Real-time alerting and notification system
WITH real_time_monitoring AS (
  SELECT 
    JSON_EXTRACT(device_metadata, '$.device_id') as device_id,
    JSON_EXTRACT(device_metadata, '$.device_type') as device_type,
    JSON_EXTRACT(device_metadata, '$.location.facility') as facility,
    JSON_EXTRACT(sensor_info, '$.sensor_type') as sensor_type,
    timestamp,
    JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL as current_value,
    JSON_EXTRACT(quality_metrics, '$.overall_score')::DECIMAL as quality_score,

    -- Alert thresholds from configuration
    JSON_EXTRACT(context, '$.alert_configuration.threshold_settings.critical_high')::DECIMAL as critical_high_threshold,
    JSON_EXTRACT(context, '$.alert_configuration.threshold_settings.critical_low')::DECIMAL as critical_low_threshold,
    JSON_EXTRACT(context, '$.alert_configuration.threshold_settings.warning_high')::DECIMAL as warning_high_threshold,
    JSON_EXTRACT(context, '$.alert_configuration.threshold_settings.warning_low')::DECIMAL as warning_low_threshold,

    -- Quality thresholds
    JSON_EXTRACT(context, '$.alert_configuration.threshold_settings.min_quality_score')::DECIMAL as min_quality_threshold,

    -- Anomaly flags
    JSON_EXTRACT(quality_metrics, '$.anomaly_indicators.statistical_anomaly')::BOOLEAN as statistical_anomaly,
    JSON_EXTRACT(quality_metrics, '$.anomaly_indicators.temporal_anomaly')::BOOLEAN as temporal_anomaly,
    JSON_EXTRACT(quality_metrics, '$.anomaly_indicators.contextual_anomaly')::BOOLEAN as contextual_anomaly,

    -- Processing performance
    JSON_EXTRACT(processing_metadata, '$.processing_latency_ms')::DECIMAL as processing_latency,
    JSON_EXTRACT(device_metadata, '$.network_info.network_latency')::DECIMAL as network_latency

  FROM sensor_readings
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '5 minutes'
),

alert_evaluation AS (
  SELECT 
    rtm.*,

    -- Value-based alerts
    CASE 
      WHEN rtm.current_value >= rtm.critical_high_threshold THEN 'critical_high_value'
      WHEN rtm.current_value <= rtm.critical_low_threshold THEN 'critical_low_value'
      WHEN rtm.current_value >= rtm.warning_high_threshold THEN 'warning_high_value'
      WHEN rtm.current_value <= rtm.warning_low_threshold THEN 'warning_low_value'
      ELSE null
    END as value_alert_type,

    -- Quality-based alerts
    CASE 
      WHEN rtm.quality_score < rtm.min_quality_threshold THEN 'quality_degradation'
      ELSE null
    END as quality_alert_type,

    -- Anomaly-based alerts
    CASE 
      WHEN rtm.statistical_anomaly = true THEN 'statistical_anomaly_detected'
      WHEN rtm.temporal_anomaly = true THEN 'temporal_pattern_anomaly'
      WHEN rtm.contextual_anomaly = true THEN 'contextual_anomaly_detected'
      ELSE null
    END as anomaly_alert_type,

    -- Performance-based alerts
    CASE 
      WHEN rtm.processing_latency > 5000 THEN 'high_processing_latency'
      WHEN rtm.network_latency > 2000 THEN 'high_network_latency'
      ELSE null
    END as performance_alert_type,

    -- Severity calculation
    CASE 
      WHEN rtm.current_value >= rtm.critical_high_threshold OR rtm.current_value <= rtm.critical_low_threshold THEN 'critical'
      WHEN rtm.quality_score < (rtm.min_quality_threshold * 0.7) THEN 'critical'
      WHEN rtm.statistical_anomaly = true OR rtm.temporal_anomaly = true THEN 'high'
      WHEN rtm.current_value >= rtm.warning_high_threshold OR rtm.current_value <= rtm.warning_low_threshold THEN 'medium'
      WHEN rtm.quality_score < rtm.min_quality_threshold THEN 'medium'
      WHEN rtm.contextual_anomaly = true OR rtm.processing_latency > 5000 THEN 'low'
      ELSE null
    END as alert_severity

  FROM real_time_monitoring rtm
),

active_alerts AS (
  SELECT 
    ae.device_id,
    ae.device_type,
    ae.facility,
    ae.sensor_type,
    ae.timestamp as alert_timestamp,
    ae.current_value,
    ae.quality_score,

    -- Consolidate all alert types
    COALESCE(ae.value_alert_type, ae.quality_alert_type, ae.anomaly_alert_type, ae.performance_alert_type) as alert_type,
    ae.alert_severity,

    -- Alert context
    JSON_OBJECT(
      'current_reading', ae.current_value,
      'quality_score', ae.quality_score,
      'thresholds', JSON_OBJECT(
        'critical_high', ae.critical_high_threshold,
        'critical_low', ae.critical_low_threshold,
        'warning_high', ae.warning_high_threshold,
        'warning_low', ae.warning_low_threshold
      ),
      'anomaly_indicators', JSON_OBJECT(
        'statistical_anomaly', ae.statistical_anomaly,
        'temporal_anomaly', ae.temporal_anomaly,
        'contextual_anomaly', ae.contextual_anomaly
      ),
      'performance_metrics', JSON_OBJECT(
        'processing_latency_ms', ae.processing_latency,
        'network_latency_ms', ae.network_latency
      )
    ) as alert_context,

    -- Notification urgency
    CASE 
      WHEN ae.alert_severity = 'critical' THEN 'immediate'
      WHEN ae.alert_severity = 'high' THEN 'within_15_minutes'
      WHEN ae.alert_severity = 'medium' THEN 'within_1_hour'
      ELSE 'next_business_day'
    END as notification_urgency,

    -- Recommended actions
    CASE 
      WHEN ae.value_alert_type IN ('critical_high_value', 'critical_low_value') THEN 'emergency_shutdown_consider'
      WHEN ae.quality_alert_type = 'quality_degradation' THEN 'sensor_maintenance_required'
      WHEN ae.anomaly_alert_type IN ('statistical_anomaly_detected', 'temporal_pattern_anomaly') THEN 'investigate_anomaly_cause'
      WHEN ae.performance_alert_type = 'high_processing_latency' THEN 'check_system_resources'
      WHEN ae.performance_alert_type = 'high_network_latency' THEN 'check_network_connectivity'
      ELSE 'standard_investigation'
    END as recommended_action

  FROM alert_evaluation ae
  WHERE ae.alert_severity IS NOT NULL
)

-- Active alerts requiring immediate attention
SELECT 
  aa.alert_timestamp,
  aa.device_id,
  aa.device_type,
  aa.sensor_type,
  aa.facility,

  -- Alert details
  aa.alert_type,
  aa.alert_severity,
  aa.notification_urgency,
  aa.recommended_action,

  -- Current status
  aa.current_value as current_reading,
  aa.quality_score as current_quality,

  -- Alert context for operators
  aa.alert_context,

  -- Time since alert
  EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - aa.alert_timestamp)) / 60 as minutes_since_alert,

  -- Business impact assessment
  CASE 
    WHEN aa.alert_severity = 'critical' AND aa.device_type = 'safety_system' THEN 'safety_risk'
    WHEN aa.alert_severity = 'critical' AND aa.device_type = 'production_equipment' THEN 'production_impact'
    WHEN aa.alert_severity IN ('critical', 'high') AND aa.sensor_type = 'environmental' THEN 'compliance_risk'
    WHEN aa.alert_severity IN ('critical', 'high') THEN 'operational_impact'
    ELSE 'monitoring_required'
  END as business_impact_level,

  -- Next steps for operators
  JSON_OBJECT(
    'immediate_action', aa.recommended_action,
    'escalation_required', 
      CASE aa.alert_severity 
        WHEN 'critical' THEN true 
        ELSE false 
      END,
    'estimated_resolution_time', 
      CASE aa.alert_type
        WHEN 'quality_degradation' THEN '30-60 minutes'
        WHEN 'statistical_anomaly_detected' THEN '1-4 hours'
        WHEN 'critical_high_value' THEN '15-30 minutes'
        WHEN 'critical_low_value' THEN '15-30 minutes'
        ELSE '1-2 hours'
      END,
    'required_expertise', 
      CASE aa.alert_type
        WHEN 'quality_degradation' THEN 'maintenance_technician'
        WHEN 'statistical_anomaly_detected' THEN 'process_engineer'
        WHEN 'high_processing_latency' THEN 'it_support'
        WHEN 'high_network_latency' THEN 'network_administrator'
        ELSE 'operations_supervisor'
      END
  ) as operational_guidance

FROM active_alerts aa
ORDER BY 
  CASE aa.alert_severity
    WHEN 'critical' THEN 1
    WHEN 'high' THEN 2
    WHEN 'medium' THEN 3
    ELSE 4
  END,
  aa.alert_timestamp DESC;

-- QueryLeaf provides comprehensive IoT time series capabilities:
-- 1. SQL-familiar time series collection creation and optimization
-- 2. High-performance IoT data ingestion with automatic bucketing
-- 3. Real-time analytics and aggregation for sensor data
-- 4. Predictive analytics and trend analysis
-- 5. Comprehensive anomaly detection and alerting
-- 6. Performance monitoring and health assessment
-- 7. Integration with MongoDB's native time series optimizations
-- 8. Enterprise-ready IoT data management with familiar SQL syntax
-- 9. Automatic data lifecycle management and archiving
-- 10. Production-ready scalability for high-volume IoT workloads

Best Practices for Time Series Implementation

IoT Data Architecture and Performance Optimization

Essential principles for effective MongoDB Time Series deployment in IoT environments:

Collection Design: Create purpose-built time series collections with optimal bucketing strategies for different sensor types and data frequencies
Metadata Strategy: Design comprehensive metadata schemas that enable efficient filtering and provide rich context for analytics
Ingestion Optimization: Implement batch ingestion patterns and write concern configurations optimized for IoT write workloads
Query Patterns: Design aggregation pipelines that leverage time series optimizations for common IoT analytics patterns
Real-Time Processing: Implement change streams and real-time processing pipelines for immediate anomaly detection and alerting
Data Lifecycle: Establish automated data retention and archiving strategies to manage long-term storage costs

Production IoT Systems and Operational Excellence

Design time series systems for enterprise IoT requirements:

Scalable Architecture: Implement horizontally scalable time series infrastructure with proper sharding and distribution strategies
Performance Monitoring: Establish comprehensive monitoring for write performance, query latency, and storage utilization
Alert Management: Create intelligent alerting systems that reduce noise while ensuring critical issues are detected promptly
Edge Integration: Design systems that work efficiently with edge computing environments and intermittent connectivity
Security Implementation: Implement device authentication, data encryption, and access controls appropriate for IoT environments
Compliance Features: Build in data governance, audit trails, and regulatory compliance capabilities for industrial applications

Conclusion

MongoDB Time Series Collections provide comprehensive IoT data management capabilities that eliminate the complexity of traditional time-based partitioning and manual optimization through automatic bucketing, intelligent compression, and purpose-built query optimization. The native support for high-volume writes, real-time aggregations, and time-based analytics makes Time Series Collections ideal for modern IoT applications requiring both scale and performance.

Key Time Series Collections benefits include:

Automatic Optimization: Native bucketing and compression eliminate manual partitioning and maintenance overhead
High-Performance Writes: Optimized storage engine designed for high-volume, time-stamped data ingestion
Intelligent Compression: Automatic compression reduces storage costs by up to 90% compared to traditional approaches
Real-Time Analytics: Built-in aggregation optimization for time-based queries and real-time processing
Flexible Data Models: Rich document structure accommodates complex IoT metadata alongside time series measurements
SQL Accessibility: Familiar SQL-style time series operations through QueryLeaf for accessible IoT data management

Whether you're building industrial monitoring systems, smart city infrastructure, environmental sensors, or enterprise IoT platforms, MongoDB Time Series Collections with QueryLeaf's familiar SQL interface provides the foundation for scalable, efficient IoT data management.

QueryLeaf Integration: QueryLeaf seamlessly manages MongoDB Time Series Collections while providing SQL-familiar syntax for time series data operations, real-time analytics, and IoT-specific query patterns. Advanced time series capabilities including automatic bucketing, predictive analytics, and enterprise alerting are elegantly handled through familiar SQL constructs, making sophisticated IoT data management both powerful and accessible to SQL-oriented development teams.

The combination of MongoDB's robust time series capabilities with SQL-style data operations makes it an ideal platform for applications requiring both high-performance IoT data storage and familiar database interaction patterns, ensuring your time series infrastructure can scale efficiently while maintaining operational simplicity and developer productivity.

November 7, 2025
24 min read

MongoDB Change Streams for Event-Driven Microservices: Real-Time Data Processing and Distributed System Architecture

Modern distributed applications require sophisticated event-driven architectures that can react to data changes in real-time, maintain consistency across microservices, and process streaming data with minimal latency. Traditional database approaches struggle to provide efficient change detection, often requiring complex polling mechanisms, external message brokers, or custom trigger implementations that introduce significant overhead and operational complexity.

MongoDB Change Streams provide native, real-time change detection capabilities that enable applications to reactively process database modifications with millisecond latency. Unlike traditional approaches that require periodic polling or complex event sourcing implementations, Change Streams deliver ordered, resumable streams of database changes that integrate seamlessly with microservices architectures and event-driven patterns.

The Traditional Change Detection Challenge

Conventional approaches to detecting and reacting to database changes have significant limitations for modern applications:

-- Traditional PostgreSQL change detection - complex and resource-intensive

-- Polling-based approach with timestamps
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INTEGER NOT NULL,
    order_status VARCHAR(50) DEFAULT 'pending',
    total_amount DECIMAL(10,2) NOT NULL,
    items JSONB NOT NULL,
    shipping_address JSONB NOT NULL,
    payment_info JSONB,

    -- Tracking fields for change detection
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    version INTEGER DEFAULT 1,

    -- Change tracking
    last_processed_at TIMESTAMP,
    change_events TEXT[] DEFAULT ARRAY[]::TEXT[],

    -- Indexes for polling queries
    INDEX idx_orders_updated_at (updated_at),
    INDEX idx_orders_status_updated (order_status, updated_at),
    INDEX idx_orders_processing (last_processed_at, updated_at)
);

-- Trigger-based change tracking (complex maintenance)
CREATE TABLE order_change_log (
    log_id SERIAL PRIMARY KEY,
    order_id INTEGER REFERENCES orders(order_id),
    change_type VARCHAR(20) NOT NULL, -- INSERT, UPDATE, DELETE
    old_values JSONB,
    new_values JSONB,
    changed_fields TEXT[],
    changed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    processed BOOLEAN DEFAULT FALSE,
    processing_attempts INTEGER DEFAULT 0,

    INDEX idx_change_log_processing (processed, changed_at),
    INDEX idx_change_log_order (order_id, changed_at)
);

-- Complex trigger function for change tracking
CREATE OR REPLACE FUNCTION track_order_changes()
RETURNS TRIGGER AS $$
DECLARE
    old_json JSONB;
    new_json JSONB;
    changed_fields TEXT[] := ARRAY[]::TEXT[];
    field_name TEXT;
BEGIN
    -- Handle different operation types
    IF TG_OP = 'DELETE' THEN
        INSERT INTO order_change_log (order_id, change_type, old_values)
        VALUES (OLD.order_id, 'DELETE', to_jsonb(OLD));

        RETURN OLD;
    END IF;

    IF TG_OP = 'INSERT' THEN
        INSERT INTO order_change_log (order_id, change_type, new_values)
        VALUES (NEW.order_id, 'INSERT', to_jsonb(NEW));

        RETURN NEW;
    END IF;

    -- UPDATE operation - detect changed fields
    old_json := to_jsonb(OLD);
    new_json := to_jsonb(NEW);

    -- Compare each field
    FOR field_name IN 
        SELECT DISTINCT key 
        FROM jsonb_each(old_json) 
        UNION 
        SELECT DISTINCT key 
        FROM jsonb_each(new_json)
    LOOP
        IF old_json->field_name != new_json->field_name OR 
           (old_json->field_name IS NULL) != (new_json->field_name IS NULL) THEN
            changed_fields := array_append(changed_fields, field_name);
        END IF;
    END LOOP;

    -- Only log if fields actually changed
    IF array_length(changed_fields, 1) > 0 THEN
        INSERT INTO order_change_log (
            order_id, change_type, old_values, new_values, changed_fields
        ) VALUES (
            NEW.order_id, 'UPDATE', old_json, new_json, changed_fields
        );

        -- Update version and tracking fields
        NEW.updated_at := CURRENT_TIMESTAMP;
        NEW.version := OLD.version + 1;
        NEW.change_events := array_append(OLD.change_events, 
            'updated_' || array_to_string(changed_fields, ',') || '_at_' || 
            extract(epoch from CURRENT_TIMESTAMP)::text
        );
    END IF;

    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Create triggers (high overhead on write operations)
CREATE TRIGGER orders_change_trigger
    BEFORE INSERT OR UPDATE OR DELETE ON orders
    FOR EACH ROW EXECUTE FUNCTION track_order_changes();

-- Application polling logic (inefficient and high-latency)
WITH pending_changes AS (
    SELECT 
        ocl.*,
        o.order_status,
        o.customer_id,
        o.total_amount,

        -- Determine change significance
        CASE 
            WHEN 'order_status' = ANY(ocl.changed_fields) THEN 'high'
            WHEN 'total_amount' = ANY(ocl.changed_fields) THEN 'medium'
            WHEN 'items' = ANY(ocl.changed_fields) THEN 'medium'
            ELSE 'low'
        END as change_priority,

        -- Calculate processing delay
        EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - ocl.changed_at)) as delay_seconds

    FROM order_change_log ocl
    JOIN orders o ON ocl.order_id = o.order_id
    WHERE 
        ocl.processed = FALSE
        AND ocl.processing_attempts < 3
        AND ocl.changed_at >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
),
prioritized_changes AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY order_id 
            ORDER BY changed_at DESC
        ) as change_sequence,

        -- Batch processing grouping
        CASE change_priority
            WHEN 'high' THEN 1
            WHEN 'medium' THEN 2  
            ELSE 3
        END as processing_batch

    FROM pending_changes
)
SELECT 
    pc.log_id,
    pc.order_id,
    pc.change_type,
    pc.changed_fields,
    pc.old_values,
    pc.new_values,
    pc.change_priority,
    pc.delay_seconds,
    pc.processing_batch,

    -- Processing metadata
    CASE 
        WHEN pc.delay_seconds > 300 THEN 'DELAYED'
        WHEN pc.processing_attempts > 0 THEN 'RETRY'
        ELSE 'READY'
    END as processing_status,

    -- Related order context
    pc.order_status,
    pc.customer_id,
    pc.total_amount

FROM prioritized_changes pc
WHERE 
    pc.change_sequence = 1 -- Only latest change per order
    AND (
        pc.change_priority = 'high' 
        OR (pc.change_priority = 'medium' AND pc.delay_seconds < 60)
        OR (pc.change_priority = 'low' AND pc.delay_seconds < 300)
    )
ORDER BY 
    pc.processing_batch,
    pc.changed_at ASC
LIMIT 100;

-- Problems with traditional change detection:
-- 1. High overhead from triggers on every write operation
-- 2. Complex polling logic with high latency and resource usage  
-- 3. Risk of missing changes during application downtime
-- 4. Difficult to scale across multiple application instances
-- 5. No guaranteed delivery or ordering of change events
-- 6. Complex state management for processed vs unprocessed changes
-- 7. Performance degradation with high-volume write workloads
-- 8. Backup and restore complications with change log tables
-- 9. Cross-database change coordination challenges
-- 10. Limited filtering and transformation capabilities

-- MySQL change detection (even more limited)
CREATE TABLE mysql_orders (
    id INT AUTO_INCREMENT PRIMARY KEY,
    status VARCHAR(50),
    amount DECIMAL(10,2),
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,

    INDEX(updated_at)
);

-- Basic polling approach (no trigger support in standard MySQL)
SELECT 
    id, status, amount, updated_at,
    UNIX_TIMESTAMP() - UNIX_TIMESTAMP(updated_at) as age_seconds
FROM mysql_orders
WHERE updated_at > DATE_SUB(NOW(), INTERVAL 5 MINUTE)
ORDER BY updated_at DESC
LIMIT 1000;

-- MySQL limitations:
-- - No comprehensive trigger system for change tracking
-- - Limited JSON functionality for change metadata
-- - Basic polling only - no streaming capabilities
-- - Poor performance with high-volume change detection
-- - No built-in change stream or event sourcing support
-- - Complex custom implementation required for real-time processing
-- - Limited scalability for distributed architectures

MongoDB Change Streams provide powerful, real-time change detection with minimal overhead:

// MongoDB Change Streams - native real-time change processing
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('ecommerce');

// Comprehensive order management with change stream support
const setupOrderManagement = async () => {
  const orders = db.collection('orders');

  // Create sample order document structure
  const orderDocument = {
    _id: new ObjectId(),
    customerId: new ObjectId("64a1b2c3d4e5f6789012347a"),

    // Order details
    orderNumber: "ORD-2024-001234",
    status: "pending", // pending, confirmed, processing, shipped, delivered, cancelled

    // Financial information
    financial: {
      subtotal: 299.99,
      tax: 24.00,
      shipping: 15.99,
      discount: 25.00,
      total: 314.98,
      currency: "USD"
    },

    // Items with detailed tracking
    items: [
      {
        productId: new ObjectId("64b2c3d4e5f6789012347b1a"),
        sku: "LAPTOP-PRO-2024",
        name: "Professional Laptop 2024",
        quantity: 1,
        unitPrice: 1299.99,
        totalPrice: 1299.99,

        // Item-level tracking
        status: "pending", // pending, reserved, picked, shipped
        warehouse: "WEST-01",
        trackingNumber: null
      },
      {
        productId: new ObjectId("64b2c3d4e5f6789012347b1b"), 
        sku: "MOUSE-WIRELESS-PREMIUM",
        name: "Premium Wireless Mouse",
        quantity: 2,
        unitPrice: 79.99,
        totalPrice: 159.98,
        status: "pending",
        warehouse: "WEST-01",
        trackingNumber: null
      }
    ],

    // Customer information
    customer: {
      customerId: new ObjectId("64a1b2c3d4e5f6789012347a"),
      email: "customer@example.com",
      name: "John Smith",
      phone: "+1-555-0123",
      loyaltyTier: "gold"
    },

    // Shipping details
    shipping: {
      method: "standard", // standard, express, overnight
      carrier: "FedEx",
      trackingNumber: null,

      address: {
        street: "123 Main Street",
        unit: "Apt 4B", 
        city: "San Francisco",
        state: "CA",
        country: "USA",
        postalCode: "94105",

        // Geospatial data for logistics
        coordinates: {
          type: "Point",
          coordinates: [-122.3937, 37.7955]
        }
      },

      // Delivery preferences
      preferences: {
        signatureRequired: true,
        leaveAtDoor: false,
        deliveryInstructions: "Ring doorbell, apartment entrance on left side",
        preferredTimeWindow: "9AM-12PM"
      }
    },

    // Payment information
    payment: {
      method: "credit_card", // credit_card, paypal, apple_pay, etc.
      status: "pending", // pending, authorized, captured, failed, refunded
      transactionId: null,

      // Payment processor details (sensitive data encrypted/redacted)
      processor: {
        name: "stripe",
        paymentIntentId: "pi_1234567890abcdef",
        chargeId: null,
        receiptUrl: null
      },

      // Billing address
      billingAddress: {
        street: "123 Main Street", 
        city: "San Francisco",
        state: "CA",
        country: "USA",
        postalCode: "94105"
      }
    },

    // Fulfillment tracking
    fulfillment: {
      warehouseId: "WEST-01",
      assignedAt: null,
      pickedAt: null,
      packedAt: null,
      shippedAt: null,
      deliveredAt: null,

      // Fulfillment team
      assignedTo: {
        pickerId: null,
        packerId: null
      },

      // Special handling
      specialInstructions: [],
      requiresSignature: true,
      isFragile: false,
      isGift: false
    },

    // Analytics and tracking
    analytics: {
      source: "web", // web, mobile, api, phone
      campaign: "summer_sale_2024",
      referrer: "google_ads",
      sessionId: "sess_abc123def456",

      // Customer journey
      customerJourney: [
        {
          event: "product_view",
          productId: new ObjectId("64b2c3d4e5f6789012347b1a"),
          timestamp: new Date("2024-11-07T14:15:00Z")
        },
        {
          event: "add_to_cart", 
          productId: new ObjectId("64b2c3d4e5f6789012347b1a"),
          timestamp: new Date("2024-11-07T14:18:00Z")
        },
        {
          event: "checkout_initiated",
          timestamp: new Date("2024-11-07T14:25:00Z")
        }
      ]
    },

    // Communication history
    communications: [
      {
        type: "email",
        subject: "Order Confirmation",
        sentAt: new Date("2024-11-07T14:30:00Z"),
        status: "sent",
        templateId: "order_confirmation_v2"
      }
    ],

    // Audit trail
    audit: {
      createdAt: new Date("2024-11-07T14:30:00Z"),
      createdBy: "system",
      updatedAt: new Date("2024-11-07T14:30:00Z"),
      updatedBy: "system",
      version: 1,

      // Change history for compliance
      changes: []
    },

    // System metadata
    metadata: {
      environment: "production",
      region: "us-west-1",
      tenantId: "tenant_123",

      // Feature flags
      features: {
        realTimeTracking: true,
        smsNotifications: true,
        expressDelivery: false
      }
    }
  };

  // Insert sample order
  await orders.insertOne(orderDocument);

  // Create indexes for change stream performance
  await orders.createIndex({ status: 1, "audit.updatedAt": 1 });
  await orders.createIndex({ customerId: 1, "audit.createdAt": -1 });
  await orders.createIndex({ "items.status": 1 });
  await orders.createIndex({ "payment.status": 1 });
  await orders.createIndex({ "shipping.trackingNumber": 1 });

  console.log('Order management setup completed');
  return orders;
};

// Advanced Change Stream processing for event-driven architecture
class OrderEventProcessor {
  constructor(db) {
    this.db = db;
    this.orders = db.collection('orders');
    this.eventHandlers = new Map();
    this.changeStream = null;
    this.resumeToken = null;
    this.processedEvents = new Set();

    // Event processing statistics
    this.stats = {
      eventsProcessed: 0,
      eventsSkipped: 0,
      processingErrors: 0,
      lastProcessedAt: null
    };
  }

  async startChangeStreamProcessing() {
    console.log('Starting MongoDB Change Stream processing...');

    // Configure change stream with comprehensive options
    const changeStreamOptions = {
      // Pipeline to filter relevant changes
      pipeline: [
        {
          // Only process specific operation types
          $match: {
            operationType: { $in: ['insert', 'update', 'delete', 'replace'] }
          }
        },
        {
          // Add additional metadata for processing
          $addFields: {
            // Extract key fields for quick processing decisions
            documentKey: '$documentKey',
            changeType: '$operationType',
            changedFields: { $objectToArray: '$updateDescription.updatedFields' },
            removedFields: '$updateDescription.removedFields',

            // Processing metadata
            processingPriority: {
              $switch: {
                branches: [
                  // High priority changes
                  {
                    case: {
                      $or: [
                        { $eq: ['$operationType', 'insert'] },
                        { $eq: ['$operationType', 'delete'] },
                        {
                          $anyElementTrue: {
                            $map: {
                              input: '$updateDescription.updatedFields',
                              in: {
                                $regexMatch: {
                                  input: '$$this.k',
                                  regex: '^(status|payment\\.status|fulfillment\\.).*'
                                }
                              }
                            }
                          }
                        }
                      ]
                    },
                    then: 'high'
                  },
                  // Medium priority changes
                  {
                    case: {
                      $anyElementTrue: {
                        $map: {
                          input: '$updateDescription.updatedFields', 
                          in: {
                            $regexMatch: {
                              input: '$$this.k',
                              regex: '^(items\\.|shipping\\.|customer\\.).*'
                            }
                          }
                        }
                      }
                    },
                    then: 'medium'
                  }
                ],
                default: 'low'
              }
            }
          }
        }
      ],

      // Change stream configuration
      fullDocument: 'updateLookup', // Always include full document
      fullDocumentBeforeChange: 'whenAvailable', // Include pre-change document when available
      maxAwaitTimeMS: 1000, // Maximum wait time for new changes
      batchSize: 100, // Process changes in batches

      // Resume from stored token if available
      startAfter: this.resumeToken
    };

    try {
      // Create change stream on orders collection
      this.changeStream = this.orders.watch(changeStreamOptions);

      console.log('Change stream established - listening for events...');

      // Process change events asynchronously
      for await (const change of this.changeStream) {
        try {
          await this.processChangeEvent(change);

          // Store resume token for recovery
          this.resumeToken = change._id;

          // Update statistics
          this.stats.eventsProcessed++;
          this.stats.lastProcessedAt = new Date();

        } catch (error) {
          console.error('Error processing change event:', error);
          this.stats.processingErrors++;

          // Implement retry logic or dead letter queue here
          await this.handleProcessingError(change, error);
        }
      }

    } catch (error) {
      console.error('Change stream error:', error);

      // Implement reconnection logic
      await this.handleChangeStreamError(error);
    }
  }

  async processChangeEvent(change) {
    const { operationType, fullDocument, documentKey, updateDescription } = change;
    const orderId = documentKey._id;

    console.log(`Processing ${operationType} event for order ${orderId}`);

    // Prevent duplicate processing
    const eventId = `${orderId}_${change._id.toString()}`;
    if (this.processedEvents.has(eventId)) {
      console.log(`Skipping duplicate event: ${eventId}`);
      this.stats.eventsSkipped++;
      return;
    }

    // Add to processed events (with TTL cleanup)
    this.processedEvents.add(eventId);
    setTimeout(() => this.processedEvents.delete(eventId), 300000); // 5 minute TTL

    // Route event based on operation type and changed fields
    switch (operationType) {
      case 'insert':
        await this.handleOrderCreated(fullDocument);
        break;

      case 'update':
        await this.handleOrderUpdated(fullDocument, updateDescription, change.fullDocumentBeforeChange);
        break;

      case 'delete':
        await this.handleOrderDeleted(documentKey);
        break;

      case 'replace':
        await this.handleOrderReplaced(fullDocument, change.fullDocumentBeforeChange);
        break;

      default:
        console.warn(`Unhandled operation type: ${operationType}`);
    }
  }

  async handleOrderCreated(order) {
    console.log(`New order created: ${order.orderNumber}`);

    // Parallel processing of order creation events
    const creationTasks = [
      // Send order confirmation email
      this.sendOrderConfirmation(order),

      // Reserve inventory for ordered items
      this.reserveInventory(order),

      // Create payment authorization
      this.authorizePayment(order),

      // Notify fulfillment center
      this.notifyFulfillmentCenter(order),

      // Update customer metrics
      this.updateCustomerMetrics(order),

      // Log analytics event
      this.logAnalyticsEvent('order_created', order),

      // Check for fraud indicators
      this.performFraudCheck(order)
    ];

    try {
      const results = await Promise.allSettled(creationTasks);

      // Handle any failed tasks
      const failedTasks = results.filter(result => result.status === 'rejected');
      if (failedTasks.length > 0) {
        console.error(`${failedTasks.length} tasks failed for order creation:`, failedTasks);
        await this.handlePartialFailure(order, failedTasks);
      }

      console.log(`Order creation processing completed: ${order.orderNumber}`);

    } catch (error) {
      console.error(`Error processing order creation: ${error}`);
      throw error;
    }
  }

  async handleOrderUpdated(order, updateDescription, previousOrder) {
    console.log(`Order updated: ${order.orderNumber}`);

    const updatedFields = Object.keys(updateDescription.updatedFields || {});
    const removedFields = updateDescription.removedFields || [];

    // Process specific field changes
    for (const fieldPath of updatedFields) {
      await this.processFieldChange(order, previousOrder, fieldPath, updateDescription.updatedFields[fieldPath]);
    }

    // Handle removed fields
    for (const fieldPath of removedFields) {
      await this.processFieldRemoval(order, previousOrder, fieldPath);
    }

    // Log comprehensive change event
    await this.logAnalyticsEvent('order_updated', {
      order,
      changedFields: updatedFields,
      removedFields: removedFields
    });
  }

  async processFieldChange(order, previousOrder, fieldPath, newValue) {
    console.log(`Field changed: ${fieldPath} = ${JSON.stringify(newValue)}`);

    // Route processing based on changed field
    if (fieldPath === 'status') {
      await this.handleStatusChange(order, previousOrder);
    } else if (fieldPath.startsWith('payment.')) {
      await this.handlePaymentChange(order, previousOrder, fieldPath, newValue);
    } else if (fieldPath.startsWith('fulfillment.')) {
      await this.handleFulfillmentChange(order, previousOrder, fieldPath, newValue);
    } else if (fieldPath.startsWith('shipping.')) {
      await this.handleShippingChange(order, previousOrder, fieldPath, newValue);
    } else if (fieldPath.startsWith('items.')) {
      await this.handleItemChange(order, previousOrder, fieldPath, newValue);
    }
  }

  async handleStatusChange(order, previousOrder) {
    const newStatus = order.status;
    const previousStatus = previousOrder?.status;

    console.log(`Order status changed: ${previousStatus} → ${newStatus}`);

    // Status-specific processing
    switch (newStatus) {
      case 'confirmed':
        await Promise.all([
          this.processPayment(order),
          this.sendStatusUpdateEmail(order, 'order_confirmed'),
          this.createFulfillmentTasks(order)
        ]);
        break;

      case 'processing':
        await Promise.all([
          this.notifyWarehouse(order),
          this.updateInventoryReservations(order),
          this.sendStatusUpdateEmail(order, 'order_processing')
        ]);
        break;

      case 'shipped':
        await Promise.all([
          this.generateTrackingInfo(order),
          this.sendShippingNotification(order),
          this.releaseInventoryReservations(order),
          this.updateDeliveryEstimate(order)
        ]);
        break;

      case 'delivered':
        await Promise.all([
          this.sendDeliveryConfirmation(order),
          this.triggerReviewRequest(order),
          this.updateCustomerLoyaltyPoints(order),
          this.closeOrderInSystems(order)
        ]);
        break;

      case 'cancelled':
        await Promise.all([
          this.processRefund(order),
          this.releaseInventoryReservations(order),
          this.sendCancellationNotification(order),
          this.updateAnalytics(order, 'cancelled')
        ]);
        break;
    }
  }

  async handlePaymentChange(order, previousOrder, fieldPath, newValue) {
    console.log(`Payment change: ${fieldPath} = ${newValue}`);

    if (fieldPath === 'payment.status') {
      switch (newValue) {
        case 'authorized':
          await this.handlePaymentAuthorized(order);
          break;
        case 'captured':
          await this.handlePaymentCaptured(order);
          break;
        case 'failed':
          await this.handlePaymentFailed(order);
          break;
        case 'refunded':
          await this.handlePaymentRefunded(order);
          break;
      }
    }
  }

  async handleShippingChange(order, previousOrder, fieldPath, newValue) {
    if (fieldPath === 'shipping.trackingNumber' && newValue) {
      console.log(`Tracking number assigned: ${newValue}`);

      // Send tracking information to customer
      await Promise.all([
        this.sendTrackingInfo(order),
        this.setupDeliveryNotifications(order),
        this.updateShippingPartnerSystems(order)
      ]);
    }
  }

  // Implementation of helper methods for event processing
  async sendOrderConfirmation(order) {
    console.log(`Sending order confirmation for ${order.orderNumber}`);

    // Simulate email service call
    const emailData = {
      to: order.customer.email,
      subject: `Order Confirmation - ${order.orderNumber}`,
      template: 'order_confirmation',
      data: {
        orderNumber: order.orderNumber,
        customerName: order.customer.name,
        items: order.items,
        total: order.financial.total,
        estimatedDelivery: this.calculateDeliveryDate(order)
      }
    };

    // Would integrate with actual email service
    await this.sendEmail(emailData);
  }

  async reserveInventory(order) {
    console.log(`Reserving inventory for order ${order.orderNumber}`);

    const inventoryUpdates = order.items.map(item => ({
      productId: item.productId,
      sku: item.sku,
      quantity: item.quantity,
      reservedFor: order._id,
      reservedAt: new Date()
    }));

    // Update inventory collection
    const inventory = this.db.collection('inventory');

    for (const update of inventoryUpdates) {
      await inventory.updateOne(
        { 
          productId: update.productId,
          availableQuantity: { $gte: update.quantity }
        },
        {
          $inc: { 
            availableQuantity: -update.quantity,
            reservedQuantity: update.quantity
          },
          $push: {
            reservations: {
              orderId: update.reservedFor,
              quantity: update.quantity,
              reservedAt: update.reservedAt
            }
          }
        }
      );
    }
  }

  async authorizePayment(order) {
    console.log(`Authorizing payment for order ${order.orderNumber}`);

    // Simulate payment processor call
    const paymentResult = await this.callPaymentProcessor({
      action: 'authorize',
      amount: order.financial.total,
      currency: order.financial.currency,
      paymentMethod: order.payment.method,
      paymentIntentId: order.payment.processor.paymentIntentId
    });

    if (paymentResult.success) {
      // Update order with payment authorization
      await this.orders.updateOne(
        { _id: order._id },
        {
          $set: {
            'payment.status': 'authorized',
            'payment.processor.chargeId': paymentResult.chargeId,
            'audit.updatedAt': new Date(),
            'audit.updatedBy': 'payment_processor'
          },
          $inc: { 'audit.version': 1 }
        }
      );
    } else {
      throw new Error(`Payment authorization failed: ${paymentResult.error}`);
    }
  }

  // Helper methods (simplified implementations)
  async sendEmail(emailData) {
    console.log(`Email sent: ${emailData.subject} to ${emailData.to}`);
  }

  async callPaymentProcessor(request) {
    // Simulate payment processor response
    await new Promise(resolve => setTimeout(resolve, 100));
    return {
      success: true,
      chargeId: `ch_${Math.random().toString(36).substr(2, 9)}`
    };
  }

  calculateDeliveryDate(order) {
    const baseDate = new Date();
    const daysToAdd = order.shipping.method === 'express' ? 2 : 
                      order.shipping.method === 'overnight' ? 1 : 5;
    baseDate.setDate(baseDate.getDate() + daysToAdd);
    return baseDate;
  }

  async logAnalyticsEvent(eventType, data) {
    const analytics = this.db.collection('analytics_events');
    await analytics.insertOne({
      eventType,
      data,
      timestamp: new Date(),
      source: 'change_stream_processor'
    });
  }

  async handleProcessingError(change, error) {
    console.error(`Processing error for change ${change._id}:`, error);

    // Log error for monitoring
    const errorLog = this.db.collection('processing_errors');
    await errorLog.insertOne({
      changeId: change._id,
      operationType: change.operationType,
      documentKey: change.documentKey,
      error: {
        message: error.message,
        stack: error.stack
      },
      timestamp: new Date(),
      retryCount: 0
    });
  }

  async handleChangeStreamError(error) {
    console.error('Change stream error:', error);

    // Wait before attempting reconnection
    await new Promise(resolve => setTimeout(resolve, 5000));

    // Restart change stream processing
    await this.startChangeStreamProcessing();
  }

  getProcessingStatistics() {
    return {
      ...this.stats,
      resumeToken: this.resumeToken,
      processedEventsInMemory: this.processedEvents.size
    };
  }
}

// Multi-service change stream coordination
class DistributedOrderEventSystem {
  constructor(db) {
    this.db = db;
    this.serviceProcessors = new Map();
    this.eventBus = new Map(); // Simple in-memory event bus
    this.globalStats = {
      totalEventsProcessed: 0,
      servicesActive: 0,
      lastProcessingTime: null
    };
  }

  async setupDistributedProcessing() {
    console.log('Setting up distributed order event processing...');

    // Create specialized processors for different services
    const services = [
      'inventory-service',
      'payment-service', 
      'fulfillment-service',
      'notification-service',
      'analytics-service',
      'customer-service'
    ];

    for (const serviceName of services) {
      const processor = new ServiceSpecificProcessor(this.db, serviceName, this);
      await processor.initialize();
      this.serviceProcessors.set(serviceName, processor);
    }

    console.log(`Distributed processing setup completed with ${services.length} services`);
  }

  async publishEvent(eventType, data, source) {
    console.log(`Publishing event: ${eventType} from ${source}`);

    // Add to event bus
    if (!this.eventBus.has(eventType)) {
      this.eventBus.set(eventType, []);
    }

    const event = {
      id: new ObjectId(),
      type: eventType,
      data,
      source,
      timestamp: new Date(),
      processed: new Set()
    };

    this.eventBus.get(eventType).push(event);

    // Notify interested services
    for (const [serviceName, processor] of this.serviceProcessors.entries()) {
      if (processor.isInterestedInEvent(eventType)) {
        await processor.processEvent(event);
      }
    }

    this.globalStats.totalEventsProcessed++;
    this.globalStats.lastProcessingTime = new Date();
  }

  getGlobalStatistics() {
    const serviceStats = {};
    for (const [serviceName, processor] of this.serviceProcessors.entries()) {
      serviceStats[serviceName] = processor.getStatistics();
    }

    return {
      global: this.globalStats,
      services: serviceStats,
      eventBusSize: Array.from(this.eventBus.values()).reduce((total, events) => total + events.length, 0)
    };
  }
}

// Service-specific processor for handling events relevant to each microservice
class ServiceSpecificProcessor {
  constructor(db, serviceName, eventSystem) {
    this.db = db;
    this.serviceName = serviceName;
    this.eventSystem = eventSystem;
    this.eventFilters = new Map();
    this.stats = {
      eventsProcessed: 0,
      eventsFiltered: 0,
      lastProcessedAt: null
    };

    this.setupEventFilters();
  }

  setupEventFilters() {
    // Define which events each service cares about
    const filterConfigs = {
      'inventory-service': [
        'order_created',
        'order_cancelled', 
        'item_status_changed'
      ],
      'payment-service': [
        'order_created',
        'order_confirmed',
        'order_cancelled',
        'payment_status_changed'
      ],
      'fulfillment-service': [
        'order_confirmed',
        'payment_authorized',
        'inventory_reserved'
      ],
      'notification-service': [
        'order_created',
        'status_changed',
        'payment_status_changed',
        'shipping_updated'
      ],
      'analytics-service': [
        '*' // Analytics service processes all events
      ],
      'customer-service': [
        'order_created',
        'order_delivered',
        'order_cancelled'
      ]
    };

    const filters = filterConfigs[this.serviceName] || [];
    filters.forEach(filter => this.eventFilters.set(filter, true));
  }

  async initialize() {
    console.log(`Initializing ${this.serviceName} processor...`);

    // Service-specific initialization
    switch (this.serviceName) {
      case 'inventory-service':
        await this.initializeInventoryTracking();
        break;
      case 'payment-service':
        await this.initializePaymentProcessing();
        break;
      // ... other services
    }

    console.log(`${this.serviceName} processor initialized`);
  }

  isInterestedInEvent(eventType) {
    return this.eventFilters.has('*') || this.eventFilters.has(eventType);
  }

  async processEvent(event) {
    if (!this.isInterestedInEvent(event.type)) {
      this.stats.eventsFiltered++;
      return;
    }

    console.log(`${this.serviceName} processing event: ${event.type}`);

    try {
      // Service-specific event processing
      await this.handleServiceEvent(event);

      event.processed.add(this.serviceName);
      this.stats.eventsProcessed++;
      this.stats.lastProcessedAt = new Date();

    } catch (error) {
      console.error(`${this.serviceName} error processing event ${event.id}:`, error);
      throw error;
    }
  }

  async handleServiceEvent(event) {
    // Dispatch to service-specific handlers
    const handlerMethod = `handle${event.type.split('_').map(word => 
      word.charAt(0).toUpperCase() + word.slice(1)
    ).join('')}`;

    if (typeof this[handlerMethod] === 'function') {
      await this[handlerMethod](event);
    } else {
      console.warn(`No handler found: ${handlerMethod} in ${this.serviceName}`);
    }
  }

  // Service-specific event handlers
  async handleOrderCreated(event) {
    if (this.serviceName === 'inventory-service') {
      await this.reserveInventoryForOrder(event.data);
    } else if (this.serviceName === 'notification-service') {
      await this.sendOrderConfirmationEmail(event.data);
    }
  }

  async handleStatusChanged(event) {
    if (this.serviceName === 'customer-service') {
      await this.updateCustomerOrderHistory(event.data);
    }
  }

  // Helper methods for specific services
  async reserveInventoryForOrder(order) {
    console.log(`Reserving inventory for order: ${order.orderNumber}`);
    // Implementation would interact with inventory management system
  }

  async sendOrderConfirmationEmail(order) {
    console.log(`Sending confirmation email for order: ${order.orderNumber}`);
    // Implementation would use email service
  }

  async initializeInventoryTracking() {
    // Setup inventory-specific collections and indexes
    const inventory = this.db.collection('inventory');
    await inventory.createIndex({ productId: 1, warehouse: 1 });
  }

  async initializePaymentProcessing() {
    // Setup payment-specific configurations
    console.log('Payment service initialized with fraud detection enabled');
  }

  getStatistics() {
    return this.stats;
  }
}

// Benefits of MongoDB Change Streams:
// - Real-time change detection with minimal latency
// - Native event sourcing capabilities without complex triggers  
// - Resumable streams with automatic recovery from failures
// - Ordered event processing with guaranteed delivery
// - Fine-grained filtering and transformation pipelines
// - Horizontal scaling across multiple application instances
// - Integration with MongoDB's replica set and sharding architecture
// - No polling overhead or resource waste
// - Built-in clustering and high availability support
// - Simple integration with existing MongoDB applications

module.exports = {
  setupOrderManagement,
  OrderEventProcessor,
  DistributedOrderEventSystem,
  ServiceSpecificProcessor
};

Understanding MongoDB Change Streams Architecture

Change Stream Processing Patterns

MongoDB Change Streams operate at the replica set level and provide several key capabilities for event-driven architectures:

// Advanced change stream patterns and configurations
class AdvancedChangeStreamManager {
  constructor(client) {
    this.client = client;
    this.db = client.db('ecommerce');
    this.changeStreams = new Map();
    this.resumeTokens = new Map();
    this.errorHandlers = new Map();
  }

  async setupMultiCollectionStreams() {
    console.log('Setting up multi-collection change streams...');

    // 1. Collection-specific streams with targeted processing
    const collectionConfigs = [
      {
        name: 'orders',
        pipeline: [
          {
            $match: {
              $or: [
                { operationType: 'insert' },
                { operationType: 'update', 'updateDescription.updatedFields.status': { $exists: true } },
                { operationType: 'update', 'updateDescription.updatedFields.payment.status': { $exists: true } }
              ]
            }
          }
        ],
        handler: this.handleOrderChanges.bind(this)
      },
      {
        name: 'inventory', 
        pipeline: [
          {
            $match: {
              $or: [
                { operationType: 'update', 'updateDescription.updatedFields.availableQuantity': { $exists: true } },
                { operationType: 'update', 'updateDescription.updatedFields.reservedQuantity': { $exists: true } }
              ]
            }
          }
        ],
        handler: this.handleInventoryChanges.bind(this)
      },
      {
        name: 'customers',
        pipeline: [
          {
            $match: {
              operationType: { $in: ['insert', 'update'] },
              $or: [
                { 'fullDocument.loyaltyTier': { $exists: true } },
                { 'updateDescription.updatedFields.loyaltyTier': { $exists: true } },
                { 'updateDescription.updatedFields.preferences': { $exists: true } }
              ]
            }
          }
        ],
        handler: this.handleCustomerChanges.bind(this)
      }
    ];

    // Start streams for each collection
    for (const config of collectionConfigs) {
      await this.startCollectionStream(config);
    }

    // 2. Database-level change stream for cross-collection events
    await this.startDatabaseStream();

    console.log(`Started ${collectionConfigs.length + 1} change streams`);
  }

  async startCollectionStream(config) {
    const collection = this.db.collection(config.name);
    const resumeToken = this.resumeTokens.get(config.name);

    const options = {
      pipeline: config.pipeline,
      fullDocument: 'updateLookup',
      fullDocumentBeforeChange: 'whenAvailable',
      maxAwaitTimeMS: 1000,
      startAfter: resumeToken
    };

    try {
      const changeStream = collection.watch(options);
      this.changeStreams.set(config.name, changeStream);

      // Process changes asynchronously
      this.processChangeStream(config.name, changeStream, config.handler);

    } catch (error) {
      console.error(`Error starting stream for ${config.name}:`, error);
      this.scheduleStreamRestart(config);
    }
  }

  async startDatabaseStream() {
    // Database-level stream for cross-collection coordination
    const pipeline = [
      {
        $match: {
          // Monitor for significant cross-collection events
          $or: [
            { 
              operationType: 'insert',
              'fullDocument.metadata.requiresCrossCollectionSync': true
            },
            {
              operationType: 'update',
              'updateDescription.updatedFields.syncRequired': { $exists: true }
            }
          ]
        }
      },
      {
        $addFields: {
          // Add processing metadata
          collectionName: '$ns.coll',
          databaseName: '$ns.db',
          changeSignature: {
            $concat: [
              '$ns.coll', '_',
              '$operationType', '_',
              { $toString: '$clusterTime' }
            ]
          }
        }
      }
    ];

    const options = {
      pipeline,
      fullDocument: 'updateLookup',
      maxAwaitTimeMS: 2000
    };

    const dbStream = this.db.watch(options);
    this.changeStreams.set('_database', dbStream);

    this.processChangeStream('_database', dbStream, this.handleDatabaseChanges.bind(this));
  }

  async processChangeStream(streamName, changeStream, handler) {
    console.log(`Processing change stream: ${streamName}`);

    try {
      for await (const change of changeStream) {
        try {
          // Store resume token
          this.resumeTokens.set(streamName, change._id);

          // Process the change
          await handler(change);

          // Persist resume token for recovery
          await this.persistResumeToken(streamName, change._id);

        } catch (processingError) {
          console.error(`Error processing change in ${streamName}:`, processingError);
          await this.handleProcessingError(streamName, change, processingError);
        }
      }
    } catch (streamError) {
      console.error(`Stream error in ${streamName}:`, streamError);
      await this.handleStreamError(streamName, streamError);
    }
  }

  async handleOrderChanges(change) {
    console.log(`Order change detected: ${change.operationType}`);

    const { operationType, fullDocument, documentKey, updateDescription } = change;

    // Route based on change type and affected fields
    if (operationType === 'insert') {
      await this.processNewOrder(fullDocument);
    } else if (operationType === 'update') {
      const updatedFields = Object.keys(updateDescription.updatedFields || {});

      // Process specific field updates
      if (updatedFields.includes('status')) {
        await this.processOrderStatusChange(fullDocument, updateDescription);
      }

      if (updatedFields.some(field => field.startsWith('payment.'))) {
        await this.processPaymentChange(fullDocument, updateDescription);
      }

      if (updatedFields.some(field => field.startsWith('fulfillment.'))) {
        await this.processFulfillmentChange(fullDocument, updateDescription);
      }
    }
  }

  async handleInventoryChanges(change) {
    console.log(`Inventory change detected: ${change.operationType}`);

    const { fullDocument, updateDescription } = change;
    const updatedFields = Object.keys(updateDescription.updatedFields || {});

    // Check for low stock conditions
    if (updatedFields.includes('availableQuantity')) {
      const newQuantity = updateDescription.updatedFields.availableQuantity;
      if (newQuantity <= fullDocument.reorderLevel) {
        await this.triggerReorderAlert(fullDocument);
      }
    }

    // Propagate inventory changes to dependent systems
    await this.syncInventoryWithExternalSystems(fullDocument, updatedFields);
  }

  async handleCustomerChanges(change) {
    console.log(`Customer change detected: ${change.operationType}`);

    const { fullDocument, updateDescription } = change;

    // Handle loyalty tier changes
    if (updateDescription?.updatedFields?.loyaltyTier) {
      await this.processLoyaltyTierChange(fullDocument, updateDescription);
    }

    // Handle preference updates
    if (updateDescription?.updatedFields?.preferences) {
      await this.updatePersonalizationEngine(fullDocument);
    }
  }

  async handleDatabaseChanges(change) {
    console.log(`Database-level change: ${change.collectionName}.${change.operationType}`);

    // Handle cross-collection synchronization events
    await this.coordinateCrossCollectionSync(change);
  }

  // Resilience and error handling
  async handleStreamError(streamName, error) {
    console.error(`Stream ${streamName} encountered error:`, error);

    // Implement exponential backoff for reconnection
    const baseDelay = 1000; // 1 second
    const maxRetries = 5;
    let retryCount = 0;

    while (retryCount < maxRetries) {
      const delay = baseDelay * Math.pow(2, retryCount);
      console.log(`Attempting to restart ${streamName} in ${delay}ms (retry ${retryCount + 1})`);

      await new Promise(resolve => setTimeout(resolve, delay));

      try {
        // Restart the specific stream
        await this.restartStream(streamName);
        console.log(`Successfully restarted ${streamName}`);
        break;
      } catch (restartError) {
        console.error(`Failed to restart ${streamName}:`, restartError);
        retryCount++;
      }
    }

    if (retryCount >= maxRetries) {
      console.error(`Failed to restart ${streamName} after ${maxRetries} attempts`);
      // Implement alerting for operations team
      await this.sendOperationalAlert(`Critical: Change stream ${streamName} failed to restart`);
    }
  }

  async restartStream(streamName) {
    // Close existing stream if it exists
    const existingStream = this.changeStreams.get(streamName);
    if (existingStream) {
      try {
        await existingStream.close();
      } catch (closeError) {
        console.warn(`Error closing ${streamName}:`, closeError);
      }
      this.changeStreams.delete(streamName);
    }

    // Restart based on stream type
    if (streamName === '_database') {
      await this.startDatabaseStream();
    } else {
      // Find and restart collection stream
      const config = this.getCollectionConfig(streamName);
      if (config) {
        await this.startCollectionStream(config);
      }
    }
  }

  async persistResumeToken(streamName, resumeToken) {
    // Store resume tokens in MongoDB for crash recovery
    const tokenCollection = this.db.collection('change_stream_tokens');

    await tokenCollection.updateOne(
      { streamName },
      {
        $set: {
          resumeToken,
          lastUpdated: new Date(),
          streamName
        }
      },
      { upsert: true }
    );
  }

  async loadPersistedResumeTokens() {
    console.log('Loading persisted resume tokens...');

    const tokenCollection = this.db.collection('change_stream_tokens');
    const tokens = await tokenCollection.find({}).toArray();

    for (const token of tokens) {
      this.resumeTokens.set(token.streamName, token.resumeToken);
      console.log(`Loaded resume token for ${token.streamName}`);
    }
  }

  // Performance monitoring and optimization
  async getChangeStreamMetrics() {
    const metrics = {
      activeStreams: this.changeStreams.size,
      resumeTokens: this.resumeTokens.size,
      streamStatus: {},
      systemHealth: await this.checkSystemHealth()
    };

    // Check status of each stream
    for (const [streamName, stream] of this.changeStreams.entries()) {
      metrics.streamStatus[streamName] = {
        isActive: !stream.closed,
        hasResumeToken: this.resumeTokens.has(streamName)
      };
    }

    return metrics;
  }

  async checkSystemHealth() {
    try {
      // Check MongoDB replica set status
      const replicaSetStatus = await this.client.db('admin').admin().replSetGetStatus();

      const healthMetrics = {
        replicaSetHealthy: replicaSetStatus.ok === 1,
        primaryNode: replicaSetStatus.members.find(member => member.state === 1)?.name,
        secondaryNodes: replicaSetStatus.members.filter(member => member.state === 2).length,
        oplogSize: await this.getOplogSize(),
        changeStreamSupported: true
      };

      return healthMetrics;
    } catch (error) {
      console.error('Error checking system health:', error);
      return {
        replicaSetHealthy: false,
        error: error.message
      };
    }
  }

  async getOplogSize() {
    // Check oplog size to ensure sufficient retention for change streams
    const oplog = this.client.db('local').collection('oplog.rs');
    const stats = await oplog.stats();

    return {
      sizeBytes: stats.size,
      sizeMB: Math.round(stats.size / 1024 / 1024),
      maxSizeBytes: stats.maxSize,
      maxSizeMB: Math.round(stats.maxSize / 1024 / 1024),
      utilizationPercent: Math.round((stats.size / stats.maxSize) * 100)
    };
  }

  // Cleanup and shutdown
  async shutdown() {
    console.log('Shutting down change stream manager...');

    const shutdownPromises = [];

    // Close all active streams
    for (const [streamName, stream] of this.changeStreams.entries()) {
      console.log(`Closing stream: ${streamName}`);
      shutdownPromises.push(
        stream.close().catch(error => 
          console.warn(`Error closing ${streamName}:`, error)
        )
      );
    }

    await Promise.allSettled(shutdownPromises);

    // Clear internal state
    this.changeStreams.clear();
    this.resumeTokens.clear();

    console.log('Change stream manager shutdown complete');
  }
}

// Helper methods for event processing
async function processNewOrder(order) {
  console.log(`Processing new order: ${order.orderNumber}`);

  // Comprehensive order processing workflow
  const processingTasks = [
    validateOrderData(order),
    checkInventoryAvailability(order), 
    validatePaymentMethod(order),
    calculateShippingOptions(order),
    applyPromotionsAndDiscounts(order),
    createFulfillmentWorkflow(order),
    sendCustomerNotifications(order),
    updateAnalyticsAndReporting(order)
  ];

  const results = await Promise.allSettled(processingTasks);

  // Handle any failed tasks
  const failures = results.filter(result => result.status === 'rejected');
  if (failures.length > 0) {
    console.error(`${failures.length} tasks failed for order ${order.orderNumber}`);
    await handleOrderProcessingFailures(order, failures);
  }
}

async function triggerReorderAlert(inventoryItem) {
  console.log(`Low stock alert: ${inventoryItem.sku} - ${inventoryItem.availableQuantity} remaining`);

  // Create automatic reorder if conditions are met
  if (inventoryItem.autoReorder && inventoryItem.availableQuantity <= inventoryItem.criticalLevel) {
    const reorderQuantity = inventoryItem.maxStock - inventoryItem.availableQuantity;

    await createPurchaseOrder({
      productId: inventoryItem.productId,
      sku: inventoryItem.sku,
      quantity: reorderQuantity,
      supplier: inventoryItem.preferredSupplier,
      urgency: 'high',
      reason: 'automated_reorder_low_stock'
    });
  }
}

// Example helper implementations
async function validateOrderData(order) {
  // Comprehensive order validation
  const validationResults = {
    customerValid: await validateCustomer(order.customerId),
    itemsValid: await validateOrderItems(order.items),
    addressValid: await validateShippingAddress(order.shipping.address),
    paymentValid: await validatePaymentInfo(order.payment)
  };

  const isValid = Object.values(validationResults).every(result => result === true);
  if (!isValid) {
    throw new Error(`Order validation failed: ${JSON.stringify(validationResults)}`);
  }
}

async function createPurchaseOrder(orderData) {
  console.log(`Creating purchase order: ${orderData.sku} x ${orderData.quantity}`);
  // Implementation would create purchase order in procurement system
}

async function sendOperationalAlert(message) {
  console.error(`OPERATIONAL ALERT: ${message}`);
  // Implementation would integrate with alerting system (PagerDuty, Slack, etc.)
}

SQL-Style Change Stream Operations with QueryLeaf

QueryLeaf provides SQL-familiar syntax for MongoDB Change Stream operations:

-- QueryLeaf change stream operations with SQL-familiar syntax

-- Create change stream listeners with SQL-style syntax
CREATE CHANGE STREAM order_status_changes 
ON orders 
WHERE 
  operation_type IN ('update', 'insert')
  AND (
    changed_fields CONTAINS 'status' 
    OR changed_fields CONTAINS 'payment.status'
  )
WITH (
  full_document = 'update_lookup',
  full_document_before_change = 'when_available',
  max_await_time = '1 second',
  batch_size = 50
);

-- Multi-collection change stream with filtering
CREATE CHANGE STREAM inventory_and_orders
ON DATABASE ecommerce
WHERE 
  collection_name IN ('orders', 'inventory', 'products')
  AND (
    (collection_name = 'orders' AND operation_type = 'insert')
    OR (collection_name = 'inventory' AND changed_fields CONTAINS 'availableQuantity')
    OR (collection_name = 'products' AND changed_fields CONTAINS 'price')
  )
WITH (
  resume_after = '8264BEB9F3000000012B0229296E04'
);

-- Real-time order processing with change stream triggers
CREATE TRIGGER process_order_changes
ON CHANGE STREAM order_status_changes
FOR EACH CHANGE AS
BEGIN
  -- Route processing based on change type
  CASE change.operation_type
    WHEN 'insert' THEN
      -- New order created
      CALL process_new_order(change.full_document);

      -- Send notifications
      INSERT INTO notification_queue (
        recipient, 
        type, 
        message, 
        data
      )
      VALUES (
        change.full_document.customer.email,
        'order_confirmation',
        'Your order has been received',
        change.full_document
      );

    WHEN 'update' THEN
      -- Order updated - check what changed
      IF change.changed_fields CONTAINS 'status' THEN
        CALL process_status_change(
          change.full_document,
          change.update_description.updated_fields.status
        );
      END IF;

      IF change.changed_fields CONTAINS 'payment.status' THEN
        CALL process_payment_status_change(
          change.full_document,
          change.update_description.updated_fields['payment.status']
        );
      END IF;
  END CASE;

  -- Update processing metrics
  UPDATE change_stream_metrics 
  SET 
    events_processed = events_processed + 1,
    last_processed_at = CURRENT_TIMESTAMP
  WHERE stream_name = 'order_status_changes';
END;

-- Change stream analytics and monitoring
WITH change_stream_analytics AS (
  SELECT 
    stream_name,
    operation_type,
    collection_name,
    DATE_TRUNC('minute', change_timestamp) as minute_bucket,

    COUNT(*) as change_count,
    COUNT(DISTINCT document_key._id) as unique_documents,

    -- Processing latency analysis
    AVG(processing_time_ms) as avg_processing_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY processing_time_ms) as p95_processing_time,

    -- Change characteristics
    COUNT(*) FILTER (WHERE operation_type = 'insert') as insert_count,
    COUNT(*) FILTER (WHERE operation_type = 'update') as update_count,
    COUNT(*) FILTER (WHERE operation_type = 'delete') as delete_count,

    -- Field change patterns
    STRING_AGG(DISTINCT changed_fields, ',') as common_changed_fields,

    -- Error tracking
    COUNT(*) FILTER (WHERE processing_status = 'error') as error_count,
    COUNT(*) FILTER (WHERE processing_status = 'retry') as retry_count

  FROM change_stream_events
  WHERE change_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
  GROUP BY stream_name, operation_type, collection_name, minute_bucket
),
stream_performance AS (
  SELECT 
    stream_name,
    SUM(change_count) as total_changes,
    AVG(avg_processing_time) as overall_avg_processing_time,
    MAX(p95_processing_time) as max_p95_processing_time,

    -- Throughput analysis
    SUM(change_count) / 60.0 as changes_per_second,

    -- Error rates
    SUM(error_count) as total_errors,
    (SUM(error_count)::numeric / SUM(change_count)) * 100 as error_rate_percent,

    -- Change type distribution
    SUM(insert_count) as total_inserts,
    SUM(update_count) as total_updates, 
    SUM(delete_count) as total_deletes,

    -- Field change frequency
    COUNT(DISTINCT common_changed_fields) as unique_field_patterns,

    -- Performance assessment
    CASE 
      WHEN AVG(avg_processing_time) > 1000 THEN 'SLOW'
      WHEN AVG(avg_processing_time) > 500 THEN 'MODERATE'
      ELSE 'FAST'
    END as performance_rating,

    -- Health indicators
    CASE
      WHEN (SUM(error_count)::numeric / SUM(change_count)) > 0.05 THEN 'UNHEALTHY'
      WHEN (SUM(error_count)::numeric / SUM(change_count)) > 0.01 THEN 'WARNING' 
      ELSE 'HEALTHY'
    END as health_status

  FROM change_stream_analytics
  GROUP BY stream_name
)
SELECT 
  sp.stream_name,
  sp.total_changes,
  ROUND(sp.changes_per_second, 2) as changes_per_sec,
  ROUND(sp.overall_avg_processing_time, 1) as avg_processing_ms,
  ROUND(sp.max_p95_processing_time, 1) as max_p95_ms,
  sp.performance_rating,
  sp.health_status,

  -- Change breakdown
  sp.total_inserts,
  sp.total_updates,
  sp.total_deletes,

  -- Error analysis  
  sp.total_errors,
  ROUND(sp.error_rate_percent, 2) as error_rate_pct,

  -- Field change patterns
  sp.unique_field_patterns,

  -- Recommendations
  CASE 
    WHEN sp.performance_rating = 'SLOW' THEN 'Optimize change processing logic or increase resources'
    WHEN sp.error_rate_percent > 5 THEN 'Investigate error patterns and improve error handling'
    WHEN sp.changes_per_second > 1000 THEN 'Consider stream partitioning for better throughput'
    ELSE 'Performance within acceptable parameters'
  END as recommendation

FROM stream_performance sp
ORDER BY sp.total_changes DESC;

-- Advanced change stream query patterns
CREATE VIEW real_time_order_insights AS
WITH order_changes AS (
  SELECT 
    full_document.*,
    change_timestamp,
    operation_type,
    changed_fields,

    -- Calculate order lifecycle timing
    CASE 
      WHEN operation_type = 'insert' THEN 'order_created'
      WHEN changed_fields CONTAINS 'status' THEN 
        CONCAT('status_changed_to_', full_document.status)
      WHEN changed_fields CONTAINS 'payment.status' THEN
        CONCAT('payment_', full_document.payment.status) 
      ELSE 'other_update'
    END as change_event_type,

    -- Time-based analytics
    DATE_TRUNC('hour', change_timestamp) as hour_bucket,
    EXTRACT(DOW FROM change_timestamp) as day_of_week,
    EXTRACT(HOUR FROM change_timestamp) as hour_of_day,

    -- Order value categories
    CASE 
      WHEN full_document.financial.total >= 500 THEN 'high_value'
      WHEN full_document.financial.total >= 100 THEN 'medium_value'
      ELSE 'low_value'
    END as order_value_category,

    -- Customer segment analysis
    full_document.customer.loyaltyTier as customer_segment,

    -- Geographic analysis
    full_document.shipping.address.state as shipping_state,
    full_document.shipping.address.country as shipping_country

  FROM CHANGE_STREAM(orders)
  WHERE change_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
),
order_metrics AS (
  SELECT 
    hour_bucket,
    day_of_week,
    hour_of_day,
    change_event_type,
    order_value_category,
    customer_segment,
    shipping_state,

    COUNT(*) as event_count,
    COUNT(DISTINCT full_document._id) as unique_orders,
    AVG(full_document.financial.total) as avg_order_value,
    SUM(full_document.financial.total) as total_order_value,

    -- Conversion funnel analysis
    COUNT(*) FILTER (WHERE change_event_type = 'order_created') as orders_created,
    COUNT(*) FILTER (WHERE change_event_type = 'status_changed_to_confirmed') as orders_confirmed,
    COUNT(*) FILTER (WHERE change_event_type = 'status_changed_to_shipped') as orders_shipped,
    COUNT(*) FILTER (WHERE change_event_type = 'status_changed_to_delivered') as orders_delivered,
    COUNT(*) FILTER (WHERE change_event_type = 'status_changed_to_cancelled') as orders_cancelled,

    -- Payment analysis
    COUNT(*) FILTER (WHERE change_event_type = 'payment_authorized') as payments_authorized,
    COUNT(*) FILTER (WHERE change_event_type = 'payment_captured') as payments_captured,
    COUNT(*) FILTER (WHERE change_event_type = 'payment_failed') as payments_failed,

    -- Customer behavior
    COUNT(DISTINCT full_document.customer.customerId) as unique_customers,
    AVG(ARRAY_LENGTH(full_document.items, 1)) as avg_items_per_order

  FROM order_changes
  GROUP BY 
    hour_bucket, day_of_week, hour_of_day, change_event_type,
    order_value_category, customer_segment, shipping_state
)
SELECT 
  hour_bucket,
  change_event_type,
  order_value_category,
  customer_segment,

  event_count,
  unique_orders,
  ROUND(avg_order_value, 2) as avg_order_value,
  ROUND(total_order_value, 2) as total_order_value,

  -- Conversion rates
  CASE 
    WHEN orders_created > 0 THEN 
      ROUND((orders_confirmed::numeric / orders_created) * 100, 1)
    ELSE 0
  END as confirmation_rate_pct,

  CASE 
    WHEN orders_confirmed > 0 THEN
      ROUND((orders_shipped::numeric / orders_confirmed) * 100, 1) 
    ELSE 0
  END as fulfillment_rate_pct,

  CASE
    WHEN orders_shipped > 0 THEN
      ROUND((orders_delivered::numeric / orders_shipped) * 100, 1)
    ELSE 0  
  END as delivery_rate_pct,

  -- Payment success rates
  CASE
    WHEN payments_authorized > 0 THEN
      ROUND((payments_captured::numeric / payments_authorized) * 100, 1)
    ELSE 0
  END as payment_success_rate_pct,

  -- Business insights
  unique_customers,
  ROUND(avg_items_per_order, 1) as avg_items_per_order,

  -- Time-based patterns
  day_of_week,
  hour_of_day,

  -- Geographic insights
  shipping_state,

  -- Performance indicators
  CASE 
    WHEN change_event_type = 'order_created' AND event_count > 100 THEN 'HIGH_VOLUME'
    WHEN change_event_type = 'payment_failed' AND event_count > 10 THEN 'PAYMENT_ISSUES'
    WHEN change_event_type = 'status_changed_to_cancelled' AND event_count > 20 THEN 'HIGH_CANCELLATION'
    ELSE 'NORMAL'
  END as alert_status

FROM order_metrics
WHERE event_count > 0
ORDER BY hour_bucket DESC, event_count DESC;

-- Resume token management for change stream reliability
CREATE TABLE change_stream_resume_tokens (
  stream_name VARCHAR(255) PRIMARY KEY,
  resume_token TEXT NOT NULL,
  last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

  -- Stream configuration
  collection_name VARCHAR(255),
  database_name VARCHAR(255),
  filter_pipeline JSONB,

  -- Monitoring
  events_processed BIGINT DEFAULT 0,
  last_event_timestamp TIMESTAMP,
  stream_status VARCHAR(50) DEFAULT 'active',

  -- Performance tracking
  avg_processing_latency_ms INTEGER,
  last_error_message TEXT,
  last_error_timestamp TIMESTAMP,
  consecutive_errors INTEGER DEFAULT 0
);

-- Automatic resume token persistence
CREATE TRIGGER update_resume_tokens
AFTER INSERT OR UPDATE ON change_stream_events
FOR EACH ROW
EXECUTE FUNCTION update_stream_resume_token();

-- Change stream health monitoring
SELECT 
  cst.stream_name,
  cst.collection_name,
  cst.events_processed,
  cst.stream_status,

  -- Time since last activity
  EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - cst.last_event_timestamp)) / 60 as minutes_since_last_event,

  -- Performance metrics
  cst.avg_processing_latency_ms,
  cst.consecutive_errors,

  -- Health assessment
  CASE 
    WHEN cst.stream_status != 'active' THEN 'INACTIVE'
    WHEN cst.consecutive_errors >= 5 THEN 'FAILING'
    WHEN EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - cst.last_event_timestamp)) > 300 THEN 'STALE'
    WHEN cst.avg_processing_latency_ms > 1000 THEN 'SLOW'
    ELSE 'HEALTHY'
  END as health_status,

  -- Recovery information
  cst.resume_token,
  cst.last_updated,

  -- Error details
  cst.last_error_message,
  cst.last_error_timestamp

FROM change_stream_resume_tokens cst
ORDER BY 
  CASE health_status
    WHEN 'FAILING' THEN 1
    WHEN 'INACTIVE' THEN 2
    WHEN 'STALE' THEN 3
    WHEN 'SLOW' THEN 4
    ELSE 5
  END,
  cst.events_processed DESC;

-- QueryLeaf change stream features provide:
-- 1. SQL-familiar syntax for MongoDB Change Stream operations
-- 2. Real-time event processing with familiar trigger patterns
-- 3. Advanced filtering and transformation using SQL expressions
-- 4. Built-in analytics and monitoring with SQL aggregation functions
-- 5. Resume token management for reliable stream processing
-- 6. Performance monitoring and health assessment queries
-- 7. Integration with existing SQL-based reporting and analytics
-- 8. Event-driven architecture patterns using familiar SQL constructs
-- 9. Multi-collection change coordination with SQL joins and unions
-- 10. Seamless scaling from simple change detection to complex event processing

Best Practices for Change Stream Implementation

Performance and Scalability Considerations

Optimize Change Streams for high-throughput, production environments:

Pipeline Filtering: Use aggregation pipelines to filter changes at the database level
Resume Token Management: Implement robust resume token persistence for crash recovery
Batch Processing: Process changes in batches to improve throughput
Resource Management: Monitor memory and connection usage for long-running streams
Error Handling: Implement comprehensive error handling and retry logic
Oplog Sizing: Ensure adequate oplog size for change stream retention requirements

Event-Driven Architecture Patterns

Design scalable event-driven systems with Change Streams:

Event Sourcing: Use Change Streams as the foundation for event sourcing patterns
CQRS Integration: Implement Command Query Responsibility Segregation with change-driven read model updates
Microservice Communication: Coordinate microservices through change-driven events
Data Synchronization: Maintain consistency across distributed systems
Real-time Analytics: Power real-time dashboards and analytics with streaming changes
Audit and Compliance: Implement comprehensive audit trails with change event logging

Conclusion

MongoDB Change Streams provide comprehensive real-time change detection capabilities that eliminate the complexity and overhead of traditional polling-based approaches while enabling sophisticated event-driven architectures. The native integration with MongoDB's replica set architecture, combined with resumable streams and fine-grained filtering, makes building reactive applications both powerful and reliable.

Key Change Stream benefits include:

Real-time Processing: Millisecond latency change detection without polling overhead
Guaranteed Delivery: Ordered, resumable streams with crash recovery capabilities
Rich Filtering: Aggregation pipeline-based change filtering and transformation
Horizontal Scaling: Native support for distributed processing across multiple application instances
Operational Simplicity: No external message brokers or complex trigger maintenance required
Event Sourcing Support: Built-in capabilities for implementing event sourcing patterns

Whether you're building microservices architectures, real-time analytics platforms, data synchronization systems, or event-driven applications, MongoDB Change Streams with QueryLeaf's familiar SQL interface provides the foundation for sophisticated reactive data processing. This combination enables you to implement complex event-driven functionality while preserving familiar database interaction patterns.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB Change Stream operations while providing SQL-familiar event processing syntax, resume token handling, and stream analytics functions. Advanced change detection, event routing, and stream monitoring are seamlessly handled through familiar SQL patterns, making event-driven architecture development both powerful and accessible.

The integration of native change detection capabilities with SQL-style stream processing makes MongoDB an ideal platform for applications requiring both real-time reactivity and familiar database interaction patterns, ensuring your event-driven solutions remain both effective and maintainable as they scale and evolve.

November 6, 2025
25 min read

MongoDB GridFS: Advanced Binary File Management and Distributed Storage for Large-Scale Applications

Modern applications require sophisticated file storage capabilities that can handle large binary files, support efficient streaming operations, and integrate seamlessly with existing data workflows while maintaining high availability and performance. Traditional file storage approaches often struggle with scenarios involving large files, distributed systems, metadata management, and the complexity of coordinating file operations with database transactions, leading to data inconsistency, performance bottlenecks, and operational complexity in production environments.

MongoDB GridFS provides comprehensive distributed file storage that automatically chunks large files, maintains file metadata, and integrates directly with MongoDB's distributed architecture and transaction capabilities. Unlike traditional file storage solutions that require separate file servers and complex synchronization logic, GridFS delivers unified file and data management through automatic file chunking, integrated metadata storage, and seamless integration with MongoDB's replication and sharding capabilities.

The Traditional File Storage Challenge

Conventional file storage architectures face significant limitations when handling large files and distributed systems:

-- Traditional PostgreSQL file storage - complex management and limited scalability

-- Basic file metadata table with limited binary storage capabilities
CREATE TABLE file_metadata (
    file_id BIGSERIAL PRIMARY KEY,
    original_filename VARCHAR(500) NOT NULL,
    content_type VARCHAR(200),
    file_size_bytes BIGINT NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- File organization
    directory_path VARCHAR(1000),
    file_category VARCHAR(100),

    -- User and access control
    uploaded_by BIGINT NOT NULL,
    access_level VARCHAR(20) DEFAULT 'private',

    -- File processing status
    processing_status VARCHAR(50) DEFAULT 'pending',
    thumbnail_generated BOOLEAN DEFAULT FALSE,
    virus_scan_status VARCHAR(50) DEFAULT 'pending',

    -- Storage location (external file system required)
    storage_path VARCHAR(1500) NOT NULL,
    storage_server VARCHAR(200),
    backup_locations TEXT[],

    -- File versioning (complex to implement)
    version_number INTEGER DEFAULT 1,
    parent_file_id BIGINT REFERENCES file_metadata(file_id),
    is_current_version BOOLEAN DEFAULT TRUE,

    -- Performance optimization fields
    download_count BIGINT DEFAULT 0,
    last_accessed TIMESTAMP,

    -- File integrity
    md5_hash VARCHAR(32),
    sha256_hash VARCHAR(64),

    -- Metadata for different file types
    image_metadata JSONB,
    document_metadata JSONB,
    video_metadata JSONB,

    CONSTRAINT valid_access_level CHECK (access_level IN ('public', 'private', 'shared', 'restricted')),
    CONSTRAINT valid_processing_status CHECK (processing_status IN ('pending', 'processing', 'completed', 'failed'))
);

-- Complex indexing strategy for file management
CREATE INDEX idx_files_user_category ON file_metadata(uploaded_by, file_category, created_at DESC);
CREATE INDEX idx_files_directory ON file_metadata(directory_path, original_filename);
CREATE INDEX idx_files_size ON file_metadata(file_size_bytes DESC);
CREATE INDEX idx_files_type ON file_metadata(content_type, created_at DESC);
CREATE INDEX idx_files_processing ON file_metadata(processing_status, created_at);

-- File chunks table for large file handling (manual implementation required)
CREATE TABLE file_chunks (
    chunk_id BIGSERIAL PRIMARY KEY,
    file_id BIGINT NOT NULL REFERENCES file_metadata(file_id) ON DELETE CASCADE,
    chunk_number INTEGER NOT NULL,
    chunk_size INTEGER NOT NULL,
    chunk_data BYTEA NOT NULL, -- Limited to 1GB per field in PostgreSQL
    chunk_hash VARCHAR(64),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    UNIQUE(file_id, chunk_number)
);

CREATE INDEX idx_chunks_file_order ON file_chunks(file_id, chunk_number);

-- Complex file upload procedure with chunking logic
CREATE OR REPLACE FUNCTION upload_large_file(
    p_filename VARCHAR(500),
    p_content_type VARCHAR(200),
    p_file_data BYTEA,
    p_uploaded_by BIGINT,
    p_directory_path VARCHAR(1000) DEFAULT '/',
    p_chunk_size INTEGER DEFAULT 1048576 -- 1MB chunks
) RETURNS TABLE (
    file_id BIGINT,
    total_chunks INTEGER,
    upload_status VARCHAR(50),
    processing_time_ms INTEGER
) AS $$
DECLARE
    new_file_id BIGINT;
    file_size BIGINT;
    chunk_count INTEGER;
    chunk_data BYTEA;
    chunk_start INTEGER;
    chunk_end INTEGER;
    current_chunk INTEGER := 1;
    upload_start_time TIMESTAMP := clock_timestamp();
    file_hash VARCHAR(64);
BEGIN

    -- Calculate file size and hash
    file_size := LENGTH(p_file_data);
    file_hash := encode(digest(p_file_data, 'sha256'), 'hex');

    -- Insert file metadata
    INSERT INTO file_metadata (
        original_filename, content_type, file_size_bytes,
        uploaded_by, directory_path, storage_path,
        sha256_hash, processing_status
    ) VALUES (
        p_filename, p_content_type, file_size,
        p_uploaded_by, p_directory_path, 
        p_directory_path || '/' || p_filename,
        file_hash, 'processing'
    ) RETURNING file_metadata.file_id INTO new_file_id;

    -- Calculate number of chunks needed
    chunk_count := CEILING(file_size::DECIMAL / p_chunk_size);

    -- Process file in chunks (inefficient for large files)
    FOR current_chunk IN 1..chunk_count LOOP
        chunk_start := ((current_chunk - 1) * p_chunk_size) + 1;
        chunk_end := LEAST(current_chunk * p_chunk_size, file_size);

        -- Extract chunk data (memory intensive)
        chunk_data := SUBSTRING(p_file_data FROM chunk_start FOR (chunk_end - chunk_start + 1));

        -- Store chunk
        INSERT INTO file_chunks (
            file_id, chunk_number, chunk_size, chunk_data,
            chunk_hash
        ) VALUES (
            new_file_id, current_chunk, LENGTH(chunk_data), chunk_data,
            encode(digest(chunk_data, 'sha256'), 'hex')
        );

        -- Performance degradation with large number of chunks
        IF current_chunk % 100 = 0 THEN
            COMMIT; -- Partial commits to avoid long transactions
        END IF;
    END LOOP;

    -- Update file status
    UPDATE file_metadata 
    SET processing_status = 'completed', updated_at = CURRENT_TIMESTAMP
    WHERE file_metadata.file_id = new_file_id;

    RETURN QUERY SELECT 
        new_file_id,
        chunk_count,
        'completed'::VARCHAR(50),
        EXTRACT(MILLISECONDS FROM clock_timestamp() - upload_start_time)::INTEGER;

EXCEPTION WHEN OTHERS THEN
    -- Cleanup on failure
    DELETE FROM file_metadata WHERE file_metadata.file_id = new_file_id;
    RAISE EXCEPTION 'File upload failed: %', SQLERRM;
END;
$$ LANGUAGE plpgsql;

-- Complex file download procedure with chunked retrieval
CREATE OR REPLACE FUNCTION download_file_chunks(
    p_file_id BIGINT,
    p_start_chunk INTEGER DEFAULT 1,
    p_end_chunk INTEGER DEFAULT NULL
) RETURNS TABLE (
    chunk_number INTEGER,
    chunk_data BYTEA,
    chunk_size INTEGER,
    is_final_chunk BOOLEAN
) AS $$
DECLARE
    total_chunks INTEGER;
    effective_end_chunk INTEGER;
BEGIN

    -- Get total number of chunks
    SELECT COUNT(*) INTO total_chunks
    FROM file_chunks 
    WHERE file_id = p_file_id;

    IF total_chunks = 0 THEN
        RAISE EXCEPTION 'File not found or has no chunks: %', p_file_id;
    END IF;

    -- Set effective end chunk
    effective_end_chunk := COALESCE(p_end_chunk, total_chunks);

    -- Return requested chunks (memory intensive for large ranges)
    RETURN QUERY
    SELECT 
        fc.chunk_number,
        fc.chunk_data,
        fc.chunk_size,
        fc.chunk_number = total_chunks as is_final_chunk
    FROM file_chunks fc
    WHERE fc.file_id = p_file_id
      AND fc.chunk_number BETWEEN p_start_chunk AND effective_end_chunk
    ORDER BY fc.chunk_number;

END;
$$ LANGUAGE plpgsql;

-- File streaming simulation with complex logic
CREATE OR REPLACE FUNCTION stream_file(
    p_file_id BIGINT,
    p_range_start BIGINT DEFAULT 0,
    p_range_end BIGINT DEFAULT NULL
) RETURNS TABLE (
    file_info JSONB,
    chunk_data BYTEA,
    content_range VARCHAR(100),
    total_size BIGINT
) AS $$
DECLARE
    file_record RECORD;
    chunk_size_bytes INTEGER := 1048576; -- 1MB chunks
    start_chunk INTEGER;
    end_chunk INTEGER;
    effective_range_end BIGINT;
    current_position BIGINT := 0;
    chunk_record RECORD;
BEGIN

    -- Get file metadata
    SELECT * INTO file_record
    FROM file_metadata fm
    WHERE fm.file_id = p_file_id;

    IF NOT FOUND THEN
        RAISE EXCEPTION 'File not found: %', p_file_id;
    END IF;

    -- Calculate effective range
    effective_range_end := COALESCE(p_range_end, file_record.file_size_bytes - 1);

    -- Calculate chunk range
    start_chunk := (p_range_start / chunk_size_bytes) + 1;
    end_chunk := (effective_range_end / chunk_size_bytes) + 1;

    -- Return file info
    file_info := json_build_object(
        'file_id', file_record.file_id,
        'filename', file_record.original_filename,
        'content_type', file_record.content_type,
        'total_size', file_record.file_size_bytes,
        'range_start', p_range_start,
        'range_end', effective_range_end
    );

    -- Stream chunks (inefficient for large files)
    FOR chunk_record IN
        SELECT fc.chunk_number, fc.chunk_data, fc.chunk_size
        FROM file_chunks fc
        WHERE fc.file_id = p_file_id
          AND fc.chunk_number BETWEEN start_chunk AND end_chunk
        ORDER BY fc.chunk_number
    LOOP

        -- Calculate partial chunk data if needed
        IF chunk_record.chunk_number = start_chunk AND p_range_start % chunk_size_bytes != 0 THEN
            -- Partial first chunk
            chunk_data := SUBSTRING(
                chunk_record.chunk_data 
                FROM (p_range_start % chunk_size_bytes) + 1
            );
        ELSIF chunk_record.chunk_number = end_chunk AND effective_range_end % chunk_size_bytes != chunk_size_bytes - 1 THEN
            -- Partial last chunk
            chunk_data := SUBSTRING(
                chunk_record.chunk_data 
                FOR (effective_range_end % chunk_size_bytes) + 1
            );
        ELSE
            -- Full chunk
            chunk_data := chunk_record.chunk_data;
        END IF;

        content_range := format('bytes %s-%s/%s', 
            current_position, 
            current_position + LENGTH(chunk_data) - 1,
            file_record.file_size_bytes
        );

        total_size := file_record.file_size_bytes;

        current_position := current_position + LENGTH(chunk_data);

        RETURN NEXT;
    END LOOP;

END;
$$ LANGUAGE plpgsql;

-- Complex analytics query for file storage management
WITH file_storage_analysis AS (
    SELECT 
        file_category,
        content_type,
        DATE_TRUNC('month', created_at) as month_bucket,

        -- Storage utilization
        COUNT(*) as total_files,
        SUM(file_size_bytes) as total_storage_bytes,
        AVG(file_size_bytes) as avg_file_size,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY file_size_bytes) as median_file_size,
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY file_size_bytes) as p95_file_size,

        -- Performance metrics
        AVG(download_count) as avg_downloads,
        SUM(download_count) as total_downloads,
        COUNT(*) FILTER (WHERE download_count = 0) as unused_files,

        -- Processing status
        COUNT(*) FILTER (WHERE processing_status = 'completed') as processed_files,
        COUNT(*) FILTER (WHERE processing_status = 'failed') as failed_files,
        COUNT(*) FILTER (WHERE thumbnail_generated = true) as files_with_thumbnails,

        -- Storage efficiency
        COUNT(DISTINCT uploaded_by) as unique_uploaders,
        AVG(version_number) as avg_version_number

    FROM file_metadata
    WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '12 months'
    GROUP BY file_category, content_type, DATE_TRUNC('month', created_at)
),

storage_growth_projection AS (
    SELECT 
        month_bucket,
        total_storage_bytes,

        -- Growth calculations (complex and expensive)
        LAG(total_storage_bytes) OVER (ORDER BY month_bucket) as prev_month_storage,
        (total_storage_bytes - LAG(total_storage_bytes) OVER (ORDER BY month_bucket))::DECIMAL / 
        NULLIF(LAG(total_storage_bytes) OVER (ORDER BY month_bucket), 0) * 100 as growth_percent

    FROM (
        SELECT 
            month_bucket,
            SUM(total_storage_bytes) as total_storage_bytes
        FROM file_storage_analysis
        GROUP BY month_bucket
    ) monthly_totals
)

SELECT 
    fsa.month_bucket,
    fsa.file_category,
    fsa.content_type,

    -- File statistics
    fsa.total_files,
    ROUND(fsa.total_storage_bytes / 1024.0 / 1024.0 / 1024.0, 2) as storage_gb,
    ROUND(fsa.avg_file_size / 1024.0 / 1024.0, 2) as avg_file_size_mb,
    ROUND(fsa.median_file_size / 1024.0 / 1024.0, 2) as median_file_size_mb,

    -- Usage patterns
    fsa.avg_downloads,
    fsa.total_downloads,
    ROUND((fsa.unused_files::DECIMAL / fsa.total_files) * 100, 1) as unused_files_percent,

    -- Processing efficiency
    ROUND((fsa.processed_files::DECIMAL / fsa.total_files) * 100, 1) as processing_success_rate,
    ROUND((fsa.files_with_thumbnails::DECIMAL / fsa.total_files) * 100, 1) as thumbnail_generation_rate,

    -- Growth metrics
    sgp.growth_percent as monthly_growth_percent,

    -- Storage recommendations
    CASE 
        WHEN fsa.unused_files::DECIMAL / fsa.total_files > 0.5 THEN 'implement_cleanup_policy'
        WHEN fsa.avg_file_size > 100 * 1024 * 1024 THEN 'consider_compression'
        WHEN sgp.growth_percent > 50 THEN 'monitor_storage_capacity'
        ELSE 'storage_optimized'
    END as storage_recommendation

FROM file_storage_analysis fsa
JOIN storage_growth_projection sgp ON DATE_TRUNC('month', fsa.month_bucket) = sgp.month_bucket
WHERE fsa.total_files > 0
ORDER BY fsa.month_bucket DESC, fsa.total_storage_bytes DESC;

-- Traditional file storage approach problems:
-- 1. Complex manual chunking implementation with performance limitations
-- 2. Separate metadata and binary data management requiring coordination
-- 3. Limited streaming capabilities and memory-intensive operations
-- 4. No built-in distributed storage or replication support
-- 5. Complex versioning and concurrent access management
-- 6. Expensive maintenance operations for large file collections
-- 7. No native integration with database transactions and consistency
-- 8. Limited file processing and metadata extraction capabilities
-- 9. Difficult backup and disaster recovery for large binary datasets
-- 10. Complex sharding and distribution strategies for file data

MongoDB GridFS provides comprehensive distributed file storage with automatic chunking and metadata management:

// MongoDB GridFS - Advanced distributed file storage with automatic chunking and metadata management
const { MongoClient, GridFSBucket, ObjectId } = require('mongodb');
const fs = require('fs');
const crypto = require('crypto');
const { Transform, Readable } = require('stream');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('advanced_file_storage');

// Comprehensive MongoDB GridFS Manager
class AdvancedGridFSManager {
  constructor(db, config = {}) {
    this.db = db;
    this.config = {
      // Default GridFS configuration
      defaultBucketName: config.defaultBucketName || 'fs',
      defaultChunkSizeBytes: config.defaultChunkSizeBytes || 255 * 1024, // 255KB

      // Performance optimization
      enableConcurrentUploads: config.enableConcurrentUploads !== false,
      maxConcurrentUploads: config.maxConcurrentUploads || 10,
      enableStreamOptimization: config.enableStreamOptimization !== false,
      bufferSize: config.bufferSize || 64 * 1024,

      // File processing features
      enableHashGeneration: config.enableHashGeneration !== false,
      enableMetadataExtraction: config.enableMetadataExtraction !== false,
      enableThumbnailGeneration: config.enableThumbnailGeneration !== false,
      enableContentAnalysis: config.enableContentAnalysis !== false,

      // Storage optimization
      enableCompression: config.enableCompression !== false,
      compressionLevel: config.compressionLevel || 6,
      enableDeduplication: config.enableDeduplication !== false,

      // Access control and security
      enableEncryption: config.enableEncryption !== false,
      encryptionKey: config.encryptionKey,
      enableAccessLogging: config.enableAccessLogging !== false,

      // Performance monitoring
      enablePerformanceMetrics: config.enablePerformanceMetrics !== false,
      enableUsageAnalytics: config.enableUsageAnalytics !== false,

      // Advanced features
      enableVersioning: config.enableVersioning !== false,
      enableDistributedStorage: config.enableDistributedStorage !== false,
      enableAutoCleanup: config.enableAutoCleanup !== false
    };

    // GridFS buckets for different file types
    this.buckets = new Map();
    this.uploadStreams = new Map();
    this.downloadStreams = new Map();

    // Performance tracking
    this.performanceMetrics = {
      totalUploads: 0,
      totalDownloads: 0,
      totalStorageBytes: 0,
      averageUploadTime: 0,
      averageDownloadTime: 0,
      errorCount: 0
    };

    // File processing queues
    this.processingQueue = new Map();
    this.thumbnailQueue = new Map();

    this.initializeGridFS();
  }

  async initializeGridFS() {
    console.log('Initializing advanced GridFS file storage system...');

    try {
      // Create specialized GridFS buckets for different file types
      await this.createOptimizedBucket('documents', {
        chunkSizeBytes: 512 * 1024, // 512KB chunks for documents
        metadata: {
          purpose: 'document_storage',
          contentTypes: ['application/pdf', 'application/msword', 'text/plain'],
          enableFullTextIndex: true,
          enableContentExtraction: true
        }
      });

      await this.createOptimizedBucket('images', {
        chunkSizeBytes: 256 * 1024, // 256KB chunks for images
        metadata: {
          purpose: 'image_storage',
          contentTypes: ['image/jpeg', 'image/png', 'image/gif', 'image/webp'],
          enableThumbnailGeneration: true,
          enableImageAnalysis: true
        }
      });

      await this.createOptimizedBucket('videos', {
        chunkSizeBytes: 1024 * 1024, // 1MB chunks for videos
        metadata: {
          purpose: 'video_storage',
          contentTypes: ['video/mp4', 'video/webm', 'video/avi'],
          enableVideoProcessing: true,
          enableStreamingOptimization: true
        }
      });

      await this.createOptimizedBucket('archives', {
        chunkSizeBytes: 2 * 1024 * 1024, // 2MB chunks for archives
        metadata: {
          purpose: 'archive_storage',
          contentTypes: ['application/zip', 'application/tar', 'application/gzip'],
          enableCompression: false, // Already compressed
          enableIntegrityCheck: true
        }
      });

      // Create general-purpose bucket
      await this.createOptimizedBucket('general', {
        chunkSizeBytes: this.config.defaultChunkSizeBytes,
        metadata: {
          purpose: 'general_storage',
          enableGenericProcessing: true
        }
      });

      // Setup performance monitoring
      if (this.config.enablePerformanceMetrics) {
        await this.setupPerformanceMonitoring();
      }

      // Setup automatic cleanup
      if (this.config.enableAutoCleanup) {
        await this.setupAutomaticCleanup();
      }

      console.log('Advanced GridFS system initialized successfully');

    } catch (error) {
      console.error('Error initializing GridFS:', error);
      throw error;
    }
  }

  async createOptimizedBucket(bucketName, options) {
    console.log(`Creating optimized GridFS bucket: ${bucketName}...`);

    try {
      const bucket = new GridFSBucket(this.db, {
        bucketName: bucketName,
        chunkSizeBytes: options.chunkSizeBytes || this.config.defaultChunkSizeBytes,
        writeConcern: { w: 1, j: true },
        readConcern: { level: 'majority' }
      });

      this.buckets.set(bucketName, {
        bucket: bucket,
        options: options,
        created: new Date(),
        stats: {
          fileCount: 0,
          totalSize: 0,
          uploadsInProgress: 0
        }
      });

      // Create optimized indexes for GridFS collections
      await this.createGridFSIndexes(bucketName);

      console.log(`GridFS bucket ${bucketName} created with ${options.chunkSizeBytes} byte chunks`);

    } catch (error) {
      console.error(`Error creating GridFS bucket ${bucketName}:`, error);
      throw error;
    }
  }

  async createGridFSIndexes(bucketName) {
    console.log(`Creating optimized indexes for GridFS bucket: ${bucketName}...`);

    try {
      // Files collection indexes
      const filesCollection = this.db.collection(`${bucketName}.files`);
      await filesCollection.createIndexes([
        { key: { filename: 1, uploadDate: -1 }, background: true, name: 'filename_upload_date' },
        { key: { 'metadata.contentType': 1, uploadDate: -1 }, background: true, name: 'content_type_date' },
        { key: { 'metadata.userId': 1, uploadDate: -1 }, background: true, sparse: true, name: 'user_files' },
        { key: { 'metadata.tags': 1 }, background: true, sparse: true, name: 'file_tags' },
        { key: { length: -1, uploadDate: -1 }, background: true, name: 'size_date' },
        { key: { 'metadata.hash': 1 }, background: true, sparse: true, name: 'file_hash' }
      ]);

      // Chunks collection indexes (automatically created by GridFS, but we can add custom ones)
      const chunksCollection = this.db.collection(`${bucketName}.chunks`);
      await chunksCollection.createIndexes([
        // Default GridFS index: { files_id: 1, n: 1 } is automatically created
        { key: { files_id: 1 }, background: true, name: 'chunks_file_id' }
      ]);

      console.log(`GridFS indexes created for bucket: ${bucketName}`);

    } catch (error) {
      console.error(`Error creating GridFS indexes for ${bucketName}:`, error);
      // Don't fail initialization for index creation issues
    }
  }

  async uploadFile(bucketName, filename, fileStream, metadata = {}) {
    console.log(`Starting file upload: ${filename} to bucket: ${bucketName}`);
    const uploadStartTime = Date.now();

    try {
      const bucketInfo = this.buckets.get(bucketName);
      if (!bucketInfo) {
        throw new Error(`GridFS bucket ${bucketName} not found`);
      }

      const bucket = bucketInfo.bucket;

      // Generate file hash for deduplication and integrity
      const hashStream = crypto.createHash('sha256');
      let fileSize = 0;

      // Enhanced metadata with automatic enrichment
      const enhancedMetadata = {
        ...metadata,

        // Upload context
        uploadedAt: new Date(),
        uploadedBy: metadata.userId || 'system',
        bucketName: bucketName,

        // File identification
        originalFilename: filename,
        contentType: metadata.contentType || this.detectContentType(filename),

        // Processing flags
        processingStatus: 'pending',
        processingQueue: [],

        // Access and security
        accessLevel: metadata.accessLevel || 'private',
        encryptionStatus: this.config.enableEncryption ? 'encrypted' : 'unencrypted',

        // File categorization
        category: this.categorizeFile(filename, metadata.contentType),
        tags: metadata.tags || [],

        // Version control
        version: metadata.version || 1,
        parentFileId: metadata.parentFileId,

        // System metadata
        source: metadata.source || 'api_upload',
        clientInfo: metadata.clientInfo || {},

        // Performance tracking
        uploadMetrics: {
          startTime: uploadStartTime,
          chunkSizeBytes: bucketInfo.options.chunkSizeBytes
        }
      };

      // Create upload stream with optimization
      const uploadOptions = {
        metadata: enhancedMetadata,
        chunkSizeBytes: bucketInfo.options.chunkSizeBytes,
        disableMD5: false // Enable MD5 for integrity checking
      };

      const uploadStream = bucket.openUploadStream(filename, uploadOptions);
      const uploadId = uploadStream.id.toString();

      // Track upload progress
      this.uploadStreams.set(uploadId, {
        stream: uploadStream,
        filename: filename,
        startTime: uploadStartTime,
        bucketName: bucketName
      });

      // Setup progress tracking and error handling
      let uploadedBytes = 0;

      return new Promise((resolve, reject) => {
        uploadStream.on('error', (error) => {
          console.error(`Upload error for ${filename}:`, error);
          this.uploadStreams.delete(uploadId);
          this.performanceMetrics.errorCount++;
          reject(error);
        });

        uploadStream.on('finish', async () => {
          const uploadTime = Date.now() - uploadStartTime;

          try {
            // Update file metadata with hash and final processing info
            const finalMetadata = {
              ...enhancedMetadata,

              // File integrity
              hash: hashStream.digest('hex'),
              fileSize: fileSize,

              // Upload completion
              processingStatus: 'uploaded',
              uploadMetrics: {
                ...enhancedMetadata.uploadMetrics,
                completedAt: new Date(),
                uploadTimeMs: uploadTime,
                throughputBytesPerSecond: fileSize > 0 ? Math.round(fileSize / (uploadTime / 1000)) : 0
              }
            };

            // Update the file document with enhanced metadata
            await this.db.collection(`${bucketName}.files`).updateOne(
              { _id: uploadStream.id },
              { 
                $set: { 
                  metadata: finalMetadata,
                  'metadata.hash': finalMetadata.hash,
                  'metadata.fileSize': finalMetadata.fileSize
                }
              }
            );

            // Update performance metrics
            this.updatePerformanceMetrics('upload', uploadTime, fileSize);
            bucketInfo.stats.fileCount++;
            bucketInfo.stats.totalSize += fileSize;

            // Queue for post-processing
            if (this.config.enableMetadataExtraction || this.config.enableThumbnailGeneration) {
              await this.queueFileProcessing(uploadStream.id, bucketName, finalMetadata);
            }

            this.uploadStreams.delete(uploadId);

            console.log(`File upload completed: ${filename} (${fileSize} bytes) in ${uploadTime}ms`);

            resolve({
              success: true,
              fileId: uploadStream.id,
              filename: filename,
              size: fileSize,
              hash: finalMetadata.hash,
              uploadTime: uploadTime,
              bucketName: bucketName,
              metadata: finalMetadata
            });

          } catch (error) {
            console.error('Error updating file metadata after upload:', error);
            reject(error);
          }
        });

        // Pipe the file stream through hash calculation and to GridFS
        fileStream.on('data', (chunk) => {
          hashStream.update(chunk);
          fileSize += chunk.length;
          uploadedBytes += chunk.length;

          // Report progress for large files
          if (uploadedBytes % (1024 * 1024) === 0) { // Every MB
            console.log(`Upload progress: ${filename} - ${Math.round(uploadedBytes / 1024 / 1024)}MB`);
          }
        });

        fileStream.pipe(uploadStream);
      });

    } catch (error) {
      console.error(`Error uploading file ${filename}:`, error);
      this.performanceMetrics.errorCount++;

      return {
        success: false,
        error: error.message,
        filename: filename,
        bucketName: bucketName
      };
    }
  }

  async downloadFile(bucketName, fileId, options = {}) {
    console.log(`Starting file download: ${fileId} from bucket: ${bucketName}`);
    const downloadStartTime = Date.now();

    try {
      const bucketInfo = this.buckets.get(bucketName);
      if (!bucketInfo) {
        throw new Error(`GridFS bucket ${bucketName} not found`);
      }

      const bucket = bucketInfo.bucket;
      const objectId = new ObjectId(fileId);

      // Get file metadata first
      const fileInfo = await this.db.collection(`${bucketName}.files`).findOne({ _id: objectId });
      if (!fileInfo) {
        throw new Error(`File not found: ${fileId}`);
      }

      // Log access if enabled
      if (this.config.enableAccessLogging) {
        await this.logFileAccess(fileId, bucketName, 'download', options.userId);
      }

      // Create download stream with range support
      const downloadOptions = {};

      if (options.range) {
        downloadOptions.start = options.range.start || 0;
        downloadOptions.end = options.range.end || fileInfo.length - 1;
      }

      const downloadStream = bucket.openDownloadStream(objectId, downloadOptions);
      const downloadId = new ObjectId().toString();

      // Track download
      this.downloadStreams.set(downloadId, {
        stream: downloadStream,
        fileId: fileId,
        filename: fileInfo.filename,
        startTime: downloadStartTime,
        bucketName: bucketName
      });

      // Setup progress tracking
      let downloadedBytes = 0;

      downloadStream.on('data', (chunk) => {
        downloadedBytes += chunk.length;

        // Report progress for large files
        if (downloadedBytes % (1024 * 1024) === 0) { // Every MB
          console.log(`Download progress: ${fileInfo.filename} - ${Math.round(downloadedBytes / 1024 / 1024)}MB`);
        }
      });

      downloadStream.on('end', () => {
        const downloadTime = Date.now() - downloadStartTime;

        // Update metrics
        this.updatePerformanceMetrics('download', downloadTime, downloadedBytes);
        this.downloadStreams.delete(downloadId);

        console.log(`File download completed: ${fileInfo.filename} (${downloadedBytes} bytes) in ${downloadTime}ms`);
      });

      downloadStream.on('error', (error) => {
        console.error(`Download error for ${fileId}:`, error);
        this.downloadStreams.delete(downloadId);
        this.performanceMetrics.errorCount++;
      });

      return {
        success: true,
        fileId: fileId,
        filename: fileInfo.filename,
        contentType: fileInfo.metadata?.contentType || 'application/octet-stream',
        fileSize: fileInfo.length,
        downloadStream: downloadStream,
        metadata: fileInfo.metadata,
        bucketName: bucketName
      };

    } catch (error) {
      console.error(`Error downloading file ${fileId}:`, error);
      this.performanceMetrics.errorCount++;

      return {
        success: false,
        error: error.message,
        fileId: fileId,
        bucketName: bucketName
      };
    }
  }

  async streamFileRange(bucketName, fileId, rangeStart, rangeEnd, options = {}) {
    console.log(`Streaming file range: ${fileId} bytes ${rangeStart}-${rangeEnd}`);

    try {
      const bucketInfo = this.buckets.get(bucketName);
      if (!bucketInfo) {
        throw new Error(`GridFS bucket ${bucketName} not found`);
      }

      const bucket = bucketInfo.bucket;
      const objectId = new ObjectId(fileId);

      // Get file info for validation
      const fileInfo = await this.db.collection(`${bucketName}.files`).findOne({ _id: objectId });
      if (!fileInfo) {
        throw new Error(`File not found: ${fileId}`);
      }

      // Validate range
      const fileSize = fileInfo.length;
      const validatedRangeStart = Math.max(0, rangeStart);
      const validatedRangeEnd = Math.min(rangeEnd || fileSize - 1, fileSize - 1);

      if (validatedRangeStart > validatedRangeEnd) {
        throw new Error('Invalid range: start position greater than end position');
      }

      // Create range download stream
      const downloadStream = bucket.openDownloadStream(objectId, {
        start: validatedRangeStart,
        end: validatedRangeEnd
      });

      // Log access
      if (this.config.enableAccessLogging) {
        await this.logFileAccess(fileId, bucketName, 'stream', options.userId, {
          rangeStart: validatedRangeStart,
          rangeEnd: validatedRangeEnd,
          rangeSize: validatedRangeEnd - validatedRangeStart + 1
        });
      }

      return {
        success: true,
        fileId: fileId,
        filename: fileInfo.filename,
        contentType: fileInfo.metadata?.contentType || 'application/octet-stream',
        totalSize: fileSize,
        rangeStart: validatedRangeStart,
        rangeEnd: validatedRangeEnd,
        rangeSize: validatedRangeEnd - validatedRangeStart + 1,
        downloadStream: downloadStream,
        contentRange: `bytes ${validatedRangeStart}-${validatedRangeEnd}/${fileSize}`
      };

    } catch (error) {
      console.error(`Error streaming file range for ${fileId}:`, error);
      return {
        success: false,
        error: error.message,
        fileId: fileId
      };
    }
  }

  async deleteFile(bucketName, fileId, options = {}) {
    console.log(`Deleting file: ${fileId} from bucket: ${bucketName}`);

    try {
      const bucketInfo = this.buckets.get(bucketName);
      if (!bucketInfo) {
        throw new Error(`GridFS bucket ${bucketName} not found`);
      }

      const bucket = bucketInfo.bucket;
      const objectId = new ObjectId(fileId);

      // Get file info before deletion (for logging and stats)
      const fileInfo = await this.db.collection(`${bucketName}.files`).findOne({ _id: objectId });
      if (!fileInfo) {
        throw new Error(`File not found: ${fileId}`);
      }

      // Check permissions if needed
      if (options.userId && fileInfo.metadata?.uploadedBy !== options.userId) {
        if (!options.bypassPermissions) {
          throw new Error('Insufficient permissions to delete file');
        }
      }

      // Delete file and all associated chunks
      await bucket.delete(objectId);

      // Update bucket stats
      bucketInfo.stats.fileCount = Math.max(0, bucketInfo.stats.fileCount - 1);
      bucketInfo.stats.totalSize = Math.max(0, bucketInfo.stats.totalSize - fileInfo.length);

      // Log deletion
      if (this.config.enableAccessLogging) {
        await this.logFileAccess(fileId, bucketName, 'delete', options.userId, {
          filename: fileInfo.filename,
          fileSize: fileInfo.length,
          deletedBy: options.userId || 'system'
        });
      }

      console.log(`File deleted successfully: ${fileInfo.filename} (${fileInfo.length} bytes)`);

      return {
        success: true,
        fileId: fileId,
        filename: fileInfo.filename,
        fileSize: fileInfo.length,
        bucketName: bucketName
      };

    } catch (error) {
      console.error(`Error deleting file ${fileId}:`, error);
      this.performanceMetrics.errorCount++;

      return {
        success: false,
        error: error.message,
        fileId: fileId,
        bucketName: bucketName
      };
    }
  }

  async findFiles(bucketName, query = {}, options = {}) {
    console.log(`Searching files in bucket: ${bucketName}`);

    try {
      const bucketInfo = this.buckets.get(bucketName);
      if (!bucketInfo) {
        throw new Error(`GridFS bucket ${bucketName} not found`);
      }

      const filesCollection = this.db.collection(`${bucketName}.files`);

      // Build MongoDB query from search parameters
      const mongoQuery = {};

      if (query.filename) {
        mongoQuery.filename = new RegExp(query.filename, 'i');
      }

      if (query.contentType) {
        mongoQuery['metadata.contentType'] = query.contentType;
      }

      if (query.userId) {
        mongoQuery['metadata.uploadedBy'] = query.userId;
      }

      if (query.tags && query.tags.length > 0) {
        mongoQuery['metadata.tags'] = { $in: query.tags };
      }

      if (query.dateRange) {
        mongoQuery.uploadDate = {
          $gte: query.dateRange.start,
          $lte: query.dateRange.end || new Date()
        };
      }

      if (query.sizeRange) {
        mongoQuery.length = {};
        if (query.sizeRange.min) mongoQuery.length.$gte = query.sizeRange.min;
        if (query.sizeRange.max) mongoQuery.length.$lte = query.sizeRange.max;
      }

      // Configure query options
      const queryOptions = {
        sort: options.sort || { uploadDate: -1 },
        limit: options.limit || 100,
        skip: options.skip || 0,
        projection: options.includeMetadata ? {} : { 
          filename: 1, 
          length: 1, 
          uploadDate: 1, 
          'metadata.contentType': 1,
          'metadata.category': 1,
          'metadata.tags': 1
        }
      };

      // Execute query
      const files = await filesCollection.find(mongoQuery, queryOptions).toArray();
      const totalCount = await filesCollection.countDocuments(mongoQuery);

      return {
        success: true,
        files: files.map(file => ({
          fileId: file._id.toString(),
          filename: file.filename,
          contentType: file.metadata?.contentType,
          fileSize: file.length,
          uploadDate: file.uploadDate,
          category: file.metadata?.category,
          tags: file.metadata?.tags || [],
          hash: file.metadata?.hash,
          metadata: options.includeMetadata ? file.metadata : undefined
        })),
        totalCount: totalCount,
        currentPage: Math.floor((options.skip || 0) / (options.limit || 100)) + 1,
        totalPages: Math.ceil(totalCount / (options.limit || 100)),
        query: query,
        bucketName: bucketName
      };

    } catch (error) {
      console.error(`Error finding files in ${bucketName}:`, error);
      return {
        success: false,
        error: error.message,
        bucketName: bucketName
      };
    }
  }

  categorizeFile(filename, contentType) {
    // Intelligent file categorization
    const extension = filename.toLowerCase().split('.').pop();

    if (contentType) {
      if (contentType.startsWith('image/')) return 'image';
      if (contentType.startsWith('video/')) return 'video';
      if (contentType.startsWith('audio/')) return 'audio';
      if (contentType.includes('pdf')) return 'document';
      if (contentType.includes('text/')) return 'text';
    }

    // Extension-based categorization
    const imageExts = ['jpg', 'jpeg', 'png', 'gif', 'bmp', 'webp', 'svg'];
    const videoExts = ['mp4', 'avi', 'mov', 'wmv', 'flv', 'webm'];
    const audioExts = ['mp3', 'wav', 'flac', 'aac', 'ogg'];
    const documentExts = ['pdf', 'doc', 'docx', 'xls', 'xlsx', 'ppt', 'pptx'];
    const archiveExts = ['zip', 'tar', 'gz', 'rar', '7z'];

    if (imageExts.includes(extension)) return 'image';
    if (videoExts.includes(extension)) return 'video';
    if (audioExts.includes(extension)) return 'audio';
    if (documentExts.includes(extension)) return 'document';
    if (archiveExts.includes(extension)) return 'archive';

    return 'other';
  }

  detectContentType(filename) {
    // Simple content type detection based on extension
    const extension = filename.toLowerCase().split('.').pop();
    const contentTypes = {
      'jpg': 'image/jpeg', 'jpeg': 'image/jpeg',
      'png': 'image/png', 'gif': 'image/gif',
      'pdf': 'application/pdf',
      'txt': 'text/plain', 'html': 'text/html',
      'mp4': 'video/mp4', 'webm': 'video/webm',
      'mp3': 'audio/mpeg', 'wav': 'audio/wav',
      'zip': 'application/zip',
      'json': 'application/json'
    };

    return contentTypes[extension] || 'application/octet-stream';
  }

  async logFileAccess(fileId, bucketName, action, userId, additionalInfo = {}) {
    if (!this.config.enableAccessLogging) return;

    try {
      const accessLog = {
        fileId: new ObjectId(fileId),
        bucketName: bucketName,
        action: action, // upload, download, delete, stream
        userId: userId,
        timestamp: new Date(),
        ...additionalInfo,

        // System context
        userAgent: additionalInfo.userAgent,
        ipAddress: additionalInfo.ipAddress,
        sessionId: additionalInfo.sessionId,

        // Performance context
        responseTime: additionalInfo.responseTime,
        bytesTransferred: additionalInfo.bytesTransferred
      };

      await this.db.collection('file_access_logs').insertOne(accessLog);

    } catch (error) {
      console.error('Error logging file access:', error);
      // Don't fail the operation for logging errors
    }
  }

  updatePerformanceMetrics(operation, duration, bytes = 0) {
    if (!this.config.enablePerformanceMetrics) return;

    if (operation === 'upload') {
      this.performanceMetrics.totalUploads++;
      this.performanceMetrics.averageUploadTime = 
        (this.performanceMetrics.averageUploadTime + duration) / 2;
    } else if (operation === 'download') {
      this.performanceMetrics.totalDownloads++;
      this.performanceMetrics.averageDownloadTime = 
        (this.performanceMetrics.averageDownloadTime + duration) / 2;
    }

    this.performanceMetrics.totalStorageBytes += bytes;
  }

  async getStorageStats() {
    console.log('Gathering GridFS storage statistics...');

    const stats = {
      buckets: {},
      systemStats: this.performanceMetrics,
      summary: {
        totalBuckets: this.buckets.size,
        activeUploads: this.uploadStreams.size,
        activeDownloads: this.downloadStreams.size
      }
    };

    for (const [bucketName, bucketInfo] of this.buckets.entries()) {
      try {
        // Get collection statistics
        const filesCollection = this.db.collection(`${bucketName}.files`);
        const chunksCollection = this.db.collection(`${bucketName}.chunks`);

        const [filesStats, chunksStats, fileCount, totalSize] = await Promise.all([
          filesCollection.stats().catch(() => ({})),
          chunksCollection.stats().catch(() => ({})),
          filesCollection.countDocuments({}),
          filesCollection.aggregate([
            { $group: { _id: null, totalSize: { $sum: '$length' } } }
          ]).toArray()
        ]);

        stats.buckets[bucketName] = {
          configuration: bucketInfo.options,
          fileCount: fileCount,
          totalSizeBytes: totalSize[0]?.totalSize || 0,
          totalSizeMB: Math.round((totalSize[0]?.totalSize || 0) / 1024 / 1024),
          filesCollectionStats: {
            size: filesStats.size || 0,
            storageSize: filesStats.storageSize || 0,
            indexSize: filesStats.totalIndexSize || 0
          },
          chunksCollectionStats: {
            size: chunksStats.size || 0,
            storageSize: chunksStats.storageSize || 0,
            indexSize: chunksStats.totalIndexSize || 0
          },
          chunkSizeBytes: bucketInfo.options.chunkSizeBytes,
          averageFileSize: fileCount > 0 ? Math.round((totalSize[0]?.totalSize || 0) / fileCount) : 0,
          created: bucketInfo.created
        };

      } catch (error) {
        stats.buckets[bucketName] = {
          error: error.message,
          available: false
        };
      }
    }

    return stats;
  }

  async shutdown() {
    console.log('Shutting down GridFS manager...');

    // Close all active upload streams
    for (const [uploadId, uploadInfo] of this.uploadStreams.entries()) {
      try {
        uploadInfo.stream.destroy();
        console.log(`Closed upload stream: ${uploadId}`);
      } catch (error) {
        console.error(`Error closing upload stream ${uploadId}:`, error);
      }
    }

    // Close all active download streams
    for (const [downloadId, downloadInfo] of this.downloadStreams.entries()) {
      try {
        downloadInfo.stream.destroy();
        console.log(`Closed download stream: ${downloadId}`);
      } catch (error) {
        console.error(`Error closing download stream ${downloadId}:`, error);
      }
    }

    // Clear collections and metrics
    this.buckets.clear();
    this.uploadStreams.clear();
    this.downloadStreams.clear();

    console.log('GridFS manager shutdown complete');
  }
}

// Benefits of MongoDB GridFS:
// - Automatic file chunking for large files without manual implementation
// - Integrated metadata storage with file data for consistency
// - Native support for file streaming and range requests
// - Distributed storage with MongoDB's replication and sharding
// - ACID transactions for file operations with database consistency
// - Built-in indexing and querying capabilities for file metadata
// - Automatic chunk deduplication and storage optimization
// - Native backup and disaster recovery with MongoDB tooling
// - Seamless integration with existing MongoDB security and access control
// - SQL-compatible file operations through QueryLeaf integration

module.exports = {
  AdvancedGridFSManager
};

Understanding MongoDB GridFS Architecture

Advanced File Storage and Distribution Patterns

Implement sophisticated GridFS strategies for production MongoDB deployments:

// Production-ready MongoDB GridFS with advanced optimization and enterprise features
class EnterpriseGridFSManager extends AdvancedGridFSManager {
  constructor(db, enterpriseConfig) {
    super(db, enterpriseConfig);

    this.enterpriseConfig = {
      ...enterpriseConfig,
      enableShardedStorage: true,
      enableAdvancedSecurity: true,
      enableContentDeliveryNetwork: true,
      enableAutoTiering: true,
      enableAdvancedAnalytics: true,
      enableComplianceFeatures: true
    };

    this.setupEnterpriseFeatures();
    this.initializeAdvancedSecurity();
    this.setupContentDeliveryNetwork();
  }

  async implementShardedFileStorage() {
    console.log('Implementing sharded GridFS storage...');

    const shardingStrategy = {
      // Shard key design for GridFS collections
      filesShardKey: { 'metadata.userId': 1, uploadDate: 1 },
      chunksShardKey: { files_id: 1 },

      // Distribution optimization
      enableZoneSharding: true,
      geographicDistribution: true,
      loadBalancing: true,

      // Performance optimization
      enableLocalReads: true,
      enableWriteDistribution: true,
      chunkDistributionStrategy: 'round_robin'
    };

    return await this.deployShardedGridFS(shardingStrategy);
  }

  async setupAdvancedContentDelivery() {
    console.log('Setting up advanced content delivery network...');

    const cdnConfig = {
      // Edge caching strategy
      edgeCaching: {
        enableEdgeNodes: true,
        cacheSize: '10GB',
        cacheTTL: 3600000, // 1 hour
        enableIntelligentCaching: true
      },

      // Content optimization
      contentOptimization: {
        enableImageOptimization: true,
        enableVideoTranscoding: true,
        enableCompressionOptimization: true,
        enableAdaptiveStreaming: true
      },

      // Global distribution
      globalDistribution: {
        enableMultiRegion: true,
        regions: ['us-east-1', 'eu-west-1', 'ap-southeast-1'],
        enableGeoRouting: true,
        enableFailover: true
      }
    };

    return await this.deployContentDeliveryNetwork(cdnConfig);
  }
}

SQL-Style GridFS Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB GridFS operations and file management:

-- QueryLeaf GridFS operations with SQL-familiar syntax for MongoDB

-- Create GridFS buckets with SQL-style DDL
CREATE GRIDFS_BUCKET documents 
WITH (
  chunk_size = '512KB',
  write_concern = 'majority',
  read_concern = 'majority',
  enable_sharding = true
);

CREATE GRIDFS_BUCKET images
WITH (
  chunk_size = '256KB',
  enable_compression = true,
  enable_thumbnail_generation = true,
  content_types = ['image/jpeg', 'image/png', 'image/gif']
);

-- File upload with enhanced metadata
INSERT INTO GRIDFS('documents') (
  filename, content_type, file_data, metadata
) VALUES (
  'enterprise-report.pdf',
  'application/pdf', 
  FILE_STREAM('/path/to/enterprise-report.pdf'),
  JSON_OBJECT(
    'category', 'reports',
    'department', 'finance',
    'classification', 'internal',
    'tags', JSON_ARRAY('quarterly', 'financial', 'analysis'),
    'access_level', 'restricted',
    'retention_years', 7,
    'compliance_flags', JSON_OBJECT(
      'gdpr_applicable', true,
      'sox_applicable', true,
      'data_classification', 'sensitive'
    ),
    'business_context', JSON_OBJECT(
      'project_id', 'PROJ-2025-Q1',
      'cost_center', 'CC-FINANCE-001',
      'stakeholders', JSON_ARRAY('john.doe@company.com', 'jane.smith@company.com')
    )
  )
);

-- Bulk file upload with batch processing
INSERT INTO GRIDFS('images') (filename, content_type, file_data, metadata)
WITH file_batch AS (
  SELECT 
    original_filename as filename,
    detected_content_type as content_type,
    file_binary_data as file_data,

    -- Enhanced metadata generation
    JSON_OBJECT(
      'upload_batch_id', batch_id,
      'uploaded_by', uploader_user_id,
      'upload_source', upload_source,
      'original_path', original_file_path,

      -- Image-specific metadata
      'image_metadata', JSON_OBJECT(
        'width', image_width,
        'height', image_height,
        'format', image_format,
        'color_space', color_space,
        'has_transparency', has_alpha_channel,
        'camera_info', camera_metadata
      ),

      -- Processing instructions
      'processing_queue', JSON_ARRAY(
        'thumbnail_generation',
        'format_optimization',
        'metadata_extraction',
        'duplicate_detection'
      ),

      -- Organization
      'album_id', album_id,
      'event_date', event_date,
      'location', geo_location,
      'tags', detected_tags,

      -- Access control
      'visibility', photo_visibility,
      'sharing_permissions', sharing_rules,
      'privacy_level', privacy_setting
    ) as metadata

  FROM staging_images 
  WHERE processing_status = 'ready_for_upload'
    AND upload_date >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
)
SELECT filename, content_type, file_data, metadata
FROM file_batch
WHERE content_type LIKE 'image/%'

-- GridFS bulk upload configuration  
WITH UPLOAD_OPTIONS (
  concurrent_uploads = 10,
  chunk_size = '256KB',
  enable_deduplication = true,
  enable_virus_scanning = true,
  processing_priority = 'normal'
);

-- Query files with advanced filtering and metadata search
WITH file_search AS (
  SELECT 
    file_id,
    filename,
    upload_date,
    length as file_size_bytes,

    -- Extract metadata fields
    JSON_EXTRACT(metadata, '$.category') as category,
    JSON_EXTRACT(metadata, '$.department') as department,
    JSON_EXTRACT(metadata, '$.uploaded_by') as uploaded_by,
    JSON_EXTRACT(metadata, '$.tags') as tags,
    JSON_EXTRACT(metadata, '$.access_level') as access_level,
    JSON_EXTRACT(metadata, '$.content_type') as content_type,

    -- Calculate file age and size categories
    EXTRACT(DAYS FROM CURRENT_TIMESTAMP - upload_date) as age_days,
    CASE 
      WHEN length < 1024 * 1024 THEN 'small'
      WHEN length < 10 * 1024 * 1024 THEN 'medium'
      WHEN length < 100 * 1024 * 1024 THEN 'large'
      ELSE 'very_large'
    END as size_category,

    -- Access patterns
    JSON_EXTRACT(metadata, '$.download_count') as download_count,
    JSON_EXTRACT(metadata, '$.last_accessed') as last_accessed,

    -- Processing status
    JSON_EXTRACT(metadata, '$.processing_status') as processing_status,
    JSON_EXTRACT(metadata, '$.hash') as file_hash,

    -- Business context
    JSON_EXTRACT(metadata, '$.business_context.project_id') as project_id,
    JSON_EXTRACT(metadata, '$.business_context.cost_center') as cost_center

  FROM GRIDFS_FILES('documents')
  WHERE 
    -- Time-based filtering
    upload_date >= CURRENT_TIMESTAMP - INTERVAL '90 days'

    -- Access level filtering (security)
    AND (
      JSON_EXTRACT(metadata, '$.access_level') = 'public'
      OR (
        JSON_EXTRACT(metadata, '$.access_level') = 'restricted' 
        AND CURRENT_USER_HAS_PERMISSION('restricted_files')
      )
      OR (
        JSON_EXTRACT(metadata, '$.uploaded_by') = CURRENT_USER_ID()
      )
    )

    -- Content filtering
    AND processing_status = 'completed'

  UNION ALL

  -- Include image files with different criteria
  SELECT 
    file_id,
    filename,
    upload_date,
    length as file_size_bytes,
    JSON_EXTRACT(metadata, '$.category') as category,
    'media' as department,
    JSON_EXTRACT(metadata, '$.uploaded_by') as uploaded_by,
    JSON_EXTRACT(metadata, '$.tags') as tags,
    JSON_EXTRACT(metadata, '$.visibility') as access_level,
    'image' as content_type,
    EXTRACT(DAYS FROM CURRENT_TIMESTAMP - upload_date) as age_days,
    CASE 
      WHEN length < 1024 * 1024 THEN 'small'
      WHEN length < 5 * 1024 * 1024 THEN 'medium' 
      WHEN length < 20 * 1024 * 1024 THEN 'large'
      ELSE 'very_large'
    END as size_category,
    COALESCE(JSON_EXTRACT(metadata, '$.view_count'), 0) as download_count,
    JSON_EXTRACT(metadata, '$.last_viewed') as last_accessed,
    JSON_EXTRACT(metadata, '$.processing_status') as processing_status,
    JSON_EXTRACT(metadata, '$.hash') as file_hash,
    JSON_EXTRACT(metadata, '$.album_id') as project_id,
    'MEDIA-STORAGE' as cost_center

  FROM GRIDFS_FILES('images')
  WHERE upload_date >= CURRENT_TIMESTAMP - INTERVAL '30 days'
),

usage_analytics AS (
  SELECT 
    fs.*,

    -- Usage classification
    CASE 
      WHEN download_count >= 100 THEN 'frequently_accessed'
      WHEN download_count >= 10 THEN 'moderately_accessed'
      WHEN download_count >= 1 THEN 'rarely_accessed'
      ELSE 'never_accessed'
    END as usage_pattern,

    -- Age-based classification
    CASE 
      WHEN age_days <= 7 THEN 'very_recent'
      WHEN age_days <= 30 THEN 'recent'
      WHEN age_days <= 90 THEN 'moderate_age'
      ELSE 'old'
    END as age_category,

    -- Storage optimization recommendations
    CASE 
      WHEN age_days > 365 AND download_count = 0 THEN 'candidate_for_archival'
      WHEN size_category = 'very_large' AND usage_pattern = 'never_accessed' THEN 'candidate_for_compression'
      WHEN age_days <= 30 AND usage_pattern = 'frequently_accessed' THEN 'hot_storage_candidate'
      ELSE 'standard_storage'
    END as storage_recommendation,

    -- Content insights
    ARRAY_LENGTH(
      STRING_TO_ARRAY(
        REPLACE(REPLACE(JSON_EXTRACT_TEXT(tags), '[', ''), ']', ''), 
        ','
      ), 
      1
    ) as tag_count,

    -- Cost analysis (estimated)
    CASE 
      WHEN size_category = 'small' THEN file_size_bytes * 0.000001  -- $0.001/GB/month
      WHEN size_category = 'medium' THEN file_size_bytes * 0.0000008
      WHEN size_category = 'large' THEN file_size_bytes * 0.0000005
      ELSE file_size_bytes * 0.0000003
    END as estimated_monthly_storage_cost

  FROM file_search fs
),

aggregated_insights AS (
  SELECT 
    department,
    category,
    content_type,
    age_category,
    usage_pattern,
    size_category,
    storage_recommendation,

    -- Volume metrics
    COUNT(*) as file_count,
    SUM(file_size_bytes) as total_size_bytes,
    AVG(file_size_bytes) as avg_file_size,

    -- Usage metrics
    SUM(download_count) as total_downloads,
    AVG(download_count) as avg_downloads_per_file,
    COUNT(*) FILTER (WHERE download_count = 0) as unused_files,

    -- Age distribution
    AVG(age_days) as avg_age_days,
    MIN(upload_date) as oldest_file_date,
    MAX(upload_date) as newest_file_date,

    -- Storage cost analysis
    SUM(estimated_monthly_storage_cost) as estimated_monthly_cost,

    -- Content analysis
    AVG(tag_count) as avg_tags_per_file,
    COUNT(DISTINCT uploaded_by) as unique_uploaders,
    COUNT(DISTINCT project_id) as unique_projects

  FROM usage_analytics
  GROUP BY 
    department, category, content_type, age_category, 
    usage_pattern, size_category, storage_recommendation
)

SELECT 
  -- Classification dimensions
  department,
  category,
  content_type,
  age_category,
  usage_pattern,
  size_category,

  -- Volume and size metrics
  file_count,
  ROUND(total_size_bytes / 1024.0 / 1024.0 / 1024.0, 2) as total_size_gb,
  ROUND(avg_file_size / 1024.0 / 1024.0, 2) as avg_file_size_mb,

  -- Usage analytics
  total_downloads,
  ROUND(avg_downloads_per_file, 1) as avg_downloads_per_file,
  unused_files,
  ROUND((unused_files::DECIMAL / file_count) * 100, 1) as unused_files_percent,

  -- Age and lifecycle
  ROUND(avg_age_days, 1) as avg_age_days,
  oldest_file_date,
  newest_file_date,

  -- Content insights
  ROUND(avg_tags_per_file, 1) as avg_tags_per_file,
  unique_uploaders,
  unique_projects,

  -- Cost optimization
  ROUND(estimated_monthly_cost, 4) as estimated_monthly_cost_usd,
  storage_recommendation,

  -- Actionable insights
  CASE storage_recommendation
    WHEN 'candidate_for_archival' THEN 'Move to cold storage or delete if no business value'
    WHEN 'candidate_for_compression' THEN 'Enable compression to reduce storage costs'
    WHEN 'hot_storage_candidate' THEN 'Ensure high-performance storage tier'
    ELSE 'Current storage tier appropriate'
  END as recommended_action,

  -- Priority scoring for action
  CASE 
    WHEN storage_recommendation = 'candidate_for_archival' AND unused_files_percent > 80 THEN 'high_priority'
    WHEN storage_recommendation = 'candidate_for_compression' AND total_size_gb > 10 THEN 'high_priority'
    WHEN storage_recommendation = 'hot_storage_candidate' AND avg_downloads_per_file > 50 THEN 'high_priority'
    WHEN unused_files_percent > 50 THEN 'medium_priority'
    ELSE 'low_priority'
  END as action_priority

FROM aggregated_insights
WHERE file_count > 0
ORDER BY 
  CASE action_priority
    WHEN 'high_priority' THEN 1
    WHEN 'medium_priority' THEN 2
    ELSE 3
  END,
  total_size_gb DESC,
  file_count DESC;

-- File streaming with range support and performance optimization
WITH file_stream_request AS (
  SELECT 
    file_id,
    filename,
    length as total_size,
    content_type,
    upload_date,

    -- Extract streaming metadata
    JSON_EXTRACT(metadata, '$.streaming_optimized') as streaming_optimized,
    JSON_EXTRACT(metadata, '$.cdn_enabled') as cdn_enabled,
    JSON_EXTRACT(metadata, '$.cache_headers') as cache_headers,

    -- Range request parameters (would be provided by application)
    $range_start as range_start,
    $range_end as range_end,

    -- Calculate effective range
    COALESCE($range_start, 0) as effective_start,
    COALESCE($range_end, length - 1) as effective_end,

    -- Streaming metadata
    JSON_EXTRACT(metadata, '$.video_metadata.duration') as video_duration,
    JSON_EXTRACT(metadata, '$.video_metadata.bitrate') as video_bitrate,
    JSON_EXTRACT(metadata, '$.image_metadata.width') as image_width,
    JSON_EXTRACT(metadata, '$.image_metadata.height') as image_height

  FROM GRIDFS_FILES('videos')
  WHERE file_id = $requested_file_id
)

SELECT 
  fsr.file_id,
  fsr.filename,
  fsr.content_type,
  fsr.total_size,

  -- Range information
  fsr.effective_start,
  fsr.effective_end,
  (fsr.effective_end - fsr.effective_start + 1) as range_size,

  -- Content headers for HTTP response
  'bytes ' || fsr.effective_start || '-' || fsr.effective_end || '/' || fsr.total_size as content_range_header,

  CASE 
    WHEN fsr.effective_start = 0 AND fsr.effective_end = fsr.total_size - 1 THEN '200'
    ELSE '206' -- Partial content
  END as http_status_code,

  -- Caching and performance headers
  CASE fsr.content_type
    WHEN 'image/jpeg' THEN 'public, max-age=2592000' -- 30 days
    WHEN 'image/png' THEN 'public, max-age=2592000'
    WHEN 'video/mp4' THEN 'public, max-age=3600' -- 1 hour
    WHEN 'application/pdf' THEN 'private, max-age=1800' -- 30 minutes
    ELSE 'private, max-age=300' -- 5 minutes
  END as cache_control_header,

  -- Streaming optimization flags
  fsr.streaming_optimized::BOOLEAN as is_streaming_optimized,
  fsr.cdn_enabled::BOOLEAN as use_cdn,

  -- Performance estimates
  CASE 
    WHEN fsr.video_bitrate IS NOT NULL THEN
      ROUND((fsr.effective_end - fsr.effective_start + 1) / (fsr.video_bitrate::DECIMAL * 1024 / 8), 2)
    ELSE NULL
  END as estimated_streaming_seconds,

  -- Content metadata for client
  JSON_OBJECT(
    'total_duration', fsr.video_duration,
    'bitrate_kbps', fsr.video_bitrate,
    'width', fsr.image_width,
    'height', fsr.image_height,
    'supports_range_requests', true,
    'chunk_size_optimized', true,
    'streaming_ready', fsr.streaming_optimized::BOOLEAN
  ) as content_metadata,

  -- GridFS streaming query (this would trigger the actual data retrieval)
  GRIDFS_STREAM(fsr.file_id, fsr.effective_start, fsr.effective_end) as file_stream

FROM file_stream_request fsr;

-- Advanced file analytics and storage optimization
WITH storage_utilization AS (
  SELECT 
    bucket_name,
    DATE_TRUNC('day', upload_date) as upload_day,

    -- Daily storage metrics
    COUNT(*) as daily_files,
    SUM(length) as daily_storage_bytes,
    AVG(length) as avg_file_size_daily,

    -- Content type distribution
    COUNT(*) FILTER (WHERE JSON_EXTRACT(metadata, '$.content_type') LIKE 'image/%') as image_files,
    COUNT(*) FILTER (WHERE JSON_EXTRACT(metadata, '$.content_type') LIKE 'video/%') as video_files,
    COUNT(*) FILTER (WHERE JSON_EXTRACT(metadata, '$.content_type') LIKE 'application/%') as document_files,

    -- Processing status
    COUNT(*) FILTER (WHERE JSON_EXTRACT(metadata, '$.processing_status') = 'completed') as processed_files,
    COUNT(*) FILTER (WHERE JSON_EXTRACT(metadata, '$.processing_status') = 'failed') as failed_files,

    -- Access patterns
    SUM(COALESCE(JSON_EXTRACT(metadata, '$.download_count')::INTEGER, 0)) as total_downloads,
    AVG(COALESCE(JSON_EXTRACT(metadata, '$.download_count')::INTEGER, 0)) as avg_downloads_per_file

  FROM (
    SELECT 'documents' as bucket_name, file_id, filename, length, upload_date, metadata 
    FROM GRIDFS_FILES('documents')
    UNION ALL
    SELECT 'images' as bucket_name, file_id, filename, length, upload_date, metadata 
    FROM GRIDFS_FILES('images') 
    UNION ALL
    SELECT 'videos' as bucket_name, file_id, filename, length, upload_date, metadata 
    FROM GRIDFS_FILES('videos')
  ) all_files
  WHERE upload_date >= CURRENT_TIMESTAMP - INTERVAL '30 days'
  GROUP BY bucket_name, DATE_TRUNC('day', upload_date)
),

performance_analysis AS (
  SELECT 
    su.*,

    -- Growth analysis
    LAG(daily_storage_bytes) OVER (
      PARTITION BY bucket_name 
      ORDER BY upload_day
    ) as prev_day_storage,

    -- Calculate growth rate
    CASE 
      WHEN LAG(daily_storage_bytes) OVER (PARTITION BY bucket_name ORDER BY upload_day) > 0 THEN
        ROUND(
          ((daily_storage_bytes - LAG(daily_storage_bytes) OVER (PARTITION BY bucket_name ORDER BY upload_day))::DECIMAL / 
           LAG(daily_storage_bytes) OVER (PARTITION BY bucket_name ORDER BY upload_day)) * 100, 
          1
        )
      ELSE NULL
    END as storage_growth_percent,

    -- Performance indicators
    ROUND(daily_storage_bytes / NULLIF(daily_files, 0) / 1024.0 / 1024.0, 2) as avg_file_size_mb,
    ROUND(total_downloads::DECIMAL / NULLIF(daily_files, 0), 2) as download_ratio,

    -- Processing efficiency
    ROUND((processed_files::DECIMAL / NULLIF(daily_files, 0)) * 100, 1) as processing_success_rate,
    ROUND((failed_files::DECIMAL / NULLIF(daily_files, 0)) * 100, 1) as processing_failure_rate,

    -- Storage efficiency indicators
    CASE 
      WHEN avg_downloads_per_file = 0 THEN 'unused_storage'
      WHEN avg_downloads_per_file < 0.1 THEN 'low_utilization'
      WHEN avg_downloads_per_file < 1.0 THEN 'moderate_utilization'
      ELSE 'high_utilization'
    END as utilization_category

  FROM storage_utilization su
)

SELECT 
  bucket_name,
  upload_day,

  -- Volume metrics
  daily_files,
  ROUND(daily_storage_bytes / 1024.0 / 1024.0 / 1024.0, 3) as daily_storage_gb,
  avg_file_size_mb,

  -- Content distribution
  image_files,
  video_files,
  document_files,

  -- Performance metrics
  processing_success_rate,
  processing_failure_rate,
  download_ratio,
  utilization_category,

  -- Growth analysis
  storage_growth_percent,

  -- Optimization recommendations
  CASE 
    WHEN utilization_category = 'unused_storage' THEN 'implement_retention_policy'
    WHEN processing_failure_rate > 10 THEN 'investigate_processing_issues'
    WHEN storage_growth_percent > 100 THEN 'monitor_storage_capacity'
    WHEN avg_file_size_mb > 100 THEN 'consider_compression_optimization'
    ELSE 'storage_operating_normally'
  END as optimization_recommendation,

  -- Projected storage (simple linear projection)
  CASE 
    WHEN storage_growth_percent IS NOT NULL THEN
      ROUND(
        daily_storage_bytes * (1 + storage_growth_percent / 100) * 30 / 1024.0 / 1024.0 / 1024.0, 
        2
      )
    ELSE NULL
  END as projected_monthly_storage_gb,

  -- Alert conditions
  CASE 
    WHEN processing_failure_rate > 20 THEN 'critical_processing_failure'
    WHEN storage_growth_percent > 200 THEN 'critical_storage_growth'
    WHEN utilization_category = 'unused_storage' AND daily_storage_gb > 1 THEN 'storage_waste_alert'
    ELSE 'normal_operations'
  END as alert_status

FROM performance_analysis
WHERE daily_files > 0
ORDER BY 
  CASE alert_status
    WHEN 'critical_processing_failure' THEN 1
    WHEN 'critical_storage_growth' THEN 2
    WHEN 'storage_waste_alert' THEN 3
    ELSE 4
  END,
  bucket_name,
  upload_day DESC;

-- QueryLeaf provides comprehensive GridFS capabilities:
-- 1. SQL-familiar GridFS bucket creation and management
-- 2. Advanced file upload with metadata enrichment and batch processing
-- 3. Efficient file querying with metadata search and filtering
-- 4. High-performance file streaming with range request support
-- 5. Comprehensive storage analytics and optimization recommendations
-- 6. Integration with MongoDB's native GridFS optimizations
-- 7. Advanced access control and security features
-- 8. SQL-style operations for complex file management workflows
-- 9. Built-in performance monitoring and capacity planning
-- 10. Enterprise-ready file storage with distributed capabilities

Best Practices for GridFS Implementation

File Storage Strategy and Performance Optimization

Essential principles for effective MongoDB GridFS deployment:

Chunk Size Optimization: Choose chunk sizes based on file types and access patterns - smaller chunks for random access, larger chunks for sequential streaming
Bucket Organization: Create separate buckets for different file types to optimize chunk sizes and indexing strategies
Metadata Design: Implement comprehensive metadata schemas that support efficient querying and business requirements
Index Strategy: Create strategic indexes on frequently queried metadata fields while avoiding over-indexing
Security Integration: Implement access control and encryption that integrates with application security frameworks
Performance Monitoring: Track upload/download performance, storage utilization, and access patterns for optimization

Production Deployment and Operational Excellence

Design GridFS systems for enterprise-scale requirements:

Distributed Architecture: Implement GridFS across sharded clusters with proper shard key design for balanced distribution
Backup and Recovery: Design backup strategies that account for GridFS's dual-collection structure (files and chunks)
Content Delivery: Integrate with CDN and caching layers for optimal global content delivery performance
Storage Tiering: Implement automated data lifecycle management with hot, warm, and cold storage tiers
Compliance Features: Build in data governance, audit trails, and regulatory compliance capabilities
Monitoring and Alerting: Establish comprehensive monitoring for storage utilization, performance, and system health

Conclusion

MongoDB GridFS provides comprehensive distributed file storage that eliminates the complexity of traditional file management systems through automatic chunking, integrated metadata storage, and seamless integration with MongoDB's distributed architecture. The unified approach to file and database operations enables sophisticated file management workflows while maintaining ACID properties and enterprise-grade reliability.

Key MongoDB GridFS benefits include:

Automatic Chunking: Seamless handling of large files without manual chunk management or size limitations
Integrated Metadata: Rich metadata storage with file data for complex querying and business logic integration
Distributed Storage: Native support for MongoDB's replication and sharding for global file distribution
Streaming Capabilities: Efficient file streaming and range requests for multimedia and large file applications
Transaction Support: ACID transactions for file operations integrated with database consistency guarantees
SQL Accessibility: Familiar SQL-style file operations through QueryLeaf for accessible enterprise file management

Whether you're building content management systems, media platforms, document repositories, or enterprise file storage solutions, MongoDB GridFS with QueryLeaf's familiar SQL interface provides the foundation for scalable, reliable, and feature-rich file storage architectures.

QueryLeaf Integration: QueryLeaf seamlessly manages MongoDB GridFS operations while providing SQL-familiar syntax for file uploads, downloads, streaming, and metadata management. Advanced file storage patterns including distributed storage, content delivery, and enterprise security features are elegantly handled through familiar SQL constructs, making sophisticated file management both powerful and accessible to SQL-oriented development teams.

The combination of MongoDB's robust GridFS capabilities with SQL-style file operations makes it an ideal platform for applications requiring both advanced file storage functionality and familiar database interaction patterns, ensuring your file storage infrastructure can scale efficiently while maintaining operational simplicity and developer productivity.

November 5, 2025
23 min read

MongoDB Atlas Search and Advanced Text Indexing: Full-Text Search with Vector Similarity and Multi-Language Support

Modern applications require sophisticated search capabilities that go beyond simple text matching to provide relevant, contextual results across multiple data types and languages. Traditional full-text search implementations struggle with semantic understanding, multi-language support, and the complexity of integrating machine learning-based relevance scoring, often requiring separate search engines and complex data synchronization processes that increase operational overhead and system complexity.

MongoDB Atlas Search provides comprehensive native search capabilities with advanced text indexing, vector similarity search, and intelligent relevance scoring that eliminate the need for external search engines. Unlike traditional approaches that require separate search infrastructure and complex data pipelines, Atlas Search integrates seamlessly with MongoDB collections, providing real-time search synchronization, multi-language support, and machine learning-enhanced search experiences within a unified platform.

The Traditional Search Challenge

Conventional search implementations involve significant complexity and operational burden:

-- Traditional PostgreSQL full-text search approach - limited and complex
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE EXTENSION IF NOT EXISTS unaccent;

-- Basic document storage with limited search capabilities
CREATE TABLE documents (
    document_id BIGSERIAL PRIMARY KEY,
    title VARCHAR(500) NOT NULL,
    content TEXT NOT NULL,
    author VARCHAR(200),
    category VARCHAR(100),
    tags VARCHAR(255)[],

    -- Language and localization
    language VARCHAR(10) DEFAULT 'en',
    content_locale VARCHAR(10),

    -- Metadata for search
    publish_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    modified_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    status VARCHAR(50) DEFAULT 'published',

    -- Basic search vectors (very limited functionality)
    title_vector TSVECTOR,
    content_vector TSVECTOR,
    combined_vector TSVECTOR
);

-- Manual maintenance of search vectors required
CREATE OR REPLACE FUNCTION update_document_search_vectors()
RETURNS TRIGGER AS $$
BEGIN
    -- Basic text search vector creation (limited language support)
    NEW.title_vector := to_tsvector('english', COALESCE(NEW.title, ''));
    NEW.content_vector := to_tsvector('english', COALESCE(NEW.content, ''));
    NEW.combined_vector := to_tsvector('english', 
        COALESCE(NEW.title, '') || ' ' || COALESCE(NEW.content, '')
    );
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trigger_update_search_vectors
    BEFORE INSERT OR UPDATE ON documents
    FOR EACH ROW EXECUTE FUNCTION update_document_search_vectors();

-- Basic GIN indexes for text search (limited optimization)
CREATE INDEX idx_documents_title_search ON documents USING GIN(title_vector);
CREATE INDEX idx_documents_content_search ON documents USING GIN(content_vector);
CREATE INDEX idx_documents_combined_search ON documents USING GIN(combined_vector);
CREATE INDEX idx_documents_category_status ON documents(category, status);

-- User search behavior and analytics tracking
CREATE TABLE search_queries (
    query_id BIGSERIAL PRIMARY KEY,
    user_id BIGINT,
    session_id VARCHAR(100),
    query_text TEXT NOT NULL,
    query_language VARCHAR(10) DEFAULT 'en',

    -- Search parameters
    filters_applied JSONB,
    sort_criteria VARCHAR(100),
    page_number INTEGER DEFAULT 1,
    results_per_page INTEGER DEFAULT 10,

    -- Search results and performance
    total_results_found INTEGER,
    execution_time_ms INTEGER,
    results_clicked INTEGER[] DEFAULT '{}',

    -- User context
    user_agent TEXT,
    referrer TEXT,
    search_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Search quality metrics
    user_satisfaction INTEGER CHECK (user_satisfaction BETWEEN 1 AND 5),
    bounce_rate DECIMAL(4,2),
    conversion_achieved BOOLEAN DEFAULT FALSE
);

-- Complex search query with limited capabilities
WITH search_base AS (
    SELECT 
        d.document_id,
        d.title,
        d.content,
        d.author,
        d.category,
        d.tags,
        d.publish_date,
        d.language,

        -- Basic relevance scoring (very primitive)
        ts_rank_cd(d.title_vector, plainto_tsquery('english', $search_query)) * 2.0 as title_relevance,
        ts_rank_cd(d.content_vector, plainto_tsquery('english', $search_query)) as content_relevance,

        -- Combine relevance scores
        (ts_rank_cd(d.title_vector, plainto_tsquery('english', $search_query)) * 2.0 +
         ts_rank_cd(d.content_vector, plainto_tsquery('english', $search_query))) as combined_relevance,

        -- Simple popularity boost (no ML)
        LOG(GREATEST(1, (SELECT COUNT(*) FROM search_queries sq WHERE sq.results_clicked @> ARRAY[d.document_id]))) as popularity_score,

        -- Basic category boosting
        CASE 
            WHEN d.category = $preferred_category THEN 1.2
            ELSE 1.0
        END as category_boost,

        -- Recency boost (basic time decay)
        CASE 
            WHEN d.publish_date >= CURRENT_TIMESTAMP - INTERVAL '30 days' THEN 1.3
            WHEN d.publish_date >= CURRENT_TIMESTAMP - INTERVAL '90 days' THEN 1.1
            ELSE 1.0
        END as recency_boost

    FROM documents d
    WHERE 
        d.status = 'published'
        AND ($language IS NULL OR d.language = $language)
        AND ($category_filter IS NULL OR d.category = $category_filter)

        -- Basic text search (limited semantic understanding)
        AND (
            d.combined_vector @@ plainto_tsquery('english', $search_query)
            OR SIMILARITY(d.title, $search_query) > 0.3
            OR d.title ILIKE '%' || $search_query || '%'
            OR d.content ILIKE '%' || $search_query || '%'
        )
),

search_with_scoring AS (
    SELECT 
        sb.*,

        -- Final relevance calculation (very basic)
        GREATEST(0.1, 
            sb.combined_relevance * sb.category_boost * sb.recency_boost + 
            (sb.popularity_score * 0.1)
        ) as final_relevance_score,

        -- Extract matching snippets (primitive)
        ts_headline('english', 
            LEFT(sb.content, 1000), 
            plainto_tsquery('english', $search_query),
            'MaxWords=35, MinWords=15, MaxFragments=3'
        ) as content_snippet,

        -- Count matching terms (basic)
        (SELECT COUNT(*) 
         FROM unnest(string_to_array(lower($search_query), ' ')) as query_word
         WHERE lower(sb.title || ' ' || sb.content) LIKE '%' || query_word || '%'
        ) as matching_terms_count,

        -- Simple spell correction suggestions (very limited)
        CASE 
            WHEN SIMILARITY(sb.title, $search_query) < 0.1 THEN
                (SELECT string_agg(suggestion, ' ') 
                 FROM (
                     SELECT word as suggestion 
                     FROM unnest(string_to_array($search_query, ' ')) as word
                     ORDER BY SIMILARITY(word, sb.title) DESC 
                     LIMIT 3
                 ) suggestions)
            ELSE NULL
        END as spelling_suggestions

    FROM search_base sb
),

search_analytics AS (
    -- Track search performance (basic analytics)
    SELECT 
        CURRENT_TIMESTAMP as search_executed_at,
        $search_query as query_executed,
        COUNT(*) as total_results_found,
        AVG(sws.final_relevance_score) as avg_relevance_score,
        MAX(sws.final_relevance_score) as max_relevance_score,

        -- Category distribution
        json_object_agg(sws.category, COUNT(sws.category)) as results_by_category,

        -- Language distribution  
        json_object_agg(sws.language, COUNT(sws.language)) as results_by_language

    FROM search_with_scoring sws
    WHERE sws.final_relevance_score > 0.1
)

-- Final search results with basic ranking
SELECT 
    sws.document_id,
    sws.title,
    sws.author,
    sws.category,
    sws.tags,
    sws.publish_date,
    sws.language,

    -- Relevance and ranking
    ROUND(sws.final_relevance_score, 4) as relevance_score,
    ROW_NUMBER() OVER (ORDER BY sws.final_relevance_score DESC, sws.publish_date DESC) as search_rank,

    -- Content preview
    sws.content_snippet,
    LENGTH(sws.content) as content_length,
    sws.matching_terms_count,

    -- Search enhancements (very basic)
    sws.spelling_suggestions,

    -- Quality indicators
    CASE 
        WHEN sws.final_relevance_score > 0.8 THEN 'high'
        WHEN sws.final_relevance_score > 0.4 THEN 'medium'
        ELSE 'low'
    END as match_quality,

    -- Search metadata
    EXTRACT(DAYS FROM CURRENT_TIMESTAMP - sws.publish_date) as days_old

FROM search_with_scoring sws
WHERE sws.final_relevance_score > 0.1
ORDER BY sws.final_relevance_score DESC, sws.publish_date DESC
LIMIT $results_limit OFFSET $results_offset;

-- Insert search analytics
INSERT INTO search_queries (
    user_id, session_id, query_text, query_language, 
    total_results_found, execution_time_ms, search_timestamp
) VALUES (
    $user_id, $session_id, $search_query, $language,
    (SELECT COUNT(*) FROM search_with_scoring WHERE final_relevance_score > 0.1),
    $execution_time_ms, CURRENT_TIMESTAMP
);

-- Traditional search approach problems:
-- 1. Very limited semantic understanding and context awareness
-- 2. Poor multi-language support requiring separate configurations
-- 3. No vector similarity or machine learning capabilities
-- 4. Manual maintenance of search indexes and vectors
-- 5. Primitive relevance scoring without ML-based optimization
-- 6. No real-time search suggestions or autocomplete
-- 7. Limited spell correction and fuzzy matching capabilities
-- 8. Complex integration with external search engines required for advanced features
-- 9. No built-in search analytics or performance optimization
-- 10. Difficulty in handling multimedia and structured data search

MongoDB Atlas Search provides comprehensive search capabilities with advanced indexing and ML integration:

// MongoDB Atlas Search - Advanced full-text and vector search capabilities
const { MongoClient, ObjectId } = require('mongodb');

// Comprehensive Atlas Search Manager
class AtlasSearchManager {
  constructor(connectionString, searchConfig = {}) {
    this.connectionString = connectionString;
    this.client = null;
    this.db = null;

    this.config = {
      // Search configuration
      enableFullTextSearch: searchConfig.enableFullTextSearch !== false,
      enableVectorSearch: searchConfig.enableVectorSearch !== false,
      enableFacetedSearch: searchConfig.enableFacetedSearch !== false,
      enableAutocomplete: searchConfig.enableAutocomplete !== false,

      // Advanced features
      enableSemanticSearch: searchConfig.enableSemanticSearch !== false,
      enableMultiLanguageSearch: searchConfig.enableMultiLanguageSearch !== false,
      enableSpellCorrection: searchConfig.enableSpellCorrection !== false,
      enableSearchAnalytics: searchConfig.enableSearchAnalytics !== false,

      // Performance optimization
      searchResultLimit: searchConfig.searchResultLimit || 50,
      facetLimit: searchConfig.facetLimit || 20,
      highlightMaxChars: searchConfig.highlightMaxChars || 500,
      cacheSearchResults: searchConfig.cacheSearchResults !== false,

      // ML and AI features
      enableRelevanceScoring: searchConfig.enableRelevanceScoring !== false,
      enablePersonalization: searchConfig.enablePersonalization !== false,
      enableSearchSuggestions: searchConfig.enableSearchSuggestions !== false,

      ...searchConfig
    };

    // Collections
    this.collections = {
      documents: null,
      searchQueries: null,
      searchAnalytics: null,
      userProfiles: null,
      searchSuggestions: null,
      vectorEmbeddings: null
    };

    // Search indexes configuration
    this.searchIndexes = new Map();
    this.vectorIndexes = new Map();

    // Performance metrics
    this.searchMetrics = {
      totalSearches: 0,
      averageLatency: 0,
      searchesWithResults: 0,
      popularQueries: new Map()
    };
  }

  async initializeAtlasSearch() {
    console.log('Initializing MongoDB Atlas Search capabilities...');

    try {
      // Connect to MongoDB Atlas
      this.client = new MongoClient(this.connectionString);
      await this.client.connect();
      this.db = this.client.db();

      // Initialize collections
      await this.setupSearchCollections();

      // Create Atlas Search indexes
      await this.createAtlasSearchIndexes();

      // Setup vector search if enabled
      if (this.config.enableVectorSearch) {
        await this.setupVectorSearch();
      }

      // Initialize search analytics
      if (this.config.enableSearchAnalytics) {
        await this.setupSearchAnalytics();
      }

      console.log('Atlas Search initialization completed successfully');

    } catch (error) {
      console.error('Error initializing Atlas Search:', error);
      throw error;
    }
  }

  async setupSearchCollections() {
    console.log('Setting up search-optimized collections...');

    // Documents collection with search-optimized schema
    this.collections.documents = this.db.collection('documents');
    await this.collections.documents.createIndexes([
      { key: { title: 'text', content: 'text' }, background: true, name: 'text_search_fallback' },
      { key: { category: 1, status: 1, publishDate: -1 }, background: true },
      { key: { author: 1, publishDate: -1 }, background: true },
      { key: { tags: 1, language: 1 }, background: true },
      { key: { popularity: -1, relevanceScore: -1 }, background: true }
    ]);

    // Search queries and analytics
    this.collections.searchQueries = this.db.collection('search_queries');
    await this.collections.searchQueries.createIndexes([
      { key: { userId: 1, searchTimestamp: -1 }, background: true },
      { key: { queryText: 1, totalResults: -1 }, background: true },
      { key: { searchTimestamp: -1 }, background: true },
      { key: { sessionId: 1, searchTimestamp: -1 }, background: true }
    ]);

    // Search analytics aggregation collection
    this.collections.searchAnalytics = this.db.collection('search_analytics');
    await this.collections.searchAnalytics.createIndexes([
      { key: { analysisDate: -1 }, background: true },
      { key: { queryPattern: 1, frequency: -1 }, background: true }
    ]);

    // User profiles for personalization
    this.collections.userProfiles = this.db.collection('user_profiles');
    await this.collections.userProfiles.createIndexes([
      { key: { userId: 1 }, unique: true, background: true },
      { key: { 'searchPreferences.categories': 1 }, background: true },
      { key: { lastActivity: -1 }, background: true }
    ]);

    console.log('Search collections setup completed');
  }

  async createAtlasSearchIndexes() {
    console.log('Creating Atlas Search indexes...');

    // Main document search index with comprehensive text analysis
    const mainSearchIndex = {
      name: 'documents_search_index',
      definition: {
        mappings: {
          dynamic: false,
          fields: {
            title: {
              type: 'string',
              analyzer: 'lucene.standard',
              searchAnalyzer: 'lucene.standard',
              highlight: {
                type: 'html'
              }
            },
            content: {
              type: 'string',
              analyzer: 'lucene.standard',
              searchAnalyzer: 'lucene.standard',
              highlight: {
                type: 'html',
                maxCharsToExamine: this.config.highlightMaxChars
              }
            },
            author: {
              type: 'string',
              analyzer: 'lucene.keyword'
            },
            category: {
              type: 'string',
              analyzer: 'lucene.keyword'
            },
            tags: {
              type: 'string',
              analyzer: 'lucene.standard'
            },
            language: {
              type: 'string',
              analyzer: 'lucene.keyword'
            },
            publishDate: {
              type: 'date'
            },
            popularity: {
              type: 'number'
            },
            relevanceScore: {
              type: 'number'
            },
            // Nested content analysis
            sections: {
              type: 'document',
              fields: {
                heading: {
                  type: 'string',
                  analyzer: 'lucene.standard'
                },
                content: {
                  type: 'string',
                  analyzer: 'lucene.standard'
                },
                importance: {
                  type: 'number'
                }
              }
            },
            // Metadata for advanced search
            metadata: {
              type: 'document',
              fields: {
                readingLevel: { type: 'string' },
                contentType: { type: 'string' },
                sourceQuality: { type: 'number' },
                lastUpdated: { type: 'date' }
              }
            }
          }
        },
        analyzers: [{
          name: 'multilingual_analyzer',
          charFilters: [{
            type: 'mapping',
            mappings: {
              '&': 'and',
              '@': 'at'
            }
          }],
          tokenizer: {
            type: 'standard'
          },
          tokenFilters: [
            { type: 'lowercase' },
            { type: 'stop' },
            { type: 'stemmer', language: 'en' }
          ]
        }]
      }
    };

    // Autocomplete search index
    const autocompleteIndex = {
      name: 'autocomplete_search_index',
      definition: {
        mappings: {
          dynamic: false,
          fields: {
            title: {
              type: 'autocomplete',
              analyzer: 'lucene.standard',
              tokenization: 'edgeGram',
              minGrams: 2,
              maxGrams: 15,
              foldDiacritics: true
            },
            content: {
              type: 'autocomplete',
              analyzer: 'lucene.standard',
              tokenization: 'nGram',
              minGrams: 3,
              maxGrams: 10
            },
            tags: {
              type: 'autocomplete',
              analyzer: 'lucene.keyword',
              tokenization: 'keyword'
            },
            category: {
              type: 'string',
              analyzer: 'lucene.keyword'
            },
            popularity: {
              type: 'number'
            }
          }
        }
      }
    };

    // Faceted search index for advanced filtering
    const facetedSearchIndex = {
      name: 'faceted_search_index',
      definition: {
        mappings: {
          dynamic: false,
          fields: {
            title: {
              type: 'string',
              analyzer: 'lucene.standard'
            },
            content: {
              type: 'string',
              analyzer: 'lucene.standard'
            },
            category: {
              type: 'stringFacet'
            },
            author: {
              type: 'stringFacet'
            },
            language: {
              type: 'stringFacet'
            },
            tags: {
              type: 'stringFacet'
            },
            publishDate: {
              type: 'dateFacet',
              boundaries: [
                new Date('2020-01-01'),
                new Date('2021-01-01'),
                new Date('2022-01-01'),
                new Date('2023-01-01'),
                new Date('2024-01-01'),
                new Date('2025-01-01')
              ]
            },
            popularity: {
              type: 'numberFacet',
              boundaries: [0, 10, 50, 100, 500, 1000]
            },
            contentLength: {
              type: 'numberFacet',
              boundaries: [0, 1000, 5000, 10000, 50000]
            }
          }
        }
      }
    };

    // Store index configurations for reference
    this.searchIndexes.set('main', mainSearchIndex);
    this.searchIndexes.set('autocomplete', autocompleteIndex);
    this.searchIndexes.set('faceted', facetedSearchIndex);

    console.log('Atlas Search indexes configured');
    // Note: In production, these indexes would be created through Atlas UI or API
  }

  async performAdvancedTextSearch(query, options = {}) {
    console.log(`Performing advanced text search for: "${query}"`);

    const startTime = Date.now();

    try {
      // Build comprehensive search aggregation pipeline
      const searchPipeline = [
        {
          $search: {
            index: 'documents_search_index',
            compound: {
              should: [
                // Primary text search with boosting
                {
                  text: {
                    query: query,
                    path: ['title', 'content'],
                    score: {
                      boost: { value: 2.0 }
                    },
                    fuzzy: {
                      maxEdits: 2,
                      prefixLength: 0,
                      maxExpansions: 50
                    }
                  }
                },
                // Exact phrase matching with highest boost
                {
                  phrase: {
                    query: query,
                    path: ['title', 'content'],
                    score: {
                      boost: { value: 3.0 }
                    }
                  }
                },
                // Autocomplete matching for partial queries
                {
                  autocomplete: {
                    query: query,
                    path: 'title',
                    tokenOrder: 'sequential',
                    score: {
                      boost: { value: 1.5 }
                    }
                  }
                },
                // Semantic search using embeddings (if available)
                ...(options.enableSemanticSearch && this.config.enableVectorSearch ? [{
                  knnBeta: {
                    vector: await this.getQueryEmbedding(query),
                    path: 'contentEmbedding',
                    k: 20,
                    score: {
                      boost: { value: 1.2 }
                    }
                  }
                }] : [])
              ],

              // Apply filters
              filter: [
                ...(options.category ? [{
                  equals: {
                    path: 'category',
                    value: options.category
                  }
                }] : []),
                ...(options.language ? [{
                  equals: {
                    path: 'language',
                    value: options.language
                  }
                }] : []),
                ...(options.author ? [{
                  text: {
                    query: options.author,
                    path: 'author'
                  }
                }] : []),
                ...(options.dateRange ? [{
                  range: {
                    path: 'publishDate',
                    gte: options.dateRange.start,
                    lte: options.dateRange.end
                  }
                }] : []),
                {
                  equals: {
                    path: 'status',
                    value: 'published'
                  }
                }
              ],

              // Boost recent and popular content
              should: [
                {
                  range: {
                    path: 'publishDate',
                    gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000), // Last 30 days
                    score: {
                      boost: { value: 1.3 }
                    }
                  }
                },
                {
                  range: {
                    path: 'popularity',
                    gte: 100,
                    score: {
                      boost: { value: 1.2 }
                    }
                  }
                }
              ]
            },

            // Add search highlighting
            highlight: {
              path: ['title', 'content'],
              maxCharsToExamine: this.config.highlightMaxChars,
              maxNumPassages: 3
            }
          }
        },

        // Add computed fields for search results
        {
          $addFields: {
            searchScore: { $meta: 'searchScore' },
            searchHighlights: { $meta: 'searchHighlights' },

            // Calculate content preview
            contentPreview: {
              $substr: ['$content', 0, 300]
            },

            // Add relevance indicators
            relevanceIndicators: {
              hasExactMatch: {
                $or: [
                  { $regexMatch: { input: '$title', regex: query, options: 'i' } },
                  { $regexMatch: { input: '$content', regex: query, options: 'i' } }
                ]
              },
              isRecent: {
                $gte: ['$publishDate', new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)]
              },
              isPopular: {
                $gte: ['$popularity', 50]
              }
            }
          }
        },

        // Add user personalization (if available)
        ...(options.userId ? [{
          $lookup: {
            from: 'user_profiles',
            localField: 'category',
            foreignField: 'searchPreferences.categories',
            as: 'personalizationMatch',
            pipeline: [
              { $match: { userId: options.userId } },
              { $limit: 1 }
            ]
          }
        }, {
          $addFields: {
            personalizationBoost: {
              $cond: [
                { $gt: [{ $size: '$personalizationMatch' }, 0] },
                1.4,
                1.0
              ]
            },
            finalScore: {
              $multiply: ['$searchScore', '$personalizationBoost']
            }
          }
        }] : [{
          $addFields: {
            finalScore: '$searchScore'
          }
        }]),

        // Sort by relevance and apply limits
        { $sort: { finalScore: -1, publishDate: -1 } },
        { $limit: options.limit || this.config.searchResultLimit },

        // Project final result structure
        {
          $project: {
            documentId: '$_id',
            title: 1,
            content: { $substr: ['$content', 0, 500] },
            author: 1,
            category: 1,
            tags: 1,
            publishDate: 1,
            language: 1,
            contentPreview: 1,

            // Search-specific fields
            searchScore: { $round: ['$finalScore', 4] },
            searchHighlights: 1,
            relevanceIndicators: 1,

            // Computed fields
            contentLength: { $strLenCP: '$content' },
            estimatedReadingTime: {
              $round: [{ $divide: [{ $strLenCP: '$content' }, 200] }, 0] // 200 words per minute
            },

            // Search result metadata
            searchRank: { $add: [{ $indexOfArray: [[], '$_id'] }, 1] },
            matchQuality: {
              $switch: {
                branches: [
                  { case: { $gte: ['$finalScore', 5.0] }, then: 'excellent' },
                  { case: { $gte: ['$finalScore', 3.0] }, then: 'good' },
                  { case: { $gte: ['$finalScore', 1.0] }, then: 'fair' }
                ],
                default: 'poor'
              }
            }
          }
        }
      ];

      // Execute search pipeline
      const searchResults = await this.collections.documents.aggregate(
        searchPipeline,
        { maxTimeMS: 10000 }
      ).toArray();

      const executionTime = Date.now() - startTime;

      // Log search query for analytics
      await this.logSearchQuery(query, searchResults.length, executionTime, options);

      // Update search metrics
      this.updateSearchMetrics(query, searchResults.length, executionTime);

      console.log(`Search completed: ${searchResults.length} results in ${executionTime}ms`);

      return {
        success: true,
        query: query,
        totalResults: searchResults.length,
        executionTime: executionTime,
        results: searchResults,
        searchMetadata: {
          hasSpellingSuggestions: false, // Would implement spell checking
          appliedFilters: options,
          searchComplexity: 'advanced',
          optimizationsApplied: ['boosting', 'fuzzy_matching', 'highlighting']
        }
      };

    } catch (error) {
      console.error('Error performing advanced text search:', error);
      return {
        success: false,
        error: error.message,
        query: query,
        executionTime: Date.now() - startTime
      };
    }
  }

  async setupVectorSearch() {
    console.log('Setting up vector search capabilities...');

    // Vector embeddings collection
    this.collections.vectorEmbeddings = this.db.collection('vector_embeddings');

    // Vector search index configuration
    const vectorSearchIndex = {
      name: 'vector_search_index',
      definition: {
        fields: [{
          type: 'vector',
          path: 'contentEmbedding',
          numDimensions: 1536, // OpenAI embedding dimensions
          similarity: 'cosine'
        }, {
          type: 'filter',
          path: 'documentId'
        }, {
          type: 'filter',
          path: 'embeddingType'
        }, {
          type: 'filter',
          path: 'language'
        }]
      }
    };

    this.vectorIndexes.set('content_vectors', vectorSearchIndex);

    // Create indexes for vector collection
    await this.collections.vectorEmbeddings.createIndexes([
      { key: { documentId: 1 }, unique: true, background: true },
      { key: { embeddingType: 1, language: 1 }, background: true },
      { key: { createdAt: -1 }, background: true }
    ]);

    console.log('Vector search setup completed');
  }

  async performVectorSearch(queryEmbedding, options = {}) {
    console.log('Performing vector similarity search...');

    const startTime = Date.now();

    try {
      const vectorSearchPipeline = [
        {
          $vectorSearch: {
            index: 'vector_search_index',
            path: 'contentEmbedding',
            queryVector: queryEmbedding,
            numCandidates: options.numCandidates || 100,
            limit: options.limit || 20,
            filter: {
              ...(options.language && { language: { $eq: options.language } }),
              ...(options.embeddingType && { embeddingType: { $eq: options.embeddingType } })
            }
          }
        },

        // Join with original documents
        {
          $lookup: {
            from: 'documents',
            localField: 'documentId',
            foreignField: '_id',
            as: 'document'
          }
        },

        // Unwind and add computed fields
        { $unwind: '$document' },
        {
          $addFields: {
            similarityScore: { $meta: 'vectorSearchScore' },
            semanticRelevance: {
              $switch: {
                branches: [
                  { case: { $gte: [{ $meta: 'vectorSearchScore' }, 0.8] }, then: 'very_high' },
                  { case: { $gte: [{ $meta: 'vectorSearchScore' }, 0.6] }, then: 'high' },
                  { case: { $gte: [{ $meta: 'vectorSearchScore' }, 0.4] }, then: 'medium' }
                ],
                default: 'low'
              }
            }
          }
        },

        // Project results
        {
          $project: {
            documentId: '$document._id',
            title: '$document.title',
            content: { $substr: ['$document.content', 0, 400] },
            author: '$document.author',
            category: '$document.category',
            similarityScore: { $round: ['$similarityScore', 4] },
            semanticRelevance: 1,
            embeddingType: 1,
            language: 1
          }
        }
      ];

      const vectorResults = await this.collections.vectorEmbeddings.aggregate(
        vectorSearchPipeline,
        { maxTimeMS: 15000 }
      ).toArray();

      const executionTime = Date.now() - startTime;

      console.log(`Vector search completed: ${vectorResults.length} results in ${executionTime}ms`);

      return {
        success: true,
        totalResults: vectorResults.length,
        executionTime: executionTime,
        results: vectorResults,
        searchType: 'vector_similarity'
      };

    } catch (error) {
      console.error('Error performing vector search:', error);
      return {
        success: false,
        error: error.message,
        executionTime: Date.now() - startTime
      };
    }
  }

  async performFacetedSearch(query, options = {}) {
    console.log(`Performing faceted search for: "${query}"`);

    const startTime = Date.now();

    try {
      const facetedSearchPipeline = [
        {
          $searchMeta: {
            index: 'faceted_search_index',
            facet: {
              operator: {
                text: {
                  query: query,
                  path: ['title', 'content']
                }
              },
              facets: {
                // Category facets
                categoriesFacet: {
                  type: 'string',
                  path: 'category',
                  numBuckets: this.config.facetLimit
                },

                // Author facets
                authorsFacet: {
                  type: 'string',
                  path: 'author',
                  numBuckets: 10
                },

                // Language facets
                languagesFacet: {
                  type: 'string',
                  path: 'language',
                  numBuckets: 10
                },

                // Date range facets
                publishDateFacet: {
                  type: 'date',
                  path: 'publishDate',
                  boundaries: [
                    new Date('2020-01-01'),
                    new Date('2021-01-01'),
                    new Date('2022-01-01'),
                    new Date('2023-01-01'),
                    new Date('2024-01-01'),
                    new Date('2025-01-01')
                  ]
                },

                // Popularity range facets
                popularityFacet: {
                  type: 'number',
                  path: 'popularity',
                  boundaries: [0, 10, 50, 100, 500, 1000]
                },

                // Content length facets
                contentLengthFacet: {
                  type: 'number',
                  path: 'contentLength',
                  boundaries: [0, 1000, 5000, 10000, 50000]
                }
              }
            }
          }
        }
      ];

      const facetResults = await this.collections.documents.aggregate(
        facetedSearchPipeline
      ).toArray();

      const executionTime = Date.now() - startTime;

      console.log(`Faceted search completed in ${executionTime}ms`);

      return {
        success: true,
        query: query,
        executionTime: executionTime,
        facets: facetResults[0]?.facet || {},
        searchType: 'faceted'
      };

    } catch (error) {
      console.error('Error performing faceted search:', error);
      return {
        success: false,
        error: error.message,
        executionTime: Date.now() - startTime
      };
    }
  }

  async generateAutocompleteResults(partialQuery, options = {}) {
    console.log(`Generating autocomplete for: "${partialQuery}"`);

    try {
      const autocompletePipeline = [
        {
          $search: {
            index: 'autocomplete_search_index',
            compound: {
              should: [
                {
                  autocomplete: {
                    query: partialQuery,
                    path: 'title',
                    tokenOrder: 'sequential',
                    score: { boost: { value: 2.0 } }
                  }
                },
                {
                  autocomplete: {
                    query: partialQuery,
                    path: 'tags',
                    tokenOrder: 'any',
                    score: { boost: { value: 1.5 } }
                  }
                }
              ],
              filter: [
                { equals: { path: 'status', value: 'published' } },
                ...(options.category ? [{ equals: { path: 'category', value: options.category } }] : [])
              ]
            }
          }
        },

        { $limit: 10 },

        {
          $project: {
            suggestion: '$title',
            category: 1,
            popularity: 1,
            autocompleteScore: { $meta: 'searchScore' }
          }
        },

        { $sort: { autocompleteScore: -1, popularity: -1 } }
      ];

      const suggestions = await this.collections.documents.aggregate(
        autocompletePipeline
      ).toArray();

      return {
        success: true,
        partialQuery: partialQuery,
        suggestions: suggestions.map(s => ({
          text: s.suggestion,
          category: s.category,
          score: s.autocompleteScore
        }))
      };

    } catch (error) {
      console.error('Error generating autocomplete results:', error);
      return {
        success: false,
        error: error.message,
        suggestions: []
      };
    }
  }

  async logSearchQuery(query, resultCount, executionTime, options) {
    try {
      const searchLog = {
        queryId: new ObjectId(),
        queryText: query,
        queryLanguage: options.language || 'en',
        userId: options.userId,
        sessionId: options.sessionId,

        // Search parameters
        filtersApplied: {
          category: options.category,
          author: options.author,
          language: options.language,
          dateRange: options.dateRange
        },

        // Search results metrics
        totalResultsFound: resultCount,
        executionTimeMs: executionTime,
        searchType: options.searchType || 'text',

        // Context information
        userAgent: options.userAgent,
        referrer: options.referrer,
        searchTimestamp: new Date(),

        // Performance data
        indexesUsed: ['documents_search_index'],
        optimizationsApplied: ['boosting', 'highlighting', 'fuzzy_matching'],

        // Quality metrics (to be updated by user interaction)
        userInteraction: {
          resultsClicked: [],
          timeOnResultsPage: null,
          refinedQuery: null,
          conversionAchieved: false
        }
      };

      await this.collections.searchQueries.insertOne(searchLog);

    } catch (error) {
      console.error('Error logging search query:', error);
    }
  }

  updateSearchMetrics(query, resultCount, executionTime) {
    this.searchMetrics.totalSearches++;
    this.searchMetrics.averageLatency = 
      (this.searchMetrics.averageLatency + executionTime) / 2;

    if (resultCount > 0) {
      this.searchMetrics.searchesWithResults++;
    }

    // Track popular queries
    const queryLower = query.toLowerCase();
    this.searchMetrics.popularQueries.set(
      queryLower,
      (this.searchMetrics.popularQueries.get(queryLower) || 0) + 1
    );
  }

  async getQueryEmbedding(query) {
    // Placeholder for actual embedding generation
    // In production, this would call OpenAI API or similar service
    return Array(1536).fill(0).map(() => Math.random() - 0.5);
  }

  async getSearchAnalytics(timeRange = '7d') {
    console.log(`Retrieving search analytics for ${timeRange}...`);

    try {
      const endDate = new Date();
      const startDate = new Date();

      switch (timeRange) {
        case '1d':
          startDate.setDate(endDate.getDate() - 1);
          break;
        case '7d':
          startDate.setDate(endDate.getDate() - 7);
          break;
        case '30d':
          startDate.setDate(endDate.getDate() - 30);
          break;
        default:
          startDate.setDate(endDate.getDate() - 7);
      }

      const analyticsAggregation = [
        {
          $match: {
            searchTimestamp: { $gte: startDate, $lte: endDate }
          }
        },

        {
          $group: {
            _id: null,
            totalSearches: { $sum: 1 },
            uniqueUsers: { $addToSet: '$userId' },
            averageExecutionTime: { $avg: '$executionTimeMs' },
            searchesWithResults: {
              $sum: { $cond: [{ $gt: ['$totalResultsFound', 0] }, 1, 0] }
            },

            // Query analysis
            popularQueries: {
              $push: {
                query: '$queryText',
                results: '$totalResultsFound',
                executionTime: '$executionTimeMs'
              }
            },

            // Performance metrics
            maxExecutionTime: { $max: '$executionTimeMs' },
            minExecutionTime: { $min: '$executionTimeMs' },

            // Filter usage analysis
            categoryFilters: { $push: '$filtersApplied.category' },
            languageFilters: { $push: '$filtersApplied.language' }
          }
        },

        {
          $addFields: {
            uniqueUserCount: { $size: '$uniqueUsers' },
            successRate: {
              $round: [
                { $multiply: [
                  { $divide: ['$searchesWithResults', '$totalSearches'] },
                  100
                ]},
                2
              ]
            },
            averageExecutionTimeRounded: {
              $round: ['$averageExecutionTime', 2]
            }
          }
        }
      ];

      const analytics = await this.collections.searchQueries.aggregate(
        analyticsAggregation
      ).toArray();

      return {
        success: true,
        timeRange: timeRange,
        analytics: analytics[0] || {
          totalSearches: 0,
          uniqueUserCount: 0,
          successRate: 0,
          averageExecutionTimeRounded: 0
        },
        systemMetrics: this.searchMetrics
      };

    } catch (error) {
      console.error('Error retrieving search analytics:', error);
      return {
        success: false,
        error: error.message
      };
    }
  }

  async shutdown() {
    console.log('Shutting down Atlas Search Manager...');

    if (this.client) {
      await this.client.close();
    }

    console.log('Atlas Search Manager shutdown complete');
  }
}

// Benefits of MongoDB Atlas Search:
// - Native full-text search with no external dependencies
// - Advanced relevance scoring with machine learning integration
// - Vector similarity search for semantic understanding
// - Multi-language support with sophisticated text analysis
// - Real-time search index synchronization
// - Faceted search and advanced filtering capabilities
// - Autocomplete and search suggestions out-of-the-box
// - Comprehensive search analytics and performance monitoring
// - SQL-compatible search operations through QueryLeaf integration

module.exports = {
  AtlasSearchManager
};

Understanding MongoDB Atlas Search Architecture

Advanced Search Patterns and Performance Optimization

Implement sophisticated search strategies for production MongoDB Atlas deployments:

// Production-ready Atlas Search with advanced features and optimization
class EnterpriseAtlasSearchProcessor extends AtlasSearchManager {
  constructor(connectionString, enterpriseConfig) {
    super(connectionString, enterpriseConfig);

    this.enterpriseConfig = {
      ...enterpriseConfig,
      enableAdvancedAnalytics: true,
      enablePersonalization: true,
      enableA_B_Testing: true,
      enableSearchOptimization: true,
      enableContentIntelligence: true,
      enableMultiModalSearch: true
    };

    this.setupEnterpriseFeatures();
    this.initializeAdvancedAnalytics();
    this.setupPersonalizationEngine();
  }

  async implementAdvancedSearchStrategies() {
    console.log('Implementing enterprise search strategies...');

    const searchStrategies = {
      // Multi-modal search capabilities
      multiModalSearch: {
        textSearch: true,
        vectorSearch: true,
        imageSearch: true,
        documentSearch: true,
        semanticSearch: true
      },

      // Personalization engine
      personalizationEngine: {
        userBehaviorAnalysis: true,
        contentRecommendations: true,
        adaptiveScoringWeights: true,
        searchIntentPrediction: true
      },

      // Search optimization
      searchOptimization: {
        realTimeIndexOptimization: true,
        queryPerformanceAnalysis: true,
        automaticRelevanceTuning: true,
        resourceUtilizationOptimization: true
      }
    };

    return await this.deployEnterpriseSearchStrategies(searchStrategies);
  }

  async setupAdvancedPersonalization() {
    console.log('Setting up advanced personalization capabilities...');

    const personalizationConfig = {
      // User modeling
      userModeling: {
        behavioralTracking: true,
        preferenceAnalysis: true,
        contextualUnderstanding: true,
        intentPrediction: true
      },

      // Content intelligence
      contentIntelligence: {
        topicModeling: true,
        contentCategorization: true,
        qualityScoring: true,
        freshnessScorig: true
      },

      // Adaptive algorithms
      adaptiveAlgorithms: {
        learningFromInteraction: true,
        realTimeAdaptation: true,
        contextualAdjustment: true,
        performanceOptimization: true
      }
    };

    return await this.deployPersonalizationEngine(personalizationConfig);
  }
}

SQL-Style Search Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Atlas Search operations:

-- QueryLeaf Atlas Search operations with SQL-familiar syntax

-- Configure comprehensive search indexes
CREATE SEARCH INDEX documents_main_index ON documents (
  title WITH (
    analyzer = 'standard',
    search_analyzer = 'standard',
    highlight = true,
    boost = 2.0
  ),
  content WITH (
    analyzer = 'standard', 
    search_analyzer = 'standard',
    highlight = true,
    max_highlight_chars = 500
  ),
  author WITH (
    analyzer = 'keyword',
    facet = true
  ),
  category WITH (
    analyzer = 'keyword',
    facet = true
  ),
  tags WITH (
    analyzer = 'standard',
    facet = true
  ),
  language WITH (
    analyzer = 'keyword',
    facet = true
  ),
  publish_date WITH (
    type = 'date',
    facet = true
  ),
  popularity WITH (
    type = 'number',
    facet = true,
    facet_boundaries = [0, 10, 50, 100, 500, 1000]
  )
)
WITH SEARCH_OPTIONS (
  enable_highlighting = true,
  enable_faceting = true,
  enable_autocomplete = true,
  enable_fuzzy_matching = true,
  default_language = 'english'
);

-- Create autocomplete search index
CREATE AUTOCOMPLETE INDEX documents_autocomplete ON documents (
  title WITH (
    tokenization = 'edgeGram',
    min_grams = 2,
    max_grams = 15,
    fold_diacritics = true
  ),
  tags WITH (
    tokenization = 'keyword',
    max_suggestions = 20
  )
);

-- Create vector search index for semantic search
CREATE VECTOR INDEX documents_semantic ON documents (
  content_embedding WITH (
    dimensions = 1536,
    similarity = 'cosine'
  )
)
WITH VECTOR_OPTIONS (
  num_candidates = 100,
  enable_filtering = true
);

-- Advanced text search with comprehensive features
WITH advanced_search AS (
  SELECT 
    document_id,
    title,
    content,
    author,
    category,
    tags,
    publish_date,
    language,
    popularity,

    -- Search scoring and ranking
    SEARCH_SCORE() as relevance_score,
    SEARCH_HIGHLIGHTS(title, content) as search_highlights,

    -- Advanced scoring components
    CASE 
      WHEN SEARCH_EXACT_MATCH(title, 'machine learning') THEN 3.0
      WHEN SEARCH_PHRASE_MATCH(content, 'machine learning') THEN 2.5
      WHEN SEARCH_FUZZY_MATCH(title, 'machine learning', max_edits = 2) THEN 1.8
      ELSE 1.0
    END as match_type_boost,

    -- Temporal and popularity boosts
    CASE 
      WHEN publish_date >= CURRENT_DATE - INTERVAL '30 days' THEN 1.3
      WHEN publish_date >= CURRENT_DATE - INTERVAL '90 days' THEN 1.1
      ELSE 1.0
    END as recency_boost,

    CASE 
      WHEN popularity >= 1000 THEN 1.4
      WHEN popularity >= 100 THEN 1.2
      WHEN popularity >= 10 THEN 1.1
      ELSE 1.0
    END as popularity_boost,

    -- Content quality indicators
    LENGTH(content) as content_length,
    ARRAY_LENGTH(tags, 1) as tag_count,
    EXTRACT(DAYS FROM CURRENT_DATE - publish_date) as days_old

  FROM documents
  WHERE SEARCH(
    -- Primary search query
    query = 'machine learning artificial intelligence',
    paths = ['title', 'content'],

    -- Search options
    WITH (
      fuzzy_matching = true,
      max_edits = 2,
      prefix_length = 2,
      enable_highlighting = true,
      highlight_max_chars = 500,

      -- Boost strategies
      title_boost = 2.0,
      exact_phrase_boost = 3.0,
      proximity_boost = 1.5
    ),

    -- Filters
    AND category IN ('technology', 'science', 'research')
    AND language = 'en'
    AND status = 'published'
    AND publish_date >= '2020-01-01'
  )
),

search_with_personalization AS (
  SELECT 
    ads.*,

    -- User personalization (if user context available)
    CASE 
      WHEN USER_PREFERENCE_MATCH(category, user_id = 'user123') THEN 1.4
      WHEN USER_INTERACTION_HISTORY(document_id, user_id = 'user123', 
                                   interaction_type = 'positive') THEN 1.3
      ELSE 1.0
    END as personalization_boost,

    -- Final relevance calculation
    (relevance_score * match_type_boost * recency_boost * 
     popularity_boost * personalization_boost) as final_relevance_score,

    -- Search result enrichment
    CASE 
      WHEN final_relevance_score >= 8.0 THEN 'excellent'
      WHEN final_relevance_score >= 5.0 THEN 'very_good'
      WHEN final_relevance_score >= 3.0 THEN 'good'
      WHEN final_relevance_score >= 1.0 THEN 'fair'
      ELSE 'poor'
    END as match_quality,

    -- Estimated reading time
    ROUND(content_length / 200.0, 0) as estimated_reading_minutes,

    -- Search result categories
    CASE 
      WHEN SEARCH_EXACT_MATCH(title, 'machine learning') OR 
           SEARCH_EXACT_MATCH(content, 'machine learning') THEN 'exact_match'
      WHEN SEARCH_SEMANTIC_SIMILARITY(content_embedding, 
                                      QUERY_EMBEDDING('machine learning artificial intelligence')) > 0.8 
           THEN 'semantic_match'
      WHEN SEARCH_FUZZY_MATCH(title, 'machine learning', max_edits = 2) THEN 'fuzzy_match'
      ELSE 'keyword_match'
    END as match_type

  FROM advanced_search ads
),

faceted_analysis AS (
  -- Generate search facets for filtering UI
  SELECT 
    'categories' as facet_type,
    category as facet_value,
    COUNT(*) as result_count,
    AVG(final_relevance_score) as avg_relevance
  FROM search_with_personalization
  GROUP BY category

  UNION ALL

  SELECT 
    'authors' as facet_type,
    author as facet_value,
    COUNT(*) as result_count,
    AVG(final_relevance_score) as avg_relevance
  FROM search_with_personalization
  GROUP BY author

  UNION ALL

  SELECT 
    'languages' as facet_type,
    language as facet_value,
    COUNT(*) as result_count,
    AVG(final_relevance_score) as avg_relevance
  FROM search_with_personalization
  GROUP BY language

  UNION ALL

  SELECT 
    'time_periods' as facet_type,
    CASE 
      WHEN publish_date >= CURRENT_DATE - INTERVAL '30 days' THEN 'last_month'
      WHEN publish_date >= CURRENT_DATE - INTERVAL '90 days' THEN 'last_3_months'
      WHEN publish_date >= CURRENT_DATE - INTERVAL '365 days' THEN 'last_year'
      ELSE 'older'
    END as facet_value,
    COUNT(*) as result_count,
    AVG(final_relevance_score) as avg_relevance
  FROM search_with_personalization
  GROUP BY facet_value

  UNION ALL

  SELECT 
    'popularity_ranges' as facet_type,
    CASE 
      WHEN popularity >= 1000 THEN 'very_popular'
      WHEN popularity >= 100 THEN 'popular'
      WHEN popularity >= 10 THEN 'moderate'
      ELSE 'emerging'
    END as facet_value,
    COUNT(*) as result_count,
    AVG(final_relevance_score) as avg_relevance
  FROM search_with_personalization
  GROUP BY facet_value
),

search_analytics AS (
  -- Real-time search analytics
  SELECT 
    'search_performance' as metric_type,
    COUNT(*) as total_results,
    AVG(final_relevance_score) as avg_relevance,
    MAX(final_relevance_score) as max_relevance,
    COUNT(*) FILTER (WHERE match_quality IN ('excellent', 'very_good')) as high_quality_results,
    COUNT(DISTINCT category) as categories_represented,
    COUNT(DISTINCT author) as authors_represented,
    COUNT(DISTINCT language) as languages_represented,

    -- Match type distribution
    COUNT(*) FILTER (WHERE match_type = 'exact_match') as exact_matches,
    COUNT(*) FILTER (WHERE match_type = 'semantic_match') as semantic_matches,
    COUNT(*) FILTER (WHERE match_type = 'fuzzy_match') as fuzzy_matches,
    COUNT(*) FILTER (WHERE match_type = 'keyword_match') as keyword_matches,

    -- Content characteristics
    AVG(content_length) as avg_content_length,
    AVG(estimated_reading_minutes) as avg_reading_time,
    AVG(days_old) as avg_content_age_days,

    -- Search quality indicators
    ROUND((COUNT(*) FILTER (WHERE match_quality IN ('excellent', 'very_good'))::DECIMAL / COUNT(*)) * 100, 2) as high_quality_percentage,
    ROUND((COUNT(*) FILTER (WHERE final_relevance_score >= 3.0)::DECIMAL / COUNT(*)) * 100, 2) as relevant_results_percentage

  FROM search_with_personalization
)

-- Main search results output
SELECT 
  swp.document_id,
  swp.title,
  LEFT(swp.content, 300) || '...' as content_preview,
  swp.author,
  swp.category,
  swp.tags,
  swp.publish_date,
  swp.language,

  -- Relevance and ranking
  ROUND(swp.final_relevance_score, 4) as relevance_score,
  ROW_NUMBER() OVER (ORDER BY swp.final_relevance_score DESC, swp.publish_date DESC) as search_rank,
  swp.match_quality,
  swp.match_type,

  -- Search highlights
  swp.search_highlights,

  -- Content metadata
  swp.content_length,
  swp.estimated_reading_minutes,
  swp.tag_count,
  swp.days_old,

  -- User personalization indicators
  ROUND(swp.personalization_boost, 2) as personalization_factor,

  -- Additional context
  CASE 
    WHEN swp.days_old <= 7 THEN 'Very Recent'
    WHEN swp.days_old <= 30 THEN 'Recent'
    WHEN swp.days_old <= 90 THEN 'Moderate'
    ELSE 'Archive'
  END as content_freshness,

  -- Search result recommendations
  CASE 
    WHEN swp.match_quality = 'excellent' AND swp.match_type = 'exact_match' THEN 'Must Read'
    WHEN swp.match_quality IN ('very_good', 'excellent') AND swp.days_old <= 30 THEN 'Trending'
    WHEN swp.match_quality = 'good' AND swp.popularity >= 100 THEN 'Popular Choice'
    WHEN swp.match_type = 'semantic_match' THEN 'Related Content'
    ELSE 'Standard Result'
  END as result_recommendation

FROM search_with_personalization swp
WHERE swp.final_relevance_score >= 0.5  -- Filter low-relevance results
ORDER BY swp.final_relevance_score DESC, swp.publish_date DESC
LIMIT 50;

-- Vector similarity search with SQL syntax
WITH semantic_search AS (
  SELECT 
    document_id,
    title,
    content,
    author,
    category,

    -- Vector similarity scoring
    VECTOR_SIMILARITY(
      content_embedding, 
      QUERY_EMBEDDING('artificial intelligence machine learning deep learning neural networks'),
      similarity_method = 'cosine'
    ) as semantic_similarity_score,

    -- Semantic relevance classification
    CASE 
      WHEN VECTOR_SIMILARITY(content_embedding, QUERY_EMBEDDING(...)) >= 0.9 THEN 'extremely_relevant'
      WHEN VECTOR_SIMILARITY(content_embedding, QUERY_EMBEDDING(...)) >= 0.8 THEN 'highly_relevant'
      WHEN VECTOR_SIMILARITY(content_embedding, QUERY_EMBEDDING(...)) >= 0.7 THEN 'relevant'
      WHEN VECTOR_SIMILARITY(content_embedding, QUERY_EMBEDDING(...)) >= 0.6 THEN 'somewhat_relevant'
      ELSE 'marginally_relevant'
    END as semantic_relevance_level

  FROM documents
  WHERE VECTOR_SEARCH(
    embedding_field = content_embedding,
    query_vector = QUERY_EMBEDDING('artificial intelligence machine learning deep learning neural networks'),
    similarity_threshold = 0.6,
    max_results = 20,

    -- Additional filters
    AND status = 'published'
    AND language IN ('en', 'es', 'fr')
    AND publish_date >= '2021-01-01'
  )
),

hybrid_search_results AS (
  -- Combine text search and vector search for optimal results
  SELECT 
    document_id,
    title,
    content,
    author,
    category,
    publish_date,

    -- Combined scoring from multiple search methods
    COALESCE(text_search.final_relevance_score, 0) as text_relevance,
    COALESCE(semantic_search.semantic_similarity_score, 0) as semantic_relevance,

    -- Hybrid relevance calculation
    (
      COALESCE(text_search.final_relevance_score, 0) * 0.6 +
      COALESCE(semantic_search.semantic_similarity_score * 10, 0) * 0.4
    ) as hybrid_relevance_score,

    -- Search method indicators
    CASE 
      WHEN text_search.document_id IS NOT NULL AND semantic_search.document_id IS NOT NULL THEN 'hybrid_match'
      WHEN text_search.document_id IS NOT NULL THEN 'text_match'
      WHEN semantic_search.document_id IS NOT NULL THEN 'semantic_match'
      ELSE 'no_match'
    END as search_method,

    -- Quality indicators
    text_search.match_quality as text_match_quality,
    semantic_search.semantic_relevance_level as semantic_match_quality

  FROM (
    SELECT DISTINCT document_id FROM search_with_personalization 
    UNION 
    SELECT DISTINCT document_id FROM semantic_search
  ) all_results
  LEFT JOIN search_with_personalization text_search ON all_results.document_id = text_search.document_id
  LEFT JOIN semantic_search ON all_results.document_id = semantic_search.document_id
  JOIN documents d ON all_results.document_id = d.document_id
)

SELECT 
  hrs.document_id,
  hrs.title,
  LEFT(hrs.content, 400) as content_preview,
  hrs.author,
  hrs.category,
  hrs.publish_date,

  -- Hybrid scoring results
  ROUND(hrs.text_relevance, 4) as text_relevance_score,
  ROUND(hrs.semantic_relevance, 4) as semantic_relevance_score,
  ROUND(hrs.hybrid_relevance_score, 4) as combined_relevance_score,

  -- Search method and quality
  hrs.search_method,
  COALESCE(hrs.text_match_quality, 'n/a') as text_quality,
  COALESCE(hrs.semantic_match_quality, 'n/a') as semantic_quality,

  -- Final recommendation
  CASE 
    WHEN hrs.hybrid_relevance_score >= 8.0 THEN 'Highly Recommended'
    WHEN hrs.hybrid_relevance_score >= 6.0 THEN 'Recommended'
    WHEN hrs.hybrid_relevance_score >= 4.0 THEN 'Relevant'
    WHEN hrs.hybrid_relevance_score >= 2.0 THEN 'Potentially Interesting'
    ELSE 'Marginally Relevant'
  END as recommendation_level

FROM hybrid_search_results hrs
WHERE hrs.hybrid_relevance_score >= 1.0
ORDER BY hrs.hybrid_relevance_score DESC, hrs.publish_date DESC
LIMIT 25;

-- Autocomplete and search suggestions
SELECT 
  suggestion_text,
  suggestion_category,
  popularity_score,
  completion_frequency,

  -- Suggestion quality metrics
  AUTOCOMPLETE_SCORE('machine lear', suggestion_text) as completion_relevance,

  -- Suggestion type classification
  CASE 
    WHEN STARTS_WITH(suggestion_text, 'machine lear') THEN 'prefix_completion'
    WHEN CONTAINS(suggestion_text, 'machine learning') THEN 'phrase_completion'
    WHEN FUZZY_MATCH(suggestion_text, 'machine learning', max_distance = 2) THEN 'corrected_completion'
    ELSE 'related_suggestion'
  END as suggestion_type,

  -- User context enhancement
  CASE 
    WHEN USER_SEARCH_HISTORY_CONTAINS('user123', suggestion_text) THEN true
    ELSE false
  END as user_has_searched_before,

  -- Trending indicator
  CASE 
    WHEN TRENDING_SEARCH_TERM(suggestion_text, time_window = '7d') THEN 'trending'
    WHEN POPULAR_SEARCH_TERM(suggestion_text, time_window = '30d') THEN 'popular'
    ELSE 'standard'
  END as trend_status

FROM AUTOCOMPLETE_SUGGESTIONS(
  partial_query = 'machine lear',
  max_suggestions = 10,

  -- Personalization options
  user_id = 'user123',
  include_user_history = true,
  include_trending = true,

  -- Filtering options
  category_filter = 'technology',
  language_filter = 'en',
  min_popularity = 10
)
ORDER BY completion_relevance DESC, popularity_score DESC;

-- Search analytics and performance monitoring
WITH search_performance_analysis AS (
  SELECT 
    DATE_TRUNC('hour', search_timestamp) as hour_bucket,
    COUNT(*) as total_searches,
    COUNT(DISTINCT user_id) as unique_users,
    AVG(execution_time_ms) as avg_execution_time,
    AVG(total_results_found) as avg_results_count,

    -- Search success metrics
    COUNT(*) FILTER (WHERE total_results_found > 0) as successful_searches,
    COUNT(*) FILTER (WHERE total_results_found >= 10) as highly_successful_searches,

    -- Query complexity analysis
    AVG(LENGTH(query_text)) as avg_query_length,
    COUNT(*) FILTER (WHERE filters_applied IS NOT NULL) as searches_with_filters,

    -- Performance categories
    COUNT(*) FILTER (WHERE execution_time_ms <= 100) as fast_searches,
    COUNT(*) FILTER (WHERE execution_time_ms > 100 AND execution_time_ms <= 500) as moderate_searches,
    COUNT(*) FILTER (WHERE execution_time_ms > 500) as slow_searches,

    -- Search types
    COUNT(*) FILTER (WHERE search_type = 'text') as text_searches,
    COUNT(*) FILTER (WHERE search_type = 'vector') as vector_searches,
    COUNT(*) FILTER (WHERE search_type = 'hybrid') as hybrid_searches,
    COUNT(*) FILTER (WHERE search_type = 'autocomplete') as autocomplete_requests

  FROM search_queries
  WHERE search_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY DATE_TRUNC('hour', search_timestamp)
),

query_pattern_analysis AS (
  SELECT 
    query_text,
    COUNT(*) as query_frequency,
    AVG(total_results_found) as avg_results,
    AVG(execution_time_ms) as avg_execution_time,
    COUNT(DISTINCT user_id) as unique_users,

    -- Query success metrics
    ROUND((COUNT(*) FILTER (WHERE total_results_found > 0)::DECIMAL / COUNT(*)) * 100, 2) as success_rate,

    -- User engagement indicators
    AVG(ARRAY_LENGTH(user_interaction.results_clicked, 1)) as avg_clicks_per_search,
    COUNT(*) FILTER (WHERE user_interaction.conversion_achieved = true) as conversions,

    -- Query characteristics
    LENGTH(query_text) as query_length,
    ARRAY_LENGTH(STRING_TO_ARRAY(query_text, ' '), 1) as word_count,

    -- Classification
    CASE 
      WHEN LENGTH(query_text) <= 10 THEN 'short_query'
      WHEN LENGTH(query_text) <= 30 THEN 'medium_query'
      ELSE 'long_query'
    END as query_length_category,

    CASE 
      WHEN ARRAY_LENGTH(STRING_TO_ARRAY(query_text, ' '), 1) = 1 THEN 'single_word'
      WHEN ARRAY_LENGTH(STRING_TO_ARRAY(query_text, ' '), 1) <= 3 THEN 'short_phrase'
      WHEN ARRAY_LENGTH(STRING_TO_ARRAY(query_text, ' '), 1) <= 6 THEN 'medium_phrase'
      ELSE 'long_phrase'
    END as query_complexity

  FROM search_queries
  WHERE search_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
  GROUP BY query_text
  HAVING COUNT(*) >= 3  -- Focus on repeated queries
)

-- Comprehensive search analytics report
SELECT 
  -- Time-based performance
  spa.hour_bucket,
  spa.total_searches,
  spa.unique_users,
  spa.avg_execution_time,
  spa.avg_results_count,

  -- Success metrics
  ROUND((spa.successful_searches::DECIMAL / spa.total_searches) * 100, 2) as success_rate_percent,
  ROUND((spa.highly_successful_searches::DECIMAL / spa.total_searches) * 100, 2) as high_success_rate_percent,

  -- Performance distribution
  ROUND((spa.fast_searches::DECIMAL / spa.total_searches) * 100, 2) as fast_search_percent,
  ROUND((spa.moderate_searches::DECIMAL / spa.total_searches) * 100, 2) as moderate_search_percent,
  ROUND((spa.slow_searches::DECIMAL / spa.total_searches) * 100, 2) as slow_search_percent,

  -- Search type distribution
  ROUND((spa.text_searches::DECIMAL / spa.total_searches) * 100, 2) as text_search_percent,
  ROUND((spa.vector_searches::DECIMAL / spa.total_searches) * 100, 2) as vector_search_percent,
  ROUND((spa.hybrid_searches::DECIMAL / spa.total_searches) * 100, 2) as hybrid_search_percent,

  -- User engagement
  ROUND(spa.searches_with_filters::DECIMAL / spa.total_searches * 100, 2) as filter_usage_percent,
  spa.avg_query_length,

  -- Performance assessment
  CASE 
    WHEN spa.avg_execution_time <= 100 THEN 'excellent'
    WHEN spa.avg_execution_time <= 300 THEN 'good'
    WHEN spa.avg_execution_time <= 800 THEN 'fair'
    ELSE 'needs_improvement'
  END as performance_rating,

  -- System health indicators
  CASE 
    WHEN (spa.successful_searches::DECIMAL / spa.total_searches) >= 0.9 THEN 'healthy'
    WHEN (spa.successful_searches::DECIMAL / spa.total_searches) >= 0.7 THEN 'moderate'
    ELSE 'concerning'
  END as system_health_status

FROM search_performance_analysis spa
ORDER BY spa.hour_bucket DESC;

-- Popular and problematic queries analysis
SELECT 
  'popular_queries' as analysis_type,
  qpa.query_text,
  qpa.query_frequency,
  qpa.success_rate,
  qpa.avg_results,
  qpa.avg_execution_time,
  qpa.unique_users,
  qpa.query_length_category,
  qpa.query_complexity,

  -- Recommendations
  CASE 
    WHEN qpa.success_rate < 50 THEN 'Investigate low success rate'
    WHEN qpa.avg_execution_time > 1000 THEN 'Optimize query performance'
    WHEN qpa.avg_results < 5 THEN 'Improve result relevance'
    WHEN qpa.conversions = 0 THEN 'Enhance result quality'
    ELSE 'Query performing well'
  END as recommendation

FROM query_pattern_analysis qpa
WHERE qpa.query_frequency >= 10
ORDER BY qpa.query_frequency DESC
LIMIT 20;

-- QueryLeaf provides comprehensive search capabilities:
-- 1. SQL-familiar syntax for Atlas Search index creation and management
-- 2. Advanced full-text search with fuzzy matching, highlighting, and boosting
-- 3. Vector similarity search for semantic understanding
-- 4. Faceted search and filtering with automatic facet generation
-- 5. Autocomplete and search suggestions with personalization
-- 6. Hybrid search combining multiple search methodologies
-- 7. Real-time search analytics and performance monitoring
-- 8. Integration with MongoDB's native Atlas Search optimizations
-- 9. Multi-language support and advanced text analysis
-- 10. Production-ready search capabilities with familiar SQL syntax

Best Practices for Atlas Search Implementation

Search Index Strategy and Performance Optimization

Essential principles for effective Atlas Search deployment:

Index Design: Create search indexes that balance functionality with performance, optimizing for your most common query patterns
Query Optimization: Structure search queries to leverage Atlas Search's advanced capabilities while maintaining fast response times
Relevance Tuning: Implement sophisticated relevance scoring that combines multiple factors for optimal search results
Multi-Language Support: Design search indexes and queries to handle multiple languages and character sets effectively
Performance Monitoring: Establish comprehensive search analytics to track performance and user behavior
Vector Integration: Leverage vector search for semantic understanding and enhanced search relevance

Production Search Architecture

Design search systems for enterprise-scale requirements:

Scalable Architecture: Implement search infrastructure that can handle high query volumes and large datasets
Advanced Analytics: Deploy comprehensive search analytics with user behavior tracking and performance optimization
Personalization Engine: Integrate machine learning-based personalization for improved search relevance
Multi-Modal Search: Support various search types including text, semantic, and multimedia search capabilities
Real-Time Optimization: Implement automated search optimization based on usage patterns and performance metrics
Security Integration: Ensure search implementations respect data access controls and privacy requirements

Conclusion

MongoDB Atlas Search provides comprehensive native search capabilities that eliminate the complexity of external search engines through advanced text indexing, vector similarity search, and intelligent relevance scoring integrated directly within MongoDB. The combination of full-text search with semantic understanding, multi-language support, and real-time synchronization makes Atlas Search ideal for modern applications requiring sophisticated search experiences.

Key Atlas Search benefits include:

Native Integration: Seamless search capabilities without external dependencies or complex data synchronization
Advanced Text Analysis: Comprehensive full-text search with fuzzy matching, highlighting, and multi-language support
Vector Similarity: Semantic search capabilities using machine learning embeddings for contextual understanding
Real-Time Synchronization: Instant search index updates without manual refresh or batch processing
Faceted Search: Advanced filtering and categorization capabilities for enhanced user search experiences
SQL Accessibility: Familiar SQL-style search operations through QueryLeaf for accessible search implementation

Whether you're building content management systems, e-commerce platforms, knowledge bases, or enterprise search applications, MongoDB Atlas Search with QueryLeaf's familiar SQL interface provides the foundation for powerful, scalable search experiences.

QueryLeaf Integration: QueryLeaf seamlessly manages MongoDB Atlas Search operations while providing SQL-familiar search syntax, index management, and advanced search query construction. Sophisticated search patterns including full-text search, vector similarity, faceted filtering, and search analytics are elegantly handled through familiar SQL constructs, making advanced search capabilities both powerful and accessible to SQL-oriented development teams.

The combination of MongoDB's robust Atlas Search capabilities with SQL-style search operations makes it an ideal platform for applications requiring both advanced search functionality and familiar database interaction patterns, ensuring your search implementations remain both sophisticated and maintainable as your search requirements evolve and scale.

November 4, 2025
26 min read

MongoDB Geospatial Queries and Location-Based Services: Advanced Spatial Indexing and Geographic Data Management

Modern applications increasingly rely on location-aware functionality, from ride-sharing and delivery services to social media check-ins and targeted marketing. Traditional database systems struggle with complex spatial operations, often requiring specialized GIS software or complex geometric calculations that are difficult to integrate, maintain, and scale within application architectures.

MongoDB provides comprehensive native geospatial capabilities with advanced spatial indexing, sophisticated geometric operations, and high-performance location-based queries that eliminate the complexity of external GIS systems. Unlike traditional approaches that require separate spatial databases or complex geometric libraries, MongoDB's integrated geospatial features deliver superior performance through optimized spatial indexes, native coordinate system support, and seamless integration with application data models.

The Traditional Geospatial Challenge

Conventional approaches to location-based services involve significant complexity and performance limitations:

-- Traditional PostgreSQL geospatial approach - complex setup and limited optimization

-- PostGIS extension required for spatial capabilities
CREATE EXTENSION IF NOT EXISTS postgis;

-- Location-based entities with complex geometric types
CREATE TABLE locations (
    location_id BIGSERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    category VARCHAR(100) NOT NULL,

    -- PostGIS geometry types (complex to work with)
    coordinates GEOMETRY(POINT, 4326) NOT NULL, -- WGS84 coordinate system
    coverage_area GEOMETRY(POLYGON, 4326),
    search_radius GEOMETRY(POLYGON, 4326),

    -- Additional location metadata
    address TEXT,
    city VARCHAR(100),
    state VARCHAR(50),
    country VARCHAR(50),
    postal_code VARCHAR(20),

    -- Business information
    phone_number VARCHAR(20),
    operating_hours JSONB,
    rating DECIMAL(3,2),
    price_range INTEGER,

    -- Spatial analysis metadata
    population_density INTEGER,
    traffic_level VARCHAR(20),
    accessibility_score DECIMAL(4,2),

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Complex spatial indexing (manual configuration required)
CREATE INDEX idx_locations_coordinates ON locations USING GIST (coordinates);
CREATE INDEX idx_locations_coverage ON locations USING GIST (coverage_area);
CREATE INDEX idx_locations_category_coords ON locations USING GIST (coordinates, category);

-- User location tracking with spatial relationships
CREATE TABLE user_locations (
    user_location_id BIGSERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL,
    location_coordinates GEOMETRY(POINT, 4326) NOT NULL,
    accuracy_meters DECIMAL(8,2),
    altitude_meters DECIMAL(8,2),

    -- Movement tracking
    speed_kmh DECIMAL(6,2),
    heading_degrees DECIMAL(5,2),

    -- Context information
    location_method VARCHAR(50), -- GPS, WIFI, CELL, MANUAL
    device_type VARCHAR(50),
    battery_level INTEGER,

    -- Privacy and permissions
    location_sharing_level VARCHAR(20) DEFAULT 'private',
    geofence_notifications BOOLEAN DEFAULT false,

    -- Temporal tracking
    recorded_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    session_id VARCHAR(100),

    FOREIGN KEY (user_id) REFERENCES users(user_id)
);

-- Spatial indexes for user locations
CREATE INDEX idx_user_locations_coords ON user_locations USING GIST (location_coordinates);
CREATE INDEX idx_user_locations_user_time ON user_locations (user_id, recorded_at);
CREATE INDEX idx_user_locations_session ON user_locations (session_id, recorded_at);

-- Complex proximity search with performance issues
WITH nearby_locations AS (
    SELECT 
        l.location_id,
        l.name,
        l.category,
        l.address,
        l.rating,
        l.price_range,

        -- Complex distance calculations
        ST_Distance(
            l.coordinates, 
            ST_SetSRID(ST_MakePoint($longitude, $latitude), 4326)::geography
        ) as distance_meters,

        -- Geometric relationships (expensive operations)
        ST_Contains(l.coverage_area, ST_SetSRID(ST_MakePoint($longitude, $latitude), 4326)) as within_coverage,
        ST_Intersects(l.search_radius, ST_SetSRID(ST_MakePoint($longitude, $latitude), 4326)) as in_search_area,

        -- Bearing calculation (complex trigonometry)
        ST_Azimuth(
            l.coordinates, 
            ST_SetSRID(ST_MakePoint($longitude, $latitude), 4326)
        ) * 180 / PI() as bearing_degrees,

        -- Additional spatial analysis
        l.coordinates,
        l.operating_hours,
        l.phone_number

    FROM locations l
    WHERE 
        -- Basic distance filter (still expensive without proper optimization)
        ST_DWithin(
            l.coordinates::geography, 
            ST_SetSRID(ST_MakePoint($longitude, $latitude), 4326)::geography, 
            $search_radius_meters
        )

        -- Category filtering
        AND ($category IS NULL OR l.category = $category)

        -- Rating filtering
        AND ($min_rating IS NULL OR l.rating >= $min_rating)

        -- Price filtering
        AND ($max_price IS NULL OR l.price_range <= $max_price)

    ORDER BY distance_meters
    LIMIT $limit_count
),

location_analytics AS (
    -- Complex spatial aggregations with performance impact
    SELECT 
        nl.category,
        COUNT(*) as location_count,
        AVG(nl.rating) as avg_rating,
        AVG(nl.distance_meters) as avg_distance,
        MIN(nl.distance_meters) as closest_distance,
        MAX(nl.distance_meters) as furthest_distance,

        -- Expensive geometric calculations
        ST_ConvexHull(ST_Collect(nl.coordinates)) as coverage_polygon,
        ST_Centroid(ST_Collect(nl.coordinates)) as category_center,

        -- Statistical analysis (resource intensive)
        STDDEV_POP(nl.distance_meters) as distance_variance,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY nl.distance_meters) as median_distance

    FROM nearby_locations nl
    GROUP BY nl.category
),

user_movement_analysis AS (
    -- Track user movement patterns (very expensive queries)
    SELECT 
        ul.user_id,
        COUNT(*) as location_updates,

        -- Complex movement calculations
        SUM(
            ST_Distance(
                ul.location_coordinates::geography,
                LAG(ul.location_coordinates::geography) OVER (
                    PARTITION BY ul.user_id 
                    ORDER BY ul.recorded_at
                )
            )
        ) as total_distance_traveled,

        -- Speed analysis
        AVG(ul.speed_kmh) as avg_speed,
        MAX(ul.speed_kmh) as max_speed,

        -- Time-based analysis
        EXTRACT(SECONDS FROM (MAX(ul.recorded_at) - MIN(ul.recorded_at))) as session_duration_seconds,

        -- Geofencing analysis (complex polygon operations)
        COUNT(*) FILTER (
            WHERE EXISTS (
                SELECT 1 FROM locations l 
                WHERE ST_Contains(l.coverage_area, ul.location_coordinates)
            )
        ) as geofence_entries,

        -- Movement patterns
        STRING_AGG(
            DISTINCT CASE 
                WHEN ul.speed_kmh > 50 THEN 'highway'
                WHEN ul.speed_kmh > 20 THEN 'city'
                WHEN ul.speed_kmh > 5 THEN 'walking'
                ELSE 'stationary'
            END, 
            ',' 
            ORDER BY ul.recorded_at
        ) as movement_pattern

    FROM user_locations ul
    WHERE ul.recorded_at >= CURRENT_TIMESTAMP - INTERVAL '1 day'
    GROUP BY ul.user_id
)

-- Final complex spatial query with multiple joins and calculations
SELECT 
    nl.location_id,
    nl.name,
    nl.category,
    nl.address,
    ROUND(nl.distance_meters, 2) as distance_meters,
    ROUND(nl.bearing_degrees, 1) as bearing_degrees,
    nl.rating,
    nl.price_range,

    -- Spatial relationship indicators
    nl.within_coverage,
    nl.in_search_area,

    -- Analytics context
    la.location_count as similar_nearby_count,
    ROUND(la.avg_rating, 2) as category_avg_rating,
    ROUND(la.avg_distance, 2) as category_avg_distance,

    -- User movement context (if available)
    uma.total_distance_traveled,
    uma.avg_speed,
    uma.movement_pattern,

    -- Additional computed fields
    CASE 
        WHEN nl.distance_meters <= 100 THEN 'immediate_vicinity'
        WHEN nl.distance_meters <= 500 THEN 'very_close'
        WHEN nl.distance_meters <= 1000 THEN 'walking_distance'
        WHEN nl.distance_meters <= 5000 THEN 'short_drive'
        ELSE 'distant'
    END as proximity_category,

    -- Operating status (complex JSON processing)
    CASE 
        WHEN nl.operating_hours IS NULL THEN 'unknown'
        WHEN nl.operating_hours->>(EXTRACT(DOW FROM CURRENT_TIMESTAMP)::TEXT) IS NULL THEN 'closed'
        ELSE 'check_hours'
    END as operating_status,

    -- Recommendations based on multiple factors
    CASE 
        WHEN nl.rating >= 4.5 AND nl.distance_meters <= 1000 THEN 'highly_recommended'
        WHEN nl.rating >= 4.0 AND nl.distance_meters <= 2000 THEN 'recommended'
        WHEN nl.distance_meters <= 500 THEN 'convenient'
        ELSE 'standard'
    END as recommendation_level

FROM nearby_locations nl
LEFT JOIN location_analytics la ON nl.category = la.category
LEFT JOIN user_movement_analysis uma ON uma.user_id = $user_id
ORDER BY 
    -- Complex sorting logic
    CASE $sort_preference
        WHEN 'distance' THEN nl.distance_meters
        WHEN 'rating' THEN -nl.rating * 100 + nl.distance_meters
        WHEN 'price' THEN nl.price_range * 1000 + nl.distance_meters
        ELSE nl.distance_meters
    END
LIMIT $result_limit;

-- Traditional geospatial approach problems:
-- 1. Requires PostGIS extension and complex geometric type management
-- 2. Expensive spatial calculations with limited built-in optimization
-- 3. Complex coordinate system transformations and projections
-- 4. Poor performance with large datasets and concurrent spatial queries
-- 5. Limited integration with application data models and business logic
-- 6. Complex indexing strategies requiring deep GIS expertise
-- 7. Difficult to maintain and scale spatial operations
-- 8. Limited support for modern location-based service patterns
-- 9. Complex query syntax requiring specialized GIS knowledge
-- 10. Poor integration with real-time and streaming location data

MongoDB provides comprehensive geospatial capabilities with native optimization and seamless integration:

// MongoDB Advanced Geospatial Operations - native spatial capabilities with optimal performance
const { MongoClient, ObjectId } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('location_services');

// Comprehensive MongoDB Geospatial Manager
class MongoDBGeospatialManager {
  constructor(db, config = {}) {
    this.db = db;
    this.config = {
      // Default search parameters
      defaultSearchRadius: config.defaultSearchRadius || 5000, // 5km
      defaultMaxResults: config.defaultMaxResults || 100,

      // Performance optimization
      enableSpatialIndexing: config.enableSpatialIndexing !== false,
      enableQueryOptimization: config.enableQueryOptimization !== false,
      enableBulkOperations: config.enableBulkOperations !== false,

      // Coordinate system configuration
      defaultCoordinateSystem: config.defaultCoordinateSystem || 'WGS84',
      enableEarthDistance: config.enableEarthDistance !== false,

      // Advanced features
      enableGeofencing: config.enableGeofencing !== false,
      enableLocationAnalytics: config.enableLocationAnalytics !== false,
      enableRealTimeTracking: config.enableRealTimeTracking !== false,
      enableSpatialAggregation: config.enableSpatialAggregation !== false,

      // Performance monitoring
      enablePerformanceMetrics: config.enablePerformanceMetrics !== false,
      logSlowQueries: config.logSlowQueries !== false,
      queryTimeoutMs: config.queryTimeoutMs || 30000,

      ...config
    };

    // Collection references
    this.collections = {
      locations: db.collection('locations'),
      userLocations: db.collection('user_locations'),
      geofences: db.collection('geofences'),
      locationAnalytics: db.collection('location_analytics'),
      spatialEvents: db.collection('spatial_events')
    };

    // Performance tracking
    this.queryMetrics = {
      totalQueries: 0,
      averageQueryTime: 0,
      spatialQueries: 0,
      indexHits: 0
    };

    this.initializeGeospatialCollections();
  }

  async initializeGeospatialCollections() {
    console.log('Initializing geospatial collections and spatial indexes...');

    try {
      // Setup locations collection with advanced spatial indexing
      await this.setupLocationsCollection();

      // Setup user location tracking
      await this.setupUserLocationTracking();

      // Setup geofencing capabilities
      await this.setupGeofencingSystem();

      // Setup location analytics
      await this.setupLocationAnalytics();

      // Setup spatial event tracking
      await this.setupSpatialEventTracking();

      console.log('All geospatial collections initialized successfully');

    } catch (error) {
      console.error('Error initializing geospatial collections:', error);
      throw error;
    }
  }

  async setupLocationsCollection() {
    console.log('Setting up locations collection with spatial indexing...');

    const locationsCollection = this.collections.locations;

    // Create 2dsphere index for geospatial queries (primary spatial index)
    await locationsCollection.createIndex(
      { coordinates: '2dsphere' },
      { 
        background: true,
        name: 'coordinates_2dsphere',
        // Optimize for common query patterns
        '2dsphereIndexVersion': 3
      }
    );

    // Compound indexes for optimized spatial queries with filters
    await locationsCollection.createIndex(
      { coordinates: '2dsphere', category: 1 },
      { background: true, name: 'spatial_category_index' }
    );

    await locationsCollection.createIndex(
      { coordinates: '2dsphere', rating: -1, priceRange: 1 },
      { background: true, name: 'spatial_rating_price_index' }
    );

    // Coverage area indexing for geofencing
    await locationsCollection.createIndex(
      { coverageArea: '2dsphere' },
      { 
        background: true, 
        sparse: true, 
        name: 'coverage_area_index' 
      }
    );

    // Text index for location search
    await locationsCollection.createIndex(
      { name: 'text', address: 'text', category: 'text' },
      { 
        background: true,
        name: 'location_text_search',
        weights: { name: 3, category: 2, address: 1 }
      }
    );

    // Additional performance indexes
    await locationsCollection.createIndex(
      { category: 1, rating: -1, createdAt: -1 },
      { background: true }
    );

    console.log('Locations collection spatial indexing complete');
  }

  async createLocation(locationData) {
    console.log('Creating location with geospatial data...');

    const startTime = Date.now();

    try {
      const locationDocument = {
        locationId: locationData.locationId || new ObjectId(),
        name: locationData.name,
        category: locationData.category,

        // GeoJSON Point for precise coordinates
        coordinates: {
          type: 'Point',
          coordinates: [locationData.longitude, locationData.latitude] // [lng, lat] order in GeoJSON
        },

        // Optional coverage area as GeoJSON Polygon
        coverageArea: locationData.coverageArea ? {
          type: 'Polygon',
          coordinates: locationData.coverageArea // Array of coordinate arrays
        } : null,

        // Location details
        address: locationData.address,
        city: locationData.city,
        state: locationData.state,
        country: locationData.country,
        postalCode: locationData.postalCode,

        // Business information
        phoneNumber: locationData.phoneNumber,
        website: locationData.website,
        operatingHours: locationData.operatingHours || {},
        rating: locationData.rating || 0,
        priceRange: locationData.priceRange || 1,

        // Spatial metadata
        accuracy: locationData.accuracy,
        altitude: locationData.altitude,
        floor: locationData.floor,

        // Business analytics data
        popularityScore: locationData.popularityScore || 0,
        trafficLevel: locationData.trafficLevel,
        accessibilityFeatures: locationData.accessibilityFeatures || [],

        // Temporal information
        createdAt: new Date(),
        updatedAt: new Date(),

        // Custom attributes
        customAttributes: locationData.customAttributes || {},
        tags: locationData.tags || [],

        // Verification status
        verified: locationData.verified || false,
        verificationSource: locationData.verificationSource
      };

      // Validate GeoJSON format
      if (!this.validateGeoJSONPoint(locationDocument.coordinates)) {
        throw new Error('Invalid coordinates format - must be valid GeoJSON Point');
      }

      const result = await this.collections.locations.insertOne(locationDocument);

      const processingTime = Date.now() - startTime;
      this.updateQueryMetrics('create_location', processingTime);

      console.log(`Location created: ${result.insertedId} (${processingTime}ms)`);

      return {
        success: true,
        locationId: result.insertedId,
        coordinates: locationDocument.coordinates,
        processingTime: processingTime
      };

    } catch (error) {
      console.error('Error creating location:', error);
      return {
        success: false,
        error: error.message,
        processingTime: Date.now() - startTime
      };
    }
  }

  async findNearbyLocations(longitude, latitude, options = {}) {
    console.log(`Finding locations near [${longitude}, ${latitude}]...`);

    const startTime = Date.now();

    try {
      // Build aggregation pipeline for advanced spatial query
      const pipeline = [
        // Stage 1: Geospatial proximity matching
        {
          $geoNear: {
            near: {
              type: 'Point',
              coordinates: [longitude, latitude]
            },
            distanceField: 'distanceMeters',
            maxDistance: options.maxDistance || this.config.defaultSearchRadius,
            spherical: true,

            // Advanced filtering options
            query: {
              ...(options.category && { category: options.category }),
              ...(options.minRating && { rating: { $gte: options.minRating } }),
              ...(options.maxPriceRange && { priceRange: { $lte: options.maxPriceRange } }),
              ...(options.verified !== undefined && { verified: options.verified }),
              ...(options.tags && { tags: { $in: options.tags } })
            },

            // Limit initial results for performance
            limit: options.limit || this.config.defaultMaxResults
          }
        },

        // Stage 2: Add computed fields and spatial analysis
        {
          $addFields: {
            // Distance calculations
            distanceKm: { $divide: ['$distanceMeters', 1000] },

            // Bearing calculation (direction from search point to location)
            bearing: {
              $let: {
                vars: {
                  lat1: { $degreesToRadians: latitude },
                  lat2: { $degreesToRadians: { $arrayElemAt: ['$coordinates.coordinates', 1] } },
                  lng1: { $degreesToRadians: longitude },
                  lng2: { $degreesToRadians: { $arrayElemAt: ['$coordinates.coordinates', 0] } }
                },
                in: {
                  $mod: [
                    {
                      $add: [
                        {
                          $radiansToDegrees: {
                            $atan2: [
                              {
                                $sin: { $subtract: ['$$lng2', '$$lng1'] }
                              },
                              {
                                $subtract: [
                                  {
                                    $multiply: [
                                      { $cos: '$$lat1' },
                                      { $sin: '$$lat2' }
                                    ]
                                  },
                                  {
                                    $multiply: [
                                      { $sin: '$$lat1' },
                                      { $cos: '$$lat2' },
                                      { $cos: { $subtract: ['$$lng2', '$$lng1'] } }
                                    ]
                                  }
                                ]
                              }
                            ]
                          }
                        },
                        360
                      ]
                    },
                    360
                  ]
                }
              }
            },

            // Proximity categorization
            proximityCategory: {
              $switch: {
                branches: [
                  { case: { $lte: ['$distanceMeters', 100] }, then: 'immediate_vicinity' },
                  { case: { $lte: ['$distanceMeters', 500] }, then: 'very_close' },
                  { case: { $lte: ['$distanceMeters', 1000] }, then: 'walking_distance' },
                  { case: { $lte: ['$distanceMeters', 5000] }, then: 'short_drive' }
                ],
                default: 'distant'
              }
            },

            // Recommendation scoring
            recommendationScore: {
              $add: [
                // Base rating score (0-5 scale)
                { $multiply: ['$rating', 2] },

                // Distance penalty (closer is better)
                {
                  $subtract: [
                    10,
                    { $divide: ['$distanceMeters', 500] }
                  ]
                },

                // Popularity bonus
                { $multiply: ['$popularityScore', 0.5] },

                // Verification bonus
                { $cond: [{ $eq: ['$verified', true] }, 2, 0] }
              ]
            }
          }
        },

        // Stage 3: Operating hours analysis (if requested)
        ...(options.checkOperatingHours ? [{
          $addFields: {
            currentlyOpen: {
              $let: {
                vars: {
                  now: new Date(),
                  dayOfWeek: { $dayOfWeek: new Date() }, // 1 = Sunday, 7 = Saturday
                  currentTime: { 
                    $dateToString: { 
                      format: '%H:%M', 
                      date: new Date() 
                    } 
                  }
                },
                in: {
                  // Simplified operating hours check
                  $cond: [
                    { $ne: ['$operatingHours', null] },
                    true, // Would implement complex time checking logic
                    null
                  ]
                }
              }
            }
          }
        }] : []),

        // Stage 4: Coverage area intersection (if requested)
        ...(options.checkCoverageArea ? [{
          $addFields: {
            withinCoverageArea: {
              $cond: [
                { $ne: ['$coverageArea', null] },
                {
                  $function: {
                    body: `function(coverageArea, searchPoint) {
                      // Simplified point-in-polygon check
                      // In production, use MongoDB's native $geoIntersects
                      return true; // Placeholder for complex geometric calculation
                    }`,
                    args: ['$coverageArea', { type: 'Point', coordinates: [longitude, latitude] }],
                    lang: 'js'
                  }
                },
                null
              ]
            }
          }
        }] : []),

        // Stage 5: Final sorting and formatting
        {
          $sort: {
            // Default sort by recommendation score, fallback to distance
            recommendationScore: options.sortBy === 'recommendation' ? -1 : 1,
            distanceMeters: options.sortBy === 'distance' ? 1 : -1,
            rating: -1
          }
        },

        // Stage 6: Limit results
        { $limit: options.limit || this.config.defaultMaxResults },

        // Stage 7: Project final result structure
        {
          $project: {
            locationId: 1,
            name: 1,
            category: 1,
            coordinates: 1,
            address: 1,
            city: 1,
            state: 1,
            country: 1,
            phoneNumber: 1,
            website: 1,
            rating: 1,
            priceRange: 1,

            // Spatial analysis results
            distanceMeters: { $round: ['$distanceMeters', 2] },
            distanceKm: { $round: ['$distanceKm', 3] },
            bearing: { $round: ['$bearing', 1] },
            proximityCategory: 1,
            recommendationScore: { $round: ['$recommendationScore', 2] },

            // Conditional fields
            ...(options.checkOperatingHours && { currentlyOpen: 1 }),
            ...(options.checkCoverageArea && { withinCoverageArea: 1 }),

            // Metadata
            verified: 1,
            tags: 1,
            createdAt: 1,

            // Custom attributes if requested
            ...(options.includeCustomAttributes && { customAttributes: 1 })
          }
        }
      ];

      // Execute aggregation pipeline
      const locations = await this.collections.locations.aggregate(
        pipeline,
        {
          allowDiskUse: true,
          maxTimeMS: this.config.queryTimeoutMs,
          hint: 'coordinates_2dsphere' // Use spatial index
        }
      ).toArray();

      const processingTime = Date.now() - startTime;
      this.updateQueryMetrics('nearby_search', processingTime);

      console.log(`Found ${locations.length} nearby locations (${processingTime}ms)`);

      return {
        success: true,
        locations: locations,
        searchParams: {
          coordinates: [longitude, latitude],
          maxDistance: options.maxDistance || this.config.defaultSearchRadius,
          filters: options
        },
        resultsCount: locations.length,
        processingTime: processingTime
      };

    } catch (error) {
      console.error('Error finding nearby locations:', error);
      return {
        success: false,
        error: error.message,
        processingTime: Date.now() - startTime
      };
    }
  }

  async setupUserLocationTracking() {
    console.log('Setting up user location tracking...');

    const userLocationsCollection = this.collections.userLocations;

    // Spatial index for user locations
    await userLocationsCollection.createIndex(
      { coordinates: '2dsphere' },
      { background: true, name: 'user_coordinates_spatial' }
    );

    // Compound indexes for user tracking queries
    await userLocationsCollection.createIndex(
      { userId: 1, recordedAt: -1 },
      { background: true, name: 'user_timeline' }
    );

    await userLocationsCollection.createIndex(
      { sessionId: 1, recordedAt: 1 },
      { background: true, name: 'session_tracking' }
    );

    // Geofencing compound index
    await userLocationsCollection.createIndex(
      { coordinates: '2dsphere', userId: 1, recordedAt: -1 },
      { background: true, name: 'spatial_user_timeline' }
    );

    console.log('User location tracking setup complete');
  }

  async trackUserLocation(userId, longitude, latitude, metadata = {}) {
    console.log(`Tracking location for user ${userId}: [${longitude}, ${latitude}]`);

    const startTime = Date.now();

    try {
      const locationDocument = {
        userId: userId,
        coordinates: {
          type: 'Point',
          coordinates: [longitude, latitude]
        },

        // Accuracy and technical metadata
        accuracy: metadata.accuracy,
        altitude: metadata.altitude,
        speed: metadata.speed,
        heading: metadata.heading,

        // Device and method information
        locationMethod: metadata.locationMethod || 'GPS',
        deviceType: metadata.deviceType,
        batteryLevel: metadata.batteryLevel,

        // Session context
        sessionId: metadata.sessionId,
        applicationContext: metadata.applicationContext,

        // Privacy and sharing
        locationSharingLevel: metadata.locationSharingLevel || 'private',
        allowGeofenceNotifications: metadata.allowGeofenceNotifications || false,

        // Temporal information
        recordedAt: metadata.recordedAt || new Date(),
        serverProcessedAt: new Date(),

        // Movement analysis
        isStationary: metadata.isStationary || false,
        movementType: metadata.movementType, // walking, driving, cycling, stationary

        // Custom context
        customData: metadata.customData || {}
      };

      // Validate coordinates
      if (!this.validateGeoJSONPoint(locationDocument.coordinates)) {
        throw new Error('Invalid coordinates for user location tracking');
      }

      const result = await this.collections.userLocations.insertOne(locationDocument);

      // Check for geofence triggers (if enabled)
      if (this.config.enableGeofencing) {
        await this.checkGeofenceEvents(userId, longitude, latitude);
      }

      const processingTime = Date.now() - startTime;
      this.updateQueryMetrics('track_user_location', processingTime);

      return {
        success: true,
        locationId: result.insertedId,
        coordinates: locationDocument.coordinates,
        processingTime: processingTime,
        geofenceChecked: this.config.enableGeofencing
      };

    } catch (error) {
      console.error('Error tracking user location:', error);
      return {
        success: false,
        error: error.message,
        processingTime: Date.now() - startTime
      };
    }
  }

  async getUserLocationHistory(userId, options = {}) {
    console.log(`Retrieving location history for user ${userId}...`);

    const startTime = Date.now();

    try {
      const pipeline = [
        // Stage 1: Filter by user and time range
        {
          $match: {
            userId: userId,
            recordedAt: {
              $gte: options.startDate || new Date(Date.now() - (7 * 24 * 60 * 60 * 1000)), // 7 days default
              $lte: options.endDate || new Date()
            },
            ...(options.sessionId && { sessionId: options.sessionId }),
            ...(options.locationSharingLevel && { locationSharingLevel: options.locationSharingLevel })
          }
        },

        // Stage 2: Sort chronologically
        { $sort: { recordedAt: 1 } },

        // Stage 3: Add movement analysis
        {
          $addFields: {
            // Calculate time since last location update
            timeSincePrevious: {
              $subtract: [
                '$recordedAt',
                { $ifNull: [{ $lag: '$recordedAt', offset: 1 }, '$recordedAt'] }
              ]
            }
          }
        },

        // Stage 4: Movement calculations using $setWindowFields
        {
          $setWindowFields: {
            partitionBy: '$userId',
            sortBy: { recordedAt: 1 },
            output: {
              // Distance from previous location
              distanceFromPrevious: {
                $function: {
                  body: `function(currentCoords, previousCoords) {
                    if (!previousCoords) return 0;

                    // Haversine formula for distance calculation
                    const R = 6371000; // Earth's radius in meters
                    const lat1 = currentCoords.coordinates[1] * Math.PI / 180;
                    const lat2 = previousCoords.coordinates[1] * Math.PI / 180;
                    const deltaLat = (previousCoords.coordinates[1] - currentCoords.coordinates[1]) * Math.PI / 180;
                    const deltaLng = (previousCoords.coordinates[0] - currentCoords.coordinates[0]) * Math.PI / 180;

                    const a = Math.sin(deltaLat/2) * Math.sin(deltaLat/2) +
                             Math.cos(lat1) * Math.cos(lat2) *
                             Math.sin(deltaLng/2) * Math.sin(deltaLng/2);
                    const c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1-a));

                    return R * c;
                  }`,
                  args: ['$coordinates', { $lag: ['$coordinates', 1] }],
                  lang: 'js'
                }
              },

              // Running total distance
              totalDistanceTraveled: {
                $sum: '$distanceFromPrevious',
                window: { documents: ['unbounded', 'current'] }
              }
            }
          }
        },

        // Stage 5: Limit results
        { $limit: options.limit || 1000 },

        // Stage 6: Project final format
        {
          $project: {
            coordinates: 1,
            accuracy: 1,
            altitude: 1,
            speed: 1,
            heading: 1,
            locationMethod: 1,
            recordedAt: 1,
            sessionId: 1,
            movementType: 1,

            // Calculated fields
            distanceFromPrevious: { $round: ['$distanceFromPrevious', 2] },
            totalDistanceTraveled: { $round: ['$totalDistanceTraveled', 2] },
            timeSincePrevious: { $divide: ['$timeSincePrevious', 1000] }, // Convert to seconds

            // Privacy filtered custom data
            ...(options.includeCustomData && { customData: 1 })
          }
        }
      ];

      const locationHistory = await this.collections.userLocations.aggregate(
        pipeline,
        { allowDiskUse: true, maxTimeMS: this.config.queryTimeoutMs }
      ).toArray();

      // Calculate summary statistics
      const totalDistance = locationHistory.reduce((sum, loc) => sum + (loc.distanceFromPrevious || 0), 0);
      const timespan = locationHistory.length > 0 ? 
        new Date(locationHistory[locationHistory.length - 1].recordedAt) - new Date(locationHistory[0].recordedAt) : 0;

      const processingTime = Date.now() - startTime;
      this.updateQueryMetrics('user_location_history', processingTime);

      return {
        success: true,
        locationHistory: locationHistory,
        summary: {
          totalPoints: locationHistory.length,
          totalDistanceMeters: Math.round(totalDistance),
          totalDistanceKm: Math.round(totalDistance / 1000 * 100) / 100,
          timespanHours: Math.round(timespan / (1000 * 60 * 60) * 100) / 100,
          averageSpeed: timespan > 0 ? Math.round((totalDistance / (timespan / 1000)) * 3.6 * 100) / 100 : 0 // km/h
        },
        processingTime: processingTime
      };

    } catch (error) {
      console.error('Error retrieving user location history:', error);
      return {
        success: false,
        error: error.message,
        processingTime: Date.now() - startTime
      };
    }
  }

  async setupGeofencingSystem() {
    console.log('Setting up geofencing system...');

    const geofencesCollection = this.collections.geofences;

    // Spatial index for geofence areas
    await geofencesCollection.createIndex(
      { area: '2dsphere' },
      { background: true, name: 'geofence_spatial' }
    );

    // Compound indexes for geofence queries
    await geofencesCollection.createIndex(
      { ownerId: 1, isActive: 1 },
      { background: true }
    );

    await geofencesCollection.createIndex(
      { category: 1, isActive: 1 },
      { background: true }
    );

    console.log('Geofencing system setup complete');
  }

  async createGeofence(ownerId, geofenceData) {
    console.log(`Creating geofence for owner ${ownerId}...`);

    const startTime = Date.now();

    try {
      const geofenceDocument = {
        geofenceId: new ObjectId(),
        ownerId: ownerId,
        name: geofenceData.name,
        description: geofenceData.description,
        category: geofenceData.category || 'custom',

        // GeoJSON area (Polygon or Circle)
        area: geofenceData.area,

        // Geofence behavior
        triggerOnEntry: geofenceData.triggerOnEntry !== false,
        triggerOnExit: geofenceData.triggerOnExit !== false,
        triggerOnDwell: geofenceData.triggerOnDwell || false,
        dwellTimeSeconds: geofenceData.dwellTimeSeconds || 300, // 5 minutes

        // Notification settings
        notificationSettings: {
          enabled: geofenceData.notifications?.enabled !== false,
          methods: geofenceData.notifications?.methods || ['push'],
          message: geofenceData.notifications?.message
        },

        // Targeting
        targetUsers: geofenceData.targetUsers || [], // Specific user IDs
        targetUserGroups: geofenceData.targetUserGroups || [],

        // Scheduling
        schedule: geofenceData.schedule || {
          enabled: true,
          startTime: null,
          endTime: null,
          daysOfWeek: [1, 2, 3, 4, 5, 6, 7] // All days
        },

        // State management
        isActive: geofenceData.isActive !== false,
        createdAt: new Date(),
        updatedAt: new Date(),

        // Analytics
        entryCount: 0,
        exitCount: 0,
        dwellCount: 0,
        lastTriggered: null,

        // Custom data
        customData: geofenceData.customData || {}
      };

      // Validate GeoJSON area
      if (!this.validateGeoJSONGeometry(geofenceDocument.area)) {
        throw new Error('Invalid geofence area geometry');
      }

      const result = await geofencesCollection.insertOne(geofenceDocument);

      const processingTime = Date.now() - startTime;

      return {
        success: true,
        geofenceId: result.insertedId,
        area: geofenceDocument.area,
        processingTime: processingTime
      };

    } catch (error) {
      console.error('Error creating geofence:', error);
      return {
        success: false,
        error: error.message,
        processingTime: Date.now() - startTime
      };
    }
  }

  async checkGeofenceEvents(userId, longitude, latitude) {
    console.log(`Checking geofence events for user ${userId} at [${longitude}, ${latitude}]...`);

    try {
      const userPoint = {
        type: 'Point',
        coordinates: [longitude, latitude]
      };

      // Find all active geofences that intersect with user location
      const intersectingGeofences = await this.collections.geofences.find({
        isActive: true,

        // Spatial intersection query
        area: {
          $geoIntersects: {
            $geometry: userPoint
          }
        },

        // Check if user is targeted (empty array means all users)
        $or: [
          { targetUsers: { $size: 0 } },
          { targetUsers: userId }
        ]
      }).toArray();

      // Process each intersecting geofence
      const geofenceEvents = [];

      for (const geofence of intersectingGeofences) {
        // Check if this is a new entry or existing presence
        const recentUserLocation = await this.collections.userLocations.findOne({
          userId: userId,
          recordedAt: { $gte: new Date(Date.now() - (5 * 60 * 1000)) } // Last 5 minutes
        }, { sort: { recordedAt: -1 } });

        let eventType = 'dwelling';

        if (!recentUserLocation) {
          eventType = 'entry';
        }

        // Create geofence event
        const geofenceEvent = {
          eventId: new ObjectId(),
          userId: userId,
          geofenceId: geofence.geofenceId,
          geofenceName: geofence.name,
          eventType: eventType,
          coordinates: userPoint,
          eventTime: new Date(),

          // Context information
          geofenceCategory: geofence.category,
          dwellTimeSeconds: eventType === 'dwelling' ? 
            (recentUserLocation ? (Date.now() - recentUserLocation.recordedAt.getTime()) / 1000 : 0) : 0,

          // Notification triggered
          notificationTriggered: geofence.notificationSettings.enabled &&
            ((eventType === 'entry' && geofence.triggerOnEntry) ||
             (eventType === 'dwelling' && geofence.triggerOnDwell)),

          customData: geofence.customData
        };

        // Store the event
        await this.collections.spatialEvents.insertOne(geofenceEvent);

        // Update geofence statistics
        const updateFields = {};
        updateFields[`${eventType}Count`] = 1;
        updateFields.lastTriggered = new Date();

        await this.collections.geofences.updateOne(
          { geofenceId: geofence.geofenceId },
          { 
            $inc: updateFields,
            $set: { updatedAt: new Date() }
          }
        );

        geofenceEvents.push(geofenceEvent);

        // Trigger notifications if configured
        if (geofenceEvent.notificationTriggered) {
          await this.triggerGeofenceNotification(userId, geofenceEvent);
        }
      }

      return {
        success: true,
        eventsTriggered: geofenceEvents.length,
        events: geofenceEvents
      };

    } catch (error) {
      console.error('Error checking geofence events:', error);
      return {
        success: false,
        error: error.message
      };
    }
  }

  async triggerGeofenceNotification(userId, geofenceEvent) {
    // Placeholder for notification system integration
    console.log(`Geofence notification triggered for user ${userId}:`, {
      geofence: geofenceEvent.geofenceName,
      eventType: geofenceEvent.eventType,
      location: geofenceEvent.coordinates
    });

    // In a real implementation, this would integrate with:
    // - Push notification services
    // - SMS/Email services  
    // - Webhook endpoints
    // - Real-time messaging systems
  }

  validateGeoJSONPoint(coordinates) {
    return coordinates &&
           coordinates.type === 'Point' &&
           Array.isArray(coordinates.coordinates) &&
           coordinates.coordinates.length === 2 &&
           typeof coordinates.coordinates[0] === 'number' &&
           typeof coordinates.coordinates[1] === 'number' &&
           coordinates.coordinates[0] >= -180 && coordinates.coordinates[0] <= 180 &&
           coordinates.coordinates[1] >= -90 && coordinates.coordinates[1] <= 90;
  }

  validateGeoJSONGeometry(geometry) {
    if (!geometry || !geometry.type) return false;

    switch (geometry.type) {
      case 'Point':
        return this.validateGeoJSONPoint(geometry);
      case 'Polygon':
        return geometry.coordinates &&
               Array.isArray(geometry.coordinates) &&
               geometry.coordinates.length > 0 &&
               Array.isArray(geometry.coordinates[0]) &&
               geometry.coordinates[0].length >= 4; // Minimum for polygon
      case 'Circle':
        // MongoDB extension for circular geofences
        return geometry.coordinates &&
               Array.isArray(geometry.coordinates) &&
               geometry.coordinates.length === 2 &&
               typeof geometry.radius === 'number' &&
               geometry.radius > 0;
      default:
        return false;
    }
  }

  updateQueryMetrics(queryType, duration) {
    this.queryMetrics.totalQueries++;
    this.queryMetrics.averageQueryTime = 
      (this.queryMetrics.averageQueryTime + duration) / 2;

    if (queryType.includes('spatial') || queryType.includes('nearby') || queryType.includes('geofence')) {
      this.queryMetrics.spatialQueries++;
    }

    if (this.config.logSlowQueries && duration > 1000) {
      console.log(`Slow query detected: ${queryType} took ${duration}ms`);
    }
  }

  async getPerformanceMetrics() {
    return {
      queryMetrics: this.queryMetrics,
      indexMetrics: await this.analyzeIndexPerformance(),
      collectionStats: await this.getCollectionStatistics()
    };
  }

  async analyzeIndexPerformance() {
    const metrics = {};

    for (const [collectionName, collection] of Object.entries(this.collections)) {
      try {
        const indexStats = await collection.aggregate([{ $indexStats: {} }]).toArray();
        metrics[collectionName] = indexStats;
      } catch (error) {
        console.error(`Error analyzing indexes for ${collectionName}:`, error);
      }
    }

    return metrics;
  }

  async getCollectionStatistics() {
    const stats = {};

    for (const [collectionName, collection] of Object.entries(this.collections)) {
      try {
        stats[collectionName] = await collection.stats();
      } catch (error) {
        console.error(`Error getting stats for ${collectionName}:`, error);
      }
    }

    return stats;
  }

  async shutdown() {
    console.log('Shutting down geospatial manager...');

    // Log final performance metrics
    if (this.config.enablePerformanceMetrics) {
      const metrics = await this.getPerformanceMetrics();
      console.log('Final Performance Metrics:', metrics.queryMetrics);
    }

    console.log('Geospatial manager shutdown complete');
  }
}

// Benefits of MongoDB Geospatial Operations:
// - Native 2dsphere indexing with optimized spatial queries
// - Comprehensive GeoJSON support for points, polygons, and complex geometries  
// - High-performance proximity searches with built-in distance calculations
// - Advanced geofencing capabilities with real-time event triggering
// - Seamless integration with application data without external GIS systems
// - Sophisticated spatial aggregation and analytics capabilities
// - Built-in coordinate system support and projection handling
// - Optimized query performance with spatial index utilization
// - SQL-compatible geospatial operations through QueryLeaf integration
// - Scalable location-based services with MongoDB's distributed architecture

module.exports = {
  MongoDBGeospatialManager
};

Understanding MongoDB Geospatial Architecture

Advanced Spatial Indexing and Query Optimization Patterns

Implement sophisticated geospatial strategies for production MongoDB deployments:

// Production-ready MongoDB geospatial operations with advanced optimization and analytics
class ProductionGeospatialProcessor extends MongoDBGeospatialManager {
  constructor(db, productionConfig) {
    super(db, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      enableAdvancedAnalytics: true,
      enableSpatialCaching: true,
      enableLocationIntelligence: true,
      enablePredictiveGeofencing: true,
      enableSpatialDataMining: true,
      enableRealtimeLocationStreams: true
    };

    this.setupProductionOptimizations();
    this.initializeAdvancedGeospatial();
    this.setupLocationIntelligence();
  }

  async implementAdvancedSpatialAnalytics() {
    console.log('Implementing advanced spatial analytics capabilities...');

    const analyticsStrategy = {
      // Location intelligence
      locationIntelligence: {
        enableHeatmapGeneration: true,
        enableClusterAnalysis: true,
        enablePatternDetection: true,
        enablePredictiveModeling: true
      },

      // Spatial data mining
      spatialDataMining: {
        enableLocationCorrelation: true,
        enableMovementPatternAnalysis: true,
        enableSpatialAnomalyDetection: true,
        enableLocationRecommendations: true
      },

      // Real-time processing
      realtimeProcessing: {
        enableStreamingGeoprocessing: true,
        enableDynamicGeofencing: true,
        enableLocationEventCorrelation: true,
        enableSpatialAlertSystems: true
      }
    };

    return await this.deployAdvancedSpatialAnalytics(analyticsStrategy);
  }

  async setupSpatialCachingSystem() {
    console.log('Setting up advanced spatial caching system...');

    const cachingConfig = {
      // Spatial query caching
      spatialQueryCache: {
        enableProximityCache: true,
        cacheRadius: 1000, // Cache results within 1km
        cacheExpiration: 300, // 5 minutes
        maxCacheEntries: 10000
      },

      // Geofence optimization
      geofenceOptimization: {
        enableGeofenceIndex: true,
        spatialPartitioning: true,
        dynamicGeofenceLoading: true,
        geofenceHierarchy: true
      },

      // Location intelligence cache
      locationIntelligenceCache: {
        enableHeatmapCache: true,
        enablePatternCache: true,
        enablePredictionCache: true
      }
    };

    return await this.deploySpatalCaching(cachingConfig);
  }

  async implementPredictiveGeofencing() {
    console.log('Implementing predictive geofencing capabilities...');

    const predictiveConfig = {
      // Movement prediction
      movementPrediction: {
        enableTrajectoryPrediction: true,
        predictionAccuracy: 0.85,
        predictionTimeHorizon: 1800, // 30 minutes
        learningModelUpdates: true
      },

      // Dynamic geofence creation
      dynamicGeofencing: {
        enablePredictiveGeofences: true,
        contextAwareGeofences: true,
        temporalGeofences: true,
        adaptiveGeofenceSizes: true
      },

      // Behavioral analysis
      behavioralAnalysis: {
        enableLocationPatterns: true,
        enableRoutePrediction: true,
        enableDestinationPrediction: true,
        enableActivityRecognition: true
      }
    };

    return await this.deployPredictiveGeofencing(predictiveConfig);
  }
}

SQL-Style Geospatial Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB geospatial operations and location-based services:

-- QueryLeaf geospatial operations with SQL-familiar syntax for MongoDB

-- Create location-enabled table with spatial indexing
CREATE TABLE locations (
  location_id UUID PRIMARY KEY,
  name VARCHAR(255) NOT NULL,
  category VARCHAR(100) NOT NULL,

  -- Geospatial coordinates (automatically creates 2dsphere index)
  coordinates POINT NOT NULL,
  coverage_area POLYGON,

  -- Location details
  address TEXT,
  city VARCHAR(100),
  state VARCHAR(50),
  country VARCHAR(50),
  postal_code VARCHAR(20),

  -- Business information
  phone_number VARCHAR(20),
  website VARCHAR(255),
  operating_hours DOCUMENT,
  rating DECIMAL(3,2) DEFAULT 0,
  price_range INTEGER DEFAULT 1,

  -- Analytics and metadata
  popularity_score DECIMAL(6,2) DEFAULT 0,
  verified BOOLEAN DEFAULT false,
  tags TEXT[],

  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
WITH SPATIAL_INDEXING (
  coordinates USING '2dsphere',
  coverage_area USING '2dsphere',

  -- Compound spatial indexes for optimized queries
  COMPOUND INDEX (coordinates, category),
  COMPOUND INDEX (coordinates, rating DESC, price_range ASC)
);

-- User location tracking table
CREATE TABLE user_locations (
  user_location_id UUID PRIMARY KEY,
  user_id VARCHAR(50) NOT NULL,
  coordinates POINT NOT NULL,

  -- Accuracy and technical details
  accuracy_meters DECIMAL(8,2),
  altitude_meters DECIMAL(8,2),
  speed_kmh DECIMAL(6,2),
  heading_degrees DECIMAL(5,2),

  -- Context and metadata
  location_method VARCHAR(50) DEFAULT 'GPS',
  device_type VARCHAR(50),
  session_id VARCHAR(100),

  -- Temporal tracking
  recorded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

  -- Privacy settings
  location_sharing_level VARCHAR(20) DEFAULT 'private'
)
WITH SPATIAL_INDEXING (
  coordinates USING '2dsphere',
  COMPOUND INDEX (user_id, recorded_at DESC),
  COMPOUND INDEX (coordinates, user_id, recorded_at DESC)
);

-- Insert locations with spatial data
INSERT INTO locations (
  name, category, coordinates, address, city, state, country,
  phone_number, rating, price_range, tags
) VALUES 
  ('Central Park Cafe', 'restaurant', POINT(-73.965355, 40.782865), 
   '123 Central Park West', 'New York', 'NY', 'USA',
   '+1-212-555-0123', 4.5, 2, ARRAY['cafe', 'outdoor_seating', 'wifi']),

  ('Brooklyn Bridge Pizza', 'restaurant', POINT(-73.997638, 40.706877),
   '456 Brooklyn Bridge Blvd', 'New York', 'NY', 'USA', 
   '+1-718-555-0456', 4.2, 1, ARRAY['pizza', 'takeout', 'delivery']),

  ('Times Square Hotel', 'hotel', POINT(-73.985130, 40.758896),
   '789 Times Square', 'New York', 'NY', 'USA',
   '+1-212-555-0789', 4.0, 3, ARRAY['hotel', 'tourist_area', 'business_center']);

-- Advanced proximity search with spatial functions
WITH nearby_search AS (
  SELECT 
    location_id,
    name,
    category,
    coordinates,
    address,
    rating,
    price_range,
    tags,

    -- Distance calculation using spatial functions
    ST_DISTANCE(coordinates, POINT(-73.985130, 40.758896)) as distance_meters,

    -- Bearing (direction) from search point to location
    ST_AZIMUTH(POINT(-73.985130, 40.758896), coordinates) as bearing_radians,
    ST_AZIMUTH(POINT(-73.985130, 40.758896), coordinates) * 180 / PI() as bearing_degrees,

    -- Proximity categorization
    CASE 
      WHEN ST_DISTANCE(coordinates, POINT(-73.985130, 40.758896)) <= 100 THEN 'immediate_vicinity'
      WHEN ST_DISTANCE(coordinates, POINT(-73.985130, 40.758896)) <= 500 THEN 'very_close'
      WHEN ST_DISTANCE(coordinates, POINT(-73.985130, 40.758896)) <= 1000 THEN 'walking_distance'
      WHEN ST_DISTANCE(coordinates, POINT(-73.985130, 40.758896)) <= 5000 THEN 'short_drive'
      ELSE 'distant'
    END as proximity_category

  FROM locations
  WHERE 
    -- Spatial proximity filter (uses spatial index automatically)
    ST_DWITHIN(coordinates, POINT(-73.985130, 40.758896), 2000) -- Within 2km

    -- Additional filters
    AND category = 'restaurant'
    AND rating >= 4.0
    AND price_range <= 2

  ORDER BY distance_meters ASC
  LIMIT 20
),

enhanced_results AS (
  SELECT 
    ns.*,

    -- Enhanced distance information
    ROUND(distance_meters, 2) as distance_meters_rounded,
    ROUND(distance_meters / 1000, 3) as distance_km,

    -- Cardinal direction
    CASE 
      WHEN bearing_degrees >= 337.5 OR bearing_degrees < 22.5 THEN 'North'
      WHEN bearing_degrees >= 22.5 AND bearing_degrees < 67.5 THEN 'Northeast'
      WHEN bearing_degrees >= 67.5 AND bearing_degrees < 112.5 THEN 'East'
      WHEN bearing_degrees >= 112.5 AND bearing_degrees < 157.5 THEN 'Southeast'
      WHEN bearing_degrees >= 157.5 AND bearing_degrees < 202.5 THEN 'South'
      WHEN bearing_degrees >= 202.5 AND bearing_degrees < 247.5 THEN 'Southwest'
      WHEN bearing_degrees >= 247.5 AND bearing_degrees < 292.5 THEN 'West'
      WHEN bearing_degrees >= 292.5 AND bearing_degrees < 337.5 THEN 'Northwest'
    END as direction,

    -- Recommendation scoring
    (
      rating * 2 +  -- Rating component
      CASE proximity_category
        WHEN 'immediate_vicinity' THEN 10
        WHEN 'very_close' THEN 8
        WHEN 'walking_distance' THEN 6
        WHEN 'short_drive' THEN 4
        ELSE 2
      END +
      (3 - price_range) * 1.5  -- Price component (lower price = higher score)
    ) as recommendation_score,

    -- Walking time estimation (average 5 km/h walking speed)
    ROUND(distance_meters / 1000 / 5 * 60, 0) as estimated_walking_minutes

  FROM nearby_search ns
)
SELECT 
  location_id,
  name,
  category,
  address,
  rating,
  price_range,
  tags,

  -- Distance and direction
  distance_meters_rounded as distance_meters,
  distance_km,
  direction,
  proximity_category,

  -- Practical information
  estimated_walking_minutes,
  recommendation_score,

  -- Helpful descriptions
  CONCAT(
    name, ' is ', distance_meters_rounded, 'm ', direction, 
    ' (', estimated_walking_minutes, ' min walk)'
  ) as location_description

FROM enhanced_results
ORDER BY recommendation_score DESC, distance_meters ASC;

-- Geofencing operations with spatial containment
CREATE TABLE geofences (
  geofence_id UUID PRIMARY KEY,
  owner_id VARCHAR(50) NOT NULL,
  name VARCHAR(255) NOT NULL,
  description TEXT,
  category VARCHAR(100) DEFAULT 'custom',

  -- Geofence area (polygon or circle)
  area POLYGON NOT NULL,

  -- Behavior configuration
  trigger_on_entry BOOLEAN DEFAULT true,
  trigger_on_exit BOOLEAN DEFAULT true,
  trigger_on_dwell BOOLEAN DEFAULT false,
  dwell_time_seconds INTEGER DEFAULT 300,

  -- Targeting
  target_users VARCHAR(50)[],
  target_user_groups VARCHAR(50)[],

  -- Status and analytics
  is_active BOOLEAN DEFAULT true,
  entry_count INTEGER DEFAULT 0,
  exit_count INTEGER DEFAULT 0,
  dwell_count INTEGER DEFAULT 0,
  last_triggered TIMESTAMP,

  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
WITH SPATIAL_INDEXING (
  area USING '2dsphere',
  COMPOUND INDEX (owner_id, is_active),
  COMPOUND INDEX (category, is_active)
);

-- Create geofences with various geometric shapes
INSERT INTO geofences (
  owner_id, name, description, category, area, target_users
) VALUES 
  -- Circular geofence around Central Park
  ('business_123', 'Central Park Zone', 'Marketing zone around Central Park', 'marketing',
   ST_BUFFER(POINT(-73.965355, 40.782865), 500), -- 500m radius circle
   ARRAY[]); -- Empty array means all users

-- Polygon geofence for Times Square area
INSERT INTO geofences (
  owner_id, name, description, category, area, trigger_on_entry, trigger_on_exit
) VALUES 
  ('business_456', 'Times Square District', 'High-traffic commercial zone', 'commercial',
   POLYGON((
     (-73.987140, 40.755751),  -- Southwest corner
     (-73.982915, 40.755751),  -- Southeast corner  
     (-73.982915, 40.762077),  -- Northeast corner
     (-73.987140, 40.762077),  -- Northwest corner
     (-73.987140, 40.755751)   -- Close the polygon
   )),
   true, true);

-- Advanced geofence event detection query
WITH user_location_check AS (
  SELECT 
    ul.user_id,
    ul.coordinates,
    ul.recorded_at,

    -- Find intersecting geofences
    g.geofence_id,
    g.name as geofence_name,
    g.category,
    g.trigger_on_entry,
    g.trigger_on_exit,
    g.trigger_on_dwell,
    g.dwell_time_seconds,

    -- Check spatial containment
    ST_CONTAINS(g.area, ul.coordinates) as is_inside_geofence,

    -- Previous location analysis for entry/exit detection
    LAG(ul.coordinates) OVER (
      PARTITION BY ul.user_id 
      ORDER BY ul.recorded_at
    ) as previous_coordinates,

    LAG(ul.recorded_at) OVER (
      PARTITION BY ul.user_id 
      ORDER BY ul.recorded_at  
    ) as previous_timestamp

  FROM user_locations ul
  CROSS JOIN geofences g
  WHERE 
    ul.recorded_at >= CURRENT_TIMESTAMP - INTERVAL '5 minutes'
    AND g.is_active = true
    AND (
      ARRAY_LENGTH(g.target_users, 1) IS NULL OR  -- No specific targeting
      ul.user_id = ANY(g.target_users)           -- User is specifically targeted
    )
    AND ST_DWITHIN(ul.coordinates, g.area, 100) -- Pre-filter for performance
),

geofence_events AS (
  SELECT 
    ulc.*,

    -- Event type detection
    CASE 
      WHEN is_inside_geofence AND previous_coordinates IS NULL THEN 'entry'
      WHEN is_inside_geofence AND NOT ST_CONTAINS(
        (SELECT area FROM geofences WHERE geofence_id = ulc.geofence_id), 
        previous_coordinates
      ) THEN 'entry'
      WHEN NOT is_inside_geofence AND ST_CONTAINS(
        (SELECT area FROM geofences WHERE geofence_id = ulc.geofence_id), 
        previous_coordinates  
      ) THEN 'exit'
      WHEN is_inside_geofence AND ST_CONTAINS(
        (SELECT area FROM geofences WHERE geofence_id = ulc.geofence_id), 
        previous_coordinates
      ) THEN 'dwelling'
      ELSE 'none'
    END as event_type,

    -- Dwell time calculation
    CASE 
      WHEN previous_timestamp IS NOT NULL THEN
        EXTRACT(EPOCH FROM (recorded_at - previous_timestamp))
      ELSE 0
    END as dwell_time_seconds_calculated

  FROM user_location_check ulc
  WHERE is_inside_geofence = true OR previous_coordinates IS NOT NULL
),

actionable_events AS (
  SELECT 
    ge.*,

    -- Determine if event should trigger notifications
    CASE 
      WHEN event_type = 'entry' AND trigger_on_entry THEN true
      WHEN event_type = 'exit' AND trigger_on_exit THEN true  
      WHEN event_type = 'dwelling' AND trigger_on_dwell AND 
           dwell_time_seconds_calculated >= dwell_time_seconds THEN true
      ELSE false
    END as should_trigger_notification,

    -- Event metadata
    CURRENT_TIMESTAMP as event_processed_at,
    GENERATE_UUID() as event_id

  FROM geofence_events ge
  WHERE event_type != 'none'
)

SELECT 
  event_id,
  user_id,
  geofence_id,
  geofence_name,
  category,
  event_type,
  coordinates,
  recorded_at,
  should_trigger_notification,
  dwell_time_seconds_calculated,

  -- Event context
  CASE event_type
    WHEN 'entry' THEN CONCAT('User entered ', geofence_name)
    WHEN 'exit' THEN CONCAT('User exited ', geofence_name)
    WHEN 'dwelling' THEN CONCAT('User dwelling in ', geofence_name, ' for ', 
                                ROUND(dwell_time_seconds_calculated), ' seconds')
  END as event_description,

  -- Notification priority
  CASE 
    WHEN category = 'security' THEN 'high'
    WHEN category = 'marketing' AND event_type = 'entry' THEN 'medium'
    WHEN event_type = 'dwelling' THEN 'low'
    ELSE 'normal'
  END as notification_priority

FROM actionable_events
WHERE should_trigger_notification = true
ORDER BY recorded_at DESC, notification_priority DESC;

-- Location analytics and heatmap generation
WITH location_density_analysis AS (
  SELECT 
    -- Create spatial grid cells (approximately 100m x 100m)
    FLOOR(ST_X(coordinates) * 1000) / 1000 as grid_lng,
    FLOOR(ST_Y(coordinates) * 1000) / 1000 as grid_lat,

    -- Calculate grid center point
    ST_POINT(
      (FLOOR(ST_X(coordinates) * 1000) + 0.5) / 1000,
      (FLOOR(ST_Y(coordinates) * 1000) + 0.5) / 1000
    ) as grid_center,

    COUNT(*) as location_count,
    COUNT(DISTINCT user_id) as unique_users,

    -- Temporal analysis
    DATE_TRUNC('hour', recorded_at) as hour_bucket,

    -- Movement analysis
    AVG(speed_kmh) as avg_speed,
    AVG(accuracy_meters) as avg_accuracy,

    -- Activity classification
    COUNT(*) FILTER (WHERE speed_kmh < 5) as stationary_count,
    COUNT(*) FILTER (WHERE speed_kmh >= 5 AND speed_kmh < 25) as walking_count,
    COUNT(*) FILTER (WHERE speed_kmh >= 25 AND speed_kmh < 60) as driving_count,
    COUNT(*) FILTER (WHERE speed_kmh >= 60) as highway_count

  FROM user_locations
  WHERE recorded_at >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY grid_lng, grid_lat, hour_bucket
),

heatmap_data AS (
  SELECT 
    grid_center,
    grid_lng,
    grid_lat,

    -- Density metrics
    SUM(location_count) as total_locations,
    COUNT(DISTINCT hour_bucket) as active_hours,
    AVG(location_count) as avg_locations_per_hour,
    MAX(location_count) as peak_hour_locations,

    -- User engagement
    SUM(unique_users) as total_unique_users,
    AVG(unique_users) as avg_unique_users,

    -- Activity distribution
    SUM(stationary_count) as total_stationary,
    SUM(walking_count) as total_walking,
    SUM(driving_count) as total_driving,
    SUM(highway_count) as total_highway,

    -- Movement characteristics
    AVG(avg_speed) as overall_avg_speed,
    AVG(avg_accuracy) as overall_avg_accuracy,

    -- Heat intensity calculation
    LN(SUM(location_count) + 1) * LOG(SUM(unique_users) + 1) as heat_intensity

  FROM location_density_analysis
  GROUP BY grid_center, grid_lng, grid_lat
),

hotspot_analysis AS (
  SELECT 
    hd.*,

    -- Percentile rankings for intensity
    PERCENT_RANK() OVER (ORDER BY heat_intensity) as intensity_percentile,
    PERCENT_RANK() OVER (ORDER BY total_unique_users) as user_percentile,
    PERCENT_RANK() OVER (ORDER BY total_locations) as activity_percentile,

    -- Classification
    CASE 
      WHEN heat_intensity > (SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY heat_intensity) FROM heatmap_data) THEN 'extreme_hotspot'
      WHEN heat_intensity > (SELECT PERCENTILE_CONT(0.85) WITHIN GROUP (ORDER BY heat_intensity) FROM heatmap_data) THEN 'major_hotspot'
      WHEN heat_intensity > (SELECT PERCENTILE_CONT(0.70) WITHIN GROUP (ORDER BY heat_intensity) FROM heatmap_data) THEN 'moderate_hotspot'
      WHEN heat_intensity > (SELECT PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY heat_intensity) FROM heatmap_data) THEN 'minor_activity'
      ELSE 'low_activity'
    END as hotspot_classification,

    -- Activity type classification
    CASE 
      WHEN total_stationary > (total_walking + total_driving + total_highway) * 0.7 THEN 'destination_area'
      WHEN total_walking > total_locations * 0.6 THEN 'pedestrian_area'
      WHEN total_driving > total_locations * 0.6 THEN 'transit_area'
      WHEN total_highway > total_locations * 0.4 THEN 'highway_corridor'
      ELSE 'mixed_use_area'
    END as area_type

  FROM heatmap_data hd
  WHERE total_locations >= 10  -- Filter out low-activity areas
)

SELECT 
  grid_center,
  ST_X(grid_center) as longitude,
  ST_Y(grid_center) as latitude,

  -- Density and activity metrics
  total_locations,
  total_unique_users,
  active_hours,
  avg_locations_per_hour,
  peak_hour_locations,

  -- Classification results
  hotspot_classification,
  area_type,

  -- Intensity and ranking
  ROUND(heat_intensity, 3) as heat_intensity,
  ROUND(intensity_percentile * 100, 1) as intensity_percentile_rank,

  -- Activity breakdown
  ROUND((total_stationary::NUMERIC / total_locations) * 100, 1) as stationary_pct,
  ROUND((total_walking::NUMERIC / total_locations) * 100, 1) as walking_pct,
  ROUND((total_driving::NUMERIC / total_locations) * 100, 1) as driving_pct,

  -- Movement characteristics
  ROUND(overall_avg_speed, 2) as avg_speed_kmh,
  ROUND(overall_avg_accuracy, 1) as avg_accuracy_meters,

  -- Insights and recommendations
  CASE hotspot_classification
    WHEN 'extreme_hotspot' THEN 'High-priority area for business development'
    WHEN 'major_hotspot' THEN 'Significant commercial opportunity'
    WHEN 'moderate_hotspot' THEN 'Growing activity area with potential'
    ELSE 'Monitor for emerging trends'
  END as business_recommendation

FROM hotspot_analysis
ORDER BY heat_intensity DESC, total_unique_users DESC
LIMIT 100;

-- Advanced user movement pattern analysis
WITH user_journeys AS (
  SELECT 
    user_id,
    coordinates,
    recorded_at,
    speed_kmh,

    -- Movement analysis using window functions
    LAG(coordinates) OVER (
      PARTITION BY user_id 
      ORDER BY recorded_at
    ) as prev_coordinates,

    LAG(recorded_at) OVER (
      PARTITION BY user_id 
      ORDER BY recorded_at
    ) as prev_timestamp,

    LEAD(coordinates) OVER (
      PARTITION BY user_id 
      ORDER BY recorded_at
    ) as next_coordinates,

    -- Session detection (gap > 30 minutes = new session)
    SUM(CASE 
      WHEN recorded_at - LAG(recorded_at) OVER (
        PARTITION BY user_id ORDER BY recorded_at
      ) > INTERVAL '30 minutes' THEN 1 
      ELSE 0 
    END) OVER (
      PARTITION BY user_id 
      ORDER BY recorded_at 
      ROWS UNBOUNDED PRECEDING
    ) as session_number

  FROM user_locations
  WHERE recorded_at >= CURRENT_TIMESTAMP - INTERVAL '7 days'
),

journey_segments AS (
  SELECT 
    uj.*,

    -- Distance calculations
    CASE 
      WHEN prev_coordinates IS NOT NULL THEN
        ST_DISTANCE(coordinates, prev_coordinates)
      ELSE 0
    END as distance_from_previous,

    -- Time calculations
    CASE 
      WHEN prev_timestamp IS NOT NULL THEN
        EXTRACT(EPOCH FROM (recorded_at - prev_timestamp))
      ELSE 0
    END as time_since_previous,

    -- Direction calculations
    CASE 
      WHEN prev_coordinates IS NOT NULL THEN
        ST_AZIMUTH(prev_coordinates, coordinates) * 180 / PI()
      ELSE NULL
    END as bearing_from_previous,

    -- Stop detection
    CASE 
      WHEN speed_kmh < 2 AND 
           LAG(speed_kmh) OVER (PARTITION BY user_id ORDER BY recorded_at) < 2 
      THEN true 
      ELSE false 
    END as is_stopped

  FROM user_journeys uj
),

movement_patterns AS (
  SELECT 
    user_id,
    session_number,

    -- Session boundaries
    MIN(recorded_at) as session_start,
    MAX(recorded_at) as session_end,
    EXTRACT(SECONDS FROM (MAX(recorded_at) - MIN(recorded_at))) as session_duration_seconds,

    -- Movement statistics
    COUNT(*) as total_location_points,
    SUM(distance_from_previous) as total_distance_meters,
    AVG(speed_kmh) as avg_speed_kmh,
    MAX(speed_kmh) as max_speed_kmh,

    -- Stop analysis
    COUNT(*) FILTER (WHERE is_stopped) as stop_count,
    AVG(time_since_previous) FILTER (WHERE is_stopped) as avg_stop_duration,

    -- Geographic analysis
    ST_EXTENT(coordinates) as bounding_box,
    ST_CENTROID(ST_COLLECT(coordinates)) as activity_center,

    -- Movement characteristics
    CASE 
      WHEN AVG(speed_kmh) < 5 THEN 'pedestrian'
      WHEN AVG(speed_kmh) < 25 THEN 'urban_transit'  
      WHEN AVG(speed_kmh) < 80 THEN 'highway_driving'
      ELSE 'high_speed_transit'
    END as primary_movement_mode,

    -- Journey classification
    CASE 
      WHEN SUM(distance_from_previous) < 500 THEN 'local_area'
      WHEN SUM(distance_from_previous) < 5000 THEN 'neighborhood'
      WHEN SUM(distance_from_previous) < 50000 THEN 'city_wide'
      ELSE 'long_distance'
    END as journey_scope

  FROM journey_segments
  WHERE distance_from_previous IS NOT NULL
  GROUP BY user_id, session_number
)

SELECT 
  user_id,
  session_number,
  session_start,
  session_end,

  -- Duration and distance
  ROUND(session_duration_seconds / 60, 1) as duration_minutes,
  ROUND(total_distance_meters, 2) as distance_meters,
  ROUND(total_distance_meters / 1000, 3) as distance_km,

  -- Movement characteristics
  primary_movement_mode,
  journey_scope,
  ROUND(avg_speed_kmh, 2) as avg_speed_kmh,
  ROUND(max_speed_kmh, 2) as max_speed_kmh,

  -- Activity analysis
  total_location_points,
  stop_count,
  ROUND(avg_stop_duration / 60, 1) as avg_stop_duration_minutes,

  -- Geographic insights
  ST_X(activity_center) as center_longitude,
  ST_Y(activity_center) as center_latitude,

  -- Journey insights
  CASE 
    WHEN stop_count > total_location_points * 0.3 THEN 'multi_destination_trip'
    WHEN stop_count > 0 THEN 'trip_with_stops'
    ELSE 'direct_trip'
  END as trip_pattern,

  -- Efficiency metrics
  CASE 
    WHEN session_duration_seconds > 0 THEN
      ROUND((total_distance_meters / session_duration_seconds) * 3.6, 2) -- km/h
    ELSE 0
  END as overall_journey_speed,

  -- Movement efficiency (straight line vs actual distance)
  CASE 
    WHEN bounding_box IS NOT NULL THEN
      ROUND(
        (ST_DISTANCE(
          ST_POINT(ST_XMIN(bounding_box), ST_YMIN(bounding_box)),
          ST_POINT(ST_XMAX(bounding_box), ST_YMAX(bounding_box))
        ) / NULLIF(total_distance_meters, 0)) * 100, 
        2
      )
    ELSE NULL
  END as route_efficiency_pct

FROM movement_patterns
WHERE session_duration_seconds > 60  -- Filter very short sessions
ORDER BY user_id, session_start DESC;

-- QueryLeaf provides comprehensive geospatial capabilities:
-- 1. SQL-familiar spatial data types and indexing (POINT, POLYGON, etc.)
-- 2. Advanced spatial functions (ST_DISTANCE, ST_CONTAINS, ST_BUFFER, etc.)
-- 3. Optimized proximity searches with automatic spatial index utilization
-- 4. Sophisticated geofencing with entry/exit/dwell event detection
-- 5. Location analytics and heatmap generation with spatial aggregation
-- 6. Movement pattern analysis with trajectory and behavioral insights
-- 7. Real-time spatial event processing and notification triggers
-- 8. Integration with MongoDB's native 2dsphere indexing optimization
-- 9. Complex spatial queries with business logic and filtering
-- 10. Production-ready geospatial operations with familiar SQL syntax

Best Practices for Geospatial Implementation

Spatial Index Strategy and Performance Optimization

Essential principles for effective MongoDB geospatial deployment:

Index Design: Create compound spatial indexes that combine location data with frequently queried attributes
Query Optimization: Structure queries to leverage spatial indexes effectively and minimize computational overhead
Coordinate System: Standardize on WGS84 (EPSG:4326) for consistency and optimal MongoDB performance
Data Validation: Implement comprehensive GeoJSON validation to prevent spatial query errors
Scaling Strategy: Design geospatial collections for horizontal scaling with appropriate shard key selection
Caching Strategy: Implement spatial query result caching for frequently accessed location data

Production Deployment and Location Intelligence

Optimize geospatial operations for enterprise-scale location-based services:

Real-Time Processing: Leverage change streams and geofencing for responsive location-aware applications
Analytics Integration: Combine spatial data with business intelligence for location-driven insights
Privacy Compliance: Implement location data privacy controls and user consent management
Performance Monitoring: Track spatial query performance and optimize based on usage patterns
Fault Tolerance: Design location services with redundancy and failover capabilities
Mobile Optimization: Optimize for mobile device constraints including battery usage and network efficiency

Conclusion

MongoDB geospatial capabilities provide comprehensive native location-based services that eliminate the complexity of external GIS systems through advanced spatial indexing, sophisticated geometric operations, and seamless integration with application data models. The combination of high-performance spatial queries with real-time geofencing and location analytics makes MongoDB ideal for modern location-aware applications.

Key MongoDB Geospatial benefits include:

Native Spatial Indexing: Advanced 2dsphere indexes with optimized geometric operations and coordinate system support
Comprehensive GeoJSON Support: Full support for points, polygons, lines, and complex geometries with native validation
High-Performance Proximity: Optimized distance calculations and bearing analysis for location-based queries
Real-Time Geofencing: Advanced geofence event detection with entry, exit, and dwell time triggers
Location Analytics: Sophisticated spatial aggregation for heatmaps, movement patterns, and location intelligence
SQL Accessibility: Familiar SQL-style spatial operations through QueryLeaf for accessible geospatial development

Whether you're building ride-sharing platforms, delivery services, social media applications, or location-based marketing systems, MongoDB geospatial capabilities with QueryLeaf's familiar SQL interface provide the foundation for scalable, high-performance location services.

QueryLeaf Integration: QueryLeaf seamlessly manages MongoDB geospatial operations while providing SQL-familiar spatial data types, indexing strategies, and location-based query capabilities. Advanced geospatial patterns including proximity searches, geofencing, movement analysis, and location analytics are elegantly handled through familiar SQL constructs, making sophisticated location-based services both powerful and accessible to SQL-oriented development teams.

The combination of MongoDB's robust geospatial capabilities with SQL-style location operations makes it an ideal platform for applications requiring both advanced spatial functionality and familiar database interaction patterns, ensuring your location services can scale efficiently while delivering precise, real-time geographic experiences.

November 3, 2025
23 min read

MongoDB Data Pipeline Optimization and Stream Processing: Advanced Real-Time Analytics for High-Performance Data Workflows

Modern applications require sophisticated data processing capabilities that can handle high-velocity data streams, complex analytical workloads, and real-time insights while maintaining optimal performance under varying load conditions. Traditional data pipeline approaches often struggle with complex transformation logic, performance bottlenecks in aggregation operations, and the operational complexity of maintaining separate systems for batch and stream processing, leading to increased latency, resource inefficiency, and difficulty in maintaining data consistency across processing workflows.

MongoDB provides comprehensive data pipeline capabilities through the Aggregation Framework, Change Streams, and advanced stream processing features that enable real-time analytics, complex data transformations, and high-performance data processing within a single unified platform. Unlike traditional approaches that require multiple specialized systems and complex integration logic, MongoDB's integrated data pipeline capabilities deliver superior performance through native optimization, intelligent query planning, and seamless integration with storage and indexing systems.

The Traditional Data Pipeline Challenge

Conventional data processing architectures face significant limitations when handling complex analytical workloads:

-- Traditional PostgreSQL data pipeline - complex ETL processes with performance limitations

-- Basic data transformation pipeline with limited optimization capabilities
CREATE TABLE raw_events (
    event_id BIGSERIAL PRIMARY KEY,
    event_timestamp TIMESTAMP NOT NULL,
    user_id BIGINT NOT NULL,
    session_id VARCHAR(100),
    event_type VARCHAR(100) NOT NULL,
    event_category VARCHAR(100),

    -- Basic event data (limited nested structure support)
    event_data JSONB,
    device_info JSONB,
    location_data JSONB,

    -- Processing metadata
    ingested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    processed_at TIMESTAMP,
    processing_status VARCHAR(50) DEFAULT 'pending',

    -- Partitioning key
    partition_date DATE GENERATED ALWAYS AS (event_timestamp::date) STORED

) PARTITION BY RANGE (partition_date);

-- Create monthly partitions (manual maintenance required)
CREATE TABLE raw_events_2024_11 PARTITION OF raw_events
    FOR VALUES FROM ('2024-11-01') TO ('2024-12-01');
CREATE TABLE raw_events_2024_12 PARTITION OF raw_events  
    FOR VALUES FROM ('2024-12-01') TO ('2025-01-01');

-- Complex data transformation pipeline with performance issues
CREATE OR REPLACE FUNCTION process_event_batch(
    batch_size INTEGER DEFAULT 1000
) RETURNS TABLE (
    processed_events INTEGER,
    failed_events INTEGER,
    processing_time_ms INTEGER,
    transformation_errors TEXT[]
) AS $$
DECLARE
    batch_start_time TIMESTAMP;
    processing_errors TEXT[] := '{}';
    events_processed INTEGER := 0;
    events_failed INTEGER := 0;
    event_record RECORD;
BEGIN
    batch_start_time := clock_timestamp();

    -- Process events in batches (inefficient row-by-row processing)
    FOR event_record IN 
        SELECT * FROM raw_events 
        WHERE processing_status = 'pending'
        ORDER BY ingested_at
        LIMIT batch_size
    LOOP
        BEGIN
            -- Complex transformation logic (limited JSON processing capabilities)
            WITH transformed_event AS (
                SELECT 
                    event_record.event_id,
                    event_record.event_timestamp,
                    event_record.user_id,
                    event_record.session_id,
                    event_record.event_type,
                    event_record.event_category,

                    -- Basic data extraction and transformation
                    COALESCE(event_record.event_data->>'revenue', '0')::DECIMAL(10,2) as revenue,
                    COALESCE(event_record.event_data->>'quantity', '1')::INTEGER as quantity,
                    event_record.event_data->>'product_id' as product_id,
                    event_record.event_data->>'product_name' as product_name,

                    -- Device information extraction (limited nested processing)
                    event_record.device_info->>'device_type' as device_type,
                    event_record.device_info->>'browser' as browser,
                    event_record.device_info->>'os' as operating_system,

                    -- Location processing (basic only)
                    event_record.location_data->>'country' as country,
                    event_record.location_data->>'region' as region,
                    event_record.location_data->>'city' as city,

                    -- Time-based calculations
                    EXTRACT(HOUR FROM event_record.event_timestamp) as event_hour,
                    EXTRACT(DOW FROM event_record.event_timestamp) as day_of_week,
                    TO_CHAR(event_record.event_timestamp, 'YYYY-MM') as year_month,

                    -- User segmentation (basic logic only)
                    CASE 
                        WHEN user_segments.segment_type IS NOT NULL THEN user_segments.segment_type
                        ELSE 'unknown'
                    END as user_segment,

                    -- Processing metadata
                    CURRENT_TIMESTAMP as processed_at

                FROM raw_events re
                LEFT JOIN user_segments ON re.user_id = user_segments.user_id
                WHERE re.event_id = event_record.event_id
            )

            -- Insert into processed events table (separate table required)
            INSERT INTO processed_events (
                event_id, event_timestamp, user_id, session_id, event_type, event_category,
                revenue, quantity, product_id, product_name,
                device_type, browser, operating_system,
                country, region, city,
                event_hour, day_of_week, year_month, user_segment,
                processed_at
            )
            SELECT * FROM transformed_event;

            -- Update processing status
            UPDATE raw_events 
            SET 
                processed_at = CURRENT_TIMESTAMP,
                processing_status = 'completed'
            WHERE event_id = event_record.event_id;

            events_processed := events_processed + 1;

        EXCEPTION WHEN OTHERS THEN
            events_failed := events_failed + 1;
            processing_errors := array_append(processing_errors, 
                'Event ID ' || event_record.event_id || ': ' || SQLERRM);

            -- Mark event as failed
            UPDATE raw_events 
            SET 
                processed_at = CURRENT_TIMESTAMP,
                processing_status = 'failed'
            WHERE event_id = event_record.event_id;
        END;
    END LOOP;

    -- Return processing results
    RETURN QUERY SELECT 
        events_processed,
        events_failed,
        EXTRACT(MILLISECONDS FROM clock_timestamp() - batch_start_time)::INTEGER,
        processing_errors;

END;
$$ LANGUAGE plpgsql;

-- Execute batch processing (requires manual scheduling)
SELECT * FROM process_event_batch(1000);

-- Complex analytical query with performance limitations
WITH hourly_metrics AS (
    -- Time-based aggregation with limited optimization
    SELECT 
        DATE_TRUNC('hour', event_timestamp) as hour_bucket,
        event_type,
        event_category,
        user_segment,
        device_type,
        country,

        -- Basic aggregations (limited analytical functions)
        COUNT(*) as event_count,
        COUNT(DISTINCT user_id) as unique_users,
        COUNT(DISTINCT session_id) as unique_sessions,
        SUM(revenue) as total_revenue,
        AVG(revenue) FILTER (WHERE revenue > 0) as avg_revenue_per_transaction,

        -- Limited statistical functions
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) as median_revenue,
        STDDEV_POP(revenue) as revenue_stddev,

        -- Time-based calculations
        MIN(event_timestamp) as first_event_time,
        MAX(event_timestamp) as last_event_time

    FROM processed_events
    WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    GROUP BY 
        DATE_TRUNC('hour', event_timestamp),
        event_type, event_category, user_segment, device_type, country
),

user_behavior_analysis AS (
    -- User journey analysis (complex and slow)
    SELECT 
        user_id,
        session_id,

        -- Session-level aggregations
        COUNT(*) as events_per_session,
        SUM(revenue) as session_revenue,
        EXTRACT(SECONDS FROM (MAX(event_timestamp) - MIN(event_timestamp))) as session_duration_seconds,

        -- Event sequence analysis (limited capabilities)
        string_agg(event_type, ' -> ' ORDER BY event_timestamp) as event_sequence,
        array_agg(event_timestamp ORDER BY event_timestamp) as event_timestamps,

        -- Conversion analysis
        CASE 
            WHEN 'purchase' = ANY(array_agg(event_type)) THEN 'converted'
            WHEN 'add_to_cart' = ANY(array_agg(event_type)) THEN 'engaged'
            ELSE 'browsing'
        END as conversion_status,

        -- Time-based metrics
        first_value(event_timestamp) OVER (
            PARTITION BY user_id 
            ORDER BY event_timestamp 
            ROWS UNBOUNDED PRECEDING
        ) as first_session_event,

        last_value(event_timestamp) OVER (
            PARTITION BY user_id 
            ORDER BY event_timestamp 
            ROWS UNBOUNDED FOLLOWING
        ) as last_session_event

    FROM processed_events
    WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
    GROUP BY user_id, session_id
),

funnel_analysis AS (
    -- Conversion funnel analysis (very limited and slow)
    SELECT 
        event_category,
        user_segment,

        -- Funnel step counts
        COUNT(*) FILTER (WHERE event_type = 'view') as step_1_views,
        COUNT(*) FILTER (WHERE event_type = 'click') as step_2_clicks,
        COUNT(*) FILTER (WHERE event_type = 'add_to_cart') as step_3_cart_adds,
        COUNT(*) FILTER (WHERE event_type = 'purchase') as step_4_purchases,

        -- Conversion rates (basic calculations)
        CASE 
            WHEN COUNT(*) FILTER (WHERE event_type = 'view') > 0 THEN
                (COUNT(*) FILTER (WHERE event_type = 'click') * 100.0) / 
                COUNT(*) FILTER (WHERE event_type = 'view')
            ELSE 0
        END as click_through_rate,

        CASE 
            WHEN COUNT(*) FILTER (WHERE event_type = 'add_to_cart') > 0 THEN
                (COUNT(*) FILTER (WHERE event_type = 'purchase') * 100.0) / 
                COUNT(*) FILTER (WHERE event_type = 'add_to_cart')
            ELSE 0
        END as cart_to_purchase_rate

    FROM processed_events
    WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '30 days'
    GROUP BY event_category, user_segment
)

-- Final analytical output (limited insights)
SELECT 
    hm.hour_bucket,
    hm.event_type,
    hm.event_category,
    hm.user_segment,
    hm.device_type,
    hm.country,

    -- Volume metrics
    hm.event_count,
    hm.unique_users,
    hm.unique_sessions,
    ROUND(hm.event_count::DECIMAL / hm.unique_users, 2) as events_per_user,

    -- Revenue metrics  
    ROUND(hm.total_revenue, 2) as total_revenue,
    ROUND(hm.avg_revenue_per_transaction, 2) as avg_revenue_per_transaction,
    ROUND(hm.median_revenue, 2) as median_revenue,

    -- User behavior insights (very limited)
    (SELECT AVG(events_per_session) 
     FROM user_behavior_analysis uba 
     WHERE uba.session_revenue > 0) as avg_events_per_converting_session,

    -- Conversion insights
    fa.click_through_rate,
    fa.cart_to_purchase_rate,

    -- Performance indicators
    EXTRACT(MINUTES FROM (hm.last_event_time - hm.first_event_time)) as processing_window_minutes,

    -- Trend indicators (very basic)
    LAG(hm.event_count, 1) OVER (
        PARTITION BY hm.event_type, hm.user_segment 
        ORDER BY hm.hour_bucket
    ) as prev_hour_event_count

FROM hourly_metrics hm
LEFT JOIN funnel_analysis fa ON (
    hm.event_category = fa.event_category AND 
    hm.user_segment = fa.user_segment
)
WHERE hm.event_count > 10  -- Filter low-volume segments
ORDER BY hm.hour_bucket DESC, hm.total_revenue DESC
LIMIT 1000;

-- Problems with traditional data pipeline approaches:
-- 1. Complex ETL processes requiring separate batch processing jobs
-- 2. Limited support for nested and complex data structures  
-- 3. Poor performance with large-scale analytical workloads
-- 4. Manual partitioning and maintenance overhead
-- 5. No real-time stream processing capabilities
-- 6. Limited statistical and analytical functions
-- 7. Complex joins and data movement between processing stages
-- 8. No native support for time-series and event stream processing
-- 9. Difficulty in maintaining data consistency across pipeline stages
-- 10. Limited optimization for analytical query patterns

MongoDB provides comprehensive data pipeline capabilities with advanced stream processing and analytics:

// MongoDB Advanced Data Pipeline and Stream Processing - real-time analytics with optimized performance
const { MongoClient, GridFSBucket } = require('mongodb');
const { EventEmitter } = require('events');

// Comprehensive MongoDB Data Pipeline Manager
class AdvancedDataPipelineManager extends EventEmitter {
  constructor(connectionString, pipelineConfig = {}) {
    super();
    this.connectionString = connectionString;
    this.client = null;
    this.db = null;

    // Advanced pipeline configuration
    this.config = {
      // Pipeline processing configuration
      enableStreamProcessing: pipelineConfig.enableStreamProcessing !== false,
      enableRealTimeAnalytics: pipelineConfig.enableRealTimeAnalytics !== false,
      enableBatchProcessing: pipelineConfig.enableBatchProcessing !== false,

      // Performance optimization settings
      aggregationOptimization: pipelineConfig.aggregationOptimization !== false,
      indexOptimization: pipelineConfig.indexOptimization !== false,
      memoryOptimization: pipelineConfig.memoryOptimization !== false,
      parallelProcessing: pipelineConfig.parallelProcessing !== false,

      // Stream processing configuration
      changeStreamOptions: {
        fullDocument: 'updateLookup',
        maxAwaitTimeMS: 1000,
        batchSize: 1000,
        ...pipelineConfig.changeStreamOptions
      },

      // Batch processing configuration
      batchSize: pipelineConfig.batchSize || 5000,
      maxBatchProcessingTime: pipelineConfig.maxBatchProcessingTime || 300000, // 5 minutes

      // Analytics configuration
      analyticsWindowSize: pipelineConfig.analyticsWindowSize || 3600000, // 1 hour
      retentionPeriod: pipelineConfig.retentionPeriod || 90 * 24 * 60 * 60 * 1000, // 90 days

      // Performance monitoring
      enablePerformanceMetrics: pipelineConfig.enablePerformanceMetrics !== false,
      enablePipelineOptimization: pipelineConfig.enablePipelineOptimization !== false
    };

    // Pipeline state management
    this.activePipelines = new Map();
    this.streamProcessors = new Map();
    this.batchProcessors = new Map();
    this.performanceMetrics = {
      pipelinesExecuted: 0,
      totalProcessingTime: 0,
      documentsProcessed: 0,
      averageThroughput: 0
    };

    this.initializeDataPipeline();
  }

  async initializeDataPipeline() {
    console.log('Initializing advanced data pipeline system...');

    try {
      // Connect to MongoDB
      this.client = new MongoClient(this.connectionString);
      await this.client.connect();
      this.db = this.client.db();

      // Setup collections and indexes
      await this.setupPipelineInfrastructure();

      // Initialize stream processing
      if (this.config.enableStreamProcessing) {
        await this.initializeStreamProcessing();
      }

      // Initialize batch processing
      if (this.config.enableBatchProcessing) {
        await this.initializeBatchProcessing();
      }

      // Setup real-time analytics
      if (this.config.enableRealTimeAnalytics) {
        await this.setupRealTimeAnalytics();
      }

      console.log('Advanced data pipeline system initialized successfully');

    } catch (error) {
      console.error('Error initializing data pipeline:', error);
      throw error;
    }
  }

  async setupPipelineInfrastructure() {
    console.log('Setting up data pipeline infrastructure...');

    try {
      // Create collections with optimized configuration
      const collections = {
        rawEvents: this.db.collection('raw_events'),
        processedEvents: this.db.collection('processed_events'),
        analyticsResults: this.db.collection('analytics_results'),
        pipelineMetrics: this.db.collection('pipeline_metrics'),
        userSessions: this.db.collection('user_sessions'),
        conversionFunnels: this.db.collection('conversion_funnels')
      };

      // Create optimized indexes for high-performance data processing
      await collections.rawEvents.createIndexes([
        { key: { eventTimestamp: -1 }, background: true },
        { key: { userId: 1, sessionId: 1, eventTimestamp: -1 }, background: true },
        { key: { eventType: 1, eventCategory: 1, eventTimestamp: -1 }, background: true },
        { key: { 'processingStatus': 1, 'ingestedAt': 1 }, background: true },
        { key: { 'locationData.country': 1, 'deviceInfo.deviceType': 1 }, background: true, sparse: true }
      ]);

      await collections.processedEvents.createIndexes([
        { key: { eventTimestamp: -1 }, background: true },
        { key: { userId: 1, sessionId: 1, eventTimestamp: -1 }, background: true },
        { key: { eventType: 1, userSegment: 1, eventTimestamp: -1 }, background: true },
        { key: { 'metrics.revenue': -1, eventTimestamp: -1 }, background: true, sparse: true }
      ]);

      await collections.analyticsResults.createIndexes([
        { key: { analysisType: 1, timeWindow: -1 }, background: true },
        { key: { 'dimensions.eventType': 1, 'dimensions.userSegment': 1, timeWindow: -1 }, background: true },
        { key: { createdAt: -1 }, background: true }
      ]);

      this.collections = collections;

      console.log('Pipeline infrastructure setup completed');

    } catch (error) {
      console.error('Error setting up pipeline infrastructure:', error);
      throw error;
    }
  }

  async createAdvancedAnalyticsPipeline(pipelineConfig) {
    console.log('Creating advanced analytics pipeline...');

    const pipelineId = this.generatePipelineId();
    const startTime = Date.now();

    try {
      // Build comprehensive aggregation pipeline
      const analyticsStages = [
        // Stage 1: Data filtering and initial processing
        {
          $match: {
            eventTimestamp: {
              $gte: new Date(Date.now() - this.config.analyticsWindowSize),
              $lte: new Date()
            },
            processingStatus: 'completed',
            ...pipelineConfig.matchCriteria
          }
        },

        // Stage 2: Advanced data transformation and enrichment
        {
          $addFields: {
            // Time-based dimensions
            hourBucket: {
              $dateFromParts: {
                year: { $year: '$eventTimestamp' },
                month: { $month: '$eventTimestamp' },
                day: { $dayOfMonth: '$eventTimestamp' },
                hour: { $hour: '$eventTimestamp' }
              }
            },
            dayOfWeek: { $dayOfWeek: '$eventTimestamp' },
            yearMonth: {
              $dateToString: {
                format: '%Y-%m',
                date: '$eventTimestamp'
              }
            },

            // User segmentation and classification
            userSegment: {
              $switch: {
                branches: [
                  {
                    case: { $gte: ['$userMetrics.totalRevenue', 1000] },
                    then: 'high_value'
                  },
                  {
                    case: { $gte: ['$userMetrics.totalRevenue', 100] },
                    then: 'medium_value'
                  },
                  {
                    case: { $gt: ['$userMetrics.totalRevenue', 0] },
                    then: 'low_value'
                  }
                ],
                default: 'non_revenue'
              }
            },

            // Device and technology classification
            deviceCategory: {
              $switch: {
                branches: [
                  {
                    case: { $in: ['$deviceInfo.deviceType', ['smartphone', 'tablet']] },
                    then: 'mobile'
                  },
                  {
                    case: { $eq: ['$deviceInfo.deviceType', 'desktop'] },
                    then: 'desktop'
                  }
                ],
                default: 'other'
              }
            },

            // Geographic clustering
            geoRegion: {
              $switch: {
                branches: [
                  {
                    case: { $in: ['$locationData.country', ['US', 'CA', 'MX']] },
                    then: 'North America'
                  },
                  {
                    case: { $in: ['$locationData.country', ['GB', 'DE', 'FR', 'IT', 'ES']] },
                    then: 'Europe'
                  },
                  {
                    case: { $in: ['$locationData.country', ['JP', 'KR', 'CN', 'IN']] },
                    then: 'Asia'
                  }
                ],
                default: 'Other'
              }
            },

            // Revenue and value metrics
            revenueMetrics: {
              revenue: { $toDouble: '$eventData.revenue' },
              quantity: { $toInt: '$eventData.quantity' },
              averageOrderValue: {
                $cond: [
                  { $gt: [{ $toInt: '$eventData.quantity' }, 0] },
                  { $divide: [{ $toDouble: '$eventData.revenue' }, { $toInt: '$eventData.quantity' }] },
                  0
                ]
              }
            }
          }
        },

        // Stage 3: Multi-dimensional aggregation and analytics
        {
          $group: {
            _id: {
              hourBucket: '$hourBucket',
              eventType: '$eventType',
              eventCategory: '$eventCategory',
              userSegment: '$userSegment',
              deviceCategory: '$deviceCategory',
              geoRegion: '$geoRegion'
            },

            // Volume metrics
            eventCount: { $sum: 1 },
            uniqueUsers: { $addToSet: '$userId' },
            uniqueSessions: { $addToSet: '$sessionId' },

            // Revenue analytics
            totalRevenue: { $sum: '$revenueMetrics.revenue' },
            totalQuantity: { $sum: '$revenueMetrics.quantity' },
            revenueTransactions: {
              $sum: {
                $cond: [{ $gt: ['$revenueMetrics.revenue', 0] }, 1, 0]
              }
            },

            // Statistical aggregations
            revenueValues: { $push: '$revenueMetrics.revenue' },
            quantityValues: { $push: '$revenueMetrics.quantity' },
            avgOrderValues: { $push: '$revenueMetrics.averageOrderValue' },

            // Time-based analytics
            firstEventTime: { $min: '$eventTimestamp' },
            lastEventTime: { $max: '$eventTimestamp' },
            eventTimestamps: { $push: '$eventTimestamp' },

            // User behavior patterns
            userSessions: {
              $push: {
                userId: '$userId',
                sessionId: '$sessionId',
                eventTimestamp: '$eventTimestamp',
                revenue: '$revenueMetrics.revenue'
              }
            }
          }
        },

        // Stage 4: Advanced statistical calculations
        {
          $addFields: {
            // User metrics
            uniqueUserCount: { $size: '$uniqueUsers' },
            uniqueSessionCount: { $size: '$uniqueSessions' },
            eventsPerUser: {
              $divide: ['$eventCount', { $size: '$uniqueUsers' }]
            },
            eventsPerSession: {
              $divide: ['$eventCount', { $size: '$uniqueSessions' }]
            },

            // Revenue analytics
            averageRevenue: {
              $cond: [
                { $gt: ['$revenueTransactions', 0] },
                { $divide: ['$totalRevenue', '$revenueTransactions'] },
                0
              ]
            },
            revenuePerUser: {
              $divide: ['$totalRevenue', { $size: '$uniqueUsers' }]
            },
            conversionRate: {
              $divide: ['$revenueTransactions', '$eventCount']
            },

            // Statistical measures
            revenueStats: {
              $let: {
                vars: {
                  sortedRevenues: {
                    $sortArray: {
                      input: '$revenueValues',
                      sortBy: 1
                    }
                  }
                },
                in: {
                  median: {
                    $arrayElemAt: [
                      '$$sortedRevenues',
                      { $floor: { $multiply: [{ $size: '$$sortedRevenues' }, 0.5] } }
                    ]
                  },
                  percentile75: {
                    $arrayElemAt: [
                      '$$sortedRevenues',
                      { $floor: { $multiply: [{ $size: '$$sortedRevenues' }, 0.75] } }
                    ]
                  },
                  percentile95: {
                    $arrayElemAt: [
                      '$$sortedRevenues',
                      { $floor: { $multiply: [{ $size: '$$sortedRevenues' }, 0.95] } }
                    ]
                  }
                }
              }
            },

            // Temporal analysis
            processingWindowMinutes: {
              $divide: [
                { $subtract: ['$lastEventTime', '$firstEventTime'] },
                60000 // Convert to minutes
              ]
            },

            // Session analysis
            sessionMetrics: {
              $reduce: {
                input: '$userSessions',
                initialValue: {
                  totalSessions: 0,
                  convertingSessions: 0,
                  totalSessionRevenue: 0
                },
                in: {
                  totalSessions: { $add: ['$$value.totalSessions', 1] },
                  convertingSessions: {
                    $cond: [
                      { $gt: ['$$this.revenue', 0] },
                      { $add: ['$$value.convertingSessions', 1] },
                      '$$value.convertingSessions'
                    ]
                  },
                  totalSessionRevenue: {
                    $add: ['$$value.totalSessionRevenue', '$$this.revenue']
                  }
                }
              }
            }
          }
        },

        // Stage 5: Performance optimization and data enrichment
        {
          $addFields: {
            // Performance indicators
            performanceMetrics: {
              throughputEventsPerMinute: {
                $divide: ['$eventCount', '$processingWindowMinutes']
              },
              revenueVelocity: {
                $divide: ['$totalRevenue', '$processingWindowMinutes']
              },
              userEngagementRate: {
                $divide: [{ $size: '$uniqueUsers' }, '$eventCount']
              }
            },

            // Business metrics
            businessMetrics: {
              customerLifetimeValue: {
                $multiply: [
                  '$revenuePerUser',
                  { $literal: 12 } // Assuming 12-month projection
                ]
              },
              sessionConversionRate: {
                $divide: [
                  '$sessionMetrics.convertingSessions',
                  '$sessionMetrics.totalSessions'
                ]
              },
              averageSessionValue: {
                $divide: [
                  '$sessionMetrics.totalSessionRevenue',
                  '$sessionMetrics.totalSessions'
                ]
              }
            },

            // Data quality metrics
            dataQuality: {
              completenessScore: {
                $divide: [
                  { $add: [
                    { $cond: [{ $gt: [{ $size: '$uniqueUsers' }, 0] }, 1, 0] },
                    { $cond: [{ $gt: ['$eventCount', 0] }, 1, 0] },
                    { $cond: [{ $ne: ['$_id.eventType', null] }, 1, 0] },
                    { $cond: [{ $ne: ['$_id.eventCategory', null] }, 1, 0] }
                  ] },
                  4
                ]
              },
              consistencyScore: {
                $cond: [
                  { $eq: ['$eventsPerSession', { $divide: ['$eventCount', { $size: '$uniqueSessions' }] }] },
                  1.0,
                  0.8
                ]
              }
            }
          }
        },

        // Stage 6: Final result formatting and metadata
        {
          $project: {
            // Dimension information
            dimensions: '$_id',
            timeWindow: '$_id.hourBucket',
            analysisType: { $literal: pipelineConfig.analysisType || 'comprehensive_analytics' },

            // Core metrics
            metrics: {
              volume: {
                eventCount: '$eventCount',
                uniqueUserCount: '$uniqueUserCount',
                uniqueSessionCount: '$uniqueSessionCount',
                eventsPerUser: { $round: ['$eventsPerUser', 2] },
                eventsPerSession: { $round: ['$eventsPerSession', 2] }
              },

              revenue: {
                totalRevenue: { $round: ['$totalRevenue', 2] },
                totalQuantity: '$totalQuantity',
                revenueTransactions: '$revenueTransactions',
                averageRevenue: { $round: ['$averageRevenue', 2] },
                revenuePerUser: { $round: ['$revenuePerUser', 2] },
                conversionRate: { $round: ['$conversionRate', 4] }
              },

              statistical: {
                medianRevenue: { $round: ['$revenueStats.median', 2] },
                percentile75Revenue: { $round: ['$revenueStats.percentile75', 2] },
                percentile95Revenue: { $round: ['$revenueStats.percentile95', 2] }
              },

              performance: '$performanceMetrics',
              business: '$businessMetrics',
              dataQuality: '$dataQuality'
            },

            // Temporal information
            temporal: {
              firstEventTime: '$firstEventTime',
              lastEventTime: '$lastEventTime',
              processingWindowMinutes: { $round: ['$processingWindowMinutes', 1] }
            },

            // Pipeline metadata
            pipelineMetadata: {
              pipelineId: { $literal: pipelineId },
              executionTime: { $literal: new Date() },
              configurationUsed: { $literal: pipelineConfig }
            }
          }
        },

        // Stage 7: Results persistence and optimization
        {
          $merge: {
            into: 'analytics_results',
            whenMatched: 'replace',
            whenNotMatched: 'insert'
          }
        }
      ];

      // Execute the comprehensive analytics pipeline
      console.log('Executing comprehensive analytics pipeline...');
      const pipelineResult = await this.collections.processedEvents.aggregate(
        analyticsStages,
        {
          allowDiskUse: true,
          maxTimeMS: this.config.maxBatchProcessingTime,
          hint: { eventTimestamp: -1 }, // Optimize with time-based index
          comment: `Advanced analytics pipeline: ${pipelineId}`
        }
      ).toArray();

      const executionTime = Date.now() - startTime;

      // Update performance metrics
      this.updatePipelineMetrics(pipelineId, {
        executionTime: executionTime,
        documentsProcessed: pipelineResult.length,
        pipelineType: 'analytics',
        success: true
      });

      this.emit('pipelineCompleted', {
        pipelineId: pipelineId,
        pipelineType: 'analytics',
        executionTime: executionTime,
        documentsProcessed: pipelineResult.length,
        resultsGenerated: pipelineResult.length
      });

      console.log(`Analytics pipeline completed: ${pipelineId} (${executionTime}ms, ${pipelineResult.length} results)`);

      return {
        success: true,
        pipelineId: pipelineId,
        executionTime: executionTime,
        resultsGenerated: pipelineResult.length,
        analyticsData: pipelineResult
      };

    } catch (error) {
      console.error(`Analytics pipeline failed: ${pipelineId}`, error);

      this.updatePipelineMetrics(pipelineId, {
        executionTime: Date.now() - startTime,
        pipelineType: 'analytics',
        success: false,
        error: error.message
      });

      return {
        success: false,
        pipelineId: pipelineId,
        error: error.message
      };
    }
  }

  async initializeStreamProcessing() {
    console.log('Initializing real-time stream processing...');

    try {
      // Setup change streams for real-time processing
      const changeStream = this.collections.rawEvents.watch(
        [
          {
            $match: {
              'operationType': { $in: ['insert', 'update'] },
              'fullDocument.processingStatus': { $ne: 'processed' }
            }
          }
        ],
        this.config.changeStreamOptions
      );

      // Process streaming data in real-time
      changeStream.on('change', async (change) => {
        try {
          await this.processStreamingEvent(change);
        } catch (error) {
          console.error('Error processing streaming event:', error);
          this.emit('streamProcessingError', { change, error: error.message });
        }
      });

      changeStream.on('error', (error) => {
        console.error('Change stream error:', error);
        this.emit('changeStreamError', { error: error.message });
      });

      this.streamProcessors.set('main', changeStream);

      console.log('Stream processing initialized successfully');

    } catch (error) {
      console.error('Error initializing stream processing:', error);
      throw error;
    }
  }

  async processStreamingEvent(change) {
    console.log('Processing streaming event:', change.documentKey);

    const document = change.fullDocument;
    const processingStartTime = Date.now();

    try {
      // Real-time event transformation and enrichment
      const transformedEvent = await this.transformEventData(document);

      // Apply real-time analytics calculations
      const analyticsData = await this.calculateRealTimeMetrics(transformedEvent);

      // Update processed events collection
      await this.collections.processedEvents.replaceOne(
        { _id: transformedEvent._id },
        {
          ...transformedEvent,
          ...analyticsData,
          processedAt: new Date(),
          processingLatency: Date.now() - processingStartTime
        },
        { upsert: true }
      );

      // Update real-time analytics aggregations
      if (this.config.enableRealTimeAnalytics) {
        await this.updateRealTimeAnalytics(transformedEvent);
      }

      this.emit('eventProcessed', {
        eventId: document._id,
        processingLatency: Date.now() - processingStartTime,
        analyticsGenerated: Object.keys(analyticsData).length
      });

    } catch (error) {
      console.error('Error processing streaming event:', error);

      // Mark event as failed for retry processing
      await this.collections.rawEvents.updateOne(
        { _id: document._id },
        {
          $set: {
            processingStatus: 'failed',
            processingError: error.message,
            lastProcessingAttempt: new Date()
          }
        }
      );

      throw error;
    }
  }

  async transformEventData(rawEvent) {
    // Advanced event data transformation with MongoDB-specific optimizations
    const transformed = {
      _id: rawEvent._id,
      eventId: rawEvent.eventId || rawEvent._id,
      eventTimestamp: rawEvent.eventTimestamp,
      userId: rawEvent.userId,
      sessionId: rawEvent.sessionId,
      eventType: rawEvent.eventType,
      eventCategory: rawEvent.eventCategory,

      // Enhanced data extraction using MongoDB operators
      eventData: {
        ...rawEvent.eventData,
        revenue: parseFloat(rawEvent.eventData?.revenue || 0),
        quantity: parseInt(rawEvent.eventData?.quantity || 1),
        productId: rawEvent.eventData?.productId,
        productName: rawEvent.eventData?.productName
      },

      // Device and technology information
      deviceInfo: {
        deviceType: rawEvent.deviceInfo?.deviceType || 'unknown',
        browser: rawEvent.deviceInfo?.browser || 'unknown',
        operatingSystem: rawEvent.deviceInfo?.os || 'unknown',
        screenResolution: rawEvent.deviceInfo?.screenResolution,
        userAgent: rawEvent.deviceInfo?.userAgent
      },

      // Geographic information
      locationData: {
        country: rawEvent.locationData?.country || 'unknown',
        region: rawEvent.locationData?.region || 'unknown',
        city: rawEvent.locationData?.city || 'unknown',
        coordinates: rawEvent.locationData?.coordinates
      },

      // Time-based dimensions for efficient aggregation
      timeDimensions: {
        hour: rawEvent.eventTimestamp.getHours(),
        dayOfWeek: rawEvent.eventTimestamp.getDay(),
        yearMonth: `${rawEvent.eventTimestamp.getFullYear()}-${String(rawEvent.eventTimestamp.getMonth() + 1).padStart(2, '0')}`,
        quarterYear: `Q${Math.floor(rawEvent.eventTimestamp.getMonth() / 3) + 1}-${rawEvent.eventTimestamp.getFullYear()}`
      },

      // Processing metadata
      processingMetadata: {
        transformedAt: new Date(),
        version: '2.0',
        source: 'stream_processor'
      }
    };

    return transformed;
  }

  async calculateRealTimeMetrics(event) {
    // Real-time metrics calculation using MongoDB aggregation
    const metricsCalculation = [
      {
        $match: {
          userId: event.userId,
          eventTimestamp: {
            $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) // Last 24 hours
          }
        }
      },
      {
        $group: {
          _id: null,
          totalEvents: { $sum: 1 },
          totalRevenue: { $sum: '$eventData.revenue' },
          uniqueSessions: { $addToSet: '$sessionId' },
          eventTypes: { $addToSet: '$eventType' },
          averageOrderValue: { $avg: '$eventData.revenue' }
        }
      }
    ];

    const userMetrics = await this.collections.processedEvents
      .aggregate(metricsCalculation)
      .toArray();

    return {
      userMetrics: userMetrics[0] || {
        totalEvents: 1,
        totalRevenue: event.eventData.revenue,
        uniqueSessions: [event.sessionId],
        eventTypes: [event.eventType],
        averageOrderValue: event.eventData.revenue
      }
    };
  }

  updatePipelineMetrics(pipelineId, metrics) {
    // Update system-wide pipeline performance metrics
    this.performanceMetrics.pipelinesExecuted++;
    this.performanceMetrics.totalProcessingTime += metrics.executionTime;
    this.performanceMetrics.documentsProcessed += metrics.documentsProcessed || 0;

    if (this.performanceMetrics.pipelinesExecuted > 0) {
      this.performanceMetrics.averageThroughput = 
        this.performanceMetrics.documentsProcessed / 
        (this.performanceMetrics.totalProcessingTime / 1000);
    }

    // Store detailed pipeline metrics
    this.collections.pipelineMetrics.insertOne({
      pipelineId: pipelineId,
      metrics: metrics,
      timestamp: new Date(),
      systemMetrics: {
        memoryUsage: process.memoryUsage(),
        systemPerformance: this.performanceMetrics
      }
    }).catch(error => {
      console.error('Error storing pipeline metrics:', error);
    });
  }

  generatePipelineId() {
    const timestamp = new Date().toISOString().replace(/[:.]/g, '-');
    return `pipeline_${timestamp}_${Math.random().toString(36).substr(2, 9)}`;
  }

  async shutdown() {
    console.log('Shutting down data pipeline manager...');

    try {
      // Close all active stream processors
      for (const [processorId, stream] of this.streamProcessors.entries()) {
        await stream.close();
        console.log(`Closed stream processor: ${processorId}`);
      }

      // Close MongoDB connection
      if (this.client) {
        await this.client.close();
      }

      console.log('Data pipeline manager shutdown complete');

    } catch (error) {
      console.error('Error during shutdown:', error);
    }
  }
}

// Benefits of MongoDB Advanced Data Pipeline:
// - Real-time stream processing with Change Streams for immediate insights
// - Comprehensive aggregation framework for complex analytical workloads
// - Native support for nested and complex data structures without ETL overhead
// - Optimized indexing and query planning for high-performance analytics
// - Integrated batch and stream processing within a single platform
// - Advanced statistical and mathematical functions for sophisticated analytics
// - Automatic scaling and optimization for large-scale data processing
// - SQL-compatible pipeline management through QueryLeaf integration
// - Built-in performance monitoring and optimization capabilities
// - Production-ready stream processing with minimal configuration overhead

module.exports = {
  AdvancedDataPipelineManager
};

Understanding MongoDB Data Pipeline Architecture

Advanced Stream Processing and Real-Time Analytics Patterns

Implement sophisticated data pipeline workflows for production MongoDB deployments:

// Enterprise-grade MongoDB data pipeline with advanced stream processing and analytics optimization
class EnterpriseDataPipelineProcessor extends AdvancedDataPipelineManager {
  constructor(connectionString, enterpriseConfig) {
    super(connectionString, enterpriseConfig);

    this.enterpriseConfig = {
      ...enterpriseConfig,
      enableAdvancedAnalytics: true,
      enableMachineLearningPipelines: true,
      enablePredictiveAnalytics: true,
      enableDataGovernance: true,
      enableComplianceReporting: true
    };

    this.setupEnterpriseFeatures();
    this.initializePredictiveAnalytics();
    this.setupComplianceFramework();
  }

  async implementAdvancedDataPipeline() {
    console.log('Implementing enterprise data pipeline with advanced capabilities...');

    const pipelineStrategy = {
      // Multi-tier processing architecture
      processingTiers: {
        realTimeProcessing: {
          latencyTarget: 100, // milliseconds
          throughputTarget: 100000, // events per second
          consistencyLevel: 'eventual'
        },
        nearRealTimeProcessing: {
          latencyTarget: 5000, // 5 seconds
          throughputTarget: 50000,
          consistencyLevel: 'strong'
        },
        batchProcessing: {
          latencyTarget: 300000, // 5 minutes
          throughputTarget: 1000000,
          consistencyLevel: 'strong'
        }
      },

      // Advanced analytics capabilities
      analyticsCapabilities: {
        descriptiveAnalytics: true,
        diagnosticAnalytics: true,
        predictiveAnalytics: true,
        prescriptiveAnalytics: true
      },

      // Data governance and compliance
      dataGovernance: {
        dataLineageTracking: true,
        dataQualityMonitoring: true,
        privacyCompliance: true,
        auditTrailMaintenance: true
      }
    };

    return await this.deployEnterpriseStrategy(pipelineStrategy);
  }

  async setupPredictiveAnalytics() {
    console.log('Setting up predictive analytics capabilities...');

    const predictiveConfig = {
      // Machine learning models
      models: {
        churnPrediction: true,
        revenueForecasting: true,
        behaviorPrediction: true,
        anomalyDetection: true
      },

      // Feature engineering
      featureEngineering: {
        temporalFeatures: true,
        behavioralFeatures: true,
        demographicFeatures: true,
        interactionFeatures: true
      },

      // Model deployment
      modelDeployment: {
        realTimeScoring: true,
        batchScoring: true,
        modelVersioning: true,
        performanceMonitoring: true
      }
    };

    return await this.deployPredictiveAnalytics(predictiveConfig);
  }
}

SQL-Style Data Pipeline Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB data pipeline operations and stream processing:

-- QueryLeaf advanced data pipeline operations with SQL-familiar syntax for MongoDB

-- Configure comprehensive data pipeline strategy
CONFIGURE DATA_PIPELINE 
SET pipeline_name = 'enterprise_analytics_pipeline',
    processing_modes = ['real_time', 'batch', 'stream'],

    -- Real-time processing configuration
    stream_processing_enabled = true,
    stream_latency_target_ms = 100,
    stream_throughput_target = 100000,
    change_stream_batch_size = 1000,

    -- Batch processing configuration
    batch_processing_enabled = true,
    batch_size = 10000,
    batch_processing_interval_minutes = 5,
    max_batch_processing_time_minutes = 30,

    -- Analytics configuration
    enable_real_time_analytics = true,
    analytics_window_size_hours = 24,
    enable_predictive_analytics = true,
    enable_statistical_functions = true,

    -- Performance optimization
    enable_aggregation_optimization = true,
    enable_index_optimization = true,
    enable_parallel_processing = true,
    max_memory_usage_gb = 8,

    -- Data governance
    enable_data_lineage_tracking = true,
    enable_data_quality_monitoring = true,
    enable_audit_trail = true,
    data_retention_days = 90;

-- Advanced multi-dimensional analytics pipeline with comprehensive transformations
WITH event_enrichment AS (
  SELECT 
    event_id,
    event_timestamp,
    user_id,
    session_id,
    event_type,
    event_category,

    -- Advanced data extraction and type conversion
    CAST(event_data->>'revenue' AS DECIMAL(10,2)) as revenue,
    CAST(event_data->>'quantity' AS INTEGER) as quantity,
    event_data->>'product_id' as product_id,
    event_data->>'product_name' as product_name,
    event_data->>'campaign_id' as campaign_id,

    -- Device and technology classification
    CASE 
      WHEN device_info->>'device_type' IN ('smartphone', 'tablet') THEN 'mobile'
      WHEN device_info->>'device_type' = 'desktop' THEN 'desktop'
      ELSE 'other'
    END as device_category,

    device_info->>'browser' as browser,
    device_info->>'operating_system' as operating_system,

    -- Geographic dimensions
    location_data->>'country' as country,
    location_data->>'region' as region,
    location_data->>'city' as city,

    -- Advanced geographic clustering
    CASE 
      WHEN location_data->>'country' IN ('US', 'CA', 'MX') THEN 'North America'
      WHEN location_data->>'country' IN ('GB', 'DE', 'FR', 'IT', 'ES', 'NL') THEN 'Europe'
      WHEN location_data->>'country' IN ('JP', 'KR', 'CN', 'IN', 'SG') THEN 'Asia Pacific'
      WHEN location_data->>'country' IN ('BR', 'AR', 'CL', 'CO') THEN 'Latin America'
      ELSE 'Other'
    END as geo_region,

    -- Time-based dimensions for efficient aggregation
    DATE_TRUNC('hour', event_timestamp) as hour_bucket,
    EXTRACT(HOUR FROM event_timestamp) as event_hour,
    EXTRACT(DOW FROM event_timestamp) as day_of_week,
    EXTRACT(WEEK FROM event_timestamp) as week_of_year,
    EXTRACT(MONTH FROM event_timestamp) as month_of_year,
    TO_CHAR(event_timestamp, 'YYYY-MM') as year_month,
    TO_CHAR(event_timestamp, 'YYYY-"Q"Q') as year_quarter,

    -- Advanced user segmentation
    CASE 
      WHEN user_metrics.total_revenue >= 1000 THEN 'high_value'
      WHEN user_metrics.total_revenue >= 100 THEN 'medium_value'  
      WHEN user_metrics.total_revenue > 0 THEN 'low_value'
      ELSE 'non_revenue'
    END as user_segment,

    -- Customer lifecycle classification
    CASE 
      WHEN user_metrics.days_since_first_event <= 30 THEN 'new'
      WHEN user_metrics.days_since_last_event <= 30 THEN 'active'
      WHEN user_metrics.days_since_last_event <= 90 THEN 'dormant'
      ELSE 'inactive'  
    END as customer_lifecycle_stage,

    -- Behavioral indicators
    user_metrics.total_events as user_total_events,
    user_metrics.total_revenue as user_total_revenue,
    user_metrics.avg_session_duration as user_avg_session_duration,
    user_metrics.days_since_first_event,
    user_metrics.days_since_last_event,

    -- Revenue and value calculations
    CASE 
      WHEN CAST(event_data->>'quantity' AS INTEGER) > 0 THEN
        CAST(event_data->>'revenue' AS DECIMAL(10,2)) / CAST(event_data->>'quantity' AS INTEGER)
      ELSE 0
    END as average_order_value,

    -- Processing metadata
    CURRENT_TIMESTAMP as processed_at,
    'advanced_pipeline_v2' as processing_version

  FROM raw_events re
  LEFT JOIN user_behavioral_metrics user_metrics ON re.user_id = user_metrics.user_id
  WHERE 
    re.event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND re.processing_status = 'pending'
),

comprehensive_aggregation AS (
  SELECT 
    hour_bucket,
    event_type,
    event_category,
    user_segment,
    customer_lifecycle_stage,
    device_category,
    geo_region,
    browser,
    operating_system,

    -- Volume metrics with advanced calculations
    COUNT(*) as event_count,
    COUNT(DISTINCT user_id) as unique_users,
    COUNT(DISTINCT session_id) as unique_sessions,
    COUNT(DISTINCT product_id) as unique_products,
    COUNT(DISTINCT campaign_id) as unique_campaigns,

    -- User engagement metrics
    ROUND(COUNT(*)::DECIMAL / COUNT(DISTINCT user_id), 2) as events_per_user,
    ROUND(COUNT(*)::DECIMAL / COUNT(DISTINCT session_id), 2) as events_per_session,
    ROUND(COUNT(DISTINCT session_id)::DECIMAL / COUNT(DISTINCT user_id), 2) as sessions_per_user,

    -- Revenue analytics with statistical functions
    SUM(revenue) as total_revenue,
    SUM(quantity) as total_quantity,
    COUNT(*) FILTER (WHERE revenue > 0) as revenue_transactions,

    -- Advanced revenue statistics
    AVG(revenue) as avg_revenue,
    AVG(revenue) FILTER (WHERE revenue > 0) as avg_revenue_per_transaction,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) as median_revenue,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY revenue) as percentile_75_revenue,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY revenue) as percentile_95_revenue,
    STDDEV_POP(revenue) as revenue_standard_deviation,

    -- Advanced order value analytics
    AVG(average_order_value) as avg_order_value,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY average_order_value) as median_order_value,
    MAX(average_order_value) as max_order_value,
    MIN(average_order_value) FILTER (WHERE average_order_value > 0) as min_order_value,

    -- Conversion and engagement metrics
    ROUND((COUNT(*) FILTER (WHERE revenue > 0)::DECIMAL / COUNT(*)) * 100, 2) as conversion_rate_percent,
    SUM(revenue) / NULLIF(COUNT(DISTINCT user_id), 0) as revenue_per_user,
    SUM(revenue) / NULLIF(COUNT(DISTINCT session_id), 0) as revenue_per_session,

    -- Time-based analytics
    MIN(event_timestamp) as window_start_time,
    MAX(event_timestamp) as window_end_time,
    EXTRACT(MINUTES FROM (MAX(event_timestamp) - MIN(event_timestamp))) as processing_window_minutes,

    -- User behavior pattern analysis
    AVG(user_total_events) as avg_user_lifetime_events,
    AVG(user_total_revenue) as avg_user_lifetime_revenue,
    AVG(user_avg_session_duration) as avg_session_duration_seconds,
    AVG(days_since_first_event) as avg_days_since_first_event,

    -- Product performance metrics
    MODE() WITHIN GROUP (ORDER BY product_id) as top_product_id,
    MODE() WITHIN GROUP (ORDER BY product_name) as top_product_name,
    COUNT(DISTINCT product_id) FILTER (WHERE revenue > 0) as converting_products,

    -- Campaign effectiveness
    MODE() WITHIN GROUP (ORDER BY campaign_id) as top_campaign_id,
    COUNT(DISTINCT campaign_id) FILTER (WHERE revenue > 0) as converting_campaigns,

    -- Seasonal and temporal patterns
    MODE() WITHIN GROUP (ORDER BY day_of_week) as most_active_day_of_week,
    MODE() WITHIN GROUP (ORDER BY event_hour) as most_active_hour,

    -- Data quality indicators
    COUNT(*) FILTER (WHERE revenue IS NOT NULL) / COUNT(*)::DECIMAL as revenue_data_completeness,
    COUNT(*) FILTER (WHERE product_id IS NOT NULL) / COUNT(*)::DECIMAL as product_data_completeness,
    COUNT(*) FILTER (WHERE geo_region != 'Other') / COUNT(*)::DECIMAL as location_data_completeness

  FROM event_enrichment
  GROUP BY 
    hour_bucket, event_type, event_category, user_segment, customer_lifecycle_stage,
    device_category, geo_region, browser, operating_system
),

performance_analysis AS (
  SELECT 
    ca.*,

    -- Performance indicators and rankings
    ROW_NUMBER() OVER (ORDER BY total_revenue DESC) as revenue_rank,
    ROW_NUMBER() OVER (ORDER BY unique_users DESC) as user_engagement_rank,
    ROW_NUMBER() OVER (ORDER BY conversion_rate_percent DESC) as conversion_rank,
    ROW_NUMBER() OVER (ORDER BY avg_order_value DESC) as aov_rank,

    -- Efficiency metrics
    ROUND(total_revenue / processing_window_minutes, 2) as revenue_velocity_per_minute,
    ROUND(event_count::DECIMAL / processing_window_minutes, 1) as event_velocity_per_minute,
    ROUND(unique_users::DECIMAL / processing_window_minutes, 1) as user_acquisition_rate_per_minute,

    -- Business health indicators
    CASE 
      WHEN conversion_rate_percent >= 5.0 THEN 'excellent'
      WHEN conversion_rate_percent >= 2.0 THEN 'good'
      WHEN conversion_rate_percent >= 1.0 THEN 'fair'
      ELSE 'poor'
    END as conversion_performance_rating,

    CASE 
      WHEN revenue_per_user >= 100 THEN 'high_value'
      WHEN revenue_per_user >= 25 THEN 'medium_value'
      WHEN revenue_per_user >= 5 THEN 'low_value'
      ELSE 'minimal_value'
    END as user_value_rating,

    CASE 
      WHEN events_per_user >= 10 THEN 'highly_engaged'
      WHEN events_per_user >= 5 THEN 'moderately_engaged'
      WHEN events_per_user >= 2 THEN 'lightly_engaged'
      ELSE 'minimally_engaged'
    END as user_engagement_rating,

    -- Trend and growth indicators
    LAG(total_revenue) OVER (
      PARTITION BY event_type, user_segment, device_category, geo_region 
      ORDER BY hour_bucket
    ) as prev_hour_revenue,

    LAG(unique_users) OVER (
      PARTITION BY event_type, user_segment, device_category, geo_region 
      ORDER BY hour_bucket
    ) as prev_hour_users,

    LAG(conversion_rate_percent) OVER (
      PARTITION BY event_type, user_segment, device_category, geo_region 
      ORDER BY hour_bucket
    ) as prev_hour_conversion_rate

  FROM comprehensive_aggregation ca
),

trend_analysis AS (
  SELECT 
    pa.*,

    -- Revenue trends
    CASE 
      WHEN prev_hour_revenue IS NOT NULL AND prev_hour_revenue > 0 THEN
        ROUND(((total_revenue - prev_hour_revenue) / prev_hour_revenue * 100), 1)
      ELSE NULL
    END as revenue_change_percent,

    -- User acquisition trends
    CASE 
      WHEN prev_hour_users IS NOT NULL AND prev_hour_users > 0 THEN
        ROUND(((unique_users - prev_hour_users)::DECIMAL / prev_hour_users * 100), 1)
      ELSE NULL
    END as user_growth_percent,

    -- Conversion optimization trends
    CASE 
      WHEN prev_hour_conversion_rate IS NOT NULL THEN
        ROUND((conversion_rate_percent - prev_hour_conversion_rate), 2)
      ELSE NULL
    END as conversion_rate_change,

    -- Growth classification
    CASE 
      WHEN prev_hour_revenue IS NOT NULL AND total_revenue > prev_hour_revenue * 1.1 THEN 'high_growth'
      WHEN prev_hour_revenue IS NOT NULL AND total_revenue > prev_hour_revenue * 1.05 THEN 'moderate_growth'
      WHEN prev_hour_revenue IS NOT NULL AND total_revenue >= prev_hour_revenue * 0.95 THEN 'stable'
      WHEN prev_hour_revenue IS NOT NULL AND total_revenue >= prev_hour_revenue * 0.9 THEN 'moderate_decline'
      WHEN prev_hour_revenue IS NOT NULL THEN 'significant_decline'
      ELSE 'insufficient_data'
    END as revenue_trend_classification,

    -- Anomaly detection indicators
    CASE 
      WHEN conversion_rate_percent > (AVG(conversion_rate_percent) OVER () + 2 * STDDEV_POP(conversion_rate_percent) OVER ()) THEN 'conversion_anomaly_high'
      WHEN conversion_rate_percent < (AVG(conversion_rate_percent) OVER () - 2 * STDDEV_POP(conversion_rate_percent) OVER ()) THEN 'conversion_anomaly_low'
      WHEN revenue_per_user > (AVG(revenue_per_user) OVER () + 2 * STDDEV_POP(revenue_per_user) OVER ()) THEN 'revenue_anomaly_high'
      WHEN revenue_per_user < (AVG(revenue_per_user) OVER () - 2 * STDDEV_POP(revenue_per_user) OVER ()) THEN 'revenue_anomaly_low'
      ELSE 'normal'
    END as anomaly_detection_status

  FROM performance_analysis pa
),

insights_and_recommendations AS (
  SELECT 
    ta.*,

    -- Strategic insights
    ARRAY[
      CASE WHEN conversion_performance_rating = 'excellent' THEN 'Maintain current conversion optimization strategies' END,
      CASE WHEN conversion_performance_rating = 'poor' THEN 'Implement conversion rate optimization initiatives' END,
      CASE WHEN user_value_rating = 'high_value' THEN 'Focus on retention and upselling strategies' END,
      CASE WHEN user_value_rating = 'minimal_value' THEN 'Develop user value enhancement programs' END,
      CASE WHEN revenue_trend_classification = 'high_growth' THEN 'Scale successful channels and campaigns' END,
      CASE WHEN revenue_trend_classification = 'significant_decline' THEN 'Investigate and address performance issues urgently' END,
      CASE WHEN anomaly_detection_status LIKE '%anomaly%' THEN 'Investigate anomalous behavior for opportunities or issues' END
    ]::TEXT[] as strategic_recommendations,

    -- Operational recommendations
    ARRAY[
      CASE WHEN event_velocity_per_minute > 1000 THEN 'Consider increasing processing capacity' END,
      CASE WHEN revenue_data_completeness < 0.9 THEN 'Improve data collection completeness' END,
      CASE WHEN location_data_completeness < 0.8 THEN 'Enhance geographic data capture' END,
      CASE WHEN processing_window_minutes > 60 THEN 'Optimize data pipeline performance' END
    ]::TEXT[] as operational_recommendations,

    -- Priority scoring for resource allocation
    CASE 
      WHEN total_revenue >= 10000 AND conversion_rate_percent >= 3.0 THEN 10  -- Highest priority
      WHEN total_revenue >= 5000 AND conversion_rate_percent >= 2.0 THEN 8
      WHEN total_revenue >= 1000 AND conversion_rate_percent >= 1.0 THEN 6
      WHEN unique_users >= 1000 THEN 4
      ELSE 2
    END as business_priority_score,

    -- Investment recommendations
    CASE 
      WHEN business_priority_score >= 8 THEN 'High investment recommended'
      WHEN business_priority_score >= 6 THEN 'Moderate investment recommended'
      WHEN business_priority_score >= 4 THEN 'Selective investment recommended'
      ELSE 'Monitor performance'
    END as investment_recommendation

  FROM trend_analysis ta
)

-- Final comprehensive analytics output with actionable insights
SELECT 
  -- Core dimensions
  hour_bucket,
  event_type,
  event_category,
  user_segment,
  customer_lifecycle_stage,
  device_category,
  geo_region,

  -- Volume and engagement metrics
  event_count,
  unique_users,
  unique_sessions,
  events_per_user,
  events_per_session,
  sessions_per_user,

  -- Revenue analytics
  ROUND(total_revenue, 2) as total_revenue,
  revenue_transactions,
  ROUND(avg_revenue_per_transaction, 2) as avg_revenue_per_transaction,
  ROUND(median_revenue, 2) as median_revenue,
  ROUND(percentile_95_revenue, 2) as percentile_95_revenue,
  ROUND(revenue_per_user, 2) as revenue_per_user,
  ROUND(revenue_per_session, 2) as revenue_per_session,

  -- Performance indicators
  conversion_rate_percent,
  ROUND(avg_order_value, 2) as avg_order_value,
  conversion_performance_rating,
  user_value_rating,
  user_engagement_rating,

  -- Trend analysis
  revenue_change_percent,
  user_growth_percent,
  conversion_rate_change,
  revenue_trend_classification,
  anomaly_detection_status,

  -- Business metrics
  business_priority_score,
  investment_recommendation,

  -- Performance rankings
  revenue_rank,
  user_engagement_rank,
  conversion_rank,

  -- Operational metrics
  ROUND(revenue_velocity_per_minute, 2) as revenue_velocity_per_minute,
  ROUND(event_velocity_per_minute, 1) as event_velocity_per_minute,
  processing_window_minutes,

  -- Data quality
  ROUND(revenue_data_completeness * 100, 1) as revenue_data_completeness_percent,
  ROUND(product_data_completeness * 100, 1) as product_data_completeness_percent,
  ROUND(location_data_completeness * 100, 1) as location_data_completeness_percent,

  -- Top performing entities
  top_product_name,
  top_campaign_id,
  most_active_hour,

  -- Strategic insights
  strategic_recommendations,
  operational_recommendations,

  -- Metadata
  window_start_time,
  window_end_time,
  CURRENT_TIMESTAMP as analysis_generated_at

FROM insights_and_recommendations
WHERE 
  -- Filter for significant segments to focus analysis
  (event_count >= 10 OR total_revenue >= 100 OR unique_users >= 5)
  AND business_priority_score >= 2
ORDER BY 
  business_priority_score DESC,
  total_revenue DESC,
  hour_bucket DESC;

-- Real-time streaming analytics with change stream processing
CREATE STREAMING_ANALYTICS_VIEW real_time_conversion_funnel AS
WITH funnel_events AS (
  SELECT 
    user_id,
    session_id,
    event_type,
    event_timestamp,
    revenue,

    -- Create event sequence within sessions
    ROW_NUMBER() OVER (
      PARTITION BY user_id, session_id 
      ORDER BY event_timestamp
    ) as event_sequence,

    -- Identify funnel steps
    CASE event_type
      WHEN 'page_view' THEN 1
      WHEN 'product_view' THEN 2  
      WHEN 'add_to_cart' THEN 3
      WHEN 'checkout_start' THEN 4
      WHEN 'purchase' THEN 5
      ELSE 0
    END as funnel_step,

    -- Calculate time between events
    LAG(event_timestamp) OVER (
      PARTITION BY user_id, session_id 
      ORDER BY event_timestamp
    ) as prev_event_timestamp

  FROM CHANGE_STREAM('raw_events')
  WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
  AND event_type IN ('page_view', 'product_view', 'add_to_cart', 'checkout_start', 'purchase')
),

real_time_funnel_analysis AS (
  SELECT 
    DATE_TRUNC('minute', event_timestamp) as minute_bucket,

    -- Funnel step counts
    COUNT(*) FILTER (WHERE funnel_step = 1) as step_1_page_views,
    COUNT(*) FILTER (WHERE funnel_step = 2) as step_2_product_views,  
    COUNT(*) FILTER (WHERE funnel_step = 3) as step_3_add_to_cart,
    COUNT(*) FILTER (WHERE funnel_step = 4) as step_4_checkout_start,
    COUNT(*) FILTER (WHERE funnel_step = 5) as step_5_purchase,

    -- Unique user counts at each step
    COUNT(DISTINCT user_id) FILTER (WHERE funnel_step = 1) as unique_users_step_1,
    COUNT(DISTINCT user_id) FILTER (WHERE funnel_step = 2) as unique_users_step_2,
    COUNT(DISTINCT user_id) FILTER (WHERE funnel_step = 3) as unique_users_step_3,
    COUNT(DISTINCT user_id) FILTER (WHERE funnel_step = 4) as unique_users_step_4,
    COUNT(DISTINCT user_id) FILTER (WHERE funnel_step = 5) as unique_users_step_5,

    -- Revenue metrics
    SUM(revenue) FILTER (WHERE funnel_step = 5) as total_revenue,
    AVG(revenue) FILTER (WHERE funnel_step = 5 AND revenue > 0) as avg_purchase_value,

    -- Timing analysis
    AVG(EXTRACT(SECONDS FROM (event_timestamp - prev_event_timestamp))) FILTER (
      WHERE prev_event_timestamp IS NOT NULL 
      AND EXTRACT(SECONDS FROM (event_timestamp - prev_event_timestamp)) <= 3600
    ) as avg_time_between_steps_seconds

  FROM funnel_events
  WHERE funnel_step > 0
  GROUP BY DATE_TRUNC('minute', event_timestamp)
)

SELECT 
  minute_bucket,

  -- Funnel volumes
  step_1_page_views,
  step_2_product_views,
  step_3_add_to_cart, 
  step_4_checkout_start,
  step_5_purchase,

  -- Conversion rates between steps
  ROUND((step_2_product_views::DECIMAL / NULLIF(step_1_page_views, 0)) * 100, 2) as page_to_product_rate,
  ROUND((step_3_add_to_cart::DECIMAL / NULLIF(step_2_product_views, 0)) * 100, 2) as product_to_cart_rate,
  ROUND((step_4_checkout_start::DECIMAL / NULLIF(step_3_add_to_cart, 0)) * 100, 2) as cart_to_checkout_rate,
  ROUND((step_5_purchase::DECIMAL / NULLIF(step_4_checkout_start, 0)) * 100, 2) as checkout_to_purchase_rate,

  -- Overall funnel performance
  ROUND((step_5_purchase::DECIMAL / NULLIF(step_1_page_views, 0)) * 100, 2) as overall_conversion_rate,

  -- User journey efficiency
  ROUND((unique_users_step_5::DECIMAL / NULLIF(unique_users_step_1, 0)) * 100, 2) as user_conversion_rate,

  -- Revenue performance
  ROUND(total_revenue, 2) as total_revenue,
  ROUND(avg_purchase_value, 2) as avg_purchase_value,
  ROUND(total_revenue / NULLIF(unique_users_step_5, 0), 2) as revenue_per_converting_user,

  -- Efficiency metrics
  ROUND(avg_time_between_steps_seconds / 60.0, 1) as avg_minutes_between_steps,

  -- Performance indicators
  CASE 
    WHEN overall_conversion_rate >= 5.0 THEN 'excellent'
    WHEN overall_conversion_rate >= 2.0 THEN 'good'
    WHEN overall_conversion_rate >= 1.0 THEN 'fair'
    ELSE 'needs_improvement'
  END as funnel_performance_rating,

  -- Real-time alerts
  CASE 
    WHEN overall_conversion_rate < 0.5 THEN 'LOW_CONVERSION_ALERT'
    WHEN avg_time_between_steps_seconds > 300 THEN 'SLOW_FUNNEL_ALERT'  
    WHEN step_5_purchase = 0 AND step_4_checkout_start > 5 THEN 'CHECKOUT_ISSUE_ALERT'
    ELSE 'normal'
  END as real_time_alert_status,

  CURRENT_TIMESTAMP as analysis_timestamp

FROM real_time_funnel_analysis
WHERE minute_bucket >= CURRENT_TIMESTAMP - INTERVAL '30 minutes'
ORDER BY minute_bucket DESC;

-- QueryLeaf provides comprehensive data pipeline capabilities:
-- 1. SQL-familiar syntax for MongoDB aggregation pipeline construction
-- 2. Advanced real-time stream processing with Change Streams integration
-- 3. Comprehensive multi-dimensional analytics with statistical functions
-- 4. Built-in performance optimization and index utilization
-- 5. Real-time anomaly detection and business intelligence
-- 6. Advanced funnel analysis and conversion optimization
-- 7. Sophisticated trend analysis and predictive indicators
-- 8. Enterprise-ready data governance and compliance features
-- 9. Automated insights generation and recommendation systems
-- 10. Production-ready stream processing with minimal configuration

Best Practices for Production Data Pipeline Implementation

Pipeline Architecture Design Principles

Essential principles for effective MongoDB data pipeline deployment:

Multi-Tier Processing Strategy: Implement real-time, near-real-time, and batch processing tiers based on latency and consistency requirements
Performance Optimization: Design aggregation pipelines with proper indexing, stage ordering, and memory optimization for maximum throughput
Stream Processing Integration: Leverage Change Streams for real-time processing while maintaining batch processing for historical analysis
Data Quality Management: Implement comprehensive data validation, cleansing, and quality monitoring throughout the pipeline
Scalability Planning: Design pipelines that can scale horizontally and handle increasing data volumes and processing complexity
Monitoring and Alerting: Establish comprehensive pipeline monitoring with performance metrics and business-critical alerting

Enterprise Data Pipeline Architecture

Design pipeline systems for enterprise-scale requirements:

Advanced Analytics Integration: Implement sophisticated analytical capabilities including predictive analytics and machine learning integration
Data Governance Framework: Establish data lineage tracking, compliance monitoring, and audit trail maintenance
Performance Monitoring: Implement comprehensive performance tracking with optimization recommendations and capacity planning
Security and Compliance: Design secure pipelines with encryption, access controls, and regulatory compliance features
Operational Excellence: Integrate with existing monitoring systems and establish operational procedures for pipeline management
Disaster Recovery: Implement pipeline resilience with failover capabilities and data recovery procedures

Conclusion

MongoDB data pipeline optimization and stream processing provide comprehensive real-time analytics capabilities that enable sophisticated data transformations, high-performance analytical workloads, and intelligent business insights through native aggregation framework optimization, integrated change stream processing, and advanced statistical functions. The unified platform approach eliminates the complexity of managing separate batch and stream processing systems while delivering superior performance and operational simplicity.

Key MongoDB Data Pipeline benefits include:

Real-Time Processing: Advanced Change Streams integration for immediate data processing and real-time analytics generation
Comprehensive Analytics: Sophisticated aggregation framework with advanced statistical functions and multi-dimensional analysis capabilities
Performance Optimization: Native query optimization, intelligent indexing, and memory management for maximum throughput
Stream and Batch Integration: Unified platform supporting both real-time stream processing and comprehensive batch analytics
Business Intelligence: Advanced analytics with anomaly detection, trend analysis, and automated insights generation
SQL Accessibility: Familiar SQL-style data pipeline operations through QueryLeaf for accessible advanced analytics

Whether you're building real-time dashboards, implementing complex analytical workloads, processing high-velocity data streams, or developing sophisticated business intelligence systems, MongoDB data pipeline optimization with QueryLeaf's familiar SQL interface provides the foundation for scalable, high-performance data processing workflows.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB aggregation pipelines while providing SQL-familiar syntax for complex analytics, stream processing, and data transformation operations. Advanced pipeline construction, performance optimization, and business intelligence features are seamlessly handled through familiar SQL constructs, making sophisticated data processing accessible to SQL-oriented analytics teams.

The combination of MongoDB's powerful aggregation framework with SQL-style data pipeline operations makes it an ideal platform for applications requiring both advanced analytical capabilities and familiar database interaction patterns, ensuring your data processing workflows can scale efficiently while delivering actionable business insights in real-time.

November 2, 2025
24 min read

MongoDB Capped Collections: Fixed-Size High-Performance Logging and Data Streaming for Real-Time Applications

Real-time applications require efficient data structures for continuous data capture, event streaming, and high-frequency logging without the overhead of traditional database management. Conventional database approaches struggle with scenarios requiring sustained high-throughput writes, automatic old data removal, and guaranteed insertion order preservation, often leading to performance degradation, storage bloat, and complex maintenance procedures in production environments.

MongoDB capped collections provide native fixed-size, high-performance data structures that maintain insertion order and automatically remove old documents when storage limits are reached. Unlike traditional database logging solutions that require complex archival processes and performance-degrading maintenance operations, MongoDB capped collections deliver consistent high-throughput writes, predictable storage usage, and automatic data lifecycle management through optimized storage allocation and write-optimized data structures.

The Traditional High-Performance Logging Challenge

Conventional database logging approaches often encounter significant performance and maintenance challenges:

-- Traditional PostgreSQL high-performance logging - complex maintenance and performance issues

-- Basic application logging table with growing maintenance complexity
CREATE TABLE application_logs (
    log_id BIGSERIAL PRIMARY KEY,
    application_name VARCHAR(100) NOT NULL,
    log_level VARCHAR(20) NOT NULL,
    timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    message TEXT NOT NULL,

    -- Additional context fields
    user_id BIGINT,
    session_id VARCHAR(100),
    request_id VARCHAR(100),

    -- Performance metadata
    duration_ms INTEGER,
    memory_usage_mb DECIMAL(8,2),
    cpu_usage_percent DECIMAL(5,2),

    -- Log metadata
    thread_id INTEGER,
    process_id INTEGER,
    hostname VARCHAR(100),

    -- Complex indexing for performance
    CONSTRAINT valid_log_level CHECK (log_level IN ('DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL'))
);

-- Multiple indexes required for different query patterns - increasing maintenance overhead
CREATE INDEX idx_logs_timestamp ON application_logs(timestamp DESC);
CREATE INDEX idx_logs_level_timestamp ON application_logs(log_level, timestamp DESC);
CREATE INDEX idx_logs_app_timestamp ON application_logs(application_name, timestamp DESC);
CREATE INDEX idx_logs_user_timestamp ON application_logs(user_id, timestamp DESC) WHERE user_id IS NOT NULL;
CREATE INDEX idx_logs_session_timestamp ON application_logs(session_id, timestamp DESC) WHERE session_id IS NOT NULL;

-- Complex partitioning strategy for log table management
CREATE TABLE application_logs_2024_01 (
    CHECK (timestamp >= '2024-01-01' AND timestamp < '2024-02-01')
) INHERITS (application_logs);

CREATE TABLE application_logs_2024_02 (
    CHECK (timestamp >= '2024-02-01' AND timestamp < '2024-03-01')
) INHERITS (application_logs);

-- Monthly partition maintenance (complex and error-prone)
CREATE OR REPLACE FUNCTION create_monthly_log_partition()
RETURNS VOID AS $$
DECLARE
    partition_name TEXT;
    start_date DATE;
    end_date DATE;
BEGIN
    start_date := DATE_TRUNC('month', CURRENT_DATE);
    end_date := start_date + INTERVAL '1 month';
    partition_name := 'application_logs_' || TO_CHAR(start_date, 'YYYY_MM');

    EXECUTE format('
        CREATE TABLE IF NOT EXISTS %I (
            CHECK (timestamp >= %L AND timestamp < %L)
        ) INHERITS (application_logs)', 
        partition_name, start_date, end_date);

    EXECUTE format('
        CREATE INDEX IF NOT EXISTS %I ON %I(timestamp DESC)',
        'idx_' || partition_name || '_timestamp', partition_name);
END;
$$ LANGUAGE plpgsql;

-- Automated cleanup process with significant performance impact
CREATE OR REPLACE FUNCTION cleanup_old_logs(retention_days INTEGER DEFAULT 90)
RETURNS TABLE(
    deleted_count BIGINT,
    cleanup_duration_ms BIGINT,
    affected_partitions TEXT[]
) AS $$
DECLARE
    cutoff_date TIMESTAMP;
    partition_record RECORD;
    total_deleted BIGINT := 0;
    start_time TIMESTAMP := clock_timestamp();
    affected_partitions TEXT[] := '{}';
BEGIN
    cutoff_date := CURRENT_TIMESTAMP - (retention_days || ' days')::INTERVAL;

    -- Delete from main table (expensive operation)
    DELETE FROM ONLY application_logs 
    WHERE timestamp < cutoff_date;

    GET DIAGNOSTICS total_deleted = ROW_COUNT;

    -- Handle partitioned tables
    FOR partition_record IN 
        SELECT schemaname, tablename 
        FROM pg_tables 
        WHERE tablename LIKE 'application_logs_%'
        AND tablename ~ '^\d{4}_\d{2}$'
    LOOP
        -- Check if entire partition can be dropped
        EXECUTE format('
            SELECT COUNT(*) 
            FROM %I.%I 
            WHERE timestamp >= %L',
            partition_record.schemaname,
            partition_record.tablename,
            cutoff_date
        );

        -- Complex logic to determine drop vs cleanup
        IF FOUND THEN
            EXECUTE format('DROP TABLE IF EXISTS %I.%I CASCADE',
                partition_record.schemaname, partition_record.tablename);
            affected_partitions := affected_partitions || partition_record.tablename;
        ELSE
            -- Partial cleanup within partition (expensive)
            EXECUTE format('
                DELETE FROM %I.%I WHERE timestamp < %L',
                partition_record.schemaname, partition_record.tablename, cutoff_date);
        END IF;
    END LOOP;

    -- Vacuum and reindex (significant performance impact)
    VACUUM ANALYZE application_logs;
    REINDEX TABLE application_logs;

    RETURN QUERY SELECT 
        total_deleted,
        EXTRACT(MILLISECONDS FROM clock_timestamp() - start_time)::BIGINT,
        affected_partitions;
END;
$$ LANGUAGE plpgsql;

-- High-frequency insert procedure with limited performance optimization
CREATE OR REPLACE FUNCTION batch_insert_logs(log_entries JSONB[])
RETURNS TABLE(
    inserted_count INTEGER,
    failed_count INTEGER,
    processing_time_ms INTEGER
) AS $$
DECLARE
    log_entry JSONB;
    success_count INTEGER := 0;
    error_count INTEGER := 0;
    start_time TIMESTAMP := clock_timestamp();
    temp_table_name TEXT := 'temp_log_batch_' || extract(epoch from now())::INTEGER;
BEGIN

    -- Create temporary table for batch processing
    EXECUTE format('
        CREATE TEMP TABLE %I (
            application_name VARCHAR(100),
            log_level VARCHAR(20),
            timestamp TIMESTAMP,
            message TEXT,
            user_id BIGINT,
            session_id VARCHAR(100),
            request_id VARCHAR(100),
            duration_ms INTEGER,
            memory_usage_mb DECIMAL(8,2),
            thread_id INTEGER,
            hostname VARCHAR(100)
        )', temp_table_name);

    -- Process each log entry individually (inefficient for high volume)
    FOREACH log_entry IN ARRAY log_entries
    LOOP
        BEGIN
            EXECUTE format('
                INSERT INTO %I (
                    application_name, log_level, timestamp, message,
                    user_id, session_id, request_id, duration_ms,
                    memory_usage_mb, thread_id, hostname
                ) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11)',
                temp_table_name
            ) USING 
                log_entry->>'application_name',
                log_entry->>'log_level',
                (log_entry->>'timestamp')::TIMESTAMP,
                log_entry->>'message',
                (log_entry->>'user_id')::BIGINT,
                log_entry->>'session_id',
                log_entry->>'request_id',
                (log_entry->>'duration_ms')::INTEGER,
                (log_entry->>'memory_usage_mb')::DECIMAL(8,2),
                (log_entry->>'thread_id')::INTEGER,
                log_entry->>'hostname';

            success_count := success_count + 1;

        EXCEPTION WHEN OTHERS THEN
            error_count := error_count + 1;
            -- Limited error handling for high-frequency operations
            CONTINUE;
        END;
    END LOOP;

    -- Batch insert into main table (still limited by indexing overhead)
    EXECUTE format('
        INSERT INTO application_logs (
            application_name, log_level, timestamp, message,
            user_id, session_id, request_id, duration_ms,
            memory_usage_mb, thread_id, hostname
        )
        SELECT * FROM %I', temp_table_name);

    -- Cleanup
    EXECUTE format('DROP TABLE %I', temp_table_name);

    RETURN QUERY SELECT 
        success_count,
        error_count,
        EXTRACT(MILLISECONDS FROM clock_timestamp() - start_time)::INTEGER;
END;
$$ LANGUAGE plpgsql;

-- Real-time event streaming table with performance limitations
CREATE TABLE event_stream (
    event_id BIGSERIAL PRIMARY KEY,
    event_type VARCHAR(100) NOT NULL,
    event_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    user_id BIGINT,
    session_id VARCHAR(100),

    -- Event payload (limited JSON support)
    event_data JSONB,

    -- Stream metadata
    stream_partition VARCHAR(50),
    sequence_number BIGINT,

    -- Processing metadata
    processing_status VARCHAR(20) DEFAULT 'pending',
    processed_at TIMESTAMP,
    processor_id VARCHAR(100)
);

-- Complex trigger for sequence number management
CREATE OR REPLACE FUNCTION update_sequence_number()
RETURNS TRIGGER AS $$
BEGIN
    NEW.sequence_number := nextval('event_stream_sequence');
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER event_stream_sequence_trigger
    BEFORE INSERT ON event_stream
    FOR EACH ROW
    EXECUTE FUNCTION update_sequence_number();

-- Performance monitoring with complex aggregations
WITH log_performance_analysis AS (
    SELECT 
        application_name,
        log_level,
        DATE_TRUNC('hour', timestamp) as hour_bucket,
        COUNT(*) as log_count,

        -- Complex aggregations causing performance issues
        AVG(CASE WHEN duration_ms IS NOT NULL THEN duration_ms ELSE NULL END) as avg_duration,
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) as p95_duration,
        AVG(CASE WHEN memory_usage_mb IS NOT NULL THEN memory_usage_mb ELSE NULL END) as avg_memory_usage,

        -- Storage analysis
        SUM(LENGTH(message)) as total_message_bytes,
        AVG(LENGTH(message)) as avg_message_length,

        -- Performance degradation over time
        COUNT(*) / EXTRACT(EPOCH FROM INTERVAL '1 hour') as logs_per_second

    FROM application_logs
    WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    GROUP BY application_name, log_level, DATE_TRUNC('hour', timestamp)
),
storage_growth_analysis AS (
    -- Complex storage growth calculations
    SELECT 
        DATE_TRUNC('day', timestamp) as day_bucket,
        COUNT(*) as daily_logs,
        SUM(LENGTH(message) + COALESCE(LENGTH(session_id), 0) + COALESCE(LENGTH(request_id), 0)) as daily_storage_bytes,

        -- Growth projections (expensive calculations)
        LAG(COUNT(*)) OVER (ORDER BY DATE_TRUNC('day', timestamp)) as prev_day_logs,
        LAG(SUM(LENGTH(message) + COALESCE(LENGTH(session_id), 0) + COALESCE(LENGTH(request_id), 0))) OVER (ORDER BY DATE_TRUNC('day', timestamp)) as prev_day_bytes

    FROM application_logs
    WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '30 days'
    GROUP BY DATE_TRUNC('day', timestamp)
)
SELECT 
    lpa.application_name,
    lpa.log_level,
    lpa.hour_bucket,
    lpa.log_count,

    -- Performance metrics
    ROUND(lpa.avg_duration, 2) as avg_duration_ms,
    ROUND(lpa.p95_duration, 2) as p95_duration_ms,
    ROUND(lpa.logs_per_second, 2) as throughput_logs_per_second,

    -- Storage efficiency
    ROUND(lpa.total_message_bytes / 1024.0 / 1024.0, 2) as message_storage_mb,
    ROUND(lpa.avg_message_length, 0) as avg_message_length,

    -- Growth indicators
    sga.daily_logs,
    ROUND(sga.daily_storage_bytes / 1024.0 / 1024.0, 2) as daily_storage_mb,

    -- Growth rate calculations (complex and expensive)
    CASE 
        WHEN sga.prev_day_logs IS NOT NULL THEN
            ROUND(((sga.daily_logs - sga.prev_day_logs) / sga.prev_day_logs::DECIMAL * 100), 1)
        ELSE NULL
    END as daily_log_growth_percent,

    CASE 
        WHEN sga.prev_day_bytes IS NOT NULL THEN
            ROUND(((sga.daily_storage_bytes - sga.prev_day_bytes) / sga.prev_day_bytes::DECIMAL * 100), 1)
        ELSE NULL
    END as daily_storage_growth_percent

FROM log_performance_analysis lpa
JOIN storage_growth_analysis sga ON DATE_TRUNC('day', lpa.hour_bucket) = sga.day_bucket
WHERE lpa.log_count > 0
ORDER BY lpa.application_name, lpa.log_level, lpa.hour_bucket DESC;

-- Traditional logging approach problems:
-- 1. Unbounded storage growth requiring complex partitioning and archival
-- 2. Performance degradation as table size increases due to indexing overhead
-- 3. Complex maintenance procedures for partition management and cleanup
-- 4. High-frequency writes causing lock contention and performance bottlenecks
-- 5. Expensive aggregation queries for real-time monitoring and analytics
-- 6. Limited support for truly high-throughput event streaming scenarios
-- 7. Complex error handling and recovery mechanisms for batch processing
-- 8. Storage bloat and fragmentation issues requiring regular maintenance
-- 9. No guarantee of insertion order preservation under concurrent access
-- 10. Resource-intensive cleanup and archival processes impacting performance

MongoDB capped collections provide elegant fixed-size, high-performance data structures for logging and streaming:

// MongoDB Capped Collections - high-performance logging and streaming with automatic size management
const { MongoClient, ObjectId } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('high_performance_logging');

// Comprehensive MongoDB Capped Collections Manager
class CappedCollectionsManager {
  constructor(db, config = {}) {
    this.db = db;
    this.config = {
      // Default capped collection configurations
      defaultLogSize: config.defaultLogSize || 100 * 1024 * 1024, // 100MB
      defaultMaxDocuments: config.defaultMaxDocuments || 50000,

      // Performance optimization settings
      enableBulkOperations: config.enableBulkOperations !== false,
      enableAsyncOperations: config.enableAsyncOperations !== false,
      batchSize: config.batchSize || 1000,
      writeBufferSize: config.writeBufferSize || 16384,

      // Collection management
      enablePerformanceMonitoring: config.enablePerformanceMonitoring !== false,
      enableAutoOptimization: config.enableAutoOptimization !== false,
      enableMetricsCollection: config.enableMetricsCollection !== false,

      // Write concern and consistency
      writeConcern: config.writeConcern || {
        w: 1, // Fast writes for high-throughput logging
        j: false, // Disable journaling for maximum speed (trade-off with durability)
        wtimeout: 1000
      },

      // Advanced features
      enableTailableCursors: config.enableTailableCursors !== false,
      enableChangeStreams: config.enableChangeStreams !== false,
      enableRealTimeProcessing: config.enableRealTimeProcessing !== false,

      // Resource management
      maxConcurrentTails: config.maxConcurrentTails || 10,
      tailCursorTimeout: config.tailCursorTimeout || 30000,
      processingThreads: config.processingThreads || 4
    };

    // Collection references
    this.cappedCollections = new Map();
    this.tailableCursors = new Map();
    this.performanceMetrics = new Map();
    this.processingStats = {
      totalWrites: 0,
      totalReads: 0,
      averageWriteTime: 0,
      averageReadTime: 0,
      errorCount: 0
    };

    this.initializeCappedCollections();
  }

  async initializeCappedCollections() {
    console.log('Initializing capped collections for high-performance logging...');

    try {
      // Application logging with different retention strategies
      await this.createOptimizedCappedCollection('application_logs', {
        size: 200 * 1024 * 1024, // 200MB
        max: 100000, // Maximum 100k documents
        description: 'High-frequency application logs with automatic rotation'
      });

      // Real-time event streaming
      await this.createOptimizedCappedCollection('event_stream', {
        size: 500 * 1024 * 1024, // 500MB
        max: 250000, // Maximum 250k events
        description: 'Real-time event streaming with insertion order preservation'
      });

      // Performance metrics collection
      await this.createOptimizedCappedCollection('performance_metrics', {
        size: 100 * 1024 * 1024, // 100MB
        max: 50000, // Maximum 50k metric entries
        description: 'System performance metrics with circular buffer behavior'
      });

      // Audit trail with longer retention
      await this.createOptimizedCappedCollection('audit_trail', {
        size: 1024 * 1024 * 1024, // 1GB
        max: 1000000, // Maximum 1M audit entries
        description: 'Security audit trail with extended retention'
      });

      // User activity stream
      await this.createOptimizedCappedCollection('user_activity_stream', {
        size: 300 * 1024 * 1024, // 300MB
        max: 150000, // Maximum 150k activities
        description: 'User activity tracking with real-time processing'
      });

      // System health monitoring
      await this.createOptimizedCappedCollection('system_health_logs', {
        size: 150 * 1024 * 1024, // 150MB
        max: 75000, // Maximum 75k health checks
        description: 'System health monitoring with high-frequency updates'
      });

      // Initialize performance monitoring
      if (this.config.enablePerformanceMonitoring) {
        await this.setupPerformanceMonitoring();
      }

      // Setup real-time processing
      if (this.config.enableRealTimeProcessing) {
        await this.initializeRealTimeProcessing();
      }

      console.log('All capped collections initialized successfully');

    } catch (error) {
      console.error('Error initializing capped collections:', error);
      throw error;
    }
  }

  async createOptimizedCappedCollection(collectionName, options) {
    console.log(`Creating optimized capped collection: ${collectionName}...`);

    try {
      // Check if collection already exists
      const collections = await this.db.listCollections({ name: collectionName }).toArray();

      if (collections.length > 0) {
        // Collection exists - verify it's capped and get reference
        const collectionInfo = collections[0];
        if (!collectionInfo.options.capped) {
          throw new Error(`Collection ${collectionName} exists but is not capped`);
        }

        console.log(`Existing capped collection ${collectionName} found`);
        const collection = this.db.collection(collectionName);
        this.cappedCollections.set(collectionName, {
          collection: collection,
          options: collectionInfo.options,
          description: options.description
        });

      } else {
        // Create new capped collection
        const collection = await this.db.createCollection(collectionName, {
          capped: true,
          size: options.size,
          max: options.max,

          // Storage engine options for performance
          storageEngine: {
            wiredTiger: {
              configString: 'block_compressor=snappy' // Enable compression
            }
          }
        });

        // Create optimized indexes for capped collections
        await this.createCappedCollectionIndexes(collection, collectionName);

        this.cappedCollections.set(collectionName, {
          collection: collection,
          options: { capped: true, size: options.size, max: options.max },
          description: options.description,
          created: new Date()
        });

        console.log(`Created capped collection ${collectionName}: ${options.size} bytes, max ${options.max} documents`);
      }

    } catch (error) {
      console.error(`Error creating capped collection ${collectionName}:`, error);
      throw error;
    }
  }

  async createCappedCollectionIndexes(collection, collectionName) {
    console.log(`Creating optimized indexes for ${collectionName}...`);

    try {
      // Most capped collections benefit from a timestamp index for range queries
      // Note: Capped collections maintain insertion order, so _id is naturally ordered
      await collection.createIndex(
        { timestamp: -1 }, 
        { background: true, name: 'timestamp_desc' }
      );

      // Collection-specific indexes based on common query patterns
      switch (collectionName) {
        case 'application_logs':
          await collection.createIndexes([
            { key: { level: 1, timestamp: -1 }, background: true, name: 'level_timestamp' },
            { key: { application: 1, timestamp: -1 }, background: true, name: 'app_timestamp' },
            { key: { userId: 1 }, background: true, sparse: true, name: 'user_sparse' }
          ]);
          break;

        case 'event_stream':
          await collection.createIndexes([
            { key: { eventType: 1, timestamp: -1 }, background: true, name: 'event_type_timestamp' },
            { key: { userId: 1, timestamp: -1 }, background: true, sparse: true, name: 'user_timeline' },
            { key: { sessionId: 1 }, background: true, sparse: true, name: 'session_events' }
          ]);
          break;

        case 'performance_metrics':
          await collection.createIndexes([
            { key: { metricName: 1, timestamp: -1 }, background: true, name: 'metric_timeline' },
            { key: { hostname: 1, timestamp: -1 }, background: true, name: 'host_metrics' }
          ]);
          break;

        case 'audit_trail':
          await collection.createIndexes([
            { key: { action: 1, timestamp: -1 }, background: true, name: 'action_timeline' },
            { key: { userId: 1, timestamp: -1 }, background: true, name: 'user_audit' },
            { key: { resourceId: 1 }, background: true, sparse: true, name: 'resource_audit' }
          ]);
          break;
      }

    } catch (error) {
      console.error(`Error creating indexes for ${collectionName}:`, error);
      // Don't fail initialization for index creation issues
    }
  }

  async logApplicationEvent(application, level, message, metadata = {}) {
    const startTime = Date.now();

    try {
      const logCollection = this.cappedCollections.get('application_logs').collection;

      const logDocument = {
        timestamp: new Date(),
        application: application,
        level: level.toUpperCase(),
        message: message,

        // Enhanced metadata
        ...metadata,

        // System context
        hostname: metadata.hostname || require('os').hostname(),
        processId: process.pid,
        threadId: metadata.threadId,

        // Performance context
        memoryUsage: metadata.includeMemoryUsage ? process.memoryUsage() : undefined,

        // Request context
        requestId: metadata.requestId,
        sessionId: metadata.sessionId,
        userId: metadata.userId,

        // Application context
        version: metadata.version,
        environment: metadata.environment || process.env.NODE_ENV,

        // Timing information
        duration: metadata.duration,

        // Additional structured data
        tags: metadata.tags || [],
        customData: metadata.customData
      };

      // High-performance insert with minimal write concern
      const result = await logCollection.insertOne(logDocument, {
        writeConcern: this.config.writeConcern
      });

      // Update performance metrics
      this.updatePerformanceMetrics('application_logs', 'write', Date.now() - startTime);

      return {
        insertedId: result.insertedId,
        collection: 'application_logs',
        processingTime: Date.now() - startTime,
        logLevel: level,
        success: true
      };

    } catch (error) {
      console.error('Error logging application event:', error);
      this.processingStats.errorCount++;

      return {
        success: false,
        error: error.message,
        collection: 'application_logs',
        processingTime: Date.now() - startTime
      };
    }
  }

  async streamEvent(eventType, eventData, options = {}) {
    const startTime = Date.now();

    try {
      const streamCollection = this.cappedCollections.get('event_stream').collection;

      const eventDocument = {
        timestamp: new Date(),
        eventType: eventType,
        eventData: eventData,

        // Event metadata
        eventId: options.eventId || new ObjectId(),
        correlationId: options.correlationId,
        causationId: options.causationId,

        // User and session context
        userId: options.userId,
        sessionId: options.sessionId,

        // System context
        source: options.source || 'application',
        hostname: options.hostname || require('os').hostname(),

        // Event processing metadata
        priority: options.priority || 'normal',
        tags: options.tags || [],

        // Real-time processing flags
        requiresProcessing: options.requiresProcessing || false,
        processingStatus: options.processingStatus || 'pending',

        // Event relationships
        parentEventId: options.parentEventId,
        childEventIds: options.childEventIds || [],

        // Timing and sequence
        occurredAt: options.occurredAt || new Date(),
        sequenceNumber: options.sequenceNumber,

        // Custom event payload
        payload: eventData
      };

      // Insert event into capped collection
      const result = await streamCollection.insertOne(eventDocument, {
        writeConcern: this.config.writeConcern
      });

      // Trigger real-time processing if enabled
      if (this.config.enableRealTimeProcessing && eventDocument.requiresProcessing) {
        await this.triggerRealTimeProcessing(eventDocument);
      }

      // Update metrics
      this.updatePerformanceMetrics('event_stream', 'write', Date.now() - startTime);

      return {
        insertedId: result.insertedId,
        eventId: eventDocument.eventId,
        collection: 'event_stream',
        processingTime: Date.now() - startTime,
        success: true,
        sequenceOrder: result.insertedId // ObjectId provides natural ordering
      };

    } catch (error) {
      console.error('Error streaming event:', error);
      this.processingStats.errorCount++;

      return {
        success: false,
        error: error.message,
        collection: 'event_stream',
        processingTime: Date.now() - startTime
      };
    }
  }

  async recordPerformanceMetric(metricName, value, metadata = {}) {
    const startTime = Date.now();

    try {
      const metricsCollection = this.cappedCollections.get('performance_metrics').collection;

      const metricDocument = {
        timestamp: new Date(),
        metricName: metricName,
        value: value,

        // Metric metadata
        unit: metadata.unit || 'count',
        type: metadata.type || 'gauge', // gauge, counter, histogram, timer

        // System context
        hostname: metadata.hostname || require('os').hostname(),
        service: metadata.service || 'unknown',
        environment: metadata.environment || process.env.NODE_ENV,

        // Metric dimensions
        tags: metadata.tags || {},
        dimensions: metadata.dimensions || {},

        // Statistical data
        min: metadata.min,
        max: metadata.max,
        avg: metadata.avg,
        count: metadata.count,
        sum: metadata.sum,

        // Performance context
        duration: metadata.duration,
        sampleRate: metadata.sampleRate || 1.0,

        // Additional metadata
        source: metadata.source || 'system',
        category: metadata.category || 'performance',
        priority: metadata.priority || 'normal',

        // Custom data
        customMetadata: metadata.customMetadata
      };

      const result = await metricsCollection.insertOne(metricDocument, {
        writeConcern: this.config.writeConcern
      });

      // Update internal metrics
      this.updatePerformanceMetrics('performance_metrics', 'write', Date.now() - startTime);

      return {
        insertedId: result.insertedId,
        collection: 'performance_metrics',
        metricName: metricName,
        processingTime: Date.now() - startTime,
        success: true
      };

    } catch (error) {
      console.error('Error recording performance metric:', error);
      this.processingStats.errorCount++;

      return {
        success: false,
        error: error.message,
        collection: 'performance_metrics',
        processingTime: Date.now() - startTime
      };
    }
  }

  async createTailableCursor(collectionName, filter = {}, options = {}) {
    console.log(`Creating tailable cursor for ${collectionName}...`);

    try {
      const cappedCollection = this.cappedCollections.get(collectionName);
      if (!cappedCollection) {
        throw new Error(`Capped collection ${collectionName} not found`);
      }

      const collection = cappedCollection.collection;

      // Configure tailable cursor options
      const cursorOptions = {
        tailable: true,
        awaitData: true,
        noCursorTimeout: true,
        maxTimeMS: options.maxTimeMS || this.config.tailCursorTimeout,
        batchSize: options.batchSize || 100,
        ...options
      };

      // Create cursor starting from specified position or end
      let cursor;
      if (options.startFromEnd || options.startAfter) {
        if (options.startAfter) {
          filter._id = { $gt: options.startAfter };
        }
        cursor = collection.find(filter, cursorOptions);
      } else {
        // Start from beginning
        cursor = collection.find(filter, cursorOptions);
      }

      // Store cursor for management
      const cursorId = options.cursorId || new ObjectId().toString();
      this.tailableCursors.set(cursorId, {
        cursor: cursor,
        collection: collectionName,
        filter: filter,
        options: cursorOptions,
        created: new Date(),
        active: true
      });

      console.log(`Tailable cursor ${cursorId} created for ${collectionName}`);

      return {
        cursorId: cursorId,
        cursor: cursor,
        collection: collectionName,
        success: true
      };

    } catch (error) {
      console.error(`Error creating tailable cursor for ${collectionName}:`, error);
      return {
        success: false,
        error: error.message,
        collection: collectionName
      };
    }
  }

  async processTailableCursor(cursorId, processingFunction, options = {}) {
    console.log(`Starting tailable cursor processing for ${cursorId}...`);

    try {
      const cursorInfo = this.tailableCursors.get(cursorId);
      if (!cursorInfo) {
        throw new Error(`Tailable cursor ${cursorId} not found`);
      }

      const cursor = cursorInfo.cursor;
      const processingStats = {
        documentsProcessed: 0,
        errors: 0,
        startTime: new Date(),
        lastProcessedAt: null
      };

      // Process documents as they arrive
      while (await cursor.hasNext() && cursorInfo.active) {
        try {
          const document = await cursor.next();

          if (document) {
            // Process the document
            const processingStartTime = Date.now();
            await processingFunction(document, cursorInfo.collection);

            // Update statistics
            processingStats.documentsProcessed++;
            processingStats.lastProcessedAt = new Date();

            // Update performance metrics
            this.updatePerformanceMetrics(
              cursorInfo.collection, 
              'tail_process', 
              Date.now() - processingStartTime
            );

            // Batch processing optimization
            if (options.batchProcessing && processingStats.documentsProcessed % options.batchSize === 0) {
              await this.flushBatchProcessing(cursorId, options);
            }
          }

        } catch (processingError) {
          console.error(`Error processing document from cursor ${cursorId}:`, processingError);
          processingStats.errors++;

          // Handle processing errors based on configuration
          if (options.stopOnError) {
            break;
          }
        }
      }

      console.log(`Tailable cursor processing completed for ${cursorId}:`, processingStats);

      return {
        success: true,
        cursorId: cursorId,
        processingStats: processingStats
      };

    } catch (error) {
      console.error(`Error in tailable cursor processing for ${cursorId}:`, error);
      return {
        success: false,
        error: error.message,
        cursorId: cursorId
      };
    }
  }

  async bulkInsertLogs(collectionName, documents, options = {}) {
    console.log(`Performing bulk insert to ${collectionName} with ${documents.length} documents...`);
    const startTime = Date.now();

    try {
      const cappedCollection = this.cappedCollections.get(collectionName);
      if (!cappedCollection) {
        throw new Error(`Capped collection ${collectionName} not found`);
      }

      const collection = cappedCollection.collection;

      // Prepare documents with consistent structure
      const preparedDocuments = documents.map((doc, index) => ({
        ...doc,
        timestamp: doc.timestamp || new Date(),
        batchId: options.batchId || new ObjectId(),
        batchIndex: index,
        bulkInsertMetadata: {
          batchSize: documents.length,
          insertedAt: new Date(),
          source: options.source || 'bulk_operation'
        }
      }));

      // Configure bulk insert options for maximum performance
      const insertOptions = {
        ordered: options.ordered || false, // Unordered for better performance
        writeConcern: options.writeConcern || this.config.writeConcern,
        bypassDocumentValidation: options.bypassValidation || false
      };

      // Execute bulk insert
      const result = await collection.insertMany(preparedDocuments, insertOptions);

      // Update performance metrics
      const processingTime = Date.now() - startTime;
      this.updatePerformanceMetrics(collectionName, 'bulk_write', processingTime);
      this.processingStats.totalWrites += result.insertedCount;

      console.log(`Bulk insert completed: ${result.insertedCount} documents in ${processingTime}ms`);

      return {
        success: true,
        collection: collectionName,
        insertedCount: result.insertedCount,
        insertedIds: Object.values(result.insertedIds),
        processingTime: processingTime,
        throughput: Math.round((result.insertedCount / processingTime) * 1000), // docs/second
        batchId: options.batchId
      };

    } catch (error) {
      console.error(`Error in bulk insert to ${collectionName}:`, error);
      this.processingStats.errorCount++;

      return {
        success: false,
        error: error.message,
        collection: collectionName,
        processingTime: Date.now() - startTime
      };
    }
  }

  async queryRecentDocuments(collectionName, filter = {}, options = {}) {
    const startTime = Date.now();

    try {
      const cappedCollection = this.cappedCollections.get(collectionName);
      if (!cappedCollection) {
        throw new Error(`Capped collection ${collectionName} not found`);
      }

      const collection = cappedCollection.collection;

      // Configure query options for optimal performance
      const queryOptions = {
        sort: { $natural: options.reverse ? 1 : -1 }, // Natural order (insertion order)
        limit: options.limit || 1000,
        projection: options.projection || {},
        maxTimeMS: options.maxTimeMS || 5000,
        batchSize: options.batchSize || 100
      };

      // Add time range filter if specified
      if (options.timeRange) {
        filter.timestamp = {
          $gte: options.timeRange.start,
          $lte: options.timeRange.end || new Date()
        };
      }

      // Execute query
      const documents = await collection.find(filter, queryOptions).toArray();

      // Update performance metrics
      const processingTime = Date.now() - startTime;
      this.updatePerformanceMetrics(collectionName, 'read', processingTime);
      this.processingStats.totalReads += documents.length;

      return {
        success: true,
        collection: collectionName,
        documents: documents,
        count: documents.length,
        processingTime: processingTime,
        query: filter,
        options: queryOptions
      };

    } catch (error) {
      console.error(`Error querying ${collectionName}:`, error);
      this.processingStats.errorCount++;

      return {
        success: false,
        error: error.message,
        collection: collectionName,
        processingTime: Date.now() - startTime
      };
    }
  }

  updatePerformanceMetrics(collectionName, operationType, duration) {
    if (!this.config.enablePerformanceMonitoring) return;

    const key = `${collectionName}_${operationType}`;

    if (!this.performanceMetrics.has(key)) {
      this.performanceMetrics.set(key, {
        totalOperations: 0,
        totalTime: 0,
        averageTime: 0,
        minTime: Infinity,
        maxTime: 0,
        lastOperation: null
      });
    }

    const metrics = this.performanceMetrics.get(key);

    metrics.totalOperations++;
    metrics.totalTime += duration;
    metrics.averageTime = metrics.totalTime / metrics.totalOperations;
    metrics.minTime = Math.min(metrics.minTime, duration);
    metrics.maxTime = Math.max(metrics.maxTime, duration);
    metrics.lastOperation = new Date();

    // Update global stats
    if (operationType === 'write' || operationType === 'bulk_write') {
      this.processingStats.averageWriteTime = 
        (this.processingStats.averageWriteTime + duration) / 2;
    } else if (operationType === 'read') {
      this.processingStats.averageReadTime = 
        (this.processingStats.averageReadTime + duration) / 2;
    }
  }

  async getCollectionStats() {
    console.log('Gathering capped collection statistics...');

    const stats = {};

    for (const [collectionName, cappedInfo] of this.cappedCollections.entries()) {
      try {
        const collection = cappedInfo.collection;

        // Get MongoDB collection stats
        const mongoStats = await collection.stats();

        // Get performance metrics
        const performanceKey = `${collectionName}_write`;
        const performanceMetrics = this.performanceMetrics.get(performanceKey) || {};

        stats[collectionName] = {
          // Collection configuration
          configuration: cappedInfo.options,
          description: cappedInfo.description,
          created: cappedInfo.created,

          // MongoDB stats
          size: mongoStats.size,
          storageSize: mongoStats.storageSize,
          totalIndexSize: mongoStats.totalIndexSize,
          count: mongoStats.count,
          avgObjSize: mongoStats.avgObjSize,
          maxSize: mongoStats.maxSize,
          max: mongoStats.max,

          // Utilization metrics
          sizeUtilization: (mongoStats.size / mongoStats.maxSize * 100).toFixed(2) + '%',
          countUtilization: mongoStats.max ? (mongoStats.count / mongoStats.max * 100).toFixed(2) + '%' : 'N/A',

          // Performance metrics
          averageWriteTime: performanceMetrics.averageTime || 0,
          totalOperations: performanceMetrics.totalOperations || 0,
          minWriteTime: performanceMetrics.minTime === Infinity ? 0 : performanceMetrics.minTime || 0,
          maxWriteTime: performanceMetrics.maxTime || 0,
          lastOperation: performanceMetrics.lastOperation,

          // Health indicators
          isNearCapacity: mongoStats.size / mongoStats.maxSize > 0.8,
          hasRecentActivity: performanceMetrics.lastOperation && 
            (new Date() - performanceMetrics.lastOperation) < 300000, // 5 minutes

          // Estimated metrics
          estimatedDocumentsPerHour: this.estimateDocumentsPerHour(performanceMetrics),
          estimatedTimeToCapacity: this.estimateTimeToCapacity(mongoStats, performanceMetrics)
        };

      } catch (error) {
        stats[collectionName] = {
          error: error.message,
          available: false
        };
      }
    }

    return {
      collections: stats,
      globalStats: this.processingStats,
      summary: {
        totalCollections: this.cappedCollections.size,
        totalActiveCursors: this.tailableCursors.size,
        totalMemoryUsage: this.estimateMemoryUsage(),
        uptime: new Date() - this.startTime || new Date()
      }
    };
  }

  estimateDocumentsPerHour(performanceMetrics) {
    if (!performanceMetrics || !performanceMetrics.lastOperation) return 0;

    const hoursActive = (new Date() - (this.startTime || new Date())) / (1000 * 60 * 60);
    if (hoursActive === 0) return 0;

    return Math.round((performanceMetrics.totalOperations || 0) / hoursActive);
  }

  estimateTimeToCapacity(mongoStats, performanceMetrics) {
    if (!performanceMetrics || !performanceMetrics.totalOperations) return 'Unknown';

    const remainingSpace = mongoStats.maxSize - mongoStats.size;
    const averageDocSize = mongoStats.avgObjSize || 1000;
    const remainingDocuments = Math.floor(remainingSpace / averageDocSize);

    const documentsPerHour = this.estimateDocumentsPerHour(performanceMetrics);
    if (documentsPerHour === 0) return 'Unknown';

    const hoursToCapacity = remainingDocuments / documentsPerHour;

    if (hoursToCapacity < 24) {
      return `${Math.round(hoursToCapacity)} hours`;
    } else {
      return `${Math.round(hoursToCapacity / 24)} days`;
    }
  }

  estimateMemoryUsage() {
    // Rough estimate based on active cursors and performance metrics
    const baseMem = 50 * 1024 * 1024; // 50MB base
    const cursorMem = this.tailableCursors.size * 1024 * 1024; // 1MB per cursor
    const metricsMem = this.performanceMetrics.size * 10 * 1024; // 10KB per metric set

    return baseMem + cursorMem + metricsMem;
  }

  async shutdown() {
    console.log('Shutting down capped collections manager...');

    // Close all tailable cursors
    for (const [cursorId, cursorInfo] of this.tailableCursors.entries()) {
      try {
        cursorInfo.active = false;
        await cursorInfo.cursor.close();
        console.log(`Closed tailable cursor: ${cursorId}`);
      } catch (error) {
        console.error(`Error closing cursor ${cursorId}:`, error);
      }
    }

    // Clear collections and metrics
    this.cappedCollections.clear();
    this.tailableCursors.clear();
    this.performanceMetrics.clear();

    console.log('Capped collections manager shutdown complete');
  }
}

// Benefits of MongoDB Capped Collections:
// - Fixed-size storage with automatic old document removal (circular buffer behavior)
// - Guaranteed insertion order preservation for event sequencing
// - High-performance writes without index maintenance overhead
// - Optimal read performance for recent document queries
// - Built-in document rotation without external management
// - Tailable cursors for real-time data streaming
// - Memory-efficient operations with predictable resource usage
// - No fragmentation or storage bloat issues
// - Ideal for logging, event streaming, and real-time analytics
// - SQL-compatible operations through QueryLeaf integration

module.exports = {
  CappedCollectionsManager
};

Understanding MongoDB Capped Collections Architecture

Advanced High-Performance Logging and Streaming Patterns

Implement sophisticated capped collection strategies for production MongoDB deployments:

// Production-ready MongoDB capped collections with advanced optimization and real-time processing
class ProductionCappedCollectionsManager extends CappedCollectionsManager {
  constructor(db, productionConfig) {
    super(db, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      enableShardedDeployment: true,
      enableReplicationOptimization: true,
      enableAdvancedMonitoring: true,
      enableAutomaticSizing: true,
      enableCompression: true,
      enableRealTimeAlerts: true
    };

    this.setupProductionOptimizations();
    this.initializeAdvancedMonitoring();
    this.setupAutomaticManagement();
  }

  async implementShardedCappedCollections(collectionName, shardingStrategy) {
    console.log(`Implementing sharded capped collections for ${collectionName}...`);

    const shardingConfig = {
      // Shard key design for capped collections
      shardKey: shardingStrategy.shardKey || { timestamp: 1, hostname: 1 },

      // Chunk size optimization for high-throughput writes
      chunkSizeMB: shardingStrategy.chunkSize || 16,

      // Balancing strategy
      enableAutoSplit: true,
      enableBalancer: true,
      balancerWindowStart: "01:00",
      balancerWindowEnd: "06:00",

      // Write distribution
      enableEvenWriteDistribution: true,
      monitorHotShards: true,
      automaticRebalancing: true
    };

    return await this.deployShardedCappedCollection(collectionName, shardingConfig);
  }

  async setupAdvancedRealTimeProcessing() {
    console.log('Setting up advanced real-time processing for capped collections...');

    const processingPipeline = {
      // Stream processing configuration
      streamProcessing: {
        enableChangeStreams: true,
        enableAggregationPipelines: true,
        enableParallelProcessing: true,
        maxConcurrentProcessors: 8
      },

      // Real-time analytics
      realTimeAnalytics: {
        enableWindowedAggregations: true,
        windowSizes: ['1m', '5m', '15m', '1h'],
        enableTrendDetection: true,
        enableAnomalyDetection: true
      },

      // Event correlation
      eventCorrelation: {
        enableEventMatching: true,
        correlationTimeWindow: 300000, // 5 minutes
        enableComplexEventProcessing: true
      }
    };

    return await this.deployRealTimeProcessing(processingPipeline);
  }

  async implementAutomaticCapacityManagement() {
    console.log('Implementing automatic capacity management for capped collections...');

    const capacityManagement = {
      // Automatic sizing
      automaticSizing: {
        enableDynamicResize: true,
        growthThreshold: 0.8,  // 80% capacity
        shrinkThreshold: 0.3,  // 30% capacity
        maxSize: 10 * 1024 * 1024 * 1024, // 10GB max
        minSize: 100 * 1024 * 1024 // 100MB min
      },

      // Performance-based optimization
      performanceOptimization: {
        monitorWriteLatency: true,
        latencyThreshold: 100, // 100ms
        enableAutomaticIndexing: true,
        optimizeForWorkload: true
      },

      // Resource management
      resourceManagement: {
        monitorMemoryUsage: true,
        memoryThreshold: 0.7, // 70% memory usage
        enableBackpressure: true,
        enableLoadShedding: true
      }
    };

    return await this.deployCapacityManagement(capacityManagement);
  }
}

SQL-Style Capped Collections Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB capped collections and high-performance logging:

-- QueryLeaf capped collections operations with SQL-familiar syntax for MongoDB

-- Create capped collections with SQL-style DDL
CREATE CAPPED COLLECTION application_logs 
WITH (
  size = '200MB',
  max_documents = 100000,
  write_concern = 'fast',
  compression = 'snappy'
);

-- Alternative syntax for collection creation
CREATE TABLE event_stream (
  event_id UUID DEFAULT GENERATE_UUID(),
  timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  event_type VARCHAR(100) NOT NULL,
  event_data DOCUMENT,
  user_id VARCHAR(50),
  session_id VARCHAR(100),
  source VARCHAR(50) DEFAULT 'application',

  -- Capped collection metadata
  insertion_order BIGINT -- Natural insertion order in capped collections
)
WITH CAPPED (
  size = '500MB',
  max_documents = 250000,
  auto_rotation = true
);

-- High-performance log insertion with SQL syntax
INSERT INTO application_logs (
  application, level, message, timestamp, user_id, session_id, metadata
) VALUES 
  ('web-server', 'INFO', 'User login successful', CURRENT_TIMESTAMP, 'user123', 'sess456', 
   JSON_OBJECT('ip_address', '192.168.1.100', 'user_agent', 'Mozilla/5.0...')),
  ('web-server', 'WARN', 'Slow query detected', CURRENT_TIMESTAMP, 'user123', 'sess456',
   JSON_OBJECT('query_time', 2500, 'table', 'users')),
  ('payment-service', 'ERROR', 'Payment processing failed', CURRENT_TIMESTAMP, 'user789', 'sess789',
   JSON_OBJECT('amount', 99.99, 'error_code', 'CARD_DECLINED'));

-- Bulk insertion for high-throughput logging
INSERT INTO application_logs (application, level, message, timestamp, metadata)
WITH log_batch AS (
  SELECT 
    app_name as application,
    log_level as level,
    log_message as message,
    log_timestamp as timestamp,

    -- Enhanced metadata generation
    JSON_OBJECT(
      'hostname', hostname,
      'process_id', process_id,
      'thread_id', thread_id,
      'memory_usage_mb', memory_usage / 1024 / 1024,
      'request_duration_ms', request_duration,
      'tags', log_tags,
      'custom_data', custom_metadata
    ) as metadata

  FROM staging_logs
  WHERE processed = false
    AND log_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
)
SELECT application, level, message, timestamp, metadata
FROM log_batch
WHERE level IN ('INFO', 'WARN', 'ERROR', 'CRITICAL')

-- Capped collection bulk insert configuration
WITH BULK_OPTIONS (
  batch_size = 1000,
  ordered = false,
  write_concern = 'fast',
  bypass_validation = false
);

-- Event streaming with guaranteed insertion order
INSERT INTO event_stream (
  event_type, event_data, user_id, session_id, 
  correlation_id, source, priority, tags
) 
WITH event_preparation AS (
  SELECT 
    event_type,
    event_payload as event_data,
    user_id,
    session_id,

    -- Generate correlation context
    COALESCE(correlation_id, GENERATE_UUID()) as correlation_id,
    COALESCE(event_source, 'application') as source,
    COALESCE(event_priority, 'normal') as priority,

    -- Generate event tags for filtering
    ARRAY[
      event_category,
      'realtime',
      CASE WHEN event_priority = 'high' THEN 'urgent' ELSE 'standard' END
    ] as tags,

    -- Add timing metadata
    CURRENT_TIMESTAMP as insertion_timestamp,
    event_occurred_at

  FROM incoming_events
  WHERE processing_status = 'pending'
    AND event_occurred_at >= CURRENT_TIMESTAMP - INTERVAL '5 minutes'
)
SELECT 
  event_type,
  JSON_SET(
    event_data,
    '$.insertion_timestamp', insertion_timestamp,
    '$.occurred_at', event_occurred_at,
    '$.processing_context', JSON_OBJECT(
      'inserted_by', 'queryleaf',
      'capped_collection', true,
      'guaranteed_order', true
    )
  ) as event_data,
  user_id,
  session_id,
  correlation_id,
  source,
  priority,
  tags
FROM event_preparation
ORDER BY event_occurred_at, correlation_id;

-- Query recent logs with natural insertion order (most efficient for capped collections)
WITH recent_application_logs AS (
  SELECT 
    timestamp,
    application,
    level,
    message,
    user_id,
    session_id,
    metadata,

    -- Natural insertion order in capped collections
    _id as insertion_order,

    -- Extract metadata fields
    JSON_EXTRACT(metadata, '$.hostname') as hostname,
    JSON_EXTRACT(metadata, '$.request_duration_ms') as request_duration,
    JSON_EXTRACT(metadata, '$.memory_usage_mb') as memory_usage,

    -- Calculate log age
    EXTRACT(SECONDS FROM CURRENT_TIMESTAMP - timestamp) as age_seconds,

    -- Categorize log importance
    CASE level
      WHEN 'CRITICAL' THEN 1
      WHEN 'ERROR' THEN 2  
      WHEN 'WARN' THEN 3
      WHEN 'INFO' THEN 4
      WHEN 'DEBUG' THEN 5
    END as log_priority_numeric

  FROM application_logs
  WHERE 
    -- Time-based filtering (efficient with capped collections)
    timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'

    -- Application filtering
    AND (application = $1 OR $1 IS NULL)

    -- Level filtering
    AND level IN ('ERROR', 'WARN', 'INFO')

  -- Natural order query (most efficient for capped collections)
  ORDER BY $natural DESC
  LIMIT 1000
),

log_analysis AS (
  SELECT 
    ral.*,

    -- Session context analysis
    COUNT(*) OVER (
      PARTITION BY session_id 
      ORDER BY timestamp 
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as session_log_sequence,

    -- Error rate analysis
    COUNT(*) FILTER (WHERE level IN ('ERROR', 'CRITICAL')) OVER (
      PARTITION BY application, DATE_TRUNC('minute', timestamp)
    ) as errors_this_minute,

    -- Performance analysis
    AVG(request_duration) OVER (
      PARTITION BY application 
      ORDER BY timestamp 
      ROWS BETWEEN 10 PRECEDING AND CURRENT ROW
    ) as rolling_avg_duration,

    -- Anomaly detection
    CASE 
      WHEN request_duration > 
        AVG(request_duration) OVER (
          PARTITION BY application 
          ORDER BY timestamp 
          ROWS BETWEEN 100 PRECEDING AND CURRENT ROW
        ) * 3 
      THEN 'performance_anomaly'

      WHEN errors_this_minute > 10 THEN 'error_spike'

      WHEN memory_usage > 
        AVG(memory_usage) OVER (
          PARTITION BY hostname 
          ORDER BY timestamp 
          ROWS BETWEEN 50 PRECEDING AND CURRENT ROW
        ) * 2
      THEN 'memory_anomaly'

      ELSE 'normal'
    END as anomaly_status

  FROM recent_application_logs ral
)

SELECT 
  timestamp,
  application,
  level,
  message,
  user_id,
  session_id,
  hostname,

  -- Performance metrics
  request_duration,
  memory_usage,
  rolling_avg_duration,

  -- Context information
  session_log_sequence,
  errors_this_minute,

  -- Analysis results
  log_priority_numeric,
  anomaly_status,
  age_seconds,

  -- Helpful indicators
  CASE 
    WHEN age_seconds < 60 THEN 'very_recent'
    WHEN age_seconds < 300 THEN 'recent' 
    WHEN age_seconds < 1800 THEN 'moderate'
    ELSE 'older'
  END as recency_category,

  -- Alert conditions
  CASE 
    WHEN level = 'CRITICAL' OR anomaly_status != 'normal' THEN 'immediate_attention'
    WHEN level = 'ERROR' AND errors_this_minute > 5 THEN 'monitor_closely'
    WHEN level = 'WARN' AND session_log_sequence > 20 THEN 'session_issues'
    ELSE 'normal_monitoring'
  END as attention_level

FROM log_analysis
WHERE 
  -- Focus on actionable logs
  (level IN ('CRITICAL', 'ERROR') OR anomaly_status != 'normal')

ORDER BY 
  -- Prioritize by importance and recency
  CASE attention_level
    WHEN 'immediate_attention' THEN 1
    WHEN 'monitor_closely' THEN 2  
    WHEN 'session_issues' THEN 3
    ELSE 4
  END,
  timestamp DESC

LIMIT 500;

-- Real-time event stream processing with tailable cursor behavior
WITH LIVE_EVENT_STREAM AS (
  SELECT 
    event_id,
    timestamp,
    event_type,
    event_data,
    user_id,
    session_id,
    correlation_id,
    source,
    tags,

    -- Event sequence tracking
    _id as natural_order,

    -- Extract event payload details
    JSON_EXTRACT(event_data, '$.action') as action,
    JSON_EXTRACT(event_data, '$.resource') as resource,
    JSON_EXTRACT(event_data, '$.metadata') as event_metadata,

    -- Real-time processing flags
    JSON_EXTRACT(event_data, '$.requires_processing') as requires_processing,
    JSON_EXTRACT(event_data, '$.priority') as event_priority

  FROM event_stream
  WHERE 
    -- Process events from the last insertion point
    _id > $last_processed_id

    -- Focus on events requiring real-time processing
    AND (
      JSON_EXTRACT(event_data, '$.requires_processing') = true
      OR event_type IN ('user_action', 'system_alert', 'security_event')
      OR JSON_EXTRACT(event_data, '$.priority') = 'high'
    )

  -- Use natural insertion order for optimal capped collection performance
  ORDER BY $natural ASC
),

event_correlation AS (
  SELECT 
    les.*,

    -- Correlation analysis
    COUNT(*) OVER (
      PARTITION BY correlation_id
      ORDER BY natural_order
    ) as correlation_sequence,

    -- User behavior patterns
    COUNT(*) OVER (
      PARTITION BY user_id, event_type
      ORDER BY timestamp
      RANGE BETWEEN INTERVAL '5 minutes' PRECEDING AND CURRENT ROW  
    ) as recent_similar_events,

    -- Session context
    STRING_AGG(event_type, ' -> ') OVER (
      PARTITION BY session_id
      ORDER BY natural_order
      ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
    ) as session_event_sequence,

    -- Anomaly detection
    CASE 
      WHEN recent_similar_events > 10 THEN 'potential_abuse'
      WHEN correlation_sequence > 50 THEN 'long_running_process'
      WHEN event_type = 'security_event' THEN 'security_concern'
      ELSE 'normal_event'
    END as event_classification

  FROM LIVE_EVENT_STREAM les
),

processed_events AS (
  SELECT 
    ec.*,

    -- Generate processing instructions
    JSON_OBJECT(
      'processing_priority', 
      CASE event_classification
        WHEN 'security_concern' THEN 'critical'
        WHEN 'potential_abuse' THEN 'high'
        WHEN 'long_running_process' THEN 'monitor'
        ELSE 'standard'
      END,

      'correlation_context', JSON_OBJECT(
        'correlation_id', correlation_id,
        'sequence', correlation_sequence,
        'related_events', recent_similar_events
      ),

      'session_context', JSON_OBJECT(
        'session_id', session_id,
        'event_sequence', session_event_sequence,
        'user_id', user_id
      ),

      'processing_metadata', JSON_OBJECT(
        'inserted_at', CURRENT_TIMESTAMP,
        'natural_order', natural_order,
        'capped_collection_source', true
      )
    ) as processing_instructions,

    -- Determine next processing steps
    CASE event_classification
      WHEN 'security_concern' THEN 'immediate_alert'
      WHEN 'potential_abuse' THEN 'rate_limit_check'  
      WHEN 'long_running_process' THEN 'status_update'
      ELSE 'standard_processing'
    END as next_action

  FROM event_correlation ec
)

SELECT 
  event_id,
  timestamp,
  event_type,
  action,
  resource,
  user_id,
  session_id,

  -- Analysis results
  event_classification,
  correlation_sequence,
  recent_similar_events,
  next_action,

  -- Processing context
  processing_instructions,

  -- Natural ordering for downstream systems
  natural_order,

  -- Real-time indicators
  EXTRACT(SECONDS FROM CURRENT_TIMESTAMP - timestamp) as processing_latency_seconds,

  CASE 
    WHEN EXTRACT(SECONDS FROM CURRENT_TIMESTAMP - timestamp) < 5 THEN 'real_time'
    WHEN EXTRACT(SECONDS FROM CURRENT_TIMESTAMP - timestamp) < 30 THEN 'near_real_time'  
    ELSE 'delayed_processing'
  END as processing_timeliness

FROM processed_events
WHERE event_classification != 'normal_event' OR requires_processing = true
ORDER BY 
  -- Process highest priority events first
  CASE next_action
    WHEN 'immediate_alert' THEN 1
    WHEN 'rate_limit_check' THEN 2
    WHEN 'status_update' THEN 3
    ELSE 4
  END,
  natural_order ASC;

-- Performance metrics and capacity monitoring for capped collections
WITH capped_collection_stats AS (
  SELECT 
    collection_name,

    -- Storage utilization
    current_size_mb,
    max_size_mb,
    (current_size_mb / max_size_mb * 100) as size_utilization_percent,

    -- Document utilization  
    document_count,
    max_documents,
    (document_count / NULLIF(max_documents, 0) * 100) as document_utilization_percent,

    -- Performance metrics
    avg_document_size,
    total_index_size_mb,

    -- Operation statistics
    total_inserts_today,
    avg_inserts_per_hour,
    peak_inserts_per_hour,

    -- Capacity projections
    estimated_hours_to_capacity,
    estimated_rotation_frequency

  FROM (
    -- This would be populated by MongoDB collection stats
    VALUES 
      ('application_logs', 150, 200, 75000, 100000, 2048, 5, 180000, 7500, 15000, 8, 'every_3_hours'),
      ('event_stream', 400, 500, 200000, 250000, 2048, 8, 480000, 20000, 35000, 4, 'every_hour'),
      ('performance_metrics', 80, 100, 40000, 50000, 2048, 3, 96000, 4000, 8000, 20, 'every_5_hours')
  ) AS stats(collection_name, current_size_mb, max_size_mb, document_count, max_documents, 
             avg_document_size, total_index_size_mb, total_inserts_today, avg_inserts_per_hour,
             peak_inserts_per_hour, estimated_hours_to_capacity, estimated_rotation_frequency)
),

performance_analysis AS (
  SELECT 
    ccs.*,

    -- Utilization status
    CASE 
      WHEN size_utilization_percent > 90 THEN 'critical'
      WHEN size_utilization_percent > 80 THEN 'warning'  
      WHEN size_utilization_percent > 60 THEN 'moderate'
      ELSE 'healthy'
    END as size_status,

    CASE 
      WHEN document_utilization_percent > 90 THEN 'critical'
      WHEN document_utilization_percent > 80 THEN 'warning'
      WHEN document_utilization_percent > 60 THEN 'moderate'  
      ELSE 'healthy'
    END as document_status,

    -- Performance indicators
    CASE 
      WHEN peak_inserts_per_hour / NULLIF(avg_inserts_per_hour, 0) > 3 THEN 'high_variance'
      WHEN peak_inserts_per_hour / NULLIF(avg_inserts_per_hour, 0) > 2 THEN 'moderate_variance'
      ELSE 'stable_load'
    END as load_pattern,

    -- Capacity recommendations
    CASE 
      WHEN estimated_hours_to_capacity < 24 THEN 'monitor_closely'
      WHEN estimated_hours_to_capacity < 72 THEN 'plan_expansion'
      WHEN estimated_hours_to_capacity > 168 THEN 'over_provisioned'
      ELSE 'adequate_capacity'
    END as capacity_recommendation,

    -- Optimization suggestions
    CASE 
      WHEN total_index_size_mb / current_size_mb > 0.3 THEN 'review_indexes'
      WHEN avg_document_size > 4096 THEN 'consider_compression'
      WHEN avg_inserts_per_hour < 100 THEN 'potentially_over_sized'
      ELSE 'well_optimized'
    END as optimization_suggestion

  FROM capped_collection_stats ccs
)

SELECT 
  collection_name,

  -- Current utilization
  ROUND(size_utilization_percent, 1) as size_used_percent,
  ROUND(document_utilization_percent, 1) as documents_used_percent,
  size_status,
  document_status,

  -- Capacity information  
  current_size_mb,
  max_size_mb,
  (max_size_mb - current_size_mb) as remaining_capacity_mb,
  document_count,
  max_documents,

  -- Performance metrics
  avg_document_size,
  total_index_size_mb,
  load_pattern,
  avg_inserts_per_hour,
  peak_inserts_per_hour,

  -- Projections and recommendations
  estimated_hours_to_capacity,
  estimated_rotation_frequency,
  capacity_recommendation,
  optimization_suggestion,

  -- Action items
  CASE 
    WHEN size_status = 'critical' OR document_status = 'critical' THEN 'immediate_action_required'
    WHEN capacity_recommendation = 'monitor_closely' THEN 'increase_monitoring_frequency'
    WHEN optimization_suggestion != 'well_optimized' THEN 'schedule_optimization_review'
    ELSE 'continue_normal_operations'
  END as recommended_action,

  -- Detailed recommendations
  CASE recommended_action
    WHEN 'immediate_action_required' THEN 'Increase capped collection size or reduce retention period'
    WHEN 'increase_monitoring_frequency' THEN 'Monitor every 15 minutes instead of hourly'
    WHEN 'schedule_optimization_review' THEN 'Review indexes, compression, and document structure'
    ELSE 'Collection is operating within normal parameters'
  END as action_details

FROM performance_analysis
ORDER BY 
  CASE size_status 
    WHEN 'critical' THEN 1
    WHEN 'warning' THEN 2
    WHEN 'moderate' THEN 3  
    ELSE 4
  END,
  collection_name;

-- QueryLeaf provides comprehensive capped collection capabilities:
-- 1. SQL-familiar capped collection creation and management
-- 2. High-performance bulk insertion with optimized batching
-- 3. Natural insertion order queries for optimal performance
-- 4. Real-time event streaming with tailable cursor behavior  
-- 5. Advanced analytics and anomaly detection on streaming data
-- 6. Automatic capacity monitoring and optimization recommendations
-- 7. Integration with MongoDB's native capped collection optimizations
-- 8. SQL-style operations for complex streaming data workflows
-- 9. Built-in performance monitoring and alerting capabilities
-- 10. Production-ready capped collections with enterprise features

Best Practices for Capped Collections Implementation

Performance Optimization and Design Strategy

Essential principles for effective MongoDB capped collections deployment:

Size Planning: Calculate optimal collection sizes based on throughput, retention requirements, and query patterns
Write Optimization: Design write patterns that leverage capped collections' sequential write performance advantages
Query Strategy: Utilize natural insertion order and time-based queries for optimal read performance
Index Design: Implement minimal, strategic indexing that complements capped collection characteristics
Monitoring Strategy: Track utilization, rotation frequency, and performance metrics for capacity planning
Integration Patterns: Design applications that benefit from guaranteed insertion order and automatic data lifecycle

Production Deployment and Operational Excellence

Optimize capped collections for enterprise-scale requirements:

Capacity Management: Implement automated monitoring and alerting for collection utilization and performance
Write Distribution: Design shard keys and distribution strategies for balanced writes across replica sets
Real-Time Processing: Leverage tailable cursors and change streams for efficient real-time data processing
Backup Strategy: Account for capped collection characteristics in backup and disaster recovery planning
Performance Monitoring: Track write throughput, query performance, and resource utilization continuously
Operational Integration: Integrate capped collections with existing logging, monitoring, and alerting infrastructure

Conclusion

MongoDB capped collections provide native high-performance data structures that eliminate the complexity of traditional logging and streaming solutions through fixed-size storage, guaranteed insertion order, and automatic data lifecycle management. The combination of predictable performance characteristics with real-time processing capabilities makes capped collections ideal for modern streaming data applications.

Key MongoDB Capped Collections benefits include:

High-Performance Writes: Sequential write optimization with minimal index maintenance overhead
Predictable Storage: Fixed-size collections with automatic old document removal and no storage bloat
Insertion Order Guarantee: Natural document ordering ideal for event sequencing and temporal data analysis
Real-Time Processing: Tailable cursors and change streams for efficient streaming data consumption
Resource Efficiency: Predictable memory usage and optimal performance characteristics for high-throughput scenarios
SQL Accessibility: Familiar SQL-style capped collection operations through QueryLeaf for accessible streaming data management

Whether you're implementing application logging, event streaming, performance monitoring, or real-time analytics, MongoDB capped collections with QueryLeaf's familiar SQL interface provide the foundation for efficient, predictable, and scalable streaming data solutions.

QueryLeaf Integration: QueryLeaf seamlessly manages MongoDB capped collections while providing SQL-familiar syntax for high-performance logging, real-time streaming, and circular buffer operations. Advanced capped collection patterns including capacity planning, real-time processing, and performance optimization are elegantly handled through familiar SQL constructs, making sophisticated streaming data management both powerful and accessible to SQL-oriented development teams.

The combination of MongoDB's robust capped collection capabilities with SQL-style streaming operations makes it an ideal platform for applications requiring both high-throughput data capture and familiar database interaction patterns, ensuring your streaming data infrastructure can scale efficiently while maintaining predictable performance and operational simplicity.

November 1, 2025
19 min read

MongoDB TTL Collections: Automatic Data Lifecycle Management and Expiration for Efficient Storage

Modern applications generate vast amounts of transient data that needs careful lifecycle management to maintain performance and control storage costs. Traditional approaches to data cleanup involve complex batch jobs, scheduled maintenance scripts, and manual processes that are error-prone and resource-intensive.

MongoDB TTL (Time To Live) collections provide native automatic data expiration capabilities that eliminate the complexity of manual data lifecycle management. Unlike traditional database systems that require custom deletion processes or external job schedulers, MongoDB TTL indexes automatically remove expired documents, ensuring optimal storage utilization and performance without operational overhead.

The Traditional Data Lifecycle Challenge

Conventional approaches to managing data expiration and cleanup involve significant complexity and operational burden:

-- Traditional PostgreSQL data cleanup approach - complex and resource-intensive

-- Session cleanup with manual batch processing
CREATE TABLE user_sessions (
    session_id UUID PRIMARY KEY,
    user_id BIGINT NOT NULL,
    session_data JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    expires_at TIMESTAMP NOT NULL,
    last_accessed TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    is_active BOOLEAN DEFAULT true
);

-- Scheduled cleanup job (requires external cron/scheduler)
-- This query must run regularly and can be resource-intensive
DELETE FROM user_sessions 
WHERE expires_at < CURRENT_TIMESTAMP 
   OR (last_accessed < CURRENT_TIMESTAMP - INTERVAL '30 days' AND is_active = false);

-- Complex log cleanup with multiple conditions
CREATE TABLE application_logs (
    log_id BIGSERIAL PRIMARY KEY,
    application_name VARCHAR(100) NOT NULL,
    log_level VARCHAR(20) NOT NULL,
    message TEXT,
    metadata JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Manual retention policy implementation
    retention_days INTEGER DEFAULT 30,
    should_archive BOOLEAN DEFAULT false
);

-- Multi-stage cleanup process
WITH logs_to_cleanup AS (
    SELECT log_id, application_name, created_at, retention_days
    FROM application_logs
    WHERE 
        -- Different retention periods by log level
        (log_level = 'DEBUG' AND created_at < CURRENT_TIMESTAMP - INTERVAL '7 days')
        OR (log_level = 'INFO' AND created_at < CURRENT_TIMESTAMP - INTERVAL '30 days')
        OR (log_level = 'WARN' AND created_at < CURRENT_TIMESTAMP - INTERVAL '90 days')
        OR (log_level = 'ERROR' AND created_at < CURRENT_TIMESTAMP - INTERVAL '365 days')
        OR (should_archive = false AND created_at < CURRENT_TIMESTAMP - retention_days * INTERVAL '1 day')
),
archival_candidates AS (
    -- Identify logs that should be archived before deletion
    SELECT ltc.log_id, ltc.application_name, ltc.created_at
    FROM logs_to_cleanup ltc
    JOIN application_logs al ON ltc.log_id = al.log_id
    WHERE al.log_level IN ('ERROR', 'CRITICAL') 
       OR al.metadata ? 'trace_id' -- Contains important debugging info
),
archive_process AS (
    -- Archive important logs (complex external process)
    INSERT INTO archived_application_logs 
    SELECT al.* FROM application_logs al
    JOIN archival_candidates ac ON al.log_id = ac.log_id
    RETURNING log_id
)
-- Finally delete the logs
DELETE FROM application_logs
WHERE log_id IN (
    SELECT log_id FROM logs_to_cleanup
    WHERE log_id NOT IN (SELECT log_id FROM archival_candidates)
       OR log_id IN (SELECT log_id FROM archive_process)
);

-- Traditional approach problems:
-- 1. Complex scheduling and orchestration required
-- 2. Resource-intensive batch operations during cleanup
-- 3. Risk of data loss if cleanup jobs fail
-- 4. Manual management of different retention policies
-- 5. No automatic optimization of storage and indexes
-- 6. Difficulty in handling timezone and date calculations
-- 7. Complex error handling and retry logic required
-- 8. Performance impact during large cleanup operations
-- 9. Manual coordination between cleanup and application logic
-- 10. Inconsistent cleanup behavior across different environments

-- Attempting MySQL-style events (limited functionality)
SET GLOBAL event_scheduler = ON;

CREATE EVENT cleanup_expired_sessions
ON SCHEDULE EVERY 1 HOUR
STARTS CURRENT_TIMESTAMP
DO
BEGIN
    DELETE FROM user_sessions 
    WHERE expires_at < NOW() 
    LIMIT 1000; -- Prevent long-running operations
END;

-- MySQL event limitations:
-- - Basic scheduling only
-- - No complex retention logic
-- - Limited error handling
-- - Manual management of batch sizes
-- - No integration with application lifecycle
-- - Poor visibility into cleanup operations

MongoDB TTL collections provide elegant automatic data expiration:

// MongoDB TTL Collections - automatic data lifecycle management
const { MongoClient, ObjectId } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('data_lifecycle_management');

// Comprehensive MongoDB TTL Data Lifecycle Manager
class MongoDBTTLManager {
  constructor(db, config = {}) {
    this.db = db;
    this.config = {
      defaultTTL: config.defaultTTL || 3600, // 1 hour default
      enableMetrics: config.enableMetrics !== false,
      enableIndexOptimization: config.enableIndexOptimization !== false,
      cleanupLogLevel: config.cleanupLogLevel || 'info',
      ...config
    };

    this.collections = {
      userSessions: db.collection('user_sessions'),
      applicationLogs: db.collection('application_logs'),
      temporaryData: db.collection('temporary_data'),
      eventStream: db.collection('event_stream'),
      apiRequests: db.collection('api_requests'),
      cacheEntries: db.collection('cache_entries'),
      ttlMetrics: db.collection('ttl_metrics')
    };

    this.ttlIndexes = new Map();
    this.expirationStrategies = new Map();
  }

  async initializeTTLCollections() {
    console.log('Initializing TTL collections and indexes...');

    try {
      // User sessions with 24-hour expiration
      await this.setupSessionTTL();

      // Application logs with variable retention based on log level
      await this.setupLogsTTL();

      // Temporary data with flexible expiration
      await this.setupTemporaryDataTTL();

      // Event stream with time-based partitioning
      await this.setupEventStreamTTL();

      // API request tracking with automatic cleanup
      await this.setupAPIRequestsTTL();

      // Cache entries with intelligent expiration
      await this.setupCacheTTL();

      // Metrics collection for monitoring TTL performance
      await this.setupTTLMetrics();

      console.log('All TTL collections initialized successfully');

    } catch (error) {
      console.error('Error initializing TTL collections:', error);
      throw error;
    }
  }

  async setupSessionTTL() {
    console.log('Setting up user session TTL...');

    const sessionCollection = this.collections.userSessions;

    // Create TTL index for automatic session expiration
    await sessionCollection.createIndex(
      { expiresAt: 1 },
      { 
        expireAfterSeconds: 0, // Expire based on document field value
        background: true,
        name: 'session_ttl_index'
      }
    );

    // Secondary TTL index for inactive sessions
    await sessionCollection.createIndex(
      { lastAccessedAt: 1 },
      { 
        expireAfterSeconds: 7 * 24 * 3600, // 7 days for inactive sessions
        background: true,
        name: 'session_inactivity_ttl_index'
      }
    );

    // Compound index for efficient session queries
    await sessionCollection.createIndex(
      { userId: 1, isActive: 1, expiresAt: 1 },
      { background: true }
    );

    this.ttlIndexes.set('userSessions', [
      { field: 'expiresAt', expireAfterSeconds: 0 },
      { field: 'lastAccessedAt', expireAfterSeconds: 7 * 24 * 3600 }
    ]);

    console.log('User session TTL configured');
  }

  async createUserSession(userId, sessionData, customTTL = null) {
    const expirationTime = new Date(Date.now() + ((customTTL || 24 * 3600) * 1000));

    const sessionDocument = {
      sessionId: new ObjectId(),
      userId: userId,
      sessionData: sessionData,
      createdAt: new Date(),
      expiresAt: expirationTime, // TTL field for automatic expiration
      lastAccessedAt: new Date(),
      isActive: true,

      // Session metadata
      userAgent: sessionData.userAgent,
      ipAddress: sessionData.ipAddress,
      deviceType: sessionData.deviceType,

      // Expiration strategy metadata
      ttlStrategy: 'fixed_expiration',
      customTTL: customTTL,
      renewalCount: 0
    };

    const result = await this.collections.userSessions.insertOne(sessionDocument);

    console.log(`Created session ${result.insertedId} for user ${userId}, expires at ${expirationTime}`);
    return result.insertedId;
  }

  async renewUserSession(sessionId, additionalTTL = 3600) {
    const newExpirationTime = new Date(Date.now() + (additionalTTL * 1000));

    const result = await this.collections.userSessions.updateOne(
      { sessionId: new ObjectId(sessionId), isActive: true },
      {
        $set: {
          expiresAt: newExpirationTime,
          lastAccessedAt: new Date()
        },
        $inc: { renewalCount: 1 }
      }
    );

    if (result.modifiedCount > 0) {
      console.log(`Renewed session ${sessionId} until ${newExpirationTime}`);
    }

    return result.modifiedCount > 0;
  }

  async setupLogsTTL() {
    console.log('Setting up application logs TTL with level-based retention...');

    const logsCollection = this.collections.applicationLogs;

    // Create partial TTL indexes for different log levels
    // Debug logs expire quickly
    await logsCollection.createIndex(
      { createdAt: 1 },
      {
        expireAfterSeconds: 7 * 24 * 3600, // 7 days
        partialFilterExpression: { logLevel: 'DEBUG' },
        background: true,
        name: 'debug_logs_ttl'
      }
    );

    // Info logs have moderate retention
    await logsCollection.createIndex(
      { createdAt: 1 },
      {
        expireAfterSeconds: 30 * 24 * 3600, // 30 days
        partialFilterExpression: { logLevel: 'INFO' },
        background: true,
        name: 'info_logs_ttl'
      }
    );

    // Warning logs kept longer
    await logsCollection.createIndex(
      { createdAt: 1 },
      {
        expireAfterSeconds: 90 * 24 * 3600, // 90 days
        partialFilterExpression: { logLevel: 'WARN' },
        background: true,
        name: 'warn_logs_ttl'
      }
    );

    // Error logs kept for a full year
    await logsCollection.createIndex(
      { createdAt: 1 },
      {
        expireAfterSeconds: 365 * 24 * 3600, // 365 days
        partialFilterExpression: { logLevel: { $in: ['ERROR', 'CRITICAL'] } },
        background: true,
        name: 'error_logs_ttl'
      }
    );

    // Compound index for efficient log queries
    await logsCollection.createIndex(
      { applicationName: 1, logLevel: 1, createdAt: -1 },
      { background: true }
    );

    this.expirationStrategies.set('applicationLogs', {
      DEBUG: 7 * 24 * 3600,
      INFO: 30 * 24 * 3600,
      WARN: 90 * 24 * 3600,
      ERROR: 365 * 24 * 3600,
      CRITICAL: 365 * 24 * 3600
    });

    console.log('Application logs TTL configured with level-based retention');
  }

  async createLogEntry(applicationName, logLevel, message, metadata = {}) {
    const logDocument = {
      logId: new ObjectId(),
      applicationName: applicationName,
      logLevel: logLevel.toUpperCase(),
      message: message,
      metadata: metadata,
      createdAt: new Date(), // TTL field used by level-specific indexes

      // Additional context
      hostname: metadata.hostname || 'unknown',
      processId: metadata.processId,
      threadId: metadata.threadId,
      traceId: metadata.traceId,

      // Automatic expiration via TTL indexes
      // No manual expiration field needed - handled by partial TTL indexes
    };

    const result = await this.collections.applicationLogs.insertOne(logDocument);

    // Log retention info based on level
    const retentionDays = this.expirationStrategies.get('applicationLogs')[logLevel.toUpperCase()];
    const expirationDate = new Date(Date.now() + (retentionDays * 1000));

    if (this.config.cleanupLogLevel === 'debug') {
      console.log(`Created ${logLevel} log entry ${result.insertedId}, will expire around ${expirationDate}`);
    }

    return result.insertedId;
  }

  async setupTemporaryDataTTL() {
    console.log('Setting up temporary data TTL with flexible expiration...');

    const tempCollection = this.collections.temporaryData;

    // Primary TTL index using document field
    await tempCollection.createIndex(
      { expiresAt: 1 },
      {
        expireAfterSeconds: 0, // Use document field value
        background: true,
        name: 'temp_data_ttl'
      }
    );

    // Backup TTL index with default expiration
    await tempCollection.createIndex(
      { createdAt: 1 },
      {
        expireAfterSeconds: 24 * 3600, // 24 hours default
        partialFilterExpression: { expiresAt: { $exists: false } },
        background: true,
        name: 'temp_data_default_ttl'
      }
    );

    // Index for data type queries
    await tempCollection.createIndex(
      { dataType: 1, createdAt: -1 },
      { background: true }
    );

    console.log('Temporary data TTL configured');
  }

  async storeTemporaryData(dataType, data, ttlSeconds = 3600) {
    const expirationTime = new Date(Date.now() + (ttlSeconds * 1000));

    const tempDocument = {
      tempId: new ObjectId(),
      dataType: dataType,
      data: data,
      createdAt: new Date(),
      expiresAt: expirationTime, // TTL field

      // Metadata
      sizeBytes: JSON.stringify(data).length,
      compressionType: data.compressionType || 'none',
      accessCount: 0,

      // TTL configuration
      ttlSeconds: ttlSeconds,
      autoExpire: true
    };

    const result = await this.collections.temporaryData.insertOne(tempDocument);

    console.log(`Stored temporary ${dataType} data ${result.insertedId}, expires at ${expirationTime}`);
    return result.insertedId;
  }

  async setupEventStreamTTL() {
    console.log('Setting up event stream TTL with sliding window retention...');

    const eventCollection = this.collections.eventStream;

    // TTL index for event stream with 30-day retention
    await eventCollection.createIndex(
      { timestamp: 1 },
      {
        expireAfterSeconds: 30 * 24 * 3600, // 30 days
        background: true,
        name: 'event_stream_ttl'
      }
    );

    // Compound index for event queries
    await eventCollection.createIndex(
      { eventType: 1, timestamp: -1 },
      { background: true }
    );

    // Index for user-specific events
    await eventCollection.createIndex(
      { userId: 1, timestamp: -1 },
      { background: true }
    );

    console.log('Event stream TTL configured');
  }

  async createEvent(eventType, userId, eventData) {
    const eventDocument = {
      eventId: new ObjectId(),
      eventType: eventType,
      userId: userId,
      eventData: eventData,
      timestamp: new Date(), // TTL field

      // Event metadata
      source: eventData.source || 'application',
      sessionId: eventData.sessionId,
      correlationId: eventData.correlationId,

      // Automatic expiration after 30 days via TTL index
    };

    const result = await this.collections.eventStream.insertOne(eventDocument);
    return result.insertedId;
  }

  async setupAPIRequestsTTL() {
    console.log('Setting up API requests TTL for monitoring and analytics...');

    const apiCollection = this.collections.apiRequests;

    // TTL index with 7-day retention for API requests
    await apiCollection.createIndex(
      { requestTime: 1 },
      {
        expireAfterSeconds: 7 * 24 * 3600, // 7 days
        background: true,
        name: 'api_requests_ttl'
      }
    );

    // Compound indexes for API analytics
    await apiCollection.createIndex(
      { endpoint: 1, requestTime: -1 },
      { background: true }
    );

    await apiCollection.createIndex(
      { statusCode: 1, requestTime: -1 },
      { background: true }
    );

    console.log('API requests TTL configured');
  }

  async logAPIRequest(endpoint, method, statusCode, responseTime, metadata = {}) {
    const requestDocument = {
      requestId: new ObjectId(),
      endpoint: endpoint,
      method: method.toUpperCase(),
      statusCode: statusCode,
      responseTime: responseTime,
      requestTime: new Date(), // TTL field

      // Request details
      userAgent: metadata.userAgent,
      ipAddress: metadata.ipAddress,
      userId: metadata.userId,
      sessionId: metadata.sessionId,

      // Performance metrics
      requestSize: metadata.requestSize || 0,
      responseSize: metadata.responseSize || 0,

      // Automatic expiration after 7 days
    };

    const result = await this.collections.apiRequests.insertOne(requestDocument);
    return result.insertedId;
  }

  async setupCacheTTL() {
    console.log('Setting up cache entries TTL with intelligent expiration...');

    const cacheCollection = this.collections.cacheEntries;

    // Primary TTL index using document field for custom expiration
    await cacheCollection.createIndex(
      { expiresAt: 1 },
      {
        expireAfterSeconds: 0, // Use document field
        background: true,
        name: 'cache_ttl'
      }
    );

    // Backup TTL for entries without explicit expiration
    await cacheCollection.createIndex(
      { lastAccessedAt: 1 },
      {
        expireAfterSeconds: 3600, // 1 hour default
        background: true,
        name: 'cache_access_ttl'
      }
    );

    // Index for cache key lookups
    await cacheCollection.createIndex(
      { cacheKey: 1 },
      { unique: true, background: true }
    );

    console.log('Cache TTL configured');
  }

  async setCacheEntry(cacheKey, value, ttlSeconds = 300) {
    const expirationTime = new Date(Date.now() + (ttlSeconds * 1000));

    const cacheDocument = {
      cacheKey: cacheKey,
      value: value,
      createdAt: new Date(),
      lastAccessedAt: new Date(),
      expiresAt: expirationTime, // TTL field

      // Cache metadata
      accessCount: 0,
      ttlSeconds: ttlSeconds,
      valueType: typeof value,
      sizeBytes: JSON.stringify(value).length,

      // Hit ratio tracking
      hitCount: 0,
      missCount: 0
    };

    const result = await cacheCollection.updateOne(
      { cacheKey: cacheKey },
      {
        $set: cacheDocument,
        $setOnInsert: { createdAt: new Date() }
      },
      { upsert: true }
    );

    return result.upsertedId || result.modifiedCount > 0;
  }

  async getCacheEntry(cacheKey) {
    const result = await this.collections.cacheEntries.findOneAndUpdate(
      { cacheKey: cacheKey },
      {
        $set: { lastAccessedAt: new Date() },
        $inc: { accessCount: 1, hitCount: 1 }
      },
      { returnDocument: 'after' }
    );

    return result.value?.value || null;
  }

  async setupTTLMetrics() {
    console.log('Setting up TTL metrics collection...');

    const metricsCollection = this.collections.ttlMetrics;

    // TTL index for metrics with 90-day retention
    await metricsCollection.createIndex(
      { timestamp: 1 },
      {
        expireAfterSeconds: 90 * 24 * 3600, // 90 days
        background: true,
        name: 'metrics_ttl'
      }
    );

    // Index for metrics queries
    await metricsCollection.createIndex(
      { collectionName: 1, timestamp: -1 },
      { background: true }
    );

    console.log('TTL metrics collection configured');
  }

  async collectTTLMetrics() {
    console.log('Collecting TTL performance metrics...');

    try {
      const metrics = {
        timestamp: new Date(),
        collections: {}
      };

      // Collect metrics for each TTL collection
      for (const [collectionName, collection] of Object.entries(this.collections)) {
        if (collectionName === 'ttlMetrics') continue;

        const collectionStats = await collection.stats();
        const indexStats = await this.getTTLIndexStats(collection);

        metrics.collections[collectionName] = {
          documentCount: collectionStats.count,
          storageSize: collectionStats.storageSize,
          avgObjSize: collectionStats.avgObjSize,
          totalIndexSize: collectionStats.totalIndexSize,
          ttlIndexes: indexStats,

          // Calculate expiration rates
          estimatedExpirationRate: await this.estimateExpirationRate(collection)
        };
      }

      // Store metrics
      await this.collections.ttlMetrics.insertOne(metrics);

      if (this.config.enableMetrics) {
        console.log('TTL Metrics:', {
          totalCollections: Object.keys(metrics.collections).length,
          totalDocuments: Object.values(metrics.collections).reduce((sum, c) => sum + c.documentCount, 0),
          totalStorageSize: Object.values(metrics.collections).reduce((sum, c) => sum + c.storageSize, 0)
        });
      }

      return metrics;

    } catch (error) {
      console.error('Error collecting TTL metrics:', error);
      throw error;
    }
  }

  async getTTLIndexStats(collection) {
    const indexes = await collection.listIndexes().toArray();
    const ttlIndexes = indexes.filter(index => 
      index.expireAfterSeconds !== undefined || index.expireAfterSeconds === 0
    );

    return ttlIndexes.map(index => ({
      name: index.name,
      key: index.key,
      expireAfterSeconds: index.expireAfterSeconds,
      partialFilterExpression: index.partialFilterExpression
    }));
  }

  async estimateExpirationRate(collection) {
    // Simple estimation based on documents created vs documents existing
    const now = new Date();
    const oneDayAgo = new Date(now.getTime() - (24 * 60 * 60 * 1000));

    const recentDocuments = await collection.countDocuments({
      createdAt: { $gte: oneDayAgo }
    });

    const totalDocuments = await collection.countDocuments();

    return recentDocuments > 0 ? (recentDocuments / totalDocuments) : 0;
  }

  async optimizeTTLIndexes() {
    console.log('Optimizing TTL indexes for better performance...');

    try {
      for (const [collectionName, collection] of Object.entries(this.collections)) {
        if (collectionName === 'ttlMetrics') continue;

        // Analyze index usage
        const indexStats = await collection.aggregate([
          { $indexStats: {} }
        ]).toArray();

        // Identify underutilized TTL indexes
        for (const indexStat of indexStats) {
          if (indexStat.key && indexStat.key.expiresAt) {
            const usage = indexStat.accesses;
            console.log(`TTL index ${indexStat.name} usage:`, usage);

            // Suggest optimizations based on usage patterns
            if (usage.ops < 100 && usage.since) {
              console.log(`Consider reviewing TTL index ${indexStat.name} - low usage detected`);
            }
          }
        }
      }

    } catch (error) {
      console.error('Error optimizing TTL indexes:', error);
    }
  }

  async getTTLStatus() {
    const status = {
      collectionsWithTTL: 0,
      totalTTLIndexes: 0,
      activeExpirations: {},
      systemHealth: 'healthy'
    };

    for (const [collectionName, collection] of Object.entries(this.collections)) {
      if (collectionName === 'ttlMetrics') continue;

      const indexes = await collection.listIndexes().toArray();
      const ttlIndexes = indexes.filter(index => 
        index.expireAfterSeconds !== undefined || index.expireAfterSeconds === 0
      );

      if (ttlIndexes.length > 0) {
        status.collectionsWithTTL++;
        status.totalTTLIndexes += ttlIndexes.length;

        // Estimate documents that will expire soon
        const soonToExpire = await this.estimateSoonToExpire(collection, ttlIndexes);
        status.activeExpirations[collectionName] = soonToExpire;
      }
    }

    return status;
  }

  async estimateSoonToExpire(collection, ttlIndexes) {
    let totalSoonToExpire = 0;

    for (const index of ttlIndexes) {
      if (index.expireAfterSeconds === 0) {
        // Documents expire based on field value
        const fieldName = Object.keys(index.key)[0];
        const nextHour = new Date(Date.now() + (60 * 60 * 1000));

        const count = await collection.countDocuments({
          [fieldName]: { $lt: nextHour }
        });

        totalSoonToExpire += count;
      } else {
        // Documents expire based on index TTL
        const fieldName = Object.keys(index.key)[0];
        const cutoffTime = new Date(Date.now() - (index.expireAfterSeconds * 1000) + (60 * 60 * 1000));

        const count = await collection.countDocuments({
          [fieldName]: { $lt: cutoffTime }
        });

        totalSoonToExpire += count;
      }
    }

    return totalSoonToExpire;
  }

  async shutdown() {
    console.log('Shutting down TTL Manager...');

    // Final metrics collection
    if (this.config.enableMetrics) {
      await this.collectTTLMetrics();
    }

    // Display final status
    const status = await this.getTTLStatus();
    console.log('Final TTL Status:', status);

    console.log('TTL Manager shutdown complete');
  }
}

// Benefits of MongoDB TTL Collections:
// - Automatic data expiration without manual intervention
// - Multiple TTL strategies (fixed time, document field, partial indexes)
// - Built-in optimization and storage reclamation
// - Integration with MongoDB's index and query optimization
// - Flexible retention policies based on data characteristics
// - No external job scheduling required
// - Consistent behavior across replica sets and sharded clusters
// - Real-time metrics and monitoring capabilities
// - SQL-compatible TTL operations through QueryLeaf integration

module.exports = {
  MongoDBTTLManager
};

Understanding MongoDB TTL Architecture

Advanced TTL Patterns and Configuration Strategies

Implement sophisticated TTL patterns for different data lifecycle requirements:

// Advanced TTL patterns for production MongoDB deployments
class AdvancedTTLStrategies extends MongoDBTTLManager {
  constructor(db, advancedConfig) {
    super(db, advancedConfig);

    this.advancedConfig = {
      ...advancedConfig,
      enableTimezoneSupport: true,
      enableConditionalExpiration: true,
      enableGradualExpiration: true,
      enableExpirationNotifications: true,
      enableComplianceMode: true
    };
  }

  async setupConditionalTTL() {
    // TTL that expires documents based on multiple conditions
    console.log('Setting up conditional TTL with complex business logic...');

    const conditionalTTLCollection = this.db.collection('conditional_expiration');

    // Different TTL for different user tiers
    await conditionalTTLCollection.createIndex(
      { lastActivityAt: 1 },
      {
        expireAfterSeconds: 30 * 24 * 3600, // 30 days for free tier
        partialFilterExpression: { 
          userTier: 'free',
          isPremium: false 
        },
        background: true,
        name: 'free_user_data_ttl'
      }
    );

    await conditionalTTLCollection.createIndex(
      { lastActivityAt: 1 },
      {
        expireAfterSeconds: 365 * 24 * 3600, // 1 year for premium users
        partialFilterExpression: { 
          userTier: 'premium',
          isPremium: true 
        },
        background: true,
        name: 'premium_user_data_ttl'
      }
    );

    // Business-critical data never expires automatically
    await conditionalTTLCollection.createIndex(
      { reviewDate: 1 },
      {
        expireAfterSeconds: 7 * 365 * 24 * 3600, // 7 years for compliance
        partialFilterExpression: { 
          dataClassification: 'business_critical',
          complianceRetentionRequired: true
        },
        background: true,
        name: 'compliance_data_ttl'
      }
    );
  }

  async setupGradualExpiration() {
    // Implement gradual expiration to reduce system load
    console.log('Setting up gradual expiration strategy...');

    const gradualCollection = this.db.collection('gradual_expiration');

    // Stagger expiration across time buckets
    const timeBuckets = [
      { hour: 2, expireSeconds: 7 * 24 * 3600 },   // 2 AM
      { hour: 14, expireSeconds: 14 * 24 * 3600 }, // 2 PM
      { hour: 20, expireSeconds: 21 * 24 * 3600 }  // 8 PM
    ];

    for (const bucket of timeBuckets) {
      await gradualCollection.createIndex(
        { createdAt: 1 },
        {
          expireAfterSeconds: bucket.expireSeconds,
          partialFilterExpression: {
            expirationBucket: bucket.hour
          },
          background: true,
          name: `gradual_ttl_${bucket.hour}h`
        }
      );
    }
  }

  async createDocumentWithGradualExpiration(data) {
    // Assign expiration bucket based on hash of document ID
    const buckets = [2, 14, 20];
    const bucketIndex = Math.abs(data.hashCode || Math.random()) % buckets.length;
    const selectedBucket = buckets[bucketIndex];

    const document = {
      ...data,
      createdAt: new Date(),
      expirationBucket: selectedBucket,

      // Add jitter to prevent thundering herd
      expirationJitter: Math.floor(Math.random() * 3600) // 0-1 hour jitter
    };

    return await this.db.collection('gradual_expiration').insertOne(document);
  }

  async setupTimezoneTTL() {
    // TTL that respects business hours and timezones
    console.log('Setting up timezone-aware TTL...');

    const timezoneCollection = this.db.collection('timezone_expiration');

    // Create TTL based on business date rather than UTC
    await timezoneCollection.createIndex(
      { businessDateExpiry: 1 },
      {
        expireAfterSeconds: 0, // Use document field
        background: true,
        name: 'business_timezone_ttl'
      }
    );
  }

  async createBusinessHoursTTLDocument(data, businessTimezone = 'America/New_York', retentionDays = 30) {
    const moment = require('moment-timezone');

    // Calculate expiration at end of business day in specified timezone
    const businessExpiry = moment()
      .tz(businessTimezone)
      .add(retentionDays, 'days')
      .endOf('day') // Expire at end of business day
      .toDate();

    const document = {
      ...data,
      createdAt: new Date(),
      businessDateExpiry: businessExpiry,
      timezone: businessTimezone,
      retentionPolicy: 'business_hours_aligned'
    };

    return await timezoneCollection.insertOne(document);
  }

  async setupComplianceTTL() {
    // TTL with compliance and audit requirements
    console.log('Setting up compliance-aware TTL...');

    const complianceCollection = this.db.collection('compliance_data');

    // Legal hold prevents automatic expiration
    await complianceCollection.createIndex(
      { scheduledDestructionDate: 1 },
      {
        expireAfterSeconds: 0,
        partialFilterExpression: {
          legalHold: false,
          complianceStatus: 'approved_for_destruction'
        },
        background: true,
        name: 'compliance_ttl'
      }
    );

    // Audit trail for expired documents
    await complianceCollection.createIndex(
      { auditExpirationDate: 1 },
      {
        expireAfterSeconds: 10 * 365 * 24 * 3600, // 10 years for audit trail
        background: true,
        name: 'audit_trail_ttl'
      }
    );
  }

  async createComplianceDocument(data, retentionYears = 7) {
    const scheduledDestruction = new Date();
    scheduledDestruction.setFullYear(scheduledDestruction.getFullYear() + retentionYears);

    const document = {
      ...data,
      createdAt: new Date(),
      retentionPeriodYears: retentionYears,
      scheduledDestructionDate: scheduledDestruction,

      // Compliance metadata
      legalHold: false,
      complianceStatus: 'under_retention',
      dataClassification: data.dataClassification || 'standard',

      // Audit requirements
      auditExpirationDate: new Date(scheduledDestruction.getTime() + (3 * 365 * 24 * 60 * 60 * 1000)) // +3 years
    };

    return await this.db.collection('compliance_data').insertOne(document);
  }

  async implementExpirationNotifications() {
    // Set up change streams to monitor expiring documents
    console.log('Setting up expiration notifications...');

    const expirationNotifier = this.db.collection('expiration_notifications');

    // Monitor documents that will expire soon
    setInterval(async () => {
      await this.checkUpcomingExpirations();
    }, 60 * 60 * 1000); // Check every hour
  }

  async checkUpcomingExpirations() {
    const collections = [
      'user_sessions', 
      'application_logs', 
      'temporary_data',
      'compliance_data'
    ];

    for (const collectionName of collections) {
      const collection = this.db.collection(collectionName);

      // Find documents expiring in the next 24 hours
      const tomorrow = new Date(Date.now() + (24 * 60 * 60 * 1000));

      const soonToExpire = await collection.find({
        $or: [
          { expiresAt: { $lt: tomorrow, $gte: new Date() } },
          { businessDateExpiry: { $lt: tomorrow, $gte: new Date() } },
          { scheduledDestructionDate: { $lt: tomorrow, $gte: new Date() } }
        ]
      }).toArray();

      if (soonToExpire.length > 0) {
        console.log(`${collectionName}: ${soonToExpire.length} documents expiring within 24 hours`);

        // Send notifications or trigger workflows
        await this.sendExpirationNotifications(collectionName, soonToExpire);
      }
    }
  }

  async sendExpirationNotifications(collectionName, documents) {
    // Implementation would integrate with notification systems
    const notification = {
      timestamp: new Date(),
      collection: collectionName,
      documentsCount: documents.length,
      urgency: 'medium',
      action: 'documents_expiring_soon'
    };

    console.log('Expiration notification:', notification);

    // Store notification for processing
    await this.db.collection('expiration_notifications').insertOne(notification);
  }
}

SQL-Style TTL Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB TTL operations:

-- QueryLeaf TTL operations with SQL-familiar syntax

-- Create TTL-enabled collections with automatic expiration
CREATE TABLE user_sessions (
  session_id UUID PRIMARY KEY,
  user_id VARCHAR(50) NOT NULL,
  session_data DOCUMENT,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  expires_at TIMESTAMP NOT NULL,
  last_accessed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  is_active BOOLEAN DEFAULT true
)
WITH TTL (
  -- Multiple TTL strategies
  expires_at EXPIRE_AFTER 0,  -- Use document field value
  last_accessed_at EXPIRE_AFTER '7 days' -- Inactive session cleanup
);

-- Create application logs with level-based retention
CREATE TABLE application_logs (
  log_id UUID PRIMARY KEY,
  application_name VARCHAR(100) NOT NULL,
  log_level VARCHAR(20) NOT NULL,
  message TEXT,
  metadata DOCUMENT,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
WITH TTL (
  -- Different retention by log level using partial indexes
  created_at EXPIRE_AFTER '7 days' WHERE log_level = 'DEBUG',
  created_at EXPIRE_AFTER '30 days' WHERE log_level = 'INFO',
  created_at EXPIRE_AFTER '90 days' WHERE log_level = 'WARN',
  created_at EXPIRE_AFTER '365 days' WHERE log_level IN ('ERROR', 'CRITICAL')
);

-- Temporary data with flexible TTL
CREATE TABLE temporary_data (
  temp_id UUID PRIMARY KEY,
  data_type VARCHAR(100),
  data DOCUMENT,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  expires_at TIMESTAMP,
  ttl_seconds INTEGER DEFAULT 3600
)
WITH TTL (
  expires_at EXPIRE_AFTER 0,  -- Use document field
  created_at EXPIRE_AFTER '24 hours' WHERE expires_at IS NULL  -- Default fallback
);

-- Insert session with custom TTL
INSERT INTO user_sessions (user_id, session_data, expires_at, is_active)
VALUES 
  ('user123', '{"preferences": {"theme": "dark"}}', CURRENT_TIMESTAMP + INTERVAL '2 hours', true),
  ('user456', '{"preferences": {"lang": "en"}}', CURRENT_TIMESTAMP + INTERVAL '1 day', true);

-- Insert log entries (automatic TTL based on level)
INSERT INTO application_logs (application_name, log_level, message, metadata)
VALUES 
  ('web-server', 'DEBUG', 'Request processed', '{"endpoint": "/api/users", "duration": 45}'),
  ('web-server', 'ERROR', 'Database connection failed', '{"error": "timeout", "retry_count": 3}'),
  ('payment-service', 'INFO', 'Payment processed', '{"amount": 99.99, "currency": "USD"}');

-- Query active sessions with TTL information
SELECT 
  session_id,
  user_id,
  created_at,
  expires_at,

  -- Calculate remaining TTL
  EXTRACT(EPOCH FROM (expires_at - CURRENT_TIMESTAMP)) as seconds_until_expiry,

  -- Expiration status
  CASE 
    WHEN expires_at <= CURRENT_TIMESTAMP THEN 'expired'
    WHEN expires_at <= CURRENT_TIMESTAMP + INTERVAL '1 hour' THEN 'expiring_soon'
    ELSE 'active'
  END as expiration_status,

  -- Session age
  EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - created_at)) as session_age_seconds

FROM user_sessions
WHERE is_active = true
ORDER BY expires_at ASC;

-- Extend session TTL (renew expiration)
UPDATE user_sessions 
SET 
  expires_at = CURRENT_TIMESTAMP + INTERVAL '2 hours',
  last_accessed_at = CURRENT_TIMESTAMP
WHERE session_id = 'session-uuid-here'
  AND is_active = true
  AND expires_at > CURRENT_TIMESTAMP;

-- Store temporary data with custom expiration
INSERT INTO temporary_data (data_type, data, expires_at, ttl_seconds)
VALUES 
  ('cache_entry', '{"result": [1,2,3], "computed_at": "2025-11-01T10:00:00Z"}', CURRENT_TIMESTAMP + INTERVAL '5 minutes', 300),
  ('user_upload', '{"filename": "document.pdf", "size": 1024000}', CURRENT_TIMESTAMP + INTERVAL '24 hours', 86400),
  ('temp_report', '{"report_data": {...}, "generated_for": "user123"}', CURRENT_TIMESTAMP + INTERVAL '1 hour', 3600);

-- Advanced TTL queries with business logic
WITH session_analytics AS (
  SELECT 
    user_id,
    COUNT(*) as total_sessions,
    AVG(EXTRACT(EPOCH FROM (expires_at - created_at))) as avg_session_duration,
    MAX(last_accessed_at) as last_activity,

    -- TTL health metrics
    COUNT(*) FILTER (WHERE expires_at <= CURRENT_TIMESTAMP) as expired_sessions,
    COUNT(*) FILTER (WHERE expires_at <= CURRENT_TIMESTAMP + INTERVAL '1 hour') as soon_to_expire,
    COUNT(*) FILTER (WHERE last_accessed_at < CURRENT_TIMESTAMP - INTERVAL '1 day') as inactive_sessions

  FROM user_sessions
  WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '7 days'
  GROUP BY user_id
),
user_engagement AS (
  SELECT 
    sa.*,

    -- Engagement scoring
    CASE 
      WHEN avg_session_duration > 7200 AND inactive_sessions = 0 THEN 'highly_engaged'
      WHEN avg_session_duration > 1800 AND inactive_sessions < 2 THEN 'engaged'
      WHEN inactive_sessions > total_sessions * 0.5 THEN 'low_engagement'
      ELSE 'moderate_engagement'
    END as engagement_level,

    -- TTL optimization recommendations
    CASE 
      WHEN inactive_sessions > 5 THEN 'reduce_session_ttl'
      WHEN expired_sessions = 0 AND soon_to_expire = 0 THEN 'extend_session_ttl'
      ELSE 'current_ttl_optimal'
    END as ttl_recommendation

  FROM session_analytics sa
)
SELECT 
  user_id,
  total_sessions,
  ROUND(avg_session_duration / 60, 2) as avg_session_minutes,
  last_activity,
  engagement_level,
  ttl_recommendation,

  -- Session health indicators
  ROUND((total_sessions - expired_sessions)::numeric / total_sessions * 100, 1) as session_health_pct,

  -- TTL efficiency metrics
  expired_sessions,
  soon_to_expire,
  inactive_sessions

FROM user_engagement
WHERE total_sessions > 0
ORDER BY 
  CASE engagement_level 
    WHEN 'highly_engaged' THEN 1
    WHEN 'engaged' THEN 2
    WHEN 'moderate_engagement' THEN 3
    ELSE 4
  END,
  total_sessions DESC;

-- Log retention analysis with TTL monitoring
WITH log_retention_analysis AS (
  SELECT 
    application_name,
    log_level,
    DATE_TRUNC('day', created_at) as log_date,
    COUNT(*) as daily_log_count,
    AVG(LENGTH(message)) as avg_message_length,

    -- TTL calculation based on level-specific retention
    CASE log_level
      WHEN 'DEBUG' THEN created_at + INTERVAL '7 days'
      WHEN 'INFO' THEN created_at + INTERVAL '30 days'
      WHEN 'WARN' THEN created_at + INTERVAL '90 days'
      WHEN 'ERROR' THEN created_at + INTERVAL '365 days'
      WHEN 'CRITICAL' THEN created_at + INTERVAL '365 days'
      ELSE created_at + INTERVAL '30 days'
    END as estimated_expiry,

    -- Storage impact analysis
    SUM(LENGTH(message) + COALESCE(LENGTH(metadata::TEXT), 0)) as daily_storage_bytes

  FROM application_logs
  WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '30 days'
  GROUP BY application_name, log_level, DATE_TRUNC('day', created_at)
),
storage_projections AS (
  SELECT 
    application_name,
    log_level,

    -- Current metrics
    SUM(daily_log_count) as total_logs,
    AVG(daily_log_count) as avg_daily_logs,
    SUM(daily_storage_bytes) as total_storage_bytes,
    AVG(daily_storage_bytes) as avg_daily_storage,

    -- TTL impact
    MIN(estimated_expiry) as earliest_expiry,
    MAX(estimated_expiry) as latest_expiry,

    -- Storage efficiency
    CASE log_level
      WHEN 'DEBUG' THEN SUM(daily_storage_bytes) * 7 / 30 -- 7-day retention
      WHEN 'INFO' THEN SUM(daily_storage_bytes) -- 30-day retention
      WHEN 'WARN' THEN SUM(daily_storage_bytes) * 3 -- 90-day retention
      ELSE SUM(daily_storage_bytes) * 12 -- 365-day retention
    END as projected_steady_state_storage

  FROM log_retention_analysis
  GROUP BY application_name, log_level
)
SELECT 
  application_name,
  log_level,
  total_logs,
  avg_daily_logs,

  -- Storage analysis
  ROUND(total_storage_bytes / 1024.0 / 1024.0, 2) as storage_mb,
  ROUND(avg_daily_storage / 1024.0 / 1024.0, 2) as avg_daily_mb,
  ROUND(projected_steady_state_storage / 1024.0 / 1024.0, 2) as steady_state_mb,

  -- TTL effectiveness
  earliest_expiry,
  latest_expiry,
  EXTRACT(DAYS FROM (latest_expiry - earliest_expiry)) as retention_range_days,

  -- Storage optimization
  ROUND((total_storage_bytes - projected_steady_state_storage) / 1024.0 / 1024.0, 2) as storage_savings_mb,
  ROUND(((total_storage_bytes - projected_steady_state_storage) / total_storage_bytes * 100), 1) as storage_reduction_pct,

  -- Recommendations
  CASE 
    WHEN log_level = 'DEBUG' AND avg_daily_logs > 10000 THEN 'Consider shorter DEBUG retention or sampling'
    WHEN projected_steady_state_storage > total_storage_bytes * 2 THEN 'TTL may be too long for this log volume'
    WHEN projected_steady_state_storage < total_storage_bytes * 0.1 THEN 'TTL may be too aggressive'
    ELSE 'TTL appears well-configured'
  END as ttl_recommendation

FROM storage_projections
WHERE total_logs > 0
ORDER BY application_name, 
  CASE log_level 
    WHEN 'CRITICAL' THEN 1
    WHEN 'ERROR' THEN 2
    WHEN 'WARN' THEN 3
    WHEN 'INFO' THEN 4
    WHEN 'DEBUG' THEN 5
  END;

-- TTL index health monitoring
WITH ttl_index_health AS (
  SELECT 
    'user_sessions' as collection_name,
    'session_ttl' as index_name,
    'expires_at' as ttl_field,
    0 as expire_after_seconds,

    -- Health metrics
    COUNT(*) as total_documents,
    COUNT(*) FILTER (WHERE expires_at <= CURRENT_TIMESTAMP) as expired_documents,
    COUNT(*) FILTER (WHERE expires_at <= CURRENT_TIMESTAMP + INTERVAL '1 hour') as expiring_soon,

    -- Performance metrics
    AVG(EXTRACT(EPOCH FROM (expires_at - created_at))) as avg_document_lifetime,
    MIN(expires_at) as earliest_expiry,
    MAX(expires_at) as latest_expiry

  FROM user_sessions

  UNION ALL

  SELECT 
    'application_logs' as collection_name,
    'logs_level_ttl' as index_name,
    'created_at' as ttl_field,
    CASE log_level
      WHEN 'DEBUG' THEN 7 * 24 * 3600
      WHEN 'INFO' THEN 30 * 24 * 3600
      WHEN 'WARN' THEN 90 * 24 * 3600
      ELSE 365 * 24 * 3600
    END as expire_after_seconds,

    COUNT(*) as total_documents,
    COUNT(*) FILTER (WHERE 
      created_at <= CURRENT_TIMESTAMP - 
      CASE log_level
        WHEN 'DEBUG' THEN INTERVAL '7 days'
        WHEN 'INFO' THEN INTERVAL '30 days'
        WHEN 'WARN' THEN INTERVAL '90 days'
        ELSE INTERVAL '365 days'
      END
    ) as expired_documents,
    COUNT(*) FILTER (WHERE 
      created_at <= CURRENT_TIMESTAMP + INTERVAL '1 day' - 
      CASE log_level
        WHEN 'DEBUG' THEN INTERVAL '7 days'
        WHEN 'INFO' THEN INTERVAL '30 days'
        WHEN 'WARN' THEN INTERVAL '90 days'
        ELSE INTERVAL '365 days'
      END
    ) as expiring_soon,

    AVG(EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - created_at))) as avg_document_lifetime,
    MIN(created_at) as earliest_expiry,
    MAX(created_at) as latest_expiry

  FROM application_logs
  GROUP BY log_level
)
SELECT 
  collection_name,
  index_name,
  ttl_field,
  expire_after_seconds,
  total_documents,
  expired_documents,
  expiring_soon,

  -- TTL efficiency metrics
  ROUND(avg_document_lifetime / 3600, 2) as avg_lifetime_hours,
  CASE 
    WHEN total_documents > 0 
    THEN ROUND((expired_documents::numeric / total_documents) * 100, 2)
    ELSE 0
  END as expiration_rate_pct,

  -- TTL health indicators
  CASE 
    WHEN expired_documents > total_documents * 0.9 THEN 'unhealthy_high_expiration'
    WHEN expired_documents = 0 AND total_documents > 1000 THEN 'no_expiration_detected'
    WHEN expiring_soon > total_documents * 0.5 THEN 'high_upcoming_expiration'
    ELSE 'healthy'
  END as ttl_health_status,

  -- Performance impact assessment
  CASE 
    WHEN expired_documents > 10000 THEN 'high_cleanup_load'
    WHEN expiring_soon > 5000 THEN 'moderate_cleanup_load'
    ELSE 'low_cleanup_load'
  END as cleanup_load_assessment

FROM ttl_index_health
ORDER BY collection_name, expire_after_seconds;

-- TTL collection management commands
-- Monitor TTL operations
SHOW TTL STATUS;

-- Optimize TTL indexes
OPTIMIZE TTL INDEXES;

-- Modify TTL expiration times
ALTER TABLE user_sessions 
MODIFY TTL expires_at EXPIRE_AFTER 0,
MODIFY TTL last_accessed_at EXPIRE_AFTER '14 days';

-- Remove TTL from a collection
ALTER TABLE temporary_data DROP TTL created_at;

-- QueryLeaf provides comprehensive TTL capabilities:
-- 1. SQL-familiar TTL creation and management syntax
-- 2. Multiple TTL strategies (field-based, time-based, conditional)
-- 3. Advanced TTL monitoring and health assessment
-- 4. Automatic storage optimization and cleanup
-- 5. Business logic integration with TTL policies
-- 6. Compliance and audit-friendly TTL management
-- 7. Performance monitoring and optimization recommendations
-- 8. Integration with MongoDB's native TTL optimizations
-- 9. Flexible retention policies with partial index support
-- 10. Familiar SQL syntax for complex TTL operations

Best Practices for TTL Implementation

Data Lifecycle Strategy Design

Essential principles for effective TTL implementation:

Business Alignment: Design TTL policies that align with business requirements and compliance needs
Performance Optimization: Consider the impact of TTL operations on database performance
Storage Management: Balance data retention needs with storage costs and performance
Monitoring Strategy: Implement comprehensive monitoring for TTL effectiveness
Gradual Implementation: Roll out TTL policies gradually to assess impact
Backup Considerations: Ensure TTL policies don't conflict with backup and recovery strategies

Advanced TTL Configuration

Optimize TTL for production environments:

Index Strategy: Design TTL indexes to minimize performance impact during cleanup
Batch Operations: Configure TTL to avoid large batch deletions during peak hours
Partial Indexes: Use partial indexes for complex retention policies
Compound TTL: Combine TTL with other indexing strategies for optimal performance
Timezone Handling: Account for business timezone requirements in TTL calculations
Compliance Integration: Ensure TTL policies meet regulatory and audit requirements

Conclusion

MongoDB TTL collections eliminate the complexity of manual data lifecycle management by providing native, automatic data expiration capabilities. The ability to configure flexible retention policies, monitor TTL effectiveness, and integrate with business logic makes TTL collections essential for modern data management strategies.

Key TTL benefits include:

Automatic Data Management: Hands-off data expiration without manual intervention
Flexible Retention Policies: Multiple TTL strategies for different data types and business requirements
Storage Optimization: Automatic cleanup reduces storage costs and improves performance
Compliance Support: Built-in capabilities for audit trails and regulatory compliance
Performance Benefits: Optimized cleanup operations with minimal impact on application performance
SQL Accessibility: Familiar SQL-style TTL operations through QueryLeaf integration

Whether you're managing user sessions, application logs, temporary data, or compliance-sensitive information, MongoDB TTL collections with QueryLeaf's familiar SQL interface provide the foundation for efficient, automated data lifecycle management.

QueryLeaf Integration: QueryLeaf seamlessly manages MongoDB TTL collections while providing SQL-familiar data lifecycle management syntax, retention policy configuration, and TTL monitoring capabilities. Advanced TTL patterns including conditional expiration, gradual cleanup, and compliance-aware retention are elegantly handled through familiar SQL constructs, making sophisticated data lifecycle management both powerful and accessible to SQL-oriented development teams.

The combination of MongoDB's robust TTL capabilities with SQL-style data lifecycle operations makes it an ideal platform for applications requiring both automated data management and familiar database interaction patterns, ensuring your TTL strategies remain both effective and maintainable as your data needs evolve and scale.

October 31, 2025
17 min read

MongoDB Bulk Operations and Performance Optimization: Advanced Batch Processing for High-Throughput Applications

High-throughput applications require efficient data processing capabilities that can handle large volumes of documents with minimal latency and optimal resource utilization. Traditional single-document operations become performance bottlenecks when applications need to process thousands or millions of documents, leading to increased response times, inefficient network utilization, and poor system scalability under heavy data processing loads.

MongoDB's bulk operations provide sophisticated batch processing capabilities that enable applications to perform multiple document operations in a single request, dramatically improving throughput while reducing network overhead and server-side processing costs. Unlike traditional databases that require complex batching logic or application-level transaction management, MongoDB offers native bulk operation support with automatic optimization, error handling, and performance monitoring.

The Single-Document Operation Challenge

Traditional document-by-document processing approaches face significant performance limitations in high-volume scenarios:

-- Traditional approach - processing documents one at a time (inefficient pattern)

-- Example: Processing user registration batch - individual operations
INSERT INTO users (name, email, registration_date, status) 
VALUES ('John Doe', 'john@example.com', CURRENT_TIMESTAMP, 'pending');

INSERT INTO users (name, email, registration_date, status) 
VALUES ('Jane Smith', 'jane@example.com', CURRENT_TIMESTAMP, 'pending');

INSERT INTO users (name, email, registration_date, status) 
VALUES ('Bob Johnson', 'bob@example.com', CURRENT_TIMESTAMP, 'pending');

-- Problems with single-document operations:
-- 1. High network round-trip overhead for each operation
-- 2. Individual index updates and lock acquisitions
-- 3. Inefficient resource utilization and memory allocation
-- 4. Poor scaling characteristics under high load
-- 5. Complex error handling for partial failures
-- 6. Limited transaction scope and atomicity guarantees

-- Example: Updating user statuses individually (performance bottleneck)
UPDATE users SET status = 'active', activated_at = CURRENT_TIMESTAMP 
WHERE email = 'john@example.com';

UPDATE users SET status = 'active', activated_at = CURRENT_TIMESTAMP 
WHERE email = 'jane@example.com';

UPDATE users SET status = 'active', activated_at = CURRENT_TIMESTAMP 
WHERE email = 'bob@example.com';

-- Individual updates result in:
-- - Multiple database connections and query parsing overhead
-- - Repeated index lookups and document retrieval operations  
-- - Inefficient write operations with individual lock acquisitions
-- - High latency due to network round trips
-- - Difficult error recovery and consistency management
-- - Poor resource utilization with context switching overhead

-- Example: Data cleanup operations (time-consuming individual deletes)
DELETE FROM users WHERE last_login < CURRENT_DATE - INTERVAL '2 years';
-- This approach processes each matching document individually

DELETE FROM user_sessions WHERE created_at < CURRENT_DATE - INTERVAL '30 days';
-- Again, individual document processing

DELETE FROM audit_logs WHERE log_date < CURRENT_DATE - INTERVAL '1 year';
-- More individual processing overhead

-- Single-document limitations:
-- 1. Long-running operations that block other requests
-- 2. Inefficient resource allocation and memory usage
-- 3. Poor progress tracking and monitoring capabilities
-- 4. Difficult to implement proper error handling
-- 5. No batch-level optimization opportunities
-- 6. Complex application logic for managing large datasets
-- 7. Limited ability to prioritize or throttle operations
-- 8. Inefficient use of database connection pooling

-- Traditional PostgreSQL bulk insert attempt (limited capabilities)
BEGIN;
INSERT INTO users (name, email, registration_date, status) VALUES
  ('User 1', 'user1@example.com', CURRENT_TIMESTAMP, 'pending'),
  ('User 2', 'user2@example.com', CURRENT_TIMESTAMP, 'pending'),
  ('User 3', 'user3@example.com', CURRENT_TIMESTAMP, 'pending');
  -- Limited to relatively small batches due to query size restrictions
  -- No advanced error handling or partial success reporting
  -- Limited optimization compared to native bulk operations
COMMIT;

-- PostgreSQL bulk update limitations
UPDATE users SET 
  status = CASE 
    WHEN email = 'user1@example.com' THEN 'active'
    WHEN email = 'user2@example.com' THEN 'suspended'
    WHEN email = 'user3@example.com' THEN 'active'
    ELSE status
  END,
  last_updated = CURRENT_TIMESTAMP
WHERE email IN ('user1@example.com', 'user2@example.com', 'user3@example.com');

-- Issues with traditional bulk approaches:
-- 1. Complex SQL syntax for conditional updates
-- 2. Limited flexibility for different operations per document
-- 3. No built-in error reporting for individual items
-- 4. Query size limitations for large batches
-- 5. Poor performance characteristics compared to native bulk operations
-- 6. Limited monitoring and progress reporting capabilities

MongoDB bulk operations provide comprehensive high-performance batch processing:

// MongoDB Advanced Bulk Operations - comprehensive batch processing with optimization

const { MongoClient } = require('mongodb');

// Advanced MongoDB Bulk Operations Manager
class MongoDBBulkOperationsManager {
  constructor(db) {
    this.db = db;
    this.performanceMetrics = {
      bulkInserts: { operations: 0, documentsProcessed: 0, totalTime: 0 },
      bulkUpdates: { operations: 0, documentsProcessed: 0, totalTime: 0 },
      bulkDeletes: { operations: 0, documentsProcessed: 0, totalTime: 0 },
      bulkWrites: { operations: 0, documentsProcessed: 0, totalTime: 0 }
    };
    this.errorTracking = new Map();
    this.optimizationSettings = {
      defaultBatchSize: 1000,
      maxBatchSize: 10000,
      enableOrdered: false, // Unordered operations for better performance
      enableBypassValidation: false,
      retryAttempts: 3,
      retryDelayMs: 1000
    };
  }

  // High-performance bulk insert operations
  async performBulkInsert(collectionName, documents, options = {}) {
    console.log(`Starting bulk insert of ${documents.length} documents into ${collectionName}`);

    const startTime = Date.now();
    const collection = this.db.collection(collectionName);

    // Configure bulk insert options for optimal performance
    const bulkOptions = {
      ordered: options.ordered !== undefined ? options.ordered : this.optimizationSettings.enableOrdered,
      bypassDocumentValidation: options.bypassValidation || this.optimizationSettings.enableBypassValidation,
      writeConcern: options.writeConcern || { w: 'majority', j: true }
    };

    try {
      // Process documents in optimal batch sizes
      const batchSize = Math.min(
        options.batchSize || this.optimizationSettings.defaultBatchSize,
        this.optimizationSettings.maxBatchSize
      );

      const results = [];
      let totalInserted = 0;
      let totalErrors = 0;

      for (let i = 0; i < documents.length; i += batchSize) {
        const batch = documents.slice(i, i + batchSize);

        try {
          console.log(`Processing batch ${Math.floor(i / batchSize) + 1} of ${Math.ceil(documents.length / batchSize)}`);

          // Add metadata to documents for tracking
          const enrichedBatch = batch.map(doc => ({
            ...doc,
            _bulk_operation_id: `bulk_insert_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`,
            _inserted_at: new Date(),
            _batch_number: Math.floor(i / batchSize) + 1
          }));

          const batchResult = await collection.insertMany(enrichedBatch, bulkOptions);

          results.push({
            batchIndex: Math.floor(i / batchSize),
            insertedCount: batchResult.insertedCount,
            insertedIds: batchResult.insertedIds,
            success: true
          });

          totalInserted += batchResult.insertedCount;

        } catch (error) {
          console.error(`Batch ${Math.floor(i / batchSize) + 1} failed:`, error.message);

          // Handle partial failures in unordered operations
          if (error.result && error.result.insertedCount) {
            totalInserted += error.result.insertedCount;
          }

          totalErrors += batch.length - (error.result?.insertedCount || 0);

          results.push({
            batchIndex: Math.floor(i / batchSize),
            insertedCount: error.result?.insertedCount || 0,
            error: error.message,
            success: false
          });

          // Track errors for analysis
          this.trackBulkOperationError('bulkInsert', error);
        }
      }

      const totalTime = Date.now() - startTime;

      // Update performance metrics
      this.updatePerformanceMetrics('bulkInserts', {
        operations: 1,
        documentsProcessed: totalInserted,
        totalTime: totalTime
      });

      const summary = {
        success: totalErrors === 0,
        totalDocuments: documents.length,
        insertedDocuments: totalInserted,
        failedDocuments: totalErrors,
        executionTimeMs: totalTime,
        throughputDocsPerSecond: Math.round((totalInserted / totalTime) * 1000),
        batchResults: results
      };

      console.log(`Bulk insert completed: ${totalInserted}/${documents.length} documents processed in ${totalTime}ms`);
      return summary;

    } catch (error) {
      console.error('Bulk insert operation failed:', error);
      this.trackBulkOperationError('bulkInsert', error);
      throw error;
    }
  }

  // Advanced bulk update operations with flexible patterns
  async performBulkUpdate(collectionName, updateOperations, options = {}) {
    console.log(`Starting bulk update of ${updateOperations.length} operations on ${collectionName}`);

    const startTime = Date.now();
    const collection = this.db.collection(collectionName);

    try {
      // Initialize ordered or unordered bulk operation
      const bulkOp = options.ordered ? collection.initializeOrderedBulkOp() : 
                                       collection.initializeUnorderedBulkOp();

      let operationCount = 0;

      // Process different types of update operations
      for (const operation of updateOperations) {
        const { filter, update, upsert = false, arrayFilters = null, hint = null } = operation;

        // Add operation metadata for tracking
        const enhancedUpdate = {
          ...update,
          $set: {
            ...update.$set,
            _last_bulk_update: new Date(),
            _bulk_operation_id: `bulk_update_${Date.now()}_${operationCount}`
          }
        };

        // Configure update operation based on type
        const updateConfig = { upsert };
        if (arrayFilters) updateConfig.arrayFilters = arrayFilters;
        if (hint) updateConfig.hint = hint;

        // Add to bulk operation
        if (operation.type === 'updateMany') {
          bulkOp.find(filter).updateMany(enhancedUpdate, updateConfig);
        } else {
          bulkOp.find(filter).updateOne(enhancedUpdate, updateConfig);
        }

        operationCount++;

        // Execute batch when reaching optimal size
        if (operationCount % this.optimizationSettings.defaultBatchSize === 0) {
          console.log(`Executing intermediate batch of ${this.optimizationSettings.defaultBatchSize} operations`);
        }
      }

      // Execute all bulk update operations
      console.log(`Executing ${operationCount} bulk update operations`);
      const result = await bulkOp.execute({
        writeConcern: options.writeConcern || { w: 'majority', j: true }
      });

      const totalTime = Date.now() - startTime;

      // Update performance metrics
      this.updatePerformanceMetrics('bulkUpdates', {
        operations: 1,
        documentsProcessed: result.modifiedCount + result.upsertedCount,
        totalTime: totalTime
      });

      const summary = {
        success: true,
        totalOperations: operationCount,
        matchedDocuments: result.matchedCount,
        modifiedDocuments: result.modifiedCount,
        upsertedDocuments: result.upsertedCount,
        upsertedIds: result.upsertedIds,
        executionTimeMs: totalTime,
        throughputOpsPerSecond: Math.round((operationCount / totalTime) * 1000),
        writeErrors: result.writeErrors || [],
        writeConcernErrors: result.writeConcernErrors || []
      };

      console.log(`Bulk update completed: ${result.modifiedCount} documents modified, ${result.upsertedCount} upserted in ${totalTime}ms`);
      return summary;

    } catch (error) {
      console.error('Bulk update operation failed:', error);
      this.trackBulkOperationError('bulkUpdate', error);

      // Return partial results if available
      if (error.result) {
        const totalTime = Date.now() - startTime;
        return {
          success: false,
          error: error.message,
          partialResult: {
            matchedDocuments: error.result.matchedCount,
            modifiedDocuments: error.result.modifiedCount,
            upsertedDocuments: error.result.upsertedCount,
            executionTimeMs: totalTime
          }
        };
      }
      throw error;
    }
  }

  // Optimized bulk delete operations
  async performBulkDelete(collectionName, deleteOperations, options = {}) {
    console.log(`Starting bulk delete of ${deleteOperations.length} operations on ${collectionName}`);

    const startTime = Date.now();
    const collection = this.db.collection(collectionName);

    try {
      // Initialize bulk operation
      const bulkOp = options.ordered ? collection.initializeOrderedBulkOp() : 
                                       collection.initializeUnorderedBulkOp();

      let operationCount = 0;

      // Process delete operations
      for (const operation of deleteOperations) {
        const { filter, deleteType = 'deleteMany', hint = null } = operation;

        // Configure delete operation
        const deleteConfig = {};
        if (hint) deleteConfig.hint = hint;

        // Add to bulk operation based on type
        if (deleteType === 'deleteOne') {
          bulkOp.find(filter).deleteOne();
        } else {
          bulkOp.find(filter).delete(); // deleteMany is default
        }

        operationCount++;
      }

      // Execute bulk delete operations
      console.log(`Executing ${operationCount} bulk delete operations`);
      const result = await bulkOp.execute({
        writeConcern: options.writeConcern || { w: 'majority', j: true }
      });

      const totalTime = Date.now() - startTime;

      // Update performance metrics
      this.updatePerformanceMetrics('bulkDeletes', {
        operations: 1,
        documentsProcessed: result.deletedCount,
        totalTime: totalTime
      });

      const summary = {
        success: true,
        totalOperations: operationCount,
        deletedDocuments: result.deletedCount,
        executionTimeMs: totalTime,
        throughputOpsPerSecond: Math.round((operationCount / totalTime) * 1000),
        writeErrors: result.writeErrors || [],
        writeConcernErrors: result.writeConcernErrors || []
      };

      console.log(`Bulk delete completed: ${result.deletedCount} documents deleted in ${totalTime}ms`);
      return summary;

    } catch (error) {
      console.error('Bulk delete operation failed:', error);
      this.trackBulkOperationError('bulkDelete', error);

      if (error.result) {
        const totalTime = Date.now() - startTime;
        return {
          success: false,
          error: error.message,
          partialResult: {
            deletedDocuments: error.result.deletedCount,
            executionTimeMs: totalTime
          }
        };
      }
      throw error;
    }
  }

  // Mixed bulk operations (insert, update, delete in single batch)
  async performMixedBulkOperations(collectionName, operations, options = {}) {
    console.log(`Starting mixed bulk operations: ${operations.length} operations on ${collectionName}`);

    const startTime = Date.now();
    const collection = this.db.collection(collectionName);

    try {
      const bulkOp = options.ordered ? collection.initializeOrderedBulkOp() : 
                                       collection.initializeUnorderedBulkOp();

      let insertCount = 0;
      let updateCount = 0;
      let deleteCount = 0;

      // Process mixed operations
      for (const operation of operations) {
        const { type, ...opData } = operation;

        switch (type) {
          case 'insert':
            const enrichedDoc = {
              ...opData.document,
              _bulk_operation_id: `bulk_mixed_${Date.now()}_${insertCount}`,
              _inserted_at: new Date()
            };
            bulkOp.insert(enrichedDoc);
            insertCount++;
            break;

          case 'updateOne':
            const updateOneData = {
              ...opData.update,
              $set: {
                ...opData.update.$set,
                _last_bulk_update: new Date(),
                _bulk_operation_id: `bulk_mixed_update_${Date.now()}_${updateCount}`
              }
            };
            bulkOp.find(opData.filter).updateOne(updateOneData, { upsert: opData.upsert || false });
            updateCount++;
            break;

          case 'updateMany':
            const updateManyData = {
              ...opData.update,
              $set: {
                ...opData.update.$set,
                _last_bulk_update: new Date(),
                _bulk_operation_id: `bulk_mixed_update_${Date.now()}_${updateCount}`
              }
            };
            bulkOp.find(opData.filter).updateMany(updateManyData, { upsert: opData.upsert || false });
            updateCount++;
            break;

          case 'deleteOne':
            bulkOp.find(opData.filter).deleteOne();
            deleteCount++;
            break;

          case 'deleteMany':
            bulkOp.find(opData.filter).delete();
            deleteCount++;
            break;

          default:
            console.warn(`Unknown operation type: ${type}`);
        }
      }

      // Execute mixed bulk operations
      console.log(`Executing mixed bulk operations: ${insertCount} inserts, ${updateCount} updates, ${deleteCount} deletes`);
      const result = await bulkOp.execute({
        writeConcern: options.writeConcern || { w: 'majority', j: true }
      });

      const totalTime = Date.now() - startTime;
      const totalDocumentsProcessed = result.insertedCount + result.modifiedCount + result.deletedCount + result.upsertedCount;

      // Update performance metrics
      this.updatePerformanceMetrics('bulkWrites', {
        operations: 1,
        documentsProcessed: totalDocumentsProcessed,
        totalTime: totalTime
      });

      const summary = {
        success: true,
        totalOperations: operations.length,
        operationBreakdown: {
          inserts: insertCount,
          updates: updateCount,
          deletes: deleteCount
        },
        results: {
          insertedDocuments: result.insertedCount,
          insertedIds: result.insertedIds,
          matchedDocuments: result.matchedCount,
          modifiedDocuments: result.modifiedCount,
          deletedDocuments: result.deletedCount,
          upsertedDocuments: result.upsertedCount,
          upsertedIds: result.upsertedIds
        },
        executionTimeMs: totalTime,
        throughputOpsPerSecond: Math.round((operations.length / totalTime) * 1000),
        throughputDocsPerSecond: Math.round((totalDocumentsProcessed / totalTime) * 1000),
        writeErrors: result.writeErrors || [],
        writeConcernErrors: result.writeConcernErrors || []
      };

      console.log(`Mixed bulk operations completed: ${totalDocumentsProcessed} documents processed in ${totalTime}ms`);
      return summary;

    } catch (error) {
      console.error('Mixed bulk operations failed:', error);
      this.trackBulkOperationError('bulkWrite', error);

      if (error.result) {
        const totalTime = Date.now() - startTime;
        const totalDocumentsProcessed = error.result.insertedCount + error.result.modifiedCount + error.result.deletedCount + error.result.upsertedCount;

        return {
          success: false,
          error: error.message,
          partialResult: {
            insertedDocuments: error.result.insertedCount,
            modifiedDocuments: error.result.modifiedCount,
            deletedDocuments: error.result.deletedCount,
            upsertedDocuments: error.result.upsertedCount,
            totalDocumentsProcessed: totalDocumentsProcessed,
            executionTimeMs: totalTime
          }
        };
      }
      throw error;
    }
  }

  // Performance monitoring and optimization
  updatePerformanceMetrics(operationType, metrics) {
    const current = this.performanceMetrics[operationType];
    current.operations += metrics.operations;
    current.documentsProcessed += metrics.documentsProcessed;
    current.totalTime += metrics.totalTime;
  }

  trackBulkOperationError(operationType, error) {
    if (!this.errorTracking.has(operationType)) {
      this.errorTracking.set(operationType, []);
    }

    this.errorTracking.get(operationType).push({
      timestamp: new Date(),
      error: error.message,
      code: error.code,
      details: error.writeErrors || error.result
    });
  }

  getBulkOperationStatistics() {
    const stats = {};

    for (const [operationType, metrics] of Object.entries(this.performanceMetrics)) {
      if (metrics.operations > 0) {
        stats[operationType] = {
          totalOperations: metrics.operations,
          documentsProcessed: metrics.documentsProcessed,
          averageExecutionTimeMs: Math.round(metrics.totalTime / metrics.operations),
          averageThroughputDocsPerSecond: Math.round((metrics.documentsProcessed / metrics.totalTime) * 1000),
          totalExecutionTimeMs: metrics.totalTime
        };
      }
    }

    return stats;
  }

  getErrorStatistics() {
    const errorStats = {};

    for (const [operationType, errors] of this.errorTracking.entries()) {
      errorStats[operationType] = {
        totalErrors: errors.length,
        recentErrors: errors.filter(e => Date.now() - e.timestamp.getTime() < 3600000), // Last hour
        errorBreakdown: this.groupErrorsByCode(errors)
      };
    }

    return errorStats;
  }

  groupErrorsByCode(errors) {
    const breakdown = {};
    errors.forEach(error => {
      const code = error.code || 'Unknown';
      breakdown[code] = (breakdown[code] || 0) + 1;
    });
    return breakdown;
  }

  // Optimized data import functionality
  async performOptimizedDataImport(collectionName, dataSource, options = {}) {
    console.log(`Starting optimized data import for ${collectionName}`);

    const importOptions = {
      batchSize: options.batchSize || 5000,
      enableValidation: options.enableValidation !== false,
      createIndexes: options.createIndexes || false,
      dropExistingCollection: options.dropExisting || false,
      parallelBatches: options.parallelBatches || 1
    };

    try {
      const collection = this.db.collection(collectionName);

      // Drop existing collection if requested
      if (importOptions.dropExistingCollection) {
        try {
          await collection.drop();
          console.log(`Existing collection ${collectionName} dropped`);
        } catch (error) {
          console.log(`Collection ${collectionName} did not exist or could not be dropped`);
        }
      }

      // Create indexes before import if specified
      if (importOptions.createIndexes && options.indexes) {
        console.log('Creating indexes before data import...');
        for (const indexSpec of options.indexes) {
          await collection.createIndex(indexSpec.fields, indexSpec.options);
        }
      }

      // Process data in optimized batches
      let totalImported = 0;
      const startTime = Date.now();

      // Assuming dataSource is an array or iterable
      const documents = Array.isArray(dataSource) ? dataSource : await this.convertDataSource(dataSource);

      const result = await this.performBulkInsert(collectionName, documents, {
        batchSize: importOptions.batchSize,
        bypassValidation: !importOptions.enableValidation,
        ordered: false // Unordered for better performance
      });

      console.log(`Data import completed: ${result.insertedDocuments} documents imported in ${result.executionTimeMs}ms`);
      return result;

    } catch (error) {
      console.error(`Data import failed for ${collectionName}:`, error);
      throw error;
    }
  }

  async convertDataSource(dataSource) {
    // Convert various data sources (streams, iterators, etc.) to arrays
    // This is a placeholder - implement based on your specific data source types
    if (typeof dataSource.toArray === 'function') {
      return await dataSource.toArray();
    }

    if (Symbol.iterator in dataSource) {
      return Array.from(dataSource);
    }

    throw new Error('Unsupported data source type');
  }
}

// Example usage: High-performance bulk operations
async function demonstrateBulkOperations() {
  const client = new MongoClient('mongodb://localhost:27017');
  await client.connect();
  const db = client.db('bulk_operations_demo');

  const bulkManager = new MongoDBBulkOperationsManager(db);

  // Demonstrate bulk insert
  const usersToInsert = [];
  for (let i = 0; i < 10000; i++) {
    usersToInsert.push({
      name: `User ${i}`,
      email: `user${i}@example.com`,
      age: Math.floor(Math.random() * 50) + 18,
      department: ['Engineering', 'Sales', 'Marketing', 'HR'][Math.floor(Math.random() * 4)],
      salary: Math.floor(Math.random() * 100000) + 40000,
      join_date: new Date(Date.now() - Math.random() * 365 * 24 * 60 * 60 * 1000)
    });
  }

  const insertResult = await bulkManager.performBulkInsert('users', usersToInsert);
  console.log('Bulk Insert Result:', insertResult);

  // Demonstrate bulk update
  const updateOperations = [
    {
      type: 'updateMany',
      filter: { department: 'Engineering' },
      update: { 
        $set: { department: 'Software Engineering' },
        $inc: { salary: 5000 }
      }
    },
    {
      type: 'updateMany', 
      filter: { age: { $lt: 25 } },
      update: { $set: { employee_type: 'junior' } },
      upsert: false
    }
  ];

  const updateResult = await bulkManager.performBulkUpdate('users', updateOperations);
  console.log('Bulk Update Result:', updateResult);

  // Display performance statistics
  const stats = bulkManager.getBulkOperationStatistics();
  console.log('Performance Statistics:', stats);

  await client.close();
}

Understanding MongoDB Bulk Operations Architecture

Advanced Bulk Processing Patterns and Performance Optimization

Implement sophisticated bulk operation patterns for production-scale data processing:

// Production-ready MongoDB bulk operations with advanced optimization strategies
class EnterpriseMongoDBBulkManager extends MongoDBBulkOperationsManager {
  constructor(db, enterpriseConfig = {}) {
    super(db);

    this.enterpriseConfig = {
      enableShardingOptimization: enterpriseConfig.enableShardingOptimization || false,
      enableReplicationOptimization: enterpriseConfig.enableReplicationOptimization || false,
      enableCompressionOptimization: enterpriseConfig.enableCompressionOptimization || false,
      maxConcurrentOperations: enterpriseConfig.maxConcurrentOperations || 10,
      enableProgressTracking: enterpriseConfig.enableProgressTracking || true,
      enableResourceMonitoring: enterpriseConfig.enableResourceMonitoring || true
    };

    this.setupEnterpriseOptimizations();
  }

  async performParallelBulkOperations(collectionName, operationBatches, options = {}) {
    console.log(`Starting parallel bulk operations on ${collectionName} with ${operationBatches.length} batches`);

    const concurrency = Math.min(
      options.maxConcurrency || this.enterpriseConfig.maxConcurrentOperations,
      operationBatches.length
    );

    const results = [];
    const startTime = Date.now();

    // Process batches in parallel with controlled concurrency
    for (let i = 0; i < operationBatches.length; i += concurrency) {
      const batchPromises = [];

      for (let j = i; j < Math.min(i + concurrency, operationBatches.length); j++) {
        const batch = operationBatches[j];

        const promise = this.processSingleBatch(collectionName, batch, {
          ...options,
          batchIndex: j
        });

        batchPromises.push(promise);
      }

      // Wait for current set of concurrent batches to complete
      const batchResults = await Promise.allSettled(batchPromises);
      results.push(...batchResults);

      console.log(`Completed ${Math.min(i + concurrency, operationBatches.length)} of ${operationBatches.length} batches`);
    }

    const totalTime = Date.now() - startTime;

    return this.consolidateParallelResults(results, totalTime);
  }

  async processSingleBatch(collectionName, batch, options) {
    // Determine batch type and process accordingly
    if (batch.type === 'insert') {
      return await this.performBulkInsert(collectionName, batch.documents, options);
    } else if (batch.type === 'update') {
      return await this.performBulkUpdate(collectionName, batch.operations, options);
    } else if (batch.type === 'delete') {
      return await this.performBulkDelete(collectionName, batch.operations, options);
    } else if (batch.type === 'mixed') {
      return await this.performMixedBulkOperations(collectionName, batch.operations, options);
    }
  }

  async performShardOptimizedBulkOperations(collectionName, operations, shardKey) {
    console.log(`Performing shard-optimized bulk operations on ${collectionName}`);

    // Group operations by shard key for optimal routing
    const shardGroupedOps = this.groupOperationsByShardKey(operations, shardKey);

    const results = [];

    for (const [shardValue, shardOps] of shardGroupedOps.entries()) {
      console.log(`Processing ${shardOps.length} operations for shard key value: ${shardValue}`);

      const shardResult = await this.performMixedBulkOperations(collectionName, shardOps, {
        ordered: false // Better performance for sharded clusters
      });

      results.push({
        shardKey: shardValue,
        result: shardResult
      });
    }

    return this.consolidateShardResults(results);
  }

  groupOperationsByShardKey(operations, shardKey) {
    const grouped = new Map();

    for (const operation of operations) {
      let keyValue;

      if (operation.type === 'insert') {
        keyValue = operation.document[shardKey];
      } else {
        keyValue = operation.filter[shardKey];
      }

      if (!grouped.has(keyValue)) {
        grouped.set(keyValue, []);
      }

      grouped.get(keyValue).push(operation);
    }

    return grouped;
  }

  async performStreamingBulkOperations(collectionName, dataStream, options = {}) {
    console.log(`Starting streaming bulk operations on ${collectionName}`);

    const batchSize = options.batchSize || 1000;
    const processingOptions = {
      ordered: false,
      ...options
    };

    let batch = [];
    let totalProcessed = 0;
    const results = [];

    return new Promise((resolve, reject) => {
      dataStream.on('data', async (data) => {
        batch.push(data);

        if (batch.length >= batchSize) {
          try {
            const batchResult = await this.performBulkInsert(
              collectionName, 
              batch, 
              processingOptions
            );

            results.push(batchResult);
            totalProcessed += batchResult.insertedDocuments;
            batch = [];

            console.log(`Processed ${totalProcessed} documents so far`);

          } catch (error) {
            reject(error);
          }
        }
      });

      dataStream.on('end', async () => {
        try {
          // Process remaining documents
          if (batch.length > 0) {
            const finalResult = await this.performBulkInsert(
              collectionName, 
              batch, 
              processingOptions
            );
            results.push(finalResult);
            totalProcessed += finalResult.insertedDocuments;
          }

          resolve({
            success: true,
            totalProcessed: totalProcessed,
            batchResults: results
          });

        } catch (error) {
          reject(error);
        }
      });

      dataStream.on('error', reject);
    });
  }
}

QueryLeaf Bulk Operations Integration

QueryLeaf provides familiar SQL syntax for MongoDB bulk operations and batch processing:

-- QueryLeaf bulk operations with SQL-familiar syntax for MongoDB batch processing

-- Bulk insert with SQL VALUES syntax (automatically optimized for MongoDB bulk operations)
INSERT INTO users (name, email, age, department, salary, join_date)
VALUES 
  ('John Doe', 'john@example.com', 32, 'Engineering', 85000, CURRENT_DATE),
  ('Jane Smith', 'jane@example.com', 28, 'Sales', 75000, CURRENT_DATE - INTERVAL '1 month'),
  ('Bob Johnson', 'bob@example.com', 35, 'Marketing', 70000, CURRENT_DATE - INTERVAL '2 months'),
  ('Alice Brown', 'alice@example.com', 29, 'HR', 68000, CURRENT_DATE - INTERVAL '3 months'),
  ('Charlie Wilson', 'charlie@example.com', 31, 'Engineering', 90000, CURRENT_DATE - INTERVAL '4 months');

-- QueryLeaf automatically converts this to optimized MongoDB bulk insert:
-- db.users.insertMany([documents...], { ordered: false })

-- Bulk update operations using SQL UPDATE syntax
-- Update all engineers' salaries (automatically uses MongoDB bulk operations)
UPDATE users 
SET salary = salary * 1.1, 
    last_updated = CURRENT_TIMESTAMP,
    promotion_eligible = true
WHERE department = 'Engineering';

-- Update employees based on multiple conditions
UPDATE users 
SET employee_level = CASE 
  WHEN age > 35 AND salary > 80000 THEN 'Senior'
  WHEN age > 30 OR salary > 70000 THEN 'Mid-level'
  ELSE 'Junior'
END,
last_evaluation = CURRENT_DATE
WHERE join_date < CURRENT_DATE - INTERVAL '6 months';

-- QueryLeaf optimizes these as MongoDB bulk update operations:
-- Uses bulkWrite() with updateMany operations for optimal performance

-- Bulk delete operations
-- Clean up old inactive users
DELETE FROM users 
WHERE last_login < CURRENT_DATE - INTERVAL '2 years' 
  AND status = 'inactive';

-- Remove test data
DELETE FROM users 
WHERE email LIKE '%test%' OR email LIKE '%example%';

-- QueryLeaf converts to optimized bulk delete operations

-- Advanced bulk processing with data transformation and aggregation
WITH user_statistics AS (
  SELECT 
    department,
    COUNT(*) as employee_count,
    AVG(salary) as avg_salary,
    MAX(salary) as max_salary,
    MIN(join_date) as earliest_hire
  FROM users 
  GROUP BY department
),

salary_adjustments AS (
  SELECT 
    u._id,
    u.name,
    u.department,
    u.salary,
    us.avg_salary,

    -- Calculate adjustment based on department average
    CASE 
      WHEN u.salary < us.avg_salary * 0.8 THEN u.salary * 1.15  -- 15% increase
      WHEN u.salary < us.avg_salary * 0.9 THEN u.salary * 1.10  -- 10% increase  
      WHEN u.salary > us.avg_salary * 1.2 THEN u.salary * 1.02  -- 2% increase
      ELSE u.salary * 1.05  -- 5% standard increase
    END as new_salary,

    CURRENT_DATE as adjustment_date

  FROM users u
  JOIN user_statistics us ON u.department = us.department
  WHERE u.status = 'active'
)

-- Bulk update with calculated values (QueryLeaf optimizes this as bulk operation)
UPDATE users 
SET salary = sa.new_salary,
    last_salary_review = sa.adjustment_date,
    salary_review_reason = CONCAT('Department average adjustment - Previous: $', 
                                 CAST(sa.salary AS VARCHAR), 
                                 ', New: $', 
                                 CAST(sa.new_salary AS VARCHAR))
FROM salary_adjustments sa
WHERE users._id = sa._id;

-- Bulk data processing with conditional operations
-- Process employee performance reviews in batches
WITH performance_data AS (
  SELECT 
    _id,
    name,
    department,
    performance_score,

    -- Calculate performance category
    CASE 
      WHEN performance_score >= 90 THEN 'exceptional'
      WHEN performance_score >= 80 THEN 'exceeds_expectations'  
      WHEN performance_score >= 70 THEN 'meets_expectations'
      WHEN performance_score >= 60 THEN 'needs_improvement'
      ELSE 'unsatisfactory'
    END as performance_category,

    -- Calculate bonus eligibility
    CASE 
      WHEN performance_score >= 85 AND department IN ('Sales', 'Engineering') THEN true
      WHEN performance_score >= 90 THEN true
      ELSE false
    END as bonus_eligible,

    -- Calculate development plan requirement
    CASE 
      WHEN performance_score < 70 THEN true
      ELSE false  
    END as requires_development_plan

  FROM employees 
  WHERE review_period = '2025-Q3'
),

bonus_calculations AS (
  SELECT 
    pd._id,
    pd.bonus_eligible,

    -- Calculate bonus amount
    CASE 
      WHEN pd.performance_score >= 95 THEN u.salary * 0.15  -- 15% bonus
      WHEN pd.performance_score >= 90 THEN u.salary * 0.12  -- 12% bonus  
      WHEN pd.performance_score >= 85 THEN u.salary * 0.10  -- 10% bonus
      ELSE 0
    END as bonus_amount

  FROM performance_data pd
  JOIN users u ON pd._id = u._id
  WHERE pd.bonus_eligible = true
)

-- Execute bulk updates for performance review results
UPDATE users 
SET performance_category = pd.performance_category,
    bonus_eligible = pd.bonus_eligible,
    bonus_amount = COALESCE(bc.bonus_amount, 0),
    requires_development_plan = pd.requires_development_plan,
    last_performance_review = CURRENT_DATE,
    review_status = 'completed'
FROM performance_data pd
LEFT JOIN bonus_calculations bc ON pd._id = bc._id  
WHERE users._id = pd._id;

-- Advanced batch processing with data validation and error handling
-- Bulk data import with validation
INSERT INTO products (sku, name, category, price, stock_quantity, supplier_id, created_at)
SELECT 
  import_sku,
  import_name,
  import_category,
  CAST(import_price AS DECIMAL(10,2)),
  CAST(import_stock AS INTEGER),
  supplier_lookup.supplier_id,
  CURRENT_TIMESTAMP

FROM product_import_staging pis
JOIN suppliers supplier_lookup ON pis.supplier_name = supplier_lookup.name

-- Validation conditions
WHERE import_sku IS NOT NULL
  AND import_name IS NOT NULL  
  AND import_category IN ('Electronics', 'Clothing', 'Books', 'Home', 'Sports')
  AND import_price::DECIMAL(10,2) > 0
  AND import_stock::INTEGER >= 0
  AND supplier_lookup.supplier_id IS NOT NULL

  -- Duplicate check
  AND NOT EXISTS (
    SELECT 1 FROM products p 
    WHERE p.sku = pis.import_sku
  );

-- Bulk inventory adjustments with audit trail
WITH inventory_adjustments AS (
  SELECT 
    product_id,
    adjustment_quantity,
    adjustment_reason,
    adjustment_type, -- 'increase', 'decrease', 'recount'
    CURRENT_TIMESTAMP as adjustment_timestamp,
    'system' as adjusted_by
  FROM inventory_adjustment_queue
  WHERE processed = false
),

stock_calculations AS (
  SELECT 
    ia.product_id,
    p.stock_quantity as current_stock,

    CASE ia.adjustment_type
      WHEN 'increase' THEN p.stock_quantity + ia.adjustment_quantity
      WHEN 'decrease' THEN GREATEST(p.stock_quantity - ia.adjustment_quantity, 0)
      WHEN 'recount' THEN ia.adjustment_quantity
      ELSE p.stock_quantity
    END as new_stock_quantity,

    ia.adjustment_reason,
    ia.adjustment_timestamp,
    ia.adjusted_by

  FROM inventory_adjustments ia
  JOIN products p ON ia.product_id = p._id
)

-- Bulk update product stock levels
UPDATE products 
SET stock_quantity = sc.new_stock_quantity,
    last_stock_update = sc.adjustment_timestamp,
    stock_updated_by = sc.adjusted_by
FROM stock_calculations sc
WHERE products._id = sc.product_id;

-- Insert audit records for inventory changes
INSERT INTO inventory_audit_log (
  product_id,
  previous_stock,
  new_stock,
  adjustment_reason,
  adjustment_timestamp,
  adjusted_by
)
SELECT 
  sc.product_id,
  sc.current_stock,
  sc.new_stock_quantity,
  sc.adjustment_reason,
  sc.adjustment_timestamp,
  sc.adjusted_by
FROM stock_calculations sc;

-- Mark adjustment queue items as processed
UPDATE inventory_adjustment_queue 
SET processed = true,
    processed_at = CURRENT_TIMESTAMP
WHERE processed = false;

-- High-performance bulk operations with monitoring
-- Query for bulk operation performance analysis
WITH operation_metrics AS (
  SELECT 
    DATE_TRUNC('hour', operation_timestamp) as hour_bucket,
    operation_type, -- 'bulk_insert', 'bulk_update', 'bulk_delete'
    collection_name,

    -- Performance metrics
    COUNT(*) as operations_count,
    SUM(documents_processed) as total_documents,
    AVG(execution_time_ms) as avg_execution_time_ms,
    MAX(execution_time_ms) as max_execution_time_ms,
    MIN(execution_time_ms) as min_execution_time_ms,

    -- Throughput calculations
    AVG(throughput_docs_per_second) as avg_throughput_docs_per_sec,
    MAX(throughput_docs_per_second) as max_throughput_docs_per_sec,

    -- Error tracking
    COUNT(*) FILTER (WHERE success = false) as failed_operations,
    COUNT(*) FILTER (WHERE success = true) as successful_operations,

    -- Resource utilization
    AVG(memory_usage_mb) as avg_memory_usage_mb,
    AVG(cpu_utilization_percent) as avg_cpu_utilization

  FROM bulk_operation_log
  WHERE operation_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY DATE_TRUNC('hour', operation_timestamp), operation_type, collection_name
)

SELECT 
  hour_bucket,
  operation_type,
  collection_name,
  operations_count,
  total_documents,

  -- Performance summary
  ROUND(avg_execution_time_ms, 2) as avg_execution_time_ms,
  ROUND(avg_throughput_docs_per_sec, 0) as avg_throughput_docs_per_sec,
  max_throughput_docs_per_sec,

  -- Success rate
  successful_operations,
  failed_operations,
  ROUND((successful_operations::DECIMAL / (successful_operations + failed_operations)) * 100, 2) as success_rate_percent,

  -- Resource efficiency  
  ROUND(avg_memory_usage_mb, 1) as avg_memory_usage_mb,
  ROUND(avg_cpu_utilization, 1) as avg_cpu_utilization_percent,

  -- Performance assessment
  CASE 
    WHEN avg_execution_time_ms < 100 AND success_rate_percent > 99 THEN 'excellent'
    WHEN avg_execution_time_ms < 500 AND success_rate_percent > 95 THEN 'good'
    WHEN avg_execution_time_ms < 1000 AND success_rate_percent > 90 THEN 'acceptable'
    ELSE 'needs_optimization'
  END as performance_rating

FROM operation_metrics
ORDER BY hour_bucket DESC, total_documents DESC;

-- QueryLeaf provides comprehensive bulk operation support:
-- 1. Automatic conversion of SQL batch operations to MongoDB bulk operations
-- 2. Optimal batching strategies based on operation types and data characteristics
-- 3. Advanced error handling with partial success reporting
-- 4. Performance monitoring and optimization recommendations
-- 5. Support for complex data transformations during bulk processing
-- 6. Intelligent resource utilization and concurrency management
-- 7. Integration with MongoDB's native bulk operation optimizations
-- 8. Familiar SQL syntax for complex batch processing workflows

Best Practices for MongoDB Bulk Operations

Performance Optimization Strategies

Essential principles for maximizing bulk operation performance:

Batch Size Optimization: Choose optimal batch sizes based on document size, available memory, and network capacity
Unordered Operations: Use unordered bulk operations when possible for better parallelization and performance
Index Considerations: Consider index impact when performing bulk operations - create indexes before bulk inserts, after bulk updates
Write Concern Configuration: Balance consistency requirements with performance using appropriate write concern settings
Error Handling Strategy: Implement comprehensive error handling with partial success reporting and retry logic
Resource Monitoring: Monitor system resources during bulk operations and adjust batch sizes dynamically

Production Deployment Considerations

Optimize bulk operations for enterprise production environments:

Sharding Awareness: Design bulk operations to work efficiently with MongoDB sharded clusters
Replication Optimization: Configure operations to work optimally with replica sets and read preferences
Concurrency Management: Implement appropriate concurrency controls to prevent resource contention
Progress Tracking: Provide comprehensive progress reporting for long-running bulk operations
Memory Management: Monitor and control memory usage during large-scale bulk processing
Performance Monitoring: Implement detailed performance monitoring and alerting for bulk operations

Conclusion

MongoDB bulk operations provide powerful capabilities for high-throughput data processing that dramatically improve performance compared to single-document operations through intelligent batching, automatic optimization, and comprehensive error handling. The native bulk operation support enables applications to efficiently process large volumes of data while maintaining consistency and providing detailed operational visibility.

Key MongoDB Bulk Operations benefits include:

High-Performance Processing: Optimal throughput through intelligent batching and reduced network overhead
Flexible Operation Types: Support for mixed bulk operations including inserts, updates, and deletes in single batches
Advanced Error Handling: Comprehensive error reporting with partial success tracking and recovery capabilities
Resource Optimization: Efficient memory and CPU utilization through optimized batch processing algorithms
Production Scalability: Enterprise-ready bulk processing with monitoring, progress tracking, and performance optimization
SQL Accessibility: Familiar SQL-style bulk operations through QueryLeaf for accessible high-performance data processing

Whether you're building data import systems, batch processing pipelines, ETL workflows, or high-throughput applications, MongoDB bulk operations with QueryLeaf's familiar SQL interface provide the foundation for efficient, scalable, and reliable batch data processing.

QueryLeaf Integration: QueryLeaf automatically optimizes SQL batch operations into MongoDB bulk operations while providing familiar SQL syntax for complex data processing workflows. Advanced bulk operation patterns, performance monitoring, and error handling are seamlessly handled through familiar SQL constructs, making high-performance batch processing accessible to SQL-oriented development teams.

The combination of MongoDB's robust bulk operation capabilities with SQL-style batch processing operations makes it an ideal platform for applications requiring both high-throughput data processing and familiar database operation patterns, ensuring your batch processing workflows can scale efficiently while maintaining performance and reliability.