Blog

December 1, 2025
23 min read

MongoDB TTL Collections and Automatic Data Lifecycle Management: Enterprise-Grade Data Expiration and Storage Optimization

Modern applications generate massive amounts of time-sensitive data that requires intelligent lifecycle management to prevent storage bloat, maintain performance, and satisfy compliance requirements. Traditional relational databases provide limited automatic data expiration capabilities, often requiring complex batch jobs, manual cleanup procedures, or external scheduling systems that add operational overhead and complexity to data management workflows.

MongoDB TTL (Time To Live) collections provide native automatic data expiration capabilities with precise control over data retention policies, storage optimization, and compliance-driven data lifecycle management. Unlike traditional databases that require manual cleanup procedures and complex scheduling, MongoDB's TTL functionality automatically removes expired documents based on date field values, ensuring optimal storage utilization while maintaining query performance and operational simplicity.

The Traditional Data Expiration Challenge

Conventional relational database data lifecycle management faces significant operational limitations:

-- Traditional PostgreSQL data expiration - manual cleanup with complex maintenance overhead

-- Session data management with manual expiration logic
CREATE TABLE user_sessions (
    session_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id INTEGER NOT NULL,
    session_token VARCHAR(256) UNIQUE NOT NULL,
    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    expires_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP + INTERVAL '24 hours',
    last_activity TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,

    -- Session metadata
    user_agent TEXT,
    ip_address INET,
    login_method VARCHAR(50),
    session_data JSONB,

    -- Security tracking
    is_active BOOLEAN DEFAULT TRUE,
    invalid_attempts INTEGER DEFAULT 0,
    security_flags TEXT[],

    -- Cleanup tracking
    cleanup_eligible BOOLEAN DEFAULT FALSE,
    cleanup_scheduled TIMESTAMP,

    -- Foreign key constraints
    FOREIGN KEY (user_id) REFERENCES users(user_id) ON DELETE CASCADE
);

-- Audit log table requiring manual retention management
CREATE TABLE audit_logs (
    log_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_timestamp TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    event_type VARCHAR(100) NOT NULL,
    user_id INTEGER,

    -- Event details
    resource_type VARCHAR(100),
    resource_id VARCHAR(255),
    action_performed VARCHAR(100),
    event_data JSONB,

    -- Request context
    ip_address INET,
    user_agent TEXT,
    request_id VARCHAR(100),
    session_id VARCHAR(100),

    -- Compliance and retention
    retention_category VARCHAR(50) NOT NULL DEFAULT 'standard',
    retention_expiry TIMESTAMP,
    compliance_flags TEXT[],

    -- Cleanup metadata
    marked_for_deletion BOOLEAN DEFAULT FALSE,
    deletion_scheduled TIMESTAMP,
    deletion_reason TEXT,

    -- Performance indexes
    INDEX idx_audit_event_timestamp (event_timestamp),
    INDEX idx_audit_user_id_timestamp (user_id, event_timestamp),
    INDEX idx_audit_retention_expiry (retention_expiry),
    INDEX idx_audit_cleanup_eligible (marked_for_deletion, deletion_scheduled)
);

-- Complex manual cleanup procedure with performance impact
CREATE OR REPLACE FUNCTION cleanup_expired_sessions()
RETURNS INTEGER AS $$
DECLARE
    cleanup_batch_size INTEGER := 10000;
    total_deleted INTEGER := 0;
    batch_deleted INTEGER;
    cleanup_start TIMESTAMP := CURRENT_TIMESTAMP;
    session_cursor CURSOR FOR 
        SELECT session_id, user_id, expires_at, last_activity
        FROM user_sessions
        WHERE (expires_at < CURRENT_TIMESTAMP OR 
               last_activity < CURRENT_TIMESTAMP - INTERVAL '7 days')
        AND cleanup_eligible = FALSE
        ORDER BY expires_at ASC
        LIMIT cleanup_batch_size;

    session_record RECORD;

BEGIN
    RAISE NOTICE 'Starting session cleanup process at %', cleanup_start;

    -- Mark sessions eligible for cleanup
    UPDATE user_sessions 
    SET cleanup_eligible = TRUE,
        cleanup_scheduled = CURRENT_TIMESTAMP
    WHERE (expires_at < CURRENT_TIMESTAMP OR 
           last_activity < CURRENT_TIMESTAMP - INTERVAL '7 days')
    AND cleanup_eligible = FALSE;

    GET DIAGNOSTICS batch_deleted = ROW_COUNT;
    RAISE NOTICE 'Marked % sessions for cleanup', batch_deleted;

    -- Process cleanup in batches to avoid long locks
    FOR session_record IN session_cursor LOOP
        BEGIN
            -- Log session termination for audit
            INSERT INTO audit_logs (
                event_type, user_id, resource_type, resource_id,
                action_performed, event_data, retention_category
            ) VALUES (
                'session_expired', session_record.user_id, 'session', 
                session_record.session_id::text, 'automatic_cleanup',
                jsonb_build_object(
                    'expired_at', session_record.expires_at,
                    'last_activity', session_record.last_activity,
                    'cleanup_reason', 'ttl_expiration',
                    'cleanup_timestamp', CURRENT_TIMESTAMP
                ),
                'session_management'
            );

            -- Remove expired session
            DELETE FROM user_sessions 
            WHERE session_id = session_record.session_id;

            total_deleted := total_deleted + 1;

            -- Commit periodically to avoid long transactions
            IF total_deleted % 1000 = 0 THEN
                COMMIT;
                RAISE NOTICE 'Progress: % sessions cleaned up', total_deleted;
            END IF;

        EXCEPTION
            WHEN foreign_key_violation THEN
                RAISE WARNING 'Foreign key constraint prevents deletion of session %', 
                    session_record.session_id;
            WHEN OTHERS THEN
                RAISE WARNING 'Error cleaning up session %: %', 
                    session_record.session_id, SQLERRM;
        END;
    END LOOP;

    -- Update cleanup statistics
    INSERT INTO cleanup_statistics (
        cleanup_type, cleanup_timestamp, records_processed,
        processing_duration, success_count, error_count
    ) VALUES (
        'session_cleanup', cleanup_start, total_deleted,
        CURRENT_TIMESTAMP - cleanup_start, total_deleted, 0
    );

    RAISE NOTICE 'Session cleanup completed: % sessions removed in %',
        total_deleted, CURRENT_TIMESTAMP - cleanup_start;

    RETURN total_deleted;
END;
$$ LANGUAGE plpgsql;

-- Audit log retention with complex policy management
CREATE OR REPLACE FUNCTION manage_audit_log_retention()
RETURNS INTEGER AS $$
DECLARE
    retention_policies RECORD;
    policy_cursor CURSOR FOR
        SELECT retention_category, retention_days, compliance_required
        FROM retention_policy_config
        WHERE active = TRUE;

    total_processed INTEGER := 0;
    category_processed INTEGER;
    retention_threshold TIMESTAMP;

BEGIN
    RAISE NOTICE 'Starting audit log retention management...';

    -- Process each retention policy
    FOR retention_policies IN policy_cursor LOOP
        retention_threshold := CURRENT_TIMESTAMP - (retention_policies.retention_days || ' days')::INTERVAL;

        -- Mark logs for deletion based on retention policy
        UPDATE audit_logs 
        SET marked_for_deletion = TRUE,
            deletion_scheduled = CURRENT_TIMESTAMP + INTERVAL '24 hours',
            deletion_reason = 'retention_policy_' || retention_policies.retention_category
        WHERE retention_category = retention_policies.retention_category
        AND event_timestamp < retention_threshold
        AND marked_for_deletion = FALSE
        AND (compliance_flags IS NULL OR NOT compliance_flags && ARRAY['litigation_hold', 'investigation_hold']);

        GET DIAGNOSTICS category_processed = ROW_COUNT;
        total_processed := total_processed + category_processed;

        RAISE NOTICE 'Retention policy %: marked % logs for deletion (threshold: %)',
            retention_policies.retention_category, category_processed, retention_threshold;
    END LOOP;

    -- Execute delayed deletion for logs past grace period
    DELETE FROM audit_logs 
    WHERE marked_for_deletion = TRUE 
    AND deletion_scheduled < CURRENT_TIMESTAMP
    AND (compliance_flags IS NULL OR NOT compliance_flags && ARRAY['litigation_hold']);

    GET DIAGNOSTICS category_processed = ROW_COUNT;
    RAISE NOTICE 'Deleted % audit logs past grace period', category_processed;

    RETURN total_processed;
END;
$$ LANGUAGE plpgsql;

-- Complex cache data management with manual expiration
CREATE TABLE application_cache (
    cache_key VARCHAR(500) PRIMARY KEY,
    cache_namespace VARCHAR(100) NOT NULL DEFAULT 'default',
    cache_value JSONB NOT NULL,
    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    expires_at TIMESTAMP NOT NULL,
    last_accessed TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,

    -- Cache metadata
    cache_size_bytes INTEGER,
    access_count INTEGER DEFAULT 1,
    cache_tags TEXT[],
    cache_priority INTEGER DEFAULT 5, -- 1 highest, 10 lowest

    -- Cleanup tracking
    cleanup_candidate BOOLEAN DEFAULT FALSE,

    -- Performance optimization indexes
    INDEX idx_cache_expires_at (expires_at),
    INDEX idx_cache_namespace_expires (cache_namespace, expires_at),
    INDEX idx_cache_cleanup_candidate (cleanup_candidate, expires_at)
);

-- Cache cleanup with performance considerations
CREATE OR REPLACE FUNCTION cleanup_expired_cache()
RETURNS INTEGER AS $$
DECLARE
    cleanup_batch_size INTEGER := 5000;
    total_cleaned INTEGER := 0;
    batch_count INTEGER;
    cleanup_rounds INTEGER := 0;
    max_cleanup_rounds INTEGER := 20;

BEGIN
    RAISE NOTICE 'Starting cache cleanup process...';

    WHILE cleanup_rounds < max_cleanup_rounds LOOP
        -- Delete expired cache entries in batches
        DELETE FROM application_cache 
        WHERE cache_key IN (
            SELECT cache_key 
            FROM application_cache
            WHERE expires_at < CURRENT_TIMESTAMP
            ORDER BY expires_at ASC
            LIMIT cleanup_batch_size
        );

        GET DIAGNOSTICS batch_count = ROW_COUNT;

        IF batch_count = 0 THEN
            EXIT; -- No more expired entries
        END IF;

        total_cleaned := total_cleaned + batch_count;
        cleanup_rounds := cleanup_rounds + 1;

        RAISE NOTICE 'Cleanup round %: removed % expired cache entries', 
            cleanup_rounds, batch_count;

        -- Brief pause to avoid overwhelming the system
        PERFORM pg_sleep(0.1);
    END LOOP;

    -- Additional cleanup for low-priority unused cache
    DELETE FROM application_cache 
    WHERE last_accessed < CURRENT_TIMESTAMP - INTERVAL '7 days'
    AND cache_priority >= 8
    AND access_count <= 5;

    GET DIAGNOSTICS batch_count = ROW_COUNT;
    total_cleaned := total_cleaned + batch_count;

    RAISE NOTICE 'Cache cleanup completed: % total entries removed', total_cleaned;

    RETURN total_cleaned;
END;
$$ LANGUAGE plpgsql;

-- Scheduled cleanup job management (requires external cron)
CREATE TABLE cleanup_job_schedule (
    job_name VARCHAR(100) PRIMARY KEY,
    job_function VARCHAR(200) NOT NULL,
    schedule_expression VARCHAR(100) NOT NULL, -- Cron expression
    last_execution TIMESTAMP,
    next_execution TIMESTAMP,
    execution_count INTEGER DEFAULT 0,

    -- Job configuration
    enabled BOOLEAN DEFAULT TRUE,
    max_execution_time INTERVAL DEFAULT '2 hours',
    cleanup_batch_size INTEGER DEFAULT 10000,

    -- Performance tracking
    average_execution_time INTERVAL,
    total_records_processed BIGINT DEFAULT 0,
    last_records_processed INTEGER,

    -- Error handling
    last_error_message TEXT,
    consecutive_failures INTEGER DEFAULT 0,
    max_failures_allowed INTEGER DEFAULT 3
);

-- Insert cleanup job configurations
INSERT INTO cleanup_job_schedule (job_name, job_function, schedule_expression) VALUES
('session_cleanup', 'cleanup_expired_sessions()', '0 */6 * * *'), -- Every 6 hours
('audit_retention', 'manage_audit_log_retention()', '0 2 * * 0'),  -- Weekly at 2 AM
('cache_cleanup', 'cleanup_expired_cache()', '*/30 * * * *'),      -- Every 30 minutes
('temp_file_cleanup', 'cleanup_temporary_files()', '0 1 * * *');   -- Daily at 1 AM

-- Monitor cleanup job performance
WITH cleanup_performance AS (
    SELECT 
        job_name,
        last_execution,
        next_execution,
        execution_count,
        average_execution_time,
        total_records_processed,
        last_records_processed,
        consecutive_failures,

        -- Performance calculations
        CASE 
            WHEN execution_count > 0 AND total_records_processed > 0 THEN
                ROUND(total_records_processed::DECIMAL / execution_count::DECIMAL, 0)
            ELSE 0
        END as avg_records_per_execution,

        -- Health status
        CASE 
            WHEN consecutive_failures >= max_failures_allowed THEN 'failed'
            WHEN consecutive_failures > 0 THEN 'degraded'
            WHEN last_execution < CURRENT_TIMESTAMP - INTERVAL '24 hours' THEN 'overdue'
            ELSE 'healthy'
        END as job_health

    FROM cleanup_job_schedule
    WHERE enabled = TRUE
),

cleanup_recommendations AS (
    SELECT 
        cp.job_name,
        cp.job_health,
        cp.avg_records_per_execution,
        cp.average_execution_time,

        -- Optimization recommendations
        CASE 
            WHEN cp.job_health = 'failed' THEN 'Immediate attention: job failing consistently'
            WHEN cp.average_execution_time > INTERVAL '1 hour' THEN 'Performance issue: execution time too long'
            WHEN cp.avg_records_per_execution > 50000 THEN 'Consider smaller batch sizes to reduce lock contention'
            WHEN cp.consecutive_failures > 0 THEN 'Monitor job execution and error logs'
            ELSE 'Job performing within expected parameters'
        END as recommendation,

        -- Resource impact assessment
        CASE 
            WHEN cp.average_execution_time > INTERVAL '30 minutes' THEN 'high'
            WHEN cp.average_execution_time > INTERVAL '10 minutes' THEN 'medium'
            ELSE 'low'
        END as resource_impact

    FROM cleanup_performance cp
)

-- Generate cleanup management dashboard
SELECT 
    cr.job_name,
    cr.job_health,
    cr.avg_records_per_execution,
    cr.average_execution_time,
    cr.resource_impact,
    cr.recommendation,

    -- Next steps
    CASE cr.job_health
        WHEN 'failed' THEN 'Review error logs and fix underlying issues'
        WHEN 'degraded' THEN 'Monitor next execution and investigate intermittent failures'
        WHEN 'overdue' THEN 'Check job scheduler and execute manually if needed'
        ELSE 'Continue monitoring performance trends'
    END as next_actions,

    -- Operational guidance
    CASE 
        WHEN cr.resource_impact = 'high' THEN 'Schedule during low-traffic periods'
        WHEN cr.avg_records_per_execution > 100000 THEN 'Consider parallel processing'
        ELSE 'Current execution strategy is appropriate'
    END as operational_guidance

FROM cleanup_recommendations cr
ORDER BY 
    CASE cr.job_health
        WHEN 'failed' THEN 1
        WHEN 'degraded' THEN 2
        WHEN 'overdue' THEN 3
        ELSE 4
    END,
    cr.resource_impact DESC;

-- Problems with traditional data expiration management:
-- 1. Complex manual cleanup procedures requiring extensive procedural code and maintenance
-- 2. Performance impact from batch deletion operations affecting application responsiveness
-- 3. Resource-intensive cleanup jobs requiring careful scheduling and monitoring  
-- 4. Risk of data inconsistency during cleanup operations due to foreign key constraints
-- 5. Limited scalability for high-volume data expiration scenarios
-- 6. Manual configuration and maintenance of retention policies across different data types
-- 7. Complex error handling and recovery procedures for failed cleanup operations
-- 8. Difficulty coordinating cleanup across multiple tables with interdependencies
-- 9. Operational overhead of monitoring and maintaining cleanup job performance
-- 10. Risk of storage bloat if cleanup jobs fail or are disabled

MongoDB provides native TTL functionality with automatic data expiration and lifecycle management:

// MongoDB TTL Collections - Native automatic data lifecycle management and expiration
const { MongoClient, ObjectId } = require('mongodb');

// Advanced MongoDB TTL Collection Manager with Enterprise Data Lifecycle Management
class MongoDBTTLManager {
  constructor(client, config = {}) {
    this.client = client;
    this.db = client.db(config.database || 'enterprise_data');

    this.config = {
      // TTL Configuration
      defaultTTLSeconds: config.defaultTTLSeconds || 86400, // 24 hours
      enableTTLMonitoring: config.enableTTLMonitoring !== false,
      enableExpirationAlerts: config.enableExpirationAlerts !== false,

      // Data lifecycle policies
      retentionPolicies: config.retentionPolicies || {},
      complianceMode: config.complianceMode || false,
      enableDataArchiving: config.enableDataArchiving || false,

      // Performance optimization
      enableBackgroundExpiration: config.enableBackgroundExpiration !== false,
      expirationBatchSize: config.expirationBatchSize || 1000,
      enableExpirationMetrics: config.enableExpirationMetrics !== false
    };

    // TTL collection management
    this.ttlCollections = new Map();
    this.retentionPolicies = new Map();
    this.expirationMetrics = new Map();

    this.initializeTTLManager();
  }

  async initializeTTLManager() {
    console.log('Initializing MongoDB TTL Collection Manager...');

    try {
      // Setup TTL collections for different data types
      await this.setupSessionTTLCollection();
      await this.setupAuditLogTTLCollection();
      await this.setupCacheTTLCollection();
      await this.setupTemporaryDataTTLCollection();
      await this.setupEventTTLCollection();

      // Initialize monitoring and metrics
      if (this.config.enableTTLMonitoring) {
        await this.initializeTTLMonitoring();
      }

      // Setup data lifecycle policies
      await this.configureDataLifecyclePolicies();

      console.log('MongoDB TTL Collection Manager initialized successfully');

    } catch (error) {
      console.error('Error initializing TTL manager:', error);
      throw error;
    }
  }

  async setupSessionTTLCollection() {
    console.log('Setting up session TTL collection...');

    try {
      const sessionCollection = this.db.collection('user_sessions');

      // Create TTL index on expiresAt field (24 hours)
      await sessionCollection.createIndex(
        { expiresAt: 1 }, 
        { 
          expireAfterSeconds: 0,  // Expire based on document date field value
          background: true,
          name: 'session_ttl_index'
        }
      );

      // Additional indexes for performance
      await sessionCollection.createIndexes([
        { key: { userId: 1, expiresAt: 1 }, background: true },
        { key: { sessionToken: 1 }, unique: true, background: true },
        { key: { lastActivity: -1 }, background: true },
        { key: { ipAddress: 1, createdAt: -1 }, background: true }
      ]);

      // Store TTL configuration
      this.ttlCollections.set('user_sessions', {
        collection: sessionCollection,
        ttlField: 'expiresAt',
        ttlSeconds: 0, // Document-controlled expiration
        retentionPolicy: 'session_management',
        complianceLevel: 'standard'
      });

      console.log('Session TTL collection configured with automatic expiration');

    } catch (error) {
      console.error('Error setting up session TTL collection:', error);
      throw error;
    }
  }

  async setupAuditLogTTLCollection() {
    console.log('Setting up audit log TTL collection with compliance requirements...');

    try {
      const auditCollection = this.db.collection('audit_logs');

      // Create TTL index for standard audit logs (90 days retention)
      await auditCollection.createIndex(
        { retentionExpiry: 1 },
        {
          expireAfterSeconds: 0, // Document-controlled expiration
          background: true,
          name: 'audit_retention_ttl_index',
          partialFilterExpression: {
            complianceHold: { $ne: true },
            retentionCategory: { $nin: ['critical', 'legal_hold'] }
          }
        }
      );

      // Performance indexes for audit queries
      await auditCollection.createIndexes([
        { key: { eventTimestamp: -1 }, background: true },
        { key: { userId: 1, eventTimestamp: -1 }, background: true },
        { key: { eventType: 1, eventTimestamp: -1 }, background: true },
        { key: { retentionCategory: 1, retentionExpiry: 1 }, background: true },
        { key: { complianceHold: 1 }, sparse: true, background: true }
      ]);

      this.ttlCollections.set('audit_logs', {
        collection: auditCollection,
        ttlField: 'retentionExpiry',
        ttlSeconds: 0,
        retentionPolicy: 'audit_compliance',
        complianceLevel: 'high',
        specialHandling: ['critical', 'legal_hold']
      });

      console.log('Audit log TTL collection configured with compliance controls');

    } catch (error) {
      console.error('Error setting up audit log TTL collection:', error);
      throw error;
    }
  }

  async setupCacheTTLCollection() {
    console.log('Setting up cache TTL collection for automatic cleanup...');

    try {
      const cacheCollection = this.db.collection('application_cache');

      // Create TTL index for cache expiration (immediate expiration when expired)
      await cacheCollection.createIndex(
        { expiresAt: 1 },
        {
          expireAfterSeconds: 60, // 1 minute grace period for cache cleanup
          background: true,
          name: 'cache_ttl_index'
        }
      );

      // Performance indexes for cache operations
      await cacheCollection.createIndexes([
        { key: { cacheKey: 1 }, unique: true, background: true },
        { key: { cacheNamespace: 1, cacheKey: 1 }, background: true },
        { key: { lastAccessed: -1 }, background: true },
        { key: { cachePriority: 1, expiresAt: 1 }, background: true }
      ]);

      this.ttlCollections.set('application_cache', {
        collection: cacheCollection,
        ttlField: 'expiresAt',
        ttlSeconds: 60, // Short grace period
        retentionPolicy: 'cache_management',
        complianceLevel: 'low'
      });

      console.log('Cache TTL collection configured for optimal performance');

    } catch (error) {
      console.error('Error setting up cache TTL collection:', error);
      throw error;
    }
  }

  async setupTemporaryDataTTLCollection() {
    console.log('Setting up temporary data TTL collection...');

    try {
      const tempCollection = this.db.collection('temporary_data');

      // Create TTL index for temporary data (1 hour default)
      await tempCollection.createIndex(
        { createdAt: 1 },
        {
          expireAfterSeconds: 3600, // 1 hour
          background: true,
          name: 'temp_data_ttl_index'
        }
      );

      // Additional indexes for temporary data queries
      await tempCollection.createIndexes([
        { key: { dataType: 1, createdAt: -1 }, background: true },
        { key: { userId: 1, dataType: 1 }, background: true },
        { key: { sessionId: 1 }, background: true, sparse: true }
      ]);

      this.ttlCollections.set('temporary_data', {
        collection: tempCollection,
        ttlField: 'createdAt',
        ttlSeconds: 3600,
        retentionPolicy: 'temporary_storage',
        complianceLevel: 'low'
      });

      console.log('Temporary data TTL collection configured');

    } catch (error) {
      console.error('Error setting up temporary data TTL collection:', error);
      throw error;
    }
  }

  async setupEventTTLCollection() {
    console.log('Setting up event TTL collection with tiered retention...');

    try {
      const eventCollection = this.db.collection('application_events');

      // Create compound TTL index with conditional expiration
      await eventCollection.createIndex(
        { retentionTier: 1, expiresAt: 1 },
        {
          expireAfterSeconds: 0, // Document-controlled
          background: true,
          name: 'event_tiered_ttl_index'
        }
      );

      // Performance indexes for event queries
      await eventCollection.createIndexes([
        { key: { eventTimestamp: -1 }, background: true },
        { key: { eventType: 1, eventTimestamp: -1 }, background: true },
        { key: { userId: 1, eventTimestamp: -1 }, background: true },
        { key: { retentionTier: 1, eventTimestamp: -1 }, background: true }
      ]);

      this.ttlCollections.set('application_events', {
        collection: eventCollection,
        ttlField: 'expiresAt',
        ttlSeconds: 0,
        retentionPolicy: 'tiered_retention',
        complianceLevel: 'medium',
        tiers: {
          'hot': 86400 * 7,    // 7 days
          'warm': 86400 * 30,  // 30 days  
          'cold': 86400 * 90   // 90 days
        }
      });

      console.log('Event TTL collection configured with tiered retention');

    } catch (error) {
      console.error('Error setting up event TTL collection:', error);
      throw error;
    }
  }

  async createSessionWithTTL(sessionData) {
    console.log('Creating user session with automatic TTL expiration...');

    try {
      const sessionCollection = this.db.collection('user_sessions');
      const expirationTime = new Date(Date.now() + (24 * 60 * 60 * 1000)); // 24 hours

      const session = {
        _id: new ObjectId(),
        sessionToken: sessionData.sessionToken,
        userId: sessionData.userId,
        createdAt: new Date(),
        expiresAt: expirationTime, // TTL expiration field
        lastActivity: new Date(),

        // Session metadata
        userAgent: sessionData.userAgent,
        ipAddress: sessionData.ipAddress,
        loginMethod: sessionData.loginMethod || 'password',
        sessionData: sessionData.additionalData || {},

        // Security tracking
        isActive: true,
        invalidAttempts: 0,
        securityFlags: [],

        // TTL metadata
        ttlManaged: true,
        retentionPolicy: 'session_management'
      };

      const result = await sessionCollection.insertOne(session);

      // Update session metrics
      await this.updateTTLMetrics('user_sessions', 'created', session);

      console.log(`Session created with TTL expiration: ${result.insertedId}`);

      return {
        sessionId: result.insertedId,
        expiresAt: expirationTime,
        ttlEnabled: true
      };

    } catch (error) {
      console.error('Error creating session with TTL:', error);
      throw error;
    }
  }

  async createAuditLogWithRetention(auditData) {
    console.log('Creating audit log with compliance-driven retention...');

    try {
      const auditCollection = this.db.collection('audit_logs');

      // Calculate retention expiry based on data classification
      const retentionDays = this.calculateRetentionPeriod(auditData.retentionCategory);
      const retentionExpiry = new Date(Date.now() + (retentionDays * 24 * 60 * 60 * 1000));

      const auditLog = {
        _id: new ObjectId(),
        eventTimestamp: new Date(),
        eventType: auditData.eventType,
        userId: auditData.userId,

        // Event details
        resourceType: auditData.resourceType,
        resourceId: auditData.resourceId,
        actionPerformed: auditData.action,
        eventData: auditData.eventData || {},

        // Request context
        ipAddress: auditData.ipAddress,
        userAgent: auditData.userAgent,
        requestId: auditData.requestId,
        sessionId: auditData.sessionId,

        // Compliance and retention
        retentionCategory: auditData.retentionCategory || 'standard',
        retentionExpiry: retentionExpiry, // TTL expiration field
        complianceFlags: auditData.complianceFlags || [],
        complianceHold: auditData.complianceHold || false,

        // TTL metadata
        ttlManaged: !auditData.complianceHold,
        retentionDays: retentionDays,
        dataClassification: auditData.dataClassification || 'internal'
      };

      const result = await auditCollection.insertOne(auditLog);

      // Update audit metrics
      await this.updateTTLMetrics('audit_logs', 'created', auditLog);

      console.log(`Audit log created with ${retentionDays}-day retention: ${result.insertedId}`);

      return {
        auditId: result.insertedId,
        retentionExpiry: retentionExpiry,
        retentionDays: retentionDays,
        ttlEnabled: !auditData.complianceHold
      };

    } catch (error) {
      console.error('Error creating audit log with retention:', error);
      throw error;
    }
  }

  async createCacheEntryWithTTL(cacheData) {
    console.log('Creating cache entry with automatic expiration...');

    try {
      const cacheCollection = this.db.collection('application_cache');

      // Calculate cache expiration based on cache type and priority
      const ttlSeconds = this.calculateCacheTTL(cacheData.cacheType, cacheData.priority);
      const expirationTime = new Date(Date.now() + (ttlSeconds * 1000));

      const cacheEntry = {
        _id: new ObjectId(),
        cacheKey: cacheData.key,
        cacheNamespace: cacheData.namespace || 'default',
        cacheValue: cacheData.value,
        createdAt: new Date(),
        expiresAt: expirationTime, // TTL expiration field
        lastAccessed: new Date(),

        // Cache metadata
        cacheType: cacheData.cacheType || 'general',
        cacheSizeBytes: JSON.stringify(cacheData.value).length,
        accessCount: 1,
        cacheTags: cacheData.tags || [],
        cachePriority: cacheData.priority || 5,

        // TTL configuration
        ttlSeconds: ttlSeconds,
        ttlManaged: true
      };

      // Use upsert to handle cache key uniqueness
      const result = await cacheCollection.replaceOne(
        { cacheKey: cacheData.key },
        cacheEntry,
        { upsert: true }
      );

      // Update cache metrics
      await this.updateTTLMetrics('application_cache', 'created', cacheEntry);

      console.log(`Cache entry created with ${ttlSeconds}s TTL: ${cacheData.key}`);

      return {
        cacheKey: cacheData.key,
        expiresAt: expirationTime,
        ttlSeconds: ttlSeconds,
        upserted: result.upsertedCount > 0
      };

    } catch (error) {
      console.error('Error creating cache entry with TTL:', error);
      throw error;
    }
  }

  async createEventWithTieredRetention(eventData) {
    console.log('Creating event with tiered retention policy...');

    try {
      const eventCollection = this.db.collection('application_events');

      // Determine retention tier based on event importance
      const retentionTier = this.determineEventRetentionTier(eventData);
      const ttlConfig = this.ttlCollections.get('application_events').tiers;
      const ttlSeconds = ttlConfig[retentionTier];
      const expirationTime = new Date(Date.now() + (ttlSeconds * 1000));

      const event = {
        _id: new ObjectId(),
        eventTimestamp: new Date(),
        eventType: eventData.type,
        userId: eventData.userId,

        // Event payload
        eventData: eventData.data || {},
        eventSource: eventData.source || 'application',
        eventSeverity: eventData.severity || 'info',

        // Context information
        sessionId: eventData.sessionId,
        requestId: eventData.requestId,
        correlationId: eventData.correlationId,

        // Tiered retention
        retentionTier: retentionTier,
        expiresAt: expirationTime, // TTL expiration field
        retentionDays: Math.floor(ttlSeconds / 86400),

        // Event metadata
        eventVersion: eventData.version || '1.0',
        processingRequirements: eventData.processing || [],

        // TTL management
        ttlManaged: true,
        ttlTier: retentionTier
      };

      const result = await eventCollection.insertOne(event);

      // Update event metrics
      await this.updateTTLMetrics('application_events', 'created', event);

      console.log(`Event created with ${retentionTier} tier retention: ${result.insertedId}`);

      return {
        eventId: result.insertedId,
        retentionTier: retentionTier,
        expiresAt: expirationTime,
        retentionDays: Math.floor(ttlSeconds / 86400)
      };

    } catch (error) {
      console.error('Error creating event with tiered retention:', error);
      throw error;
    }
  }

  async updateTTLConfiguration(collectionName, newTTLSeconds) {
    console.log(`Updating TTL configuration for collection: ${collectionName}`);

    try {
      const collection = this.db.collection(collectionName);
      const ttlConfig = this.ttlCollections.get(collectionName);

      if (!ttlConfig) {
        throw new Error(`TTL configuration not found for collection: ${collectionName}`);
      }

      // Update TTL index
      await collection.dropIndex(ttlConfig.ttlField + '_1');
      await collection.createIndex(
        { [ttlConfig.ttlField]: 1 },
        {
          expireAfterSeconds: newTTLSeconds,
          background: true,
          name: `${ttlConfig.ttlField}_ttl_index`
        }
      );

      // Update configuration
      ttlConfig.ttlSeconds = newTTLSeconds;
      this.ttlCollections.set(collectionName, ttlConfig);

      console.log(`TTL configuration updated: ${collectionName} now expires after ${newTTLSeconds} seconds`);

      return {
        collection: collectionName,
        ttlSeconds: newTTLSeconds,
        updated: true
      };

    } catch (error) {
      console.error(`Error updating TTL configuration for ${collectionName}:`, error);
      throw error;
    }
  }

  // Utility methods for TTL management

  calculateRetentionPeriod(retentionCategory) {
    const retentionPolicies = {
      'session_management': 1,      // 1 day
      'standard': 90,               // 90 days
      'security': 365,              // 1 year
      'financial': 2555,            // 7 years
      'legal': 3650,                // 10 years
      'critical': 7300,             // 20 years
      'permanent': 0                // No expiration
    };

    return retentionPolicies[retentionCategory] || 90;
  }

  calculateCacheTTL(cacheType, priority) {
    const baseTTL = {
      'session': 1800,         // 30 minutes
      'user_data': 3600,       // 1 hour  
      'api_response': 300,     // 5 minutes
      'computed': 7200,        // 2 hours
      'static': 86400          // 24 hours
    };

    const base = baseTTL[cacheType] || 3600;

    // Adjust TTL based on priority (1 = highest, 10 = lowest)
    const priorityMultiplier = Math.max(0.5, Math.min(2.0, (11 - priority) / 5));

    return Math.floor(base * priorityMultiplier);
  }

  determineEventRetentionTier(eventData) {
    const eventType = eventData.type;
    const severity = eventData.severity || 'info';
    const importance = eventData.importance || 'standard';

    // Critical events get longest retention
    if (severity === 'critical' || importance === 'high') {
      return 'cold'; // 90 days
    }

    // Security and audit events get medium retention  
    if (eventType.includes('security') || eventType.includes('audit')) {
      return 'warm'; // 30 days
    }

    // Regular application events get short retention
    return 'hot'; // 7 days
  }

  async updateTTLMetrics(collectionName, operation, document) {
    if (!this.config.enableExpirationMetrics) return;

    const metrics = this.expirationMetrics.get(collectionName) || {
      created: 0,
      expired: 0,
      totalSize: 0,
      lastUpdated: new Date()
    };

    if (operation === 'created') {
      metrics.created++;
      metrics.totalSize += JSON.stringify(document).length;
    } else if (operation === 'expired') {
      metrics.expired++;
    }

    metrics.lastUpdated = new Date();
    this.expirationMetrics.set(collectionName, metrics);
  }

  async getTTLStatus() {
    console.log('Retrieving TTL status for all managed collections...');

    const status = {
      collections: {},
      summary: {
        totalCollections: this.ttlCollections.size,
        totalDocuments: 0,
        upcomingExpirations: 0,
        storageOptimization: 0
      }
    };

    for (const [collectionName, config] of this.ttlCollections) {
      try {
        const collection = config.collection;
        const stats = await collection.stats();

        // Count documents expiring soon (next 24 hours)
        const upcoming = await collection.countDocuments({
          [config.ttlField]: {
            $lte: new Date(Date.now() + 86400000) // 24 hours
          }
        });

        status.collections[collectionName] = {
          ttlField: config.ttlField,
          ttlSeconds: config.ttlSeconds,
          retentionPolicy: config.retentionPolicy,
          documentCount: stats.count,
          storageSize: stats.storageSize,
          upcomingExpirations: upcoming,
          lastChecked: new Date()
        };

        status.summary.totalDocuments += stats.count;
        status.summary.upcomingExpirations += upcoming;
        status.summary.storageOptimization += stats.storageSize;

      } catch (error) {
        console.error(`Error getting TTL status for ${collectionName}:`, error);
        status.collections[collectionName] = {
          error: error.message,
          lastChecked: new Date()
        };
      }
    }

    return status;
  }

  async getExpirationMetrics() {
    console.log('Retrieving comprehensive expiration metrics...');

    const metrics = {
      timestamp: new Date(),
      collections: {},
      summary: {
        totalCreated: 0,
        totalExpired: 0,
        storageReclaimed: 0,
        expirationEfficiency: 0
      }
    };

    for (const [collectionName, collectionMetrics] of this.expirationMetrics) {
      metrics.collections[collectionName] = {
        ...collectionMetrics,
        expirationRate: collectionMetrics.expired / Math.max(collectionMetrics.created, 1)
      };

      metrics.summary.totalCreated += collectionMetrics.created;
      metrics.summary.totalExpired += collectionMetrics.expired;
    }

    metrics.summary.expirationEfficiency = 
      metrics.summary.totalExpired / Math.max(metrics.summary.totalCreated, 1);

    return metrics;
  }

  async cleanup() {
    console.log('Cleaning up TTL Manager resources...');

    // Clear monitoring intervals and cleanup resources
    this.ttlCollections.clear();
    this.retentionPolicies.clear();
    this.expirationMetrics.clear();

    console.log('TTL Manager cleanup completed');
  }
}

// Example usage for enterprise data lifecycle management
async function demonstrateEnterpriseDataLifecycle() {
  const client = new MongoClient('mongodb://localhost:27017');
  await client.connect();

  const ttlManager = new MongoDBTTLManager(client, {
    database: 'enterprise_lifecycle',
    enableTTLMonitoring: true,
    enableExpirationMetrics: true,
    complianceMode: true
  });

  try {
    // Create session with automatic 24-hour expiration
    const session = await ttlManager.createSessionWithTTL({
      sessionToken: 'session_' + Date.now(),
      userId: 'user_12345',
      userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
      ipAddress: '192.168.1.100',
      loginMethod: 'password'
    });

    // Create audit log with compliance-driven retention
    const auditLog = await ttlManager.createAuditLogWithRetention({
      eventType: 'user_login',
      userId: 'user_12345',
      resourceType: 'authentication',
      action: 'login_success',
      retentionCategory: 'security', // 365 days retention
      ipAddress: '192.168.1.100',
      eventData: {
        loginMethod: 'password',
        mfaUsed: true,
        riskScore: 'low'
      }
    });

    // Create cache entry with priority-based TTL
    const cacheEntry = await ttlManager.createCacheEntryWithTTL({
      key: 'user_preferences_12345',
      namespace: 'user_data',
      value: {
        theme: 'dark',
        language: 'en',
        timezone: 'UTC',
        notifications: true
      },
      cacheType: 'user_data',
      priority: 3, // High priority = longer TTL
      tags: ['preferences', 'user_settings']
    });

    // Create event with tiered retention
    const event = await ttlManager.createEventWithTieredRetention({
      type: 'page_view',
      userId: 'user_12345',
      severity: 'info',
      data: {
        page: '/dashboard',
        duration: 1500,
        interactions: 5
      },
      source: 'web_app',
      sessionId: session.sessionId.toString()
    });

    // Get TTL status and metrics
    const ttlStatus = await ttlManager.getTTLStatus();
    const expirationMetrics = await ttlManager.getExpirationMetrics();

    console.log('Enterprise Data Lifecycle Management Results:');
    console.log('Session:', session);
    console.log('Audit Log:', auditLog);
    console.log('Cache Entry:', cacheEntry);
    console.log('Event:', event);
    console.log('TTL Status:', JSON.stringify(ttlStatus, null, 2));
    console.log('Expiration Metrics:', JSON.stringify(expirationMetrics, null, 2));

    return {
      session,
      auditLog,
      cacheEntry,
      event,
      ttlStatus,
      expirationMetrics
    };

  } catch (error) {
    console.error('Error demonstrating enterprise data lifecycle:', error);
    throw error;
  } finally {
    await ttlManager.cleanup();
    await client.close();
  }
}

// Benefits of MongoDB TTL Collections:
// - Native automatic data expiration eliminates complex manual cleanup procedures
// - Document-level TTL control with flexible expiration policies based on business requirements
// - Zero performance impact on application operations with background expiration processing
// - Compliance-friendly retention management with audit trails and legal hold capabilities  
// - Intelligent storage optimization with automatic document removal and space reclamation
// - Scalable data lifecycle management that handles high-volume data expiration scenarios
// - Enterprise-grade monitoring and metrics for data retention and compliance reporting
// - Seamless integration with MongoDB's document model and indexing capabilities

module.exports = {
  MongoDBTTLManager,
  demonstrateEnterpriseDataLifecycle
};

SQL-Style TTL Management with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB TTL collections and data lifecycle management:

-- QueryLeaf TTL collections with SQL-familiar data lifecycle management syntax

-- Configure TTL collections and expiration policies
SET ttl_monitoring_enabled = true;
SET ttl_expiration_alerts = true;
SET default_ttl_seconds = 86400; -- 24 hours
SET enable_compliance_mode = true;
SET enable_data_archiving = true;

-- Create TTL-managed collections with expiration policies
WITH ttl_collection_configuration AS (
  SELECT 
    -- Collection TTL configurations
    'user_sessions' as collection_name,
    'expiresAt' as ttl_field,
    0 as ttl_seconds, -- Document-controlled expiration
    'session_management' as retention_policy,
    24 * 3600 as default_session_ttl_seconds,

    -- Index configuration
    JSON_BUILD_OBJECT(
      'ttl_index', JSON_BUILD_OBJECT(
        'field', 'expiresAt',
        'expireAfterSeconds', 0,
        'background', true
      ),
      'performance_indexes', ARRAY[
        JSON_BUILD_OBJECT('fields', JSON_BUILD_OBJECT('userId', 1, 'expiresAt', 1)),
        JSON_BUILD_OBJECT('fields', JSON_BUILD_OBJECT('sessionToken', 1), 'unique', true),
        JSON_BUILD_OBJECT('fields', JSON_BUILD_OBJECT('lastActivity', -1))
      ]
    ) as index_configuration

  UNION ALL

  SELECT 
    'audit_logs' as collection_name,
    'retentionExpiry' as ttl_field,
    0 as ttl_seconds, -- Document-controlled with compliance
    'audit_compliance' as retention_policy,
    90 * 24 * 3600 as default_audit_ttl_seconds,

    JSON_BUILD_OBJECT(
      'ttl_index', JSON_BUILD_OBJECT(
        'field', 'retentionExpiry',
        'expireAfterSeconds', 0,
        'background', true,
        'partial_filter', JSON_BUILD_OBJECT(
          'complianceHold', JSON_BUILD_OBJECT('$ne', true),
          'retentionCategory', JSON_BUILD_OBJECT('$nin', ARRAY['critical', 'legal_hold'])
        )
      ),
      'performance_indexes', ARRAY[
        JSON_BUILD_OBJECT('fields', JSON_BUILD_OBJECT('eventTimestamp', -1)),
        JSON_BUILD_OBJECT('fields', JSON_BUILD_OBJECT('userId', 1, 'eventTimestamp', -1)),
        JSON_BUILD_OBJECT('fields', JSON_BUILD_OBJECT('retentionCategory', 1, 'retentionExpiry', 1))
      ]
    ) as index_configuration

  UNION ALL

  SELECT 
    'application_cache' as collection_name,
    'expiresAt' as ttl_field,
    60 as ttl_seconds, -- 1 minute grace period
    'cache_management' as retention_policy,
    3600 as default_cache_ttl_seconds, -- 1 hour

    JSON_BUILD_OBJECT(
      'ttl_index', JSON_BUILD_OBJECT(
        'field', 'expiresAt',
        'expireAfterSeconds', 60,
        'background', true
      ),
      'performance_indexes', ARRAY[
        JSON_BUILD_OBJECT('fields', JSON_BUILD_OBJECT('cacheKey', 1), 'unique', true),
        JSON_BUILD_OBJECT('fields', JSON_BUILD_OBJECT('cacheNamespace', 1, 'cacheKey', 1)),
        JSON_BUILD_OBJECT('fields', JSON_BUILD_OBJECT('cachePriority', 1, 'expiresAt', 1))
      ]
    ) as index_configuration
),

-- Data retention policy definitions
retention_policy_definitions AS (
  SELECT 
    policy_name,
    retention_days,
    compliance_level,
    auto_expiration,
    archive_before_expiration,
    legal_hold_exempt,

    -- TTL calculation
    retention_days * 24 * 3600 as retention_seconds,

    -- Policy rules
    CASE policy_name
      WHEN 'session_management' THEN 'Expire user sessions after inactivity period'
      WHEN 'audit_compliance' THEN 'Retain audit logs per compliance requirements'
      WHEN 'cache_management' THEN 'Optimize cache storage with automatic cleanup'
      WHEN 'temporary_storage' THEN 'Remove temporary data after processing'
      WHEN 'event_analytics' THEN 'Tiered retention for application events'
    END as policy_description,

    -- Compliance requirements
    CASE compliance_level
      WHEN 'high' THEN ARRAY['audit_trail', 'legal_hold_support', 'data_classification']
      WHEN 'medium' THEN ARRAY['audit_trail', 'data_classification'] 
      ELSE ARRAY['basic_logging']
    END as compliance_requirements

  FROM (VALUES
    ('session_management', 1, 'medium', true, false, true),
    ('audit_compliance', 90, 'high', true, true, false),
    ('security_logs', 365, 'high', true, true, false),
    ('cache_management', 0, 'low', true, false, true), -- Immediate expiration
    ('temporary_storage', 1, 'low', true, false, true),
    ('event_analytics', 30, 'medium', true, false, true),
    ('financial_records', 2555, 'critical', false, true, false), -- 7 years
    ('legal_documents', 3650, 'critical', false, true, false)    -- 10 years
  ) AS policies(policy_name, retention_days, compliance_level, auto_expiration, archive_before_expiration, legal_hold_exempt)
),

-- Create session data with automatic TTL expiration
session_ttl_operations AS (
  INSERT INTO user_sessions_ttl
  SELECT 
    GENERATE_UUID() as session_id,
    'user_' || generate_series(1, 1000) as user_id,
    'session_token_' || EXTRACT(EPOCH FROM CURRENT_TIMESTAMP) || '_' || generate_series(1, 1000) as session_token,
    CURRENT_TIMESTAMP as created_at,
    CURRENT_TIMESTAMP + INTERVAL '24 hours' as expires_at, -- TTL expiration field
    CURRENT_TIMESTAMP as last_activity,

    -- Session metadata
    'Mozilla/5.0 (compatible; Enterprise App)' as user_agent,
    ('192.168.1.' || (1 + random() * 254)::int)::inet as ip_address,
    'password' as login_method,
    JSON_BUILD_OBJECT(
      'preferences', JSON_BUILD_OBJECT('theme', 'dark', 'language', 'en'),
      'permissions', ARRAY['read', 'write'],
      'mfa_verified', true
    ) as session_data,

    -- Security and TTL metadata
    true as is_active,
    0 as invalid_attempts,
    ARRAY[]::text[] as security_flags,
    true as ttl_managed,
    'session_management' as retention_policy
  RETURNING session_id, expires_at
),

-- Create audit logs with compliance-driven TTL
audit_log_ttl_operations AS (
  INSERT INTO audit_logs_ttl
  SELECT 
    GENERATE_UUID() as log_id,
    CURRENT_TIMESTAMP - (random() * INTERVAL '30 days') as event_timestamp,

    -- Event details
    (ARRAY['user_login', 'data_access', 'permission_change', 'security_event', 'system_action'])
      [1 + floor(random() * 5)] as event_type,
    'user_' || (1 + floor(random() * 100)) as user_id,
    'resource_' || (1 + floor(random() * 500)) as resource_id,
    (ARRAY['create', 'read', 'update', 'delete', 'execute'])
      [1 + floor(random() * 5)] as action_performed,

    -- Compliance and retention
    (ARRAY['standard', 'security', 'financial', 'legal'])
      [1 + floor(random() * 4)] as retention_category,

    -- Calculate retention expiry based on category
    CASE retention_category
      WHEN 'standard' THEN CURRENT_TIMESTAMP + INTERVAL '90 days'
      WHEN 'security' THEN CURRENT_TIMESTAMP + INTERVAL '365 days'  
      WHEN 'financial' THEN CURRENT_TIMESTAMP + INTERVAL '2555 days' -- 7 years
      WHEN 'legal' THEN CURRENT_TIMESTAMP + INTERVAL '3650 days'     -- 10 years
    END as retention_expiry, -- TTL expiration field

    -- Compliance flags and controls
    CASE WHEN random() < 0.1 THEN ARRAY['sensitive_data'] ELSE ARRAY[]::text[] END as compliance_flags,
    CASE WHEN random() < 0.05 THEN true ELSE false END as compliance_hold, -- Prevents TTL expiration

    -- Event data and context
    JSON_BUILD_OBJECT(
      'user_agent', 'Mozilla/5.0 (Enterprise Browser)',
      'ip_address', '192.168.' || (1 + floor(random() * 254)) || '.' || (1 + floor(random() * 254)),
      'request_id', 'req_' || EXTRACT(EPOCH FROM CURRENT_TIMESTAMP),
      'session_duration', floor(random() * 3600),
      'data_size', floor(random() * 10000)
    ) as event_data,

    -- TTL management metadata
    CASE WHEN compliance_hold THEN false ELSE true END as ttl_managed,
    'audit_compliance' as retention_policy_applied
  RETURNING log_id, retention_category, retention_expiry, compliance_hold
),

-- Create cache entries with priority-based TTL
cache_ttl_operations AS (
  INSERT INTO application_cache_ttl
  SELECT 
    'cache_key_' || generate_series(1, 5000) as cache_key,
    (ARRAY['user_data', 'api_responses', 'computed_results', 'session_data', 'static_content'])
      [1 + floor(random() * 5)] as cache_namespace,

    -- Cache value and metadata
    JSON_BUILD_OBJECT(
      'data', 'cached_data_' || generate_series(1, 5000),
      'computed_at', CURRENT_TIMESTAMP,
      'version', '1.0'
    ) as cache_value,

    CURRENT_TIMESTAMP as created_at,

    -- Priority-based TTL calculation
    CASE cache_namespace
      WHEN 'user_data' THEN CURRENT_TIMESTAMP + INTERVAL '1 hour'
      WHEN 'api_responses' THEN CURRENT_TIMESTAMP + INTERVAL '5 minutes'
      WHEN 'computed_results' THEN CURRENT_TIMESTAMP + INTERVAL '2 hours'
      WHEN 'session_data' THEN CURRENT_TIMESTAMP + INTERVAL '30 minutes'
      WHEN 'static_content' THEN CURRENT_TIMESTAMP + INTERVAL '24 hours'
    END as expires_at, -- TTL expiration field

    CURRENT_TIMESTAMP as last_accessed,

    -- Cache optimization metadata
    (1 + floor(random() * 10)) as cache_priority, -- 1 = highest, 10 = lowest
    JSON_LENGTH(cache_value::text) as cache_size_bytes,
    1 as access_count,
    ARRAY['generated', 'optimized'] as cache_tags,
    true as ttl_managed
  RETURNING cache_key, cache_namespace, expires_at
),

-- Monitor TTL operations and expiration patterns
ttl_monitoring_metrics AS (
  SELECT 
    collection_name,
    retention_policy,

    -- Document lifecycle metrics
    COUNT(*) as total_documents,
    COUNT(*) FILTER (WHERE ttl_managed = true) as ttl_managed_documents,
    COUNT(*) FILTER (WHERE expires_at <= CURRENT_TIMESTAMP + INTERVAL '1 hour') as expiring_soon,
    COUNT(*) FILTER (WHERE expires_at <= CURRENT_TIMESTAMP + INTERVAL '24 hours') as expiring_today,

    -- TTL efficiency analysis
    AVG(EXTRACT(EPOCH FROM (expires_at - created_at))) as avg_ttl_duration_seconds,
    MIN(expires_at) as next_expiration,
    MAX(expires_at) as latest_expiration,

    -- Storage optimization metrics
    SUM(COALESCE(JSON_LENGTH(session_data::text), JSON_LENGTH(cache_value::text), JSON_LENGTH(event_data::text), 0)) as total_storage_bytes,
    AVG(COALESCE(JSON_LENGTH(session_data::text), JSON_LENGTH(cache_value::text), JSON_LENGTH(event_data::text), 0)) as avg_document_size_bytes,

    -- Retention policy distribution
    MODE() WITHIN GROUP (ORDER BY retention_policy) as primary_retention_policy,

    -- Compliance tracking
    COUNT(*) FILTER (WHERE compliance_hold = true) as compliance_hold_count,
    COUNT(*) FILTER (WHERE compliance_flags IS NOT NULL AND array_length(compliance_flags, 1) > 0) as compliance_flagged

  FROM (
    -- Union all TTL-managed collections
    SELECT 'user_sessions' as collection_name, retention_policy, ttl_managed, 
           created_at, expires_at, session_data as data_field, 
           NULL::text[] as compliance_flags, false as compliance_hold
    FROM session_ttl_operations

    UNION ALL

    SELECT 'audit_logs' as collection_name, retention_policy_applied as retention_policy, ttl_managed,
           event_timestamp as created_at, retention_expiry as expires_at, event_data as data_field,
           compliance_flags, compliance_hold
    FROM audit_log_ttl_operations

    UNION ALL

    SELECT 'application_cache' as collection_name, 'cache_management' as retention_policy, ttl_managed,
           created_at, expires_at, cache_value as data_field,
           NULL::text[] as compliance_flags, false as compliance_hold
    FROM cache_ttl_operations
  ) combined_ttl_data
  GROUP BY collection_name, retention_policy
),

-- TTL performance and optimization analysis
ttl_optimization_analysis AS (
  SELECT 
    tmm.collection_name,
    tmm.retention_policy,
    tmm.total_documents,
    tmm.ttl_managed_documents,

    -- Expiration timeline
    tmm.expiring_soon,
    tmm.expiring_today,
    tmm.next_expiration,
    tmm.latest_expiration,

    -- Storage and performance metrics
    ROUND(tmm.total_storage_bytes / (1024 * 1024)::decimal, 2) as total_storage_mb,
    ROUND(tmm.avg_document_size_bytes / 1024::decimal, 2) as avg_document_size_kb,
    ROUND(tmm.avg_ttl_duration_seconds / 3600::decimal, 2) as avg_ttl_duration_hours,

    -- TTL efficiency assessment
    CASE 
      WHEN tmm.ttl_managed_documents::decimal / tmm.total_documents > 0.9 THEN 'highly_optimized'
      WHEN tmm.ttl_managed_documents::decimal / tmm.total_documents > 0.7 THEN 'well_optimized'
      WHEN tmm.ttl_managed_documents::decimal / tmm.total_documents > 0.5 THEN 'moderately_optimized'
      ELSE 'needs_optimization'
    END as ttl_optimization_level,

    -- Storage optimization potential
    CASE 
      WHEN tmm.expiring_today > tmm.total_documents * 0.1 THEN 'significant_storage_reclaim_expected'
      WHEN tmm.expiring_today > tmm.total_documents * 0.05 THEN 'moderate_storage_reclaim_expected'  
      WHEN tmm.expiring_today > 0 THEN 'minimal_storage_reclaim_expected'
      ELSE 'no_immediate_storage_reclaim'
    END as storage_optimization_forecast,

    -- Compliance assessment
    CASE 
      WHEN tmm.compliance_hold_count > 0 THEN 'compliance_holds_active'
      WHEN tmm.compliance_flagged > tmm.total_documents * 0.1 THEN 'high_compliance_requirements'
      WHEN tmm.compliance_flagged > 0 THEN 'moderate_compliance_requirements'
      ELSE 'standard_compliance_requirements'
    END as compliance_status,

    -- Operational recommendations
    CASE 
      WHEN tmm.avg_ttl_duration_seconds < 3600 THEN 'Consider longer TTL for performance'
      WHEN tmm.avg_ttl_duration_seconds > 86400 * 30 THEN 'Review long retention periods'
      WHEN tmm.expiring_soon > 1000 THEN 'High expiration volume - monitor performance'
      ELSE 'TTL configuration appropriate'
    END as operational_recommendation

  FROM ttl_monitoring_metrics tmm
),

-- Generate comprehensive TTL management dashboard
ttl_dashboard_comprehensive AS (
  SELECT 
    toa.collection_name,
    toa.retention_policy,

    -- Current status
    toa.total_documents,
    toa.ttl_managed_documents,
    ROUND((toa.ttl_managed_documents::decimal / toa.total_documents::decimal) * 100, 1) as ttl_coverage_percent,

    -- Expiration schedule
    toa.expiring_soon,
    toa.expiring_today,
    TO_CHAR(toa.next_expiration, 'YYYY-MM-DD HH24:MI:SS') as next_expiration_time,

    -- Storage metrics
    toa.total_storage_mb,
    toa.avg_document_size_kb,
    toa.avg_ttl_duration_hours,

    -- Optimization status
    toa.ttl_optimization_level,
    toa.storage_optimization_forecast,
    toa.compliance_status,
    toa.operational_recommendation,

    -- Retention policy details
    rpd.retention_days,
    rpd.compliance_level,
    rpd.auto_expiration,
    rpd.legal_hold_exempt,

    -- Performance projections
    CASE 
      WHEN toa.expiring_today > 0 THEN 
        ROUND((toa.expiring_today * toa.avg_document_size_kb) / 1024, 2)
      ELSE 0
    END as projected_storage_reclaim_mb,

    -- Action priorities
    CASE 
      WHEN toa.ttl_optimization_level = 'needs_optimization' THEN 'high'
      WHEN toa.compliance_status = 'compliance_holds_active' THEN 'high'
      WHEN toa.expiring_soon > 1000 THEN 'medium'
      WHEN toa.storage_optimization_forecast LIKE '%significant%' THEN 'medium'
      ELSE 'low'
    END as action_priority,

    -- Specific action items
    ARRAY[
      CASE WHEN toa.ttl_optimization_level = 'needs_optimization' 
           THEN 'Implement TTL for remaining ' || (toa.total_documents - toa.ttl_managed_documents) || ' documents' END,
      CASE WHEN toa.compliance_status = 'compliance_holds_active'
           THEN 'Review active compliance holds and update retention policies' END,
      CASE WHEN toa.expiring_soon > 1000
           THEN 'Monitor system performance during high-volume expiration period' END,
      CASE WHEN toa.operational_recommendation != 'TTL configuration appropriate'
           THEN toa.operational_recommendation END
    ] as action_items

  FROM ttl_optimization_analysis toa
  LEFT JOIN retention_policy_definitions rpd ON toa.retention_policy = rpd.policy_name
)

-- Final comprehensive TTL management report
SELECT 
  tdc.collection_name,
  tdc.retention_policy,
  tdc.compliance_level,

  -- Current state
  tdc.total_documents,
  tdc.ttl_coverage_percent || '%' as ttl_coverage,
  tdc.total_storage_mb || ' MB' as current_storage,

  -- Expiration schedule
  tdc.expiring_soon as expiring_next_hour,
  tdc.expiring_today as expiring_next_24h,
  tdc.next_expiration_time,

  -- Optimization assessment  
  tdc.ttl_optimization_level,
  tdc.storage_optimization_forecast,
  tdc.projected_storage_reclaim_mb || ' MB' as storage_reclaim_potential,

  -- Operational guidance
  tdc.action_priority,
  tdc.operational_recommendation,
  array_to_string(
    array_remove(tdc.action_items, NULL), 
    '; '
  ) as specific_action_items,

  -- Configuration recommendations
  CASE 
    WHEN tdc.ttl_coverage_percent < 70 THEN 
      'Enable TTL for ' || (100 - tdc.ttl_coverage_percent) || '% of documents to improve storage efficiency'
    WHEN tdc.avg_ttl_duration_hours > 720 THEN  -- 30 days
      'Review extended retention periods for compliance requirements'
    WHEN tdc.projected_storage_reclaim_mb > 100 THEN
      'Significant storage optimization opportunity available'
    ELSE 'TTL configuration optimized for current workload'
  END as configuration_guidance,

  -- Compliance and governance
  tdc.compliance_status,
  CASE 
    WHEN tdc.legal_hold_exempt = false THEN 'Legal hold procedures apply'
    WHEN tdc.auto_expiration = false THEN 'Manual expiration required'
    ELSE 'Automatic expiration enabled'
  END as governance_status,

  -- Performance impact assessment
  CASE 
    WHEN tdc.expiring_soon > 5000 THEN 'Monitor database performance during expiration'
    WHEN tdc.expiring_today > 10000 THEN 'Schedule expiration during low-traffic periods'
    WHEN tdc.total_storage_mb > 1000 THEN 'Storage optimization will improve query performance'
    ELSE 'Minimal performance impact expected'
  END as performance_impact_assessment,

  -- Success metrics
  JSON_BUILD_OBJECT(
    'storage_efficiency', ROUND(tdc.projected_storage_reclaim_mb / NULLIF(tdc.total_storage_mb, 0) * 100, 1),
    'automation_coverage', tdc.ttl_coverage_percent,
    'compliance_alignment', CASE WHEN tdc.compliance_status LIKE '%high%' THEN 'high' ELSE 'standard' END,
    'operational_maturity', tdc.ttl_optimization_level
  ) as success_metrics

FROM ttl_dashboard_comprehensive tdc
ORDER BY 
  CASE tdc.action_priority
    WHEN 'high' THEN 1
    WHEN 'medium' THEN 2
    ELSE 3
  END,
  tdc.total_storage_mb DESC;

-- QueryLeaf provides comprehensive MongoDB TTL capabilities:
-- 1. Native automatic data expiration with SQL-familiar TTL configuration syntax
-- 2. Compliance-driven retention policies with legal hold and audit trail support
-- 3. Intelligent TTL optimization based on data classification and access patterns  
-- 4. Performance monitoring with storage optimization and expiration forecasting
-- 5. Enterprise governance with retention policy management and compliance reporting
-- 6. Scalable data lifecycle management that handles high-volume expiration scenarios
-- 7. Integration with MongoDB's background TTL processing and index optimization
-- 8. SQL-style TTL operations for familiar data lifecycle management workflows
-- 9. Advanced analytics for TTL performance, storage optimization, and compliance tracking
-- 10. Automated recommendations for TTL configuration and data retention optimization

Best Practices for MongoDB TTL Implementation

Enterprise Data Lifecycle Management

Essential practices for implementing TTL collections effectively:

TTL Strategy Design: Plan TTL configurations based on data classification, compliance requirements, and business value
Performance Considerations: Monitor TTL processing impact and optimize index configurations for efficient expiration
Compliance Integration: Implement legal hold capabilities and audit trails for regulated data retention
Storage Optimization: Use TTL to maintain optimal storage utilization while preserving query performance
Monitoring and Alerting: Establish comprehensive monitoring for TTL operations and expiration patterns
Backup Coordination: Ensure backup strategies account for TTL expiration and data lifecycle requirements

Production Deployment and Scalability

Optimize TTL collections for enterprise-scale requirements:

Index Strategy: Design efficient compound indexes that support both TTL expiration and query patterns
Capacity Planning: Plan for TTL processing overhead and storage optimization benefits in capacity models
High Availability: Implement TTL collections across replica sets with consistent expiration behavior
Operational Excellence: Create standardized procedures for TTL configuration, monitoring, and compliance
Integration Patterns: Design application integration patterns that leverage TTL for optimal data lifecycle management
Performance Baselines: Establish performance baselines for TTL operations and storage optimization metrics

Conclusion

MongoDB TTL collections provide comprehensive automatic data lifecycle management that eliminates manual cleanup procedures, ensures compliance-driven retention, and maintains optimal storage utilization through intelligent document expiration. The native TTL functionality integrates seamlessly with MongoDB's document model and indexing capabilities to deliver enterprise-grade data lifecycle management.

Key MongoDB TTL Collection benefits include:

Automatic Expiration: Native document expiration eliminates manual cleanup procedures and operational overhead
Flexible Policies: Document-level and collection-level TTL control with compliance-driven retention management
Zero Performance Impact: Background expiration processing with no impact on application performance
Storage Optimization: Automatic storage reclamation and space optimization through intelligent document removal
Enterprise Compliance: Legal hold capabilities and audit trails for regulated data retention requirements
SQL Accessibility: Familiar TTL management operations through SQL-style syntax and configuration

Whether you're managing session data, audit logs, cache entries, or any time-sensitive information, MongoDB TTL collections with QueryLeaf's familiar SQL interface provide the foundation for scalable, compliant, and efficient data lifecycle management.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB TTL collections while providing SQL-familiar syntax for data lifecycle management, retention policy configuration, and expiration monitoring. Advanced TTL patterns, compliance controls, and storage optimization strategies are seamlessly accessible through familiar SQL constructs, making sophisticated data lifecycle management both powerful and approachable for SQL-oriented teams.

The combination of MongoDB's intelligent TTL capabilities with SQL-style lifecycle management makes it an ideal platform for applications requiring both automated data expiration and familiar operational patterns, ensuring your data lifecycle strategies scale efficiently while maintaining compliance and operational excellence.

November 29, 2025
21 min read

MongoDB Connection Pooling and Concurrency Management: High-Performance Database Scaling and Enterprise Connection Optimization

Modern applications demand efficient database connection management to handle varying workloads, concurrent users, and peak traffic scenarios while maintaining optimal performance and resource utilization. Traditional database connection approaches often struggle with connection overhead, resource exhaustion, and poor scalability under high concurrency, leading to application bottlenecks, timeout errors, and degraded user experience.

MongoDB's connection pooling provides sophisticated connection management capabilities with intelligent pooling, automatic connection lifecycle management, and advanced concurrency control designed specifically for high-performance applications. Unlike traditional connection management that requires manual configuration and monitoring, MongoDB's connection pooling automatically optimizes connection usage while providing comprehensive monitoring and tuning capabilities for enterprise-scale deployments.

The Traditional Connection Management Challenge

Conventional database connection management faces significant scalability limitations:

-- Traditional PostgreSQL connection management - manual connection handling with poor scalability

-- Basic connection configuration (limited flexibility)
CREATE DATABASE production_app;

-- Connection pool configuration in application.properties (static configuration)
-- spring.datasource.url=jdbc:postgresql://localhost:5432/production_app
-- spring.datasource.username=app_user
-- spring.datasource.password=secure_password
-- spring.datasource.driver-class-name=org.postgresql.Driver

-- HikariCP connection pool settings (manual tuning required)
-- spring.datasource.hikari.maximum-pool-size=20
-- spring.datasource.hikari.minimum-idle=5
-- spring.datasource.hikari.connection-timeout=30000
-- spring.datasource.hikari.idle-timeout=600000
-- spring.datasource.hikari.max-lifetime=1800000
-- spring.datasource.hikari.leak-detection-threshold=60000

-- Application layer connection management with manual pooling
CREATE TABLE connection_metrics (
    metric_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    metric_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Connection pool metrics
    pool_name VARCHAR(100),
    active_connections INTEGER,
    idle_connections INTEGER,
    total_connections INTEGER,
    max_pool_size INTEGER,

    -- Performance metrics
    connection_acquisition_time_ms INTEGER,
    connection_usage_time_ms INTEGER,
    query_execution_count INTEGER,
    failed_connection_attempts INTEGER,

    -- Resource utilization
    memory_usage_bytes BIGINT,
    cpu_usage_percent DECIMAL(5,2),
    connection_wait_count INTEGER,
    connection_timeout_count INTEGER,

    -- Error tracking
    connection_leak_count INTEGER,
    pool_exhaustion_count INTEGER,
    database_errors INTEGER
);

-- Manual connection monitoring with limited visibility
CREATE OR REPLACE FUNCTION monitor_connection_pool()
RETURNS TABLE(
    pool_status VARCHAR,
    active_count INTEGER,
    idle_count INTEGER,
    wait_count INTEGER,
    usage_percent DECIMAL
) AS $$
BEGIN
    -- Basic connection pool monitoring (limited capabilities)
    RETURN QUERY
    WITH pool_stats AS (
        SELECT 
            'main_pool' as pool_name,
            -- Simulated pool metrics (not real-time)
            15 as current_active,
            5 as current_idle,
            20 as pool_max_size,
            2 as current_waiting
    )
    SELECT 
        'operational'::VARCHAR as pool_status,
        ps.current_active,
        ps.current_idle,
        ps.current_waiting,
        ROUND((ps.current_active::DECIMAL / ps.pool_max_size::DECIMAL) * 100, 2) as usage_percent
    FROM pool_stats ps;
END;
$$ LANGUAGE plpgsql;

-- Inadequate connection handling in stored procedures
CREATE OR REPLACE FUNCTION process_high_volume_transactions()
RETURNS VOID AS $$
DECLARE
    batch_size INTEGER := 1000;
    processed_count INTEGER := 0;
    error_count INTEGER := 0;
    connection_failures INTEGER := 0;
    start_time TIMESTAMP := CURRENT_TIMESTAMP;

    -- Limited connection context
    transaction_cursor CURSOR FOR 
        SELECT transaction_id, amount, user_id, transaction_type
        FROM pending_transactions
        WHERE status = 'pending'
        ORDER BY created_at
        LIMIT 10000;

    transaction_record RECORD;

BEGIN
    RAISE NOTICE 'Starting high-volume transaction processing...';

    -- Manual transaction processing with connection overhead
    FOR transaction_record IN transaction_cursor LOOP
        BEGIN
            -- Each operation creates connection overhead and latency
            INSERT INTO processed_transactions (
                original_transaction_id, 
                amount, 
                user_id, 
                transaction_type,
                processed_at,
                processing_batch
            ) VALUES (
                transaction_record.transaction_id,
                transaction_record.amount,
                transaction_record.user_id,
                transaction_record.transaction_type,
                CURRENT_TIMESTAMP,
                'batch_' || EXTRACT(EPOCH FROM start_time)
            );

            -- Update original transaction status
            UPDATE pending_transactions 
            SET status = 'processed',
                processed_at = CURRENT_TIMESTAMP,
                processed_by = CURRENT_USER
            WHERE transaction_id = transaction_record.transaction_id;

            processed_count := processed_count + 1;

            -- Frequent commits create connection pressure
            IF processed_count % batch_size = 0 THEN
                COMMIT;
                RAISE NOTICE 'Processed % transactions', processed_count;

                -- Manual connection health check (limited effectiveness)
                PERFORM pg_stat_get_activity(NULL);
            END IF;

        EXCEPTION
            WHEN connection_exception THEN
                connection_failures := connection_failures + 1;
                RAISE WARNING 'Connection failure for transaction %: %', 
                    transaction_record.transaction_id, SQLERRM;

            WHEN OTHERS THEN
                error_count := error_count + 1;
                RAISE WARNING 'Processing error for transaction %: %', 
                    transaction_record.transaction_id, SQLERRM;
        END;
    END LOOP;

    RAISE NOTICE 'Transaction processing completed: % processed, % errors, % connection failures in %',
        processed_count, error_count, connection_failures, 
        CURRENT_TIMESTAMP - start_time;

    -- Limited connection pool reporting
    INSERT INTO connection_metrics (
        pool_name, active_connections, total_connections,
        query_execution_count, failed_connection_attempts,
        connection_timeout_count
    ) VALUES (
        'manual_pool', 
        -- Estimated metrics (not accurate)
        GREATEST(processed_count / 100, 1),
        20,
        processed_count,
        connection_failures,
        connection_failures
    );
END;
$$ LANGUAGE plpgsql;

-- Complex manual connection management for concurrent operations
CREATE OR REPLACE FUNCTION concurrent_data_processing()
RETURNS TABLE(
    worker_id INTEGER,
    records_processed INTEGER,
    processing_time INTERVAL,
    connection_efficiency DECIMAL
) AS $$
DECLARE
    worker_count INTEGER := 5;
    records_per_worker INTEGER := 2000;
    worker_index INTEGER;
    processing_start TIMESTAMP;
    processing_end TIMESTAMP;

BEGIN
    processing_start := CURRENT_TIMESTAMP;

    -- Simulate concurrent workers (limited parallelization in PostgreSQL)
    FOR worker_index IN 1..worker_count LOOP
        BEGIN
            -- Each worker creates separate connection overhead
            PERFORM process_worker_batch(worker_index, records_per_worker);

            processing_end := CURRENT_TIMESTAMP;

            RETURN QUERY 
            SELECT 
                worker_index,
                records_per_worker,
                processing_end - processing_start,
                ROUND(
                    records_per_worker::DECIMAL / 
                    EXTRACT(EPOCH FROM processing_end - processing_start)::DECIMAL, 
                    2
                ) as efficiency;

        EXCEPTION
            WHEN connection_exception THEN
                RAISE WARNING 'Worker % failed due to connection issues', worker_index;

                RETURN QUERY 
                SELECT worker_index, 0, INTERVAL '0', 0.0::DECIMAL;

            WHEN OTHERS THEN
                RAISE WARNING 'Worker % failed: %', worker_index, SQLERRM;

                RETURN QUERY 
                SELECT worker_index, 0, INTERVAL '0', 0.0::DECIMAL;
        END;
    END LOOP;

    RETURN;
END;
$$ LANGUAGE plpgsql;

-- Helper function for worker batch processing
CREATE OR REPLACE FUNCTION process_worker_batch(
    p_worker_id INTEGER,
    p_batch_size INTEGER
) RETURNS VOID AS $$
DECLARE
    processed INTEGER := 0;
    batch_start TIMESTAMP := CURRENT_TIMESTAMP;
BEGIN
    -- Simulated batch processing with connection overhead
    WHILE processed < p_batch_size LOOP
        -- Each operation has connection acquisition overhead
        INSERT INTO worker_results (
            worker_id,
            batch_item,
            processed_at,
            processing_order
        ) VALUES (
            p_worker_id,
            processed + 1,
            CURRENT_TIMESTAMP,
            processed
        );

        processed := processed + 1;

        -- Frequent connection status checks
        IF processed % 100 = 0 THEN
            PERFORM pg_stat_get_activity(NULL);
        END IF;
    END LOOP;

    RAISE NOTICE 'Worker % completed % records in %',
        p_worker_id, processed, CURRENT_TIMESTAMP - batch_start;
END;
$$ LANGUAGE plpgsql;

-- Limited connection pool analysis and optimization
WITH connection_analysis AS (
    SELECT 
        pool_name,
        AVG(active_connections) as avg_active,
        MAX(active_connections) as peak_active,
        AVG(connection_acquisition_time_ms) as avg_acquisition_time,
        COUNT(*) FILTER (WHERE connection_timeout_count > 0) as timeout_incidents,
        COUNT(*) FILTER (WHERE pool_exhaustion_count > 0) as exhaustion_incidents,

        -- Basic utilization calculation
        AVG(active_connections::DECIMAL / total_connections::DECIMAL) as avg_utilization,

        -- Simple performance metrics
        AVG(query_execution_count) as avg_query_throughput,
        SUM(failed_connection_attempts) as total_failures

    FROM connection_metrics
    WHERE metric_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    GROUP BY pool_name
),

pool_health_assessment AS (
    SELECT 
        ca.*,

        -- Basic health scoring (limited insight)
        CASE 
            WHEN ca.avg_utilization > 0.9 THEN 'overloaded'
            WHEN ca.avg_utilization > 0.7 THEN 'high_usage'
            WHEN ca.avg_utilization > 0.5 THEN 'normal'
            ELSE 'underutilized'
        END as pool_health,

        -- Simple recommendations
        CASE 
            WHEN ca.timeout_incidents > 5 THEN 'increase_pool_size'
            WHEN ca.avg_acquisition_time > 5000 THEN 'optimize_connection_creation'
            WHEN ca.exhaustion_incidents > 0 THEN 'review_connection_limits'
            ELSE 'monitor_trends'
        END as recommendation,

        -- Limited optimization suggestions
        CASE 
            WHEN ca.avg_utilization < 0.3 THEN 'reduce_pool_size_for_efficiency'
            WHEN ca.total_failures > 100 THEN 'investigate_connection_failures'
            ELSE 'maintain_current_configuration'
        END as optimization_advice

    FROM connection_analysis ca
)

SELECT 
    pha.pool_name,
    pha.avg_active,
    pha.peak_active,
    ROUND(pha.avg_utilization * 100, 1) as utilization_percent,
    pha.avg_acquisition_time || 'ms' as avg_connection_time,
    pha.pool_health,
    pha.recommendation,
    pha.optimization_advice,

    -- Basic performance assessment
    CASE 
        WHEN pha.avg_query_throughput > 1000 THEN 'high_performance'
        WHEN pha.avg_query_throughput > 500 THEN 'moderate_performance'
        ELSE 'low_performance'
    END as performance_assessment

FROM pool_health_assessment pha
ORDER BY pha.avg_utilization DESC;

-- Problems with traditional connection management:
-- 1. Manual configuration and tuning required for different workloads
-- 2. Limited visibility into connection usage patterns and performance
-- 3. Poor handling of connection spikes and variable load scenarios
-- 4. Rigid pooling strategies that don't adapt to application patterns
-- 5. Complex error handling for connection failures and timeouts
-- 6. Inefficient resource utilization with static pool configurations
-- 7. Difficult monitoring and debugging of connection-related issues
-- 8. Poor integration with modern microservices and cloud-native architectures
-- 9. Limited scalability with concurrent operations and high-throughput scenarios
-- 10. Complex optimization requiring deep database and application expertise

MongoDB provides comprehensive connection pooling with intelligent management and optimization:

// MongoDB Advanced Connection Pooling - enterprise-grade connection management and optimization
const { MongoClient, MongoServerError, MongoNetworkError } = require('mongodb');
const { EventEmitter } = require('events');

// Advanced MongoDB connection pool manager with intelligent optimization
class AdvancedConnectionPoolManager extends EventEmitter {
  constructor(config = {}) {
    super();

    this.config = {
      // Connection configuration
      uri: config.uri || 'mongodb://localhost:27017',
      database: config.database || 'production_app',

      // Connection pool configuration
      minPoolSize: config.minPoolSize || 5,
      maxPoolSize: config.maxPoolSize || 100,
      maxIdleTimeMS: config.maxIdleTimeMS || 30000,
      waitQueueTimeoutMS: config.waitQueueTimeoutMS || 5000,

      // Advanced pooling features
      enableConnectionPooling: config.enableConnectionPooling !== false,
      enableReadPreference: config.enableReadPreference !== false,
      enableWriteConcern: config.enableWriteConcern !== false,

      // Performance optimization
      maxConnecting: config.maxConnecting || 2,
      heartbeatFrequencyMS: config.heartbeatFrequencyMS || 10000,
      serverSelectionTimeoutMS: config.serverSelectionTimeoutMS || 30000,
      socketTimeoutMS: config.socketTimeoutMS || 0,

      // Connection management
      retryWrites: config.retryWrites !== false,
      retryReads: config.retryReads !== false,
      compressors: config.compressors || ['snappy', 'zlib'],

      // Monitoring and analytics
      enableConnectionPoolMonitoring: config.enableConnectionPoolMonitoring !== false,
      enablePerformanceAnalytics: config.enablePerformanceAnalytics !== false,
      enableAdaptivePooling: config.enableAdaptivePooling !== false,

      // Application-specific optimization
      applicationName: config.applicationName || 'enterprise-mongodb-app',
      loadBalanced: config.loadBalanced || false,
      directConnection: config.directConnection || false
    };

    // Connection pool state
    this.connectionState = {
      isInitialized: false,
      client: null,
      database: null,
      connectionStats: {
        totalConnections: 0,
        activeConnections: 0,
        availableConnections: 0,
        connectionRequests: 0,
        failedConnections: 0,
        pooledConnections: 0
      }
    };

    // Performance monitoring
    this.performanceMetrics = {
      connectionAcquisitionTimes: [],
      operationLatencies: [],
      throughputMeasurements: [],
      errorRates: [],
      resourceUtilization: []
    };

    // Connection pool event handlers
    this.poolEventHandlers = new Map();

    // Adaptive pooling algorithm
    this.adaptivePooling = {
      enabled: this.config.enableAdaptivePooling,
      learningPeriodMS: 300000, // 5 minutes
      adjustmentThreshold: 0.15,
      lastAdjustment: Date.now(),
      performanceBaseline: null
    };

    this.initializeConnectionPool();
  }

  async initializeConnectionPool() {
    console.log('Initializing advanced MongoDB connection pool...');

    try {
      // Create MongoDB client with optimized connection pool settings
      this.connectionState.client = new MongoClient(this.config.uri, {
        // Connection pool configuration
        minPoolSize: this.config.minPoolSize,
        maxPoolSize: this.config.maxPoolSize,
        maxIdleTimeMS: this.config.maxIdleTimeMS,
        waitQueueTimeoutMS: this.config.waitQueueTimeoutMS,
        maxConnecting: this.config.maxConnecting,

        // Server selection and timeouts
        serverSelectionTimeoutMS: this.config.serverSelectionTimeoutMS,
        heartbeatFrequencyMS: this.config.heartbeatFrequencyMS,
        socketTimeoutMS: this.config.socketTimeoutMS,
        connectTimeoutMS: 10000,

        // Connection optimization
        retryWrites: this.config.retryWrites,
        retryReads: this.config.retryReads,
        compressors: this.config.compressors,

        // Application configuration
        appName: this.config.applicationName,
        loadBalanced: this.config.loadBalanced,
        directConnection: this.config.directConnection,

        // Read and write preferences
        readPreference: 'secondaryPreferred',
        writeConcern: { w: 'majority', j: true },
        readConcern: { level: 'majority' },

        // Monitoring configuration
        monitorCommands: this.config.enableConnectionPoolMonitoring,
        loggerLevel: 'info'
      });

      // Setup connection pool event monitoring
      if (this.config.enableConnectionPoolMonitoring) {
        this.setupConnectionPoolMonitoring();
      }

      // Connect to MongoDB
      await this.connectionState.client.connect();
      this.connectionState.database = this.connectionState.client.db(this.config.database);
      this.connectionState.isInitialized = true;

      // Initialize performance monitoring
      if (this.config.enablePerformanceAnalytics) {
        await this.initializePerformanceMonitoring();
      }

      // Setup adaptive pooling if enabled
      if (this.config.enableAdaptivePooling) {
        this.setupAdaptivePooling();
      }

      console.log('MongoDB connection pool initialized successfully', {
        database: this.config.database,
        minPoolSize: this.config.minPoolSize,
        maxPoolSize: this.config.maxPoolSize,
        adaptivePooling: this.config.enableAdaptivePooling
      });

      this.emit('connectionPoolReady', this.getConnectionStats());

      return this.connectionState.database;

    } catch (error) {
      console.error('Failed to initialize connection pool:', error);
      this.emit('connectionPoolError', error);
      throw error;
    }
  }

  setupConnectionPoolMonitoring() {
    console.log('Setting up comprehensive connection pool monitoring...');

    // Connection pool opened
    this.connectionState.client.on('connectionPoolCreated', (event) => {
      console.log(`Connection pool created: ${event.address}`, {
        maxPoolSize: event.options?.maxPoolSize,
        minPoolSize: event.options?.minPoolSize
      });

      this.emit('poolCreated', event);
    });

    // Connection created
    this.connectionState.client.on('connectionCreated', (event) => {
      this.connectionState.connectionStats.totalConnections++;
      this.connectionState.connectionStats.availableConnections++;

      console.log(`Connection created: ${event.connectionId}`, {
        totalConnections: this.connectionState.connectionStats.totalConnections
      });

      this.emit('connectionCreated', event);
    });

    // Connection ready
    this.connectionState.client.on('connectionReady', (event) => {
      console.log(`Connection ready: ${event.connectionId}`);
      this.emit('connectionReady', event);
    });

    // Connection checked out
    this.connectionState.client.on('connectionCheckedOut', (event) => {
      this.connectionState.connectionStats.activeConnections++;
      this.connectionState.connectionStats.availableConnections--;

      const checkoutTime = Date.now();
      this.recordConnectionAcquisitionTime(checkoutTime);

      this.emit('connectionCheckedOut', event);
    });

    // Connection checked in
    this.connectionState.client.on('connectionCheckedIn', (event) => {
      this.connectionState.connectionStats.activeConnections--;
      this.connectionState.connectionStats.availableConnections++;

      this.emit('connectionCheckedIn', event);
    });

    // Connection pool closed
    this.connectionState.client.on('connectionPoolClosed', (event) => {
      console.log(`Connection pool closed: ${event.address}`);
      this.emit('connectionPoolClosed', event);
    });

    // Connection check out failed
    this.connectionState.client.on('connectionCheckOutFailed', (event) => {
      this.connectionState.connectionStats.failedConnections++;

      console.warn(`Connection checkout failed: ${event.reason}`, {
        failedConnections: this.connectionState.connectionStats.failedConnections
      });

      this.emit('connectionCheckoutFailed', event);

      // Trigger adaptive pooling adjustment if enabled
      if (this.config.enableAdaptivePooling) {
        this.evaluatePoolingAdjustment('checkout_failure');
      }
    });

    // Connection closed
    this.connectionState.client.on('connectionClosed', (event) => {
      this.connectionState.connectionStats.totalConnections--;

      console.log(`Connection closed: ${event.connectionId}`, {
        reason: event.reason,
        totalConnections: this.connectionState.connectionStats.totalConnections
      });

      this.emit('connectionClosed', event);
    });
  }

  async executeWithPoolManagement(operation, options = {}) {
    console.log('Executing operation with advanced pool management...');
    const startTime = Date.now();

    try {
      if (!this.connectionState.isInitialized) {
        throw new Error('Connection pool not initialized');
      }

      // Record connection request
      this.connectionState.connectionStats.connectionRequests++;

      // Check pool health before operation
      const poolHealth = await this.assessPoolHealth();
      if (poolHealth.status === 'critical') {
        console.warn('Pool in critical state, applying emergency measures...');
        await this.applyEmergencyPoolMeasures(poolHealth);
      }

      // Execute operation with connection management
      const result = await this.executeOperationWithRetry(operation, options);

      // Record successful operation
      const executionTime = Date.now() - startTime;
      this.recordOperationLatency(executionTime);

      // Update performance metrics
      if (this.config.enablePerformanceAnalytics) {
        this.updatePerformanceMetrics(executionTime, 'success');
      }

      return result;

    } catch (error) {
      const executionTime = Date.now() - startTime;

      console.error('Operation failed with connection pool:', error.message);

      // Record failed operation
      this.recordOperationLatency(executionTime, 'error');

      // Handle connection-specific errors
      if (this.isConnectionError(error)) {
        await this.handleConnectionError(error, options);
      }

      // Update error metrics
      if (this.config.enablePerformanceAnalytics) {
        this.updatePerformanceMetrics(executionTime, 'error');
      }

      throw error;
    }
  }

  async executeOperationWithRetry(operation, options) {
    const maxRetries = options.maxRetries || 3;
    const retryDelayMs = options.retryDelayMs || 1000;
    let lastError = null;

    for (let attempt = 1; attempt <= maxRetries; attempt++) {
      try {
        // Execute the operation
        const result = await operation(this.connectionState.database);

        if (attempt > 1) {
          console.log(`Operation succeeded on retry attempt ${attempt}`);
        }

        return result;

      } catch (error) {
        lastError = error;

        // Check if error is retryable
        if (!this.isRetryableError(error) || attempt === maxRetries) {
          throw error;
        }

        console.warn(`Operation failed (attempt ${attempt}/${maxRetries}): ${error.message}`);

        // Wait before retry with exponential backoff
        const delay = retryDelayMs * Math.pow(2, attempt - 1);
        await this.sleep(delay);
      }
    }

    throw lastError;
  }

  async performBulkOperationsWithPoolOptimization(collectionName, operations, options = {}) {
    console.log(`Executing bulk operations with pool optimization: ${operations.length} operations...`);
    const startTime = Date.now();

    try {
      // Optimize pool for bulk operations
      await this.optimizePoolForBulkOperations(operations.length);

      const collection = this.connectionState.database.collection(collectionName);
      const batchSize = options.batchSize || 1000;
      const results = {
        totalOperations: operations.length,
        successfulOperations: 0,
        failedOperations: 0,
        batches: [],
        totalTime: 0,
        averageLatency: 0
      };

      // Process operations in optimized batches
      const batches = this.createOptimizedBatches(operations, batchSize);

      for (let batchIndex = 0; batchIndex < batches.length; batchIndex++) {
        const batch = batches[batchIndex];
        const batchStartTime = Date.now();

        try {
          const batchResult = await this.executeWithPoolManagement(async (db) => {
            return await collection.bulkWrite(batch, {
              ordered: options.ordered !== false,
              writeConcern: { w: 'majority', j: true }
            });
          });

          const batchTime = Date.now() - batchStartTime;
          results.successfulOperations += batchResult.insertedCount + batchResult.modifiedCount;
          results.batches.push({
            batchIndex,
            batchSize: batch.length,
            executionTime: batchTime,
            insertedCount: batchResult.insertedCount,
            modifiedCount: batchResult.modifiedCount,
            deletedCount: batchResult.deletedCount
          });

          console.log(`Batch ${batchIndex + 1}/${batches.length} completed: ${batch.length} operations in ${batchTime}ms`);

        } catch (batchError) {
          console.error(`Batch ${batchIndex + 1} failed:`, batchError.message);
          results.failedOperations += batch.length;

          if (!options.continueOnError) {
            throw batchError;
          }
        }
      }

      // Calculate final statistics
      results.totalTime = Date.now() - startTime;
      results.averageLatency = results.totalTime / results.batches.length;

      console.log(`Bulk operations completed: ${results.successfulOperations}/${results.totalOperations} successful in ${results.totalTime}ms`);

      return results;

    } catch (error) {
      console.error('Bulk operations failed:', error);
      throw error;
    }
  }

  async handleConcurrentOperations(concurrentTasks, options = {}) {
    console.log(`Managing ${concurrentTasks.length} concurrent operations with pool optimization...`);
    const startTime = Date.now();

    try {
      // Optimize pool for concurrent operations
      await this.optimizePoolForConcurrency(concurrentTasks.length);

      const maxConcurrency = options.maxConcurrency || Math.min(concurrentTasks.length, this.config.maxPoolSize * 0.8);
      const results = [];
      const errors = [];

      // Execute tasks with controlled concurrency
      const taskPromises = [];
      const semaphore = { count: maxConcurrency };

      for (let i = 0; i < concurrentTasks.length; i++) {
        const task = concurrentTasks[i];
        const taskPromise = this.executeConcurrentTask(task, i, semaphore, options);
        taskPromises.push(taskPromise);
      }

      // Wait for all tasks to complete
      const taskResults = await Promise.allSettled(taskPromises);

      // Process results
      taskResults.forEach((result, index) => {
        if (result.status === 'fulfilled') {
          results.push({
            taskIndex: index,
            result: result.value,
            success: true
          });
        } else {
          errors.push({
            taskIndex: index,
            error: result.reason.message,
            success: false
          });
        }
      });

      const totalTime = Date.now() - startTime;

      console.log(`Concurrent operations completed: ${results.length} successful, ${errors.length} failed in ${totalTime}ms`);

      return {
        totalTasks: concurrentTasks.length,
        successfulTasks: results.length,
        failedTasks: errors.length,
        totalTime,
        results,
        errors,
        averageConcurrency: maxConcurrency
      };

    } catch (error) {
      console.error('Concurrent operations management failed:', error);
      throw error;
    }
  }

  async executeConcurrentTask(task, taskIndex, semaphore, options) {
    // Wait for semaphore (connection availability)
    await this.acquireSemaphore(semaphore);

    try {
      const taskStartTime = Date.now();

      const result = await this.executeWithPoolManagement(async (db) => {
        return await task(db, taskIndex);
      }, options);

      const taskTime = Date.now() - taskStartTime;

      return {
        taskIndex,
        executionTime: taskTime,
        result
      };

    } finally {
      this.releaseSemaphore(semaphore);
    }
  }

  async optimizePoolForBulkOperations(operationCount) {
    console.log(`Optimizing connection pool for ${operationCount} bulk operations...`);

    // Calculate optimal pool size for bulk operations
    const estimatedConnections = Math.min(
      Math.ceil(operationCount / 1000) + 2, // Base estimate plus buffer
      this.config.maxPoolSize
    );

    // Temporarily adjust pool if needed
    if (estimatedConnections > this.config.minPoolSize) {
      console.log(`Temporarily increasing pool size to ${estimatedConnections} for bulk operations`);
      // Note: In production, this would adjust pool configuration dynamically
    }
  }

  async optimizePoolForConcurrency(concurrentTaskCount) {
    console.log(`Optimizing connection pool for ${concurrentTaskCount} concurrent operations...`);

    // Ensure sufficient connections for concurrency
    const requiredConnections = Math.min(concurrentTaskCount + 2, this.config.maxPoolSize);

    if (requiredConnections > this.connectionState.connectionStats.totalConnections) {
      console.log(`Pool optimization: ensuring ${requiredConnections} connections are available`);
      // Note: MongoDB driver automatically manages this, but we can provide hints
    }
  }

  async assessPoolHealth() {
    const stats = this.getConnectionStats();
    const utilizationRatio = stats.activeConnections / this.config.maxPoolSize;
    const failureRate = stats.failedConnections / Math.max(stats.connectionRequests, 1);

    let status = 'healthy';
    const issues = [];

    if (utilizationRatio > 0.9) {
      status = 'critical';
      issues.push('high_utilization');
    } else if (utilizationRatio > 0.7) {
      status = 'warning';
      issues.push('moderate_utilization');
    }

    if (failureRate > 0.1) {
      status = status === 'healthy' ? 'warning' : 'critical';
      issues.push('high_failure_rate');
    }

    if (stats.availableConnections === 0) {
      status = 'critical';
      issues.push('no_available_connections');
    }

    return {
      status,
      utilizationRatio,
      failureRate,
      issues,
      recommendations: this.generateHealthRecommendations(issues)
    };
  }

  generateHealthRecommendations(issues) {
    const recommendations = [];

    if (issues.includes('high_utilization')) {
      recommendations.push('Consider increasing maxPoolSize');
    }

    if (issues.includes('high_failure_rate')) {
      recommendations.push('Check network connectivity and server health');
    }

    if (issues.includes('no_available_connections')) {
      recommendations.push('Investigate connection leaks and optimize operation duration');
    }

    return recommendations;
  }

  async applyEmergencyPoolMeasures(poolHealth) {
    console.log('Applying emergency pool measures:', poolHealth.issues);

    if (poolHealth.issues.includes('no_available_connections')) {
      console.log('Force closing idle connections to recover pool capacity...');
      // In production, this would implement connection cleanup
    }

    if (poolHealth.issues.includes('high_failure_rate')) {
      console.log('Implementing circuit breaker for connection failures...');
      // In production, this would implement circuit breaker pattern
    }
  }

  setupAdaptivePooling() {
    console.log('Setting up adaptive connection pooling algorithm...');

    setInterval(() => {
      this.evaluateAndAdjustPool();
    }, this.adaptivePooling.learningPeriodMS);
  }

  async evaluateAndAdjustPool() {
    if (!this.adaptivePooling.enabled) return;

    console.log('Evaluating pool performance for adaptive adjustment...');

    const currentMetrics = this.calculatePerformanceMetrics();

    if (this.adaptivePooling.performanceBaseline === null) {
      this.adaptivePooling.performanceBaseline = currentMetrics;
      return;
    }

    const performanceChange = this.comparePerformanceMetrics(
      currentMetrics,
      this.adaptivePooling.performanceBaseline
    );

    if (Math.abs(performanceChange) > this.adaptivePooling.adjustmentThreshold) {
      await this.adjustPoolConfiguration(performanceChange, currentMetrics);
      this.adaptivePooling.performanceBaseline = currentMetrics;
    }
  }

  async adjustPoolConfiguration(performanceChange, metrics) {
    console.log(`Adaptive pooling: adjusting configuration based on ${performanceChange > 0 ? 'improved' : 'degraded'} performance`);

    if (performanceChange < -this.adaptivePooling.adjustmentThreshold) {
      // Performance degraded, try to optimize
      if (metrics.utilizationRatio > 0.8) {
        console.log('Increasing pool size due to high utilization');
        // In production, would adjust pool size
      }
    } else if (performanceChange > this.adaptivePooling.adjustmentThreshold) {
      // Performance improved, maintain or optimize further
      console.log('Performance improved, maintaining current pool configuration');
    }
  }

  // Utility methods for connection pool management

  recordConnectionAcquisitionTime(checkoutTime) {
    const acquisitionTime = Date.now() - checkoutTime;
    this.performanceMetrics.connectionAcquisitionTimes.push(acquisitionTime);

    // Keep only recent measurements
    if (this.performanceMetrics.connectionAcquisitionTimes.length > 1000) {
      this.performanceMetrics.connectionAcquisitionTimes = 
        this.performanceMetrics.connectionAcquisitionTimes.slice(-500);
    }
  }

  recordOperationLatency(latency, status = 'success') {
    this.performanceMetrics.operationLatencies.push({
      latency,
      status,
      timestamp: Date.now()
    });

    // Keep only recent measurements
    if (this.performanceMetrics.operationLatencies.length > 1000) {
      this.performanceMetrics.operationLatencies = 
        this.performanceMetrics.operationLatencies.slice(-500);
    }
  }

  isConnectionError(error) {
    return error instanceof MongoNetworkError || 
           error instanceof MongoServerError ||
           error.message.includes('connection') ||
           error.message.includes('timeout');
  }

  isRetryableError(error) {
    if (error instanceof MongoNetworkError) return true;
    if (error.code === 11000) return false; // Duplicate key error
    if (error.message.includes('timeout')) return true;
    return false;
  }

  async handleConnectionError(error, options) {
    console.warn('Handling connection error:', error.message);

    if (error instanceof MongoNetworkError) {
      console.log('Network error detected, checking pool health...');
      const poolHealth = await this.assessPoolHealth();
      if (poolHealth.status === 'critical') {
        await this.applyEmergencyPoolMeasures(poolHealth);
      }
    }
  }

  createOptimizedBatches(operations, batchSize) {
    const batches = [];
    for (let i = 0; i < operations.length; i += batchSize) {
      batches.push(operations.slice(i, i + batchSize));
    }
    return batches;
  }

  async acquireSemaphore(semaphore) {
    while (semaphore.count <= 0) {
      await this.sleep(10);
    }
    semaphore.count--;
  }

  releaseSemaphore(semaphore) {
    semaphore.count++;
  }

  sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }

  getConnectionStats() {
    return {
      ...this.connectionState.connectionStats,
      poolSize: this.config.maxPoolSize,
      utilizationRatio: this.connectionState.connectionStats.activeConnections / this.config.maxPoolSize,
      timestamp: new Date()
    };
  }

  calculatePerformanceMetrics() {
    const recent = this.performanceMetrics.operationLatencies.slice(-100);
    const avgLatency = recent.reduce((sum, op) => sum + op.latency, 0) / recent.length || 0;
    const successRate = recent.filter(op => op.status === 'success').length / recent.length || 0;
    const utilizationRatio = this.connectionState.connectionStats.activeConnections / this.config.maxPoolSize;

    return {
      avgLatency,
      successRate,
      utilizationRatio,
      throughput: recent.length / 5 // Operations per second estimate
    };
  }

  comparePerformanceMetrics(current, baseline) {
    const latencyChange = (baseline.avgLatency - current.avgLatency) / baseline.avgLatency;
    const successRateChange = current.successRate - baseline.successRate;
    const throughputChange = (current.throughput - baseline.throughput) / baseline.throughput;

    // Weighted performance score
    return (latencyChange * 0.4) + (successRateChange * 0.3) + (throughputChange * 0.3);
  }

  async getDetailedPoolAnalytics() {
    const stats = this.getConnectionStats();
    const metrics = this.calculatePerformanceMetrics();
    const poolHealth = await this.assessPoolHealth();

    return {
      connectionStats: stats,
      performanceMetrics: metrics,
      poolHealth: poolHealth,
      configuration: {
        minPoolSize: this.config.minPoolSize,
        maxPoolSize: this.config.maxPoolSize,
        maxIdleTimeMS: this.config.maxIdleTimeMS,
        adaptivePoolingEnabled: this.config.enableAdaptivePooling
      },
      recommendations: poolHealth.recommendations
    };
  }

  async closeConnectionPool() {
    console.log('Closing MongoDB connection pool...');

    if (this.connectionState.client) {
      await this.connectionState.client.close();
      this.connectionState.isInitialized = false;
      console.log('Connection pool closed successfully');
    }
  }
}

// Example usage for enterprise-scale applications
async function demonstrateAdvancedConnectionPooling() {
  const poolManager = new AdvancedConnectionPoolManager({
    uri: 'mongodb://localhost:27017',
    database: 'production_analytics',
    minPoolSize: 10,
    maxPoolSize: 50,
    enableAdaptivePooling: true,
    enablePerformanceAnalytics: true,
    applicationName: 'enterprise-data-processor'
  });

  try {
    // Wait for pool initialization
    await poolManager.initializeConnectionPool();

    // Demonstrate bulk operations with pool optimization
    const bulkOperations = Array.from({ length: 5000 }, (_, index) => ({
      insertOne: {
        document: {
          userId: `user_${index}`,
          eventType: 'page_view',
          timestamp: new Date(),
          sessionId: `session_${Math.floor(index / 100)}`,
          data: {
            page: `/page_${index % 50}`,
            duration: Math.floor(Math.random() * 300),
            source: 'web'
          }
        }
      }
    }));

    console.log('Executing bulk operations with pool optimization...');
    const bulkResults = await poolManager.performBulkOperationsWithPoolOptimization(
      'user_events',
      bulkOperations,
      {
        batchSize: 1000,
        continueOnError: true
      }
    );

    // Demonstrate concurrent operations
    const concurrentTasks = Array.from({ length: 20 }, (_, index) => 
      async (db, taskIndex) => {
        const collection = db.collection('analytics_data');

        // Simulate complex aggregation
        const result = await collection.aggregate([
          { $match: { userId: { $regex: `user_${taskIndex}` } } },
          { $group: {
            _id: '$eventType',
            count: { $sum: 1 },
            avgDuration: { $avg: '$data.duration' }
          }},
          { $sort: { count: -1 } }
        ]).toArray();

        return { taskIndex, resultCount: result.length };
      }
    );

    console.log('Executing concurrent operations with pool management...');
    const concurrentResults = await poolManager.handleConcurrentOperations(concurrentTasks, {
      maxConcurrency: 15
    });

    // Get detailed analytics
    const poolAnalytics = await poolManager.getDetailedPoolAnalytics();
    console.log('Connection Pool Analytics:', JSON.stringify(poolAnalytics, null, 2));

    return {
      bulkResults,
      concurrentResults,
      poolAnalytics
    };

  } catch (error) {
    console.error('Advanced connection pooling demonstration failed:', error);
    throw error;
  } finally {
    await poolManager.closeConnectionPool();
  }
}

// Benefits of MongoDB Advanced Connection Pooling:
// - Intelligent connection management with automatic optimization and resource management
// - Comprehensive monitoring with real-time pool health assessment and performance analytics
// - Adaptive pooling algorithms that adjust to application patterns and workload changes
// - Advanced error handling with retry mechanisms and circuit breaker patterns
// - Support for concurrent operations with intelligent connection allocation and management
// - Production-ready scalability with distributed connection management and optimization
// - Comprehensive analytics and monitoring for operational insight and troubleshooting
// - Seamless integration with MongoDB's native connection pooling and cluster management

module.exports = {
  AdvancedConnectionPoolManager,
  demonstrateAdvancedConnectionPooling
};

Understanding MongoDB Connection Pooling Architecture

Enterprise-Scale Connection Management and Optimization

Implement sophisticated connection pooling strategies for production applications:

// Production-ready connection pooling with advanced features and enterprise optimization
class ProductionConnectionPoolPlatform extends AdvancedConnectionPoolManager {
  constructor(productionConfig) {
    super(productionConfig);

    this.productionConfig = {
      ...productionConfig,
      distributedPooling: true,
      realtimeMonitoring: true,
      advancedLoadBalancing: true,
      enterpriseFailover: true,
      automaticRecovery: true,
      performanceOptimization: true
    };

    this.setupProductionFeatures();
    this.initializeDistributedPooling();
    this.setupEnterpriseMonitoring();
  }

  async implementDistributedConnectionPooling() {
    console.log('Setting up distributed connection pooling architecture...');

    const distributedStrategy = {
      // Multi-region pooling
      regionAwareness: {
        enabled: true,
        primaryRegion: 'us-east-1',
        secondaryRegions: ['us-west-2', 'eu-west-1'],
        crossRegionFailover: true
      },

      // Load balancing
      loadBalancing: {
        algorithm: 'weighted_round_robin',
        healthChecking: true,
        automaticFailover: true,
        loadFactors: {
          latency: 0.4,
          throughput: 0.3,
          availability: 0.3
        }
      },

      // Connection optimization
      optimization: {
        connectionAffinity: true,
        adaptiveBatchSizing: true,
        intelligentRouting: true,
        resourceOptimization: true
      }
    };

    return await this.deployDistributedStrategy(distributedStrategy);
  }

  async implementEnterpriseFailover() {
    console.log('Implementing enterprise-grade failover mechanisms...');

    const failoverStrategy = {
      // Automatic failover
      automaticFailover: {
        enabled: true,
        healthCheckInterval: 5000,
        failoverThreshold: 3,
        recoveryTimeout: 30000
      },

      // Connection recovery
      connectionRecovery: {
        automaticRecovery: true,
        retryBackoffStrategy: 'exponential',
        maxRecoveryAttempts: 5,
        recoveryDelay: 1000
      },

      // High availability
      highAvailability: {
        redundantConnections: true,
        crossDatacenterFailover: true,
        zeroDowntimeRecovery: true,
        dataConsistencyGuarantees: true
      }
    };

    return await this.deployFailoverStrategy(failoverStrategy);
  }

  async implementPerformanceOptimization() {
    console.log('Implementing advanced performance optimization...');

    const optimizationStrategy = {
      // Connection optimization
      connectionOptimization: {
        warmupConnections: true,
        connectionPreloading: true,
        intelligentCaching: true,
        resourcePooling: true
      },

      // Query optimization
      queryOptimization: {
        queryPlanCaching: true,
        connectionAffinity: true,
        batchOptimization: true,
        pipelineOptimization: true
      },

      // Resource management
      resourceManagement: {
        memoryOptimization: true,
        cpuUtilizationOptimization: true,
        networkOptimization: true,
        diskIOOptimization: true
      }
    };

    return await this.deployOptimizationStrategy(optimizationStrategy);
  }
}

SQL-Style Connection Management with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB connection pooling and management:

-- QueryLeaf connection pooling with SQL-familiar configuration syntax

-- Configure connection pool settings
SET connection_pool_min_size = 10;
SET connection_pool_max_size = 100;
SET connection_pool_max_idle_time = '30 seconds';
SET connection_pool_wait_timeout = '5 seconds';
SET enable_adaptive_pooling = true;
SET enable_connection_monitoring = true;

-- Advanced connection pool configuration
WITH connection_pool_configuration AS (
  SELECT 
    -- Pool sizing configuration
    10 as min_pool_size,
    100 as max_pool_size,
    30000 as max_idle_time_ms,
    5000 as wait_queue_timeout_ms,
    2 as max_connecting,

    -- Performance optimization
    true as enable_compression,
    ARRAY['snappy', 'zlib'] as compression_algorithms,
    true as retry_writes,
    true as retry_reads,

    -- Application configuration
    'enterprise-analytics-app' as application_name,
    false as load_balanced,
    false as direct_connection,

    -- Monitoring and analytics
    true as enable_monitoring,
    true as enable_performance_analytics,
    true as enable_adaptive_pooling,
    true as enable_health_checking,

    -- Timeout and retry configuration
    30000 as server_selection_timeout_ms,
    10000 as heartbeat_frequency_ms,
    0 as socket_timeout_ms,
    10000 as connect_timeout_ms,

    -- Read and write preferences
    'secondaryPreferred' as read_preference,
    JSON_OBJECT('w', 'majority', 'j', true) as write_concern,
    JSON_OBJECT('level', 'majority') as read_concern
),

-- Monitor connection pool performance and utilization
connection_pool_metrics AS (
  SELECT 
    pool_name,
    measurement_timestamp,

    -- Connection statistics
    total_connections,
    active_connections,
    available_connections,
    pooled_connections,
    connection_requests,
    failed_connections,

    -- Performance metrics
    avg_connection_acquisition_time_ms,
    max_connection_acquisition_time_ms,
    avg_operation_latency_ms,
    operations_per_second,

    -- Pool utilization analysis
    ROUND((active_connections::DECIMAL / total_connections::DECIMAL) * 100, 2) as utilization_percent,
    ROUND((failed_connections::DECIMAL / NULLIF(connection_requests::DECIMAL, 0)) * 100, 2) as failure_rate_percent,

    -- Connection lifecycle metrics
    connections_created_per_minute,
    connections_closed_per_minute,
    connection_timeouts,

    -- Resource utilization
    memory_usage_mb,
    cpu_usage_percent,
    network_bytes_per_second,

    -- Health indicators
    CASE 
      WHEN utilization_percent > 90 THEN 'critical'
      WHEN utilization_percent > 70 THEN 'warning'
      WHEN utilization_percent > 50 THEN 'normal'
      ELSE 'low'
    END as utilization_status,

    CASE 
      WHEN failure_rate_percent > 10 THEN 'critical'
      WHEN failure_rate_percent > 5 THEN 'warning'
      WHEN failure_rate_percent > 1 THEN 'moderate'
      ELSE 'healthy'
    END as connection_health_status

  FROM connection_pool_monitoring_data
  WHERE measurement_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
),

-- Analyze connection pool performance trends
performance_trend_analysis AS (
  SELECT 
    pool_name,
    DATE_TRUNC('minute', measurement_timestamp) as time_bucket,

    -- Aggregated performance metrics
    AVG(utilization_percent) as avg_utilization,
    MAX(utilization_percent) as peak_utilization,
    AVG(avg_connection_acquisition_time_ms) as avg_acquisition_time,
    MAX(max_connection_acquisition_time_ms) as peak_acquisition_time,
    AVG(operations_per_second) as avg_throughput,

    -- Error and timeout analysis
    SUM(failed_connections) as total_failures,
    SUM(connection_timeouts) as total_timeouts,
    AVG(failure_rate_percent) as avg_failure_rate,

    -- Resource consumption trends
    AVG(memory_usage_mb) as avg_memory_usage,
    AVG(cpu_usage_percent) as avg_cpu_usage,
    AVG(network_bytes_per_second) as avg_network_usage,

    -- Performance scoring
    CASE 
      WHEN AVG(avg_operation_latency_ms) < 10 AND AVG(failure_rate_percent) < 1 THEN 100
      WHEN AVG(avg_operation_latency_ms) < 50 AND AVG(failure_rate_percent) < 5 THEN 80
      WHEN AVG(avg_operation_latency_ms) < 100 AND AVG(failure_rate_percent) < 10 THEN 60
      ELSE 40
    END as performance_score,

    -- Trend calculations
    LAG(AVG(operations_per_second)) OVER (PARTITION BY pool_name ORDER BY time_bucket) as prev_throughput,
    LAG(AVG(avg_connection_acquisition_time_ms)) OVER (PARTITION BY pool_name ORDER BY time_bucket) as prev_acquisition_time

  FROM connection_pool_metrics
  GROUP BY pool_name, DATE_TRUNC('minute', measurement_timestamp)
),

-- Connection pool optimization recommendations
pool_optimization_analysis AS (
  SELECT 
    pta.pool_name,
    pta.time_bucket,
    pta.avg_utilization,
    pta.avg_acquisition_time,
    pta.avg_throughput,
    pta.performance_score,

    -- Performance trend analysis
    CASE 
      WHEN pta.avg_throughput > pta.prev_throughput THEN 'improving'
      WHEN pta.avg_throughput < pta.prev_throughput THEN 'degrading'
      ELSE 'stable'
    END as throughput_trend,

    CASE 
      WHEN pta.avg_acquisition_time < pta.prev_acquisition_time THEN 'improving'
      WHEN pta.avg_acquisition_time > pta.prev_acquisition_time THEN 'degrading'
      ELSE 'stable'
    END as latency_trend,

    -- Pool sizing recommendations
    CASE 
      WHEN pta.avg_utilization > 90 THEN 'increase_pool_size'
      WHEN pta.avg_utilization > 80 AND pta.avg_acquisition_time > 100 THEN 'increase_pool_size'
      WHEN pta.avg_utilization < 30 AND pta.performance_score > 80 THEN 'decrease_pool_size'
      WHEN pta.avg_acquisition_time > 200 THEN 'optimize_connection_creation'
      ELSE 'maintain_current_size'
    END as sizing_recommendation,

    -- Configuration optimization suggestions
    CASE 
      WHEN pta.total_failures > 10 THEN 'increase_retry_attempts'
      WHEN pta.total_timeouts > 5 THEN 'increase_timeout_values'
      WHEN pta.avg_failure_rate > 5 THEN 'investigate_connection_issues'
      WHEN pta.performance_score < 60 THEN 'comprehensive_optimization_needed'
      ELSE 'configuration_optimal'
    END as configuration_recommendation,

    -- Resource optimization suggestions
    CASE 
      WHEN pta.avg_memory_usage > 1000 THEN 'optimize_memory_usage'
      WHEN pta.avg_cpu_usage > 80 THEN 'optimize_cpu_utilization'
      WHEN pta.avg_network_usage > 100000000 THEN 'optimize_network_efficiency'
      ELSE 'resource_usage_optimal'
    END as resource_optimization,

    -- Priority scoring for optimization actions
    CASE 
      WHEN pta.avg_utilization > 95 OR pta.avg_failure_rate > 15 THEN 'critical'
      WHEN pta.avg_utilization > 85 OR pta.avg_failure_rate > 10 THEN 'high'
      WHEN pta.avg_utilization > 75 OR pta.avg_acquisition_time > 150 THEN 'medium'
      ELSE 'low'
    END as optimization_priority

  FROM performance_trend_analysis pta
),

-- Adaptive pooling recommendations based on workload patterns
adaptive_pooling_recommendations AS (
  SELECT 
    poa.pool_name,

    -- Current state assessment
    AVG(poa.avg_utilization) as current_avg_utilization,
    MAX(poa.avg_utilization) as current_peak_utilization,
    AVG(poa.avg_throughput) as current_avg_throughput,
    AVG(poa.performance_score) as current_performance_score,

    -- Optimization priority distribution
    COUNT(*) FILTER (WHERE poa.optimization_priority = 'critical') as critical_periods,
    COUNT(*) FILTER (WHERE poa.optimization_priority = 'high') as high_priority_periods,
    COUNT(*) FILTER (WHERE poa.optimization_priority = 'medium') as medium_priority_periods,

    -- Recommendation consensus
    MODE() WITHIN GROUP (ORDER BY poa.sizing_recommendation) as recommended_sizing_action,
    MODE() WITHIN GROUP (ORDER BY poa.configuration_recommendation) as recommended_config_action,
    MODE() WITHIN GROUP (ORDER BY poa.resource_optimization) as recommended_resource_action,

    -- Adaptive pooling configuration
    CASE 
      WHEN AVG(poa.avg_utilization) > 80 AND AVG(poa.performance_score) < 70 THEN
        JSON_OBJECT(
          'min_pool_size', GREATEST(cpc.min_pool_size + 5, 15),
          'max_pool_size', GREATEST(cpc.max_pool_size + 10, 50),
          'adjustment_reason', 'high_utilization_poor_performance'
        )
      WHEN AVG(poa.avg_utilization) < 40 AND AVG(poa.performance_score) > 85 THEN
        JSON_OBJECT(
          'min_pool_size', GREATEST(cpc.min_pool_size - 2, 5),
          'max_pool_size', cpc.max_pool_size,
          'adjustment_reason', 'low_utilization_good_performance'
        )
      WHEN AVG(poa.avg_throughput) FILTER (WHERE poa.throughput_trend = 'degrading') > 0.5 * COUNT(*) THEN
        JSON_OBJECT(
          'min_pool_size', cpc.min_pool_size + 3,
          'max_pool_size', cpc.max_pool_size + 15,
          'adjustment_reason', 'throughput_degradation'
        )
      ELSE
        JSON_OBJECT(
          'min_pool_size', cpc.min_pool_size,
          'max_pool_size', cpc.max_pool_size,
          'adjustment_reason', 'optimal_configuration'
        )
    END as adaptive_pool_config,

    -- Performance impact estimation
    CASE 
      WHEN COUNT(*) FILTER (WHERE poa.optimization_priority IN ('critical', 'high')) > COUNT(*) * 0.3 THEN
        'significant_improvement_expected'
      WHEN COUNT(*) FILTER (WHERE poa.optimization_priority = 'medium') > COUNT(*) * 0.5 THEN
        'moderate_improvement_expected'
      ELSE 'minimal_improvement_expected'
    END as expected_impact

  FROM pool_optimization_analysis poa
  CROSS JOIN connection_pool_configuration cpc
  GROUP BY poa.pool_name, cpc.min_pool_size, cpc.max_pool_size
)

-- Comprehensive connection pool management dashboard
SELECT 
  apr.pool_name,

  -- Current performance status
  ROUND(apr.current_avg_utilization, 1) || '%' as avg_utilization,
  ROUND(apr.current_peak_utilization, 1) || '%' as peak_utilization,
  ROUND(apr.current_avg_throughput, 0) as avg_throughput_ops_per_sec,
  apr.current_performance_score as performance_score,

  -- Problem severity assessment
  CASE 
    WHEN apr.critical_periods > 0 THEN 'Critical Issues Detected'
    WHEN apr.high_priority_periods > 0 THEN 'High Priority Issues Detected'
    WHEN apr.medium_priority_periods > 0 THEN 'Moderate Issues Detected'
    ELSE 'Operating Normally'
  END as overall_status,

  -- Optimization recommendations
  apr.recommended_sizing_action,
  apr.recommended_config_action,
  apr.recommended_resource_action,

  -- Adaptive pooling suggestions
  apr.adaptive_pool_config->>'min_pool_size' as recommended_min_pool_size,
  apr.adaptive_pool_config->>'max_pool_size' as recommended_max_pool_size,
  apr.adaptive_pool_config->>'adjustment_reason' as adjustment_rationale,

  -- Implementation priority and impact
  CASE 
    WHEN apr.critical_periods > 0 THEN 'Immediate'
    WHEN apr.high_priority_periods > 0 THEN 'Within 24 hours'
    WHEN apr.medium_priority_periods > 0 THEN 'Within 1 week'
    ELSE 'Monitor and evaluate'
  END as implementation_timeline,

  apr.expected_impact,

  -- Detailed action plan
  CASE 
    WHEN apr.recommended_sizing_action = 'increase_pool_size' THEN 
      ARRAY[
        'Increase max pool size to handle higher concurrent load',
        'Monitor utilization after adjustment',
        'Evaluate memory and CPU impact of larger pool',
        'Set up alerting for new utilization thresholds'
      ]
    WHEN apr.recommended_sizing_action = 'decrease_pool_size' THEN
      ARRAY[
        'Gradually reduce pool size to optimize resource usage',
        'Monitor for any performance degradation',
        'Adjust monitoring thresholds for new pool size',
        'Document resource savings achieved'
      ]
    WHEN apr.recommended_config_action = 'investigate_connection_issues' THEN
      ARRAY[
        'Review connection error logs for patterns',
        'Check network connectivity and latency',
        'Validate MongoDB server health and capacity',
        'Consider connection timeout optimization'
      ]
    ELSE 
      ARRAY['Continue monitoring current configuration', 'Review performance trends weekly']
  END as action_items,

  -- Configuration details for implementation
  JSON_BUILD_OBJECT(
    'current_configuration', JSON_BUILD_OBJECT(
      'min_pool_size', cpc.min_pool_size,
      'max_pool_size', cpc.max_pool_size,
      'max_idle_time_ms', cpc.max_idle_time_ms,
      'wait_timeout_ms', cpc.wait_queue_timeout_ms,
      'enable_adaptive_pooling', cpc.enable_adaptive_pooling
    ),
    'recommended_configuration', JSON_BUILD_OBJECT(
      'min_pool_size', (apr.adaptive_pool_config->>'min_pool_size')::integer,
      'max_pool_size', (apr.adaptive_pool_config->>'max_pool_size')::integer,
      'optimization_enabled', true,
      'monitoring_enhanced', true
    ),
    'expected_changes', JSON_BUILD_OBJECT(
      'utilization_improvement', CASE 
        WHEN apr.current_avg_utilization > 80 THEN 'Reduced peak utilization'
        WHEN apr.current_avg_utilization < 50 THEN 'Improved resource efficiency'
        ELSE 'Maintained optimal utilization'
      END,
      'performance_improvement', apr.expected_impact,
      'resource_impact', CASE 
        WHEN (apr.adaptive_pool_config->>'max_pool_size')::integer > cpc.max_pool_size THEN 'Increased memory usage'
        WHEN (apr.adaptive_pool_config->>'max_pool_size')::integer < cpc.max_pool_size THEN 'Reduced memory usage'
        ELSE 'No significant resource change'
      END
    )
  ) as configuration_details

FROM adaptive_pooling_recommendations apr
CROSS JOIN connection_pool_configuration cpc
ORDER BY apr.critical_periods DESC, apr.high_priority_periods DESC;

-- QueryLeaf provides comprehensive connection pooling capabilities:
-- 1. SQL-familiar connection pool configuration with advanced optimization settings
-- 2. Real-time monitoring and analytics for connection performance and utilization
-- 3. Intelligent pool sizing recommendations based on workload patterns and performance
-- 4. Adaptive pooling algorithms that automatically adjust to application requirements  
-- 5. Comprehensive error handling and retry mechanisms for connection reliability
-- 6. Advanced troubleshooting and optimization guidance for production environments
-- 7. Integration with MongoDB's native connection pooling features and optimizations
-- 8. Enterprise-scale monitoring with detailed metrics and performance analytics
-- 9. Automated optimization recommendations with implementation timelines and priorities
-- 10. SQL-style syntax for complex connection management workflows and configurations

Best Practices for Production Connection Pooling Implementation

Performance Architecture and Scaling Strategies

Essential principles for effective MongoDB connection pooling deployment:

Pool Sizing Strategy: Configure optimal pool sizes based on application concurrency patterns and server capacity
Performance Monitoring: Implement comprehensive monitoring for connection utilization, latency, and error rates
Adaptive Management: Use intelligent pooling algorithms that adjust to changing workload patterns
Error Handling: Design robust error handling with retry mechanisms and circuit breaker patterns
Resource Optimization: Balance connection pool sizes with memory usage and server resource constraints
Operational Excellence: Create monitoring dashboards and alerting for proactive pool management

Scalability and Production Deployment

Optimize connection pooling for enterprise-scale requirements:

Distributed Architecture: Design connection pooling strategies that work effectively across microservices
High Availability: Implement connection pooling with automatic failover and recovery capabilities
Performance Tuning: Optimize pool configurations based on application patterns and MongoDB cluster topology
Monitoring Integration: Integrate connection pool monitoring with enterprise observability platforms
Capacity Planning: Plan connection pool capacity based on expected growth and peak load scenarios
Security Considerations: Implement secure connection management with proper authentication and encryption

Conclusion

MongoDB connection pooling provides comprehensive high-performance database connection management capabilities that enable applications to efficiently handle concurrent operations, variable workloads, and peak traffic scenarios while maintaining optimal resource utilization and operational reliability. The intelligent pooling algorithms automatically optimize connection usage while providing detailed monitoring and tuning capabilities for enterprise deployments.

Key MongoDB connection pooling benefits include:

Intelligent Connection Management: Automatic connection lifecycle management with optimized pooling strategies
High Performance: Minimal connection overhead with intelligent connection reuse and resource optimization
Adaptive Optimization: Dynamic pool sizing based on application patterns and performance requirements
Comprehensive Monitoring: Real-time visibility into connection usage, performance, and health metrics
Enterprise Reliability: Robust error handling with automatic recovery and failover capabilities
Production Scalability: Distributed connection management that scales with application requirements

Whether you're building high-traffic web applications, real-time analytics platforms, microservices architectures, or any application requiring efficient database connectivity, MongoDB connection pooling with QueryLeaf's familiar SQL interface provides the foundation for scalable and reliable database connection management.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB connection pooling while providing SQL-familiar syntax for connection management and monitoring. Advanced pooling patterns, performance optimization strategies, and enterprise monitoring capabilities are seamlessly handled through familiar SQL constructs, making sophisticated connection management accessible to SQL-oriented development teams.

The combination of MongoDB's robust connection pooling capabilities with SQL-style management operations makes it an ideal platform for modern applications that require both high-performance database connectivity and familiar management patterns, ensuring your connection pooling solutions scale efficiently while remaining operationally excellent.

November 28, 2025
22 min read

MongoDB Database Administration and Monitoring: Enterprise Operations Management and Performance Optimization

Enterprise MongoDB deployments require comprehensive database administration and monitoring strategies to ensure optimal performance, reliability, and operational excellence. Traditional relational database administration approaches often fall short when managing MongoDB's distributed architecture, flexible schema design, and unique operational characteristics that require specialized administrative expertise and tooling.

MongoDB database administration encompasses performance monitoring, capacity planning, security management, backup operations, and operational maintenance through sophisticated tooling and administrative frameworks. Unlike traditional SQL databases that rely on rigid administrative procedures, MongoDB administration requires understanding of document-based operations, replica set management, sharding strategies, and distributed system maintenance patterns.

The Traditional Database Administration Challenge

Conventional relational database administration approaches face significant limitations when applied to MongoDB environments:

-- Traditional PostgreSQL database administration - rigid procedures with limited MongoDB applicability

-- Database performance monitoring with limited visibility
CREATE TABLE db_performance_metrics (
    metric_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    server_name VARCHAR(100) NOT NULL,
    metric_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Connection metrics
    active_connections INTEGER,
    max_connections INTEGER,
    connection_utilization DECIMAL(5,2),
    idle_connections INTEGER,

    -- Query performance metrics
    slow_query_count INTEGER,
    average_query_time DECIMAL(10,4),
    queries_per_second DECIMAL(10,2),
    cache_hit_ratio DECIMAL(5,2),

    -- System resource utilization
    cpu_usage DECIMAL(5,2),
    memory_usage DECIMAL(5,2),
    disk_usage DECIMAL(5,2),
    io_wait DECIMAL(5,2),

    -- Database-specific metrics
    database_size BIGINT,
    index_size BIGINT,
    table_count INTEGER,
    index_count INTEGER,

    -- Lock contention
    lock_waits INTEGER,
    deadlock_count INTEGER,
    blocked_queries INTEGER,

    -- Transaction metrics
    transactions_per_second DECIMAL(10,2),
    transaction_rollback_rate DECIMAL(5,2)
);

-- Complex monitoring infrastructure with limited MongoDB compatibility
CREATE TABLE slow_query_log (
    log_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    server_name VARCHAR(100) NOT NULL,
    database_name VARCHAR(100) NOT NULL,
    query_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Query identification
    query_hash VARCHAR(64) NOT NULL,
    query_text TEXT NOT NULL,
    normalized_query TEXT,

    -- Performance metrics
    execution_time DECIMAL(10,4) NOT NULL,
    rows_examined BIGINT,
    rows_returned BIGINT,
    index_usage TEXT,

    -- Resource consumption
    cpu_time DECIMAL(10,4),
    io_reads INTEGER,
    io_writes INTEGER,
    memory_peak BIGINT,

    -- User context
    user_name VARCHAR(100),
    application_name VARCHAR(200),
    connection_id BIGINT,
    session_id VARCHAR(100),

    -- Query categorization
    query_type VARCHAR(20), -- SELECT, INSERT, UPDATE, DELETE
    complexity_score INTEGER,
    optimization_suggestions TEXT
);

-- Maintenance scheduling with manual coordination
CREATE TABLE maintenance_schedules (
    schedule_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    server_name VARCHAR(100) NOT NULL,
    maintenance_type VARCHAR(50) NOT NULL,
    scheduled_start TIMESTAMP NOT NULL,
    estimated_duration INTERVAL NOT NULL,

    -- Maintenance details
    maintenance_description TEXT,
    affected_databases TEXT[],
    expected_downtime INTERVAL,
    backup_required BOOLEAN DEFAULT TRUE,

    -- Approval workflow
    requested_by VARCHAR(100) NOT NULL,
    approved_by VARCHAR(100),
    approval_timestamp TIMESTAMP,
    approval_status VARCHAR(20) DEFAULT 'pending',

    -- Execution tracking
    actual_start TIMESTAMP,
    actual_end TIMESTAMP,
    execution_status VARCHAR(20) DEFAULT 'scheduled',
    execution_notes TEXT,

    -- Impact assessment
    business_impact VARCHAR(20), -- low, medium, high, critical
    user_notification_required BOOLEAN DEFAULT TRUE,
    rollback_plan TEXT
);

-- Limited backup and recovery management
CREATE TABLE backup_operations (
    backup_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    server_name VARCHAR(100) NOT NULL,
    database_name VARCHAR(100),
    backup_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Backup configuration
    backup_type VARCHAR(20) NOT NULL, -- full, incremental, differential
    backup_method VARCHAR(20) NOT NULL, -- dump, filesystem, streaming
    compression_enabled BOOLEAN DEFAULT TRUE,
    encryption_enabled BOOLEAN DEFAULT FALSE,

    -- Backup metrics
    backup_size BIGINT,
    compressed_size BIGINT,
    backup_duration INTERVAL,

    -- Storage information
    backup_location VARCHAR(500) NOT NULL,
    storage_type VARCHAR(50), -- local, s3, nfs, tape
    retention_period INTERVAL,
    deletion_scheduled TIMESTAMP,

    -- Validation and integrity
    integrity_check_performed BOOLEAN DEFAULT FALSE,
    integrity_check_passed BOOLEAN,
    checksum VARCHAR(128),

    -- Recovery testing
    recovery_test_performed BOOLEAN DEFAULT FALSE,
    recovery_test_passed BOOLEAN,
    recovery_test_notes TEXT
);

-- Complex user access and security management
CREATE TABLE user_access_audit (
    audit_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    server_name VARCHAR(100) NOT NULL,
    audit_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- User information
    username VARCHAR(100) NOT NULL,
    user_type VARCHAR(20), -- admin, application, readonly
    authentication_method VARCHAR(50),
    source_ip INET,

    -- Access details
    operation_type VARCHAR(20) NOT NULL, -- login, logout, query, admin
    database_accessed VARCHAR(100),
    table_accessed VARCHAR(100),
    operation_details TEXT,

    -- Security context
    session_id VARCHAR(100),
    connection_duration INTERVAL,
    privilege_level VARCHAR(20),
    elevated_privileges BOOLEAN DEFAULT FALSE,

    -- Result tracking
    operation_success BOOLEAN NOT NULL,
    error_message TEXT,
    security_violation BOOLEAN DEFAULT FALSE,

    -- Risk assessment
    risk_level VARCHAR(20) DEFAULT 'low',
    anomaly_detected BOOLEAN DEFAULT FALSE,
    follow_up_required BOOLEAN DEFAULT FALSE
);

-- Manual index management with limited optimization capabilities
WITH index_analysis AS (
    SELECT 
        schemaname,
        tablename,
        indexname,
        idx_size,
        idx_tup_read,
        idx_tup_fetch,

        -- Index usage calculation
        CASE 
            WHEN idx_tup_read > 0 THEN 
                ROUND((idx_tup_fetch * 100.0 / idx_tup_read), 2)
            ELSE 0 
        END as index_efficiency,

        -- Index maintenance needs
        CASE 
            WHEN idx_tup_read < 1000 AND pg_size_pretty(idx_size::bigint) > '10 MB' THEN 'Consider removal'
            WHEN idx_tup_fetch < idx_tup_read * 0.1 THEN 'Poor selectivity'
            ELSE 'Optimal'
        END as optimization_recommendation

    FROM pg_stat_user_indexes 
    JOIN pg_indexes USING (schemaname, tablename, indexname)
),

maintenance_recommendations AS (
    SELECT 
        ia.schemaname,
        ia.tablename,
        COUNT(*) as total_indexes,
        COUNT(*) FILTER (WHERE ia.optimization_recommendation != 'Optimal') as problematic_indexes,
        array_agg(ia.indexname) FILTER (WHERE ia.optimization_recommendation = 'Consider removal') as removable_indexes,
        SUM(ia.idx_size) as total_index_size,
        AVG(ia.index_efficiency) as avg_efficiency

    FROM index_analysis ia
    GROUP BY ia.schemaname, ia.tablename
)

-- Generate maintenance recommendations
SELECT 
    mr.schemaname,
    mr.tablename,
    mr.total_indexes,
    mr.problematic_indexes,
    mr.removable_indexes,
    pg_size_pretty(mr.total_index_size::bigint) as total_size,
    ROUND(mr.avg_efficiency, 2) as avg_efficiency_percent,

    -- Maintenance priority
    CASE 
        WHEN mr.problematic_indexes > mr.total_indexes * 0.5 THEN 'High Priority'
        WHEN mr.problematic_indexes > 0 THEN 'Medium Priority'
        ELSE 'Low Priority'
    END as maintenance_priority,

    -- Specific recommendations
    CASE 
        WHEN array_length(mr.removable_indexes, 1) > 0 THEN 
            'Remove unused indexes: ' || array_to_string(mr.removable_indexes, ', ')
        WHEN mr.avg_efficiency < 50 THEN
            'Review index selectivity and query patterns'
        ELSE 
            'Continue current index strategy'
    END as detailed_recommendations

FROM maintenance_recommendations mr
WHERE mr.problematic_indexes > 0
ORDER BY mr.problematic_indexes DESC, mr.total_index_size DESC;

-- Problems with traditional database administration:
-- 1. Limited visibility into MongoDB-specific operations and performance characteristics
-- 2. Manual monitoring processes that don't scale with distributed MongoDB deployments  
-- 3. Rigid maintenance procedures that don't account for replica sets and sharding
-- 4. Backup strategies that don't leverage MongoDB's native backup capabilities
-- 5. Security management that doesn't integrate with MongoDB's role-based access control
-- 6. Performance tuning approaches that ignore MongoDB's unique optimization patterns
-- 7. Index management that doesn't understand MongoDB's compound index strategies
-- 8. Monitoring tools that lack MongoDB-specific metrics and operational insights
-- 9. Maintenance scheduling that doesn't coordinate across MongoDB cluster topology
-- 10. Recovery procedures that don't leverage MongoDB's built-in replication and failover

MongoDB provides comprehensive database administration capabilities with integrated monitoring and management tools:

// MongoDB Advanced Database Administration - Enterprise monitoring and operations management
const { MongoClient, ObjectId, GridFSBucket } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('enterprise_operations');

// Comprehensive MongoDB Database Administration Manager
class MongoDBAdministrationManager {
  constructor(client, config = {}) {
    this.client = client;
    this.db = client.db(config.database || 'enterprise_operations');

    this.collections = {
      performanceMetrics: this.db.collection('performance_metrics'),
      slowOperations: this.db.collection('slow_operations'),
      indexAnalysis: this.db.collection('index_analysis'),
      maintenanceSchedule: this.db.collection('maintenance_schedule'),
      backupOperations: this.db.collection('backup_operations'),
      userActivity: this.db.collection('user_activity'),
      systemAlerts: this.db.collection('system_alerts'),
      capacityPlanning: this.db.collection('capacity_planning')
    };

    // Administration configuration
    this.config = {
      // Monitoring configuration
      metricsCollectionInterval: config.metricsCollectionInterval || 60000, // 1 minute
      slowOperationThreshold: config.slowOperationThreshold || 100, // 100ms
      enableDetailedProfiling: config.enableDetailedProfiling !== false,
      enableRealtimeAlerts: config.enableRealtimeAlerts !== false,

      // Performance thresholds
      cpuThreshold: config.cpuThreshold || 80,
      memoryThreshold: config.memoryThreshold || 85,
      connectionThreshold: config.connectionThreshold || 80,
      diskThreshold: config.diskThreshold || 90,

      // Maintenance configuration
      enableAutomaticMaintenance: config.enableAutomaticMaintenance || false,
      maintenanceWindow: config.maintenanceWindow || { start: '02:00', end: '04:00' },
      enableMaintenanceNotifications: config.enableMaintenanceNotifications !== false,

      // Backup configuration
      enableAutomaticBackups: config.enableAutomaticBackups !== false,
      backupRetentionDays: config.backupRetentionDays || 30,
      backupSchedule: config.backupSchedule || '0 2 * * *', // 2 AM daily

      // Security and compliance
      enableSecurityAuditing: config.enableSecurityAuditing !== false,
      enableComplianceTracking: config.enableComplianceTracking || false,
      enableAccessLogging: config.enableAccessLogging !== false
    };

    // Performance monitoring state
    this.performanceMonitors = new Map();
    this.alertSubscriptions = new Map();
    this.maintenanceQueue = [];

    this.initializeAdministrationSystem();
  }

  async initializeAdministrationSystem() {
    console.log('Initializing MongoDB administration and monitoring system...');

    try {
      // Setup monitoring infrastructure
      await this.setupMonitoringInfrastructure();

      // Initialize performance tracking
      await this.initializePerformanceMonitoring();

      // Setup automated maintenance
      if (this.config.enableAutomaticMaintenance) {
        await this.initializeMaintenanceAutomation();
      }

      // Initialize backup management
      if (this.config.enableAutomaticBackups) {
        await this.initializeBackupManagement();
      }

      // Setup security auditing
      if (this.config.enableSecurityAuditing) {
        await this.initializeSecurityAuditing();
      }

      // Start monitoring processes
      await this.startMonitoringProcesses();

      console.log('MongoDB administration system initialized successfully');

    } catch (error) {
      console.error('Error initializing administration system:', error);
      throw error;
    }
  }

  async setupMonitoringInfrastructure() {
    console.log('Setting up comprehensive monitoring infrastructure...');

    try {
      // Create optimized indexes for monitoring collections
      await this.collections.performanceMetrics.createIndexes([
        { key: { timestamp: -1 }, background: true },
        { key: { serverName: 1, timestamp: -1 }, background: true },
        { key: { 'metrics.alertLevel': 1, timestamp: -1 }, background: true },
        { key: { 'metrics.cpuUsage': 1 }, background: true, sparse: true },
        { key: { 'metrics.memoryUsage': 1 }, background: true, sparse: true }
      ]);

      // Slow operations monitoring indexes
      await this.collections.slowOperations.createIndexes([
        { key: { timestamp: -1 }, background: true },
        { key: { operation: 1, timestamp: -1 }, background: true },
        { key: { duration: -1 }, background: true },
        { key: { 'context.collection': 1, timestamp: -1 }, background: true },
        { key: { 'optimization.needsAttention': 1 }, background: true, sparse: true }
      ]);

      // Index analysis tracking
      await this.collections.indexAnalysis.createIndexes([
        { key: { collection: 1, timestamp: -1 }, background: true },
        { key: { 'analysis.efficiency': 1 }, background: true },
        { key: { 'recommendations.priority': 1, timestamp: -1 }, background: true }
      ]);

      // System alerts indexing
      await this.collections.systemAlerts.createIndexes([
        { key: { timestamp: -1 }, background: true },
        { key: { severity: 1, timestamp: -1 }, background: true },
        { key: { resolved: 1, timestamp: -1 }, background: true },
        { key: { category: 1, timestamp: -1 }, background: true }
      ]);

      console.log('Monitoring infrastructure setup completed');

    } catch (error) {
      console.error('Error setting up monitoring infrastructure:', error);
      throw error;
    }
  }

  async collectComprehensiveMetrics() {
    console.log('Collecting comprehensive performance metrics...');
    const startTime = Date.now();

    try {
      // Get server status information
      const serverStatus = await this.db.admin().serverStatus();
      const dbStats = await this.db.stats();
      const currentOperations = await this.db.admin().currentOp();

      // Collect performance metrics
      const performanceMetrics = {
        _id: new ObjectId(),
        timestamp: new Date(),
        serverName: serverStatus.host,

        // Connection metrics
        connections: {
          current: serverStatus.connections.current,
          available: serverStatus.connections.available,
          totalCreated: serverStatus.connections.totalCreated,
          utilization: Math.round((serverStatus.connections.current / 
                                  (serverStatus.connections.current + serverStatus.connections.available)) * 100)
        },

        // Operation metrics
        operations: {
          insert: serverStatus.opcounters.insert,
          query: serverStatus.opcounters.query,
          update: serverStatus.opcounters.update,
          delete: serverStatus.opcounters.delete,
          getmore: serverStatus.opcounters.getmore,
          command: serverStatus.opcounters.command,

          // Operations per second calculations
          insertRate: this.calculateOperationRate('insert', serverStatus.opcounters.insert),
          queryRate: this.calculateOperationRate('query', serverStatus.opcounters.query),
          updateRate: this.calculateOperationRate('update', serverStatus.opcounters.update),
          deleteRate: this.calculateOperationRate('delete', serverStatus.opcounters.delete)
        },

        // Memory metrics
        memory: {
          resident: serverStatus.mem.resident,
          virtual: serverStatus.mem.virtual,
          mapped: serverStatus.mem.mapped || 0,
          mappedWithJournal: serverStatus.mem.mappedWithJournal || 0,

          // WiredTiger cache metrics (if available)
          cacheSizeGB: serverStatus.wiredTiger?.cache?.['maximum bytes configured'] ? 
                      Math.round(serverStatus.wiredTiger.cache['maximum bytes configured'] / 1024 / 1024 / 1024) : 0,
          cacheUsedGB: serverStatus.wiredTiger?.cache?.['bytes currently in the cache'] ? 
                      Math.round(serverStatus.wiredTiger.cache['bytes currently in the cache'] / 1024 / 1024 / 1024) : 0,
          cacheUtilization: this.calculateCacheUtilization(serverStatus)
        },

        // Database metrics
        database: {
          collections: dbStats.collections,
          objects: dbStats.objects,
          dataSize: dbStats.dataSize,
          storageSize: dbStats.storageSize,
          indexes: dbStats.indexes,
          indexSize: dbStats.indexSize,

          // Growth metrics
          avgObjSize: dbStats.avgObjSize,
          scaleFactor: dbStats.scaleFactor || 1
        },

        // Network metrics
        network: {
          bytesIn: serverStatus.network.bytesIn,
          bytesOut: serverStatus.network.bytesOut,
          numRequests: serverStatus.network.numRequests,

          // Network rates
          bytesInRate: this.calculateNetworkRate('bytesIn', serverStatus.network.bytesIn),
          bytesOutRate: this.calculateNetworkRate('bytesOut', serverStatus.network.bytesOut),
          requestRate: this.calculateNetworkRate('numRequests', serverStatus.network.numRequests)
        },

        // Current operations analysis
        activeOperations: {
          total: currentOperations.inprog.length,
          reads: currentOperations.inprog.filter(op => op.op === 'query' || op.op === 'getmore').length,
          writes: currentOperations.inprog.filter(op => op.op === 'insert' || op.op === 'update' || op.op === 'remove').length,
          commands: currentOperations.inprog.filter(op => op.op === 'command').length,

          // Long running operations
          longRunning: currentOperations.inprog.filter(op => op.microsecs_running > 1000000).length, // > 1 second
          blocked: currentOperations.inprog.filter(op => op.waitingForLock).length
        },

        // Lock metrics
        locks: this.analyzeLockMetrics(serverStatus),

        // Replication metrics (if applicable)
        replication: await this.collectReplicationMetrics(),

        // Sharding metrics (if applicable)
        sharding: await this.collectShardingMetrics(),

        // Alert evaluation
        alerts: await this.evaluatePerformanceAlerts(serverStatus, dbStats),

        // Collection metadata
        collectionTime: Date.now() - startTime,
        metricsVersion: '2.0'
      };

      // Store metrics
      await this.collections.performanceMetrics.insertOne(performanceMetrics);

      // Process alerts if any
      if (performanceMetrics.alerts.length > 0) {
        await this.processPerformanceAlerts(performanceMetrics.alerts, performanceMetrics);
      }

      console.log(`Performance metrics collected successfully: ${performanceMetrics.alerts.length} alerts generated`);

      return performanceMetrics;

    } catch (error) {
      console.error('Error collecting performance metrics:', error);
      throw error;
    }
  }

  async analyzeSlowOperations() {
    console.log('Analyzing slow operations and query performance...');

    try {
      // Get current profiling level
      const profilingLevel = await this.db.runCommand({ profile: -1 });

      // Enable profiling if not already enabled
      if (profilingLevel.was < 1) {
        await this.db.runCommand({ 
          profile: 2, // Profile all operations
          slowms: this.config.slowOperationThreshold 
        });
      }

      // Retrieve slow operations from profiler collection
      const profileData = await this.db.collection('system.profile')
        .find({
          ts: { $gte: new Date(Date.now() - 300000) }, // Last 5 minutes
          millis: { $gte: this.config.slowOperationThreshold }
        })
        .sort({ ts: -1 })
        .limit(1000)
        .toArray();

      // Analyze operations
      for (const operation of profileData) {
        const analysis = await this.analyzeOperation(operation);

        const slowOpDocument = {
          _id: new ObjectId(),
          timestamp: operation.ts,
          operation: operation.op,
          namespace: operation.ns,
          duration: operation.millis,

          // Command details
          command: operation.command,
          planSummary: operation.planSummary,
          executionStats: operation.execStats,

          // Performance analysis
          analysis: analysis,

          // Optimization recommendations
          recommendations: await this.generateOptimizationRecommendations(operation, analysis),

          // Context information
          context: {
            client: operation.client,
            user: operation.user,
            collection: this.extractCollectionName(operation.ns),
            database: this.extractDatabaseName(operation.ns)
          },

          // Impact assessment
          impact: this.assessOperationImpact(operation, analysis),

          // Tracking metadata
          analyzed: true,
          reviewRequired: analysis.complexity === 'high' || operation.millis > 5000,
          processed: new Date()
        };

        await this.collections.slowOperations.insertOne(slowOpDocument);
      }

      console.log(`Analyzed ${profileData.length} slow operations`);

      return profileData.length;

    } catch (error) {
      console.error('Error analyzing slow operations:', error);
      throw error;
    }
  }

  async performIndexAnalysis() {
    console.log('Performing comprehensive index analysis...');

    try {
      const databases = await this.client.db().admin().listDatabases();

      for (const dbInfo of databases.databases) {
        if (['admin', 'local', 'config'].includes(dbInfo.name)) continue;

        const database = this.client.db(dbInfo.name);
        const collections = await database.listCollections().toArray();

        for (const collInfo of collections) {
          const collection = database.collection(collInfo.name);

          // Get index information
          const indexes = await collection.listIndexes().toArray();
          const indexStats = await collection.aggregate([
            { $indexStats: {} }
          ]).toArray();

          // Get collection statistics
          const collStats = await collection.stats();

          // Analyze each index
          for (const index of indexes) {
            const indexStat = indexStats.find(stat => stat.name === index.name);
            const analysis = await this.analyzeIndex(index, indexStat, collStats, collection);

            const indexAnalysisDocument = {
              _id: new ObjectId(),
              timestamp: new Date(),
              database: dbInfo.name,
              collection: collInfo.name,
              indexName: index.name,

              // Index configuration
              indexSpec: index.key,
              indexOptions: {
                unique: index.unique || false,
                sparse: index.sparse || false,
                background: index.background || false,
                partialFilterExpression: index.partialFilterExpression
              },

              // Index statistics
              statistics: {
                accessCount: indexStat?.accesses?.ops || 0,
                lastAccessed: indexStat?.accesses?.since || null,
                indexSize: index.indexSizes ? index.indexSizes[index.name] : 0
              },

              // Analysis results
              analysis: analysis,

              // Performance recommendations
              recommendations: await this.generateIndexRecommendations(analysis, index, collStats),

              // Usage patterns
              usagePatterns: await this.analyzeIndexUsagePatterns(index, collection),

              // Maintenance requirements
              maintenance: {
                needsRebuild: analysis.fragmentation > 30,
                needsOptimization: analysis.efficiency < 70,
                canBeDropped: analysis.usage === 'unused' && !index.unique && index.name !== '_id_',
                priority: this.calculateMaintenancePriority(analysis)
              }
            };

            await this.collections.indexAnalysis.insertOne(indexAnalysisDocument);
          }
        }
      }

      console.log('Index analysis completed successfully');

    } catch (error) {
      console.error('Error performing index analysis:', error);
      throw error;
    }
  }

  async scheduleMaintenanceOperation(maintenanceRequest) {
    console.log('Scheduling maintenance operation...');

    try {
      const maintenanceOperation = {
        _id: new ObjectId(),
        type: maintenanceRequest.type,
        scheduledTime: maintenanceRequest.scheduledTime || new Date(),
        estimatedDuration: maintenanceRequest.estimatedDuration,

        // Operation details
        description: maintenanceRequest.description,
        targetDatabases: maintenanceRequest.databases || [],
        targetCollections: maintenanceRequest.collections || [],

        // Impact assessment
        impactLevel: maintenanceRequest.impactLevel || 'medium',
        downTimeRequired: maintenanceRequest.downTimeRequired || false,
        userNotificationRequired: maintenanceRequest.userNotificationRequired !== false,

        // Approval workflow
        requestedBy: maintenanceRequest.requestedBy,
        approvalRequired: this.determineApprovalRequirement(maintenanceRequest),
        approvalStatus: 'pending',

        // Execution planning
        executionPlan: await this.generateExecutionPlan(maintenanceRequest),
        rollbackPlan: await this.generateRollbackPlan(maintenanceRequest),
        preChecks: await this.generatePreChecks(maintenanceRequest),
        postChecks: await this.generatePostChecks(maintenanceRequest),

        // Status tracking
        status: 'scheduled',
        createdAt: new Date(),
        executedAt: null,
        completedAt: null,

        // Results tracking
        executionResults: null,
        success: null,
        notes: []
      };

      // Validate maintenance window
      const validationResult = await this.validateMaintenanceWindow(maintenanceOperation);
      if (!validationResult.valid) {
        throw new Error(`Maintenance scheduling validation failed: ${validationResult.reason}`);
      }

      // Check for conflicts
      const conflicts = await this.checkMaintenanceConflicts(maintenanceOperation);
      if (conflicts.length > 0) {
        maintenanceOperation.conflicts = conflicts;
        maintenanceOperation.status = 'conflict_detected';
      }

      // Store maintenance operation
      await this.collections.maintenanceSchedule.insertOne(maintenanceOperation);

      // Send notifications if required
      if (maintenanceOperation.userNotificationRequired) {
        await this.sendMaintenanceNotification(maintenanceOperation);
      }

      console.log(`Maintenance operation scheduled: ${maintenanceOperation._id}`, {
        type: maintenanceOperation.type,
        scheduledTime: maintenanceOperation.scheduledTime,
        impactLevel: maintenanceOperation.impactLevel
      });

      return maintenanceOperation;

    } catch (error) {
      console.error('Error scheduling maintenance operation:', error);
      throw error;
    }
  }

  async executeBackupOperation(backupConfig) {
    console.log('Executing comprehensive backup operation...');

    try {
      const backupId = new ObjectId();
      const startTime = new Date();

      const backupOperation = {
        _id: backupId,
        type: backupConfig.type || 'full',
        startTime: startTime,

        // Backup configuration
        databases: backupConfig.databases || ['all'],
        compression: backupConfig.compression !== false,
        encryption: backupConfig.encryption || false,

        // Storage configuration
        storageLocation: backupConfig.storageLocation,
        storageType: backupConfig.storageType || 'local',
        retentionDays: backupConfig.retentionDays || this.config.backupRetentionDays,

        // Backup metadata
        triggeredBy: backupConfig.triggeredBy || 'system',
        backupReason: backupConfig.reason || 'scheduled_backup',

        // Status tracking
        status: 'in_progress',
        progress: 0,
        currentDatabase: null,

        // Results placeholder
        endTime: null,
        duration: null,
        backupSize: null,
        compressedSize: null,
        success: null,
        errorMessage: null
      };

      // Store backup operation record
      await this.collections.backupOperations.insertOne(backupOperation);

      // Execute backup based on type
      let backupResults;
      switch (backupConfig.type) {
        case 'full':
          backupResults = await this.performFullBackup(backupId, backupConfig);
          break;
        case 'incremental':
          backupResults = await this.performIncrementalBackup(backupId, backupConfig);
          break;
        case 'differential':
          backupResults = await this.performDifferentialBackup(backupId, backupConfig);
          break;
        default:
          throw new Error(`Unsupported backup type: ${backupConfig.type}`);
      }

      // Update backup operation with results
      const endTime = new Date();
      await this.collections.backupOperations.updateOne(
        { _id: backupId },
        {
          $set: {
            status: backupResults.success ? 'completed' : 'failed',
            endTime: endTime,
            duration: endTime - startTime,
            backupSize: backupResults.backupSize,
            compressedSize: backupResults.compressedSize,
            success: backupResults.success,
            errorMessage: backupResults.errorMessage,
            progress: 100,

            // Backup verification
            verificationResults: backupResults.verification,
            checksumVerified: backupResults.checksumVerified,

            // Storage information
            backupFiles: backupResults.files,
            storageLocation: backupResults.storageLocation
          }
        }
      );

      // Schedule cleanup of old backups
      await this.scheduleBackupCleanup(backupId, backupConfig);

      console.log(`Backup operation completed: ${backupId}`, {
        success: backupResults.success,
        duration: endTime - startTime,
        backupSize: backupResults.backupSize
      });

      return backupResults;

    } catch (error) {
      console.error('Error executing backup operation:', error);
      throw error;
    }
  }

  // Utility methods for MongoDB administration

  calculateOperationRate(operationType, currentValue) {
    const previousMetrics = this.performanceMonitors.get(`${operationType}_previous`);
    const previousTime = this.performanceMonitors.get(`${operationType}_time`);

    if (previousMetrics && previousTime) {
      const timeDiff = (Date.now() - previousTime) / 1000; // seconds
      const valueDiff = currentValue - previousMetrics;
      const rate = Math.round(valueDiff / timeDiff);

      // Update previous values
      this.performanceMonitors.set(`${operationType}_previous`, currentValue);
      this.performanceMonitors.set(`${operationType}_time`, Date.now());

      return rate;
    } else {
      // Initialize tracking
      this.performanceMonitors.set(`${operationType}_previous`, currentValue);
      this.performanceMonitors.set(`${operationType}_time`, Date.now());
      return 0;
    }
  }

  calculateNetworkRate(metric, currentValue) {
    return this.calculateOperationRate(`network_${metric}`, currentValue);
  }

  calculateCacheUtilization(serverStatus) {
    if (serverStatus.wiredTiger?.cache) {
      const maxBytes = serverStatus.wiredTiger.cache['maximum bytes configured'];
      const currentBytes = serverStatus.wiredTiger.cache['bytes currently in the cache'];

      if (maxBytes && currentBytes) {
        return Math.round((currentBytes / maxBytes) * 100);
      }
    }
    return 0;
  }

  analyzeLockMetrics(serverStatus) {
    if (serverStatus.locks) {
      return {
        globalLock: serverStatus.locks.Global || {},
        databaseLock: serverStatus.locks.Database || {},
        collectionLock: serverStatus.locks.Collection || {},

        // Lock contention analysis
        lockContention: this.calculateLockContention(serverStatus.locks),
        lockEfficiency: this.calculateLockEfficiency(serverStatus.locks)
      };
    }
    return {};
  }

  async collectReplicationMetrics() {
    try {
      const replStatus = await this.db.admin().replSetGetStatus();

      if (replStatus.ok) {
        return {
          setName: replStatus.set,
          members: replStatus.members.length,
          primary: replStatus.members.find(m => m.stateStr === 'PRIMARY'),
          secondaries: replStatus.members.filter(m => m.stateStr === 'SECONDARY'),
          replicationLag: this.calculateReplicationLag(replStatus.members)
        };
      }
    } catch (error) {
      // Not a replica set
      return null;
    }
  }

  async collectShardingMetrics() {
    try {
      const shardingStatus = await this.db.admin().runCommand({ listShards: 1 });

      if (shardingStatus.ok) {
        return {
          shards: shardingStatus.shards.length,
          shardsInfo: shardingStatus.shards,
          balancerActive: await this.checkBalancerStatus()
        };
      }
    } catch (error) {
      // Not a sharded cluster
      return null;
    }
  }

  async evaluatePerformanceAlerts(serverStatus, dbStats) {
    const alerts = [];

    // CPU usage alert
    if (serverStatus.extra_info?.page_faults > 1000) {
      alerts.push({
        type: 'high_page_faults',
        severity: 'warning',
        message: 'High page fault rate detected',
        value: serverStatus.extra_info.page_faults,
        threshold: 1000
      });
    }

    // Connection usage alert
    const connectionUtilization = (serverStatus.connections.current / 
                                 (serverStatus.connections.current + serverStatus.connections.available)) * 100;
    if (connectionUtilization > this.config.connectionThreshold) {
      alerts.push({
        type: 'high_connection_usage',
        severity: connectionUtilization > 95 ? 'critical' : 'warning',
        message: 'High connection utilization',
        value: Math.round(connectionUtilization),
        threshold: this.config.connectionThreshold
      });
    }

    // Memory usage alerts
    if (serverStatus.mem.resident > 8000) { // > 8GB
      alerts.push({
        type: 'high_memory_usage',
        severity: 'warning',
        message: 'High resident memory usage',
        value: serverStatus.mem.resident,
        threshold: 8000
      });
    }

    return alerts;
  }

  async analyzeOperation(operation) {
    return {
      complexity: this.assessQueryComplexity(operation),
      indexUsage: this.analyzeIndexUsage(operation),
      efficiency: this.calculateQueryEfficiency(operation),
      optimization: this.identifyOptimizationOpportunities(operation)
    };
  }

  assessQueryComplexity(operation) {
    let complexity = 'low';

    if (operation.execStats?.totalDocsExamined > 10000) complexity = 'medium';
    if (operation.execStats?.totalDocsExamined > 100000) complexity = 'high';
    if (operation.planSummary?.includes('COLLSCAN')) complexity = 'high';
    if (operation.millis > 5000) complexity = 'high';

    return complexity;
  }

  analyzeIndexUsage(operation) {
    if (operation.planSummary) {
      if (operation.planSummary.includes('IXSCAN')) return 'efficient';
      if (operation.planSummary.includes('COLLSCAN')) return 'full_scan';
    }
    return 'unknown';
  }

  calculateQueryEfficiency(operation) {
    if (operation.execStats?.totalDocsExamined && operation.execStats?.totalDocsReturned) {
      const efficiency = (operation.execStats.totalDocsReturned / operation.execStats.totalDocsExamined) * 100;
      return Math.round(efficiency);
    }
    return 0;
  }
}

// Benefits of MongoDB Advanced Database Administration:
// - Comprehensive performance monitoring with real-time metrics collection and analysis
// - Automated slow operation detection and optimization recommendation generation
// - Intelligent index analysis with usage patterns and maintenance prioritization
// - Automated maintenance scheduling with conflict detection and approval workflows
// - Enterprise backup management with compression, encryption, and retention policies
// - Integrated security auditing and access control monitoring
// - Real-time alerting with configurable thresholds and escalation procedures
// - Capacity planning with growth trend analysis and resource optimization
// - High availability monitoring for replica sets and sharded clusters
// - SQL-compatible administration operations through QueryLeaf integration

module.exports = {
  MongoDBAdministrationManager
};

Understanding MongoDB Database Administration Architecture

Enterprise Performance Monitoring and Optimization

Implement comprehensive monitoring for production MongoDB environments:

// Production-ready MongoDB monitoring with advanced analytics and alerting
class ProductionMonitoringManager extends MongoDBAdministrationManager {
  constructor(client, productionConfig) {
    super(client, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      enableAdvancedAnalytics: true,
      enablePredictiveAlerts: true,
      enableCapacityPlanning: true,
      enableComplianceMonitoring: true,
      enableForensicLogging: true,
      enablePerformanceBaselines: true
    };

    this.setupProductionMonitoring();
    this.initializeAdvancedAnalytics();
    this.setupCapacityPlanning();
  }

  async implementAdvancedMonitoring(monitoringProfile) {
    console.log('Implementing advanced production monitoring...');

    const monitoringFramework = {
      // Real-time performance monitoring
      realTimeMetrics: {
        operationLatency: true,
        throughputAnalysis: true,
        resourceUtilization: true,
        connectionPoolAnalysis: true,
        lockContentionTracking: true
      },

      // Predictive analytics
      predictiveAnalytics: {
        capacityForecast: true,
        performanceTrendAnalysis: true,
        anomalyDetection: true,
        failurePrediction: true,
        maintenanceScheduling: true
      },

      // Advanced alerting
      intelligentAlerting: {
        dynamicThresholds: true,
        alertCorrelation: true,
        escalationManagement: true,
        suppressionRules: true,
        businessImpactAssessment: true
      }
    };

    return await this.deployMonitoringFramework(monitoringFramework, monitoringProfile);
  }

  async setupCapacityPlanningFramework() {
    console.log('Setting up comprehensive capacity planning...');

    const capacityFramework = {
      // Growth analysis
      growthAnalysis: {
        dataGrowthTrends: true,
        operationalGrowthPatterns: true,
        resourceConsumptionForecasting: true,
        scalingRecommendations: true
      },

      // Performance baselines
      performanceBaselines: {
        responseTimeBaselines: true,
        throughputBaselines: true,
        resourceUtilizationBaselines: true,
        operationalBaselines: true
      },

      // Optimization recommendations
      optimizationFramework: {
        indexingOptimization: true,
        queryOptimization: true,
        schemaOptimization: true,
        infrastructureOptimization: true
      }
    };

    return await this.deployCapacityFramework(capacityFramework);
  }

  async implementComplianceMonitoring(complianceRequirements) {
    console.log('Implementing comprehensive compliance monitoring...');

    const complianceFramework = {
      // Audit trail monitoring
      auditCompliance: {
        accessLogging: true,
        changeTracking: true,
        privilegedOperationsTracking: true,
        complianceReporting: true
      },

      // Security monitoring
      securityCompliance: {
        authenticationMonitoring: true,
        authorizationTracking: true,
        securityViolationDetection: true,
        threatDetection: true
      }
    };

    return await this.deployComplianceFramework(complianceFramework, complianceRequirements);
  }
}

SQL-Style Database Administration with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB database administration and monitoring operations:

-- QueryLeaf advanced database administration with SQL-familiar syntax for MongoDB

-- Configure monitoring and administration settings
SET monitoring_interval = 60; -- seconds
SET slow_operation_threshold = 100; -- milliseconds
SET enable_performance_profiling = true;
SET enable_automatic_maintenance = true;
SET enable_capacity_planning = true;

-- Comprehensive performance monitoring with SQL-familiar administration
WITH monitoring_configuration AS (
  SELECT 
    -- Monitoring settings
    60 as metrics_collection_interval_seconds,
    100 as slow_operation_threshold_ms,
    true as enable_detailed_profiling,
    true as enable_realtime_alerts,

    -- Performance thresholds
    80 as cpu_threshold_percent,
    85 as memory_threshold_percent,
    80 as connection_threshold_percent,
    90 as disk_threshold_percent,

    -- Alert configuration
    'critical' as high_severity_level,
    'warning' as medium_severity_level,
    'info' as low_severity_level,

    -- Maintenance configuration
    true as enable_automatic_maintenance,
    '02:00:00'::time as maintenance_window_start,
    '04:00:00'::time as maintenance_window_end,

    -- Backup configuration
    true as enable_automatic_backups,
    30 as backup_retention_days,
    'full' as default_backup_type
),

performance_metrics_collection AS (
  -- Collect comprehensive performance metrics
  SELECT 
    CURRENT_TIMESTAMP as metric_timestamp,
    EXTRACT(EPOCH FROM CURRENT_TIMESTAMP) as metric_epoch,

    -- Server identification
    'mongodb-primary-01' as server_name,
    'production' as environment,
    '4.4.15' as mongodb_version,

    -- Connection metrics
    JSON_BUILD_OBJECT(
      'current_connections', 245,
      'available_connections', 755,
      'total_created', 15420,
      'utilization_percent', ROUND((245.0 / (245 + 755)) * 100, 2)
    ) as connection_metrics,

    -- Operation metrics
    JSON_BUILD_OBJECT(
      'insert_ops', 1245670,
      'query_ops', 8901234,
      'update_ops', 567890,
      'delete_ops', 123456,
      'command_ops', 2345678,

      -- Operations per second (calculated)
      'insert_ops_per_sec', 12.5,
      'query_ops_per_sec', 156.7,
      'update_ops_per_sec', 8.9,
      'delete_ops_per_sec', 2.1,
      'command_ops_per_sec', 45.6
    ) as operation_metrics,

    -- Memory metrics
    JSON_BUILD_OBJECT(
      'resident_mb', 2048,
      'virtual_mb', 4096,
      'mapped_mb', 1024,
      'cache_size_gb', 8,
      'cache_used_gb', 6.5,
      'cache_utilization_percent', 81.25
    ) as memory_metrics,

    -- Database metrics
    JSON_BUILD_OBJECT(
      'total_collections', 45,
      'total_documents', 25678901,
      'data_size_gb', 12.5,
      'storage_size_gb', 15.2,
      'total_indexes', 123,
      'index_size_gb', 2.8,
      'avg_document_size_bytes', 512
    ) as database_metrics,

    -- Lock metrics
    JSON_BUILD_OBJECT(
      'global_lock_ratio', 0.02,
      'database_lock_ratio', 0.01,
      'collection_lock_ratio', 0.005,
      'lock_contention_percent', 1.2,
      'blocked_operations', 3
    ) as lock_metrics,

    -- Replication metrics (if replica set)
    JSON_BUILD_OBJECT(
      'replica_set_name', 'rs-production',
      'is_primary', true,
      'secondary_count', 2,
      'max_replication_lag_seconds', 0.5,
      'oplog_size_gb', 2.0,
      'oplog_used_percent', 15.3
    ) as replication_metrics,

    -- Alert evaluation
    ARRAY[
      CASE WHEN 245.0 / (245 + 755) > 0.8 THEN
        JSON_BUILD_OBJECT(
          'type', 'high_connection_usage',
          'severity', 'warning',
          'value', ROUND((245.0 / (245 + 755)) * 100, 2),
          'threshold', 80,
          'message', 'Connection utilization above threshold'
        )
      END,
      CASE WHEN 81.25 > 85 THEN
        JSON_BUILD_OBJECT(
          'type', 'high_cache_usage',
          'severity', 'warning',
          'value', 81.25,
          'threshold', 85,
          'message', 'Cache utilization above threshold'
        )
      END
    ] as performance_alerts

  FROM monitoring_configuration mc
),

slow_operations_analysis AS (
  -- Analyze slow operations and performance bottlenecks
  SELECT 
    operation_id,
    operation_timestamp,
    operation_type,
    database_name,
    collection_name,
    duration_ms,

    -- Operation details
    operation_command,
    plan_summary,
    execution_stats,

    -- Performance analysis
    CASE 
      WHEN duration_ms > 5000 THEN 'critical'
      WHEN duration_ms > 1000 THEN 'high'
      WHEN duration_ms > 500 THEN 'medium'
      ELSE 'low'
    END as performance_impact,

    -- Query complexity assessment
    CASE 
      WHEN plan_summary LIKE '%COLLSCAN%' THEN 'high'
      WHEN execution_stats->>'totalDocsExamined' > '100000' THEN 'high'
      WHEN execution_stats->>'totalDocsExamined' > '10000' THEN 'medium'
      ELSE 'low'
    END as query_complexity,

    -- Index usage analysis
    CASE 
      WHEN plan_summary LIKE '%IXSCAN%' THEN 'efficient'
      WHEN plan_summary LIKE '%COLLSCAN%' THEN 'full_scan'
      ELSE 'unknown'
    END as index_usage,

    -- Query efficiency calculation
    CASE 
      WHEN execution_stats->>'totalDocsExamined' > '0' AND execution_stats->>'totalDocsReturned' > '0' THEN
        ROUND(
          (execution_stats->>'totalDocsReturned')::decimal / 
          (execution_stats->>'totalDocsExamined')::decimal * 100, 2
        )
      ELSE 0
    END as query_efficiency_percent,

    -- Optimization recommendations
    CASE 
      WHEN plan_summary LIKE '%COLLSCAN%' THEN 'Create appropriate indexes'
      WHEN query_efficiency_percent < 10 THEN 'Optimize query selectivity'
      WHEN duration_ms > 1000 THEN 'Review query structure'
      ELSE 'Performance acceptable'
    END as optimization_recommendation

  FROM slow_operation_log sol
  WHERE sol.operation_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
  AND sol.duration_ms >= 100 -- From monitoring configuration
),

index_analysis_comprehensive AS (
  -- Comprehensive index analysis and optimization
  SELECT 
    database_name,
    collection_name,
    index_name,
    index_specification,

    -- Index statistics
    index_size_mb,
    access_count,
    last_accessed_timestamp,

    -- Index efficiency analysis
    CASE 
      WHEN access_count > 1000 AND index_size_mb < 50 THEN 'highly_efficient'
      WHEN access_count > 100 AND index_size_mb < 100 THEN 'efficient'
      WHEN access_count > 10 THEN 'moderately_efficient'
      WHEN access_count = 0 AND index_name != '_id_' THEN 'unused'
      ELSE 'inefficient'
    END as efficiency_rating,

    -- Usage pattern analysis
    CASE 
      WHEN last_accessed_timestamp < CURRENT_TIMESTAMP - INTERVAL '30 days' THEN 'rarely_used'
      WHEN last_accessed_timestamp < CURRENT_TIMESTAMP - INTERVAL '7 days' THEN 'occasionally_used'
      WHEN access_count > 100 THEN 'frequently_used'
      ELSE 'moderately_used'
    END as usage_pattern,

    -- Maintenance recommendations
    CASE 
      WHEN access_count = 0 AND index_name != '_id_' THEN 'Consider removal'
      WHEN index_size_mb > 1000 AND efficiency_rating = 'inefficient' THEN 'Rebuild index'
      WHEN efficiency_rating = 'unused' THEN 'Review necessity'
      WHEN usage_pattern = 'frequently_used' AND index_size_mb > 500 THEN 'Monitor fragmentation'
      ELSE 'Maintain current strategy'
    END as maintenance_recommendation,

    -- Priority scoring
    CASE 
      WHEN efficiency_rating = 'unused' THEN 10
      WHEN efficiency_rating = 'inefficient' AND index_size_mb > 100 THEN 8
      WHEN maintenance_recommendation LIKE '%Rebuild%' THEN 7
      WHEN usage_pattern = 'rarely_used' AND index_size_mb > 50 THEN 5
      ELSE 1
    END as maintenance_priority_score

  FROM index_statistics_view isv
),

maintenance_scheduling AS (
  -- Automated maintenance scheduling with priority management
  SELECT 
    GENERATE_UUID() as maintenance_id,
    maintenance_type,
    target_database,
    target_collection,
    target_index,

    -- Scheduling information
    CASE maintenance_type
      WHEN 'index_rebuild' THEN CURRENT_DATE + INTERVAL '1 week'
      WHEN 'index_removal' THEN CURRENT_DATE + INTERVAL '2 weeks'
      WHEN 'collection_compaction' THEN CURRENT_DATE + INTERVAL '1 month'
      WHEN 'statistics_refresh' THEN CURRENT_DATE + INTERVAL '1 day'
      ELSE CURRENT_DATE + INTERVAL '1 month'
    END as scheduled_date,

    -- Impact assessment
    CASE maintenance_type
      WHEN 'index_rebuild' THEN 'medium'
      WHEN 'index_removal' THEN 'low'
      WHEN 'collection_compaction' THEN 'high'
      WHEN 'statistics_refresh' THEN 'low'
      ELSE 'medium'
    END as impact_level,

    -- Duration estimation
    CASE maintenance_type
      WHEN 'index_rebuild' THEN INTERVAL '30 minutes'
      WHEN 'index_removal' THEN INTERVAL '5 minutes'
      WHEN 'collection_compaction' THEN INTERVAL '2 hours'
      WHEN 'statistics_refresh' THEN INTERVAL '10 minutes'
      ELSE INTERVAL '1 hour'
    END as estimated_duration,

    -- Approval requirements
    CASE impact_level
      WHEN 'high' THEN 'director_approval'
      WHEN 'medium' THEN 'manager_approval'
      ELSE 'automatic_approval'
    END as approval_requirement,

    -- Maintenance window validation
    CASE 
      WHEN EXTRACT(DOW FROM scheduled_date) IN (0, 6) THEN true -- Weekends preferred
      WHEN EXTRACT(HOUR FROM mc.maintenance_window_start) BETWEEN 2 AND 4 THEN true
      ELSE false
    END as within_maintenance_window,

    -- Business justification
    CASE maintenance_type
      WHEN 'index_rebuild' THEN 'Improve query performance and reduce storage overhead'
      WHEN 'index_removal' THEN 'Eliminate unused indexes to improve write performance'
      WHEN 'collection_compaction' THEN 'Reclaim disk space and optimize storage'
      WHEN 'statistics_refresh' THEN 'Ensure query optimizer has current statistics'
      ELSE 'General database maintenance and optimization'
    END as business_justification,

    CURRENT_TIMESTAMP as maintenance_created

  FROM (
    -- Generate maintenance tasks based on analysis results
    SELECT 'index_rebuild' as maintenance_type, database_name as target_database, 
           collection_name as target_collection, index_name as target_index
    FROM index_analysis_comprehensive
    WHERE maintenance_recommendation LIKE '%Rebuild%'

    UNION ALL

    SELECT 'index_removal' as maintenance_type, database_name, collection_name, index_name
    FROM index_analysis_comprehensive  
    WHERE maintenance_recommendation LIKE '%removal%'

    UNION ALL

    SELECT 'statistics_refresh' as maintenance_type, database_name, collection_name, NULL
    FROM slow_operations_analysis
    WHERE query_complexity = 'high'
    GROUP BY database_name, collection_name
  ) maintenance_tasks
  CROSS JOIN monitoring_configuration mc
),

backup_operations_management AS (
  -- Comprehensive backup operations and scheduling
  SELECT 
    GENERATE_UUID() as backup_id,
    backup_type,
    target_databases,

    -- Backup scheduling
    CASE backup_type
      WHEN 'full' THEN DATE_TRUNC('day', CURRENT_TIMESTAMP) + INTERVAL '2 hours'
      WHEN 'incremental' THEN DATE_TRUNC('hour', CURRENT_TIMESTAMP) + INTERVAL '4 hours'
      WHEN 'differential' THEN DATE_TRUNC('day', CURRENT_TIMESTAMP) + INTERVAL '8 hours'
      ELSE CURRENT_TIMESTAMP + INTERVAL '1 hour'
    END as scheduled_backup_time,

    -- Backup configuration
    JSON_BUILD_OBJECT(
      'compression_enabled', true,
      'encryption_enabled', true,
      'verify_backup', true,
      'test_restore', backup_type = 'full',
      'storage_location', '/backup/mongodb/' || TO_CHAR(CURRENT_DATE, 'YYYY/MM/DD'),
      'retention_days', mc.backup_retention_days
    ) as backup_configuration,

    -- Estimated backup metrics
    CASE backup_type
      WHEN 'full' THEN JSON_BUILD_OBJECT(
        'estimated_size_gb', 25.5,
        'estimated_duration_minutes', 45,
        'estimated_compressed_size_gb', 12.8
      )
      WHEN 'incremental' THEN JSON_BUILD_OBJECT(
        'estimated_size_gb', 2.5,
        'estimated_duration_minutes', 8,
        'estimated_compressed_size_gb', 1.2
      )
      WHEN 'differential' THEN JSON_BUILD_OBJECT(
        'estimated_size_gb', 8.5,
        'estimated_duration_minutes', 15,
        'estimated_compressed_size_gb', 4.2
      )
    END as backup_estimates,

    -- Backup priority and impact
    CASE backup_type
      WHEN 'full' THEN 'high'
      WHEN 'differential' THEN 'medium'
      ELSE 'low'
    END as backup_priority,

    -- Validation requirements
    JSON_BUILD_OBJECT(
      'integrity_check_required', true,
      'restore_test_required', backup_type = 'full',
      'checksum_verification', true,
      'backup_verification', true
    ) as validation_requirements

  FROM (
    SELECT 'full' as backup_type, ARRAY['production_db'] as target_databases
    UNION ALL
    SELECT 'incremental' as backup_type, ARRAY['production_db', 'analytics_db'] as target_databases
    UNION ALL  
    SELECT 'differential' as backup_type, ARRAY['production_db'] as target_databases
  ) backup_schedule
  CROSS JOIN monitoring_configuration mc
),

administration_dashboard AS (
  -- Comprehensive administration dashboard and reporting
  SELECT 
    pmc.metric_timestamp,
    pmc.server_name,
    pmc.environment,

    -- Connection summary
    (pmc.connection_metrics->>'current_connections')::int as current_connections,
    (pmc.connection_metrics->>'utilization_percent')::decimal as connection_utilization,

    -- Performance summary
    (pmc.operation_metrics->>'query_ops_per_sec')::decimal as queries_per_second,
    (pmc.memory_metrics->>'cache_utilization_percent')::decimal as cache_utilization,
    (pmc.database_metrics->>'data_size_gb')::decimal as data_size_gb,

    -- Slow operations summary
    COUNT(soa.operation_id) as slow_operations_count,
    COUNT(soa.operation_id) FILTER (WHERE soa.performance_impact = 'critical') as critical_slow_ops,
    COUNT(soa.operation_id) FILTER (WHERE soa.index_usage = 'full_scan') as full_scan_operations,
    AVG(soa.query_efficiency_percent) as avg_query_efficiency,

    -- Index maintenance summary
    COUNT(iac.index_name) as total_indexes_analyzed,
    COUNT(iac.index_name) FILTER (WHERE iac.efficiency_rating = 'unused') as unused_indexes,
    COUNT(iac.index_name) FILTER (WHERE iac.maintenance_recommendation != 'Maintain current strategy') as indexes_needing_maintenance,
    SUM(iac.maintenance_priority_score) as total_maintenance_priority,

    -- Maintenance scheduling summary
    COUNT(ms.maintenance_id) as scheduled_maintenance_tasks,
    COUNT(ms.maintenance_id) FILTER (WHERE ms.impact_level = 'high') as high_impact_maintenance,
    COUNT(ms.maintenance_id) FILTER (WHERE ms.approval_requirement = 'director_approval') as director_approval_required,

    -- Backup operations summary
    COUNT(bom.backup_id) as scheduled_backups,
    COUNT(bom.backup_id) FILTER (WHERE bom.backup_type = 'full') as full_backups_scheduled,
    SUM((bom.backup_estimates->>'estimated_size_gb')::decimal) as total_estimated_backup_size_gb,

    -- Alert summary
    array_length(pmc.performance_alerts, 1) as active_alerts_count,

    -- Overall health score
    CASE 
      WHEN connection_utilization < 60 AND cache_utilization < 80 AND critical_slow_ops = 0 THEN 'excellent'
      WHEN connection_utilization < 75 AND cache_utilization < 85 AND critical_slow_ops <= 2 THEN 'good'
      WHEN connection_utilization < 85 AND cache_utilization < 90 AND critical_slow_ops <= 5 THEN 'fair'
      ELSE 'needs_attention'
    END as overall_health_status,

    -- Recommendations summary
    CASE 
      WHEN unused_indexes > 5 THEN 'Review and remove unused indexes for improved performance'
      WHEN full_scan_operations > 10 THEN 'Create missing indexes to optimize query performance'
      WHEN avg_query_efficiency < 50 THEN 'Optimize queries for better selectivity'
      WHEN high_impact_maintenance > 3 THEN 'Schedule maintenance window for optimization tasks'
      ELSE 'Database performance is within acceptable parameters'
    END as primary_recommendation

  FROM performance_metrics_collection pmc
  LEFT JOIN slow_operations_analysis soa ON true
  LEFT JOIN index_analysis_comprehensive iac ON true
  LEFT JOIN maintenance_scheduling ms ON true
  LEFT JOIN backup_operations_management bom ON true
  GROUP BY pmc.metric_timestamp, pmc.server_name, pmc.environment, pmc.connection_metrics,
           pmc.operation_metrics, pmc.memory_metrics, pmc.database_metrics, pmc.performance_alerts
)

-- Generate comprehensive administration report
SELECT 
  ad.metric_timestamp,
  ad.server_name,
  ad.environment,

  -- Current status
  ad.current_connections,
  ad.connection_utilization,
  ad.queries_per_second,
  ad.cache_utilization,
  ad.data_size_gb,

  -- Performance analysis
  ad.slow_operations_count,
  ad.critical_slow_ops,
  ad.full_scan_operations,
  ROUND(ad.avg_query_efficiency, 1) as avg_query_efficiency_percent,

  -- Index management
  ad.total_indexes_analyzed,
  ad.unused_indexes,
  ad.indexes_needing_maintenance,
  ad.total_maintenance_priority,

  -- Operational management
  ad.scheduled_maintenance_tasks,
  ad.high_impact_maintenance,
  ad.director_approval_required,

  -- Backup management
  ad.scheduled_backups,
  ad.full_backups_scheduled,
  ROUND(ad.total_estimated_backup_size_gb, 1) as total_backup_size_gb,

  -- Health assessment
  ad.active_alerts_count,
  ad.overall_health_status,
  ad.primary_recommendation,

  -- Operational metrics
  CASE ad.overall_health_status
    WHEN 'excellent' THEN 'Continue current operations and monitoring'
    WHEN 'good' THEN 'Monitor performance trends and plan preventive maintenance'
    WHEN 'fair' THEN 'Schedule performance review and optimization planning'
    ELSE 'Immediate attention required - review alerts and implement optimizations'
  END as operational_guidance,

  -- Next actions
  ARRAY[
    CASE WHEN ad.critical_slow_ops > 0 THEN 'Investigate critical slow operations' END,
    CASE WHEN ad.unused_indexes > 3 THEN 'Schedule unused index removal' END,
    CASE WHEN ad.connection_utilization > 80 THEN 'Review connection pooling configuration' END,
    CASE WHEN ad.cache_utilization > 85 THEN 'Consider memory allocation optimization' END,
    CASE WHEN ad.full_scan_operations > 5 THEN 'Create missing indexes for better performance' END
  ] as immediate_action_items

FROM administration_dashboard ad
ORDER BY ad.metric_timestamp DESC;

-- QueryLeaf provides comprehensive MongoDB administration capabilities:
-- 1. Real-time performance monitoring with configurable thresholds and alerting
-- 2. Automated slow operation analysis and query optimization recommendations
-- 3. Intelligent index analysis with maintenance prioritization and scheduling
-- 4. Comprehensive backup management with scheduling and retention policies
-- 5. Maintenance planning with approval workflows and impact assessment
-- 6. Security auditing and compliance monitoring for enterprise requirements
-- 7. Capacity planning with growth trend analysis and resource optimization
-- 8. Integration with MongoDB native administration tools and enterprise monitoring
-- 9. SQL-familiar syntax for complex administration operations and reporting
-- 10. Automated administration workflows with intelligent decision-making capabilities

Best Practices for Production MongoDB Administration

Enterprise Operations Management and Monitoring

Essential principles for effective MongoDB database administration:

Comprehensive Monitoring: Implement real-time performance monitoring with configurable thresholds and intelligent alerting
Proactive Maintenance: Schedule automated maintenance operations with approval workflows and impact assessment
Performance Optimization: Continuously analyze slow operations and optimize query performance with index recommendations
Capacity Planning: Monitor growth trends and plan resource allocation for optimal performance and cost efficiency
Security Management: Implement comprehensive security auditing and access control monitoring for compliance
Backup Strategy: Maintain robust backup operations with automated scheduling, verification, and retention management

Scalability and Production Deployment

Optimize MongoDB administration for enterprise-scale requirements:

Monitoring Infrastructure: Deploy scalable monitoring solutions that handle high-volume metrics collection and analysis
Automated Operations: Implement intelligent automation for routine maintenance and optimization tasks
Performance Baselines: Establish and maintain performance baselines for proactive issue detection
Disaster Recovery: Design comprehensive backup and recovery procedures with regular testing and validation
Compliance Integration: Integrate administration workflows with enterprise compliance and governance frameworks
Operational Excellence: Create standardized procedures for MongoDB operations with documentation and training

Conclusion

MongoDB database administration provides comprehensive operations management capabilities that enable enterprise-grade performance monitoring, maintenance automation, and operational excellence through sophisticated administration frameworks and intelligent automation. The combination of real-time monitoring, automated maintenance, and proactive optimization ensures optimal MongoDB performance and reliability.

Key MongoDB Database Administration benefits include:

Real-time Monitoring: Comprehensive performance metrics collection with intelligent alerting and threshold management
Automated Maintenance: Intelligent scheduling and execution of maintenance operations with approval workflows
Performance Optimization: Continuous analysis and optimization of query performance with actionable recommendations
Enterprise Security: Comprehensive security auditing and access control monitoring for compliance requirements
Capacity Management: Proactive capacity planning with growth trend analysis and resource optimization
SQL Accessibility: Familiar SQL-style administration operations through QueryLeaf for accessible database management

Whether you're managing production MongoDB deployments, optimizing database performance, implementing compliance monitoring, or planning capacity growth, MongoDB database administration with QueryLeaf's familiar SQL interface provides the foundation for robust, scalable database operations management.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB database administration while providing SQL-familiar syntax for performance monitoring, maintenance scheduling, and operational management. Advanced administration patterns, automated optimization workflows, and enterprise monitoring capabilities are seamlessly handled through familiar SQL constructs, making sophisticated database operations accessible to SQL-oriented administration teams.

The combination of MongoDB's comprehensive administration capabilities with SQL-style operations management makes it an ideal platform for applications requiring both sophisticated database operations and familiar administration patterns, ensuring your MongoDB deployments can operate efficiently while maintaining optimal performance and operational excellence.

November 26, 2025
23 min read

MongoDB Document Versioning and Audit Trails: Enterprise-Grade Data History Management and Compliance Tracking

Enterprise applications require comprehensive data history tracking, audit trails, and compliance management to meet regulatory requirements, support forensic analysis, and maintain data integrity across complex business operations. Traditional approaches to document versioning often struggle with storage efficiency, query performance, and the complexity of managing historical data alongside current records.

MongoDB document versioning provides sophisticated data history management capabilities that enable audit trails, compliance tracking, and temporal data analysis through flexible schema design and optimized storage strategies. Unlike rigid relational audit tables that require complex joins and separate storage, MongoDB's document model allows for efficient embedded versioning, reference-based history tracking, and hybrid approaches that balance performance with storage requirements.

The Traditional Audit Trail Challenge

Conventional relational approaches to audit trails and versioning face significant limitations:

-- Traditional PostgreSQL audit trail approach - complex table management and poor performance

-- Main business entity table
CREATE TABLE products (
    product_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name VARCHAR(500) NOT NULL,
    category VARCHAR(100),
    price DECIMAL(10,2) NOT NULL,
    description TEXT,
    supplier_id UUID,
    status VARCHAR(50) DEFAULT 'active',

    -- Versioning metadata
    version INTEGER NOT NULL DEFAULT 1,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    created_by VARCHAR(100),
    updated_by VARCHAR(100),

    -- Soft delete support
    deleted BOOLEAN DEFAULT FALSE,
    deleted_at TIMESTAMP,
    deleted_by VARCHAR(100)
);

-- Separate audit trail table with complex structure
CREATE TABLE product_audit (
    audit_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    product_id UUID NOT NULL REFERENCES products(product_id),
    operation_type VARCHAR(20) NOT NULL, -- INSERT, UPDATE, DELETE, UNDELETE

    -- Version tracking
    version_from INTEGER,
    version_to INTEGER NOT NULL,

    -- Complete historical snapshot (storage intensive)
    historical_data JSONB NOT NULL,
    changed_fields JSONB,

    -- Change metadata
    change_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    changed_by VARCHAR(100) NOT NULL,
    change_reason TEXT,
    change_source VARCHAR(100), -- 'api', 'admin_panel', 'batch_process', etc.

    -- Audit context
    session_id VARCHAR(100),
    request_id VARCHAR(100),
    ip_address INET,
    user_agent TEXT,

    -- Compliance metadata
    retention_until TIMESTAMP,
    compliance_flags JSONB DEFAULT '{}'::jsonb,
    regulatory_context VARCHAR(200)
);

-- Complex trigger system for automatic audit trail generation
CREATE OR REPLACE FUNCTION create_product_audit()
RETURNS TRIGGER AS $$
DECLARE
    changed_fields jsonb := '{}'::jsonb;
    field_name text;
    audit_operation text;
    user_context jsonb;
BEGIN
    -- Determine operation type
    IF TG_OP = 'INSERT' THEN
        audit_operation := 'INSERT';
        changed_fields := to_jsonb(NEW) - 'created_at' - 'updated_at';

        INSERT INTO product_audit (
            product_id, operation_type, version_to, 
            historical_data, changed_fields, changed_by, change_source
        ) VALUES (
            NEW.product_id, audit_operation, NEW.version,
            to_jsonb(NEW), changed_fields, 
            NEW.created_by, COALESCE(current_setting('app.change_source', true), 'system')
        );

        RETURN NEW;

    ELSIF TG_OP = 'UPDATE' THEN
        -- Detect soft delete
        IF OLD.deleted = FALSE AND NEW.deleted = TRUE THEN
            audit_operation := 'DELETE';
        ELSIF OLD.deleted = TRUE AND NEW.deleted = FALSE THEN
            audit_operation := 'UNDELETE';
        ELSE
            audit_operation := 'UPDATE';
        END IF;

        -- Complex field-by-field change detection
        FOR field_name IN SELECT key FROM jsonb_each(to_jsonb(NEW)) LOOP
            IF to_jsonb(NEW)->>field_name IS DISTINCT FROM to_jsonb(OLD)->>field_name THEN
                changed_fields := changed_fields || jsonb_build_object(
                    field_name, jsonb_build_object(
                        'old_value', to_jsonb(OLD)->>field_name,
                        'new_value', to_jsonb(NEW)->>field_name
                    )
                );
            END IF;
        END LOOP;

        -- Only create audit record if there are meaningful changes
        IF changed_fields != '{}'::jsonb THEN
            -- Increment version
            NEW.version := OLD.version + 1;
            NEW.updated_at := CURRENT_TIMESTAMP;

            INSERT INTO product_audit (
                product_id, operation_type, version_from, version_to,
                historical_data, changed_fields, changed_by, 
                change_source, change_reason
            ) VALUES (
                NEW.product_id, audit_operation, OLD.version, NEW.version,
                to_jsonb(OLD), changed_fields, 
                COALESCE(NEW.updated_by, OLD.updated_by),
                COALESCE(current_setting('app.change_source', true), 'system'),
                COALESCE(current_setting('app.change_reason', true), 'System update')
            );
        END IF;

        RETURN NEW;

    ELSIF TG_OP = 'DELETE' THEN
        -- Handle hard delete (rarely used in enterprise applications)
        INSERT INTO product_audit (
            product_id, operation_type, version_from,
            historical_data, changed_by, change_source
        ) VALUES (
            OLD.product_id, 'HARD_DELETE', OLD.version,
            to_jsonb(OLD), 
            COALESCE(current_setting('app.user_id', true), 'system'),
            COALESCE(current_setting('app.change_source', true), 'system')
        );

        RETURN OLD;
    END IF;

    RETURN COALESCE(NEW, OLD);
END;
$$ LANGUAGE plpgsql;

-- Apply audit trigger (adds significant overhead)
CREATE TRIGGER product_audit_trigger
    BEFORE INSERT OR UPDATE OR DELETE ON products
    FOR EACH ROW EXECUTE FUNCTION create_product_audit();

-- Complex audit trail queries with poor performance
WITH audit_trail_analysis AS (
    SELECT 
        pa.product_id,
        pa.audit_id,
        pa.operation_type,
        pa.version_from,
        pa.version_to,
        pa.change_timestamp,
        pa.changed_by,
        pa.changed_fields,

        -- Extract specific field changes (complex JSON processing)
        CASE 
            WHEN pa.changed_fields ? 'price' THEN 
                jsonb_build_object(
                    'old_price', (pa.changed_fields->'price'->>'old_value')::decimal,
                    'new_price', (pa.changed_fields->'price'->>'new_value')::decimal,
                    'price_change_percent', 
                        CASE WHEN (pa.changed_fields->'price'->>'old_value')::decimal > 0 THEN
                            (((pa.changed_fields->'price'->>'new_value')::decimal - 
                              (pa.changed_fields->'price'->>'old_value')::decimal) /
                             (pa.changed_fields->'price'->>'old_value')::decimal) * 100
                        ELSE 0 END
                )
            ELSE NULL
        END as price_change_analysis,

        -- Calculate time between changes
        LAG(pa.change_timestamp) OVER (
            PARTITION BY pa.product_id 
            ORDER BY pa.change_timestamp
        ) as previous_change_timestamp,

        -- Identify frequent changers
        COUNT(*) OVER (
            PARTITION BY pa.product_id 
            ORDER BY pa.change_timestamp 
            RANGE BETWEEN INTERVAL '1 day' PRECEDING AND CURRENT ROW
        ) as changes_in_last_day,

        -- User activity analysis
        COUNT(DISTINCT pa.changed_by) OVER (
            PARTITION BY pa.product_id
        ) as unique_users_modified,

        -- Compliance tracking
        CASE 
            WHEN pa.retention_until IS NOT NULL AND pa.retention_until < CURRENT_TIMESTAMP THEN 'expired'
            WHEN pa.compliance_flags ? 'gdpr_subject' THEN 'gdpr_protected'
            WHEN pa.compliance_flags ? 'financial_record' THEN 'sox_compliant'
            ELSE 'standard'
        END as compliance_status

    FROM product_audit pa
    WHERE pa.change_timestamp >= CURRENT_TIMESTAMP - INTERVAL '30 days'
),

audit_summary AS (
    -- Aggregate audit information (expensive operations)
    SELECT 
        ata.product_id,

        -- Change frequency analysis
        COUNT(*) as total_changes,
        COUNT(DISTINCT ata.changed_by) as unique_modifiers,
        COUNT(*) FILTER (WHERE ata.operation_type = 'UPDATE') as update_count,
        COUNT(*) FILTER (WHERE ata.operation_type = 'DELETE') as delete_count,

        -- Time-based analysis
        MAX(ata.change_timestamp) as last_modified,
        MIN(ata.change_timestamp) as first_modified_in_period,
        AVG(EXTRACT(EPOCH FROM (ata.change_timestamp - ata.previous_change_timestamp))) as avg_time_between_changes,

        -- Field change analysis
        COUNT(*) FILTER (WHERE ata.changed_fields ? 'price') as price_changes,
        COUNT(*) FILTER (WHERE ata.changed_fields ? 'status') as status_changes,
        COUNT(*) FILTER (WHERE ata.changed_fields ? 'supplier_id') as supplier_changes,

        -- Compliance summary
        array_agg(DISTINCT ata.compliance_status) as compliance_statuses,

        -- Most active user
        MODE() WITHIN GROUP (ORDER BY ata.changed_by) as most_active_modifier,

        -- Recent activity indicators
        MAX(ata.changes_in_last_day) as max_daily_changes,
        CASE WHEN MAX(ata.changes_in_last_day) > 10 THEN 'high_activity' 
             WHEN MAX(ata.changes_in_last_day) > 3 THEN 'moderate_activity'
             ELSE 'low_activity' END as activity_level

    FROM audit_trail_analysis ata
    GROUP BY ata.product_id
)

-- Generate audit report with complex joins
SELECT 
    p.product_id,
    p.name as current_name,
    p.price as current_price,
    p.status as current_status,
    p.version as current_version,

    -- Audit summary information
    aus.total_changes,
    aus.unique_modifiers,
    aus.last_modified,
    aus.activity_level,
    aus.most_active_modifier,

    -- Change breakdown
    aus.update_count,
    aus.delete_count,
    aus.price_changes,
    aus.status_changes,
    aus.supplier_changes,

    -- Compliance information
    aus.compliance_statuses,

    -- Performance metrics
    ROUND(aus.avg_time_between_changes / 3600, 2) as avg_hours_between_changes,

    -- Recent changes (expensive subquery)
    (
        SELECT jsonb_agg(
            jsonb_build_object(
                'timestamp', pa.change_timestamp,
                'operation', pa.operation_type,
                'changed_by', pa.changed_by,
                'changed_fields', jsonb_object_keys(pa.changed_fields)
            ) ORDER BY pa.change_timestamp DESC
        )
        FROM product_audit pa 
        WHERE pa.product_id = p.product_id 
        AND pa.change_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
        LIMIT 10
    ) as recent_changes,

    -- Historical versions (very expensive)
    (
        SELECT jsonb_agg(
            jsonb_build_object(
                'version', pa.version_to,
                'timestamp', pa.change_timestamp,
                'data_snapshot', pa.historical_data
            ) ORDER BY pa.version_to DESC
        )
        FROM product_audit pa 
        WHERE pa.product_id = p.product_id
    ) as version_history

FROM products p
LEFT JOIN audit_summary aus ON p.product_id = aus.product_id
WHERE p.deleted = FALSE
ORDER BY aus.total_changes DESC NULLS LAST, p.updated_at DESC;

-- Problems with traditional audit trail approaches:
-- 1. Complex trigger systems that add significant overhead to every database operation
-- 2. Separate audit tables requiring expensive joins for historical analysis
-- 3. Storage inefficiency with complete document snapshots for every change
-- 4. Poor query performance for audit trail analysis and reporting
-- 5. Complex field-level change detection and comparison logic
-- 6. Difficult maintenance of audit table schemas as business entities evolve
-- 7. Limited flexibility for different versioning strategies per entity type
-- 8. Complex compliance and retention management across multiple tables
-- 9. Difficult integration with modern event-driven architectures
-- 10. Poor scalability with high-frequency change environments

-- Compliance reporting challenges (complex multi-table queries)
WITH gdpr_audit_compliance AS (
    SELECT 
        pa.product_id,
        pa.changed_by,
        pa.change_timestamp,
        pa.changed_fields,

        -- GDPR compliance analysis
        CASE 
            WHEN pa.compliance_flags ? 'gdpr_subject' THEN
                jsonb_build_object(
                    'requires_anonymization', true,
                    'retention_period', pa.retention_until,
                    'lawful_basis', pa.compliance_flags->'gdpr_lawful_basis',
                    'data_subject_rights', ARRAY['access', 'rectification', 'erasure', 'portability']
                )
            ELSE NULL
        END as gdpr_metadata,

        -- SOX compliance for financial data
        CASE 
            WHEN pa.compliance_flags ? 'financial_record' THEN
                jsonb_build_object(
                    'sox_compliant', true,
                    'immutable_record', true,
                    'retention_required', '7 years',
                    'audit_trail_complete', true
                )
            ELSE NULL
        END as sox_metadata,

        -- Change approval workflow
        CASE 
            WHEN pa.changed_fields ? 'price' AND (pa.changed_fields->'price'->>'new_value')::decimal > 1000 THEN
                'requires_manager_approval'
            WHEN pa.operation_type = 'DELETE' THEN
                'requires_director_approval'
            ELSE 'standard_approval'
        END as approval_requirement

    FROM product_audit pa
    WHERE pa.compliance_flags IS NOT NULL
    AND pa.change_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 year'
)

SELECT 
    COUNT(*) as total_compliance_events,
    COUNT(*) FILTER (WHERE gdpr_metadata IS NOT NULL) as gdpr_events,
    COUNT(*) FILTER (WHERE sox_metadata IS NOT NULL) as sox_events,
    COUNT(*) FILTER (WHERE approval_requirement != 'standard_approval') as approval_required_events,

    -- Compliance summary
    jsonb_object_agg(
        approval_requirement,
        COUNT(*)
    ) as approval_breakdown,

    array_agg(DISTINCT changed_by) as users_with_compliance_changes

FROM gdpr_audit_compliance;

-- Traditional approach limitations:
-- 1. Performance degradation with large audit tables and complex queries
-- 2. Storage overhead from complete document snapshots and redundant data
-- 3. Maintenance complexity for evolving audit schemas and compliance requirements
-- 4. Limited flexibility for different versioning strategies per business context
-- 5. Complex reporting and analytics across multiple related audit tables
-- 6. Difficult implementation of retention policies and data lifecycle management
-- 7. Poor integration with modern microservices and event-driven architectures
-- 8. Limited support for distributed audit trails across multiple systems
-- 9. Complex user access control and audit log security management
-- 10. Difficult compliance reporting across multiple regulatory frameworks

MongoDB provides comprehensive document versioning capabilities with flexible storage strategies:

// MongoDB Advanced Document Versioning - Enterprise-grade audit trails with flexible versioning strategies
const { MongoClient, ObjectId } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('enterprise_audit_system');

// Comprehensive MongoDB Document Versioning Manager
class AdvancedDocumentVersioningManager {
  constructor(db, config = {}) {
    this.db = db;
    this.collections = {
      products: db.collection('products'),
      productVersions: db.collection('product_versions'),
      auditTrail: db.collection('audit_trail'),
      complianceTracking: db.collection('compliance_tracking'),
      retentionPolicies: db.collection('retention_policies'),
      userSessions: db.collection('user_sessions')
    };

    // Advanced versioning configuration
    this.config = {
      // Versioning strategy
      versioningStrategy: config.versioningStrategy || 'hybrid', // 'embedded', 'referenced', 'hybrid'
      maxEmbeddedVersions: config.maxEmbeddedVersions || 5,
      compressionEnabled: config.compressionEnabled !== false,
      enableFieldLevelVersioning: config.enableFieldLevelVersioning !== false,

      // Audit configuration
      enableAuditTrail: config.enableAuditTrail !== false,
      auditLevel: config.auditLevel || 'comprehensive', // 'basic', 'detailed', 'comprehensive'
      enableUserTracking: config.enableUserTracking !== false,
      enableSessionTracking: config.enableSessionTracking !== false,

      // Compliance configuration
      enableComplianceTracking: config.enableComplianceTracking !== false,
      gdprCompliance: config.gdprCompliance !== false,
      soxCompliance: config.soxCompliance || false,
      hipaCompliance: config.hipaCompliance || false,
      customComplianceRules: config.customComplianceRules || [],

      // Performance optimization
      enableIndexOptimization: config.enableIndexOptimization !== false,
      enableBackgroundArchiving: config.enableBackgroundArchiving || false,
      retentionPeriod: config.retentionPeriod || 365 * 7, // 7 years default
      enableChangeStreamIntegration: config.enableChangeStreamIntegration || false
    };

    // Version tracking
    this.versionCounters = new Map();
    this.sessionContext = new Map();
    this.complianceRules = new Map();

    this.initializeVersioningSystem();
  }

  async initializeVersioningSystem() {
    console.log('Initializing advanced document versioning system...');

    try {
      // Setup versioning infrastructure
      await this.setupVersioningInfrastructure();

      // Initialize compliance tracking
      if (this.config.enableComplianceTracking) {
        await this.initializeComplianceTracking();
      }

      // Setup audit trail processing
      if (this.config.enableAuditTrail) {
        await this.setupAuditTrailProcessing();
      }

      // Initialize background processes
      await this.initializeBackgroundProcesses();

      console.log('Document versioning system initialized successfully');

    } catch (error) {
      console.error('Error initializing versioning system:', error);
      throw error;
    }
  }

  async setupVersioningInfrastructure() {
    console.log('Setting up versioning infrastructure...');

    try {
      // Create optimized indexes for versioning
      await this.collections.products.createIndexes([
        { key: { _id: 1, version: -1 }, background: true },
        { key: { 'metadata.lastModified': -1 }, background: true },
        { key: { 'metadata.modifiedBy': 1, 'metadata.lastModified': -1 }, background: true },
        { key: { 'compliance.retentionUntil': 1 }, background: true, sparse: true }
      ]);

      // Version collection indexes
      await this.collections.productVersions.createIndexes([
        { key: { documentId: 1, version: -1 }, background: true },
        { key: { 'metadata.timestamp': -1 }, background: true },
        { key: { 'metadata.operationType': 1, 'metadata.timestamp': -1 }, background: true },
        { key: { 'compliance.retentionUntil': 1 }, background: true, sparse: true }
      ]);

      // Audit trail indexes
      await this.collections.auditTrail.createIndexes([
        { key: { documentId: 1, timestamp: -1 }, background: true },
        { key: { userId: 1, timestamp: -1 }, background: true },
        { key: { operationType: 1, timestamp: -1 }, background: true },
        { key: { 'compliance.requiresRetention': 1, timestamp: -1 }, background: true }
      ]);

      console.log('Versioning infrastructure setup completed');

    } catch (error) {
      console.error('Error setting up versioning infrastructure:', error);
      throw error;
    }
  }

  async createVersionedDocument(documentData, userContext = {}) {
    console.log('Creating new versioned document...');
    const startTime = Date.now();

    try {
      // Generate document metadata
      const documentId = new ObjectId();
      const currentTimestamp = new Date();

      // Prepare versioned document
      const versionedDocument = {
        _id: documentId,
        ...documentData,

        // Version metadata
        version: 1,
        metadata: {
          createdAt: currentTimestamp,
          lastModified: currentTimestamp,
          createdBy: userContext.userId || 'system',
          modifiedBy: userContext.userId || 'system',

          // Change tracking
          changeHistory: [],
          totalChanges: 0,

          // Session context
          sessionId: userContext.sessionId || new ObjectId().toString(),
          requestId: userContext.requestId || new ObjectId().toString(),
          ipAddress: userContext.ipAddress,
          userAgent: userContext.userAgent,

          // Version strategy metadata
          versioningStrategy: this.config.versioningStrategy,
          embeddedVersions: []
        },

        // Compliance tracking
        compliance: await this.generateComplianceMetadata(documentData, userContext, 'create'),

        // Audit context
        auditContext: {
          operationType: 'create',
          operationTimestamp: currentTimestamp,
          businessContext: userContext.businessContext || {},
          regulatoryContext: userContext.regulatoryContext || {}
        }
      };

      // Insert document with session for consistency
      const session = client.startSession();

      try {
        await session.withTransaction(async () => {
          // Create main document
          const insertResult = await this.collections.products.insertOne(versionedDocument, { session });

          // Create initial audit trail entry
          if (this.config.enableAuditTrail) {
            await this.createAuditTrailEntry({
              documentId: documentId,
              operationType: 'create',
              version: 1,
              documentData: versionedDocument,
              userContext: userContext,
              timestamp: currentTimestamp
            }, { session });
          }

          // Initialize compliance tracking
          if (this.config.enableComplianceTracking) {
            await this.initializeDocumentCompliance(documentId, versionedDocument, userContext, { session });
          }

          return insertResult;
        });

      } finally {
        await session.endSession();
      }

      const processingTime = Date.now() - startTime;

      console.log(`Document created successfully: ${documentId}`, {
        version: 1,
        processingTime: processingTime,
        complianceEnabled: this.config.enableComplianceTracking
      });

      return {
        documentId: documentId,
        version: 1,
        created: true,
        processingTime: processingTime,
        complianceMetadata: versionedDocument.compliance
      };

    } catch (error) {
      console.error('Error creating versioned document:', error);
      throw error;
    }
  }

  async updateVersionedDocument(documentId, updateData, userContext = {}) {
    console.log(`Updating versioned document: ${documentId}...`);
    const startTime = Date.now();

    try {
      const session = client.startSession();
      let updateResult;

      try {
        await session.withTransaction(async () => {
          // Retrieve current document
          const currentDocument = await this.collections.products.findOne(
            { _id: new ObjectId(documentId) },
            { session }
          );

          if (!currentDocument) {
            throw new Error(`Document not found: ${documentId}`);
          }

          // Analyze changes
          const changeAnalysis = await this.analyzeDocumentChanges(currentDocument, updateData, userContext);

          // Determine versioning strategy based on change significance
          const versioningStrategy = this.determineVersioningStrategy(changeAnalysis, currentDocument);

          // Create version backup based on strategy
          if (versioningStrategy.createVersionBackup) {
            await this.createVersionBackup(currentDocument, changeAnalysis, userContext, { session });
          }

          // Prepare updated document
          const updatedDocument = await this.prepareUpdatedDocument(
            currentDocument,
            updateData,
            changeAnalysis,
            userContext
          );

          // Update main document
          const updateOperation = await this.collections.products.replaceOne(
            { _id: new ObjectId(documentId) },
            updatedDocument,
            { session }
          );

          // Create audit trail entry
          if (this.config.enableAuditTrail) {
            await this.createAuditTrailEntry({
              documentId: new ObjectId(documentId),
              operationType: 'update',
              version: updatedDocument.version,
              previousVersion: currentDocument.version,
              changeAnalysis: changeAnalysis,
              documentData: updatedDocument,
              previousDocumentData: currentDocument,
              userContext: userContext,
              timestamp: new Date()
            }, { session });
          }

          // Update compliance tracking
          if (this.config.enableComplianceTracking) {
            await this.updateComplianceTracking(
              new ObjectId(documentId),
              changeAnalysis,
              userContext,
              { session }
            );
          }

          updateResult = {
            documentId: documentId,
            version: updatedDocument.version,
            previousVersion: currentDocument.version,
            changeAnalysis: changeAnalysis,
            versioningStrategy: versioningStrategy
          };
        });

      } finally {
        await session.endSession();
      }

      const processingTime = Date.now() - startTime;

      console.log(`Document updated successfully: ${documentId}`, {
        newVersion: updateResult.version,
        changesDetected: Object.keys(updateResult.changeAnalysis.changedFields).length,
        processingTime: processingTime
      });

      return {
        ...updateResult,
        updated: true,
        processingTime: processingTime
      };

    } catch (error) {
      console.error('Error updating versioned document:', error);
      throw error;
    }
  }

  async analyzeDocumentChanges(currentDocument, updateData, userContext) {
    console.log('Analyzing document changes...');

    const changeAnalysis = {
      changedFields: {},
      addedFields: {},
      removedFields: {},
      significantChanges: [],
      minorChanges: [],
      businessImpact: 'low',
      complianceImpact: 'none',
      approvalRequired: false,
      changeReason: userContext.changeReason || 'User update',
      changeCategory: 'standard'
    };

    // Perform deep comparison
    for (const [fieldPath, newValue] of Object.entries(updateData)) {
      const currentValue = this.getNestedValue(currentDocument, fieldPath);

      if (this.isDifferentValue(currentValue, newValue)) {
        const changeInfo = {
          field: fieldPath,
          oldValue: currentValue,
          newValue: newValue,
          changeType: this.determineChangeType(currentValue, newValue),
          timestamp: new Date()
        };

        changeAnalysis.changedFields[fieldPath] = changeInfo;

        // Categorize change significance
        if (this.isSignificantChange(fieldPath, currentValue, newValue, currentDocument)) {
          changeAnalysis.significantChanges.push(changeInfo);

          // Update business impact
          const fieldBusinessImpact = this.assessBusinessImpact(fieldPath, currentValue, newValue, currentDocument);
          if (this.compareImpactLevels(fieldBusinessImpact, changeAnalysis.businessImpact) > 0) {
            changeAnalysis.businessImpact = fieldBusinessImpact;
          }

          // Check compliance impact
          const fieldComplianceImpact = await this.assessComplianceImpact(fieldPath, currentValue, newValue, currentDocument);
          if (fieldComplianceImpact !== 'none') {
            changeAnalysis.complianceImpact = fieldComplianceImpact;
          }

        } else {
          changeAnalysis.minorChanges.push(changeInfo);
        }
      }
    }

    // Determine if approval is required
    changeAnalysis.approvalRequired = await this.requiresApproval(changeAnalysis, currentDocument, userContext);

    // Categorize change
    changeAnalysis.changeCategory = this.categorizeChange(changeAnalysis, currentDocument);

    return changeAnalysis;
  }

  async createVersionBackup(currentDocument, changeAnalysis, userContext, transactionOptions = {}) {
    console.log(`Creating version backup for document: ${currentDocument._id}`);

    try {
      // Determine backup strategy based on versioning configuration
      const backupStrategy = this.determineBackupStrategy(currentDocument, changeAnalysis);

      if (backupStrategy.strategy === 'embedded') {
        // Add version to embedded history
        await this.addEmbeddedVersion(currentDocument, changeAnalysis, userContext, transactionOptions);
      } else if (backupStrategy.strategy === 'referenced') {
        // Create separate version document
        await this.createReferencedVersion(currentDocument, changeAnalysis, userContext, transactionOptions);
      } else if (backupStrategy.strategy === 'hybrid') {
        // Use hybrid approach based on change significance
        if (changeAnalysis.businessImpact === 'high' || changeAnalysis.complianceImpact !== 'none') {
          await this.createReferencedVersion(currentDocument, changeAnalysis, userContext, transactionOptions);
        } else {
          await this.addEmbeddedVersion(currentDocument, changeAnalysis, userContext, transactionOptions);
        }
      }

    } catch (error) {
      console.error('Error creating version backup:', error);
      throw error;
    }
  }

  async addEmbeddedVersion(currentDocument, changeAnalysis, userContext, transactionOptions = {}) {
    console.log('Adding embedded version to document history...');

    const versionSnapshot = {
      version: currentDocument.version,
      timestamp: new Date(),
      data: this.createVersionSnapshot(currentDocument),
      metadata: {
        changeReason: changeAnalysis.changeReason,
        changedBy: userContext.userId || 'system',
        sessionId: userContext.sessionId,
        changeCategory: changeAnalysis.changeCategory,
        businessImpact: changeAnalysis.businessImpact,
        complianceImpact: changeAnalysis.complianceImpact
      }
    };

    // Add to embedded versions (with size limit)
    const updateOperation = {
      $push: {
        'metadata.embeddedVersions': {
          $each: [versionSnapshot],
          $slice: -this.config.maxEmbeddedVersions // Keep only recent versions
        }
      },
      $inc: {
        'metadata.totalVersions': 1
      }
    };

    await this.collections.products.updateOne(
      { _id: currentDocument._id },
      updateOperation,
      transactionOptions
    );
  }

  async createReferencedVersion(currentDocument, changeAnalysis, userContext, transactionOptions = {}) {
    console.log('Creating referenced version document...');

    const versionDocument = {
      _id: new ObjectId(),
      documentId: currentDocument._id,
      version: currentDocument.version,
      timestamp: new Date(),

      // Complete document snapshot
      documentSnapshot: this.createVersionSnapshot(currentDocument),

      // Change metadata
      changeMetadata: {
        changeReason: changeAnalysis.changeReason,
        changedBy: userContext.userId || 'system',
        sessionId: userContext.sessionId,
        requestId: userContext.requestId,
        changeCategory: changeAnalysis.changeCategory,
        businessImpact: changeAnalysis.businessImpact,
        complianceImpact: changeAnalysis.complianceImpact,
        changedFields: Object.keys(changeAnalysis.changedFields),
        significantChanges: changeAnalysis.significantChanges.length
      },

      // Compliance metadata
      compliance: await this.generateVersionComplianceMetadata(currentDocument, changeAnalysis, userContext),

      // Storage metadata
      storageMetadata: {
        compressionEnabled: this.config.compressionEnabled,
        storageSize: JSON.stringify(currentDocument).length,
        createdAt: new Date()
      }
    };

    await this.collections.productVersions.insertOne(versionDocument, transactionOptions);
  }

  async prepareUpdatedDocument(currentDocument, updateData, changeAnalysis, userContext) {
    console.log('Preparing updated document with versioning metadata...');

    // Create updated document
    const updatedDocument = {
      ...currentDocument,
      ...updateData,

      // Update version information
      version: currentDocument.version + 1,

      // Update metadata
      metadata: {
        ...currentDocument.metadata,
        lastModified: new Date(),
        modifiedBy: userContext.userId || 'system',
        totalChanges: currentDocument.metadata.totalChanges + 1,

        // Add change to history
        changeHistory: [
          ...currentDocument.metadata.changeHistory.slice(-9), // Keep recent 10 changes
          {
            version: currentDocument.version + 1,
            timestamp: new Date(),
            changedBy: userContext.userId || 'system',
            changeReason: changeAnalysis.changeReason,
            changedFields: Object.keys(changeAnalysis.changedFields),
            businessImpact: changeAnalysis.businessImpact
          }
        ],

        // Update session context
        sessionId: userContext.sessionId || currentDocument.metadata.sessionId,
        requestId: userContext.requestId || new ObjectId().toString(),
        ipAddress: userContext.ipAddress,
        userAgent: userContext.userAgent
      },

      // Update compliance metadata
      compliance: await this.updateComplianceMetadata(currentDocument.compliance, changeAnalysis, userContext),

      // Update audit context
      auditContext: {
        ...currentDocument.auditContext,
        lastOperation: {
          operationType: 'update',
          operationTimestamp: new Date(),
          changeAnalysis: {
            businessImpact: changeAnalysis.businessImpact,
            complianceImpact: changeAnalysis.complianceImpact,
            changeCategory: changeAnalysis.changeCategory,
            significantChanges: changeAnalysis.significantChanges.length
          },
          userContext: {
            userId: userContext.userId,
            sessionId: userContext.sessionId,
            businessContext: userContext.businessContext
          }
        }
      }
    };

    return updatedDocument;
  }

  async createAuditTrailEntry(auditData, transactionOptions = {}) {
    console.log('Creating comprehensive audit trail entry...');

    const auditEntry = {
      _id: new ObjectId(),
      documentId: auditData.documentId,
      operationType: auditData.operationType,
      version: auditData.version,
      previousVersion: auditData.previousVersion,
      timestamp: auditData.timestamp,

      // User and session context
      userId: auditData.userContext.userId || 'system',
      sessionId: auditData.userContext.sessionId,
      requestId: auditData.userContext.requestId,
      ipAddress: auditData.userContext.ipAddress,
      userAgent: auditData.userContext.userAgent,

      // Change details
      changeDetails: auditData.changeAnalysis ? {
        changedFieldsCount: Object.keys(auditData.changeAnalysis.changedFields).length,
        changedFields: Object.keys(auditData.changeAnalysis.changedFields),
        significantChangesCount: auditData.changeAnalysis.significantChanges.length,
        businessImpact: auditData.changeAnalysis.businessImpact,
        complianceImpact: auditData.changeAnalysis.complianceImpact,
        changeReason: auditData.changeAnalysis.changeReason,
        changeCategory: auditData.changeAnalysis.changeCategory,
        approvalRequired: auditData.changeAnalysis.approvalRequired
      } : null,

      // Document snapshots (based on audit level)
      documentSnapshots: this.createAuditSnapshots(auditData),

      // Business context
      businessContext: auditData.userContext.businessContext || {},
      regulatoryContext: auditData.userContext.regulatoryContext || {},

      // Compliance metadata
      compliance: {
        requiresRetention: await this.requiresComplianceRetention(auditData),
        retentionUntil: await this.calculateRetentionDate(auditData),
        complianceFlags: await this.generateComplianceFlags(auditData),
        regulatoryRequirements: await this.getApplicableRegulations(auditData)
      },

      // Technical metadata
      technicalMetadata: {
        auditLevel: this.config.auditLevel,
        processingTimestamp: new Date(),
        auditVersion: '2.0',
        dataClassification: await this.classifyAuditData(auditData)
      }
    };

    await this.collections.auditTrail.insertOne(auditEntry, transactionOptions);

    return auditEntry._id;
  }

  createAuditSnapshots(auditData) {
    const snapshots = {};

    switch (this.config.auditLevel) {
      case 'basic':
        // Only capture essential identifiers
        snapshots.documentId = auditData.documentId;
        snapshots.version = auditData.version;
        break;

      case 'detailed':
        // Capture changed fields and metadata
        snapshots.documentId = auditData.documentId;
        snapshots.version = auditData.version;
        if (auditData.changeAnalysis) {
          snapshots.changedFields = auditData.changeAnalysis.changedFields;
        }
        break;

      case 'comprehensive':
        // Capture complete document states
        snapshots.documentId = auditData.documentId;
        snapshots.version = auditData.version;
        if (auditData.documentData) {
          snapshots.currentState = this.createVersionSnapshot(auditData.documentData);
        }
        if (auditData.previousDocumentData) {
          snapshots.previousState = this.createVersionSnapshot(auditData.previousDocumentData);
        }
        if (auditData.changeAnalysis) {
          snapshots.changeAnalysis = auditData.changeAnalysis;
        }
        break;

      default:
        snapshots.documentId = auditData.documentId;
        snapshots.version = auditData.version;
    }

    return snapshots;
  }

  // Utility methods for comprehensive document versioning

  createVersionSnapshot(document) {
    // Create a clean snapshot without internal metadata
    const snapshot = { ...document };

    // Remove MongoDB internal fields
    delete snapshot._id;
    delete snapshot.metadata;
    delete snapshot.auditContext;

    // Apply compression if enabled
    if (this.config.compressionEnabled) {
      return this.compressSnapshot(snapshot);
    }

    return snapshot;
  }

  compressSnapshot(snapshot) {
    // Implement snapshot compression logic
    // This would typically use a compression algorithm like gzip
    return {
      compressed: true,
      data: snapshot, // In production, this would be compressed
      originalSize: JSON.stringify(snapshot).length,
      compressionRatio: 0.7 // Simulated compression ratio
    };
  }

  getNestedValue(obj, path) {
    return path.split('.').reduce((current, key) => current?.[key], obj);
  }

  isDifferentValue(oldValue, newValue) {
    // Handle different types and deep comparison
    if (oldValue === newValue) return false;

    if (typeof oldValue !== typeof newValue) return true;

    if (oldValue === null || newValue === null) return oldValue !== newValue;

    if (typeof oldValue === 'object') {
      return JSON.stringify(oldValue) !== JSON.stringify(newValue);
    }

    return oldValue !== newValue;
  }

  determineChangeType(oldValue, newValue) {
    if (oldValue === undefined || oldValue === null) return 'addition';
    if (newValue === undefined || newValue === null) return 'removal';
    if (typeof oldValue !== typeof newValue) return 'type_change';
    return 'modification';
  }

  isSignificantChange(fieldPath, oldValue, newValue, document) {
    // Define business-specific significant fields
    const significantFields = ['price', 'status', 'category', 'supplier_id', 'compliance_status'];

    if (significantFields.includes(fieldPath)) return true;

    // Check for percentage-based changes for numeric fields
    if (fieldPath === 'price' && typeof oldValue === 'number' && typeof newValue === 'number') {
      const changePercentage = Math.abs((newValue - oldValue) / oldValue);
      return changePercentage > 0.05; // 5% threshold
    }

    return false;
  }

  assessBusinessImpact(fieldPath, oldValue, newValue, document) {
    // Business impact assessment logic
    const highImpactFields = ['status', 'compliance_status', 'legal_status'];
    const mediumImpactFields = ['price', 'category', 'supplier_id'];

    if (highImpactFields.includes(fieldPath)) return 'high';
    if (mediumImpactFields.includes(fieldPath)) return 'medium';
    return 'low';
  }

  async assessComplianceImpact(fieldPath, oldValue, newValue, document) {
    // Compliance impact assessment
    if (document.compliance?.gdprSubject && ['name', 'email', 'phone'].includes(fieldPath)) {
      return 'gdpr_personal_data';
    }

    if (document.compliance?.financialRecord && ['price', 'cost', 'revenue'].includes(fieldPath)) {
      return 'sox_financial_data';
    }

    return 'none';
  }

  compareImpactLevels(level1, level2) {
    const levels = { 'low': 1, 'medium': 2, 'high': 3 };
    return levels[level1] - levels[level2];
  }

  async requiresApproval(changeAnalysis, document, userContext) {
    // Approval requirement logic
    if (changeAnalysis.businessImpact === 'high') return true;
    if (changeAnalysis.complianceImpact !== 'none') return true;

    // Check for high-value changes
    if (changeAnalysis.changedFields.price) {
      const priceChange = changeAnalysis.changedFields.price;
      if (priceChange.newValue > 10000 || Math.abs(priceChange.newValue - priceChange.oldValue) > 1000) {
        return true;
      }
    }

    return false;
  }

  categorizeChange(changeAnalysis, document) {
    if (changeAnalysis.businessImpact === 'high') return 'critical_business_change';
    if (changeAnalysis.complianceImpact !== 'none') return 'compliance_change';
    if (changeAnalysis.approvalRequired) return 'approval_required_change';
    if (changeAnalysis.significantChanges.length > 3) return 'major_change';
    return 'standard_change';
  }

  determineVersioningStrategy(changeAnalysis, document) {
    return {
      createVersionBackup: true,
      strategy: this.config.versioningStrategy,
      reason: changeAnalysis.businessImpact === 'high' ? 'high_impact_change' : 'standard_versioning'
    };
  }

  determineBackupStrategy(document, changeAnalysis) {
    if (this.config.versioningStrategy === 'hybrid') {
      // Use referenced storage for high-impact changes
      if (changeAnalysis.businessImpact === 'high' || changeAnalysis.complianceImpact !== 'none') {
        return { strategy: 'referenced', reason: 'high_impact_or_compliance' };
      }

      // Use embedded for standard changes if under limit
      if (document.metadata.embeddedVersions && document.metadata.embeddedVersions.length < this.config.maxEmbeddedVersions) {
        return { strategy: 'embedded', reason: 'under_embedded_limit' };
      }

      return { strategy: 'referenced', reason: 'embedded_limit_exceeded' };
    }

    return { strategy: this.config.versioningStrategy, reason: 'configured_strategy' };
  }

  async generateComplianceMetadata(documentData, userContext, operationType) {
    const complianceMetadata = {
      gdprSubject: false,
      financialRecord: false,
      healthRecord: false,
      customClassifications: [],
      dataRetentionRequired: true,
      retentionPeriod: this.config.retentionPeriod,
      retentionUntil: null
    };

    // GDPR classification
    if (this.config.gdprCompliance && this.containsPersonalData(documentData)) {
      complianceMetadata.gdprSubject = true;
      complianceMetadata.gdprLawfulBasis = userContext.gdprLawfulBasis || 'legitimate_interest';
      complianceMetadata.dataSubjectRights = ['access', 'rectification', 'erasure', 'portability'];
    }

    // SOX compliance
    if (this.config.soxCompliance && this.containsFinancialData(documentData)) {
      complianceMetadata.financialRecord = true;
      complianceMetadata.soxRetentionPeriod = 7 * 365; // 7 years
      complianceMetadata.immutableRecord = true;
    }

    // Calculate retention date
    if (complianceMetadata.dataRetentionRequired) {
      const retentionDays = complianceMetadata.soxRetentionPeriod || complianceMetadata.retentionPeriod;
      complianceMetadata.retentionUntil = new Date(Date.now() + (retentionDays * 24 * 60 * 60 * 1000));
    }

    return complianceMetadata;
  }

  containsPersonalData(documentData) {
    const personalDataFields = ['email', 'phone', 'address', 'ssn', 'name', 'dob'];
    return personalDataFields.some(field => documentData[field] !== undefined);
  }

  containsFinancialData(documentData) {
    const financialDataFields = ['price', 'cost', 'revenue', 'profit', 'tax', 'salary'];
    return financialDataFields.some(field => documentData[field] !== undefined);
  }
}

// Benefits of MongoDB Advanced Document Versioning:
// - Flexible versioning strategies (embedded, referenced, hybrid) for optimal performance
// - Comprehensive audit trails with configurable detail levels  
// - Built-in compliance tracking for GDPR, SOX, HIPAA, and custom regulations
// - Intelligent change analysis and business impact assessment
// - Automatic retention management and data lifecycle policies
// - High-performance versioning with optimized storage and indexing
// - Session and user context tracking for complete audit visibility
// - Change approval workflows for sensitive data modifications
// - Seamless integration with MongoDB's native features and operations
// - SQL-compatible versioning operations through QueryLeaf integration

module.exports = {
  AdvancedDocumentVersioningManager
};

Understanding MongoDB Document Versioning Architecture

Enterprise Compliance and Audit Trail Strategies

Implement sophisticated versioning for production compliance requirements:

// Production-ready MongoDB Document Versioning with comprehensive compliance and audit capabilities
class ProductionComplianceManager extends AdvancedDocumentVersioningManager {
  constructor(db, productionConfig) {
    super(db, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      enableRegulatoryCompliance: true,
      enableAutomaticArchiving: true,
      enableComplianceReporting: true,
      enableDataGovernance: true,
      enablePrivacyProtection: true,
      enableForensicAnalysis: true
    };

    this.setupProductionCompliance();
    this.initializeRegulatoryFrameworks();
    this.setupDataGovernance();
  }

  async implementRegulatoryCompliance(documentData, regulatoryFramework) {
    console.log('Implementing comprehensive regulatory compliance...');

    const complianceFramework = {
      // GDPR compliance implementation
      gdpr: {
        dataMinimization: true,
        consentManagement: true,
        rightToRectification: true,
        rightToErasure: true,
        dataPortability: true,
        privacyByDesign: true
      },

      // SOX compliance implementation
      sox: {
        financialDataIntegrity: true,
        immutableAuditTrails: true,
        executiveApprovalWorkflows: true,
        quarterlyComplianceReporting: true,
        internalControlsTesting: true
      },

      // HIPAA compliance implementation
      hipaa: {
        phiProtection: true,
        accessControlEnforcement: true,
        encryptionAtRest: true,
        auditLogProtection: true,
        businessAssociateCompliance: true
      }
    };

    return await this.deployComplianceFramework(complianceFramework, regulatoryFramework);
  }

  async setupDataGovernanceFramework() {
    console.log('Setting up comprehensive data governance framework...');

    const governanceFramework = {
      // Data classification and cataloging
      dataClassification: {
        sensitivityLevels: ['public', 'internal', 'confidential', 'restricted'],
        dataCategories: ['personal', 'financial', 'operational', 'strategic'],
        automaticClassification: true,
        classificationWorkflows: true
      },

      // Access control and security
      accessControl: {
        roleBasedAccess: true,
        attributeBasedAccess: true,
        dynamicPermissions: true,
        privilegedAccessMonitoring: true
      },

      // Data quality management
      dataQuality: {
        validationRules: true,
        qualityMetrics: true,
        dataProfileling: true,
        qualityReporting: true
      }
    };

    return await this.deployGovernanceFramework(governanceFramework);
  }

  async implementForensicAnalysis(investigationContext) {
    console.log('Implementing forensic analysis capabilities...');

    const forensicCapabilities = {
      // Digital forensics support
      forensicAnalysis: {
        chainOfCustody: true,
        evidencePreservation: true,
        forensicReporting: true,
        expertWitnessSupport: true
      },

      // Investigation workflows
      investigationSupport: {
        timelineReconstruction: true,
        activityCorrelation: true,
        anomalyDetection: true,
        reportGeneration: true
      }
    };

    return await this.deployForensicFramework(forensicCapabilities, investigationContext);
  }
}

SQL-Style Document Versioning with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB document versioning and audit trails:

-- QueryLeaf advanced document versioning with SQL-familiar syntax for MongoDB

-- Configure document versioning settings
SET versioning_strategy = 'hybrid';
SET max_embedded_versions = 5;
SET audit_level = 'comprehensive';
SET compliance_tracking = true;
SET retention_period = 2557; -- 7 years in days

-- Advanced document versioning with comprehensive audit trail creation
WITH versioning_configuration AS (
  SELECT 
    -- Versioning strategy configuration
    'hybrid' as versioning_strategy,
    5 as max_embedded_versions,
    true as enable_compression,
    true as enable_compliance_tracking,

    -- Audit configuration
    'comprehensive' as audit_level,
    true as enable_field_level_auditing,
    true as enable_user_context_tracking,
    true as enable_session_tracking,

    -- Compliance configuration
    true as gdpr_compliance,
    false as sox_compliance,
    false as hipaa_compliance,
    ARRAY['financial_data', 'personal_data', 'health_data'] as sensitive_data_types,

    -- Retention policies
    2557 as default_retention_days, -- 7 years
    365 as gdpr_retention_days, -- 1 year for GDPR
    2557 as sox_retention_days  -- 7 years for SOX
),

document_change_analysis AS (
  -- Analyze changes and determine versioning strategy
  SELECT 
    doc_id,
    previous_version,
    new_version,
    operation_type,

    -- Field-level change analysis
    changed_fields,
    added_fields,
    removed_fields,

    -- Change significance assessment
    CASE 
      WHEN changed_fields ? 'price' AND 
           ABS((new_data->>'price')::decimal - (old_data->>'price')::decimal) > 1000 THEN 'high'
      WHEN changed_fields ? 'status' OR changed_fields ? 'compliance_status' THEN 'high'
      WHEN array_length(array(SELECT jsonb_object_keys(changed_fields)), 1) > 5 THEN 'medium'
      ELSE 'low'
    END as business_impact,

    -- Compliance impact assessment
    CASE 
      WHEN changed_fields ?& ARRAY['email', 'phone', 'address', 'name'] AND 
           old_data->>'gdpr_subject' = 'true' THEN 'gdpr_personal_data'
      WHEN changed_fields ?& ARRAY['price', 'cost', 'revenue'] AND 
           old_data->>'financial_record' = 'true' THEN 'sox_financial_data'
      WHEN changed_fields ?& ARRAY['medical_info', 'health_data'] AND 
           old_data->>'health_record' = 'true' THEN 'hipaa_health_data'
      ELSE 'none'
    END as compliance_impact,

    -- Approval requirement determination
    CASE 
      WHEN changed_fields ? 'price' AND (new_data->>'price')::decimal > 10000 THEN true
      WHEN operation_type = 'DELETE' THEN true
      WHEN changed_fields ? 'compliance_status' THEN true
      ELSE false
    END as requires_approval,

    -- Change categorization
    CASE 
      WHEN operation_type = 'DELETE' THEN 'deletion_operation'
      WHEN changed_fields ? 'price' THEN 'pricing_change'
      WHEN changed_fields ? 'status' THEN 'status_change'  
      WHEN changed_fields ? 'supplier_id' THEN 'supplier_change'
      WHEN array_length(array(SELECT jsonb_object_keys(changed_fields)), 1) > 3 THEN 'major_change'
      ELSE 'standard_change'
    END as change_category,

    -- User and session context
    user_id,
    session_id,
    request_id,
    ip_address,
    user_agent,
    change_reason,
    business_context,

    -- Timestamps
    change_timestamp,
    processing_timestamp

  FROM document_changes dc
  JOIN versioning_configuration vc ON true
  WHERE dc.change_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
),

version_backup_strategy AS (
  -- Determine optimal backup strategy for each change
  SELECT 
    dca.*,

    -- Version backup strategy determination
    CASE 
      WHEN dca.business_impact = 'high' OR dca.compliance_impact != 'none' THEN 'referenced'
      WHEN dca.change_category IN ('deletion_operation', 'major_change') THEN 'referenced'
      ELSE 'embedded'
    END as backup_strategy,

    -- Storage optimization
    CASE 
      WHEN business_impact = 'high' THEN false  -- No compression for high-impact changes
      WHEN LENGTH(old_data::text) > 100000 THEN true  -- Compress large documents
      ELSE vc.enable_compression
    END as enable_compression,

    -- Retention policy determination
    CASE dca.compliance_impact
      WHEN 'gdpr_personal_data' THEN vc.gdpr_retention_days
      WHEN 'sox_financial_data' THEN vc.sox_retention_days
      WHEN 'hipaa_health_data' THEN vc.sox_retention_days -- Use same retention as SOX
      ELSE vc.default_retention_days
    END as retention_days,

    -- Compliance metadata generation
    JSON_BUILD_OBJECT(
      'gdpr_subject', (dca.old_data->>'gdpr_subject')::boolean,
      'financial_record', (dca.old_data->>'financial_record')::boolean,
      'health_record', (dca.old_data->>'health_record')::boolean,
      'data_classification', dca.old_data->>'data_classification',
      'sensitivity_level', dca.old_data->>'sensitivity_level',
      'retention_required', true,
      'retention_until', CURRENT_TIMESTAMP + (
        CASE dca.compliance_impact
          WHEN 'gdpr_personal_data' THEN vc.gdpr_retention_days
          WHEN 'sox_financial_data' THEN vc.sox_retention_days
          ELSE vc.default_retention_days
        END || ' days'
      )::interval,
      'compliance_flags', JSON_BUILD_OBJECT(
        'requires_encryption', dca.compliance_impact != 'none',
        'requires_audit_trail', true,
        'requires_approval', dca.requires_approval,
        'immutable_record', dca.compliance_impact = 'sox_financial_data'
      )
    ) as compliance_metadata

  FROM document_change_analysis dca
  CROSS JOIN versioning_configuration vc
),

audit_trail_creation AS (
  -- Create comprehensive audit trail entries
  SELECT 
    vbs.doc_id,
    vbs.operation_type,
    vbs.previous_version,
    vbs.new_version,
    vbs.change_timestamp,

    -- Audit entry data
    JSON_BUILD_OBJECT(
      'audit_id', GENERATE_UUID(),
      'document_id', vbs.doc_id,
      'operation_type', vbs.operation_type,
      'version_from', vbs.previous_version,
      'version_to', vbs.new_version,
      'timestamp', vbs.change_timestamp,

      -- User context
      'user_context', JSON_BUILD_OBJECT(
        'user_id', vbs.user_id,
        'session_id', vbs.session_id,
        'request_id', vbs.request_id,
        'ip_address', vbs.ip_address,
        'user_agent', vbs.user_agent
      ),

      -- Change analysis
      'change_analysis', JSON_BUILD_OBJECT(
        'changed_fields', array(SELECT jsonb_object_keys(vbs.changed_fields)),
        'added_fields', array(SELECT jsonb_object_keys(vbs.added_fields)),
        'removed_fields', array(SELECT jsonb_object_keys(vbs.removed_fields)),
        'business_impact', vbs.business_impact,
        'compliance_impact', vbs.compliance_impact,
        'change_category', vbs.change_category,
        'change_reason', vbs.change_reason,
        'requires_approval', vbs.requires_approval
      ),

      -- Document snapshots (based on audit level)
      'document_snapshots', CASE vc.audit_level
        WHEN 'basic' THEN JSON_BUILD_OBJECT(
          'document_id', vbs.doc_id,
          'version', vbs.new_version
        )
        WHEN 'detailed' THEN JSON_BUILD_OBJECT(
          'document_id', vbs.doc_id,
          'version', vbs.new_version,
          'changed_fields', vbs.changed_fields
        )
        WHEN 'comprehensive' THEN JSON_BUILD_OBJECT(
          'document_id', vbs.doc_id,
          'version', vbs.new_version,
          'previous_state', vbs.old_data,
          'current_state', vbs.new_data,
          'field_changes', vbs.changed_fields
        )
        ELSE JSON_BUILD_OBJECT('document_id', vbs.doc_id)
      END,

      -- Business context
      'business_context', vbs.business_context,

      -- Compliance metadata
      'compliance', vbs.compliance_metadata,

      -- Technical metadata
      'technical_metadata', JSON_BUILD_OBJECT(
        'audit_level', vc.audit_level,
        'versioning_strategy', vbs.backup_strategy,
        'compression_enabled', vbs.enable_compression,
        'retention_days', vbs.retention_days,
        'processing_timestamp', vbs.processing_timestamp,
        'audit_version', '2.0'
      )

    ) as audit_entry_data

  FROM version_backup_strategy vbs
  CROSS JOIN versioning_configuration vc
),

version_storage_operations AS (
  -- Execute version backup operations based on strategy
  SELECT 
    vbs.doc_id,
    vbs.backup_strategy,
    vbs.previous_version,
    vbs.new_version,

    -- Embedded version data (for embedded strategy)
    CASE WHEN vbs.backup_strategy = 'embedded' THEN
      JSON_BUILD_OBJECT(
        'version', vbs.previous_version,
        'timestamp', vbs.change_timestamp,
        'data', CASE WHEN vbs.enable_compression THEN
          JSON_BUILD_OBJECT(
            'compressed', true,
            'data', vbs.old_data,
            'compression_ratio', 0.7
          )
          ELSE vbs.old_data
        END,
        'metadata', JSON_BUILD_OBJECT(
          'change_reason', vbs.change_reason,
          'changed_by', vbs.user_id,
          'session_id', vbs.session_id,
          'change_category', vbs.change_category,
          'business_impact', vbs.business_impact,
          'compliance_impact', vbs.compliance_impact
        )
      )
      ELSE NULL
    END as embedded_version_data,

    -- Referenced version document (for referenced strategy)
    CASE WHEN vbs.backup_strategy = 'referenced' THEN
      JSON_BUILD_OBJECT(
        'version_id', GENERATE_UUID(),
        'document_id', vbs.doc_id,
        'version', vbs.previous_version,
        'timestamp', vbs.change_timestamp,
        'document_snapshot', CASE WHEN vbs.enable_compression THEN
          JSON_BUILD_OBJECT(
            'compressed', true,
            'data', vbs.old_data,
            'original_size', LENGTH(vbs.old_data::text),
            'compression_ratio', 0.7
          )
          ELSE vbs.old_data
        END,
        'change_metadata', JSON_BUILD_OBJECT(
          'change_reason', vbs.change_reason,
          'changed_by', vbs.user_id,
          'session_id', vbs.session_id,
          'request_id', vbs.request_id,
          'change_category', vbs.change_category,
          'business_impact', vbs.business_impact,
          'compliance_impact', vbs.compliance_impact,
          'changed_fields', array(SELECT jsonb_object_keys(vbs.changed_fields)),
          'significant_changes', array_length(array(SELECT jsonb_object_keys(vbs.changed_fields)), 1)
        ),
        'compliance', vbs.compliance_metadata,
        'storage_metadata', JSON_BUILD_OBJECT(
          'compression_enabled', vbs.enable_compression,
          'storage_size', LENGTH(vbs.old_data::text),
          'created_at', vbs.processing_timestamp,
          'retention_until', CURRENT_TIMESTAMP + (vbs.retention_days || ' days')::interval
        )
      )
      ELSE NULL
    END as referenced_version_data

  FROM version_backup_strategy vbs
)

-- Execute versioning operations
INSERT INTO document_versions (
  document_id,
  version_strategy,
  version_data,
  audit_trail_id,
  compliance_metadata,
  created_at
)
SELECT 
  vso.doc_id,
  vso.backup_strategy,
  COALESCE(vso.embedded_version_data, vso.referenced_version_data),
  atc.audit_entry_data->>'audit_id',
  (atc.audit_entry_data->'compliance'),
  CURRENT_TIMESTAMP
FROM version_storage_operations vso
JOIN audit_trail_creation atc ON vso.doc_id = atc.doc_id;

-- Comprehensive versioning analytics and compliance reporting
WITH versioning_analytics AS (
  SELECT 
    DATE_TRUNC('day', created_at) as date_bucket,
    version_strategy,

    -- Volume metrics
    COUNT(*) as total_versions_created,
    COUNT(DISTINCT document_id) as unique_documents_versioned,

    -- Strategy distribution
    COUNT(*) FILTER (WHERE version_strategy = 'embedded') as embedded_versions,
    COUNT(*) FILTER (WHERE version_strategy = 'referenced') as referenced_versions,

    -- Compliance metrics
    COUNT(*) FILTER (WHERE compliance_metadata->>'gdpr_subject' = 'true') as gdpr_versions,
    COUNT(*) FILTER (WHERE compliance_metadata->>'financial_record' = 'true') as sox_versions,
    COUNT(*) FILTER (WHERE compliance_metadata->>'health_record' = 'true') as hipaa_versions,

    -- Business impact analysis
    COUNT(*) FILTER (WHERE (version_data->'change_metadata'->>'business_impact') = 'high') as high_impact_changes,
    COUNT(*) FILTER (WHERE (version_data->'change_metadata'->>'business_impact') = 'medium') as medium_impact_changes,
    COUNT(*) FILTER (WHERE (version_data->'change_metadata'->>'business_impact') = 'low') as low_impact_changes,

    -- Storage utilization
    AVG(LENGTH((version_data->'document_snapshot')::text)) as avg_version_size,
    SUM(LENGTH((version_data->'document_snapshot')::text)) as total_storage_used,
    AVG(CASE WHEN version_data->'storage_metadata'->>'compression_enabled' = 'true' 
             THEN (version_data->'storage_metadata'->>'compression_ratio')::decimal 
             ELSE 1.0 END) as avg_compression_ratio,

    -- User activity analysis
    COUNT(DISTINCT (version_data->'change_metadata'->>'changed_by')) as unique_users,
    MODE() WITHIN GROUP (ORDER BY (version_data->'change_metadata'->>'changed_by')) as most_active_user,

    -- Change pattern analysis
    COUNT(*) FILTER (WHERE (version_data->'change_metadata'->>'requires_approval') = 'true') as approval_required_changes,
    MODE() WITHIN GROUP (ORDER BY (version_data->'change_metadata'->>'change_category')) as most_common_change_type

  FROM document_versions
  WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '30 days'
  GROUP BY DATE_TRUNC('day', created_at), version_strategy
),

compliance_reporting AS (
  SELECT 
    va.*,

    -- Compliance percentage calculations
    ROUND((gdpr_versions * 100.0 / NULLIF(total_versions_created, 0)), 2) as gdpr_compliance_percent,
    ROUND((sox_versions * 100.0 / NULLIF(total_versions_created, 0)), 2) as sox_compliance_percent,
    ROUND((hipaa_versions * 100.0 / NULLIF(total_versions_created, 0)), 2) as hipaa_compliance_percent,

    -- Storage efficiency metrics
    ROUND((total_storage_used / 1024.0 / 1024.0), 2) as storage_used_mb,
    ROUND((avg_compression_ratio * 100), 1) as avg_compression_percent,
    ROUND((total_storage_used * (1 - avg_compression_ratio) / 1024.0 / 1024.0), 2) as storage_saved_mb,

    -- Change approval metrics
    ROUND((approval_required_changes * 100.0 / NULLIF(total_versions_created, 0)), 2) as approval_rate_percent,

    -- Risk assessment
    CASE 
      WHEN high_impact_changes > total_versions_created * 0.1 THEN 'high_risk'
      WHEN high_impact_changes > total_versions_created * 0.05 THEN 'medium_risk'
      ELSE 'low_risk'
    END as change_risk_level,

    -- Optimization recommendations
    CASE 
      WHEN avg_compression_ratio < 0.5 THEN 'review_compression_settings'
      WHEN referenced_versions > embedded_versions * 2 THEN 'optimize_versioning_strategy'
      WHEN approval_required_changes > total_versions_created * 0.2 THEN 'review_approval_thresholds'
      WHEN unique_users < 5 THEN 'review_user_access_patterns'
      ELSE 'performance_optimal'
    END as optimization_recommendation

  FROM versioning_analytics va
)

SELECT 
  date_bucket,
  version_strategy,

  -- Volume summary
  total_versions_created,
  unique_documents_versioned,
  unique_users,
  most_active_user,

  -- Strategy breakdown
  embedded_versions,
  referenced_versions,
  ROUND((embedded_versions * 100.0 / NULLIF(total_versions_created, 0)), 1) as embedded_strategy_percent,
  ROUND((referenced_versions * 100.0 / NULLIF(total_versions_created, 0)), 1) as referenced_strategy_percent,

  -- Impact analysis
  high_impact_changes,
  medium_impact_changes, 
  low_impact_changes,
  most_common_change_type,

  -- Compliance summary
  gdpr_versions,
  sox_versions,
  hipaa_versions,
  gdpr_compliance_percent,
  sox_compliance_percent,
  hipaa_compliance_percent,

  -- Storage metrics
  ROUND(avg_version_size / 1024.0, 1) as avg_version_size_kb,
  storage_used_mb,
  avg_compression_percent,
  storage_saved_mb,

  -- Approval workflow metrics
  approval_required_changes,
  approval_rate_percent,

  -- Risk and optimization
  change_risk_level,
  optimization_recommendation,

  -- Detailed recommendations
  CASE optimization_recommendation
    WHEN 'review_compression_settings' THEN 'Enable compression for better storage efficiency'
    WHEN 'optimize_versioning_strategy' THEN 'Consider increasing embedded version limits'
    WHEN 'review_approval_thresholds' THEN 'Adjust approval requirements for better workflow efficiency'
    WHEN 'review_user_access_patterns' THEN 'Evaluate user permissions and training needs'
    ELSE 'Continue current versioning configuration - performance is optimal'
  END as detailed_recommendation

FROM compliance_reporting
ORDER BY date_bucket DESC, version_strategy;

-- QueryLeaf provides comprehensive document versioning capabilities:
-- 1. Flexible versioning strategies with embedded, referenced, and hybrid approaches
-- 2. Advanced audit trails with configurable detail levels and compliance tracking
-- 3. Comprehensive change analysis and business impact assessment
-- 4. Built-in compliance support for GDPR, SOX, HIPAA, and custom regulations
-- 5. Intelligent storage optimization with compression and retention management
-- 6. User context tracking and session management for complete audit visibility
-- 7. Change approval workflows for sensitive data and high-impact modifications
-- 8. Performance monitoring and optimization recommendations for versioning strategies
-- 9. SQL-familiar syntax for complex versioning operations and compliance reporting
-- 10. Integration with MongoDB's native document features and indexing optimizations

Best Practices for Production Document Versioning

Enterprise Compliance and Audit Trail Management

Essential principles for effective MongoDB document versioning deployment:

Versioning Strategy Selection: Choose appropriate versioning strategies based on change frequency, document size, and business requirements
Compliance Integration: Implement comprehensive compliance tracking for regulatory frameworks like GDPR, SOX, and HIPAA
Audit Trail Design: Create detailed audit trails with configurable granularity for different business contexts
Storage Optimization: Balance version history completeness with storage efficiency through compression and retention policies
User Context Tracking: Capture complete user, session, and business context for forensic analysis capabilities
Change Approval Workflows: Implement automated approval workflows for high-impact changes and sensitive data modifications

Scalability and Production Deployment

Optimize document versioning for enterprise-scale requirements:

Performance Optimization: Design efficient indexing strategies for versioning and audit collections
Storage Management: Implement automated archiving and retention policies for historical data lifecycle management
Compliance Reporting: Create comprehensive reporting capabilities for regulatory audits and compliance verification
Data Governance: Integrate versioning with enterprise data governance frameworks and security policies
Forensic Readiness: Ensure versioning systems support digital forensics and legal discovery requirements
Integration Architecture: Design versioning systems that integrate seamlessly with existing enterprise applications and workflows

Conclusion

MongoDB document versioning provides comprehensive data history management capabilities that enable enterprise-grade audit trails, regulatory compliance, and forensic analysis through flexible storage strategies and configurable compliance frameworks. The combination of embedded, referenced, and hybrid versioning approaches ensures optimal performance while maintaining complete change visibility.

Key MongoDB Document Versioning benefits include:

Flexible Versioning Strategies: Multiple approaches to balance storage efficiency with audit completeness requirements
Comprehensive Compliance Support: Built-in support for GDPR, SOX, HIPAA, and custom regulatory frameworks
Detailed Audit Trails: Configurable audit granularity from basic change tracking to complete document snapshots
Intelligent Storage Management: Automated compression, retention, and archiving for optimized storage utilization
Complete Context Tracking: User, session, and business context capture for forensic analysis and compliance reporting
SQL Accessibility: Familiar SQL-style versioning operations through QueryLeaf for accessible audit trail management

Whether you're managing regulatory compliance, supporting forensic investigations, implementing data governance policies, or maintaining comprehensive change history for business operations, MongoDB document versioning with QueryLeaf's familiar SQL interface provides the foundation for robust, scalable audit trail management.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB document versioning while providing SQL-familiar syntax for audit trail creation, compliance tracking, and historical analysis. Advanced versioning patterns, regulatory compliance workflows, and forensic analysis capabilities are seamlessly handled through familiar SQL constructs, making enterprise-grade audit trail management accessible to SQL-oriented development teams.

The combination of MongoDB's flexible document versioning capabilities with SQL-style audit operations makes it an ideal platform for applications requiring both comprehensive change tracking and familiar database management patterns, ensuring your audit trails can scale efficiently while maintaining regulatory compliance and operational transparency.

November 25, 2025
21 min read

MongoDB Bulk Operations for High-Performance Batch Processing: Enterprise Data Ingestion and Mass Data Processing with SQL-Style Bulk Operations

Enterprise applications frequently require processing large volumes of data through bulk operations that can efficiently insert, update, or delete thousands or millions of records while maintaining data consistency and optimal performance. Traditional approaches to mass data processing often struggle with network round-trips, transaction overhead, and resource utilization when handling high-volume batch operations common in ETL pipelines, data migrations, and real-time ingestion systems.

MongoDB's bulk operations provide native support for high-performance batch processing with automatic optimization, intelligent batching, and comprehensive error handling designed specifically for enterprise-scale data operations. Unlike traditional row-by-row processing that creates excessive network overhead and transaction costs, MongoDB bulk operations combine multiple operations into optimized batches while maintaining ACID guarantees and providing detailed execution feedback for complex data processing workflows.

The Traditional Batch Processing Challenge

Relational databases face significant performance limitations when processing large data volumes:

-- Traditional PostgreSQL bulk processing - inefficient row-by-row operations with poor performance

-- Large-scale product catalog management with individual operations
CREATE TABLE products (
    product_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    sku VARCHAR(100) UNIQUE NOT NULL,
    name VARCHAR(500) NOT NULL,
    description TEXT,
    category_id UUID,
    brand_id UUID,

    -- Pricing information
    base_price DECIMAL(12,2) NOT NULL,
    sale_price DECIMAL(12,2),
    cost_price DECIMAL(12,2),
    currency VARCHAR(3) DEFAULT 'USD',

    -- Inventory tracking
    stock_quantity INTEGER DEFAULT 0,
    reserved_quantity INTEGER DEFAULT 0,
    available_quantity INTEGER GENERATED ALWAYS AS (stock_quantity - reserved_quantity) STORED,
    low_stock_threshold INTEGER DEFAULT 10,
    reorder_point INTEGER DEFAULT 5,

    -- Product attributes
    weight_kg DECIMAL(8,3),
    dimensions_cm VARCHAR(50), -- Length x Width x Height
    color VARCHAR(50),
    size VARCHAR(50),
    material VARCHAR(100),

    -- Status and lifecycle
    status VARCHAR(20) DEFAULT 'active' CHECK (status IN ('active', 'inactive', 'discontinued')),
    launch_date DATE,
    discontinue_date DATE,

    -- SEO and marketing
    seo_title VARCHAR(200),
    seo_description TEXT,
    tags TEXT[],

    -- Supplier information
    supplier_id UUID,
    supplier_sku VARCHAR(100),
    lead_time_days INTEGER DEFAULT 14,

    -- Tracking timestamps
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    last_inventory_update TIMESTAMP WITH TIME ZONE,

    -- Audit fields
    created_by UUID,
    updated_by UUID,
    version INTEGER DEFAULT 1,

    FOREIGN KEY (category_id) REFERENCES categories(category_id),
    FOREIGN KEY (brand_id) REFERENCES brands(brand_id),
    FOREIGN KEY (supplier_id) REFERENCES suppliers(supplier_id)
);

-- Product images table for multiple images per product
CREATE TABLE product_images (
    image_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    product_id UUID NOT NULL,
    image_url VARCHAR(500) NOT NULL,
    image_type VARCHAR(20) DEFAULT 'product' CHECK (image_type IN ('product', 'thumbnail', 'gallery', 'variant')),
    alt_text VARCHAR(200),
    sort_order INTEGER DEFAULT 0,
    is_primary BOOLEAN DEFAULT false,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (product_id) REFERENCES products(product_id) ON DELETE CASCADE
);

-- Product variants for size/color combinations
CREATE TABLE product_variants (
    variant_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    product_id UUID NOT NULL,
    variant_sku VARCHAR(100) UNIQUE NOT NULL,
    variant_name VARCHAR(200),

    -- Variant-specific attributes
    size VARCHAR(50),
    color VARCHAR(50),
    material VARCHAR(100),

    -- Variant pricing and inventory
    price_adjustment DECIMAL(12,2) DEFAULT 0,
    stock_quantity INTEGER DEFAULT 0,
    weight_adjustment_kg DECIMAL(8,3) DEFAULT 0,

    -- Status
    is_active BOOLEAN DEFAULT true,

    -- Timestamps
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (product_id) REFERENCES products(product_id) ON DELETE CASCADE
);

-- Inefficient individual insert approach for bulk data loading
DO $$
DECLARE
    batch_size INTEGER := 1000;
    total_records INTEGER := 50000;
    current_batch INTEGER := 0;
    start_time TIMESTAMP;
    end_time TIMESTAMP;
    record_count INTEGER := 0;
    error_count INTEGER := 0;

    -- Sample product data
    categories UUID[] := ARRAY[
        '550e8400-e29b-41d4-a716-446655440001',
        '550e8400-e29b-41d4-a716-446655440002',
        '550e8400-e29b-41d4-a716-446655440003'
    ];

    brands UUID[] := ARRAY[
        '660e8400-e29b-41d4-a716-446655440001',
        '660e8400-e29b-41d4-a716-446655440002',
        '660e8400-e29b-41d4-a716-446655440003'
    ];

    suppliers UUID[] := ARRAY[
        '770e8400-e29b-41d4-a716-446655440001',
        '770e8400-e29b-41d4-a716-446655440002'
    ];

BEGIN
    start_time := clock_timestamp();

    RAISE NOTICE 'Starting bulk product insertion of % records...', total_records;

    -- Inefficient approach - individual INSERT statements with poor performance
    FOR i IN 1..total_records LOOP
        BEGIN
            -- Single row insert - creates network round-trip and transaction overhead for each record
            INSERT INTO products (
                sku, name, description, category_id, brand_id,
                base_price, cost_price, stock_quantity, weight_kg,
                status, supplier_id, created_by
            ) VALUES (
                'SKU-' || LPAD(i::text, 8, '0'),
                'Product ' || i,
                'Description for product ' || i || ' with detailed specifications and features.',
                categories[((i-1) % array_length(categories, 1)) + 1],
                brands[((i-1) % array_length(brands, 1)) + 1],
                ROUND((RANDOM() * 1000 + 10)::numeric, 2),
                ROUND((RANDOM() * 500 + 5)::numeric, 2),
                FLOOR(RANDOM() * 100),
                ROUND((RANDOM() * 10 + 0.1)::numeric, 3),
                CASE WHEN RANDOM() < 0.9 THEN 'active' ELSE 'inactive' END,
                suppliers[((i-1) % array_length(suppliers, 1)) + 1],
                '880e8400-e29b-41d4-a716-446655440001'
            );

            record_count := record_count + 1;

            -- Progress reporting every batch
            IF record_count % batch_size = 0 THEN
                current_batch := current_batch + 1;
                RAISE NOTICE 'Inserted batch %: % records (% total)', current_batch, batch_size, record_count;
            END IF;

        EXCEPTION 
            WHEN OTHERS THEN
                error_count := error_count + 1;
                RAISE NOTICE 'Error inserting record %: %', i, SQLERRM;
        END;
    END LOOP;

    end_time := clock_timestamp();

    RAISE NOTICE 'Bulk insertion completed:';
    RAISE NOTICE '  Total time: %', end_time - start_time;
    RAISE NOTICE '  Records inserted: %', record_count;
    RAISE NOTICE '  Errors encountered: %', error_count;
    RAISE NOTICE '  Average rate: % records/second', ROUND(record_count / EXTRACT(EPOCH FROM end_time - start_time));

    -- Performance issues with individual inserts:
    -- 1. Each INSERT creates a separate network round-trip
    -- 2. Individual transaction overhead for each operation
    -- 3. No batch optimization or bulk loading capabilities
    -- 4. Poor resource utilization and high latency
    -- 5. Difficulty in handling partial failures and rollbacks
    -- 6. Limited parallelization options
    -- 7. Inefficient index maintenance for each individual operation
END $$;

-- Equally inefficient bulk update approach
DO $$
DECLARE
    update_batch_size INTEGER := 500;
    products_cursor CURSOR FOR 
        SELECT product_id, base_price, stock_quantity 
        FROM products 
        WHERE status = 'active';

    product_record RECORD;
    updated_count INTEGER := 0;
    batch_count INTEGER := 0;
    start_time TIMESTAMP := clock_timestamp();

BEGIN
    RAISE NOTICE 'Starting bulk price and inventory update...';

    -- Individual UPDATE statements - highly inefficient for bulk operations
    FOR product_record IN products_cursor LOOP
        BEGIN
            -- Single row update with complex business logic
            UPDATE products 
            SET 
                base_price = CASE 
                    WHEN product_record.base_price < 50 THEN product_record.base_price * 1.15
                    WHEN product_record.base_price < 200 THEN product_record.base_price * 1.10
                    ELSE product_record.base_price * 1.05
                END,

                sale_price = CASE 
                    WHEN RANDOM() < 0.3 THEN base_price * 0.85  -- 30% chance of sale
                    ELSE NULL
                END,

                stock_quantity = CASE 
                    WHEN product_record.stock_quantity < 5 THEN product_record.stock_quantity + 50
                    WHEN product_record.stock_quantity < 20 THEN product_record.stock_quantity + 25
                    ELSE product_record.stock_quantity
                END,

                updated_at = CURRENT_TIMESTAMP,
                version = version + 1,
                updated_by = '880e8400-e29b-41d4-a716-446655440002'

            WHERE product_id = product_record.product_id;

            updated_count := updated_count + 1;

            -- Batch progress reporting
            IF updated_count % update_batch_size = 0 THEN
                batch_count := batch_count + 1;
                RAISE NOTICE 'Updated batch %: % products', batch_count, update_batch_size;
                COMMIT; -- Frequent commits for progress tracking
            END IF;

        EXCEPTION 
            WHEN OTHERS THEN
                RAISE NOTICE 'Error updating product %: %', product_record.product_id, SQLERRM;
        END;
    END LOOP;

    RAISE NOTICE 'Bulk update completed: % products updated in %', 
        updated_count, clock_timestamp() - start_time;
END $$;

-- Traditional batch delete with poor performance characteristics
DELETE FROM products 
WHERE status = 'discontinued' 
  AND discontinue_date < CURRENT_DATE - INTERVAL '2 years'
  AND product_id IN (
    -- Subquery creates additional overhead and complexity
    SELECT p.product_id 
    FROM products p
    LEFT JOIN product_variants pv ON p.product_id = pv.product_id
    LEFT JOIN order_items oi ON p.product_id = oi.product_id
    WHERE pv.variant_id IS NULL  -- No variants
      AND oi.item_id IS NULL     -- No order history
      AND p.stock_quantity = 0   -- No inventory
  );

-- Problems with traditional bulk processing:
-- 1. Row-by-row processing creates excessive network round-trips and latency
-- 2. Individual transaction overhead significantly impacts performance
-- 3. Poor resource utilization and limited parallelization capabilities  
-- 4. Complex error handling for partial failures and rollback scenarios
-- 5. Inefficient index maintenance with frequent individual operations
-- 6. Limited batch optimization and bulk loading strategies
-- 7. Difficulty in monitoring progress and performance of bulk operations
-- 8. Poor integration with modern data pipeline and ETL tools
-- 9. Scalability limitations with very large datasets and concurrent operations
-- 10. Lack of automatic retry and recovery mechanisms for failed operations

-- MySQL bulk operations (even more limited capabilities)
CREATE TABLE mysql_products (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    sku VARCHAR(100) UNIQUE,
    name VARCHAR(500),
    price DECIMAL(10,2),
    quantity INT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- MySQL limited bulk insert (no advanced error handling)
INSERT INTO mysql_products (sku, name, price, quantity) VALUES
('SKU001', 'Product 1', 99.99, 10),
('SKU002', 'Product 2', 149.99, 5),
('SKU003', 'Product 3', 199.99, 15);

-- Basic bulk update (limited conditional logic)
UPDATE mysql_products 
SET price = price * 1.1 
WHERE quantity > 0;

-- Simple bulk delete (no complex conditions)
DELETE FROM mysql_products 
WHERE quantity = 0 AND created_at < DATE_SUB(NOW(), INTERVAL 30 DAY);

-- MySQL limitations for bulk operations:
-- - Basic INSERT VALUES limited by max_allowed_packet size
-- - No advanced bulk operation APIs or optimization
-- - Poor error handling and partial failure recovery
-- - Limited conditional logic in bulk updates
-- - Basic performance monitoring and progress tracking
-- - No automatic batching or optimization strategies

MongoDB's bulk operations provide comprehensive high-performance batch processing:

// MongoDB Bulk Operations - enterprise-scale high-performance batch processing
const { MongoClient, ObjectId } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('enterprise_data_platform');

// Advanced bulk operations manager for enterprise data processing
class AdvancedBulkOperationsManager {
  constructor(db, config = {}) {
    this.db = db;
    this.config = {
      // Performance configuration
      batchSize: config.batchSize || 1000,
      maxConcurrentOperations: config.maxConcurrentOperations || 10,
      retryAttempts: config.retryAttempts || 3,
      retryDelayMs: config.retryDelayMs || 1000,

      // Memory management
      maxMemoryUsageMB: config.maxMemoryUsageMB || 500,
      memoryMonitoringInterval: config.memoryMonitoringInterval || 1000,

      // Progress tracking
      enableProgressTracking: config.enableProgressTracking || true,
      progressUpdateInterval: config.progressUpdateInterval || 1000,
      enableDetailedLogging: config.enableDetailedLogging || true,

      // Error handling
      continueOnError: config.continueOnError || false,
      errorLoggingLevel: config.errorLoggingLevel || 'detailed',

      // Optimization features
      enableIndexOptimization: config.enableIndexOptimization || true,
      enableCompressionOptimization: config.enableCompressionOptimization || true,
      enableShardingOptimization: config.enableShardingOptimization || true,

      // Monitoring and analytics
      enablePerformanceAnalytics: config.enablePerformanceAnalytics || true,
      enableResourceMonitoring: config.enableResourceMonitoring || true
    };

    this.operationStats = {
      totalOperations: 0,
      successfulOperations: 0,
      failedOperations: 0,
      bytesProcessed: 0,
      operationsPerSecond: 0,
      averageLatency: 0
    };

    this.setupMonitoring();
  }

  async performBulkInsert(collectionName, documents, options = {}) {
    console.log(`Starting bulk insert of ${documents.length} documents into ${collectionName}...`);
    const startTime = Date.now();

    try {
      const collection = this.db.collection(collectionName);
      const operationOptions = {
        ordered: options.ordered !== false, // Default to ordered operations
        bypassDocumentValidation: options.bypassDocumentValidation || false,
        ...options
      };

      // Prepare documents with validation and enrichment
      const preparedDocuments = await this.prepareDocumentsForInsertion(documents, options);

      // Create bulk operation
      const bulkOperations = [];
      const results = {
        insertedIds: [],
        insertedCount: 0,
        errors: [],
        processingTime: 0,
        throughput: 0
      };

      // Process in optimized batches
      const batches = this.createOptimizedBatches(preparedDocuments, this.config.batchSize);

      console.log(`Processing ${batches.length} batches of up to ${this.config.batchSize} documents each...`);

      for (let batchIndex = 0; batchIndex < batches.length; batchIndex++) {
        const batch = batches[batchIndex];
        const batchStartTime = Date.now();

        try {
          // Create bulk write operations for current batch
          const batchOperations = batch.map(doc => ({
            insertOne: {
              document: {
                ...doc,
                _createdAt: new Date(),
                _batchId: options.batchId || new ObjectId(),
                _batchIndex: batchIndex,
                _processingMetadata: {
                  insertedAt: new Date(),
                  batchSize: batch.length,
                  totalBatches: batches.length
                }
              }
            }
          }));

          // Execute bulk write with comprehensive error handling
          const batchResult = await collection.bulkWrite(batchOperations, operationOptions);

          // Track successful operations
          results.insertedCount += batchResult.insertedCount;
          results.insertedIds.push(...Object.values(batchResult.insertedIds));

          // Update statistics
          this.operationStats.totalOperations += batch.length;
          this.operationStats.successfulOperations += batchResult.insertedCount;

          const batchTime = Date.now() - batchStartTime;
          const batchThroughput = Math.round(batch.length / (batchTime / 1000));

          if (this.config.enableProgressTracking) {
            const progress = Math.round(((batchIndex + 1) / batches.length) * 100);
            console.log(`Batch ${batchIndex + 1}/${batches.length} completed: ${batchResult.insertedCount} inserted (${batchThroughput} docs/sec, ${progress}% complete)`);
          }

        } catch (batchError) {
          console.error(`Batch ${batchIndex + 1} failed:`, batchError.message);

          if (!this.config.continueOnError) {
            throw batchError;
          }

          results.errors.push({
            batchIndex,
            batchSize: batch.length,
            error: batchError.message,
            timestamp: new Date()
          });

          this.operationStats.failedOperations += batch.length;
        }
      }

      // Calculate final statistics
      const totalTime = Date.now() - startTime;
      results.processingTime = totalTime;
      results.throughput = Math.round(results.insertedCount / (totalTime / 1000));

      // Update global statistics
      this.operationStats.operationsPerSecond = results.throughput;
      this.operationStats.averageLatency = totalTime / batches.length;

      console.log(`Bulk insert completed: ${results.insertedCount} documents inserted in ${totalTime}ms (${results.throughput} docs/sec)`);

      if (results.errors.length > 0) {
        console.warn(`${results.errors.length} batch errors encountered during insertion`);
      }

      return results;

    } catch (error) {
      console.error(`Bulk insert operation failed:`, error);
      throw error;
    }
  }

  async performBulkUpdate(collectionName, updateOperations, options = {}) {
    console.log(`Starting bulk update of ${updateOperations.length} operations on ${collectionName}...`);
    const startTime = Date.now();

    try {
      const collection = this.db.collection(collectionName);
      const results = {
        matchedCount: 0,
        modifiedCount: 0,
        upsertedCount: 0,
        upsertedIds: [],
        errors: [],
        processingTime: 0,
        throughput: 0
      };

      // Process update operations in optimized batches
      const batches = this.createOptimizedBatches(updateOperations, this.config.batchSize);

      console.log(`Processing ${batches.length} update batches...`);

      for (let batchIndex = 0; batchIndex < batches.length; batchIndex++) {
        const batch = batches[batchIndex];
        const batchStartTime = Date.now();

        try {
          // Create bulk write operations for updates
          const batchOperations = batch.map(operation => {
            const updateDoc = {
              ...operation.update,
              $set: {
                ...operation.update.$set,
                _updatedAt: new Date(),
                _batchId: options.batchId || new ObjectId(),
                _batchIndex: batchIndex,
                _updateMetadata: {
                  updatedAt: new Date(),
                  batchSize: batch.length,
                  operationType: 'bulkUpdate'
                }
              }
            };

            if (operation.upsert) {
              return {
                updateOne: {
                  filter: operation.filter,
                  update: updateDoc,
                  upsert: true
                }
              };
            } else {
              return {
                updateMany: {
                  filter: operation.filter,
                  update: updateDoc
                }
              };
            }
          });

          // Execute bulk write
          const batchResult = await collection.bulkWrite(batchOperations, {
            ordered: options.ordered !== false,
            ...options
          });

          // Aggregate results
          results.matchedCount += batchResult.matchedCount;
          results.modifiedCount += batchResult.modifiedCount;
          results.upsertedCount += batchResult.upsertedCount;
          results.upsertedIds.push(...Object.values(batchResult.upsertedIds || {}));

          const batchTime = Date.now() - batchStartTime;
          const batchThroughput = Math.round(batch.length / (batchTime / 1000));

          if (this.config.enableProgressTracking) {
            const progress = Math.round(((batchIndex + 1) / batches.length) * 100);
            console.log(`Update batch ${batchIndex + 1}/${batches.length}: ${batchResult.modifiedCount} modified (${batchThroughput} ops/sec, ${progress}% complete)`);
          }

        } catch (batchError) {
          console.error(`Update batch ${batchIndex + 1} failed:`, batchError.message);

          if (!this.config.continueOnError) {
            throw batchError;
          }

          results.errors.push({
            batchIndex,
            batchSize: batch.length,
            error: batchError.message,
            timestamp: new Date()
          });
        }
      }

      // Calculate final statistics
      const totalTime = Date.now() - startTime;
      results.processingTime = totalTime;
      results.throughput = Math.round(updateOperations.length / (totalTime / 1000));

      console.log(`Bulk update completed: ${results.modifiedCount} documents modified in ${totalTime}ms (${results.throughput} ops/sec)`);

      return results;

    } catch (error) {
      console.error(`Bulk update operation failed:`, error);
      throw error;
    }
  }

  async performBulkDelete(collectionName, deleteFilters, options = {}) {
    console.log(`Starting bulk delete of ${deleteFilters.length} operations on ${collectionName}...`);
    const startTime = Date.now();

    try {
      const collection = this.db.collection(collectionName);
      const results = {
        deletedCount: 0,
        errors: [],
        processingTime: 0,
        throughput: 0
      };

      // Archive documents before deletion if required
      if (options.archiveBeforeDelete) {
        console.log('Archiving documents before deletion...');
        await this.archiveDocumentsBeforeDelete(collection, deleteFilters, options);
      }

      // Process delete operations in batches
      const batches = this.createOptimizedBatches(deleteFilters, this.config.batchSize);

      console.log(`Processing ${batches.length} delete batches...`);

      for (let batchIndex = 0; batchIndex < batches.length; batchIndex++) {
        const batch = batches[batchIndex];
        const batchStartTime = Date.now();

        try {
          // Create bulk write operations for deletes
          const batchOperations = batch.map(filter => ({
            deleteMany: { filter }
          }));

          // Execute bulk write
          const batchResult = await collection.bulkWrite(batchOperations, {
            ordered: options.ordered !== false,
            ...options
          });

          results.deletedCount += batchResult.deletedCount;

          const batchTime = Date.now() - batchStartTime;
          const batchThroughput = Math.round(batch.length / (batchTime / 1000));

          if (this.config.enableProgressTracking) {
            const progress = Math.round(((batchIndex + 1) / batches.length) * 100);
            console.log(`Delete batch ${batchIndex + 1}/${batches.length}: ${batchResult.deletedCount} deleted (${batchThroughput} ops/sec, ${progress}% complete)`);
          }

        } catch (batchError) {
          console.error(`Delete batch ${batchIndex + 1} failed:`, batchError.message);

          if (!this.config.continueOnError) {
            throw batchError;
          }

          results.errors.push({
            batchIndex,
            batchSize: batch.length,
            error: batchError.message,
            timestamp: new Date()
          });
        }
      }

      // Calculate final statistics
      const totalTime = Date.now() - startTime;
      results.processingTime = totalTime;
      results.throughput = Math.round(deleteFilters.length / (totalTime / 1000));

      console.log(`Bulk delete completed: ${results.deletedCount} documents deleted in ${totalTime}ms (${results.throughput} ops/sec)`);

      return results;

    } catch (error) {
      console.error(`Bulk delete operation failed:`, error);
      throw error;
    }
  }

  async performMixedBulkOperations(collectionName, operations, options = {}) {
    console.log(`Starting mixed bulk operations (${operations.length} operations) on ${collectionName}...`);
    const startTime = Date.now();

    try {
      const collection = this.db.collection(collectionName);
      const results = {
        insertedCount: 0,
        matchedCount: 0,
        modifiedCount: 0,
        deletedCount: 0,
        upsertedCount: 0,
        upsertedIds: [],
        errors: [],
        processingTime: 0,
        throughput: 0
      };

      // Process mixed operations in optimized batches
      const batches = this.createOptimizedBatches(operations, this.config.batchSize);

      console.log(`Processing ${batches.length} mixed operation batches...`);

      for (let batchIndex = 0; batchIndex < batches.length; batchIndex++) {
        const batch = batches[batchIndex];
        const batchStartTime = Date.now();

        try {
          // Transform operations into MongoDB bulk write format
          const batchOperations = batch.map(op => {
            switch (op.type) {
              case 'insert':
                return {
                  insertOne: {
                    document: {
                      ...op.document,
                      _createdAt: new Date(),
                      _batchId: options.batchId || new ObjectId()
                    }
                  }
                };

              case 'update':
                return {
                  updateMany: {
                    filter: op.filter,
                    update: {
                      ...op.update,
                      $set: {
                        ...op.update.$set,
                        _updatedAt: new Date(),
                        _batchId: options.batchId || new ObjectId()
                      }
                    }
                  }
                };

              case 'delete':
                return {
                  deleteMany: {
                    filter: op.filter
                  }
                };

              case 'upsert':
                return {
                  replaceOne: {
                    filter: op.filter,
                    replacement: {
                      ...op.document,
                      _updatedAt: new Date(),
                      _batchId: options.batchId || new ObjectId()
                    },
                    upsert: true
                  }
                };

              default:
                throw new Error(`Unsupported operation type: ${op.type}`);
            }
          });

          // Execute mixed bulk operations
          const batchResult = await collection.bulkWrite(batchOperations, {
            ordered: options.ordered !== false,
            ...options
          });

          // Aggregate results
          results.insertedCount += batchResult.insertedCount || 0;
          results.matchedCount += batchResult.matchedCount || 0;
          results.modifiedCount += batchResult.modifiedCount || 0;
          results.deletedCount += batchResult.deletedCount || 0;
          results.upsertedCount += batchResult.upsertedCount || 0;
          if (batchResult.upsertedIds) {
            results.upsertedIds.push(...Object.values(batchResult.upsertedIds));
          }

          const batchTime = Date.now() - batchStartTime;
          const batchThroughput = Math.round(batch.length / (batchTime / 1000));

          if (this.config.enableProgressTracking) {
            const progress = Math.round(((batchIndex + 1) / batches.length) * 100);
            console.log(`Mixed batch ${batchIndex + 1}/${batches.length}: completed (${batchThroughput} ops/sec, ${progress}% complete)`);
          }

        } catch (batchError) {
          console.error(`Mixed batch ${batchIndex + 1} failed:`, batchError.message);

          if (!this.config.continueOnError) {
            throw batchError;
          }

          results.errors.push({
            batchIndex,
            batchSize: batch.length,
            error: batchError.message,
            timestamp: new Date()
          });
        }
      }

      // Calculate final statistics
      const totalTime = Date.now() - startTime;
      results.processingTime = totalTime;
      results.throughput = Math.round(operations.length / (totalTime / 1000));

      console.log(`Mixed bulk operations completed in ${totalTime}ms (${results.throughput} ops/sec)`);
      console.log(`Results: ${results.insertedCount} inserted, ${results.modifiedCount} modified, ${results.deletedCount} deleted`);

      return results;

    } catch (error) {
      console.error(`Mixed bulk operations failed:`, error);
      throw error;
    }
  }

  async prepareDocumentsForInsertion(documents, options) {
    return documents.map((doc, index) => ({
      ...doc,
      _id: doc._id || new ObjectId(),
      _documentIndex: index,
      _validationStatus: 'validated',
      _preparationTimestamp: new Date()
    }));
  }

  createOptimizedBatches(items, batchSize) {
    const batches = [];
    for (let i = 0; i < items.length; i += batchSize) {
      batches.push(items.slice(i, i + batchSize));
    }
    return batches;
  }

  async archiveDocumentsBeforeDelete(collection, filters, options) {
    const archiveCollection = this.db.collection(`${collection.collectionName}_archive`);

    for (const filter of filters) {
      const documentsToArchive = await collection.find(filter).toArray();
      if (documentsToArchive.length > 0) {
        const archiveDocuments = documentsToArchive.map(doc => ({
          ...doc,
          _archivedAt: new Date(),
          _originalId: doc._id,
          _archiveReason: options.archiveReason || 'bulk_delete'
        }));

        await archiveCollection.insertMany(archiveDocuments);
      }
    }
  }

  setupMonitoring() {
    if (this.config.enablePerformanceAnalytics) {
      console.log('Performance analytics enabled');
    }
  }

  async getOperationStatistics() {
    return {
      ...this.operationStats,
      timestamp: new Date()
    };
  }
}

// Example usage for enterprise-scale bulk operations
async function demonstrateAdvancedBulkOperations() {
  const bulkManager = new AdvancedBulkOperationsManager(db, {
    batchSize: 2000,
    maxConcurrentOperations: 15,
    enableProgressTracking: true,
    enableDetailedLogging: true,
    continueOnError: true
  });

  try {
    // Bulk insert: Load 100,000 product records
    const productDocuments = Array.from({ length: 100000 }, (_, index) => ({
      sku: `BULK-PRODUCT-${String(index + 1).padStart(8, '0')}`,
      name: `Enterprise Product ${index + 1}`,
      description: `Comprehensive product description for bulk-loaded item ${index + 1}`,
      category: ['electronics', 'clothing', 'books', 'home'][index % 4],
      brand: ['BrandA', 'BrandB', 'BrandC', 'BrandD'][index % 4],
      price: Math.round((Math.random() * 1000 + 10) * 100) / 100,
      costPrice: Math.round((Math.random() * 500 + 5) * 100) / 100,
      stockQuantity: Math.floor(Math.random() * 100),
      weight: Math.round((Math.random() * 10 + 0.1) * 1000) / 1000,
      status: Math.random() < 0.9 ? 'active' : 'inactive',
      tags: [`tag-${index % 10}`, `category-${index % 5}`],
      metadata: {
        supplier: `supplier-${index % 20}`,
        leadTime: Math.floor(Math.random() * 30) + 1,
        quality: Math.random() < 0.8 ? 'high' : 'medium'
      }
    }));

    console.log('Starting bulk product insertion...');
    const insertResults = await bulkManager.performBulkInsert('products', productDocuments, {
      batchId: new ObjectId(),
      ordered: true
    });

    // Bulk update: Update pricing for all active products
    const priceUpdateOperations = [
      {
        filter: { status: 'active', price: { $lt: 100 } },
        update: {
          $mul: { price: 1.15 },
          $set: { priceUpdatedReason: 'low_price_adjustment' }
        }
      },
      {
        filter: { status: 'active', price: { $gte: 100, $lt: 500 } },
        update: {
          $mul: { price: 1.10 },
          $set: { priceUpdatedReason: 'medium_price_adjustment' }
        }
      },
      {
        filter: { status: 'active', price: { $gte: 500 } },
        update: {
          $mul: { price: 1.05 },
          $set: { priceUpdatedReason: 'premium_price_adjustment' }
        }
      }
    ];

    console.log('Starting bulk price updates...');
    const updateResults = await bulkManager.performBulkUpdate('products', priceUpdateOperations, {
      batchId: new ObjectId()
    });

    // Mixed operations: Complex business logic
    const mixedOperations = [
      // Insert new seasonal products
      ...Array.from({ length: 1000 }, (_, index) => ({
        type: 'insert',
        document: {
          sku: `SEASONAL-${String(index + 1).padStart(6, '0')}`,
          name: `Seasonal Product ${index + 1}`,
          category: 'seasonal',
          price: Math.round((Math.random() * 200 + 20) * 100) / 100,
          status: 'active',
          seasonal: true,
          season: 'winter'
        }
      })),

      // Update low-stock products
      {
        type: 'update',
        filter: { stockQuantity: { $lt: 10 }, status: 'active' },
        update: {
          $set: { 
            lowStockAlert: true,
            restockPriority: 'high',
            alertTriggeredAt: new Date()
          }
        }
      },

      // Delete discontinued products with no inventory
      {
        type: 'delete',
        filter: { 
          status: 'discontinued', 
          stockQuantity: 0,
          lastOrderDate: { $lt: new Date(Date.now() - 365 * 24 * 60 * 60 * 1000) }
        }
      }
    ];

    console.log('Starting mixed bulk operations...');
    const mixedResults = await bulkManager.performMixedBulkOperations('products', mixedOperations);

    // Get final statistics
    const finalStats = await bulkManager.getOperationStatistics();
    console.log('Final Operation Statistics:', finalStats);

    return {
      insertResults,
      updateResults,
      mixedResults,
      finalStats
    };

  } catch (error) {
    console.error('Bulk operations demonstration failed:', error);
    throw error;
  }
}

// Benefits of MongoDB Bulk Operations:
// - Native batch optimization with intelligent operation grouping and network efficiency
// - Comprehensive error handling with partial failure recovery and detailed error reporting
// - Flexible operation mixing with support for complex business logic in single batches
// - Advanced progress tracking and performance monitoring for enterprise operations
// - Memory-efficient processing with automatic resource management and optimization
// - Production-ready scalability with sharding optimization and distributed processing
// - Integrated retry mechanisms with configurable backoff strategies and failure handling
// - Rich analytics and monitoring capabilities for operational insight and optimization

module.exports = {
  AdvancedBulkOperationsManager,
  demonstrateAdvancedBulkOperations
};

Understanding MongoDB Bulk Operations Architecture

High-Performance Batch Processing Patterns

Implement sophisticated bulk operation strategies for enterprise data processing:

// Production-scale bulk operations implementation with advanced monitoring and optimization
class ProductionBulkProcessingPlatform extends AdvancedBulkOperationsManager {
  constructor(db, productionConfig) {
    super(db, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      distributedProcessing: true,
      realTimeMonitoring: true,
      advancedOptimization: true,
      enterpriseErrorHandling: true,
      automaticRecovery: true,
      performanceAnalytics: true
    };

    this.setupProductionOptimizations();
    this.initializeDistributedProcessing();
    this.setupAdvancedMonitoring();
  }

  async implementDistributedBulkProcessing() {
    console.log('Setting up distributed bulk processing architecture...');

    const distributedStrategy = {
      // Parallel processing
      parallelization: {
        maxConcurrentBatches: 20,
        adaptiveBatchSizing: true,
        loadBalancing: true,
        resourceOptimization: true
      },

      // Distributed coordination
      coordination: {
        shardAwareness: true,
        crossShardOptimization: true,
        distributedTransactions: true,
        consistencyGuarantees: 'majority'
      },

      // Performance optimization
      performance: {
        networkOptimization: true,
        compressionEnabled: true,
        pipelineOptimization: true,
        indexOptimization: true
      }
    };

    return await this.deployDistributedStrategy(distributedStrategy);
  }

  async implementAdvancedErrorHandling() {
    console.log('Implementing enterprise-grade error handling...');

    const errorHandlingStrategy = {
      // Recovery mechanisms
      recovery: {
        automaticRetry: true,
        exponentialBackoff: true,
        circuitBreaker: true,
        fallbackStrategies: true
      },

      // Error classification
      classification: {
        transientErrors: 'retry',
        permanentErrors: 'skip_and_log',
        partialFailures: 'continue_processing',
        criticalErrors: 'immediate_stop'
      },

      // Monitoring and alerting
      monitoring: {
        realTimeAlerts: true,
        errorTrend: true,
        performanceImpact: true,
        operationalDashboard: true
      }
    };

    return await this.deployErrorHandlingStrategy(errorHandlingStrategy);
  }

  async implementPerformanceOptimization() {
    console.log('Implementing advanced performance optimization...');

    const optimizationStrategy = {
      // Batch optimization
      batchOptimization: {
        dynamicBatchSizing: true,
        contentAwareBatching: true,
        resourceBasedAdjustment: true,
        latencyOptimization: true
      },

      // Memory management
      memoryManagement: {
        streamingProcessing: true,
        memoryPooling: true,
        garbageCollectionOptimization: true,
        resourceMonitoring: true
      },

      // Network optimization
      networkOptimization: {
        connectionPooling: true,
        compressionEnabled: true,
        pipeliningEnabled: true,
        latencyReduction: true
      }
    };

    return await this.deployOptimizationStrategy(optimizationStrategy);
  }
}

SQL-Style Bulk Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB bulk operations and batch processing:

-- QueryLeaf bulk operations with SQL-familiar batch processing syntax

-- Bulk insert with advanced batch configuration
BULK INSERT INTO products (sku, name, category, price, stock_quantity)
VALUES 
  -- Batch of 1000 products with optimized performance settings
  ('BULK-001', 'Enterprise Product 1', 'electronics', 299.99, 50),
  ('BULK-002', 'Enterprise Product 2', 'electronics', 399.99, 75),
  ('BULK-003', 'Enterprise Product 3', 'clothing', 99.99, 25),
  -- ... continuing for large datasets
WITH (
  batch_size = 2000,
  ordered = true,
  continue_on_error = false,
  enable_progress_tracking = true,

  -- Performance optimization
  bypass_document_validation = false,
  compression_enabled = true,
  parallel_processing = true,
  max_concurrent_batches = 10,

  -- Error handling
  retry_attempts = 3,
  retry_delay_ms = 1000,
  detailed_error_logging = true,

  -- Monitoring
  enable_analytics = true,
  progress_update_interval = 1000,
  performance_monitoring = true
);

-- Bulk insert from SELECT with data transformation
BULK INSERT INTO product_summaries (category, product_count, avg_price, total_value, last_updated)
SELECT 
  category,
  COUNT(*) as product_count,
  ROUND(AVG(price), 2) as avg_price,
  ROUND(SUM(price * stock_quantity), 2) as total_value,
  CURRENT_TIMESTAMP as last_updated
FROM products
WHERE status = 'active'
GROUP BY category
HAVING COUNT(*) > 10
WITH (
  batch_size = 500,
  upsert_on_conflict = true,
  conflict_resolution = 'replace'
);

-- Advanced bulk update with complex conditions and business logic
BULK UPDATE products
SET 
  price = CASE 
    WHEN category = 'electronics' AND price < 100 THEN price * 1.20
    WHEN category = 'electronics' AND price >= 100 THEN price * 1.15
    WHEN category = 'clothing' AND price < 50 THEN price * 1.25
    WHEN category = 'clothing' AND price >= 50 THEN price * 1.15
    WHEN category = 'books' THEN price * 1.10
    ELSE price * 1.05
  END,

  sale_price = CASE 
    WHEN RANDOM() < 0.3 THEN price * 0.85  -- 30% chance of sale
    ELSE NULL
  END,

  stock_status = CASE 
    WHEN stock_quantity = 0 THEN 'out_of_stock'
    WHEN stock_quantity < 10 THEN 'low_stock'
    WHEN stock_quantity < 5 THEN 'critical_stock'
    ELSE 'in_stock'
  END,

  priority_reorder = CASE 
    WHEN stock_quantity < reorder_point THEN true
    ELSE false
  END,

  last_price_update = CURRENT_TIMESTAMP,
  price_update_reason = 'bulk_adjustment_2025',
  updated_by = CURRENT_USER_ID(),
  version = version + 1

WHERE status = 'active'
  AND created_at >= CURRENT_DATE - INTERVAL '2 years'
WITH (
  batch_size = 1500,
  ordered = false,  -- Allow parallel processing
  continue_on_error = true,

  -- Update strategy
  multi_document_updates = true,
  atomic_operations = true,

  -- Performance tuning
  index_optimization = true,
  parallel_processing = true,
  max_concurrent_operations = 15,

  -- Progress tracking
  enable_progress_tracking = true,
  progress_callback = 'product_update_progress_handler',
  estimated_total_operations = 50000
);

-- Bulk upsert operations with conflict resolution
BULK UPSERT INTO inventory_snapshots (product_id, snapshot_date, stock_quantity, reserved_quantity, available_quantity)
SELECT 
  p.product_id,
  CURRENT_DATE as snapshot_date,
  p.stock_quantity,
  COALESCE(r.reserved_quantity, 0) as reserved_quantity,
  p.stock_quantity - COALESCE(r.reserved_quantity, 0) as available_quantity
FROM products p
LEFT JOIN (
  SELECT 
    product_id,
    SUM(quantity) as reserved_quantity
  FROM order_items oi
  JOIN orders o ON oi.order_id = o.order_id
  WHERE o.status IN ('pending', 'processing')
  GROUP BY product_id
) r ON p.product_id = r.product_id
WHERE p.status = 'active'
WITH (
  batch_size = 1000,
  conflict_resolution = 'update',

  -- Upsert configuration
  upsert_conditions = JSON_OBJECT(
    'match_fields', JSON_ARRAY('product_id', 'snapshot_date'),
    'update_strategy', 'replace_all',
    'preserve_audit_fields', true
  ),

  -- Performance optimization
  enable_bulk_optimization = true,
  parallel_upserts = true,
  transaction_batching = true
);

-- Complex bulk delete with archiving and cascade handling
BULK DELETE FROM products
WHERE status = 'discontinued'
  AND discontinue_date < CURRENT_DATE - INTERVAL '2 years'
  AND stock_quantity = 0
  AND product_id NOT IN (
    -- Exclude products with recent orders
    SELECT DISTINCT product_id 
    FROM order_items oi
    JOIN orders o ON oi.order_id = o.order_id
    WHERE o.order_date > CURRENT_DATE - INTERVAL '1 year'
  )
  AND product_id NOT IN (
    -- Exclude products with active variants
    SELECT DISTINCT product_id 
    FROM product_variants 
    WHERE is_active = true
  )
WITH (
  batch_size = 800,
  ordered = true,

  -- Archive configuration
  archive_before_delete = true,
  archive_collection = 'products_archive',
  archive_metadata = JSON_OBJECT(
    'deletion_reason', 'bulk_cleanup_2025',
    'deleted_by', CURRENT_USER_ID(),
    'deletion_date', CURRENT_TIMESTAMP,
    'cleanup_batch_id', 'batch_2025_001'
  ),

  -- Cascade handling
  handle_cascades = true,
  cascade_operations = JSON_ARRAY(
    JSON_OBJECT('collection', 'product_images', 'action', 'delete'),
    JSON_OBJECT('collection', 'product_reviews', 'action', 'archive'),
    JSON_OBJECT('collection', 'product_analytics', 'action', 'delete')
  ),

  -- Safety measures
  require_confirmation = true,
  max_delete_limit = 10000,
  enable_rollback = true,
  rollback_timeout_minutes = 30
);

-- Mixed bulk operations for complex business processes
BEGIN BULK TRANSACTION 'product_lifecycle_update_2025';

-- Step 1: Insert new seasonal products
BULK INSERT INTO products (sku, name, category, price, stock_quantity, status, seasonal_info)
SELECT 
  'SEASONAL-' || LPAD(ROW_NUMBER() OVER (ORDER BY t.name), 6, '0') as sku,
  t.name,
  t.category,
  t.price,
  t.initial_stock,
  'active' as status,
  JSON_OBJECT(
    'is_seasonal', true,
    'season', 'winter_2025',
    'availability_start', '2025-12-01',
    'availability_end', '2026-02-28'
  ) as seasonal_info
FROM temp_seasonal_products t
WITH (batch_size = 1000);

-- Step 2: Update existing product categories
BULK UPDATE products 
SET 
  category = CASE 
    WHEN category = 'winter_clothing' THEN 'clothing'
    WHEN category = 'holiday_electronics' THEN 'electronics'
    WHEN category = 'seasonal_books' THEN 'books'
    ELSE category
  END,

  tags = CASE 
    WHEN category LIKE '%seasonal%' THEN 
      ARRAY_APPEND(tags, 'seasonal_clearance')
    ELSE tags
  END,

  clearance_eligible = CASE 
    WHEN category LIKE '%seasonal%' AND stock_quantity > 50 THEN true
    ELSE clearance_eligible
  END

WHERE category LIKE '%seasonal%'
   OR category LIKE '%holiday%'
WITH (batch_size = 1200);

-- Step 3: Archive old product versions
BULK MOVE products TO products_archive
WHERE version < 5
  AND last_updated < CURRENT_DATE - INTERVAL '1 year'
  AND status = 'inactive'
WITH (
  batch_size = 500,
  preserve_relationships = false,
  archive_metadata = JSON_OBJECT(
    'archive_reason', 'version_cleanup',
    'archive_date', CURRENT_TIMESTAMP
  )
);

-- Step 4: Clean up orphaned related records
BULK DELETE FROM product_images 
WHERE product_id NOT IN (
  SELECT product_id FROM products
  UNION 
  SELECT original_product_id FROM products_archive
)
WITH (batch_size = 1000);

COMMIT BULK TRANSACTION;

-- Monitoring and analytics for bulk operations
WITH bulk_operation_analytics AS (
  SELECT 
    operation_type,
    collection_name,
    batch_id,
    operation_start_time,
    operation_end_time,
    EXTRACT(EPOCH FROM operation_end_time - operation_start_time) as duration_seconds,

    -- Operation counts
    total_operations_attempted,
    successful_operations,
    failed_operations,

    -- Performance metrics
    operations_per_second,
    average_batch_size,
    peak_memory_usage_mb,
    network_bytes_transmitted,

    -- Error analysis
    error_rate,
    most_common_error_type,
    retry_count,

    -- Resource utilization
    cpu_usage_percent,
    memory_usage_percent,
    disk_io_operations,

    -- Business impact
    data_volume_processed_mb,
    estimated_cost_savings,
    processing_efficiency_score

  FROM bulk_operation_logs
  WHERE operation_date >= CURRENT_DATE - INTERVAL '30 days'
),

performance_summary AS (
  SELECT 
    operation_type,
    COUNT(*) as total_operations,

    -- Performance statistics
    ROUND(AVG(duration_seconds), 2) as avg_duration_seconds,
    ROUND(AVG(operations_per_second), 0) as avg_throughput,
    ROUND(AVG(processing_efficiency_score), 2) as avg_efficiency,

    -- Error analysis
    ROUND(AVG(error_rate) * 100, 2) as avg_error_rate_percent,
    SUM(failed_operations) as total_failures,

    -- Resource consumption
    ROUND(AVG(peak_memory_usage_mb), 0) as avg_peak_memory_mb,
    ROUND(SUM(data_volume_processed_mb), 0) as total_data_processed_mb,

    -- Performance trends
    ROUND(
      (AVG(operations_per_second) FILTER (WHERE operation_start_time >= CURRENT_DATE - INTERVAL '7 days') - 
       AVG(operations_per_second) FILTER (WHERE operation_start_time < CURRENT_DATE - INTERVAL '7 days')) /
      AVG(operations_per_second) FILTER (WHERE operation_start_time < CURRENT_DATE - INTERVAL '7 days') * 100,
      1
    ) as performance_trend_percent,

    -- Cost analysis
    ROUND(SUM(estimated_cost_savings), 2) as total_cost_savings

  FROM bulk_operation_analytics
  GROUP BY operation_type
),

efficiency_recommendations AS (
  SELECT 
    ps.operation_type,
    ps.total_operations,
    ps.avg_throughput,
    ps.avg_efficiency,

    -- Performance assessment
    CASE 
      WHEN ps.avg_efficiency >= 0.9 THEN 'Excellent'
      WHEN ps.avg_efficiency >= 0.8 THEN 'Good'  
      WHEN ps.avg_efficiency >= 0.7 THEN 'Fair'
      ELSE 'Needs Improvement'
    END as performance_rating,

    -- Optimization recommendations
    CASE 
      WHEN ps.avg_error_rate_percent > 5 THEN 'Focus on error handling and data validation'
      WHEN ps.avg_peak_memory_mb > 1000 THEN 'Optimize memory usage and batch sizing'
      WHEN ps.avg_throughput < 100 THEN 'Increase parallelization and batch optimization'
      WHEN ps.performance_trend_percent < -10 THEN 'Investigate performance degradation'
      ELSE 'Performance is optimized'
    END as primary_recommendation,

    -- Capacity planning
    CASE 
      WHEN ps.total_data_processed_mb > 10000 THEN 'Consider distributed processing'
      WHEN ps.total_operations > 1000 THEN 'Implement advanced caching strategies'  
      ELSE 'Current capacity is sufficient'
    END as capacity_recommendation,

    -- Cost optimization
    ps.total_cost_savings,
    CASE 
      WHEN ps.total_cost_savings < 100 THEN 'Minimal cost impact'
      WHEN ps.total_cost_savings < 1000 THEN 'Moderate cost savings'
      ELSE 'Significant cost optimization achieved'
    END as cost_impact

  FROM performance_summary ps
)

-- Comprehensive bulk operations dashboard
SELECT 
  er.operation_type,
  er.total_operations,
  er.performance_rating,
  er.avg_throughput || ' ops/sec' as throughput,
  er.avg_efficiency * 100 || '%' as efficiency_percentage,

  -- Recommendations
  er.primary_recommendation,
  er.capacity_recommendation,
  er.cost_impact,

  -- Detailed metrics
  JSON_OBJECT(
    'avg_duration', ps.avg_duration_seconds || ' seconds',
    'error_rate', ps.avg_error_rate_percent || '%',
    'memory_usage', ps.avg_peak_memory_mb || ' MB',
    'data_processed', ps.total_data_processed_mb || ' MB',
    'performance_trend', ps.performance_trend_percent || '% change',
    'total_failures', ps.total_failures,
    'cost_savings', '$' || ps.total_cost_savings
  ) as detailed_metrics,

  -- Next actions
  CASE 
    WHEN er.performance_rating = 'Needs Improvement' THEN 
      JSON_ARRAY(
        'Review batch sizing configuration',
        'Analyze error patterns and root causes',
        'Consider infrastructure scaling',
        'Implement performance monitoring alerts'
      )
    WHEN er.performance_rating = 'Fair' THEN 
      JSON_ARRAY(
        'Fine-tune batch optimization parameters',
        'Implement advanced error handling',
        'Consider parallel processing improvements'
      )
    ELSE 
      JSON_ARRAY('Continue monitoring performance trends', 'Maintain current optimization level')
  END as recommended_actions

FROM efficiency_recommendations er
JOIN performance_summary ps ON er.operation_type = ps.operation_type
ORDER BY er.avg_throughput DESC;

-- QueryLeaf provides comprehensive bulk operation capabilities:
-- 1. SQL-familiar bulk insert, update, and delete operations with advanced configuration
-- 2. Sophisticated batch processing with intelligent optimization and error handling
-- 3. Mixed operation transactions with complex business logic and cascade handling
-- 4. Comprehensive monitoring and analytics for operational insight and optimization
-- 5. Production-ready performance tuning with parallel processing and resource management
-- 6. Advanced error handling with retry mechanisms and partial failure recovery
-- 7. Seamless integration with MongoDB's native bulk operation APIs and optimizations
-- 8. Enterprise-scale processing capabilities with distributed operation support
-- 9. Intelligent batch sizing and resource optimization for maximum throughput
-- 10. SQL-style syntax for complex bulk operation workflows and data transformations

Best Practices for Production Bulk Operations Implementation

Performance Architecture and Optimization Strategies

Essential principles for scalable MongoDB bulk operations deployment:

Batch Size Optimization: Configure optimal batch sizes based on document size, operation type, and available resources
Error Handling Strategy: Design comprehensive error handling with retry mechanisms and partial failure recovery
Resource Management: Implement intelligent resource monitoring and automatic optimization for memory and CPU usage
Progress Tracking: Provide detailed progress monitoring and analytics for long-running bulk operations
Transaction Management: Design appropriate transaction boundaries and consistency guarantees for bulk operations
Performance Monitoring: Implement comprehensive monitoring for throughput, latency, and resource utilization

Scalability and Operational Excellence

Optimize bulk operations for enterprise-scale requirements:

Distributed Processing: Design sharding-aware bulk operations that optimize cross-shard performance
Parallel Execution: Implement intelligent parallelization strategies that maximize throughput without overwhelming resources
Memory Optimization: Use streaming processing and memory pooling to handle large datasets efficiently
Network Efficiency: Minimize network overhead through compression and intelligent batching strategies
Error Recovery: Implement robust error recovery and rollback mechanisms for complex bulk operation workflows
Operational Monitoring: Provide comprehensive dashboards and alerting for production bulk operation management

Conclusion

MongoDB bulk operations provide comprehensive high-performance batch processing capabilities that enable efficient handling of large-scale data operations with automatic optimization, intelligent batching, and robust error handling designed specifically for enterprise applications. The native MongoDB integration ensures bulk operations benefit from the same scalability, consistency, and operational features as individual operations while providing significant performance improvements.

Key MongoDB bulk operations benefits include:

High Performance: Optimized batch processing with intelligent operation grouping and minimal network overhead
Comprehensive Error Handling: Robust error management with partial failure recovery and detailed reporting
Flexible Operations: Support for mixed operation types with complex business logic in single batch transactions
Resource Efficiency: Memory-efficient processing with automatic resource management and optimization
Production Scalability: Enterprise-ready performance with distributed processing and advanced monitoring
Operational Excellence: Integrated analytics, progress tracking, and performance optimization tools

Whether you're building ETL pipelines, data migration systems, real-time ingestion platforms, or any application requiring high-volume data processing, MongoDB bulk operations with QueryLeaf's familiar SQL interface provide the foundation for scalable and maintainable batch processing solutions.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB bulk operations while providing SQL-familiar syntax for complex batch processing workflows. Advanced bulk operation patterns, error handling strategies, and performance optimization are seamlessly handled through familiar SQL constructs, making sophisticated batch processing capabilities accessible to SQL-oriented development teams.

The combination of MongoDB's robust bulk operation capabilities with SQL-style batch processing makes it an ideal platform for modern applications that require both powerful data processing and familiar database management patterns, ensuring your bulk operation solutions scale efficiently while remaining maintainable and feature-rich.

November 24, 2025
27 min read

MongoDB GridFS and Binary Data Management: Advanced File Storage Solutions for Large-Scale Applications with SQL-Style File Operations

Modern applications require robust file storage solutions that can handle large files, multimedia content, and binary data at scale while providing efficient streaming, versioning, and metadata management capabilities. Traditional file storage approaches struggle with managing large files, handling concurrent access, providing atomic operations, and integrating seamlessly with database transactions and application logic.

MongoDB GridFS provides comprehensive large file storage capabilities that enable efficient handling of binary data, multimedia content, and large documents with automatic chunking, streaming support, and integrated metadata management. Unlike traditional file systems that separate file storage from database operations, GridFS integrates file storage directly into MongoDB, enabling atomic operations, transactions, and unified query capabilities across both structured data and file content.

The Traditional File Storage Challenge

Conventional approaches to large file storage and binary data management have significant limitations for modern applications:

-- Traditional PostgreSQL large object storage - complex and limited integration

-- Basic large object table structure with limited capabilities
CREATE TABLE document_files (
    file_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    filename VARCHAR(255) NOT NULL,
    file_size BIGINT NOT NULL,
    mime_type VARCHAR(100) NOT NULL,
    content_hash VARCHAR(64) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- PostgreSQL large object reference
    content_oid OID NOT NULL,

    -- File metadata
    original_filename VARCHAR(500),
    upload_session_id UUID,
    uploader_user_id UUID,

    -- File properties
    is_public BOOLEAN DEFAULT FALSE,
    download_count INTEGER DEFAULT 0,
    last_accessed TIMESTAMP,

    -- Content analysis
    file_extension VARCHAR(20),
    encoding VARCHAR(50),
    language VARCHAR(10),

    -- Storage metadata
    storage_location VARCHAR(200),
    backup_status VARCHAR(50) DEFAULT 'pending',
    compression_enabled BOOLEAN DEFAULT FALSE
);

-- Image-specific metadata table
CREATE TABLE image_files (
    file_id UUID PRIMARY KEY REFERENCES document_files(file_id),
    width INTEGER,
    height INTEGER,
    color_depth INTEGER,
    has_transparency BOOLEAN,
    image_format VARCHAR(20),
    resolution_dpi INTEGER,
    color_profile VARCHAR(100),

    -- Image processing metadata
    thumbnail_generated BOOLEAN DEFAULT FALSE,
    processed_versions JSONB,
    exif_data JSONB
);

-- Video-specific metadata table  
CREATE TABLE video_files (
    file_id UUID PRIMARY KEY REFERENCES document_files(file_id),
    duration_seconds INTEGER,
    width INTEGER,
    height INTEGER,
    frame_rate DECIMAL(5,2),
    video_codec VARCHAR(50),
    audio_codec VARCHAR(50),
    bitrate INTEGER,
    container_format VARCHAR(20),

    -- Video processing metadata
    thumbnails_generated BOOLEAN DEFAULT FALSE,
    preview_clips JSONB,
    processing_status VARCHAR(50) DEFAULT 'pending'
);

-- Audio file metadata table
CREATE TABLE audio_files (
    file_id UUID PRIMARY KEY REFERENCES document_files(file_id),
    duration_seconds INTEGER,
    sample_rate INTEGER,
    channels INTEGER,
    bitrate INTEGER,
    audio_codec VARCHAR(50),
    container_format VARCHAR(20),

    -- Audio metadata
    title VARCHAR(200),
    artist VARCHAR(200),
    album VARCHAR(200),
    genre VARCHAR(100),
    year INTEGER,

    -- Processing metadata
    waveform_generated BOOLEAN DEFAULT FALSE,
    transcription_status VARCHAR(50)
);

-- Complex file chunk management for large files
CREATE TABLE file_chunks (
    chunk_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    file_id UUID NOT NULL REFERENCES document_files(file_id),
    chunk_number INTEGER NOT NULL,
    chunk_size INTEGER NOT NULL,
    chunk_hash VARCHAR(64) NOT NULL,
    content_oid OID NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    UNIQUE(file_id, chunk_number)
);

-- Index for chunk retrieval performance
CREATE INDEX idx_file_chunks_file_id_number ON file_chunks (file_id, chunk_number);
CREATE INDEX idx_document_files_hash ON document_files (content_hash);
CREATE INDEX idx_document_files_mime_type ON document_files (mime_type);
CREATE INDEX idx_document_files_created ON document_files (created_at);

-- Complex file upload and streaming implementation
CREATE OR REPLACE FUNCTION upload_large_file(
    p_filename TEXT,
    p_file_content BYTEA,
    p_mime_type TEXT DEFAULT 'application/octet-stream',
    p_user_id UUID DEFAULT NULL,
    p_chunk_size INTEGER DEFAULT 1048576  -- 1MB chunks
) RETURNS UUID
LANGUAGE plpgsql
AS $$
DECLARE
    v_file_id UUID;
    v_content_oid OID;
    v_file_size BIGINT;
    v_content_hash TEXT;
    v_chunk_count INTEGER;
    v_chunk_start INTEGER;
    v_chunk_end INTEGER;
    v_chunk_content BYTEA;
    v_chunk_oid OID;
    i INTEGER;
BEGIN
    -- Calculate file properties
    v_file_size := length(p_file_content);
    v_content_hash := encode(digest(p_file_content, 'sha256'), 'hex');
    v_chunk_count := CEIL(v_file_size::DECIMAL / p_chunk_size);

    -- Check for duplicate content
    SELECT file_id INTO v_file_id 
    FROM document_files 
    WHERE content_hash = v_content_hash;

    IF v_file_id IS NOT NULL THEN
        -- Update access count for existing file
        UPDATE document_files 
        SET download_count = download_count + 1,
            last_accessed = CURRENT_TIMESTAMP
        WHERE file_id = v_file_id;
        RETURN v_file_id;
    END IF;

    -- Generate new file ID
    v_file_id := gen_random_uuid();

    -- Store main file content as large object
    v_content_oid := lo_create(0);
    PERFORM lo_put(v_content_oid, 0, p_file_content);

    -- Insert file metadata
    INSERT INTO document_files (
        file_id, filename, file_size, mime_type, content_hash,
        content_oid, original_filename, uploader_user_id,
        file_extension, storage_location
    ) VALUES (
        v_file_id, p_filename, v_file_size, p_mime_type, v_content_hash,
        v_content_oid, p_filename, p_user_id,
        SUBSTRING(p_filename FROM '\.([^.]*)$'),
        'postgresql_large_objects'
    );

    -- Create chunks for streaming and partial access
    FOR i IN 0..(v_chunk_count - 1) LOOP
        v_chunk_start := i * p_chunk_size;
        v_chunk_end := LEAST((i + 1) * p_chunk_size - 1, v_file_size - 1);

        -- Extract chunk content
        v_chunk_content := SUBSTRING(p_file_content FROM v_chunk_start + 1 FOR (v_chunk_end - v_chunk_start + 1));

        -- Store chunk as separate large object
        v_chunk_oid := lo_create(0);
        PERFORM lo_put(v_chunk_oid, 0, v_chunk_content);

        -- Insert chunk metadata
        INSERT INTO file_chunks (
            file_id, chunk_number, chunk_size, 
            chunk_hash, content_oid
        ) VALUES (
            v_file_id, i, length(v_chunk_content),
            encode(digest(v_chunk_content, 'md5'), 'hex'), v_chunk_oid
        );
    END LOOP;

    RETURN v_file_id;

EXCEPTION
    WHEN OTHERS THEN
        -- Cleanup on error
        IF v_content_oid IS NOT NULL THEN
            PERFORM lo_unlink(v_content_oid);
        END IF;
        RAISE;
END;
$$;

-- Complex streaming download function
CREATE OR REPLACE FUNCTION stream_file_chunk(
    p_file_id UUID,
    p_chunk_number INTEGER
) RETURNS TABLE(
    chunk_content BYTEA,
    chunk_size INTEGER,
    total_chunks INTEGER,
    file_size BIGINT,
    mime_type TEXT
)
LANGUAGE plpgsql
AS $$
DECLARE
    v_chunk_oid OID;
    v_content BYTEA;
BEGIN
    -- Get chunk information
    SELECT 
        fc.content_oid, fc.chunk_size,
        (SELECT COUNT(*) FROM file_chunks WHERE file_id = p_file_id),
        df.file_size, df.mime_type
    INTO v_chunk_oid, chunk_size, total_chunks, file_size, mime_type
    FROM file_chunks fc
    JOIN document_files df ON fc.file_id = df.file_id
    WHERE fc.file_id = p_file_id 
    AND fc.chunk_number = p_chunk_number;

    IF v_chunk_oid IS NULL THEN
        RAISE EXCEPTION 'Chunk not found: file_id=%, chunk=%', p_file_id, p_chunk_number;
    END IF;

    -- Read chunk content
    SELECT lo_get(v_chunk_oid) INTO v_content;
    chunk_content := v_content;

    -- Update access statistics
    UPDATE document_files 
    SET last_accessed = CURRENT_TIMESTAMP,
        download_count = CASE 
            WHEN p_chunk_number = 0 THEN download_count + 1 
            ELSE download_count 
        END
    WHERE file_id = p_file_id;

    RETURN NEXT;
END;
$$;

-- File search and management with limited capabilities
WITH file_analytics AS (
    SELECT 
        df.file_id,
        df.filename,
        df.file_size,
        df.mime_type,
        df.created_at,
        df.download_count,
        df.uploader_user_id,

        -- Size categorization
        CASE 
            WHEN df.file_size < 1048576 THEN 'small'    -- < 1MB
            WHEN df.file_size < 104857600 THEN 'medium' -- < 100MB
            WHEN df.file_size < 1073741824 THEN 'large' -- < 1GB
            ELSE 'xlarge'  -- >= 1GB
        END as size_category,

        -- Type categorization
        CASE 
            WHEN df.mime_type LIKE 'image/%' THEN 'image'
            WHEN df.mime_type LIKE 'video/%' THEN 'video'  
            WHEN df.mime_type LIKE 'audio/%' THEN 'audio'
            WHEN df.mime_type LIKE 'application/pdf' THEN 'document'
            WHEN df.mime_type LIKE 'text/%' THEN 'text'
            ELSE 'other'
        END as content_type,

        -- Storage efficiency
        (SELECT COUNT(*) FROM file_chunks WHERE file_id = df.file_id) as chunk_count,

        -- Usage metrics  
        EXTRACT(DAYS FROM CURRENT_TIMESTAMP - df.last_accessed) as days_since_access,

        -- Duplication analysis (limited by hash comparison only)
        (
            SELECT COUNT(*) - 1 
            FROM document_files df2 
            WHERE df2.content_hash = df.content_hash 
            AND df2.file_id != df.file_id
        ) as duplicate_count

    FROM document_files df
    WHERE df.created_at >= CURRENT_DATE - INTERVAL '90 days'
),
storage_summary AS (
    SELECT 
        content_type,
        size_category,
        COUNT(*) as file_count,
        SUM(file_size) as total_size_bytes,
        ROUND(AVG(file_size)::numeric, 0) as avg_file_size,
        SUM(download_count) as total_downloads,
        ROUND(AVG(download_count)::numeric, 1) as avg_downloads_per_file,

        -- Storage optimization opportunities
        SUM(CASE WHEN duplicate_count > 0 THEN file_size ELSE 0 END) as duplicate_storage_waste,
        COUNT(CASE WHEN days_since_access > 30 THEN 1 END) as stale_files,
        SUM(CASE WHEN days_since_access > 30 THEN file_size ELSE 0 END) as stale_storage_bytes

    FROM file_analytics
    GROUP BY content_type, size_category
)
SELECT 
    ss.content_type,
    ss.size_category,
    ss.file_count,

    -- Size formatting
    CASE 
        WHEN ss.total_size_bytes >= 1073741824 THEN 
            ROUND((ss.total_size_bytes / 1073741824.0)::numeric, 2) || ' GB'
        WHEN ss.total_size_bytes >= 1048576 THEN 
            ROUND((ss.total_size_bytes / 1048576.0)::numeric, 2) || ' MB'  
        WHEN ss.total_size_bytes >= 1024 THEN 
            ROUND((ss.total_size_bytes / 1024.0)::numeric, 2) || ' KB'
        ELSE ss.total_size_bytes || ' bytes'
    END as total_storage,

    ss.avg_file_size,
    ss.total_downloads,
    ss.avg_downloads_per_file,

    -- Storage optimization insights
    CASE 
        WHEN ss.duplicate_storage_waste > 0 THEN 
            ROUND((ss.duplicate_storage_waste / 1048576.0)::numeric, 2) || ' MB duplicate waste'
        ELSE 'No duplicates found'
    END as duplication_impact,

    ss.stale_files,
    CASE 
        WHEN ss.stale_storage_bytes > 0 THEN 
            ROUND((ss.stale_storage_bytes / 1048576.0)::numeric, 2) || ' MB in stale files'
        ELSE 'No stale files'
    END as stale_storage_impact,

    -- Storage efficiency recommendations
    CASE 
        WHEN ss.duplicate_count > ss.file_count * 0.1 THEN 'Implement deduplication'
        WHEN ss.stale_files > ss.file_count * 0.2 THEN 'Archive old files'
        WHEN ss.avg_file_size > 104857600 AND ss.content_type != 'video' THEN 'Consider compression'
        ELSE 'Storage optimized'
    END as optimization_recommendation

FROM storage_summary ss
ORDER BY ss.total_size_bytes DESC;

-- Problems with traditional file storage approaches:
-- 1. Complex chunking and streaming implementation with manual management
-- 2. Separate storage of file content and metadata in different systems  
-- 3. No atomic operations across file content and related database records
-- 4. Limited query capabilities for file content and metadata together
-- 5. Manual deduplication and storage optimization required
-- 6. Poor integration with application transactions and consistency
-- 7. Complex backup and replication strategies for large object storage
-- 8. Limited support for file versioning and concurrent access
-- 9. Difficult to implement advanced features like content-based indexing
-- 10. Scalability limitations with very large files and high concurrency

-- MySQL file storage (even more limited)
CREATE TABLE mysql_files (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    filename VARCHAR(255) NOT NULL,
    file_content LONGBLOB,  -- Limited to ~4GB
    mime_type VARCHAR(100),
    file_size INT UNSIGNED,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    INDEX idx_filename (filename),
    INDEX idx_mime_type (mime_type)
);

-- Basic file insertion (limited by LONGBLOB size)
INSERT INTO mysql_files (filename, file_content, mime_type, file_size)
VALUES (?, ?, ?, LENGTH(?));

-- Simple file retrieval (no streaming capabilities)
SELECT file_content, mime_type, file_size 
FROM mysql_files 
WHERE id = ?;

-- MySQL limitations for file storage:
-- - LONGBLOB limited to ~4GB maximum file size
-- - No built-in chunking or streaming capabilities  
-- - Poor performance with large binary data
-- - No atomic operations with file content and metadata
-- - Limited backup and replication options for large files
-- - No advanced features like deduplication or versioning
-- - Basic search capabilities limited to filename and metadata

MongoDB GridFS provides comprehensive large file storage and binary data management:

// MongoDB GridFS - advanced large file storage with comprehensive binary data management
const { MongoClient, GridFSBucket } = require('mongodb');
const { createReadStream, createWriteStream } = require('fs');
const { pipeline } = require('stream');
const { promisify } = require('util');
const crypto = require('crypto');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('advanced_file_management_platform');

// Advanced GridFS file management and multimedia processing system
class AdvancedGridFSManager {
  constructor(db, config = {}) {
    this.db = db;
    this.config = {
      chunkSizeBytes: config.chunkSizeBytes || 261120, // 255KB chunks
      maxFileSizeBytes: config.maxFileSizeBytes || 16 * 1024 * 1024 * 1024, // 16GB
      enableCompression: config.enableCompression || true,
      enableDeduplication: config.enableDeduplication || true,
      enableVersioning: config.enableVersioning || true,
      enableContentAnalysis: config.enableContentAnalysis || true,

      // Storage optimization
      compressionThreshold: config.compressionThreshold || 1048576, // 1MB
      dedupHashAlgorithm: config.dedupHashAlgorithm || 'sha256',
      thumbnailGeneration: config.thumbnailGeneration || true,
      contentIndexing: config.contentIndexing || true,

      // Performance tuning
      concurrentUploads: config.concurrentUploads || 10,
      streamingChunkSize: config.streamingChunkSize || 1024 * 1024, // 1MB
      cacheStrategy: config.cacheStrategy || 'lru',
      maxCacheSize: config.maxCacheSize || 100 * 1024 * 1024 // 100MB
    };

    // Initialize GridFS buckets for different content types
    this.buckets = {
      files: new GridFSBucket(db, { 
        bucketName: 'files',
        chunkSizeBytes: this.config.chunkSizeBytes
      }),
      images: new GridFSBucket(db, { 
        bucketName: 'images',
        chunkSizeBytes: this.config.chunkSizeBytes
      }),
      videos: new GridFSBucket(db, { 
        bucketName: 'videos',
        chunkSizeBytes: this.config.chunkSizeBytes
      }),
      audio: new GridFSBucket(db, { 
        bucketName: 'audio',
        chunkSizeBytes: this.config.chunkSizeBytes
      }),
      documents: new GridFSBucket(db, { 
        bucketName: 'documents',
        chunkSizeBytes: this.config.chunkSizeBytes
      }),
      archives: new GridFSBucket(db, { 
        bucketName: 'archives',
        chunkSizeBytes: this.config.chunkSizeBytes
      })
    };

    // File processing queues and caches
    this.processingQueue = new Map();
    this.contentCache = new Map();
    this.metadataCache = new Map();

    this.setupIndexes();
    this.initializeContentProcessors();
  }

  async setupIndexes() {
    console.log('Setting up GridFS performance indexes...');

    try {
      // Index configurations for all buckets
      const indexConfigs = [
        // Filename and content type indexes
        { 'filename': 1, 'metadata.contentType': 1 },
        { 'metadata.contentType': 1, 'uploadDate': -1 },

        // Content-based indexes
        { 'metadata.contentHash': 1 }, // For deduplication
        { 'metadata.originalHash': 1 },
        { 'metadata.fileSize': 1 },

        // Access pattern indexes
        { 'metadata.createdBy': 1, 'uploadDate': -1 },
        { 'metadata.accessCount': -1 },
        { 'metadata.lastAccessed': -1 },

        // Content analysis indexes
        { 'metadata.tags': 1 },
        { 'metadata.category': 1, 'metadata.subcategory': 1 },
        { 'metadata.language': 1 },
        { 'metadata.processingStatus': 1 },

        // Multimedia-specific indexes
        { 'metadata.imageProperties.width': 1, 'metadata.imageProperties.height': 1 },
        { 'metadata.videoProperties.duration': 1 },
        { 'metadata.audioProperties.duration': 1 },

        // Version and relationship indexes
        { 'metadata.version': 1, 'metadata.baseFileId': 1 },
        { 'metadata.parentFileId': 1 },
        { 'metadata.derivedFrom': 1 },

        // Storage optimization indexes
        { 'metadata.storageClass': 1 },
        { 'metadata.compressionRatio': 1 },
        { 'metadata.isCompressed': 1 },

        // Search and discovery indexes
        { 'metadata.searchableText': 'text' },
        { '$**': 'text' }, // Wildcard text index for flexible search

        // Geospatial indexes for location-based files
        { 'metadata.location': '2dsphere' },

        // Compound indexes for complex queries
        { 'metadata.contentType': 1, 'metadata.fileSize': -1, 'uploadDate': -1 },
        { 'metadata.createdBy': 1, 'metadata.contentType': 1, 'metadata.isPublic': 1 },
        { 'metadata.category': 1, 'metadata.processingStatus': 1, 'uploadDate': -1 }
      ];

      // Apply indexes to all bucket collections
      for (const [bucketName, bucket] of Object.entries(this.buckets)) {
        const filesCollection = this.db.collection(`${bucketName}.files`);
        const chunksCollection = this.db.collection(`${bucketName}.chunks`);

        // Files collection indexes
        for (const indexSpec of indexConfigs) {
          try {
            await filesCollection.createIndex(indexSpec, { background: true });
          } catch (error) {
            if (!error.message.includes('already exists')) {
              console.warn(`Index creation warning for ${bucketName}.files:`, error.message);
            }
          }
        }

        // Chunks collection optimization
        await chunksCollection.createIndex(
          { files_id: 1, n: 1 }, 
          { background: true, unique: true }
        );
      }

      console.log('GridFS indexes created successfully');
    } catch (error) {
      console.error('Error setting up GridFS indexes:', error);
      throw error;
    }
  }

  async uploadFile(fileStream, filename, metadata = {}, options = {}) {
    console.log(`Starting GridFS upload: ${filename}`);
    const uploadStart = Date.now();

    try {
      // Determine appropriate bucket based on content type
      const bucket = this.selectBucket(metadata.contentType || options.contentType);

      // Prepare comprehensive metadata
      const fileMetadata = await this.prepareFileMetadata(filename, metadata, options);

      // Check for deduplication if enabled
      if (this.config.enableDeduplication && fileMetadata.contentHash) {
        const existingFile = await this.checkForDuplicate(fileMetadata.contentHash);
        if (existingFile) {
          console.log(`Duplicate file found, linking to existing: ${existingFile._id}`);
          return await this.linkToDuplicate(existingFile, fileMetadata);
        }
      }

      // Create upload stream with compression if needed
      const uploadStream = bucket.openUploadStream(filename, {
        chunkSizeBytes: options.chunkSize || this.config.chunkSizeBytes,
        metadata: fileMetadata
      });

      // Set up progress tracking and error handling
      let uploadedBytes = 0;
      const totalSize = fileMetadata.fileSize || 0;

      uploadStream.on('progress', (bytesUploaded) => {
        uploadedBytes = bytesUploaded;
        if (options.onProgress) {
          options.onProgress({
            filename,
            uploadedBytes,
            totalSize,
            percentage: totalSize ? Math.round((uploadedBytes / totalSize) * 100) : 0
          });
        }
      });

      // Handle upload completion
      const uploadPromise = new Promise((resolve, reject) => {
        uploadStream.on('finish', async () => {
          try {
            const uploadTime = Date.now() - uploadStart;
            console.log(`Upload completed: ${filename} (${uploadTime}ms)`);

            // Post-upload processing
            const fileDoc = await this.getFileById(uploadStream.id);

            // Queue for content processing if enabled
            if (this.config.enableContentAnalysis) {
              await this.queueContentProcessing(fileDoc);
            }

            // Update upload statistics
            await this.updateUploadStatistics(fileDoc, uploadTime);

            resolve(fileDoc);
          } catch (error) {
            reject(error);
          }
        });

        uploadStream.on('error', reject);
      });

      // Pipe file stream to GridFS with compression if needed
      if (this.shouldCompress(fileMetadata)) {
        const compressionStream = this.createCompressionStream();
        pipeline(fileStream, compressionStream, uploadStream, (error) => {
          if (error) {
            console.error(`Upload pipeline error for ${filename}:`, error);
            uploadStream.destroy(error);
          }
        });
      } else {
        pipeline(fileStream, uploadStream, (error) => {
          if (error) {
            console.error(`Upload pipeline error for ${filename}:`, error);
            uploadStream.destroy(error);
          }
        });
      }

      return await uploadPromise;

    } catch (error) {
      console.error(`GridFS upload error for ${filename}:`, error);
      throw error;
    }
  }

  async prepareFileMetadata(filename, providedMetadata, options) {
    // Generate comprehensive file metadata
    const metadata = {
      // Basic file information
      originalFilename: filename,
      uploadedAt: new Date(),
      createdBy: options.userId || null,
      fileSize: providedMetadata.fileSize || null,

      // Content identification
      contentType: providedMetadata.contentType || this.detectContentType(filename),
      fileExtension: this.extractFileExtension(filename),
      encoding: providedMetadata.encoding || 'binary',

      // Content hashing for deduplication
      contentHash: providedMetadata.contentHash || null,
      originalHash: providedMetadata.originalHash || null,

      // Access control and visibility
      isPublic: options.isPublic || false,
      accessLevel: options.accessLevel || 'private',
      permissions: options.permissions || {},

      // Classification and organization
      category: providedMetadata.category || this.categorizeByContentType(providedMetadata.contentType),
      subcategory: providedMetadata.subcategory || null,
      tags: providedMetadata.tags || [],
      keywords: providedMetadata.keywords || [],

      // Content properties (will be updated during processing)
      language: providedMetadata.language || null,
      searchableText: providedMetadata.searchableText || '',

      // Processing status
      processingStatus: 'uploaded',
      processingQueue: [],
      processingResults: {},

      // Storage optimization
      isCompressed: false,
      compressionAlgorithm: null,
      compressionRatio: null,
      storageClass: options.storageClass || 'standard',

      // Usage tracking
      accessCount: 0,
      downloadCount: 0,
      lastAccessed: null,

      // Versioning and relationships
      version: options.version || 1,
      baseFileId: options.baseFileId || null,
      parentFileId: options.parentFileId || null,
      derivedFrom: options.derivedFrom || null,
      hasVersions: false,

      // Location and context
      location: providedMetadata.location || null,
      uploadSource: options.uploadSource || 'api',
      uploadSessionId: options.uploadSessionId || null,

      // Custom metadata
      customFields: providedMetadata.customFields || {},
      applicationData: providedMetadata.applicationData || {},

      // Media-specific properties (initialized empty, filled during processing)
      imageProperties: {},
      videoProperties: {},
      audioProperties: {},
      documentProperties: {}
    };

    return metadata;
  }

  async downloadFile(fileId, options = {}) {
    console.log(`Starting GridFS download: ${fileId}`);

    try {
      // Get file document first
      const fileDoc = await this.getFileById(fileId);
      if (!fileDoc) {
        throw new Error(`File not found: ${fileId}`);
      }

      // Check access permissions
      if (!await this.checkDownloadPermissions(fileDoc, options.userId)) {
        throw new Error('Insufficient permissions to download file');
      }

      // Select appropriate bucket
      const bucket = this.selectBucketForFile(fileDoc);

      // Create download stream with range support
      let downloadStream;

      if (options.range) {
        // Partial download with HTTP range support
        downloadStream = bucket.openDownloadStream(fileDoc._id, {
          start: options.range.start,
          end: options.range.end
        });
      } else {
        // Full file download
        downloadStream = bucket.openDownloadStream(fileDoc._id);
      }

      // Set up decompression if needed
      let finalStream = downloadStream;
      if (fileDoc.metadata.isCompressed && !options.skipDecompression) {
        const decompressionStream = this.createDecompressionStream(
          fileDoc.metadata.compressionAlgorithm
        );
        finalStream = pipeline(downloadStream, decompressionStream, () => {});
      }

      // Track download statistics
      downloadStream.on('file', async () => {
        await this.updateDownloadStatistics(fileDoc);
      });

      // Handle streaming errors
      downloadStream.on('error', (error) => {
        console.error(`Download error for file ${fileId}:`, error);
        throw error;
      });

      return {
        stream: finalStream,
        metadata: fileDoc.metadata,
        filename: fileDoc.filename,
        contentType: fileDoc.metadata.contentType,
        fileSize: fileDoc.length
      };

    } catch (error) {
      console.error(`GridFS download error for ${fileId}:`, error);
      throw error;
    }
  }

  async searchFiles(query, options = {}) {
    console.log('Performing advanced GridFS file search...', query);

    try {
      // Build comprehensive search pipeline
      const searchPipeline = this.buildFileSearchPipeline(query, options);

      // Select appropriate bucket or search across all
      const results = [];
      const bucketsToSearch = options.bucket ? [options.bucket] : Object.keys(this.buckets);

      for (const bucketName of bucketsToSearch) {
        const filesCollection = this.db.collection(`${bucketName}.files`);
        const bucketResults = await filesCollection.aggregate(searchPipeline).toArray();

        // Add bucket context to results
        const enhancedResults = bucketResults.map(result => ({
          ...result,
          bucketName: bucketName,
          downloadUrl: `/api/files/${bucketName}/${result._id}/download`,
          previewUrl: `/api/files/${bucketName}/${result._id}/preview`,
          metadataUrl: `/api/files/${bucketName}/${result._id}/metadata`
        }));

        results.push(...enhancedResults);
      }

      // Sort combined results by relevance
      results.sort((a, b) => (b.searchScore || 0) - (a.searchScore || 0));

      return {
        results: results.slice(0, options.limit || 50),
        totalCount: results.length,
        searchQuery: query,
        searchOptions: options,
        executionTime: Date.now()
      };

    } catch (error) {
      console.error('GridFS search error:', error);
      throw error;
    }
  }

  buildFileSearchPipeline(query, options) {
    const pipeline = [];
    const matchStage = {};

    // Text search across filename and searchable content
    if (query.text) {
      matchStage.$text = {
        $search: query.text,
        $caseSensitive: false,
        $diacriticSensitive: false
      };
    }

    // Content type filtering
    if (query.contentType) {
      matchStage['metadata.contentType'] = Array.isArray(query.contentType) 
        ? { $in: query.contentType }
        : query.contentType;
    }

    // File size filtering
    if (query.minSize || query.maxSize) {
      matchStage.length = {};
      if (query.minSize) matchStage.length.$gte = query.minSize;
      if (query.maxSize) matchStage.length.$lte = query.maxSize;
    }

    // Date range filtering
    if (query.dateFrom || query.dateTo) {
      matchStage.uploadDate = {};
      if (query.dateFrom) matchStage.uploadDate.$gte = new Date(query.dateFrom);
      if (query.dateTo) matchStage.uploadDate.$lte = new Date(query.dateTo);
    }

    // User/creator filtering
    if (query.createdBy) {
      matchStage['metadata.createdBy'] = query.createdBy;
    }

    // Category and tag filtering
    if (query.category) {
      matchStage['metadata.category'] = query.category;
    }

    if (query.tags && query.tags.length > 0) {
      matchStage['metadata.tags'] = { $in: query.tags };
    }

    // Processing status filtering
    if (query.processingStatus) {
      matchStage['metadata.processingStatus'] = query.processingStatus;
    }

    // Access level filtering
    if (query.accessLevel) {
      matchStage['metadata.accessLevel'] = query.accessLevel;
    }

    // Public/private filtering
    if (query.isPublic !== undefined) {
      matchStage['metadata.isPublic'] = query.isPublic;
    }

    // Add match stage
    if (Object.keys(matchStage).length > 0) {
      pipeline.push({ $match: matchStage });
    }

    // Add search scoring for text queries
    if (query.text) {
      pipeline.push({
        $addFields: {
          searchScore: { $meta: 'textScore' }
        }
      });
    }

    // Add computed fields for enhanced results
    pipeline.push({
      $addFields: {
        fileSizeFormatted: {
          $switch: {
            branches: [
              { case: { $gte: ['$length', 1073741824] }, then: { $concat: [{ $toString: { $round: [{ $divide: ['$length', 1073741824] }, 2] } }, ' GB'] } },
              { case: { $gte: ['$length', 1048576] }, then: { $concat: [{ $toString: { $round: [{ $divide: ['$length', 1048576] }, 2] } }, ' MB'] } },
              { case: { $gte: ['$length', 1024] }, then: { $concat: [{ $toString: { $round: [{ $divide: ['$length', 1024] }, 2] } }, ' KB'] } }
            ],
            default: { $concat: [{ $toString: '$length' }, ' bytes'] }
          }
        },

        uploadDateFormatted: {
          $dateToString: {
            format: '%Y-%m-%d %H:%M:%S',
            date: '$uploadDate'
          }
        },

        // Content category for display
        contentCategory: {
          $switch: {
            branches: [
              { case: { $regexMatch: { input: '$metadata.contentType', regex: '^image/' } }, then: 'Image' },
              { case: { $regexMatch: { input: '$metadata.contentType', regex: '^video/' } }, then: 'Video' },
              { case: { $regexMatch: { input: '$metadata.contentType', regex: '^audio/' } }, then: 'Audio' },
              { case: { $regexMatch: { input: '$metadata.contentType', regex: '^text/' } }, then: 'Text' },
              { case: { $eq: ['$metadata.contentType', 'application/pdf'] }, then: 'PDF Document' }
            ],
            default: 'Other'
          }
        },

        // Processing status indicator
        processingStatusDisplay: {
          $switch: {
            branches: [
              { case: { $eq: ['$metadata.processingStatus', 'uploaded'] }, then: 'Ready' },
              { case: { $eq: ['$metadata.processingStatus', 'processing'] }, then: 'Processing...' },
              { case: { $eq: ['$metadata.processingStatus', 'completed'] }, then: 'Processed' },
              { case: { $eq: ['$metadata.processingStatus', 'failed'] }, then: 'Processing Failed' }
            ],
            default: 'Unknown'
          }
        },

        // Popularity indicator
        popularityScore: {
          $multiply: [
            { $log10: { $add: [{ $ifNull: ['$metadata.downloadCount', 0] }, 1] } },
            { $log10: { $add: [{ $ifNull: ['$metadata.accessCount', 0] }, 1] } }
          ]
        }
      }
    });

    // Sorting
    const sortStage = {};
    if (query.text) {
      sortStage.searchScore = { $meta: 'textScore' };
    }

    if (options.sortBy) {
      switch (options.sortBy) {
        case 'uploadDate':
          sortStage.uploadDate = options.sortOrder === 'asc' ? 1 : -1;
          break;
        case 'fileSize':
          sortStage.length = options.sortOrder === 'asc' ? 1 : -1;
          break;
        case 'filename':
          sortStage.filename = options.sortOrder === 'asc' ? 1 : -1;
          break;
        case 'popularity':
          sortStage.popularityScore = -1;
          break;
        default:
          sortStage.uploadDate = -1;
      }
    } else {
      sortStage.uploadDate = -1; // Default sort by upload date
    }

    pipeline.push({ $sort: sortStage });

    // Pagination
    if (options.skip) {
      pipeline.push({ $skip: options.skip });
    }

    if (options.limit) {
      pipeline.push({ $limit: options.limit });
    }

    return pipeline;
  }

  async processMultimediaContent(fileDoc) {
    console.log(`Processing multimedia content: ${fileDoc.filename}`);

    try {
      const contentType = fileDoc.metadata.contentType;
      let processingResults = {};

      // Update processing status
      await this.updateFileMetadata(fileDoc._id, {
        'metadata.processingStatus': 'processing',
        'metadata.processingStarted': new Date()
      });

      // Image processing
      if (contentType.startsWith('image/')) {
        processingResults.image = await this.processImageFile(fileDoc);
      }
      // Video processing
      else if (contentType.startsWith('video/')) {
        processingResults.video = await this.processVideoFile(fileDoc);
      }
      // Audio processing
      else if (contentType.startsWith('audio/')) {
        processingResults.audio = await this.processAudioFile(fileDoc);
      }
      // Document processing
      else if (this.isDocumentType(contentType)) {
        processingResults.document = await this.processDocumentFile(fileDoc);
      }

      // Update file with processing results
      await this.updateFileMetadata(fileDoc._id, {
        'metadata.processingStatus': 'completed',
        'metadata.processingCompleted': new Date(),
        'metadata.processingResults': processingResults,
        'metadata.imageProperties': processingResults.image || {},
        'metadata.videoProperties': processingResults.video || {},
        'metadata.audioProperties': processingResults.audio || {},
        'metadata.documentProperties': processingResults.document || {}
      });

      console.log(`Multimedia processing completed: ${fileDoc.filename}`);
      return processingResults;

    } catch (error) {
      console.error(`Multimedia processing error for ${fileDoc.filename}:`, error);

      // Update error status
      await this.updateFileMetadata(fileDoc._id, {
        'metadata.processingStatus': 'failed',
        'metadata.processingError': error.message,
        'metadata.processingCompleted': new Date()
      });

      throw error;
    }
  }

  async processImageFile(fileDoc) {
    // Image processing implementation
    return {
      width: 1920,
      height: 1080,
      colorDepth: 24,
      hasTransparency: false,
      format: 'jpeg',
      resolutionDpi: 72,
      colorProfile: 'sRGB',
      thumbnailGenerated: true,
      exifData: {}
    };
  }

  async processVideoFile(fileDoc) {
    // Video processing implementation
    return {
      duration: 120.5,
      width: 1920,
      height: 1080,
      frameRate: 29.97,
      videoCodec: 'h264',
      audioCodec: 'aac',
      bitrate: 2500000,
      containerFormat: 'mp4',
      thumbnailsGenerated: true,
      previewClips: []
    };
  }

  async processAudioFile(fileDoc) {
    // Audio processing implementation
    return {
      duration: 245.3,
      sampleRate: 44100,
      channels: 2,
      bitrate: 320000,
      codec: 'mp3',
      containerFormat: 'mp3',
      title: 'Unknown',
      artist: 'Unknown',
      album: 'Unknown',
      waveformGenerated: true
    };
  }

  async performFileAnalytics(options = {}) {
    console.log('Performing comprehensive GridFS analytics...');

    try {
      const analytics = {};

      // Analyze each bucket
      for (const [bucketName, bucket] of Object.entries(this.buckets)) {
        console.log(`Analyzing bucket: ${bucketName}`);

        const filesCollection = this.db.collection(`${bucketName}.files`);
        const chunksCollection = this.db.collection(`${bucketName}.chunks`);

        // Basic statistics
        const totalFiles = await filesCollection.countDocuments();
        const totalSizeResult = await filesCollection.aggregate([
          { $group: { _id: null, totalSize: { $sum: '$length' } } }
        ]).toArray();

        const totalSize = totalSizeResult[0]?.totalSize || 0;

        // Content type distribution
        const contentTypeDistribution = await filesCollection.aggregate([
          {
            $group: {
              _id: '$metadata.contentType',
              count: { $sum: 1 },
              totalSize: { $sum: '$length' },
              avgSize: { $avg: '$length' }
            }
          },
          { $sort: { count: -1 } }
        ]).toArray();

        // Upload trends
        const uploadTrends = await filesCollection.aggregate([
          {
            $group: {
              _id: {
                year: { $year: '$uploadDate' },
                month: { $month: '$uploadDate' },
                day: { $dayOfMonth: '$uploadDate' }
              },
              dailyUploads: { $sum: 1 },
              dailySize: { $sum: '$length' }
            }
          },
          { $sort: { '_id.year': 1, '_id.month': 1, '_id.day': 1 } },
          { $limit: 30 } // Last 30 days
        ]).toArray();

        // Storage efficiency analysis
        const compressionAnalysis = await filesCollection.aggregate([
          {
            $group: {
              _id: '$metadata.isCompressed',
              count: { $sum: 1 },
              totalSize: { $sum: '$length' },
              avgCompressionRatio: { $avg: '$metadata.compressionRatio' }
            }
          }
        ]).toArray();

        // Usage patterns
        const usagePatterns = await filesCollection.aggregate([
          {
            $group: {
              _id: null,
              totalDownloads: { $sum: '$metadata.downloadCount' },
              totalAccesses: { $sum: '$metadata.accessCount' },
              avgDownloadsPerFile: { $avg: '$metadata.downloadCount' },
              mostDownloaded: { $max: '$metadata.downloadCount' }
            }
          }
        ]).toArray();

        // Chunk analysis
        const chunkAnalysis = await chunksCollection.aggregate([
          {
            $group: {
              _id: null,
              totalChunks: { $sum: 1 },
              avgChunkSize: { $avg: { $binarySize: '$data' } },
              minChunkSize: { $min: { $binarySize: '$data' } },
              maxChunkSize: { $max: { $binarySize: '$data' } }
            }
          }
        ]).toArray();

        analytics[bucketName] = {
          summary: {
            totalFiles,
            totalSize,
            avgFileSize: totalFiles > 0 ? Math.round(totalSize / totalFiles) : 0,
            formattedTotalSize: this.formatFileSize(totalSize)
          },
          contentTypes: contentTypeDistribution,
          uploadTrends: uploadTrends,
          compression: compressionAnalysis,
          usage: usagePatterns[0] || {},
          chunks: chunkAnalysis[0] || {},
          recommendations: this.generateOptimizationRecommendations({
            totalFiles,
            totalSize,
            contentTypeDistribution,
            compressionAnalysis,
            usagePatterns: usagePatterns[0]
          })
        };
      }

      return analytics;

    } catch (error) {
      console.error('GridFS analytics error:', error);
      throw error;
    }
  }

  generateOptimizationRecommendations(stats) {
    const recommendations = [];

    // Storage optimization
    if (stats.totalSize > 100 * 1024 * 1024 * 1024) { // 100GB
      recommendations.push({
        type: 'storage',
        priority: 'high',
        message: 'Large storage usage detected - consider implementing data archival strategies'
      });
    }

    // Compression recommendations
    const uncompressedFiles = stats.compressionAnalysis.find(c => c._id === false);
    if (uncompressedFiles && uncompressedFiles.count > stats.totalFiles * 0.8) {
      recommendations.push({
        type: 'compression',
        priority: 'medium',
        message: 'Many files could benefit from compression to save storage space'
      });
    }

    // Usage pattern recommendations
    if (stats.usagePatterns && stats.usagePatterns.avgDownloadsPerFile < 1) {
      recommendations.push({
        type: 'usage',
        priority: 'low',
        message: 'Low file access rates - consider implementing content cleanup policies'
      });
    }

    return recommendations;
  }

  // Utility methods
  selectBucket(contentType) {
    if (!contentType) return this.buckets.files;

    if (contentType.startsWith('image/')) return this.buckets.images;
    if (contentType.startsWith('video/')) return this.buckets.videos;
    if (contentType.startsWith('audio/')) return this.buckets.audio;
    if (this.isDocumentType(contentType)) return this.buckets.documents;
    if (contentType.includes('zip') || contentType.includes('tar')) return this.buckets.archives;

    return this.buckets.files;
  }

  isDocumentType(contentType) {
    return contentType === 'application/pdf' || 
           contentType.startsWith('text/') || 
           contentType.includes('document') ||
           contentType.includes('office') ||
           contentType.includes('word') ||
           contentType.includes('excel') ||
           contentType.includes('powerpoint');
  }

  formatFileSize(bytes) {
    if (bytes >= 1073741824) return `${(bytes / 1073741824).toFixed(2)} GB`;
    if (bytes >= 1048576) return `${(bytes / 1048576).toFixed(2)} MB`;
    if (bytes >= 1024) return `${(bytes / 1024).toFixed(2)} KB`;
    return `${bytes} bytes`;
  }

  detectContentType(filename) {
    const ext = this.extractFileExtension(filename).toLowerCase();
    const mimeTypes = {
      'jpg': 'image/jpeg', 'jpeg': 'image/jpeg', 'png': 'image/png', 'gif': 'image/gif',
      'mp4': 'video/mp4', 'avi': 'video/avi', 'mov': 'video/quicktime',
      'mp3': 'audio/mpeg', 'wav': 'audio/wav', 'flac': 'audio/flac',
      'pdf': 'application/pdf', 'doc': 'application/msword', 'txt': 'text/plain',
      'zip': 'application/zip', 'tar': 'application/x-tar'
    };
    return mimeTypes[ext] || 'application/octet-stream';
  }

  extractFileExtension(filename) {
    const lastDot = filename.lastIndexOf('.');
    return lastDot > 0 ? filename.substring(lastDot + 1) : '';
  }

  categorizeByContentType(contentType) {
    if (!contentType) return 'other';
    if (contentType.startsWith('image/')) return 'image';
    if (contentType.startsWith('video/')) return 'video';
    if (contentType.startsWith('audio/')) return 'audio';
    if (contentType === 'application/pdf') return 'document';
    if (contentType.startsWith('text/')) return 'text';
    return 'other';
  }

  async getFileById(fileId) {
    // Search across all buckets for the file
    for (const [bucketName, bucket] of Object.entries(this.buckets)) {
      const filesCollection = this.db.collection(`${bucketName}.files`);
      const fileDoc = await filesCollection.findOne({ _id: fileId });
      if (fileDoc) {
        fileDoc.bucketName = bucketName;
        return fileDoc;
      }
    }
    return null;
  }

  async updateFileMetadata(fileId, updates) {
    const fileDoc = await this.getFileById(fileId);
    if (!fileDoc) {
      throw new Error(`File not found: ${fileId}`);
    }

    const filesCollection = this.db.collection(`${fileDoc.bucketName}.files`);
    return await filesCollection.updateOne({ _id: fileId }, { $set: updates });
  }
}

// Benefits of MongoDB GridFS for Large File Management:
// - Native chunking and streaming capabilities with automatic chunk management
// - Atomic operations combining file content and metadata in database transactions
// - Built-in replication and sharding support for distributed file storage
// - Comprehensive indexing capabilities for file metadata and content properties
// - Integrated backup and restore operations with database-level consistency
// - Advanced querying capabilities across file content and associated data
// - Automatic load balancing and failover for file operations
// - Version control and concurrent access management built into the database
// - Seamless integration with MongoDB's security and access control systems
// - Production-ready scalability with automatic optimization and performance tuning

module.exports = {
  AdvancedGridFSManager
};

Understanding MongoDB GridFS Architecture

Advanced File Storage Patterns and Multimedia Processing

Implement sophisticated GridFS strategies for production file management systems:

// Production-scale GridFS implementation with advanced multimedia processing and content management
class ProductionGridFSPlatform extends AdvancedGridFSManager {
  constructor(db, productionConfig) {
    super(db, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      highAvailability: true,
      globalDistribution: true,
      advancedSecurity: true,
      contentDelivery: true,
      realTimeProcessing: true,
      aiContentAnalysis: true
    };

    this.setupProductionOptimizations();
    this.initializeAdvancedProcessing();
    this.setupMonitoringAndAlerts();
  }

  async implementAdvancedContentProcessing() {
    console.log('Implementing advanced content processing pipeline...');

    const processingPipeline = {
      // AI-powered content analysis
      contentAnalysis: {
        imageRecognition: true,
        videoContentAnalysis: true,
        audioTranscription: true,
        documentOCR: true,
        contentModerationAI: true
      },

      // Multimedia optimization
      mediaOptimization: {
        imageCompression: true,
        videoTranscoding: true,
        audioNormalization: true,
        thumbnailGeneration: true,
        previewGeneration: true
      },

      // Content delivery optimization
      deliveryOptimization: {
        adaptiveStreaming: true,
        globalCDN: true,
        edgeCache: true,
        compressionOptimization: true
      }
    };

    return await this.deployProcessingPipeline(processingPipeline);
  }

  async setupDistributedFileStorage() {
    console.log('Setting up distributed file storage architecture...');

    const distributionStrategy = {
      // Geographic distribution
      regions: ['us-east-1', 'eu-west-1', 'ap-southeast-1'],
      replicationFactor: 3,

      // Storage tiers
      storageTiers: {
        hot: { accessPattern: 'frequent', retention: '30d' },
        warm: { accessPattern: 'occasional', retention: '90d' },
        cold: { accessPattern: 'rare', retention: '1y' },
        archive: { accessPattern: 'backup', retention: '7y' }
      },

      // Performance optimization
      performanceOptimization: {
        readPreference: 'nearest',
        writePreference: 'majority',
        connectionPooling: true,
        indexOptimization: true
      }
    };

    return await this.deployDistributionStrategy(distributionStrategy);
  }

  async implementAdvancedSecurity() {
    console.log('Implementing advanced security measures...');

    const securityMeasures = {
      // Encryption
      encryption: {
        encryptionAtRest: true,
        encryptionInTransit: true,
        fieldLevelEncryption: true,
        keyManagement: 'aws-kms'
      },

      // Access control
      accessControl: {
        roleBasedAccess: true,
        attributeBasedAccess: true,
        tokenBasedAuth: true,
        auditLogging: true
      },

      // Content security
      contentSecurity: {
        virusScanning: true,
        contentValidation: true,
        integrityChecking: true,
        accessTracking: true
      }
    };

    return await this.deploySecurityMeasures(securityMeasures);
  }
}

SQL-Style GridFS Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB GridFS operations and file management:

-- QueryLeaf GridFS operations with SQL-familiar file management syntax

-- Create GridFS storage buckets with advanced configuration
CREATE GRIDFS BUCKET files_storage 
WITH (
  chunk_size = 261120,  -- 255KB chunks
  bucket_name = 'files',
  compression = 'zstd',
  deduplication = true,
  versioning = true,

  -- Storage optimization
  storage_class = 'standard',
  auto_tiering = true,
  retention_policy = '365 days',

  -- Performance tuning
  max_concurrent_uploads = 10,
  streaming_chunk_size = 1048576,
  index_optimization = 'performance'
);

CREATE GRIDFS BUCKET images_storage 
WITH (
  chunk_size = 261120,
  bucket_name = 'images',
  content_processing = true,
  thumbnail_generation = true,

  -- Image-specific settings
  image_optimization = true,
  format_conversion = true,
  quality_presets = JSON_ARRAY('thumbnail', 'medium', 'high', 'original')
);

CREATE GRIDFS BUCKET videos_storage 
WITH (
  chunk_size = 1048576,  -- 1MB chunks for videos
  bucket_name = 'videos',
  content_processing = true,

  -- Video-specific settings
  transcoding_enabled = true,
  preview_generation = true,
  streaming_optimization = true,
  adaptive_bitrate = true
);

-- Upload files with comprehensive metadata and processing options
UPLOAD FILE '/path/to/document.pdf'
TO GRIDFS BUCKET files_storage
AS 'important-document.pdf'
WITH (
  content_type = 'application/pdf',
  category = 'legal_documents',
  tags = JSON_ARRAY('contract', 'legal', '2025'),
  access_level = 'restricted',
  created_by = CURRENT_USER_ID(),

  -- Custom metadata
  metadata = JSON_OBJECT(
    'department', 'legal',
    'client_id', '12345',
    'confidentiality_level', 'high',
    'retention_period', '7 years'
  ),

  -- Processing options
  enable_ocr = true,
  enable_full_text_indexing = true,
  generate_thumbnail = true,
  content_analysis = true
);

-- Batch upload multiple files with pattern matching
UPLOAD FILES FROM DIRECTORY '/uploads/batch_2025/'
PATTERN '*.{jpg,png,gif}'
TO GRIDFS BUCKET images_storage
WITH (
  category = 'product_images',
  batch_id = 'batch_2025_001',
  auto_categorize = true,

  -- Image processing options
  generate_thumbnails = JSON_ARRAY('128x128', '256x256', '512x512'),
  compress_originals = true,
  extract_metadata = true,

  -- Content analysis
  image_recognition = true,
  face_detection = true,
  content_moderation = true
);

-- Advanced file search with complex filtering and ranking
WITH file_search AS (
  SELECT 
    f.file_id,
    f.filename,
    f.upload_date,
    f.file_size,
    f.content_type,
    f.metadata,

    -- Full-text search scoring
    GRIDFS_SEARCH_SCORE(f.filename || ' ' || f.metadata.searchable_text, 'contract legal document') as text_score,

    -- Content-based similarity (for images/videos)
    GRIDFS_CONTENT_SIMILARITY(f.file_id, 'reference_image_id') as content_similarity,

    -- Metadata-based relevance
    CASE 
      WHEN f.metadata.category = 'legal_documents' THEN 1.0
      WHEN f.metadata.tags @> JSON_ARRAY('legal') THEN 0.8
      WHEN f.metadata.tags @> JSON_ARRAY('contract') THEN 0.6
      ELSE 0.0
    END as category_relevance,

    -- Recency boost
    CASE 
      WHEN f.upload_date > CURRENT_DATE - INTERVAL '30 days' THEN 0.2
      WHEN f.upload_date > CURRENT_DATE - INTERVAL '90 days' THEN 0.1
      ELSE 0.0
    END as recency_boost,

    -- Usage popularity
    LOG(f.metadata.download_count + 1) * 0.1 as popularity_score,

    -- File quality indicators
    CASE 
      WHEN f.metadata.processing_status = 'completed' THEN 0.1
      WHEN f.metadata.has_thumbnail = true THEN 0.05
      WHEN f.metadata.content_indexed = true THEN 0.05
      ELSE 0.0
    END as quality_score

  FROM GRIDFS_FILES('files_storage') f
  WHERE 
    -- Content type filtering
    f.content_type IN ('application/pdf', 'application/msword', 'text/plain')

    -- Date range filtering
    AND f.upload_date >= CURRENT_DATE - INTERVAL '2 years'

    -- Access level filtering (based on user permissions)
    AND GRIDFS_CHECK_ACCESS(f.file_id, CURRENT_USER_ID()) = true

    -- Size filtering
    AND f.file_size BETWEEN 1024 AND 100*1024*1024  -- 1KB to 100MB

    -- Metadata filtering
    AND (
      f.metadata.category = 'legal_documents'
      OR f.metadata.tags @> JSON_ARRAY('legal')
      OR GRIDFS_FULL_TEXT_SEARCH(f.file_id, 'contract agreement legal') > 0.5
    )

    -- Processing status filtering
    AND f.metadata.processing_status IN ('completed', 'partial')
),

ranked_results AS (
  SELECT *,
    -- Combined relevance scoring
    (
      COALESCE(text_score, 0) * 0.4 +
      COALESCE(content_similarity, 0) * 0.2 +
      category_relevance * 0.2 +
      recency_boost +
      popularity_score +
      quality_score
    ) as combined_relevance_score,

    -- Result categorization
    CASE 
      WHEN content_similarity > 0.8 THEN 'visually_similar'
      WHEN text_score > 0.8 THEN 'text_match'
      WHEN category_relevance > 0.8 THEN 'category_match'
      ELSE 'general_relevance'
    END as match_type,

    -- Access recommendations
    CASE 
      WHEN metadata.access_level = 'public' THEN 'immediate_access'
      WHEN metadata.access_level = 'restricted' THEN 'approval_required'
      WHEN metadata.access_level = 'confidential' THEN 'special_authorization'
      ELSE 'standard_access'
    END as access_recommendation

  FROM file_search
  WHERE text_score > 0.1 OR content_similarity > 0.3 OR category_relevance > 0.0
),

file_analytics AS (
  SELECT 
    COUNT(*) as total_results,
    AVG(combined_relevance_score) as avg_relevance,

    -- Content type distribution
    JSON_OBJECT_AGG(
      content_type,
      COUNT(*)
    ) as content_type_distribution,

    -- Match type analysis
    JSON_OBJECT_AGG(
      match_type,
      COUNT(*)
    ) as match_type_distribution,

    -- Size distribution analysis
    JSON_OBJECT(
      'small_files', COUNT(*) FILTER (WHERE file_size < 1048576),
      'medium_files', COUNT(*) FILTER (WHERE file_size BETWEEN 1048576 AND 104857600),
      'large_files', COUNT(*) FILTER (WHERE file_size > 104857600)
    ) as size_distribution,

    -- Temporal distribution
    JSON_OBJECT_AGG(
      DATE_TRUNC('month', upload_date)::text,
      COUNT(*)
    ) as upload_timeline

  FROM ranked_results
)

-- Final comprehensive file search results with analytics
SELECT 
  -- File identification
  rr.file_id,
  rr.filename,
  rr.content_type,

  -- File properties
  GRIDFS_FORMAT_FILE_SIZE(rr.file_size) as file_size_formatted,
  rr.upload_date,
  DATE_TRUNC('day', rr.upload_date)::date as upload_date_formatted,

  -- Relevance and matching
  ROUND(rr.combined_relevance_score, 4) as relevance_score,
  rr.match_type,
  ROUND(rr.text_score, 3) as text_match_score,
  ROUND(rr.content_similarity, 3) as visual_similarity_score,

  -- Content and metadata
  rr.metadata.category,
  rr.metadata.tags,
  rr.metadata.description,

  -- Processing status and capabilities
  rr.metadata.processing_status,
  rr.metadata.has_thumbnail,
  rr.metadata.content_indexed,
  JSON_OBJECT(
    'ocr_available', COALESCE(rr.metadata.ocr_completed, false),
    'full_text_searchable', COALESCE(rr.metadata.full_text_indexed, false),
    'content_analyzed', COALESCE(rr.metadata.content_analysis_completed, false)
  ) as processing_capabilities,

  -- Access and usage
  rr.access_recommendation,
  rr.metadata.download_count,
  rr.metadata.last_accessed,

  -- File operations URLs
  CONCAT('/api/gridfs/files/', rr.file_id, '/download') as download_url,
  CONCAT('/api/gridfs/files/', rr.file_id, '/preview') as preview_url,
  CONCAT('/api/gridfs/files/', rr.file_id, '/thumbnail') as thumbnail_url,
  CONCAT('/api/gridfs/files/', rr.file_id, '/metadata') as metadata_url,

  -- Related files
  GRIDFS_FIND_SIMILAR_FILES(
    rr.file_id, 
    limit => 3,
    similarity_threshold => 0.7
  ) as related_files,

  -- Version information
  CASE 
    WHEN rr.metadata.has_versions = true THEN 
      JSON_OBJECT(
        'is_versioned', true,
        'version_number', rr.metadata.version,
        'latest_version', GRIDFS_GET_LATEST_VERSION(rr.metadata.base_file_id),
        'version_history_url', CONCAT('/api/gridfs/files/', rr.file_id, '/versions')
      )
    ELSE JSON_OBJECT('is_versioned', false)
  END as version_info,

  -- Search analytics (same for all results)
  (SELECT JSON_BUILD_OBJECT(
    'total_results', fa.total_results,
    'average_relevance', ROUND(fa.avg_relevance, 3),
    'content_types', fa.content_type_distribution,
    'match_types', fa.match_type_distribution,
    'size_distribution', fa.size_distribution,
    'upload_timeline', fa.upload_timeline
  ) FROM file_analytics fa) as search_analytics

FROM ranked_results rr
WHERE rr.combined_relevance_score > 0.2
ORDER BY rr.combined_relevance_score DESC
LIMIT 50;

-- Advanced file streaming and download operations
WITH streaming_session AS (
  SELECT 
    f.file_id,
    f.filename,
    f.file_size,
    f.content_type,
    f.metadata,

    -- Calculate optimal streaming parameters
    CASE 
      WHEN f.file_size > 1073741824 THEN 'chunked'  -- > 1GB
      WHEN f.file_size > 104857600 THEN 'buffered'   -- > 100MB
      ELSE 'direct'
    END as streaming_strategy,

    -- Determine chunk size based on file type and size
    CASE 
      WHEN f.content_type LIKE 'video/%' THEN 2097152  -- 2MB chunks for video
      WHEN f.content_type LIKE 'audio/%' THEN 524288   -- 512KB chunks for audio  
      WHEN f.file_size > 104857600 THEN 1048576        -- 1MB chunks for large files
      ELSE 262144                                      -- 256KB chunks for others
    END as optimal_chunk_size,

    -- Caching strategy
    CASE 
      WHEN f.metadata.download_count > 100 THEN 'cache_aggressively'
      WHEN f.metadata.download_count > 10 THEN 'cache_moderately'
      ELSE 'cache_minimally'
    END as cache_strategy

  FROM GRIDFS_FILES('videos_storage') f
  WHERE f.content_type LIKE 'video/%'
    AND f.file_size > 10485760  -- > 10MB
)

-- Stream video files with adaptive bitrate and quality selection
SELECT 
  ss.file_id,
  ss.filename,
  ss.streaming_strategy,
  ss.optimal_chunk_size,

  -- Generate streaming URLs for different qualities
  JSON_OBJECT(
    'original', GRIDFS_STREAMING_URL(ss.file_id, quality => 'original'),
    'hd', GRIDFS_STREAMING_URL(ss.file_id, quality => 'hd'),
    'sd', GRIDFS_STREAMING_URL(ss.file_id, quality => 'sd'),
    'mobile', GRIDFS_STREAMING_URL(ss.file_id, quality => 'mobile')
  ) as streaming_urls,

  -- Adaptive streaming manifest
  GRIDFS_GENERATE_HLS_MANIFEST(
    ss.file_id,
    qualities => JSON_ARRAY('original', 'hd', 'sd', 'mobile'),
    segment_duration => 10
  ) as hls_manifest_url,

  -- Video metadata for player
  JSON_OBJECT(
    'duration', ss.metadata.video_properties.duration,
    'width', ss.metadata.video_properties.width,
    'height', ss.metadata.video_properties.height,
    'frame_rate', ss.metadata.video_properties.frame_rate,
    'bitrate', ss.metadata.video_properties.bitrate,
    'codec', ss.metadata.video_properties.video_codec,
    'has_subtitles', COALESCE(ss.metadata.has_subtitles, false),
    'thumbnail_count', ARRAY_LENGTH(ss.metadata.video_thumbnails, 1)
  ) as video_metadata,

  -- Streaming optimization
  ss.cache_strategy,

  -- CDN and delivery optimization
  JSON_OBJECT(
    'cdn_enabled', true,
    'edge_cache_ttl', CASE ss.cache_strategy 
      WHEN 'cache_aggressively' THEN 3600
      WHEN 'cache_moderately' THEN 1800
      ELSE 600
    END,
    'compression_enabled', true,
    'adaptive_streaming', true
  ) as delivery_options

FROM streaming_session ss
ORDER BY ss.metadata.download_count DESC;

-- File management and lifecycle operations
WITH file_lifecycle_analysis AS (
  SELECT 
    f.file_id,
    f.filename,
    f.upload_date,
    f.file_size,
    f.metadata,

    -- Age categorization
    CASE 
      WHEN f.upload_date > CURRENT_DATE - INTERVAL '30 days' THEN 'recent'
      WHEN f.upload_date > CURRENT_DATE - INTERVAL '90 days' THEN 'current'  
      WHEN f.upload_date > CURRENT_DATE - INTERVAL '365 days' THEN 'old'
      ELSE 'archived'
    END as age_category,

    -- Usage categorization
    CASE 
      WHEN f.metadata.download_count > 100 THEN 'high_usage'
      WHEN f.metadata.download_count > 10 THEN 'medium_usage'
      WHEN f.metadata.download_count > 0 THEN 'low_usage'
      ELSE 'unused'
    END as usage_category,

    -- Storage efficiency analysis
    GRIDFS_CALCULATE_STORAGE_EFFICIENCY(f.file_id) as storage_efficiency,

    -- Content value scoring
    (
      LOG(f.metadata.download_count + 1) * 0.3 +
      CASE WHEN f.metadata.access_level = 'public' THEN 0.2 ELSE 0 END +
      CASE WHEN f.metadata.has_versions = true THEN 0.1 ELSE 0 END +
      CASE WHEN f.metadata.content_indexed = true THEN 0.1 ELSE 0 END +
      CASE WHEN ARRAY_LENGTH(f.metadata.tags, 1) > 0 THEN 0.1 ELSE 0 END
    ) as content_value_score,

    -- Days since last access
    COALESCE(EXTRACT(DAYS FROM CURRENT_DATE - f.metadata.last_accessed::date), 9999) as days_since_access

  FROM GRIDFS_FILES() f  -- Search across all buckets
  WHERE f.upload_date >= CURRENT_DATE - INTERVAL '2 years'
),

lifecycle_recommendations AS (
  SELECT 
    fla.*,

    -- Lifecycle action recommendations
    CASE 
      WHEN fla.age_category = 'archived' AND fla.usage_category = 'unused' THEN 'delete_candidate'
      WHEN fla.age_category = 'old' AND fla.usage_category IN ('unused', 'low_usage') THEN 'archive_candidate'
      WHEN fla.usage_category = 'high_usage' AND fla.storage_efficiency < 0.7 THEN 'optimize_candidate'
      WHEN fla.days_since_access > 180 AND fla.usage_category != 'high_usage' THEN 'cold_storage_candidate'
      ELSE 'maintain_current'
    END as lifecycle_action,

    -- Storage tier recommendation
    CASE 
      WHEN fla.usage_category = 'high_usage' AND fla.days_since_access <= 7 THEN 'hot'
      WHEN fla.usage_category IN ('high_usage', 'medium_usage') AND fla.days_since_access <= 30 THEN 'warm'
      WHEN fla.days_since_access <= 90 THEN 'cool'
      ELSE 'cold'
    END as recommended_storage_tier,

    -- Estimated cost savings
    GRIDFS_ESTIMATE_COST_SAVINGS(
      fla.file_id,
      current_tier => fla.metadata.storage_class,
      recommended_tier => CASE 
        WHEN fla.usage_category = 'high_usage' AND fla.days_since_access <= 7 THEN 'hot'
        WHEN fla.usage_category IN ('high_usage', 'medium_usage') AND fla.days_since_access <= 30 THEN 'warm'  
        WHEN fla.days_since_access <= 90 THEN 'cool'
        ELSE 'cold'
      END
    ) as estimated_monthly_savings,

    -- Priority score for lifecycle actions
    CASE fla.age_category
      WHEN 'archived' THEN 1
      WHEN 'old' THEN 2  
      WHEN 'current' THEN 3
      ELSE 4
    END * 
    CASE fla.usage_category
      WHEN 'unused' THEN 1
      WHEN 'low_usage' THEN 2
      WHEN 'medium_usage' THEN 3  
      ELSE 4
    END as action_priority

  FROM file_lifecycle_analysis fla
)

-- Execute lifecycle management recommendations
SELECT 
  lr.lifecycle_action,
  COUNT(*) as affected_files,
  SUM(lr.file_size) as total_size_bytes,
  GRIDFS_FORMAT_FILE_SIZE(SUM(lr.file_size)) as total_size_formatted,
  SUM(lr.estimated_monthly_savings) as total_monthly_savings,
  AVG(lr.action_priority) as avg_priority,

  -- Detailed breakdown by file characteristics
  JSON_OBJECT_AGG(lr.age_category, COUNT(*)) as age_distribution,
  JSON_OBJECT_AGG(lr.usage_category, COUNT(*)) as usage_distribution,
  JSON_OBJECT_AGG(lr.recommended_storage_tier, COUNT(*)) as tier_distribution,

  -- Sample files for review
  JSON_AGG(
    JSON_OBJECT(
      'file_id', lr.file_id,
      'filename', lr.filename,
      'size', GRIDFS_FORMAT_FILE_SIZE(lr.file_size),
      'age_days', EXTRACT(DAYS FROM CURRENT_DATE - lr.upload_date),
      'last_access_days', lr.days_since_access,
      'download_count', lr.metadata.download_count,
      'estimated_savings', lr.estimated_monthly_savings
    ) 
    ORDER BY lr.action_priority ASC, lr.file_size DESC
    LIMIT 5
  ) as sample_files,

  -- Implementation recommendations
  CASE lr.lifecycle_action
    WHEN 'delete_candidate' THEN 'Schedule for deletion after 30-day notice period'
    WHEN 'archive_candidate' THEN 'Move to archive storage tier'
    WHEN 'optimize_candidate' THEN 'Apply compression and deduplication'
    WHEN 'cold_storage_candidate' THEN 'Migrate to cold storage tier'
    ELSE 'No action required'
  END as implementation_recommendation

FROM lifecycle_recommendations lr
WHERE lr.lifecycle_action != 'maintain_current'
GROUP BY lr.lifecycle_action
ORDER BY total_size_bytes DESC;

-- Storage analytics and optimization insights
CREATE VIEW gridfs_storage_dashboard AS
WITH bucket_analytics AS (
  SELECT 
    bucket_name,
    COUNT(*) as total_files,
    SUM(file_size) as total_size_bytes,
    AVG(file_size) as avg_file_size,
    MIN(file_size) as min_file_size,
    MAX(file_size) as max_file_size,

    -- Content type distribution
    JSON_OBJECT_AGG(content_type, COUNT(*)) as content_type_counts,

    -- Upload trends
    JSON_OBJECT_AGG(
      DATE_TRUNC('month', upload_date)::text,
      COUNT(*)
    ) as monthly_upload_trends,

    -- Usage statistics
    SUM(metadata.download_count) as total_downloads,
    AVG(metadata.download_count) as avg_downloads_per_file,

    -- Processing statistics
    COUNT(*) FILTER (WHERE metadata.processing_status = 'completed') as processed_files,
    COUNT(*) FILTER (WHERE metadata.has_thumbnail = true) as files_with_thumbnails,
    COUNT(*) FILTER (WHERE metadata.content_indexed = true) as indexed_files,

    -- Storage efficiency
    AVG(
      CASE WHEN metadata.is_compressed = true 
        THEN metadata.compression_ratio 
        ELSE 1.0 
      END
    ) as avg_compression_ratio,

    COUNT(*) FILTER (WHERE metadata.is_compressed = true) as compressed_files,

    -- Age distribution
    COUNT(*) FILTER (WHERE upload_date > CURRENT_DATE - INTERVAL '30 days') as recent_files,
    COUNT(*) FILTER (WHERE upload_date <= CURRENT_DATE - INTERVAL '365 days') as old_files

  FROM GRIDFS_FILES() 
  GROUP BY bucket_name
)

SELECT 
  bucket_name,
  total_files,
  GRIDFS_FORMAT_FILE_SIZE(total_size_bytes) as total_storage,
  GRIDFS_FORMAT_FILE_SIZE(avg_file_size) as avg_file_size,

  -- Storage efficiency metrics
  ROUND((compressed_files::numeric / total_files) * 100, 1) as compression_percentage,
  ROUND(avg_compression_ratio, 2) as avg_compression_ratio,
  ROUND((processed_files::numeric / total_files) * 100, 1) as processing_completion_rate,

  -- Usage metrics
  total_downloads,
  ROUND(avg_downloads_per_file, 1) as avg_downloads_per_file,
  ROUND((indexed_files::numeric / total_files) * 100, 1) as indexing_coverage,

  -- Content insights
  content_type_counts,
  monthly_upload_trends,

  -- Storage optimization opportunities
  CASE 
    WHEN compressed_files::numeric / total_files < 0.5 THEN 
      CONCAT('Enable compression for ', ROUND(((total_files - compressed_files)::numeric / total_files) * 100, 1), '% of files')
    WHEN processed_files::numeric / total_files < 0.8 THEN
      CONCAT('Complete processing for ', ROUND(((total_files - processed_files)::numeric / total_files) * 100, 1), '% of files')
    WHEN old_files > total_files * 0.3 THEN
      CONCAT('Consider archiving ', old_files, ' old files (', ROUND((old_files::numeric / total_files) * 100, 1), '%)')
    ELSE 'Storage optimized'
  END as optimization_opportunity,

  -- Performance indicators
  JSON_OBJECT(
    'recent_activity', recent_files,
    'storage_growth_rate', ROUND((recent_files::numeric / GREATEST(total_files - recent_files, 1)) * 100, 1),
    'avg_file_age_days', ROUND(AVG(EXTRACT(DAYS FROM CURRENT_DATE - upload_date)), 0),
    'thumbnail_coverage', ROUND((files_with_thumbnails::numeric / total_files) * 100, 1)
  ) as performance_indicators

FROM bucket_analytics
ORDER BY total_size_bytes DESC;

-- QueryLeaf provides comprehensive GridFS capabilities:
-- 1. SQL-familiar file upload, download, and streaming operations
-- 2. Advanced file search with content-based and metadata filtering
-- 3. Multimedia processing integration with thumbnail and preview generation
-- 4. Intelligent file lifecycle management and storage optimization
-- 5. Comprehensive analytics and monitoring for file storage systems
-- 6. Production-ready security, access control, and audit logging
-- 7. Seamless integration with MongoDB's replication and sharding
-- 8. Advanced content analysis and AI-powered file processing
-- 9. Distributed file storage with global CDN integration
-- 10. SQL-style syntax for complex file management workflows

Best Practices for Production GridFS Implementation

Storage Architecture and Performance Optimization

Essential principles for scalable MongoDB GridFS deployment:

Bucket Organization: Design bucket structure based on content types, access patterns, and processing requirements
Chunk Size Optimization: Configure optimal chunk sizes based on file types, access patterns, and network characteristics
Index Strategy: Implement comprehensive indexing for file metadata, content properties, and access patterns
Storage Tiering: Design intelligent storage tiering strategies for cost optimization and performance
Processing Pipeline: Implement automated content processing for multimedia optimization and analysis
Security Integration: Ensure comprehensive security controls for file access, encryption, and audit logging

Scalability and Operational Excellence

Optimize GridFS deployments for enterprise-scale requirements:

Distributed Architecture: Design sharding strategies for large-scale file storage across multiple regions
Performance Monitoring: Implement comprehensive monitoring for storage usage, access patterns, and processing performance
Backup and Recovery: Design robust backup strategies that handle both file content and metadata consistency
Content Delivery: Integrate with CDN and edge caching for optimal file delivery performance
Cost Optimization: Implement automated lifecycle management and storage optimization policies
Disaster Recovery: Plan for business continuity with replicated file storage and failover capabilities

Conclusion

MongoDB GridFS provides comprehensive large file storage and binary data management capabilities that enable efficient handling of multimedia content, documents, and large datasets with automatic chunking, streaming, and integrated metadata management. The native MongoDB integration ensures GridFS benefits from the same scalability, consistency, and operational features as document storage.

Key MongoDB GridFS benefits include:

Native Integration: Seamless integration with MongoDB's document model, transactions, and consistency guarantees
Automatic Chunking: Efficient handling of large files with automatic chunking and streaming capabilities
Comprehensive Metadata: Rich metadata management with flexible schemas and advanced querying capabilities
Processing Integration: Built-in support for content processing, thumbnail generation, and multimedia optimization
Scalable Architecture: Production-ready scalability with sharding, replication, and distributed storage
Operational Excellence: Integrated backup, monitoring, and management tools for enterprise deployments

Whether you're building content management systems, multimedia platforms, document repositories, or any application requiring robust file storage, MongoDB GridFS with QueryLeaf's familiar SQL interface provides the foundation for scalable and maintainable file management solutions.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB GridFS operations while providing SQL-familiar syntax for file uploads, downloads, content processing, and storage optimization. Advanced file management patterns, multimedia processing workflows, and storage analytics are seamlessly handled through familiar SQL constructs, making sophisticated file storage capabilities accessible to SQL-oriented development teams.

The combination of MongoDB's robust GridFS capabilities with SQL-style file operations makes it an ideal platform for modern applications that require both powerful file storage and familiar database management patterns, ensuring your file storage solutions scale efficiently while remaining maintainable and feature-rich.

November 23, 2025
32 min read

MongoDB Time Series Collections for IoT Sensor Data Management: Real-Time Analytics and High-Performance Time-Based Data Processing

Modern IoT applications generate massive volumes of time-stamped sensor data that require specialized storage and processing strategies to handle high ingestion rates, efficient time-based queries, and real-time analytics workloads. Traditional relational databases struggle with time series data due to their row-oriented storage models, lack of built-in time-based optimizations, and inefficient handling of high-frequency data ingestion patterns common in IoT environments.

MongoDB Time Series Collections provide native support for time-stamped data with automatic data organization, compression optimizations, and specialized indexing strategies designed specifically for temporal workloads. Unlike traditional approaches that require custom partitioning schemes and complex query optimization, MongoDB's time series collections automatically optimize storage layout, query performance, and data retention policies while maintaining familiar query interfaces and operational simplicity.

The Traditional Time Series Data Challenge

Relational databases face significant limitations when handling high-volume time series data:

-- Traditional PostgreSQL time series data management - complex partitioning and limited optimization

-- IoT sensor readings table with manual partitioning strategy
CREATE TABLE sensor_readings (
    reading_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    device_id UUID NOT NULL,
    sensor_type VARCHAR(50) NOT NULL,
    timestamp TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,

    -- Sensor measurement values
    temperature DECIMAL(5,2),
    humidity DECIMAL(5,2),
    pressure DECIMAL(7,2),
    battery_level DECIMAL(5,2),
    signal_strength INTEGER,

    -- Location data
    latitude DECIMAL(10,8),
    longitude DECIMAL(11,8),
    altitude DECIMAL(8,2),

    -- Device status information
    device_status VARCHAR(20) DEFAULT 'active',
    firmware_version VARCHAR(20),
    last_calibration TIMESTAMP WITH TIME ZONE,

    -- Data quality indicators
    data_quality_score DECIMAL(3,2) DEFAULT 1.0,
    anomaly_detected BOOLEAN DEFAULT false,
    validation_flags JSONB,

    -- Metadata
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    processing_timestamp TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,

    -- Constraints
    CONSTRAINT chk_timestamp_valid CHECK (timestamp <= CURRENT_TIMESTAMP + INTERVAL '1 hour'),
    CONSTRAINT chk_temperature_range CHECK (temperature BETWEEN -50 AND 125),
    CONSTRAINT chk_humidity_range CHECK (humidity BETWEEN 0 AND 100),
    CONSTRAINT chk_battery_level CHECK (battery_level BETWEEN 0 AND 100)
) PARTITION BY RANGE (timestamp);

-- Device metadata table
CREATE TABLE devices (
    device_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    device_name VARCHAR(200) NOT NULL,
    device_type VARCHAR(50) NOT NULL,
    manufacturer VARCHAR(100),
    model VARCHAR(100),
    firmware_version VARCHAR(20),

    -- Installation details
    installation_location VARCHAR(200),
    installation_date DATE,
    installation_coordinates POINT,

    -- Configuration
    sampling_interval_seconds INTEGER DEFAULT 60,
    reporting_interval_seconds INTEGER DEFAULT 300,
    sensor_configuration JSONB,

    -- Status tracking
    device_status VARCHAR(20) DEFAULT 'active',
    last_seen TIMESTAMP WITH TIME ZONE,
    last_maintenance TIMESTAMP WITH TIME ZONE,
    next_maintenance_due DATE,

    -- Metadata
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Manual monthly partitioning (requires maintenance)
CREATE TABLE sensor_readings_2025_01 PARTITION OF sensor_readings
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

CREATE TABLE sensor_readings_2025_02 PARTITION OF sensor_readings
FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');

CREATE TABLE sensor_readings_2025_03 PARTITION OF sensor_readings
FOR VALUES FROM ('2025-03-01') TO ('2025-04-01');

-- Need to create partitions for every month manually or with automation
-- This becomes a maintenance burden and source of potential failures

-- Indexing strategy for time series queries (limited effectiveness)
CREATE INDEX idx_sensor_readings_timestamp ON sensor_readings (timestamp DESC);
CREATE INDEX idx_sensor_readings_device_time ON sensor_readings (device_id, timestamp DESC);
CREATE INDEX idx_sensor_readings_sensor_type_time ON sensor_readings (sensor_type, timestamp DESC);
CREATE INDEX idx_sensor_readings_location ON sensor_readings USING GIST (ST_MakePoint(longitude, latitude));

-- Complex aggregation query for sensor analytics
WITH hourly_sensor_averages AS (
  SELECT 
    device_id,
    sensor_type,
    DATE_TRUNC('hour', timestamp) as hour_bucket,

    -- Aggregated measurements
    COUNT(*) as reading_count,
    AVG(temperature) as avg_temperature,
    AVG(humidity) as avg_humidity,
    AVG(pressure) as avg_pressure,
    AVG(battery_level) as avg_battery_level,
    AVG(signal_strength) as avg_signal_strength,

    -- Statistical measures
    STDDEV(temperature) as temp_stddev,
    MIN(temperature) as min_temperature,
    MAX(temperature) as max_temperature,

    -- Data quality metrics
    AVG(data_quality_score) as avg_data_quality,
    COUNT(*) FILTER (WHERE anomaly_detected = true) as anomaly_count,

    -- Time-based calculations
    MIN(timestamp) as period_start,
    MAX(timestamp) as period_end,
    MAX(timestamp) - MIN(timestamp) as actual_duration

  FROM sensor_readings sr
  WHERE sr.timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND sr.device_status = 'active'
    AND sr.data_quality_score >= 0.8
  GROUP BY device_id, sensor_type, DATE_TRUNC('hour', timestamp)
),

device_performance_analysis AS (
  SELECT 
    hsa.*,
    d.device_name,
    d.device_type,
    d.installation_location,
    d.sampling_interval_seconds,

    -- Performance calculations
    CASE 
      WHEN hsa.reading_count < (3600 / d.sampling_interval_seconds) * 0.8 THEN 'under_reporting'
      WHEN hsa.reading_count > (3600 / d.sampling_interval_seconds) * 1.2 THEN 'over_reporting'
      ELSE 'normal'
    END as reporting_status,

    -- Battery analysis
    CASE 
      WHEN hsa.avg_battery_level < 20 THEN 'critical'
      WHEN hsa.avg_battery_level < 40 THEN 'low'
      WHEN hsa.avg_battery_level < 60 THEN 'medium'
      ELSE 'good'
    END as battery_status,

    -- Signal strength analysis
    CASE 
      WHEN hsa.avg_signal_strength < -80 THEN 'poor'
      WHEN hsa.avg_signal_strength < -60 THEN 'fair'
      WHEN hsa.avg_signal_strength < -40 THEN 'good'
      ELSE 'excellent'
    END as signal_status,

    -- Calculate trends using window functions (expensive)
    LAG(hsa.avg_temperature) OVER (
      PARTITION BY hsa.device_id, hsa.sensor_type 
      ORDER BY hsa.hour_bucket
    ) as prev_hour_temperature,

    -- Rolling averages (very expensive across partitions)
    AVG(hsa.avg_temperature) OVER (
      PARTITION BY hsa.device_id, hsa.sensor_type 
      ORDER BY hsa.hour_bucket 
      ROWS 23 PRECEDING
    ) as rolling_24h_avg_temperature

  FROM hourly_sensor_averages hsa
  JOIN devices d ON hsa.device_id = d.device_id
),

environmental_conditions AS (
  -- Complex environmental analysis requiring expensive calculations
  SELECT 
    dpa.hour_bucket,
    dpa.installation_location,

    -- Location-based aggregations
    AVG(dpa.avg_temperature) as location_avg_temperature,
    AVG(dpa.avg_humidity) as location_avg_humidity,
    AVG(dpa.avg_pressure) as location_avg_pressure,

    -- Device count and health by location
    COUNT(*) as active_devices,
    COUNT(*) FILTER (WHERE dpa.battery_status = 'critical') as critical_battery_devices,
    COUNT(*) FILTER (WHERE dpa.signal_status = 'poor') as poor_signal_devices,
    COUNT(*) FILTER (WHERE dpa.anomaly_count > 0) as devices_with_anomalies,

    -- Environmental variance analysis
    STDDEV(dpa.avg_temperature) as temperature_variance,
    STDDEV(dpa.avg_humidity) as humidity_variance,

    -- Extreme conditions detection
    BOOL_OR(dpa.avg_temperature > 40 OR dpa.avg_temperature < -10) as extreme_temperature_detected,
    BOOL_OR(dpa.avg_humidity > 90 OR dpa.avg_humidity < 10) as extreme_humidity_detected,

    -- Data quality aggregation
    AVG(dpa.avg_data_quality) as location_data_quality,
    SUM(dpa.anomaly_count) as total_location_anomalies

  FROM device_performance_analysis dpa
  GROUP BY dpa.hour_bucket, dpa.installation_location
)

SELECT 
  ec.hour_bucket,
  ec.installation_location,
  ec.active_devices,

  -- Environmental metrics
  ROUND(ec.location_avg_temperature::NUMERIC, 2) as avg_temperature,
  ROUND(ec.location_avg_humidity::NUMERIC, 2) as avg_humidity,
  ROUND(ec.location_avg_pressure::NUMERIC, 2) as avg_pressure,

  -- Device health summary
  ec.critical_battery_devices,
  ec.poor_signal_devices,
  ec.devices_with_anomalies,

  -- Environmental conditions
  CASE 
    WHEN ec.extreme_temperature_detected OR ec.extreme_humidity_detected THEN 'extreme'
    WHEN ec.temperature_variance > 5 OR ec.humidity_variance > 15 THEN 'variable'
    ELSE 'stable'
  END as environmental_stability,

  -- Data quality indicators
  ROUND(ec.location_data_quality::NUMERIC, 3) as data_quality_score,
  ec.total_location_anomalies,

  -- Health scoring
  (
    100 - 
    (ec.critical_battery_devices * 20) - 
    (ec.poor_signal_devices * 10) - 
    (ec.devices_with_anomalies * 5) -
    CASE WHEN ec.location_data_quality < 0.9 THEN 15 ELSE 0 END
  ) as location_health_score,

  -- Operational recommendations
  CASE 
    WHEN ec.critical_battery_devices > 0 THEN 'URGENT: Replace batteries on ' || ec.critical_battery_devices || ' devices'
    WHEN ec.poor_signal_devices > ec.active_devices * 0.3 THEN 'Consider signal boosters for location'
    WHEN ec.total_location_anomalies > 10 THEN 'Investigate environmental factors causing anomalies'
    WHEN ec.location_data_quality < 0.8 THEN 'Review device calibration and maintenance schedules'
    ELSE 'Location operating within normal parameters'
  END as operational_recommendation

FROM environmental_conditions ec
ORDER BY ec.hour_bucket DESC, ec.location_health_score ASC;

-- Performance problems with traditional time series approaches:
-- 1. Manual partition management creates operational overhead and failure points
-- 2. Complex query plans across multiple partitions reduce performance
-- 3. Limited compression and storage optimization for time-stamped data
-- 4. No native support for time-based retention policies and archiving
-- 5. Expensive aggregation operations across large time ranges
-- 6. Poor performance for recent data queries due to partition pruning limitations
-- 7. Complex indexing strategies required for different time-based access patterns
-- 8. Difficult to optimize for both high-throughput writes and analytical reads
-- 9. No built-in support for downsampling and data compaction strategies
-- 10. Limited ability to handle irregular time intervals and sparse data efficiently

-- Attempt at data retention management (complex and error-prone)
CREATE OR REPLACE FUNCTION manage_sensor_data_retention()
RETURNS void AS $$
DECLARE
    partition_name text;
    retention_date timestamp;
BEGIN
    -- Calculate retention boundary (keep 1 year of data)
    retention_date := CURRENT_TIMESTAMP - INTERVAL '1 year';

    -- Find partitions older than retention period
    FOR partition_name IN
        SELECT schemaname||'.'||tablename
        FROM pg_tables 
        WHERE tablename LIKE 'sensor_readings_%'
        AND schemaname = 'public'
    LOOP
        -- Extract date from partition name (fragile parsing)
        IF partition_name ~ 'sensor_readings_[0-9]{4}_[0-9]{2}$' THEN
            -- This logic is complex and error-prone
            -- Need to parse partition name, validate dates, check data
            -- Then carefully drop partitions without losing data
            RAISE NOTICE 'Would evaluate partition % for retention', partition_name;
        END IF;
    END LOOP;

    -- Complex logic needed to:
    -- 1. Verify partition contains only old data
    -- 2. Archive data if needed before deletion  
    -- 3. Update constraints and metadata
    -- 4. Handle dependencies and foreign keys
    -- 5. Clean up indexes and statistics

EXCEPTION
    WHEN OTHERS THEN
        -- Error handling for partition management failures
        RAISE EXCEPTION 'Retention management failed: %', SQLERRM;
END;
$$ LANGUAGE plpgsql;

-- Expensive real-time alerting query
WITH real_time_sensor_status AS (
  SELECT DISTINCT ON (device_id, sensor_type)
    device_id,
    sensor_type,
    timestamp,
    temperature,
    humidity,
    battery_level,
    signal_strength,
    anomaly_detected,
    data_quality_score
  FROM sensor_readings
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '15 minutes'
  ORDER BY device_id, sensor_type, timestamp DESC
),

alert_conditions AS (
  SELECT 
    rtss.*,
    d.device_name,
    d.installation_location,

    -- Define alert conditions
    CASE 
      WHEN rtss.temperature > 50 OR rtss.temperature < -20 THEN 'temperature_extreme'
      WHEN rtss.humidity > 95 OR rtss.humidity < 5 THEN 'humidity_extreme'
      WHEN rtss.battery_level < 15 THEN 'battery_critical'
      WHEN rtss.signal_strength < -85 THEN 'signal_poor'
      WHEN rtss.anomaly_detected = true THEN 'anomaly_detected'
      WHEN rtss.data_quality_score < 0.7 THEN 'data_quality_poor'
      WHEN rtss.timestamp < CURRENT_TIMESTAMP - INTERVAL '10 minutes' THEN 'device_offline'
      ELSE null
    END as alert_type,

    -- Alert severity
    CASE 
      WHEN rtss.battery_level < 10 OR rtss.timestamp < CURRENT_TIMESTAMP - INTERVAL '20 minutes' THEN 'critical'
      WHEN rtss.temperature > 45 OR rtss.temperature < -15 OR rtss.anomaly_detected THEN 'high'
      WHEN rtss.battery_level < 20 OR rtss.signal_strength < -80 THEN 'medium'
      ELSE 'low'
    END as alert_severity

  FROM real_time_sensor_status rtss
  JOIN devices d ON rtss.device_id = d.device_id
  WHERE d.device_status = 'active'
)

SELECT 
  device_id,
  device_name,
  installation_location,
  sensor_type,
  alert_type,
  alert_severity,
  temperature,
  humidity,
  battery_level,
  signal_strength,
  timestamp,

  -- Alert message generation
  CASE alert_type
    WHEN 'temperature_extreme' THEN FORMAT('Temperature %s°C is outside safe range', temperature)
    WHEN 'humidity_extreme' THEN FORMAT('Humidity %s%% is at extreme level', humidity)
    WHEN 'battery_critical' THEN FORMAT('Battery level %s%% requires immediate attention', battery_level)
    WHEN 'signal_poor' THEN FORMAT('Signal strength %s dBm indicates connectivity issues', signal_strength)
    WHEN 'anomaly_detected' THEN 'Sensor readings show anomalous patterns'
    WHEN 'data_quality_poor' THEN FORMAT('Data quality score %s indicates sensor issues', data_quality_score)
    WHEN 'device_offline' THEN FORMAT('Device has not reported for %s minutes', EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - timestamp))/60)
    ELSE 'Unknown alert condition'
  END as alert_message,

  -- Recommended actions
  CASE alert_type
    WHEN 'temperature_extreme' THEN 'Check environmental conditions and sensor calibration'
    WHEN 'humidity_extreme' THEN 'Verify sensor operation and environmental factors'
    WHEN 'battery_critical' THEN 'Schedule immediate battery replacement'
    WHEN 'signal_poor' THEN 'Check device antenna and network infrastructure'
    WHEN 'anomaly_detected' THEN 'Investigate sensor readings and potential interference'
    WHEN 'data_quality_poor' THEN 'Perform sensor calibration and diagnostic checks'
    WHEN 'device_offline' THEN 'Check device power and network connectivity'
    ELSE 'Monitor device status'
  END as recommended_action

FROM alert_conditions
WHERE alert_type IS NOT NULL
ORDER BY 
  CASE alert_severity
    WHEN 'critical' THEN 1
    WHEN 'high' THEN 2
    WHEN 'medium' THEN 3
    ELSE 4
  END,
  timestamp DESC;

-- Traditional limitations for IoT time series data:
-- 1. Manual partition management and maintenance complexity
-- 2. Poor compression ratios for repetitive time series data patterns
-- 3. Expensive aggregation queries across large time ranges
-- 4. Limited real-time query performance for recent data analysis
-- 5. Complex retention policy implementation and data archiving
-- 6. No native support for irregular time intervals and sparse sensor data
-- 7. Difficult optimization for mixed analytical and operational workloads
-- 8. Limited scalability for high-frequency data ingestion (>1000 inserts/sec)
-- 9. Complex alerting and real-time monitoring query patterns
-- 10. Poor storage efficiency for IoT metadata and device information duplication

MongoDB Time Series Collections provide comprehensive optimization for temporal data workloads:

// MongoDB Time Series Collections - optimized for IoT sensor data and real-time analytics
const { MongoClient, ObjectId } = require('mongodb');

// Advanced IoT Time Series Data Manager
class IoTTimeSeriesDataManager {
  constructor() {
    this.client = null;
    this.db = null;
    this.collections = new Map();
    this.performanceMetrics = new Map();
    this.alertingRules = new Map();
    this.retentionPolicies = new Map();
  }

  async initialize() {
    console.log('Initializing IoT Time Series Data Manager...');

    // Connect with optimized settings for time series workloads
    this.client = new MongoClient(process.env.MONGODB_URI || 'mongodb://localhost:27017', {
      // Optimized for high-throughput time series writes
      maxPoolSize: 50,
      minPoolSize: 10,
      maxIdleTimeMS: 30000,
      serverSelectionTimeoutMS: 5000,

      // Write settings for time series data
      writeConcern: { 
        w: 1, 
        j: false, // Disable journaling for better write performance
        wtimeout: 5000
      },

      // Read preferences for time series queries
      readPreference: 'primaryPreferred',
      readConcern: { level: 'local' },

      // Compression for large time series datasets
      compressors: ['zstd', 'zlib'],

      appName: 'IoTTimeSeriesManager'
    });

    await this.client.connect();
    this.db = this.client.db('iot_platform');

    // Initialize time series collections with optimized configurations
    await this.setupTimeSeriesCollections();

    // Setup data retention policies
    await this.setupRetentionPolicies();

    // Initialize real-time alerting system
    await this.setupRealTimeAlerting();

    console.log('✅ IoT Time Series Data Manager initialized');
  }

  async setupTimeSeriesCollections() {
    console.log('Setting up optimized time series collections...');

    try {
      // Primary sensor readings time series collection
      const sensorReadingsTS = await this.db.createCollection('sensor_readings', {
        timeseries: {
          timeField: 'timestamp',           // Field containing the timestamp
          metaField: 'device',             // Field containing device metadata
          granularity: 'minutes',          // Optimized for minute-level granularity
          bucketMaxSpanSeconds: 3600       // 1-hour buckets for optimal compression
        },

        // Storage optimization
        storageEngine: {
          wiredTiger: {
            configString: 'block_compressor=zstd'  // Use zstd compression
          }
        }
      });

      // Device status and health time series
      const deviceHealthTS = await this.db.createCollection('device_health', {
        timeseries: {
          timeField: 'timestamp',
          metaField: 'device_info',
          granularity: 'hours',            // Less frequent health updates
          bucketMaxSpanSeconds: 86400      // 24-hour buckets
        }
      });

      // Environmental conditions aggregated data
      const environmentalTS = await this.db.createCollection('environmental_conditions', {
        timeseries: {
          timeField: 'timestamp',
          metaField: 'location',
          granularity: 'hours',
          bucketMaxSpanSeconds: 86400
        }
      });

      // Alert events time series
      const alertsTS = await this.db.createCollection('alert_events', {
        timeseries: {
          timeField: 'timestamp',
          metaField: 'alert_context',
          granularity: 'seconds',          // Fine-grained for alert analysis
          bucketMaxSpanSeconds: 3600
        }
      });

      // Store collection references
      this.collections.set('sensor_readings', this.db.collection('sensor_readings'));
      this.collections.set('device_health', this.db.collection('device_health'));
      this.collections.set('environmental_conditions', this.db.collection('environmental_conditions'));
      this.collections.set('alert_events', this.db.collection('alert_events'));

      // Create supporting collections for device metadata
      await this.setupDeviceCollections();

      // Create optimized indexes for time series queries
      await this.createTimeSeriesIndexes();

      console.log('✅ Time series collections configured with optimal settings');

    } catch (error) {
      console.error('Error setting up time series collections:', error);
      throw error;
    }
  }

  async setupDeviceCollections() {
    console.log('Setting up device metadata collections...');

    // Device registry collection (regular collection)
    const devicesCollection = this.db.collection('devices');
    await devicesCollection.createIndex({ device_id: 1 }, { unique: true });
    await devicesCollection.createIndex({ installation_location: 1 });
    await devicesCollection.createIndex({ device_type: 1 });
    await devicesCollection.createIndex({ "location_coordinates": "2dsphere" });

    // Location registry for environmental analytics
    const locationsCollection = this.db.collection('locations');
    await locationsCollection.createIndex({ location_id: 1 }, { unique: true });
    await locationsCollection.createIndex({ "coordinates": "2dsphere" });

    this.collections.set('devices', devicesCollection);
    this.collections.set('locations', locationsCollection);

    console.log('✅ Device metadata collections configured');
  }

  async createTimeSeriesIndexes() {
    console.log('Creating optimized time series indexes...');

    const sensorReadings = this.collections.get('sensor_readings');

    // Compound indexes optimized for common query patterns
    await sensorReadings.createIndex({ 
      'device.device_id': 1, 
      'timestamp': -1 
    }, { 
      name: 'device_timestamp_desc',
      background: true 
    });

    await sensorReadings.createIndex({ 
      'device.sensor_type': 1, 
      'timestamp': -1 
    }, { 
      name: 'sensor_type_timestamp_desc',
      background: true 
    });

    await sensorReadings.createIndex({ 
      'device.installation_location': 1, 
      'timestamp': -1 
    }, { 
      name: 'location_timestamp_desc',
      background: true 
    });

    // Partial indexes for alerting queries
    await sensorReadings.createIndex(
      { 'timestamp': -1 },
      { 
        name: 'recent_readings_partial',
        partialFilterExpression: { 
          timestamp: { $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) }
        },
        background: true
      }
    );

    console.log('✅ Time series indexes created');
  }

  async ingestSensorData(sensorDataBatch) {
    console.log(`Ingesting batch of ${sensorDataBatch.length} sensor readings...`);

    const startTime = Date.now();

    try {
      // Transform data for time series optimized format
      const timeSeriesDocuments = sensorDataBatch.map(reading => ({
        timestamp: new Date(reading.timestamp),

        // Device metadata (metaField for automatic bucketing)
        device: {
          device_id: reading.device_id,
          sensor_type: reading.sensor_type,
          installation_location: reading.installation_location,
          device_model: reading.device_model
        },

        // Measurement values (optimized for compression)
        measurements: {
          temperature: reading.temperature,
          humidity: reading.humidity,
          pressure: reading.pressure,
          battery_level: reading.battery_level,
          signal_strength: reading.signal_strength
        },

        // Location data (when available)
        ...(reading.latitude && reading.longitude && {
          location: {
            coordinates: [reading.longitude, reading.latitude],
            altitude: reading.altitude
          }
        }),

        // Data quality and status
        quality: {
          data_quality_score: reading.data_quality_score || 1.0,
          anomaly_detected: reading.anomaly_detected || false,
          validation_flags: reading.validation_flags || {}
        },

        // Processing metadata
        ingestion_time: new Date(),
        processing_version: '1.0'
      }));

      // Bulk insert with optimal batch size for time series
      const insertResult = await this.collections.get('sensor_readings').insertMany(
        timeSeriesDocuments,
        { 
          ordered: false,  // Allow parallel inserts
          writeConcern: { w: 1, j: false }  // Optimize for throughput
        }
      );

      // Update device health tracking
      await this.updateDeviceHealthTracking(sensorDataBatch);

      // Process real-time alerts
      await this.processRealTimeAlerts(timeSeriesDocuments);

      // Update performance metrics
      const ingestionTime = Date.now() - startTime;
      await this.trackIngestionMetrics({
        batchSize: sensorDataBatch.length,
        ingestionTimeMs: ingestionTime,
        documentsInserted: insertResult.insertedCount,
        timestamp: new Date()
      });

      console.log(`✅ Ingested ${insertResult.insertedCount} sensor readings in ${ingestionTime}ms`);

      return {
        success: true,
        documentsInserted: insertResult.insertedCount,
        ingestionTimeMs: ingestionTime,
        throughput: (insertResult.insertedCount / ingestionTime * 1000).toFixed(2) + ' docs/sec'
      };

    } catch (error) {
      console.error('Error ingesting sensor data:', error);
      return { success: false, error: error.message };
    }
  }

  async queryRecentSensorData(deviceId, timeRangeMinutes = 60) {
    console.log(`Querying recent sensor data for device ${deviceId} over ${timeRangeMinutes} minutes...`);

    const startTime = Date.now();
    const queryStartTime = new Date(Date.now() - timeRangeMinutes * 60 * 1000);

    try {
      const pipeline = [
        {
          $match: {
            'device.device_id': deviceId,
            timestamp: { $gte: queryStartTime }
          }
        },
        {
          $sort: { timestamp: -1 }
        },
        {
          $limit: 1000  // Reasonable limit for recent data
        },
        {
          $project: {
            timestamp: 1,
            'device.sensor_type': 1,
            'device.installation_location': 1,
            measurements: 1,
            location: 1,
            'quality.data_quality_score': 1,
            'quality.anomaly_detected': 1
          }
        }
      ];

      const results = await this.collections.get('sensor_readings')
        .aggregate(pipeline, { allowDiskUse: false })
        .toArray();

      const queryTime = Date.now() - startTime;

      console.log(`✅ Retrieved ${results.length} readings in ${queryTime}ms`);

      return {
        deviceId: deviceId,
        timeRangeMinutes: timeRangeMinutes,
        readingCount: results.length,
        queryTimeMs: queryTime,
        data: results,

        // Data summary
        summary: {
          latestReading: results[0]?.timestamp,
          oldestReading: results[results.length - 1]?.timestamp,
          sensorTypes: [...new Set(results.map(r => r.device.sensor_type))],
          averageDataQuality: results.reduce((sum, r) => sum + (r.quality.data_quality_score || 1), 0) / results.length,
          anomaliesDetected: results.filter(r => r.quality.anomaly_detected).length
        }
      };

    } catch (error) {
      console.error('Error querying recent sensor data:', error);
      return { success: false, error: error.message };
    }
  }

  async performTimeSeriesAggregation(aggregationOptions) {
    console.log('Performing optimized time series aggregation...');

    const {
      timeRange = { hours: 24 },
      granularity = 'hour',
      metrics = ['temperature', 'humidity', 'pressure'],
      groupBy = ['device.installation_location'],
      filters = {}
    } = aggregationOptions;

    const startTime = Date.now();

    try {
      // Calculate time range
      const timeRangeMs = (timeRange.days || 0) * 24 * 60 * 60 * 1000 +
                          (timeRange.hours || 0) * 60 * 60 * 1000 +
                          (timeRange.minutes || 0) * 60 * 1000;
      const queryStartTime = new Date(Date.now() - timeRangeMs);

      // Build granularity for $dateTrunc
      const granularityMap = {
        'minute': 'minute',
        'hour': 'hour', 
        'day': 'day',
        'week': 'week'
      };

      const pipeline = [
        // Match stage with time range and filters
        {
          $match: {
            timestamp: { $gte: queryStartTime },
            ...filters
          }
        },

        // Add time bucket field
        {
          $addFields: {
            timeBucket: {
              $dateTrunc: {
                date: '$timestamp',
                unit: granularityMap[granularity] || 'hour'
              }
            }
          }
        },

        // Group by time bucket and specified dimensions
        {
          $group: {
            _id: {
              timeBucket: '$timeBucket',
              ...Object.fromEntries(groupBy.map(field => [field.replace('.', '_'), `$${field}`]))
            },

            // Count and time range
            readingCount: { $sum: 1 },
            periodStart: { $min: '$timestamp' },
            periodEnd: { $max: '$timestamp' },

            // Dynamic metric aggregations
            ...Object.fromEntries(metrics.flatMap(metric => [
              [`avg_${metric}`, { $avg: `$measurements.${metric}` }],
              [`min_${metric}`, { $min: `$measurements.${metric}` }],
              [`max_${metric}`, { $max: `$measurements.${metric}` }],
              [`stdDev_${metric}`, { $stdDevPop: `$measurements.${metric}` }]
            ])),

            // Data quality metrics
            avgDataQuality: { $avg: '$quality.data_quality_score' },
            anomalyCount: {
              $sum: { $cond: [{ $eq: ['$quality.anomaly_detected', true] }, 1, 0] }
            },

            // Device diversity
            uniqueDevices: { $addToSet: '$device.device_id' },
            sensorTypes: { $addToSet: '$device.sensor_type' }
          }
        },

        // Calculate derived metrics
        {
          $addFields: {
            // Device count
            deviceCount: { $size: '$uniqueDevices' },
            sensorTypeCount: { $size: '$sensorTypes' },

            // Data coverage and reliability
            dataCoveragePercent: {
              $multiply: [
                { $divide: ['$readingCount', { $multiply: ['$deviceCount', 60] }] }, // Assuming 1-minute intervals
                100
              ]
            },

            // Anomaly rate
            anomalyRate: {
              $cond: [
                { $gt: ['$readingCount', 0] },
                { $divide: ['$anomalyCount', '$readingCount'] },
                0
              ]
            }
          }
        },

        // Sort by time bucket
        {
          $sort: { '_id.timeBucket': 1 }
        },

        // Project final structure
        {
          $project: {
            timeBucket: '$_id.timeBucket',
            grouping: {
              $objectToArray: {
                $arrayToObject: {
                  $filter: {
                    input: { $objectToArray: '$_id' },
                    cond: { $ne: ['$$this.k', 'timeBucket'] }
                  }
                }
              }
            },

            // Measurements
            measurements: Object.fromEntries(metrics.map(metric => [
              metric,
              {
                avg: { $round: [`$avg_${metric}`, 2] },
                min: { $round: [`$min_${metric}`, 2] },
                max: { $round: [`$max_${metric}`, 2] },
                stdDev: { $round: [`$stdDev_${metric}`, 3] }
              }
            ])),

            // Metadata
            metadata: {
              readingCount: '$readingCount',
              deviceCount: '$deviceCount',
              sensorTypeCount: '$sensorTypeCount',
              periodStart: '$periodStart',
              periodEnd: '$periodEnd',
              dataCoveragePercent: { $round: ['$dataCoveragePercent', 1] },
              avgDataQuality: { $round: ['$avgDataQuality', 3] },
              anomalyCount: '$anomalyCount',
              anomalyRate: { $round: ['$anomalyRate', 4] }
            }
          }
        }
      ];

      // Execute aggregation with optimization hints
      const results = await this.collections.get('sensor_readings').aggregate(pipeline, {
        allowDiskUse: false,  // Use memory for better performance
        maxTimeMS: 30000,     // 30-second timeout
        hint: 'location_timestamp_desc'  // Use optimized index
      }).toArray();

      const aggregationTime = Date.now() - startTime;

      console.log(`✅ Completed time series aggregation: ${results.length} buckets in ${aggregationTime}ms`);

      return {
        success: true,
        aggregationTimeMs: aggregationTime,
        bucketCount: results.length,
        timeRange: timeRange,
        granularity: granularity,
        data: results,

        // Performance metrics
        performance: {
          documentsScanned: results.reduce((sum, bucket) => sum + bucket.metadata.readingCount, 0),
          averageBucketProcessingTime: aggregationTime / results.length,
          throughput: (results.length / aggregationTime * 1000).toFixed(2) + ' buckets/sec'
        }
      };

    } catch (error) {
      console.error('Error performing time series aggregation:', error);
      return { success: false, error: error.message };
    }
  }

  async setupRetentionPolicies() {
    console.log('Setting up automated data retention policies...');

    const retentionConfigs = {
      sensor_readings: {
        rawDataRetentionDays: 90,      // Keep raw data for 90 days
        aggregatedDataRetentionDays: 365, // Keep aggregated data for 1 year
        archiveAfterDays: 30,          // Archive data older than 30 days
        compressionLevel: 'high'
      },

      device_health: {
        rawDataRetentionDays: 180,     // Keep device health for 6 months
        aggregatedDataRetentionDays: 730, // Keep aggregated health data for 2 years
        archiveAfterDays: 60
      },

      alert_events: {
        rawDataRetentionDays: 365,     // Keep alerts for 1 year
        archiveAfterDays: 90
      }
    };

    // Store retention policies
    for (const [collection, config] of Object.entries(retentionConfigs)) {
      this.retentionPolicies.set(collection, config);

      // Create TTL indexes for automatic deletion
      await this.collections.get(collection).createIndex(
        { timestamp: 1 },
        { 
          expireAfterSeconds: config.rawDataRetentionDays * 24 * 60 * 60,
          name: `ttl_${collection}`,
          background: true
        }
      );
    }

    console.log('✅ Retention policies configured');
  }

  async processRealTimeAlerts(sensorDocuments) {
    const alertingRules = [
      {
        name: 'temperature_extreme',
        condition: (doc) => doc.measurements.temperature > 50 || doc.measurements.temperature < -20,
        severity: 'critical',
        message: (doc) => `Extreme temperature ${doc.measurements.temperature}°C detected at ${doc.device.installation_location}`
      },
      {
        name: 'battery_critical',
        condition: (doc) => doc.measurements.battery_level < 15,
        severity: 'high',
        message: (doc) => `Critical battery level ${doc.measurements.battery_level}% on device ${doc.device.device_id}`
      },
      {
        name: 'anomaly_detected',
        condition: (doc) => doc.quality.anomaly_detected === true,
        severity: 'medium',
        message: (doc) => `Anomalous readings detected from device ${doc.device.device_id}`
      },
      {
        name: 'data_quality_poor',
        condition: (doc) => doc.quality.data_quality_score < 0.7,
        severity: 'medium',
        message: (doc) => `Poor data quality (${doc.quality.data_quality_score}) from device ${doc.device.device_id}`
      }
    ];

    const alerts = [];
    const currentTime = new Date();

    for (const document of sensorDocuments) {
      for (const rule of alertingRules) {
        if (rule.condition(document)) {
          alerts.push({
            timestamp: currentTime,
            alert_context: {
              rule_name: rule.name,
              device_id: document.device.device_id,
              location: document.device.installation_location,
              sensor_type: document.device.sensor_type
            },
            severity: rule.severity,
            message: rule.message(document),
            source_data: {
              measurements: document.measurements,
              quality: document.quality,
              reading_timestamp: document.timestamp
            },
            status: 'active',
            acknowledgment: null,
            created_at: currentTime
          });
        }
      }
    }

    // Insert alerts into time series collection
    if (alerts.length > 0) {
      await this.collections.get('alert_events').insertMany(alerts, {
        ordered: false,
        writeConcern: { w: 1, j: false }
      });

      console.log(`🚨 Generated ${alerts.length} real-time alerts`);
    }

    return alerts;
  }

  async updateDeviceHealthTracking(sensorDataBatch) {
    // Aggregate device health metrics from sensor readings
    const deviceHealthUpdates = {};

    for (const reading of sensorDataBatch) {
      if (!deviceHealthUpdates[reading.device_id]) {
        deviceHealthUpdates[reading.device_id] = {
          readings: [],
          location: reading.installation_location,
          deviceModel: reading.device_model
        };
      }
      deviceHealthUpdates[reading.device_id].readings.push(reading);
    }

    const healthDocuments = Object.entries(deviceHealthUpdates).map(([deviceId, data]) => {
      const readings = data.readings;
      const avgBattery = readings.reduce((sum, r) => sum + (r.battery_level || 0), 0) / readings.length;
      const avgSignal = readings.reduce((sum, r) => sum + (r.signal_strength || -50), 0) / readings.length;
      const avgDataQuality = readings.reduce((sum, r) => sum + (r.data_quality_score || 1), 0) / readings.length;

      return {
        timestamp: new Date(),
        device_info: {
          device_id: deviceId,
          installation_location: data.location,
          device_model: data.deviceModel
        },
        health_metrics: {
          battery_level: avgBattery,
          signal_strength: avgSignal,
          data_quality_score: avgDataQuality,
          reading_frequency: readings.length,
          last_reading_time: new Date(Math.max(...readings.map(r => new Date(r.timestamp))))
        },
        health_status: {
          battery_status: avgBattery > 40 ? 'good' : avgBattery > 20 ? 'warning' : 'critical',
          connectivity_status: avgSignal > -60 ? 'excellent' : avgSignal > -80 ? 'good' : 'poor',
          overall_health: avgBattery > 20 && avgSignal > -80 && avgDataQuality > 0.8 ? 'healthy' : 'attention_needed'
        }
      };
    });

    if (healthDocuments.length > 0) {
      await this.collections.get('device_health').insertMany(healthDocuments, {
        ordered: false,
        writeConcern: { w: 1, j: false }
      });
    }
  }

  async trackIngestionMetrics(metrics) {
    const key = `${metrics.timestamp.getFullYear()}-${metrics.timestamp.getMonth() + 1}-${metrics.timestamp.getDate()}`;

    if (!this.performanceMetrics.has(key)) {
      this.performanceMetrics.set(key, {
        totalBatches: 0,
        totalDocuments: 0,
        totalIngestionTime: 0,
        averageBatchSize: 0,
        averageThroughput: 0
      });
    }

    const dailyMetrics = this.performanceMetrics.get(key);
    dailyMetrics.totalBatches++;
    dailyMetrics.totalDocuments += metrics.documentsInserted;
    dailyMetrics.totalIngestionTime += metrics.ingestionTimeMs;
    dailyMetrics.averageBatchSize = dailyMetrics.totalDocuments / dailyMetrics.totalBatches;
    dailyMetrics.averageThroughput = dailyMetrics.totalDocuments / (dailyMetrics.totalIngestionTime / 1000);
  }

  async getPerformanceMetrics() {
    const metrics = {
      timestamp: new Date(),
      ingestionMetrics: Object.fromEntries(this.performanceMetrics),

      // Collection statistics
      collectionStats: {},

      // Current throughput (last 5 minutes)
      recentThroughput: await this.calculateRecentThroughput(),

      // Storage efficiency
      storageStats: await this.getStorageStatistics()
    };

    // Get collection stats
    for (const [name, collection] of this.collections) {
      try {
        if (name.includes('sensor_readings') || name.includes('device_health')) {
          const stats = await this.db.command({ collStats: collection.collectionName });
          metrics.collectionStats[name] = {
            documentCount: stats.count,
            storageSize: stats.storageSize,
            averageDocumentSize: stats.avgObjSize,
            indexSize: stats.totalIndexSize,
            compressionRatio: stats.storageSize > 0 ? (stats.size / stats.storageSize).toFixed(2) : 0
          };
        }
      } catch (error) {
        console.warn(`Could not get stats for collection ${name}:`, error.message);
      }
    }

    return metrics;
  }

  async calculateRecentThroughput() {
    const fiveMinutesAgo = new Date(Date.now() - 5 * 60 * 1000);

    try {
      const recentCount = await this.collections.get('sensor_readings').countDocuments({
        timestamp: { $gte: fiveMinutesAgo }
      });

      return {
        documentsLast5Minutes: recentCount,
        throughputDocsPerSecond: (recentCount / 300).toFixed(2)
      };
    } catch (error) {
      return { error: error.message };
    }
  }

  async getStorageStatistics() {
    try {
      const dbStats = await this.db.command({ dbStats: 1 });

      return {
        totalDataSize: dbStats.dataSize,
        totalStorageSize: dbStats.storageSize,
        totalIndexSize: dbStats.indexSize,
        compressionRatio: dbStats.dataSize > 0 ? (dbStats.dataSize / dbStats.storageSize).toFixed(2) : 0,
        collections: dbStats.collections,
        objects: dbStats.objects
      };
    } catch (error) {
      return { error: error.message };
    }
  }

  async shutdown() {
    console.log('Shutting down IoT Time Series Data Manager...');

    if (this.client) {
      await this.client.close();
      console.log('✅ MongoDB connection closed');
    }

    this.collections.clear();
    this.performanceMetrics.clear();
    this.alertingRules.clear();
    this.retentionPolicies.clear();
  }
}

// Export the IoT time series data manager
module.exports = { IoTTimeSeriesDataManager };

// Benefits of MongoDB Time Series Collections for IoT:
// - Native time series optimization with automatic bucketing and compression
// - High-throughput data ingestion optimized for IoT sensor data patterns
// - Efficient storage with specialized compression algorithms for time series data
// - Automatic data retention policies with TTL indexes
// - Real-time query performance for recent data analysis and alerting
// - Flexible metadata handling for diverse IoT device types and sensor configurations
// - Built-in support for time-based aggregations and analytics
// - Seamless integration with existing MongoDB tooling and ecosystem
// - SQL-compatible time series operations through QueryLeaf integration
// - Enterprise-grade scalability for high-volume IoT data ingestion

Understanding MongoDB Time Series Architecture

Advanced IoT Data Processing and Analytics Patterns

Implement sophisticated time series data management for production IoT deployments:

// Production-ready IoT time series processing with advanced analytics and monitoring
class ProductionIoTTimeSeriesProcessor extends IoTTimeSeriesDataManager {
  constructor(productionConfig) {
    super();

    this.productionConfig = {
      ...productionConfig,
      enableAdvancedAnalytics: true,
      enablePredictiveAlerting: true,
      enableDataQualityMonitoring: true,
      enableAutomaticScaling: true,
      enableCompliance: true,
      enableDisasterRecovery: true
    };

    this.analyticsEngine = new Map();
    this.predictionModels = new Map();
    this.qualityMetrics = new Map();
    this.complianceTracking = new Map();
  }

  async setupAdvancedAnalytics() {
    console.log('Setting up advanced IoT analytics pipeline...');

    // Time series analytics configurations
    const analyticsConfigs = {
      // Real-time stream processing for immediate insights
      realTimeAnalytics: {
        windowSizes: ['5min', '15min', '1hour'],
        metrics: ['avg', 'min', 'max', 'stddev', 'percentiles'],
        alertingThresholds: {
          temperature: { min: -20, max: 50, stddev: 5 },
          humidity: { min: 0, max: 100, stddev: 10 },
          battery: { critical: 15, warning: 30 }
        },
        processingLatency: 'sub_second'
      },

      // Predictive analytics for proactive maintenance
      predictiveAnalytics: {
        algorithms: ['linear_trend', 'seasonal_decomposition', 'anomaly_detection'],
        predictionHorizons: ['1hour', '6hours', '24hours', '7days'],
        confidenceIntervals: [0.95, 0.99],
        modelRetraining: 'daily'
      },

      // Environmental correlation analysis
      environmentalAnalytics: {
        correlationAnalysis: true,
        spatialAnalysis: true,
        temporalPatterns: true,
        crossDeviceAnalysis: true
      },

      // Device performance analytics
      devicePerformanceAnalytics: {
        batteryLifePrediction: true,
        connectivityAnalysis: true,
        sensorDriftDetection: true,
        maintenancePrediction: true
      }
    };

    // Initialize analytics pipelines
    for (const [analyticsType, config] of Object.entries(analyticsConfigs)) {
      await this.initializeAnalyticsPipeline(analyticsType, config);
    }

    console.log('✅ Advanced analytics pipelines configured');
  }

  async performPredictiveAnalytics(deviceId, predictionType, horizonHours = 24) {
    console.log(`Performing predictive analytics for device ${deviceId}...`);

    const historicalHours = horizonHours * 10; // Use 10x horizon for historical analysis
    const startTime = new Date(Date.now() - historicalHours * 60 * 60 * 1000);

    const pipeline = [
      {
        $match: {
          'device.device_id': deviceId,
          timestamp: { $gte: startTime }
        }
      },
      {
        $sort: { timestamp: 1 }
      },
      {
        $group: {
          _id: {
            hour: { $dateToString: { format: '%Y-%m-%d-%H', date: '$timestamp' } }
          },
          avgTemperature: { $avg: '$measurements.temperature' },
          avgHumidity: { $avg: '$measurements.humidity' },
          avgBattery: { $avg: '$measurements.battery_level' },
          avgSignal: { $avg: '$measurements.signal_strength' },
          readingCount: { $sum: 1 },
          anomalyCount: { $sum: { $cond: ['$quality.anomaly_detected', 1, 0] } },
          minTimestamp: { $min: '$timestamp' },
          maxTimestamp: { $max: '$timestamp' }
        }
      },
      {
        $sort: { '_id.hour': 1 }
      },
      {
        $project: {
          hour: '$_id.hour',
          metrics: {
            temperature: '$avgTemperature',
            humidity: '$avgHumidity',
            battery: '$avgBattery',
            signal: '$avgSignal',
            readingCount: '$readingCount',
            anomalyRate: { $divide: ['$anomalyCount', '$readingCount'] }
          },
          timeRange: {
            start: '$minTimestamp',
            end: '$maxTimestamp'
          }
        }
      }
    ];

    const historicalData = await this.collections.get('sensor_readings')
      .aggregate(pipeline)
      .toArray();

    // Perform predictive analysis based on historical patterns
    const predictions = await this.generatePredictions(deviceId, historicalData, predictionType, horizonHours);

    return {
      deviceId: deviceId,
      predictionType: predictionType,
      horizon: `${horizonHours} hours`,
      historicalDataPoints: historicalData.length,
      predictions: predictions,
      confidence: this.calculatePredictionConfidence(historicalData),
      recommendations: this.generateMaintenanceRecommendations(predictions)
    };
  }

  async generatePredictions(deviceId, historicalData, predictionType, horizonHours) {
    const predictions = {
      batteryLife: null,
      temperatureTrends: null,
      connectivityHealth: null,
      maintenanceNeeds: null
    };

    if (predictionType === 'battery_life' || predictionType === 'all') {
      predictions.batteryLife = this.predictBatteryLife(historicalData, horizonHours);
    }

    if (predictionType === 'temperature_trends' || predictionType === 'all') {
      predictions.temperatureTrends = this.predictTemperatureTrends(historicalData, horizonHours);
    }

    if (predictionType === 'connectivity' || predictionType === 'all') {
      predictions.connectivityHealth = this.predictConnectivityHealth(historicalData, horizonHours);
    }

    if (predictionType === 'maintenance' || predictionType === 'all') {
      predictions.maintenanceNeeds = this.predictMaintenanceNeeds(historicalData, horizonHours);
    }

    return predictions;
  }

  predictBatteryLife(historicalData, horizonHours) {
    if (historicalData.length < 24) { // Need at least 24 hours of data
      return { error: 'Insufficient historical data for battery prediction' };
    }

    // Simple linear regression on battery level
    const batteryData = historicalData
      .filter(d => d.metrics.battery !== null)
      .map((d, index) => ({ x: index, y: d.metrics.battery }));

    if (batteryData.length < 10) {
      return { error: 'Insufficient battery data points' };
    }

    // Calculate linear trend
    const n = batteryData.length;
    const sumX = batteryData.reduce((sum, d) => sum + d.x, 0);
    const sumY = batteryData.reduce((sum, d) => sum + d.y, 0);
    const sumXY = batteryData.reduce((sum, d) => sum + (d.x * d.y), 0);
    const sumXX = batteryData.reduce((sum, d) => sum + (d.x * d.x), 0);

    const slope = (n * sumXY - sumX * sumY) / (n * sumXX - sumX * sumX);
    const intercept = (sumY - slope * sumX) / n;

    // Project battery level
    const currentBattery = batteryData[batteryData.length - 1].y;
    const futureX = batteryData.length - 1 + horizonHours;
    const projectedBattery = slope * futureX + intercept;

    // Calculate time until battery reaches critical level (15%)
    const criticalTime = slope !== 0 ? (15 - intercept) / slope : null;
    const hoursUntilCritical = criticalTime ? Math.max(0, criticalTime - (batteryData.length - 1)) : null;

    return {
      currentLevel: Math.round(currentBattery * 10) / 10,
      projectedLevel: Math.round(projectedBattery * 10) / 10,
      drainRate: Math.round((-slope) * 100) / 100, // % per hour
      hoursUntilCritical: hoursUntilCritical,
      confidence: this.calculateTrendConfidence(batteryData, slope, intercept),
      recommendation: hoursUntilCritical && hoursUntilCritical < 72 ? 
        'Schedule battery replacement within 3 days' : 
        'Battery levels normal'
    };
  }

  predictTemperatureTrends(historicalData, horizonHours) {
    const temperatureData = historicalData
      .filter(d => d.metrics.temperature !== null)
      .map((d, index) => ({ 
        x: index, 
        y: d.metrics.temperature,
        hour: d.hour
      }));

    if (temperatureData.length < 12) {
      return { error: 'Insufficient temperature data for trend analysis' };
    }

    // Calculate moving averages and trends
    const recentAvg = temperatureData.slice(-6).reduce((sum, d) => sum + d.y, 0) / 6;
    const historicalAvg = temperatureData.reduce((sum, d) => sum + d.y, 0) / temperatureData.length;
    const trend = recentAvg - historicalAvg;

    // Detect patterns (simplified seasonal detection)
    const hourlyAverages = this.calculateHourlyAverages(temperatureData);
    const dailyPattern = this.detectDailyPattern(hourlyAverages);

    return {
      currentAverage: Math.round(recentAvg * 100) / 100,
      historicalAverage: Math.round(historicalAvg * 100) / 100,
      trend: Math.round(trend * 100) / 100,
      trendDirection: trend > 1 ? 'increasing' : trend < -1 ? 'decreasing' : 'stable',
      dailyPattern: dailyPattern,
      extremeRisk: recentAvg > 40 || recentAvg < -10 ? 'high' : 'low',
      projectedRange: {
        min: Math.round((recentAvg + trend - 5) * 100) / 100,
        max: Math.round((recentAvg + trend + 5) * 100) / 100
      }
    };
  }

  predictConnectivityHealth(historicalData, horizonHours) {
    const signalData = historicalData
      .filter(d => d.metrics.signal !== null)
      .map(d => d.metrics.signal);

    const readingCountData = historicalData.map(d => d.metrics.readingCount);

    if (signalData.length < 6) {
      return { error: 'Insufficient connectivity data' };
    }

    const avgSignal = signalData.reduce((sum, s) => sum + s, 0) / signalData.length;
    const signalTrend = this.calculateSimpleTrend(signalData);

    const avgReadings = readingCountData.reduce((sum, r) => sum + r, 0) / readingCountData.length;
    const expectedReadings = 60; // Assuming 1-minute intervals
    const connectivityRatio = avgReadings / expectedReadings;

    return {
      averageSignalStrength: Math.round(avgSignal),
      signalTrend: Math.round(signalTrend * 100) / 100,
      connectivityRatio: Math.round(connectivityRatio * 1000) / 1000,
      connectivityStatus: connectivityRatio > 0.9 ? 'excellent' : 
                         connectivityRatio > 0.8 ? 'good' : 
                         connectivityRatio > 0.6 ? 'fair' : 'poor',
      projectedSignal: Math.round((avgSignal + signalTrend * horizonHours)),
      riskFactors: this.identifyConnectivityRisks(avgSignal, connectivityRatio, signalTrend)
    };
  }

  predictMaintenanceNeeds(historicalData, horizonHours) {
    const anomalyRates = historicalData.map(d => d.metrics.anomalyRate || 0);
    const recentAnomalyRate = anomalyRates.slice(-6).reduce((sum, r) => sum + r, 0) / 6;

    const batteryPrediction = this.predictBatteryLife(historicalData, horizonHours);
    const connectivityPrediction = this.predictConnectivityHealth(historicalData, horizonHours);

    const maintenanceScore = this.calculateMaintenanceScore(
      recentAnomalyRate,
      batteryPrediction,
      connectivityPrediction
    );

    return {
      maintenanceScore: Math.round(maintenanceScore),
      priority: maintenanceScore > 80 ? 'critical' :
                maintenanceScore > 60 ? 'high' :
                maintenanceScore > 40 ? 'medium' : 'low',

      recommendedActions: this.generateMaintenanceActions(
        maintenanceScore,
        batteryPrediction,
        connectivityPrediction,
        recentAnomalyRate
      ),

      estimatedMaintenanceWindow: this.estimateMaintenanceWindow(maintenanceScore),

      riskAssessment: {
        dataLossRisk: recentAnomalyRate > 0.1 ? 'high' : 'low',
        deviceFailureRisk: maintenanceScore > 70 ? 'high' : 'medium',
        serviceDisruptionRisk: connectivityPrediction.connectivityStatus === 'poor' ? 'high' : 'low'
      }
    };
  }

  calculateSimpleTrend(data) {
    if (data.length < 2) return 0;

    const n = data.length;
    const sumX = (n - 1) * n / 2; // Sum of 0,1,2,...,n-1
    const sumY = data.reduce((sum, y) => sum + y, 0);
    const sumXY = data.reduce((sum, y, x) => sum + (x * y), 0);
    const sumXX = (n - 1) * n * (2 * n - 1) / 6; // Sum of squares

    return (n * sumXY - sumX * sumY) / (n * sumXX - sumX * sumX);
  }

  calculateMaintenanceScore(anomalyRate, batteryPrediction, connectivityPrediction) {
    let score = 0;

    // Anomaly rate impact (0-30 points)
    score += Math.min(30, anomalyRate * 300);

    // Battery level impact (0-40 points)
    if (batteryPrediction && !batteryPrediction.error) {
      if (batteryPrediction.currentLevel < 20) score += 40;
      else if (batteryPrediction.currentLevel < 40) score += 25;
      else if (batteryPrediction.currentLevel < 60) score += 10;

      if (batteryPrediction.hoursUntilCritical && batteryPrediction.hoursUntilCritical < 72) {
        score += 30;
      }
    }

    // Connectivity impact (0-30 points)
    if (connectivityPrediction && !connectivityPrediction.error) {
      if (connectivityPrediction.connectivityStatus === 'poor') score += 30;
      else if (connectivityPrediction.connectivityStatus === 'fair') score += 20;
      else if (connectivityPrediction.connectivityStatus === 'good') score += 10;
    }

    return Math.min(100, score);
  }

  generateMaintenanceActions(score, batteryPrediction, connectivityPrediction, anomalyRate) {
    const actions = [];

    if (batteryPrediction && !batteryPrediction.error && batteryPrediction.currentLevel < 30) {
      actions.push({
        action: 'Replace device battery',
        priority: batteryPrediction.currentLevel < 15 ? 'urgent' : 'high',
        timeframe: batteryPrediction.currentLevel < 15 ? '24 hours' : '72 hours'
      });
    }

    if (connectivityPrediction && !connectivityPrediction.error && 
        connectivityPrediction.connectivityStatus === 'poor') {
      actions.push({
        action: 'Inspect device antenna and positioning',
        priority: 'high',
        timeframe: '48 hours'
      });
    }

    if (anomalyRate > 0.1) {
      actions.push({
        action: 'Perform sensor calibration and diagnostic check',
        priority: 'medium',
        timeframe: '1 week'
      });
    }

    if (score > 70) {
      actions.push({
        action: 'Comprehensive device health inspection',
        priority: 'high',
        timeframe: '48 hours'
      });
    }

    return actions.length > 0 ? actions : [{
      action: 'Continue routine monitoring',
      priority: 'low',
      timeframe: 'next_maintenance_cycle'
    }];
  }

  estimateMaintenanceWindow(score) {
    if (score > 80) return '0-24 hours';
    if (score > 60) return '1-3 days';
    if (score > 40) return '1-2 weeks';
    return '1-3 months';
  }

  calculateTrendConfidence(data, slope, intercept) {
    // Calculate R-squared for trend confidence
    const yMean = data.reduce((sum, d) => sum + d.y, 0) / data.length;
    const ssTotal = data.reduce((sum, d) => sum + Math.pow(d.y - yMean, 2), 0);
    const ssRes = data.reduce((sum, d) => {
      const predicted = slope * d.x + intercept;
      return sum + Math.pow(d.y - predicted, 2);
    }, 0);

    const rSquared = ssTotal > 0 ? 1 - (ssRes / ssTotal) : 0;

    if (rSquared > 0.8) return 'high';
    if (rSquared > 0.6) return 'medium';
    return 'low';
  }

  calculateHourlyAverages(temperatureData) {
    // Simplified hourly pattern detection
    const hourlyData = {};

    temperatureData.forEach(d => {
      const hour = d.hour.split('-')[3]; // Extract hour from YYYY-MM-DD-HH format
      if (!hourlyData[hour]) {
        hourlyData[hour] = [];
      }
      hourlyData[hour].push(d.y);
    });

    const hourlyAverages = {};
    for (const [hour, temps] of Object.entries(hourlyData)) {
      hourlyAverages[hour] = temps.reduce((sum, t) => sum + t, 0) / temps.length;
    }

    return hourlyAverages;
  }

  detectDailyPattern(hourlyAverages) {
    const hours = Object.keys(hourlyAverages).sort();
    if (hours.length < 6) return 'insufficient_data';

    const temperatures = hours.map(h => hourlyAverages[h]);
    const minTemp = Math.min(...temperatures);
    const maxTemp = Math.max(...temperatures);
    const range = maxTemp - minTemp;

    if (range > 10) return 'high_variation';
    if (range > 5) return 'moderate_variation';
    return 'stable';
  }

  identifyConnectivityRisks(avgSignal, connectivityRatio, signalTrend) {
    const risks = [];

    if (avgSignal < -80) {
      risks.push('Weak signal strength may cause intermittent connectivity');
    }

    if (connectivityRatio < 0.7) {
      risks.push('High packet loss affecting data reliability');
    }

    if (signalTrend < -1) {
      risks.push('Degrading signal strength trend detected');
    }

    if (risks.length === 0) {
      risks.push('No significant connectivity risks identified');
    }

    return risks;
  }

  generateMaintenanceRecommendations(predictions) {
    const recommendations = [];

    // Battery recommendations
    if (predictions.batteryLife && !predictions.batteryLife.error) {
      if (predictions.batteryLife.hoursUntilCritical && predictions.batteryLife.hoursUntilCritical < 168) {
        recommendations.push({
          type: 'battery',
          urgency: 'high',
          message: `Battery replacement needed within ${Math.ceil(predictions.batteryLife.hoursUntilCritical / 24)} days`,
          action: 'schedule_battery_replacement'
        });
      }
    }

    // Temperature recommendations
    if (predictions.temperatureTrends && !predictions.temperatureTrends.error) {
      if (predictions.temperatureTrends.extremeRisk === 'high') {
        recommendations.push({
          type: 'environmental',
          urgency: 'medium',
          message: 'Device operating in extreme temperature conditions',
          action: 'verify_installation_environment'
        });
      }
    }

    // Connectivity recommendations
    if (predictions.connectivityHealth && !predictions.connectivityHealth.error) {
      if (predictions.connectivityHealth.connectivityStatus === 'poor') {
        recommendations.push({
          type: 'connectivity',
          urgency: 'high',
          message: 'Poor connectivity affecting data transmission reliability',
          action: 'inspect_device_positioning_and_antenna'
        });
      }
    }

    // Maintenance recommendations
    if (predictions.maintenanceNeeds && !predictions.maintenanceNeeds.error) {
      if (predictions.maintenanceNeeds.priority === 'critical') {
        recommendations.push({
          type: 'maintenance',
          urgency: 'critical',
          message: 'Device requires immediate maintenance attention',
          action: 'schedule_emergency_maintenance'
        });
      }
    }

    return recommendations.length > 0 ? recommendations : [{
      type: 'status',
      urgency: 'none',
      message: 'Device operating within normal parameters',
      action: 'continue_monitoring'
    }];
  }

  calculatePredictionConfidence(historicalData) {
    if (historicalData.length < 12) return 'low';
    if (historicalData.length < 48) return 'medium';
    return 'high';
  }

  async getAdvancedAnalytics() {
    return {
      timestamp: new Date(),
      analyticsEngine: Object.fromEntries(this.analyticsEngine),
      predictionModels: Object.fromEntries(this.predictionModels),
      qualityMetrics: Object.fromEntries(this.qualityMetrics),
      systemHealth: await this.assessSystemHealth()
    };
  }

  async assessSystemHealth() {
    return {
      ingestionRate: 'optimal',
      queryPerformance: 'good',
      storageUtilization: 'normal',
      alertingSystem: 'operational',
      predictiveModels: 'trained'
    };
  }
}

// Export the production time series processor
module.exports = { ProductionIoTTimeSeriesProcessor };

SQL-Style Time Series Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB time series operations:

-- QueryLeaf time series operations with SQL-familiar syntax for MongoDB

-- Create time series collection with SQL-style DDL
CREATE TIME SERIES COLLECTION sensor_readings (
  timestamp TIMESTAMP PRIMARY KEY,
  device OBJECT AS metaField (
    device_id VARCHAR(50),
    sensor_type VARCHAR(50),
    installation_location VARCHAR(200),
    device_model VARCHAR(100)
  ),
  measurements OBJECT (
    temperature DECIMAL(5,2),
    humidity DECIMAL(5,2), 
    pressure DECIMAL(7,2),
    battery_level DECIMAL(5,2),
    signal_strength INTEGER
  ),
  location OBJECT (
    coordinates ARRAY[2] OF DECIMAL(11,8),
    altitude DECIMAL(8,2)
  ),
  quality OBJECT (
    data_quality_score DECIMAL(3,2) DEFAULT 1.0,
    anomaly_detected BOOLEAN DEFAULT FALSE,
    validation_flags JSON
  )
)
WITH (
  granularity = 'minutes',
  bucket_max_span_seconds = 3600,
  compression = 'zstd',
  retention_days = 90
);

-- Insert time series data using familiar SQL syntax
INSERT INTO sensor_readings (timestamp, device, measurements, quality)
VALUES 
  (
    CURRENT_TIMESTAMP,
    JSON_OBJECT(
      'device_id', 'sensor_001',
      'sensor_type', 'environmental', 
      'installation_location', 'Warehouse A',
      'device_model', 'TempHumid Pro'
    ),
    JSON_OBJECT(
      'temperature', 23.5,
      'humidity', 45.2,
      'pressure', 1013.25,
      'battery_level', 87.5,
      'signal_strength', -65
    ),
    JSON_OBJECT(
      'data_quality_score', 0.95,
      'anomaly_detected', FALSE
    )
  );

-- Time-based queries with SQL window functions and time series optimizations
WITH recent_readings AS (
  SELECT 
    device->>'device_id' as device_id,
    device->>'installation_location' as location,
    timestamp,
    measurements->>'temperature'::DECIMAL as temperature,
    measurements->>'humidity'::DECIMAL as humidity,
    measurements->>'battery_level'::DECIMAL as battery_level,
    quality->>'data_quality_score'::DECIMAL as data_quality

  FROM sensor_readings
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND quality->>'data_quality_score'::DECIMAL >= 0.8
),

time_series_analysis AS (
  SELECT 
    device_id,
    location,
    timestamp,
    temperature,
    humidity,
    battery_level,

    -- Time series window functions for trend analysis
    AVG(temperature) OVER (
      PARTITION BY device_id 
      ORDER BY timestamp 
      ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
    ) as temperature_12_point_avg,

    LAG(temperature, 1) OVER (
      PARTITION BY device_id 
      ORDER BY timestamp
    ) as prev_temperature,

    -- Calculate rate of change
    (temperature - LAG(temperature, 1) OVER (
      PARTITION BY device_id ORDER BY timestamp
    )) / EXTRACT(EPOCH FROM (
      timestamp - LAG(timestamp, 1) OVER (
        PARTITION BY device_id ORDER BY timestamp
      )
    )) * 3600 as temp_change_per_hour,

    -- Moving standard deviation for anomaly detection
    STDDEV(temperature) OVER (
      PARTITION BY device_id 
      ORDER BY timestamp 
      ROWS BETWEEN 23 PRECEDING AND CURRENT ROW
    ) as temperature_24_point_stddev,

    -- Battery drain analysis
    FIRST_VALUE(battery_level) OVER (
      PARTITION BY device_id 
      ORDER BY timestamp 
      ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) - battery_level as total_battery_drain,

    -- Time-based calculations
    EXTRACT(HOUR FROM timestamp) as hour_of_day,
    EXTRACT(DOW FROM timestamp) as day_of_week,
    DATE_TRUNC('hour', timestamp) as hour_bucket

  FROM recent_readings
),

hourly_aggregations AS (
  SELECT 
    hour_bucket,
    location,
    COUNT(*) as reading_count,

    -- Statistical aggregations optimized for time series
    AVG(temperature) as avg_temperature,
    MIN(temperature) as min_temperature,
    MAX(temperature) as max_temperature,
    STDDEV(temperature) as temp_stddev,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY temperature) as temp_median,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY temperature) as temp_p95,

    AVG(humidity) as avg_humidity,
    STDDEV(humidity) as humidity_stddev,

    -- Battery analysis
    AVG(battery_level) as avg_battery_level,
    MIN(battery_level) as min_battery_level,
    COUNT(*) FILTER (WHERE battery_level < 20) as low_battery_readings,

    -- Data quality metrics
    AVG(data_quality) as avg_data_quality,
    COUNT(*) FILTER (WHERE data_quality < 0.9) as poor_quality_readings,

    -- Anomaly detection indicators
    COUNT(*) FILTER (WHERE ABS(temp_change_per_hour) > 5) as rapid_temp_changes,
    COUNT(*) FILTER (WHERE ABS(temperature - temperature_12_point_avg) > 2 * temperature_24_point_stddev) as statistical_anomalies

  FROM time_series_analysis
  GROUP BY hour_bucket, location
),

location_health_analysis AS (
  SELECT 
    location,

    -- Time range analysis
    MIN(hour_bucket) as analysis_start,
    MAX(hour_bucket) as analysis_end,
    COUNT(*) as total_hours,

    -- Environmental conditions
    AVG(avg_temperature) as overall_avg_temperature,
    MAX(max_temperature) as peak_temperature,
    MIN(min_temperature) as lowest_temperature,
    AVG(temp_stddev) as avg_temperature_variability,

    AVG(avg_humidity) as overall_avg_humidity,
    AVG(humidity_stddev) as avg_humidity_variability,

    -- Device health indicators
    AVG(avg_battery_level) as location_avg_battery,
    SUM(low_battery_readings) as total_low_battery_readings,
    AVG(avg_data_quality) as location_data_quality,
    SUM(poor_quality_readings) as total_poor_quality_readings,

    -- Anomaly aggregations
    SUM(rapid_temp_changes) as total_rapid_changes,
    SUM(statistical_anomalies) as total_anomalies,
    SUM(reading_count) as total_readings,

    -- Calculated health metrics
    (SUM(reading_count) - SUM(poor_quality_readings)) * 100.0 / NULLIF(SUM(reading_count), 0) as data_reliability_percent,

    CASE 
      WHEN AVG(avg_temperature) BETWEEN 18 AND 25 AND AVG(avg_humidity) BETWEEN 40 AND 60 THEN 'optimal'
      WHEN AVG(avg_temperature) BETWEEN 15 AND 30 AND AVG(avg_humidity) BETWEEN 30 AND 70 THEN 'acceptable'
      WHEN AVG(avg_temperature) BETWEEN 10 AND 35 AND AVG(avg_humidity) BETWEEN 20 AND 80 THEN 'suboptimal'
      ELSE 'extreme'
    END as environmental_classification,

    CASE 
      WHEN AVG(avg_battery_level) > 60 THEN 'healthy'
      WHEN AVG(avg_battery_level) > 30 THEN 'moderate'
      WHEN AVG(avg_battery_level) > 15 THEN 'concerning'
      ELSE 'critical'
    END as battery_health_status

  FROM hourly_aggregations
  GROUP BY location
)

SELECT 
  location,
  TO_CHAR(analysis_start, 'YYYY-MM-DD HH24:MI') as period_start,
  TO_CHAR(analysis_end, 'YYYY-MM-DD HH24:MI') as period_end,
  total_hours,

  -- Environmental metrics
  ROUND(overall_avg_temperature::NUMERIC, 2) as avg_temperature_c,
  ROUND(peak_temperature::NUMERIC, 2) as max_temperature_c,
  ROUND(lowest_temperature::NUMERIC, 2) as min_temperature_c,
  ROUND(avg_temperature_variability::NUMERIC, 2) as temperature_stability,

  ROUND(overall_avg_humidity::NUMERIC, 2) as avg_humidity_percent,
  ROUND(avg_humidity_variability::NUMERIC, 2) as humidity_stability,

  environmental_classification,

  -- Device health metrics
  ROUND(location_avg_battery::NUMERIC, 2) as avg_battery_percent,
  battery_health_status,
  total_low_battery_readings,

  -- Data quality metrics
  ROUND(location_data_quality::NUMERIC, 3) as avg_data_quality,
  ROUND(data_reliability_percent::NUMERIC, 2) as data_reliability_percent,
  total_poor_quality_readings,

  -- Anomaly metrics
  total_rapid_changes,
  total_anomalies,
  ROUND((total_anomalies * 100.0 / NULLIF(total_readings, 0))::NUMERIC, 3) as anomaly_rate_percent,

  -- Overall location health score
  (
    CASE environmental_classification
      WHEN 'optimal' THEN 25
      WHEN 'acceptable' THEN 20
      WHEN 'suboptimal' THEN 15
      ELSE 5
    END +
    CASE battery_health_status
      WHEN 'healthy' THEN 25
      WHEN 'moderate' THEN 20
      WHEN 'concerning' THEN 10
      ELSE 5
    END +
    CASE 
      WHEN data_reliability_percent >= 95 THEN 25
      WHEN data_reliability_percent >= 90 THEN 20
      WHEN data_reliability_percent >= 85 THEN 15
      ELSE 10
    END +
    CASE 
      WHEN total_anomalies * 100.0 / NULLIF(total_readings, 0) < 1 THEN 25
      WHEN total_anomalies * 100.0 / NULLIF(total_readings, 0) < 3 THEN 20
      WHEN total_anomalies * 100.0 / NULLIF(total_readings, 0) < 5 THEN 15
      ELSE 10
    END
  ) as location_health_score,

  -- Operational recommendations
  CASE 
    WHEN environmental_classification = 'extreme' THEN 'URGENT: Review environmental conditions'
    WHEN battery_health_status = 'critical' THEN 'URGENT: Multiple devices need battery replacement'
    WHEN data_reliability_percent < 90 THEN 'HIGH: Investigate data quality issues'
    WHEN total_anomalies * 100.0 / NULLIF(total_readings, 0) > 5 THEN 'MEDIUM: High anomaly rate needs investigation'
    WHEN environmental_classification = 'suboptimal' THEN 'LOW: Monitor environmental conditions'
    ELSE 'Location operating within normal parameters'
  END as operational_recommendation,

  total_readings

FROM location_health_analysis
ORDER BY location_health_score ASC, location;

-- Real-time alerting with time series optimizations
WITH latest_device_readings AS (
  SELECT DISTINCT ON (device->>'device_id')
    device->>'device_id' as device_id,
    device->>'installation_location' as location,
    device->>'sensor_type' as sensor_type,
    timestamp,
    measurements->>'temperature'::DECIMAL as temperature,
    measurements->>'humidity'::DECIMAL as humidity,
    measurements->>'battery_level'::DECIMAL as battery_level,
    measurements->>'signal_strength'::INTEGER as signal_strength,
    quality->>'anomaly_detected'::BOOLEAN as anomaly_detected,
    quality->>'data_quality_score'::DECIMAL as data_quality_score

  FROM sensor_readings
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '15 minutes'
  ORDER BY device->>'device_id', timestamp DESC
),

alert_conditions AS (
  SELECT 
    device_id,
    location,
    sensor_type,
    timestamp,
    temperature,
    humidity,
    battery_level,
    signal_strength,
    data_quality_score,

    -- Alert condition evaluation
    CASE 
      WHEN temperature > 50 OR temperature < -20 THEN 'temperature_extreme'
      WHEN humidity > 95 OR humidity < 5 THEN 'humidity_extreme'
      WHEN battery_level < 15 THEN 'battery_critical'
      WHEN signal_strength < -85 THEN 'signal_poor'
      WHEN anomaly_detected = TRUE THEN 'anomaly_detected'
      WHEN data_quality_score < 0.7 THEN 'data_quality_poor'
      WHEN timestamp < CURRENT_TIMESTAMP - INTERVAL '10 minutes' THEN 'device_offline'
    END as alert_type,

    -- Alert severity calculation
    CASE 
      WHEN battery_level < 10 OR timestamp < CURRENT_TIMESTAMP - INTERVAL '20 minutes' THEN 'critical'
      WHEN temperature > 45 OR temperature < -15 OR anomaly_detected THEN 'high'
      WHEN battery_level < 20 OR signal_strength < -80 THEN 'medium'
      ELSE 'low'
    END as alert_severity,

    -- Time since last reading
    EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - timestamp))/60 as minutes_since_reading

  FROM latest_device_readings
)

SELECT 
  device_id,
  location,
  sensor_type,
  alert_type,
  alert_severity,
  TO_CHAR(timestamp, 'YYYY-MM-DD HH24:MI:SS') as last_reading_time,
  ROUND(minutes_since_reading::NUMERIC, 1) as minutes_ago,

  -- Current sensor values
  temperature,
  humidity,
  battery_level,
  signal_strength,
  ROUND(data_quality_score::NUMERIC, 3) as data_quality,

  -- Alert message generation
  CASE alert_type
    WHEN 'temperature_extreme' THEN 
      FORMAT('Temperature %s°C exceeds safe operating range', temperature)
    WHEN 'humidity_extreme' THEN 
      FORMAT('Humidity %s%% is at extreme level', humidity)
    WHEN 'battery_critical' THEN 
      FORMAT('Battery level %s%% requires immediate replacement', battery_level)
    WHEN 'signal_poor' THEN 
      FORMAT('Signal strength %s dBm indicates connectivity issues', signal_strength)
    WHEN 'anomaly_detected' THEN 
      'Sensor anomaly detected in recent readings'
    WHEN 'data_quality_poor' THEN 
      FORMAT('Data quality score %s indicates sensor problems', data_quality_score)
    WHEN 'device_offline' THEN 
      FORMAT('Device offline for %s minutes', ROUND(minutes_since_reading::NUMERIC, 0))
    ELSE 'Unknown alert condition'
  END as alert_message,

  -- Recommended actions with time urgency
  CASE alert_type
    WHEN 'temperature_extreme' THEN 'Verify environmental conditions and sensor calibration within 2 hours'
    WHEN 'humidity_extreme' THEN 'Check sensor operation and environmental factors within 4 hours'
    WHEN 'battery_critical' THEN 'Replace battery immediately (within 24 hours)'
    WHEN 'signal_poor' THEN 'Inspect antenna and network infrastructure within 48 hours'
    WHEN 'anomaly_detected' THEN 'Investigate sensor readings and interference within 24 hours'
    WHEN 'data_quality_poor' THEN 'Perform sensor calibration within 48 hours'
    WHEN 'device_offline' THEN 'Check power and connectivity immediately'
    ELSE 'Monitor device status'
  END as recommended_action

FROM alert_conditions
WHERE alert_type IS NOT NULL
ORDER BY 
  CASE alert_severity
    WHEN 'critical' THEN 1
    WHEN 'high' THEN 2
    WHEN 'medium' THEN 3
    ELSE 4
  END,
  timestamp DESC;

-- Time series aggregation with downsampling for long-term analysis
WITH daily_device_summary AS (
  SELECT 
    DATE_TRUNC('day', timestamp) as day,
    device->>'device_id' as device_id,
    device->>'installation_location' as location,

    -- Daily statistical aggregations
    COUNT(*) as daily_reading_count,

    -- Temperature analysis
    AVG(measurements->>'temperature'::DECIMAL) as avg_temperature,
    MIN(measurements->>'temperature'::DECIMAL) as min_temperature,
    MAX(measurements->>'temperature'::DECIMAL) as max_temperature,
    STDDEV(measurements->>'temperature'::DECIMAL) as temp_daily_stddev,

    -- Humidity analysis
    AVG(measurements->>'humidity'::DECIMAL) as avg_humidity,
    STDDEV(measurements->>'humidity'::DECIMAL) as humidity_daily_stddev,

    -- Battery degradation tracking
    FIRST_VALUE(measurements->>'battery_level'::DECIMAL) OVER (
      PARTITION BY device->>'device_id', DATE_TRUNC('day', timestamp)
      ORDER BY timestamp ASC
      ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as day_start_battery,

    LAST_VALUE(measurements->>'battery_level'::DECIMAL) OVER (
      PARTITION BY device->>'device_id', DATE_TRUNC('day', timestamp)
      ORDER BY timestamp ASC
      ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING  
    ) as day_end_battery,

    -- Data quality metrics
    AVG(quality->>'data_quality_score'::DECIMAL) as avg_daily_data_quality,
    COUNT(*) FILTER (WHERE quality->>'anomaly_detected'::BOOLEAN = TRUE) as daily_anomaly_count,

    -- Connectivity metrics
    AVG(measurements->>'signal_strength'::INTEGER) as avg_signal_strength,
    COUNT(*) as expected_readings, -- Based on device configuration

    -- Environmental stability
    CASE 
      WHEN STDDEV(measurements->>'temperature'::DECIMAL) < 2 AND 
           STDDEV(measurements->>'humidity'::DECIMAL) < 5 THEN 'stable'
      WHEN STDDEV(measurements->>'temperature'::DECIMAL) < 5 AND 
           STDDEV(measurements->>'humidity'::DECIMAL) < 15 THEN 'moderate'
      ELSE 'variable'
    END as environmental_stability

  FROM sensor_readings
  WHERE timestamp >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY DATE_TRUNC('day', timestamp), device->>'device_id', device->>'installation_location'
),

device_trend_analysis AS (
  SELECT 
    device_id,
    location,

    -- Time period
    MIN(day) as analysis_start_date,
    MAX(day) as analysis_end_date,
    COUNT(*) as total_days,

    -- Reading consistency
    AVG(daily_reading_count) as avg_daily_readings,
    STDDEV(daily_reading_count) as reading_count_consistency,

    -- Temperature trends
    AVG(avg_temperature) as overall_avg_temperature,
    STDDEV(avg_temperature) as temperature_day_to_day_variation,

    -- Linear regression on daily averages for trend detection
    REGR_SLOPE(avg_temperature, EXTRACT(EPOCH FROM day)) * 86400 as temp_trend_per_day,
    REGR_R2(avg_temperature, EXTRACT(EPOCH FROM day)) as temp_trend_confidence,

    -- Battery degradation analysis  
    AVG(day_end_battery - day_start_battery) as avg_daily_battery_drain,

    -- Battery trend analysis
    REGR_SLOPE(day_end_battery, EXTRACT(EPOCH FROM day)) * 86400 as battery_trend_per_day,
    REGR_R2(day_end_battery, EXTRACT(EPOCH FROM day)) as battery_trend_confidence,

    -- Data quality trends
    AVG(avg_daily_data_quality) as overall_avg_data_quality,
    AVG(daily_anomaly_count) as avg_daily_anomalies,

    -- Connectivity trends
    AVG(avg_signal_strength) as overall_avg_signal,
    REGR_SLOPE(avg_signal_strength, EXTRACT(EPOCH FROM day)) * 86400 as signal_trend_per_day,

    -- Environmental stability assessment
    MODE() WITHIN GROUP (ORDER BY environmental_stability) as predominant_stability,
    COUNT(*) FILTER (WHERE environmental_stability = 'stable') * 100.0 / COUNT(*) as stable_days_percent

  FROM daily_device_summary
  GROUP BY device_id, location
)

SELECT 
  device_id,
  location,
  TO_CHAR(analysis_start_date, 'YYYY-MM-DD') as period_start,
  TO_CHAR(analysis_end_date, 'YYYY-MM-DD') as period_end,
  total_days,

  -- Reading consistency metrics
  ROUND(avg_daily_readings::NUMERIC, 1) as avg_daily_readings,
  CASE 
    WHEN reading_count_consistency / NULLIF(avg_daily_readings, 0) < 0.1 THEN 'very_consistent'
    WHEN reading_count_consistency / NULLIF(avg_daily_readings, 0) < 0.2 THEN 'consistent'
    WHEN reading_count_consistency / NULLIF(avg_daily_readings, 0) < 0.4 THEN 'variable'
    ELSE 'inconsistent'
  END as reading_consistency,

  -- Temperature analysis
  ROUND(overall_avg_temperature::NUMERIC, 2) as avg_temperature_c,
  ROUND(temperature_day_to_day_variation::NUMERIC, 2) as temp_daily_variation,
  ROUND((temp_trend_per_day * 30)::NUMERIC, 3) as temp_trend_per_month_c,
  CASE 
    WHEN temp_trend_confidence > 0.7 THEN 'high_confidence'
    WHEN temp_trend_confidence > 0.4 THEN 'medium_confidence'
    ELSE 'low_confidence'
  END as temp_trend_reliability,

  -- Battery health analysis
  ROUND(avg_daily_battery_drain::NUMERIC, 2) as avg_daily_drain_percent,
  ROUND((battery_trend_per_day * 30)::NUMERIC, 2) as battery_degradation_per_month,

  -- Estimated battery life (assuming linear degradation)
  CASE 
    WHEN battery_trend_per_day < -0.1 THEN 
      ROUND((50.0 / ABS(battery_trend_per_day))::NUMERIC, 0) -- Days until 50% from current
    ELSE NULL
  END as estimated_days_until_50_percent,

  CASE 
    WHEN battery_trend_per_day < -0.05 THEN 
      ROUND((85.0 / ABS(battery_trend_per_day))::NUMERIC, 0) -- Days until need replacement
    ELSE NULL
  END as estimated_days_until_replacement,

  -- Data quality assessment
  ROUND(overall_avg_data_quality::NUMERIC, 3) as avg_data_quality,
  ROUND(avg_daily_anomalies::NUMERIC, 1) as avg_daily_anomalies,

  -- Connectivity assessment
  ROUND(overall_avg_signal::NUMERIC, 0) as avg_signal_strength_dbm,
  CASE 
    WHEN signal_trend_per_day < -0.5 THEN 'degrading'
    WHEN signal_trend_per_day > 0.5 THEN 'improving'
    ELSE 'stable'
  END as signal_trend,

  -- Environmental assessment
  predominant_stability as environmental_stability,
  ROUND(stable_days_percent::NUMERIC, 1) as stable_days_percent,

  -- Overall device health scoring
  (
    -- Reading consistency (0-25 points)
    CASE 
      WHEN reading_count_consistency / NULLIF(avg_daily_readings, 0) < 0.1 THEN 25
      WHEN reading_count_consistency / NULLIF(avg_daily_readings, 0) < 0.2 THEN 20
      WHEN reading_count_consistency / NULLIF(avg_daily_readings, 0) < 0.4 THEN 15
      ELSE 10
    END +

    -- Battery health (0-25 points)
    CASE 
      WHEN battery_trend_per_day > -0.05 THEN 25
      WHEN battery_trend_per_day > -0.1 THEN 20
      WHEN battery_trend_per_day > -0.2 THEN 15
      ELSE 10
    END +

    -- Data quality (0-25 points)
    CASE 
      WHEN overall_avg_data_quality >= 0.95 THEN 25
      WHEN overall_avg_data_quality >= 0.90 THEN 20
      WHEN overall_avg_data_quality >= 0.85 THEN 15
      ELSE 10
    END +

    -- Environmental stability (0-25 points)
    CASE 
      WHEN stable_days_percent >= 80 THEN 25
      WHEN stable_days_percent >= 60 THEN 20
      WHEN stable_days_percent >= 40 THEN 15
      ELSE 10
    END
  ) as device_health_score,

  -- Maintenance recommendations
  CASE 
    WHEN battery_trend_per_day < -0.2 OR overall_avg_data_quality < 0.8 THEN 'URGENT: Schedule maintenance'
    WHEN battery_trend_per_day < -0.1 OR stable_days_percent < 50 THEN 'HIGH: Review device status'
    WHEN overall_avg_signal < -80 OR avg_daily_anomalies > 5 THEN 'MEDIUM: Monitor closely'
    ELSE 'LOW: Continue routine monitoring'
  END as maintenance_priority

FROM device_trend_analysis
WHERE total_days >= 7  -- Only analyze devices with sufficient data
ORDER BY device_health_score ASC, device_id;

-- QueryLeaf provides comprehensive time series capabilities:
-- 1. Native time series collection creation with SQL DDL syntax
-- 2. Optimized time-based queries with window functions and aggregations  
-- 3. Real-time alerting with complex condition evaluation
-- 4. Long-term trend analysis with statistical functions
-- 5. Automated data retention and lifecycle management
-- 6. High-performance ingestion optimized for IoT data patterns
-- 7. Advanced analytics with predictive capabilities
-- 8. Integration with MongoDB's time series optimizations
-- 9. Familiar SQL syntax for complex temporal operations
-- 10. Enterprise-grade scalability for high-volume IoT applications

Best Practices for MongoDB Time Series Implementation

IoT Data Architecture and Optimization Strategies

Essential practices for production MongoDB time series deployments:

Granularity Selection: Choose appropriate time series granularity based on data frequency and query patterns
Metadata Organization: Structure device metadata efficiently in the metaField for optimal bucketing and compression
Index Strategy: Create compound indexes on metaField components and timestamp for optimal query performance
Retention Policies: Implement TTL indexes and automated data archiving based on business requirements
Compression Optimization: Use zstd compression for maximum storage efficiency with time series data
Query Optimization: Design aggregation pipelines that leverage time series collection optimizations

Scalability and Production Deployment

Optimize time series collections for enterprise IoT requirements:

High-Throughput Ingestion: Configure write settings and batch sizes for optimal data ingestion rates
Real-Time Analytics: Implement efficient real-time query patterns that leverage time series optimizations
Predictive Analytics: Build statistical models using historical time series data for proactive maintenance
Multi-Tenant Architecture: Design time series schemas that support multiple device types and customers
Compliance Integration: Ensure time series data meets regulatory retention and audit requirements
Disaster Recovery: Implement backup and recovery strategies optimized for time series data volumes

Conclusion

MongoDB Time Series Collections provide comprehensive optimization for IoT sensor data management through native time-stamped data support, automatic compression algorithms, and specialized indexing strategies designed specifically for temporal workloads. The integrated approach eliminates complex manual partitioning while delivering superior performance for high-frequency data ingestion and time-based analytics operations.

Key MongoDB Time Series benefits for IoT applications include:

Native IoT Optimization: Purpose-built time series collections with automatic bucketing and compression for sensor data
High-Performance Ingestion: Optimized write paths capable of handling thousands of sensor readings per second
Intelligent Storage Management: Automatic data compression and retention policies that scale with IoT data volumes
Real-Time Analytics: Efficient time-based queries and aggregations optimized for recent data analysis and alerting
Predictive Capabilities: Advanced analytics support for device maintenance, trend analysis, and anomaly detection
SQL Compatibility: Familiar time series operations accessible through SQL-style interfaces for operational simplicity

Whether you're managing environmental sensors, industrial equipment monitoring, smart city infrastructure, or consumer IoT devices, MongoDB Time Series Collections with QueryLeaf's SQL-familiar interface provide the foundation for scalable IoT data architecture that maintains high performance while supporting sophisticated real-time analytics and predictive maintenance workflows.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB Time Series Collections while providing SQL-familiar syntax for time series data management, real-time analytics, and predictive maintenance operations. Advanced temporal query patterns, statistical analysis, and IoT-specific optimizations are seamlessly accessible through familiar SQL constructs, making sophisticated time series data management both powerful and approachable for SQL-oriented development teams.

The combination of MongoDB's specialized time series optimizations with familiar SQL-style operations makes it an ideal platform for IoT applications that require both high-performance temporal data processing and operational simplicity, ensuring your sensor data architecture scales efficiently while maintaining familiar development and analytical patterns.

November 22, 2025
23 min read

MongoDB Atlas Vector Search for AI and Embedding Similarity: Building Intelligent Search Applications with SQL-Compatible Vector Operations

Modern AI applications require sophisticated search capabilities that go beyond traditional keyword matching to understand semantic meaning, user intent, and content similarity. Traditional database search approaches struggle with high-dimensional vector data, semantic relationships, and the complex similarity calculations required for recommendation systems, content discovery, and AI-powered features.

MongoDB Atlas Vector Search provides native vector database capabilities that enable efficient storage, indexing, and querying of high-dimensional embeddings generated by machine learning models. Unlike traditional search engines that require separate vector databases and complex data synchronization, Atlas Vector Search integrates seamlessly with your existing MongoDB data while delivering enterprise-grade performance for AI applications.

The Traditional Vector Search Challenge

Implementing vector similarity search with conventional approaches creates significant architectural complexity and performance challenges:

-- Traditional PostgreSQL vector search - complex and limited

-- Attempting vector similarity with PostgreSQL extensions
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pg_trgm;

-- Product embeddings table with limited vector support
CREATE TABLE product_embeddings (
    product_id BIGINT PRIMARY KEY,
    product_name VARCHAR(500) NOT NULL,
    description TEXT,
    category VARCHAR(100),
    price DECIMAL(10,2),

    -- Vector embeddings (limited to 2000 dimensions in pg_vector)
    title_embedding vector(384),        -- Limited dimensionality
    description_embedding vector(768),  -- Separate embeddings
    image_embedding vector(512),

    -- Traditional text search fallbacks
    search_vector tsvector,
    keywords TEXT[],

    -- Metadata
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Manual similarity caching (performance workaround)
    similar_products JSONB,
    similarity_last_computed TIMESTAMP
);

-- Create vector indexes (limited optimization)
CREATE INDEX idx_title_embedding ON product_embeddings 
USING ivfflat (title_embedding vector_cosine_ops)
WITH (lists = 100);  -- Fixed index parameters

CREATE INDEX idx_description_embedding ON product_embeddings 
USING ivfflat (description_embedding vector_cosine_ops)
WITH (lists = 100);

-- Traditional text search as fallback
CREATE INDEX idx_search_vector ON product_embeddings 
USING GIN (search_vector);

-- Complex similarity search query with poor performance
WITH query_vector AS (
    SELECT vector('[0.1, 0.2, 0.3, ...]'::vector) as embedding
),
similarity_scores AS (
    SELECT 
        pe.product_id,
        pe.product_name,
        pe.description,
        pe.category,
        pe.price,

        -- Expensive similarity calculations
        1 - (pe.title_embedding <=> qv.embedding) as title_similarity,
        1 - (pe.description_embedding <=> qv.embedding) as desc_similarity,
        1 - (pe.image_embedding <=> qv.embedding) as image_similarity,

        -- Combined scoring (manual implementation)
        (
            (1 - (pe.title_embedding <=> qv.embedding)) * 0.4 +
            (1 - (pe.description_embedding <=> qv.embedding)) * 0.4 +
            (1 - (pe.image_embedding <=> qv.embedding)) * 0.2
        ) as combined_similarity_score,

        -- Traditional text relevance as fallback
        ts_rank_cd(pe.search_vector, plainto_tsquery('search query')) as text_relevance,

        -- Distance calculations for debugging
        pe.title_embedding <=> qv.embedding as title_distance,
        pe.description_embedding <=> qv.embedding as desc_distance

    FROM product_embeddings pe
    CROSS JOIN query_vector qv
    WHERE 
        -- Pre-filtering to reduce computation (limited effectiveness)
        pe.category IN ('electronics', 'clothing', 'books')
        AND pe.price BETWEEN 10 AND 1000
        AND pe.updated_at >= CURRENT_DATE - INTERVAL '1 year'
),

ranked_results AS (
    SELECT *,
        -- Manual ranking logic
        ROW_NUMBER() OVER (ORDER BY combined_similarity_score DESC) as similarity_rank,
        ROW_NUMBER() OVER (ORDER BY text_relevance DESC) as text_rank,

        -- Hybrid scoring attempt
        (combined_similarity_score * 0.7 + text_relevance * 0.3) as hybrid_score

    FROM similarity_scores
    WHERE combined_similarity_score > 0.6  -- Arbitrary threshold
)

SELECT 
    product_id,
    product_name,
    category,
    price,

    -- Similarity metrics
    ROUND(combined_similarity_score::NUMERIC, 4) as similarity_score,
    ROUND(title_similarity::NUMERIC, 4) as title_sim,
    ROUND(desc_similarity::NUMERIC, 4) as desc_sim,
    ROUND(image_similarity::NUMERIC, 4) as image_sim,

    -- Ranking information
    similarity_rank,
    hybrid_score,

    -- Performance debugging
    title_distance as debug_title_dist,
    desc_distance as debug_desc_dist

FROM ranked_results
ORDER BY hybrid_score DESC
LIMIT 20;

-- Problems with traditional vector search approaches:
-- 1. Limited vector dimensionality and poor performance scaling
-- 2. Complex manual similarity calculations and scoring logic
-- 3. No native support for advanced similarity algorithms
-- 4. Poor integration with existing application data
-- 5. Manual index optimization and maintenance
-- 6. Limited filtering capabilities during vector search
-- 7. No built-in support for multiple embedding models
-- 8. Complex hybrid search implementation
-- 9. Poor performance with large vector datasets
-- 10. Limited support for real-time embedding updates

-- Attempt at recommendation system (extremely inefficient)
WITH user_preferences AS (
    SELECT 
        user_id,
        -- Compute average embedding from user's purchase history
        AVG(pe.title_embedding::vector) as preference_embedding,
        COUNT(*) as purchase_count,
        ARRAY_AGG(DISTINCT pe.category) as preferred_categories
    FROM user_purchases up
    JOIN product_embeddings pe ON pe.product_id = up.product_id
    WHERE up.purchase_date >= CURRENT_DATE - INTERVAL '6 months'
    GROUP BY user_id
    HAVING COUNT(*) >= 5  -- Minimum purchase history
),

candidate_products AS (
    SELECT DISTINCT
        pe.product_id,
        pe.product_name,
        pe.category,
        pe.price,
        pe.title_embedding,
        pe.description_embedding
    FROM product_embeddings pe
    WHERE pe.product_id NOT IN (
        -- Exclude already purchased products
        SELECT product_id 
        FROM user_purchases 
        WHERE user_id = $1 
        AND purchase_date >= CURRENT_DATE - INTERVAL '3 months'
    )
),

recommendations AS (
    SELECT 
        up.user_id,
        cp.product_id,
        cp.product_name,
        cp.category,
        cp.price,

        -- Expensive similarity calculation for each user-product pair
        1 - (up.preference_embedding <=> cp.title_embedding) as title_preference_sim,
        1 - (up.preference_embedding <=> cp.description_embedding) as desc_preference_sim,

        -- Category matching bonus
        CASE 
            WHEN cp.category = ANY(up.preferred_categories) THEN 0.2
            ELSE 0.0
        END as category_bonus,

        -- Purchase history influence
        up.purchase_count,

        -- Combined recommendation score
        (
            (1 - (up.preference_embedding <=> cp.title_embedding)) * 0.5 +
            (1 - (up.preference_embedding <=> cp.description_embedding)) * 0.3 +
            CASE WHEN cp.category = ANY(up.preferred_categories) THEN 0.2 ELSE 0.0 END
        ) as recommendation_score

    FROM user_preferences up
    CROSS JOIN candidate_products cp
    WHERE cp.category = ANY(up.preferred_categories)  -- Basic filtering
)

SELECT 
    user_id,
    product_id,
    product_name,
    category,
    price,
    ROUND(recommendation_score::NUMERIC, 4) as score,
    ROUND(title_preference_sim::NUMERIC, 4) as title_sim,
    ROUND(desc_preference_sim::NUMERIC, 4) as desc_sim
FROM recommendations
WHERE recommendation_score > 0.5
ORDER BY user_id, recommendation_score DESC;

-- This approach is extremely slow and doesn't scale beyond small datasets
-- Vector operations are not optimized for recommendation workloads
-- Manual preference modeling lacks sophistication
-- No support for real-time recommendation updates
-- Limited ability to incorporate multiple signals and features

MongoDB Atlas Vector Search provides comprehensive vector database capabilities with enterprise performance:

// MongoDB Atlas Vector Search - advanced AI-powered search capabilities
const { MongoClient } = require('mongodb');

class AtlasVectorSearchManager {
  constructor() {
    this.client = null;
    this.db = null;
    this.searchIndexes = new Map();
    this.embeddingModels = new Map();
    this.searchPerformanceMetrics = new Map();
  }

  async initialize() {
    console.log('Initializing MongoDB Atlas Vector Search Manager...');

    // Connect to Atlas with vector search optimization
    this.client = new MongoClient(process.env.MONGODB_ATLAS_URI, {
      // Optimized connection settings for vector operations
      maxPoolSize: 20,
      minPoolSize: 5,
      maxIdleTimeMS: 30000,

      // Read preference for vector search workloads
      readPreference: 'primary',
      readConcern: { level: 'local' },

      // Compression for large vector payloads
      compression: ['zlib', 'snappy'],

      appName: 'VectorSearchApplication'
    });

    await this.client.connect();
    this.db = this.client.db('ai_application');

    // Initialize vector search indexes and models
    await this.setupVectorSearchIndexes();
    await this.initializeEmbeddingModels();

    console.log('✅ Atlas Vector Search Manager initialized');
  }

  async setupVectorSearchIndexes() {
    console.log('Setting up Atlas Vector Search indexes...');

    const productsCollection = this.db.collection('products');

    // Create comprehensive vector search index
    const productVectorIndex = {
      name: 'products_vector_search',
      type: 'vectorSearch',
      definition: {
        // Multi-field vector search configuration
        fields: [
          {
            // Primary product embedding for semantic search
            type: 'vector',
            path: 'embeddings.combined',
            numDimensions: 1536,          // OpenAI ada-002 dimensions
            similarity: 'cosine'           // Cosine similarity for semantic search
          },
          {
            // Title-specific embedding for title-focused search
            type: 'vector', 
            path: 'embeddings.title',
            numDimensions: 384,            // Sentence transformers dimensions
            similarity: 'euclidean'
          },
          {
            // Description embedding for content-based search
            type: 'vector',
            path: 'embeddings.description', 
            numDimensions: 768,            // BERT-based embeddings
            similarity: 'dotProduct'
          },
          {
            // Visual embedding for image similarity
            type: 'vector',
            path: 'embeddings.image',
            numDimensions: 512,            // Vision transformer embeddings
            similarity: 'cosine'
          },

          // Filterable fields for hybrid search
          {
            type: 'filter',
            path: 'category'
          },
          {
            type: 'filter', 
            path: 'brand'
          },
          {
            type: 'filter',
            path: 'pricing.basePrice'
          },
          {
            type: 'filter',
            path: 'availability.isActive'
          },
          {
            type: 'filter',
            path: 'ratings.averageRating'
          },
          {
            type: 'filter',
            path: 'metadata.tags'
          }
        ]
      }
    };

    // Create user preferences vector index
    const userPreferencesIndex = {
      name: 'user_preferences_vector_search',
      type: 'vectorSearch', 
      definition: {
        fields: [
          {
            // User preference embedding for personalization
            type: 'vector',
            path: 'preferences.embedding',
            numDimensions: 1536,
            similarity: 'cosine'
          },
          {
            // Session-based embedding for short-term preferences
            type: 'vector',
            path: 'preferences.sessionEmbedding', 
            numDimensions: 384,
            similarity: 'cosine'
          },

          // User demographic and behavioral filters
          {
            type: 'filter',
            path: 'demographics.ageRange'
          },
          {
            type: 'filter',
            path: 'demographics.location'
          },
          {
            type: 'filter', 
            path: 'behavior.purchaseFrequency'
          },
          {
            type: 'filter',
            path: 'preferences.categories'
          }
        ]
      }
    };

    // Store index configurations
    this.searchIndexes.set('products', productVectorIndex);
    this.searchIndexes.set('user_preferences', userPreferencesIndex);

    console.log('✅ Vector search indexes configured');
  }

  async initializeEmbeddingModels() {
    console.log('Initializing embedding models...');

    // Configure different embedding models for different use cases
    const embeddingConfigs = {
      'openai-ada-002': {
        provider: 'openai',
        model: 'text-embedding-ada-002',
        dimensions: 1536,
        maxTokens: 8192,
        useCase: 'general_semantic_search',
        costPerToken: 0.0001
      },

      'sentence-transformers': {
        provider: 'huggingface',
        model: 'all-MiniLM-L6-v2', 
        dimensions: 384,
        maxTokens: 256,
        useCase: 'title_and_short_text',
        costPerToken: 0.0  // Free local model
      },

      'cohere-embed-v3': {
        provider: 'cohere',
        model: 'embed-english-v3.0',
        dimensions: 1024,
        maxTokens: 512,
        useCase: 'multilingual_content',
        costPerToken: 0.0001
      },

      'vision-transformer': {
        provider: 'openai',
        model: 'clip-vit-base-patch32',
        dimensions: 512,
        useCase: 'image_similarity',
        costPerToken: 0.0002
      }
    };

    for (const [modelName, config] of Object.entries(embeddingConfigs)) {
      this.embeddingModels.set(modelName, config);
    }

    console.log('✅ Embedding models initialized');
  }

  async performSemanticProductSearch(queryText, options = {}) {
    console.log(`Performing semantic product search: "${queryText}"`);

    const startTime = Date.now();

    // Generate query embedding using configured model
    const queryEmbedding = await this.generateEmbedding(
      queryText, 
      options.embeddingModel || 'openai-ada-002'
    );

    const productsCollection = this.db.collection('products');

    // Construct vector search pipeline with advanced filtering
    const searchPipeline = [
      {
        $vectorSearch: {
          index: 'products_vector_search',
          path: 'embeddings.combined',
          queryVector: queryEmbedding,
          numCandidates: options.numCandidates || 1000,
          limit: options.limit || 20,

          // Advanced filtering during vector search
          filter: {
            $and: [
              { 'availability.isActive': true },
              ...(options.categories ? [{ category: { $in: options.categories } }] : []),
              ...(options.priceRange ? [{ 
                'pricing.basePrice': { 
                  $gte: options.priceRange.min, 
                  $lte: options.priceRange.max 
                }
              }] : []),
              ...(options.minRating ? [{ 
                'ratings.averageRating': { $gte: options.minRating }
              }] : []),
              ...(options.brands ? [{ brand: { $in: options.brands } }] : [])
            ]
          }
        }
      },

      // Add similarity score and additional fields
      {
        $addFields: {
          vectorSearchScore: { $meta: 'vectorSearchScore' },

          // Calculate additional similarity metrics
          titleRelevance: {
            $function: {
              body: function(title, query) {
                // Custom relevance scoring function
                const titleLower = title.toLowerCase();
                const queryLower = query.toLowerCase();
                const words = queryLower.split(/\s+/);
                let score = 0;

                words.forEach(word => {
                  if (titleLower.includes(word)) {
                    score += word.length / title.length;
                  }
                });

                return Math.min(score, 1.0);
              },
              args: ['$name', queryText],
              lang: 'js'
            }
          },

          // Boost scoring based on business rules
          businessBoost: {
            $switch: {
              branches: [
                {
                  case: { $eq: ['$featured', true] },
                  then: 0.2  // Featured products get boost
                },
                {
                  case: { $gte: ['$inventory.stockQuantity', 100] },
                  then: 0.1  // Well-stocked items get boost
                },
                {
                  case: { $gte: ['$ratings.averageRating', 4.5] },
                  then: 0.15 // Highly rated products get boost
                }
              ],
              default: 0.0
            }
          }
        }
      },

      // Calculate final relevance score
      {
        $addFields: {
          finalRelevanceScore: {
            $add: [
              { $multiply: ['$vectorSearchScore', 0.7] },  // Vector similarity weight
              { $multiply: ['$titleRelevance', 0.2] },     // Title relevance weight
              '$businessBoost'                             // Business rule boost
            ]
          }
        }
      },

      // Re-sort by final relevance score
      {
        $sort: { finalRelevanceScore: -1 }
      },

      // Project final result structure
      {
        $project: {
          productId: '$_id',
          name: 1,
          description: 1,
          category: 1,
          brand: 1,
          pricing: 1,
          ratings: 1,
          images: 1,
          availability: 1,

          // Search relevance metrics
          relevance: {
            vectorScore: '$vectorSearchScore',
            titleRelevance: '$titleRelevance', 
            businessBoost: '$businessBoost',
            finalScore: '$finalRelevanceScore'
          },

          // Additional context
          searchContext: {
            query: queryText,
            embeddingModel: options.embeddingModel || 'openai-ada-002',
            searchTimestamp: new Date()
          }
        }
      }
    ];

    // Execute search with performance tracking
    const searchResults = await productsCollection.aggregate(searchPipeline).toArray();

    const searchDuration = Date.now() - startTime;

    // Track search performance metrics
    await this.trackSearchMetrics({
      queryText: queryText,
      resultsCount: searchResults.length,
      searchDurationMs: searchDuration,
      embeddingModel: options.embeddingModel || 'openai-ada-002',
      filters: options
    });

    console.log(`✅ Semantic search completed: ${searchResults.length} results in ${searchDuration}ms`);

    return {
      results: searchResults,
      metadata: {
        query: queryText,
        totalResults: searchResults.length,
        searchDurationMs: searchDuration,
        embeddingModel: options.embeddingModel || 'openai-ada-002',
        searchTimestamp: new Date()
      }
    };
  }

  async generatePersonalizedRecommendations(userId, options = {}) {
    console.log(`Generating personalized recommendations for user: ${userId}`);

    const startTime = Date.now();

    // Get user preference embedding
    const userProfile = await this.getUserPreferenceEmbedding(userId);

    if (!userProfile || !userProfile.preferences?.embedding) {
      console.log('No user preference data available, falling back to popularity-based recommendations');
      return await this.getPopularityBasedRecommendations(options);
    }

    const productsCollection = this.db.collection('products');

    // Generate recommendations using vector similarity
    const recommendationPipeline = [
      {
        $vectorSearch: {
          index: 'products_vector_search',
          path: 'embeddings.combined',
          queryVector: userProfile.preferences.embedding,
          numCandidates: options.numCandidates || 2000,
          limit: options.limit || 50,

          // Exclude previously purchased/viewed products
          filter: {
            $and: [
              { 'availability.isActive': true },
              { '_id': { $not: { $in: userProfile.excludeProductIds || [] } } },
              ...(options.categories ? [{ category: { $in: options.categories } }] : []),
              ...(userProfile.preferences?.priceRange ? [{ 
                'pricing.basePrice': { 
                  $gte: userProfile.preferences.priceRange.min,
                  $lte: userProfile.preferences.priceRange.max
                }
              }] : [])
            ]
          }
        }
      },

      // Add personalization scoring
      {
        $addFields: {
          vectorSimilarity: { $meta: 'vectorSearchScore' },

          // Category preference matching
          categoryPreferenceScore: {
            $switch: {
              branches: userProfile.preferences.categories?.map(cat => ({
                case: { $eq: ['$category', cat.name] },
                then: cat.score || 0.5
              })) || [],
              default: 0.1
            }
          },

          // Brand preference scoring
          brandPreferenceScore: {
            $cond: {
              if: { $in: ['$brand', userProfile.preferences.brands || []] },
              then: 0.3,
              else: 0.0
            }
          },

          // Price preference scoring
          pricePreferenceScore: {
            $cond: {
              if: {
                $and: [
                  { $gte: ['$pricing.basePrice', userProfile.preferences.priceRange?.min || 0] },
                  { $lte: ['$pricing.basePrice', userProfile.preferences.priceRange?.max || 999999] }
                ]
              },
              then: 0.2,
              else: 0.0
            }
          },

          // Recency bias for trending products
          recencyScore: {
            $cond: {
              if: { $gte: ['$createdAt', new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)] },
              then: 0.1,
              else: 0.0
            }
          }
        }
      },

      // Calculate final recommendation score
      {
        $addFields: {
          recommendationScore: {
            $add: [
              { $multiply: ['$vectorSimilarity', 0.5] },        // Vector similarity weight
              { $multiply: ['$categoryPreferenceScore', 0.2] }, // Category preference
              '$brandPreferenceScore',                          // Brand preference
              '$pricePreferenceScore',                          // Price preference  
              '$recencyScore'                                   // Recency boost
            ]
          }
        }
      },

      // Sort by final recommendation score
      {
        $sort: { recommendationScore: -1 }
      },

      // Limit to requested number of recommendations
      {
        $limit: options.limit || 20
      },

      // Project final recommendation structure
      {
        $project: {
          productId: '$_id',
          name: 1,
          description: 1,
          category: 1,
          brand: 1,
          pricing: 1,
          ratings: 1,
          images: 1,

          // Recommendation scoring details
          recommendation: {
            score: '$recommendationScore',
            vectorSimilarity: '$vectorSimilarity',
            categoryMatch: '$categoryPreferenceScore',
            brandMatch: '$brandPreferenceScore',
            priceMatch: '$pricePreferenceScore',
            recencyBoost: '$recencyScore',
            reason: {
              $switch: {
                branches: [
                  {
                    case: { $gt: ['$categoryPreferenceScore', 0.3] },
                    then: 'Based on your interest in this category'
                  },
                  {
                    case: { $gt: ['$brandPreferenceScore', 0.2] },
                    then: 'From a brand you like'
                  },
                  {
                    case: { $gt: ['$vectorSimilarity', 0.8] },
                    then: 'Similar to products you\'ve liked'
                  }
                ],
                default: 'Recommended for you'
              }
            }
          },

          // Recommendation metadata
          recommendationContext: {
            userId: userId,
            basedOn: 'user_preferences',
            generatedAt: new Date()
          }
        }
      }
    ];

    const recommendations = await productsCollection.aggregate(recommendationPipeline).toArray();

    const generationDuration = Date.now() - startTime;

    console.log(`✅ Generated ${recommendations.length} personalized recommendations in ${generationDuration}ms`);

    return {
      recommendations: recommendations,
      userProfile: {
        userId: userId,
        preferences: userProfile.preferences,
        excludedProducts: userProfile.excludeProductIds?.length || 0
      },
      metadata: {
        generationDurationMs: generationDuration,
        totalRecommendations: recommendations.length,
        algorithm: 'vector_similarity_personalized',
        generatedAt: new Date()
      }
    };
  }

  async performHybridSearch(queryText, userId, options = {}) {
    console.log(`Performing hybrid search for query: "${queryText}", user: ${userId}`);

    // Execute both semantic search and personalized recommendations
    const [semanticResults, personalizedResults] = await Promise.all([
      this.performSemanticProductSearch(queryText, {
        ...options,
        limit: Math.ceil((options.limit || 20) * 0.7)  // 70% semantic results
      }),
      userId ? this.generatePersonalizedRecommendations(userId, {
        ...options,
        limit: Math.ceil((options.limit || 20) * 0.3)  // 30% personalized results
      }) : Promise.resolve({ recommendations: [] })
    ]);

    // Merge and re-rank results
    const hybridResults = this.mergeAndRankHybridResults(
      semanticResults.results,
      personalizedResults.recommendations || [],
      options
    );

    return {
      results: hybridResults,
      sources: {
        semantic: semanticResults.results.length,
        personalized: personalizedResults.recommendations?.length || 0
      },
      metadata: {
        query: queryText,
        userId: userId,
        algorithm: 'hybrid_semantic_personalized',
        searchTimestamp: new Date()
      }
    };
  }

  async generateEmbedding(text, modelName) {
    const model = this.embeddingModels.get(modelName);
    if (!model) {
      throw new Error(`Unknown embedding model: ${modelName}`);
    }

    // Implementation would integrate with actual embedding service
    // This is a placeholder for the actual embedding generation
    console.log(`Generating embedding with ${modelName} for text: ${text.substring(0, 50)}...`);

    // Return mock embedding vector for demonstration
    return Array.from({ length: model.dimensions }, () => Math.random() - 0.5);
  }

  async getUserPreferenceEmbedding(userId) {
    const userPreferencesCollection = this.db.collection('user_preferences');

    const userProfile = await userPreferencesCollection.findOne(
      { userId: userId },
      { 
        projection: {
          preferences: 1,
          excludeProductIds: 1,
          lastUpdated: 1
        }
      }
    );

    return userProfile;
  }

  async trackSearchMetrics(metrics) {
    const searchMetricsCollection = this.db.collection('search_metrics');

    await searchMetricsCollection.insertOne({
      ...metrics,
      timestamp: new Date()
    });

    // Update performance tracking
    if (!this.searchPerformanceMetrics.has(metrics.embeddingModel)) {
      this.searchPerformanceMetrics.set(metrics.embeddingModel, {
        totalSearches: 0,
        totalDurationMs: 0,
        avgResults: 0
      });
    }

    const modelMetrics = this.searchPerformanceMetrics.get(metrics.embeddingModel);
    modelMetrics.totalSearches++;
    modelMetrics.totalDurationMs += metrics.searchDurationMs;
    modelMetrics.avgResults = (modelMetrics.avgResults + metrics.resultsCount) / 2;
  }

  mergeAndRankHybridResults(semanticResults, personalizedResults, options) {
    // Combine results with hybrid scoring
    const combinedResults = new Map();

    // Add semantic results with base score
    semanticResults.forEach((result, index) => {
      combinedResults.set(result.productId.toString(), {
        ...result,
        hybridScore: (result.relevance?.finalScore || 0) * 0.7 + (1 - index / semanticResults.length) * 0.3,
        sources: ['semantic']
      });
    });

    // Add personalized results, boosting score if already present
    personalizedResults.forEach((result, index) => {
      const productId = result.productId.toString();
      const personalizedScore = (result.recommendation?.score || 0) * 0.6 + (1 - index / personalizedResults.length) * 0.4;

      if (combinedResults.has(productId)) {
        // Boost existing result
        const existing = combinedResults.get(productId);
        existing.hybridScore = existing.hybridScore * 0.8 + personalizedScore * 0.2;
        existing.sources.push('personalized');
        existing.personalization = result.recommendation;
      } else {
        // Add new personalized result
        combinedResults.set(productId, {
          ...result,
          hybridScore: personalizedScore * 0.8,  // Slightly lower weight for pure personalized
          sources: ['personalized'],
          relevance: { finalScore: personalizedScore }
        });
      }
    });

    // Convert to array and sort by hybrid score
    return Array.from(combinedResults.values())
      .sort((a, b) => b.hybridScore - a.hybridScore)
      .slice(0, options.limit || 20);
  }

  async getSearchAnalytics() {
    const searchMetricsCollection = this.db.collection('search_metrics');

    const analytics = await searchMetricsCollection.aggregate([
      {
        $match: {
          timestamp: { $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) }
        }
      },
      {
        $group: {
          _id: '$embeddingModel',
          totalSearches: { $sum: 1 },
          avgDuration: { $avg: '$searchDurationMs' },
          avgResults: { $avg: '$resultsCount' },
          minDuration: { $min: '$searchDurationMs' },
          maxDuration: { $max: '$searchDurationMs' }
        }
      },
      {
        $sort: { totalSearches: -1 }
      }
    ]).toArray();

    return {
      timestamp: new Date(),
      period: '24_hours',
      models: analytics
    };
  }

  async shutdown() {
    console.log('Shutting down Atlas Vector Search Manager...');

    if (this.client) {
      await this.client.close();
      console.log('✅ MongoDB Atlas connection closed');
    }

    this.searchIndexes.clear();
    this.embeddingModels.clear();
    this.searchPerformanceMetrics.clear();
  }
}

// Export the Atlas Vector Search manager
module.exports = { AtlasVectorSearchManager };

// Benefits of MongoDB Atlas Vector Search:
// - Native vector database capabilities integrated with document data
// - High-performance vector indexing and similarity search
// - Advanced filtering during vector search operations
// - Multiple embedding model support with configurable algorithms
// - Hybrid search combining semantic and traditional approaches
// - Real-time personalization with user preference embeddings
// - Enterprise-grade scalability and performance optimization
// - Comprehensive analytics and performance monitoring
// - SQL-compatible vector operations through QueryLeaf integration
// - Zero additional infrastructure for vector search capabilities

Understanding MongoDB Atlas Vector Search Architecture

Advanced Vector Search Implementation Patterns

Implement sophisticated vector search strategies for different AI application scenarios:

// Advanced Atlas Vector Search patterns for enterprise AI applications
class EnterpriseVectorSearchOrchestrator {
  constructor() {
    this.searchStrategies = new Map();
    this.embeddingPipelines = new Map();
    this.performanceOptimizer = new Map();
    this.cacheManager = new Map();
  }

  async initializeSearchStrategies() {
    console.log('Initializing enterprise vector search strategies...');

    const strategies = {
      // E-commerce product discovery
      'product_discovery': {
        primaryEmbeddingModel: 'openai-ada-002',
        fallbackEmbeddingModel: 'sentence-transformers',

        searchConfiguration: {
          numCandidates: 2000,
          similarity: 'cosine',
          indexName: 'products_vector_search',

          scoringWeights: {
            vectorSimilarity: 0.6,
            textRelevance: 0.2,
            popularityBoost: 0.1,
            businessRules: 0.1
          },

          filterPriority: ['availability', 'category', 'priceRange', 'ratings'],
          resultDiversification: true
        },

        performanceTargets: {
          maxLatencyMs: 500,
          minResultCount: 10,
          maxResultCount: 50
        }
      },

      // Content recommendation system
      'content_recommendations': {
        primaryEmbeddingModel: 'cohere-embed-v3',
        fallbackEmbeddingModel: 'sentence-transformers',

        searchConfiguration: {
          numCandidates: 5000,
          similarity: 'dotProduct',
          indexName: 'content_vector_search',

          scoringWeights: {
            vectorSimilarity: 0.7,
            userEngagement: 0.15,
            recency: 0.1,
            contentQuality: 0.05
          },

          filterPriority: ['contentType', 'publishDate', 'userPreferences'],
          resultDiversification: false
        },

        performanceTargets: {
          maxLatencyMs: 300,
          minResultCount: 20,
          maxResultCount: 100
        }
      },

      // Customer support knowledge base
      'knowledge_search': {
        primaryEmbeddingModel: 'openai-ada-002',
        fallbackEmbeddingModel: 'sentence-transformers',

        searchConfiguration: {
          numCandidates: 1000,
          similarity: 'cosine',
          indexName: 'knowledge_vector_search',

          scoringWeights: {
            vectorSimilarity: 0.8,
            documentAuthority: 0.1,
            recency: 0.05,
            userFeedback: 0.05
          },

          filterPriority: ['category', 'difficulty', 'department'],
          resultDiversification: true
        },

        performanceTargets: {
          maxLatencyMs: 200,
          minResultCount: 5,
          maxResultCount: 15
        }
      },

      // Image similarity search
      'image_similarity': {
        primaryEmbeddingModel: 'vision-transformer',
        fallbackEmbeddingModel: null,

        searchConfiguration: {
          numCandidates: 3000,
          similarity: 'cosine',
          indexName: 'images_vector_search',

          scoringWeights: {
            vectorSimilarity: 0.9,
            imageMetadata: 0.05,
            userPreferences: 0.05
          },

          filterPriority: ['imageType', 'resolution', 'tags'],
          resultDiversification: false
        },

        performanceTargets: {
          maxLatencyMs: 800,
          minResultCount: 10,
          maxResultCount: 30
        }
      }
    };

    for (const [strategyName, strategy] of Object.entries(strategies)) {
      this.searchStrategies.set(strategyName, strategy);
    }

    console.log('✅ Enterprise search strategies initialized');
  }

  async executeVectorSearchStrategy(strategyName, query, context = {}) {
    const strategy = this.searchStrategies.get(strategyName);
    if (!strategy) {
      throw new Error(`Unknown search strategy: ${strategyName}`);
    }

    console.log(`Executing ${strategyName} strategy for query: "${query}"`);

    const searchContext = {
      strategy: strategyName,
      query: query,
      userId: context.userId,
      sessionId: context.sessionId,
      startTime: Date.now(),
      filters: context.filters || {},
      options: context.options || {}
    };

    try {
      // Generate embeddings using primary model
      let queryEmbedding;
      try {
        queryEmbedding = await this.generateEmbedding(
          query,
          strategy.primaryEmbeddingModel
        );
      } catch (primaryError) {
        console.warn(`Primary embedding model failed, using fallback:`, primaryError.message);

        if (strategy.fallbackEmbeddingModel) {
          queryEmbedding = await this.generateEmbedding(
            query,
            strategy.fallbackEmbeddingModel
          );
        } else {
          throw primaryError;
        }
      }

      // Execute vector search with strategy-specific configuration
      const searchResults = await this.performOptimizedVectorSearch(
        queryEmbedding,
        strategy,
        searchContext
      );

      // Apply strategy-specific post-processing
      const processedResults = await this.applySearchPostProcessing(
        searchResults,
        strategy,
        searchContext
      );

      const searchDuration = Date.now() - searchContext.startTime;

      console.log(`✅ ${strategyName} completed: ${processedResults.length} results in ${searchDuration}ms`);

      return {
        results: processedResults,
        strategy: strategyName,
        metadata: {
          ...searchContext,
          searchDurationMs: searchDuration,
          resultCount: processedResults.length,
          embeddingModel: strategy.primaryEmbeddingModel
        }
      };

    } catch (error) {
      console.error(`Search strategy ${strategyName} failed:`, error);
      return {
        results: [],
        strategy: strategyName,
        error: error.message,
        metadata: searchContext
      };
    }
  }

  async performOptimizedVectorSearch(queryEmbedding, strategy, context) {
    const collection = this.getCollectionForStrategy(strategy.searchConfiguration.indexName);

    // Build vector search aggregation pipeline
    const searchPipeline = [
      {
        $vectorSearch: {
          index: strategy.searchConfiguration.indexName,
          path: this.getVectorPathForStrategy(strategy),
          queryVector: queryEmbedding,
          numCandidates: strategy.searchConfiguration.numCandidates,
          limit: strategy.performanceTargets.maxResultCount,

          // Apply context-aware filtering
          ...(Object.keys(context.filters).length > 0 && {
            filter: this.buildFilterExpression(context.filters, strategy)
          })
        }
      },

      // Add vector search score
      {
        $addFields: {
          vectorSearchScore: { $meta: 'vectorSearchScore' }
        }
      },

      // Apply strategy-specific scoring enhancements
      ...this.buildScoringEnhancements(strategy, context),

      // Sort by enhanced score
      {
        $sort: { enhancedScore: -1 }
      },

      // Limit to strategy target
      {
        $limit: strategy.performanceTargets.maxResultCount
      }
    ];

    return await collection.aggregate(searchPipeline).toArray();
  }

  buildScoringEnhancements(strategy, context) {
    const enhancements = [];
    const weights = strategy.searchConfiguration.scoringWeights;

    // Base scoring calculation
    enhancements.push({
      $addFields: {
        enhancedScore: {
          $add: [
            { $multiply: ['$vectorSearchScore', weights.vectorSimilarity] },

            // Add text relevance if applicable
            ...(weights.textRelevance ? [{
              $multiply: [
                { $ifNull: ['$textRelevanceScore', 0] },
                weights.textRelevance
              ]
            }] : []),

            // Add popularity boost
            ...(weights.popularityBoost ? [{
              $multiply: [
                { $ifNull: ['$popularityScore', 0] },
                weights.popularityBoost
              ]
            }] : []),

            // Add business rule adjustments
            ...(weights.businessRules ? [{
              $multiply: [
                { $ifNull: ['$businessRuleScore', 0] },
                weights.businessRules
              ]
            }] : []),

            // Add user engagement metrics
            ...(weights.userEngagement ? [{
              $multiply: [
                { $ifNull: ['$userEngagementScore', 0] },
                weights.userEngagement
              ]
            }] : []),

            // Add recency boost
            ...(weights.recency ? [{
              $multiply: [
                { $ifNull: ['$recencyScore', 0] },
                weights.recency
              ]
            }] : [])
          ]
        }
      }
    });

    // Add strategy-specific scoring fields
    if (strategy.searchConfiguration.indexName === 'products_vector_search') {
      enhancements.unshift({
        $addFields: {
          popularityScore: {
            $divide: [
              { $ln: { $add: ['$metrics.totalSales', 1] } },
              10
            ]
          },

          businessRuleScore: {
            $switch: {
              branches: [
                {
                  case: { $eq: ['$featured', true] },
                  then: 0.3
                },
                {
                  case: { $gte: ['$ratings.averageRating', 4.5] },
                  then: 0.2
                },
                {
                  case: { $gte: ['$inventory.stockQuantity', 50] },
                  then: 0.1
                }
              ],
              default: 0.0
            }
          }
        }
      });
    } else if (strategy.searchConfiguration.indexName === 'content_vector_search') {
      enhancements.unshift({
        $addFields: {
          userEngagementScore: {
            $divide: [
              { $add: ['$metrics.views', '$metrics.likes', '$metrics.shares'] },
              1000
            ]
          },

          recencyScore: {
            $cond: {
              if: { 
                $gte: [
                  '$publishedAt',
                  new Date(Date.now() - 7 * 24 * 60 * 60 * 1000)
                ]
              },
              then: 0.2,
              else: 0.0
            }
          }
        }
      });
    }

    return enhancements;
  }

  async applySearchPostProcessing(results, strategy, context) {
    // Apply result diversification if enabled
    if (strategy.searchConfiguration.resultDiversification) {
      results = await this.diversifyResults(results, strategy);
    }

    // Apply user personalization
    if (context.userId && strategy.searchConfiguration.personalizeResults !== false) {
      results = await this.personalizeResults(results, context.userId, strategy);
    }

    // Apply business rules and filters
    results = this.applyBusinessRules(results, strategy, context);

    // Ensure minimum result count
    if (results.length < strategy.performanceTargets.minResultCount) {
      console.warn(`Insufficient results for ${strategy.strategy}: ${results.length} < ${strategy.performanceTargets.minResultCount}`);
    }

    return results;
  }

  async diversifyResults(results, strategy) {
    if (results.length <= 5) return results; // Skip diversification for small result sets

    const diversified = [];
    const categories = new Set();
    const maxPerCategory = Math.ceil(strategy.performanceTargets.maxResultCount / 5);
    const categoryCount = new Map();

    // First, add top results ensuring category diversity
    for (const result of results) {
      const category = result.category || result.contentType || 'default';
      const currentCount = categoryCount.get(category) || 0;

      if (currentCount < maxPerCategory || categories.size < 3) {
        diversified.push(result);
        categories.add(category);
        categoryCount.set(category, currentCount + 1);

        if (diversified.length >= strategy.performanceTargets.maxResultCount) {
          break;
        }
      }
    }

    // Fill remaining slots with best remaining results
    const remaining = strategy.performanceTargets.maxResultCount - diversified.length;
    const usedIds = new Set(diversified.map(r => r._id?.toString() || r.productId?.toString()));

    for (const result of results) {
      if (remaining <= 0) break;

      const resultId = result._id?.toString() || result.productId?.toString();
      if (!usedIds.has(resultId)) {
        diversified.push(result);
        usedIds.add(resultId);
      }
    }

    return diversified;
  }

  async personalizeResults(results, userId, strategy) {
    // Get user preferences and behavior
    const userProfile = await this.getUserProfile(userId);

    if (!userProfile) return results;

    // Apply personalization scoring
    return results.map(result => {
      let personalizationBoost = 0;

      // Category preferences
      if (userProfile.preferredCategories?.includes(result.category)) {
        personalizationBoost += 0.1;
      }

      // Brand preferences
      if (userProfile.preferredBrands?.includes(result.brand)) {
        personalizationBoost += 0.05;
      }

      // Price range preferences
      if (result.pricing && userProfile.priceRange) {
        if (result.pricing.basePrice >= userProfile.priceRange.min && 
            result.pricing.basePrice <= userProfile.priceRange.max) {
          personalizationBoost += 0.05;
        }
      }

      // Update enhanced score with personalization
      result.enhancedScore = (result.enhancedScore || result.vectorSearchScore || 0) + personalizationBoost;

      return result;
    }).sort((a, b) => (b.enhancedScore || 0) - (a.enhancedScore || 0));
  }

  applyBusinessRules(results, strategy, context) {
    // Apply business-specific filtering and boosting rules
    return results
      .filter(result => {
        // Basic availability check
        if (result.availability && !result.availability.isActive) {
          return false;
        }

        // Inventory check for products
        if (result.inventory && result.inventory.stockQuantity === 0 && 
            !result.inventory.allowBackorder) {
          return false;
        }

        // Content moderation check
        if (result.moderation && result.moderation.status === 'rejected') {
          return false;
        }

        return true;
      })
      .map(result => {
        // Apply business rule boosts
        if (result.featured || result.promoted) {
          result.enhancedScore = (result.enhancedScore || 0) * 1.2;
        }

        if (result.ratings && result.ratings.averageRating >= 4.5) {
          result.enhancedScore = (result.enhancedScore || 0) * 1.1;
        }

        return result;
      });
  }

  getCollectionForStrategy(indexName) {
    const collectionMap = {
      'products_vector_search': 'products',
      'content_vector_search': 'content_items',
      'knowledge_vector_search': 'knowledge_articles',
      'images_vector_search': 'images'
    };

    const collectionName = collectionMap[indexName];
    if (!collectionName) {
      throw new Error(`Unknown index name: ${indexName}`);
    }

    return this.db.collection(collectionName);
  }

  getVectorPathForStrategy(strategy) {
    const pathMap = {
      'products_vector_search': 'embeddings.combined',
      'content_vector_search': 'embeddings.content',
      'knowledge_vector_search': 'embeddings.article',
      'images_vector_search': 'embeddings.visual'
    };

    return pathMap[strategy.searchConfiguration.indexName] || 'embeddings.default';
  }

  buildFilterExpression(filters, strategy) {
    const filterExpression = { $and: [] };

    // Apply filters based on strategy priority
    for (const filterType of strategy.searchConfiguration.filterPriority) {
      if (filters[filterType] !== undefined) {
        switch (filterType) {
          case 'category':
            if (Array.isArray(filters.category)) {
              filterExpression.$and.push({ category: { $in: filters.category } });
            } else {
              filterExpression.$and.push({ category: filters.category });
            }
            break;

          case 'priceRange':
            if (filters.priceRange.min !== undefined || filters.priceRange.max !== undefined) {
              const priceFilter = {};
              if (filters.priceRange.min !== undefined) {
                priceFilter.$gte = filters.priceRange.min;
              }
              if (filters.priceRange.max !== undefined) {
                priceFilter.$lte = filters.priceRange.max;
              }
              filterExpression.$and.push({ 'pricing.basePrice': priceFilter });
            }
            break;

          case 'availability':
            filterExpression.$and.push({ 'availability.isActive': true });
            break;

          case 'ratings':
            if (filters.ratings?.min !== undefined) {
              filterExpression.$and.push({ 
                'ratings.averageRating': { $gte: filters.ratings.min }
              });
            }
            break;
        }
      }
    }

    return filterExpression.$and.length > 0 ? filterExpression : undefined;
  }

  async getUserProfile(userId) {
    const userProfilesCollection = this.db.collection('user_profiles');
    return await userProfilesCollection.findOne(
      { userId: userId },
      { 
        projection: {
          preferredCategories: 1,
          preferredBrands: 1,
          priceRange: 1,
          behaviors: 1
        }
      }
    );
  }

  async getPerformanceMetrics() {
    const searchMetrics = [];

    for (const [strategyName, strategy] of this.searchStrategies) {
      const metrics = await this.getStrategyMetrics(strategyName);
      searchMetrics.push({
        strategy: strategyName,
        configuration: strategy,
        metrics: metrics
      });
    }

    return {
      timestamp: new Date(),
      strategies: searchMetrics
    };
  }

  async getStrategyMetrics(strategyName) {
    const searchLogs = this.db.collection('search_logs');

    const metrics = await searchLogs.aggregate([
      {
        $match: {
          strategy: strategyName,
          timestamp: { $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) }
        }
      },
      {
        $group: {
          _id: null,
          totalSearches: { $sum: 1 },
          avgDuration: { $avg: '$searchDurationMs' },
          avgResults: { $avg: '$resultCount' },
          p95Duration: { $percentile: { input: '$searchDurationMs', p: [0.95], method: 'approximate' } },
          successRate: { 
            $avg: { 
              $cond: [{ $gt: ['$resultCount', 0] }, 1, 0]
            }
          }
        }
      }
    ]).toArray();

    return metrics[0] || {
      totalSearches: 0,
      avgDuration: 0,
      avgResults: 0,
      successRate: 0
    };
  }
}

// Export the enterprise vector search orchestrator
module.exports = { EnterpriseVectorSearchOrchestrator };

SQL-Style Vector Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Atlas Vector Search operations:

-- QueryLeaf vector search operations with SQL-familiar syntax

-- Create vector search indexes with SQL-style DDL
CREATE VECTOR INDEX products_semantic_search ON products (
  -- Primary embedding field
  embeddings.combined VECTOR(1536) USING cosine,

  -- Additional vector fields for multi-modal search
  embeddings.title VECTOR(384) USING euclidean,
  embeddings.description VECTOR(768) USING dotProduct,
  embeddings.image VECTOR(512) USING cosine,

  -- Filterable fields for hybrid search
  category FILTER,
  brand FILTER,
  pricing.basePrice FILTER,
  availability.isActive FILTER,
  ratings.averageRating FILTER
) WITH (
  similarity_algorithm = 'cosine',
  num_candidates = 2000
);

-- Vector similarity search with SQL syntax
WITH semantic_search AS (
  SELECT 
    p.*,
    VECTOR_SIMILARITY(
      p.embeddings.combined,
      VECTOR_EMBED('openai-ada-002', 'wireless bluetooth headphones with noise cancellation'),
      'cosine'
    ) as semantic_similarity,

    -- Multi-modal similarity scoring
    VECTOR_SIMILARITY(
      p.embeddings.title,
      VECTOR_EMBED('sentence-transformer', 'wireless bluetooth headphones with noise cancellation'), 
      'euclidean'
    ) as title_similarity,

    VECTOR_SIMILARITY(
      p.embeddings.description,
      VECTOR_EMBED('cohere-embed-v3', 'wireless bluetooth headphones with noise cancellation'),
      'dotProduct' 
    ) as description_similarity

  FROM products p
  WHERE 
    -- Vector search with filtering
    VECTOR_SEARCH(
      p.embeddings.combined,
      VECTOR_EMBED('openai-ada-002', 'wireless bluetooth headphones with noise cancellation'),
      num_candidates = 1000,
      limit = 50
    )
    AND p.availability.isActive = true
    AND p.category IN ('electronics', 'audio', 'headphones')
    AND p.pricing.basePrice BETWEEN 50 AND 500
    AND p.ratings.averageRating >= 4.0
),

scored_results AS (
  SELECT *,
    -- Hybrid scoring combining multiple similarity measures
    (
      semantic_similarity * 0.6 +
      title_similarity * 0.25 +
      description_similarity * 0.15
    ) as combined_similarity_score,

    -- Business rule adjustments
    CASE 
      WHEN featured = true THEN 0.2
      WHEN ratings.averageRating >= 4.5 THEN 0.1
      WHEN inventory.stockQuantity > 100 THEN 0.05
      ELSE 0.0
    END as business_boost,

    -- Popularity scoring
    LOG(COALESCE(metrics.totalSales, 1) + 1) / 10.0 as popularity_score,

    -- Recency boost for new products
    CASE 
      WHEN created_at >= CURRENT_DATE - INTERVAL '30 days' THEN 0.1
      WHEN created_at >= CURRENT_DATE - INTERVAL '90 days' THEN 0.05
      ELSE 0.0
    END as recency_boost

  FROM semantic_search
),

final_rankings AS (
  SELECT *,
    -- Calculate final relevance score
    combined_similarity_score + business_boost + popularity_score + recency_boost as final_score,

    -- Ranking within categories for diversification
    ROW_NUMBER() OVER (
      PARTITION BY category 
      ORDER BY combined_similarity_score DESC
    ) as category_rank,

    -- Overall ranking
    ROW_NUMBER() OVER (ORDER BY final_score DESC) as overall_rank

  FROM scored_results
)

SELECT 
  product_id,
  name,
  description,
  category,
  brand,
  pricing.basePrice as price,
  ratings.averageRating as avg_rating,

  -- Similarity and scoring details
  ROUND(semantic_similarity, 4) as semantic_sim,
  ROUND(title_similarity, 4) as title_sim,
  ROUND(description_similarity, 4) as desc_sim,
  ROUND(final_score, 4) as relevance_score,

  -- Ranking information
  overall_rank,
  category_rank,

  -- Business context
  CASE
    WHEN business_boost > 0 THEN 'Featured/Highly Rated'
    WHEN popularity_score > 0.5 THEN 'Popular Choice'
    WHEN recency_boost > 0 THEN 'New Product'
    ELSE 'Standard'
  END as recommendation_reason,

  -- Search context
  'semantic_vector_search' as search_method,
  CURRENT_TIMESTAMP as search_timestamp

FROM final_rankings
WHERE overall_rank <= 20
ORDER BY final_score DESC, overall_rank ASC;

-- Personalized recommendations using vector similarity
WITH user_preference_embedding AS (
  SELECT 
    user_id,
    preferences.embedding as preference_vector,
    preferences.categories as preferred_categories,
    preferences.price_range as price_range
  FROM user_preferences
  WHERE user_id = $1
),

personalized_candidates AS (
  SELECT 
    p.*,
    VECTOR_SIMILARITY(
      upe.preference_vector,
      p.embeddings.combined,
      'cosine'
    ) as preference_similarity,

    -- Category preference matching
    CASE 
      WHEN p.category = ANY(upe.preferred_categories) THEN 0.3
      ELSE 0.0
    END as category_preference_score,

    -- Price preference alignment
    CASE 
      WHEN p.pricing.basePrice BETWEEN upe.price_range.min AND upe.price_range.max THEN 0.2
      ELSE 0.0
    END as price_preference_score

  FROM products p
  CROSS JOIN user_preference_embedding upe
  WHERE 
    VECTOR_SEARCH(
      p.embeddings.combined,
      upe.preference_vector,
      num_candidates = 2000,
      limit = 100
    )
    AND p.availability.isActive = true
    AND p.product_id NOT IN (
      -- Exclude recently purchased/viewed products
      SELECT product_id 
      FROM user_interactions ui
      WHERE ui.user_id = $1 
      AND ui.interaction_type IN ('purchase', 'view')
      AND ui.interaction_date >= CURRENT_DATE - INTERVAL '30 days'
    )
),

recommendation_scores AS (
  SELECT *,
    -- Combined recommendation score
    (
      preference_similarity * 0.5 +
      category_preference_score +
      price_preference_score +
      (ratings.averageRating / 5.0) * 0.1
    ) as recommendation_score,

    -- Diversification ranking
    ROW_NUMBER() OVER (
      PARTITION BY category 
      ORDER BY preference_similarity DESC
    ) as category_diversity_rank

  FROM personalized_candidates
)

SELECT 
  product_id,
  name,
  category,
  brand,
  pricing.basePrice as price,
  ratings.averageRating as rating,

  -- Recommendation metrics
  ROUND(preference_similarity, 4) as preference_match,
  ROUND(recommendation_score, 4) as recommendation_score,

  -- Explanation
  CASE 
    WHEN category_preference_score > 0 THEN 'Based on your category preferences'
    WHEN price_preference_score > 0 THEN 'Within your preferred price range'
    WHEN preference_similarity > 0.8 THEN 'Highly similar to your preferences'
    ELSE 'Recommended for you'
  END as recommendation_reason,

  category_diversity_rank,
  'personalized_vector_recommendation' as recommendation_type

FROM recommendation_scores
WHERE category_diversity_rank <= 5  -- Max 5 per category for diversity
ORDER BY recommendation_score DESC
LIMIT 20;

-- Hybrid search combining semantic search with traditional text search
WITH vector_search_results AS (
  SELECT 
    p.*,
    VECTOR_SIMILARITY(
      p.embeddings.combined,
      VECTOR_EMBED('openai-ada-002', $1),  -- Query parameter
      'cosine'
    ) as vector_score,
    'vector_search' as source
  FROM products p
  WHERE 
    VECTOR_SEARCH(
      p.embeddings.combined,
      VECTOR_EMBED('openai-ada-002', $1),
      num_candidates = 1000,
      limit = 30
    )
    AND p.availability.isActive = true
),

text_search_results AS (
  SELECT 
    p.*,
    MATCH_SCORE(p.search_text, $1) as text_score,
    'text_search' as source
  FROM products p
  WHERE 
    MATCH(p.search_text) AGAINST ($1 IN BOOLEAN MODE)
    AND p.availability.isActive = true
  ORDER BY text_score DESC
  LIMIT 30
),

combined_results AS (
  SELECT *, vector_score as relevance_score FROM vector_search_results
  UNION ALL
  SELECT *, text_score as relevance_score FROM text_search_results
),

deduplicated_results AS (
  SELECT 
    product_id,
    name,
    description,
    category,
    brand,
    pricing,
    ratings,

    -- Aggregate scores from multiple sources
    MAX(relevance_score) as max_score,
    AVG(relevance_score) as avg_score,
    COUNT(*) as source_count,
    ARRAY_AGG(DISTINCT source) as search_sources,

    -- Hybrid scoring - boost items found by multiple methods
    CASE 
      WHEN COUNT(*) > 1 THEN MAX(relevance_score) * 1.2  -- Multi-source boost
      ELSE MAX(relevance_score)
    END as hybrid_score

  FROM combined_results
  GROUP BY product_id, name, description, category, brand, pricing, ratings
)

SELECT 
  product_id,
  name,
  category,
  brand,
  pricing.basePrice as price,
  ratings.averageRating as rating,

  -- Scoring details
  ROUND(hybrid_score, 4) as relevance_score,
  ROUND(max_score, 4) as max_individual_score,
  source_count,
  search_sources,

  -- Search method classification
  CASE 
    WHEN source_count > 1 THEN 'hybrid_match'
    WHEN 'vector_search' = ANY(search_sources) THEN 'semantic_match'
    WHEN 'text_search' = ANY(search_sources) THEN 'keyword_match'
    ELSE 'unknown'
  END as match_type,

  'hybrid_search' as search_algorithm

FROM deduplicated_results
ORDER BY hybrid_score DESC, source_count DESC
LIMIT 25;

-- Vector search performance analysis and optimization
WITH search_performance AS (
  SELECT 
    embedding_model,
    search_type,
    DATE_TRUNC('hour', search_timestamp) as hour_bucket,

    -- Performance metrics
    COUNT(*) as total_searches,
    AVG(search_duration_ms) as avg_duration_ms,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY search_duration_ms) as p95_duration_ms,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY search_duration_ms) as p99_duration_ms,

    -- Result quality metrics
    AVG(result_count) as avg_result_count,
    AVG(avg_similarity_score) as avg_similarity,
    COUNT(*) FILTER (WHERE result_count = 0) as zero_result_searches,

    -- User engagement metrics
    AVG(click_through_rate) as avg_ctr,
    AVG(conversion_rate) as avg_conversion_rate,

    -- Resource utilization
    AVG(candidates_examined) as avg_candidates,
    AVG(memory_usage_mb) as avg_memory_usage

  FROM vector_search_logs vsl
  WHERE search_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
  GROUP BY embedding_model, search_type, DATE_TRUNC('hour', search_timestamp)
),

performance_trends AS (
  SELECT *,
    -- Calculate performance trends
    LAG(avg_duration_ms) OVER (
      PARTITION BY embedding_model, search_type 
      ORDER BY hour_bucket
    ) as prev_avg_duration,

    LAG(avg_similarity) OVER (
      PARTITION BY embedding_model, search_type 
      ORDER BY hour_bucket
    ) as prev_avg_similarity,

    LAG(avg_ctr) OVER (
      PARTITION BY embedding_model, search_type 
      ORDER BY hour_bucket
    ) as prev_avg_ctr

  FROM search_performance
)

SELECT 
  embedding_model,
  search_type,
  TO_CHAR(hour_bucket, 'YYYY-MM-DD HH24:00') as analysis_hour,

  -- Volume metrics
  total_searches,
  ROUND(avg_result_count, 1) as avg_results,

  -- Performance metrics
  ROUND(avg_duration_ms, 0) as avg_duration_ms,
  ROUND(p95_duration_ms, 0) as p95_duration_ms,
  ROUND(p99_duration_ms, 0) as p99_duration_ms,

  -- Quality metrics
  ROUND(avg_similarity, 3) as avg_similarity,
  ROUND((zero_result_searches::DECIMAL / total_searches) * 100, 1) as zero_result_rate_pct,

  -- Engagement metrics
  ROUND(avg_ctr * 100, 2) as avg_ctr_pct,
  ROUND(avg_conversion_rate * 100, 2) as avg_conversion_pct,

  -- Resource metrics
  ROUND(avg_candidates, 0) as avg_candidates_examined,
  ROUND(avg_memory_usage, 1) as avg_memory_mb,

  -- Trend analysis
  CASE 
    WHEN prev_avg_duration IS NOT NULL THEN
      ROUND(((avg_duration_ms - prev_avg_duration) / prev_avg_duration) * 100, 1)
    ELSE NULL
  END as duration_change_pct,

  CASE 
    WHEN prev_avg_similarity IS NOT NULL THEN
      ROUND(((avg_similarity - prev_avg_similarity) / prev_avg_similarity) * 100, 1) 
    ELSE NULL
  END as similarity_change_pct,

  -- Performance assessment
  CASE 
    WHEN avg_duration_ms > 1000 THEN 'slow'
    WHEN avg_duration_ms > 500 THEN 'moderate'
    ELSE 'fast'
  END as performance_rating,

  -- Optimization recommendations
  CASE 
    WHEN avg_duration_ms > 1000 THEN 'Reduce num_candidates or optimize index'
    WHEN zero_result_searches > total_searches * 0.1 THEN 'Review embedding quality or expand corpus'
    WHEN avg_ctr < 0.05 THEN 'Improve result relevance ranking'
    WHEN avg_memory_usage > 1000 THEN 'Consider batch size optimization'
    ELSE 'Performance within acceptable parameters'
  END as optimization_recommendation

FROM performance_trends
WHERE hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY embedding_model, search_type, hour_bucket DESC;

-- QueryLeaf provides comprehensive vector search capabilities:
-- 1. SQL-familiar syntax for MongoDB Atlas Vector Search operations
-- 2. Multi-modal vector similarity search with configurable algorithms
-- 3. Hybrid search combining semantic and traditional text matching
-- 4. Personalized recommendations using user preference embeddings
-- 5. Advanced filtering and ranking with business rule integration
-- 6. Performance monitoring with comprehensive analytics and optimization
-- 7. Real-time vector search with enterprise-grade scalability
-- 8. Integration with popular embedding models and AI services
-- 9. Familiar SQL constructs for complex vector operations
-- 10. Production-ready vector database capabilities through MongoDB Atlas

Best Practices for MongoDB Atlas Vector Search Implementation

Vector Search Optimization Strategies

Essential practices for maximizing vector search performance and accuracy:

Embedding Model Selection: Choose appropriate embedding models based on data type and use case requirements
Index Configuration: Optimize vector indexes for similarity algorithms and dimensionality
Hybrid Search Design: Combine vector similarity with traditional search methods for comprehensive results
Performance Monitoring: Track search latency, result quality, and user engagement metrics
Result Diversification: Implement strategies to ensure diverse and relevant search results
Personalization Integration: Leverage user preference embeddings for customized experiences

Production Deployment Considerations

Key factors for enterprise vector search deployments:

Scalability Planning: Design for high-concurrency vector search workloads
Embedding Management: Implement efficient embedding generation and update strategies
Quality Assurance: Monitor search result quality and user satisfaction metrics
Cost Optimization: Balance embedding model costs with search performance requirements
Security Implementation: Secure vector data and search operations appropriately
Disaster Recovery: Plan for vector index backup and recovery procedures

Conclusion

MongoDB Atlas Vector Search provides enterprise-grade vector database capabilities that seamlessly integrate AI-powered search with traditional database operations. The combination of high-performance vector indexing, advanced similarity algorithms, and familiar SQL-style interfaces enables applications to deliver sophisticated semantic search, personalization, and recommendation features without additional infrastructure complexity.

Key Atlas Vector Search benefits include:

Native AI Integration: Vector database capabilities built into MongoDB Atlas with zero additional infrastructure
High-Performance Search: Optimized vector indexing and similarity algorithms for enterprise-scale workloads
Hybrid Search Capabilities: Seamless integration of semantic and traditional search methodologies
Advanced Personalization: User preference embeddings enable sophisticated recommendation systems
SQL Compatibility: Familiar vector operations accessible through SQL-style query interfaces
Comprehensive Analytics: Real-time monitoring and optimization recommendations for vector search performance

Whether you're building e-commerce recommendation engines, content discovery platforms, customer support systems, or AI-powered search applications, MongoDB Atlas Vector Search with QueryLeaf's familiar SQL interface provides the foundation for intelligent search experiences that scale efficiently while maintaining familiar development patterns.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB Atlas Vector Search operations while providing SQL-familiar syntax for vector similarity search, hybrid search strategies, and personalized recommendations. Advanced vector indexing, embedding management, and performance analytics are seamlessly accessible through familiar SQL constructs, making sophisticated AI-powered search both powerful and approachable for SQL-oriented development teams.

The integration of MongoDB's vector search capabilities with SQL-style operations makes it an ideal platform for applications that require both advanced AI functionality and operational simplicity, ensuring your search and recommendation systems deliver intelligent user experiences while maintaining familiar development and deployment patterns.

November 21, 2025
25 min read

MongoDB Aggregation Framework for Geospatial Analytics: Advanced Location-Based Data Processing and Spatial Intelligence for Enterprise Applications

Modern applications increasingly rely on location-based services and spatial analytics to deliver personalized experiences, optimize operations, and generate business insights. Traditional databases struggle with complex geospatial queries, requiring specialized GIS systems or expensive spatial extensions that complicate development and deployment workflows. Processing location data at scale often requires complex spatial calculations, distance computations, and geographic aggregations that are inefficient in conventional relational systems.

MongoDB's Aggregation Framework provides comprehensive geospatial analytics capabilities that enable sophisticated location-based data processing directly within the database. Unlike traditional databases that require external spatial libraries or complex workarounds, MongoDB's native spatial operators integrate seamlessly with aggregation pipelines while delivering enterprise-grade performance and scalability for real-world location intelligence applications.

The Traditional Geospatial Analytics Challenge

Implementing location-based analytics with conventional database systems creates significant development and performance challenges:

-- Traditional PostgreSQL geospatial analytics - complex spatial extensions required

-- Store location tracking data with PostGIS spatial extensions
CREATE EXTENSION IF NOT EXISTS postgis;

CREATE TABLE user_locations (
    location_id SERIAL PRIMARY KEY,
    user_id INTEGER NOT NULL,
    device_id VARCHAR(100),

    -- Location coordinates using PostGIS geometry types
    coordinates GEOMETRY(POINT, 4326) NOT NULL,
    accuracy_meters DECIMAL(8,2),
    altitude_meters DECIMAL(8,2),

    -- Address information
    street_address VARCHAR(500),
    city VARCHAR(100),
    state_province VARCHAR(100),
    postal_code VARCHAR(20),
    country VARCHAR(3),

    -- Location metadata
    location_source VARCHAR(50), -- 'gps', 'network', 'passive'
    location_method VARCHAR(50), -- 'check_in', 'automatic', 'manual'

    -- Timestamps and tracking
    recorded_at TIMESTAMP NOT NULL,
    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,

    -- Indexes for spatial operations
    FOREIGN KEY (user_id) REFERENCES users(user_id)
);

-- Create spatial indexes (essential for performance)
CREATE INDEX idx_user_locations_gist ON user_locations USING GIST (coordinates);
CREATE INDEX idx_user_locations_user_recorded ON user_locations(user_id, recorded_at);

-- Store points of interest for proximity analysis
CREATE TABLE points_of_interest (
    poi_id SERIAL PRIMARY KEY,
    poi_name VARCHAR(200) NOT NULL,
    poi_category VARCHAR(100) NOT NULL,
    poi_subcategory VARCHAR(100),

    -- POI location
    coordinates GEOMETRY(POINT, 4326) NOT NULL,

    -- POI details
    description TEXT,
    website VARCHAR(500),
    phone VARCHAR(20),

    -- Business hours and metadata
    business_hours JSONB,
    amenities TEXT[],
    rating_average DECIMAL(3,2),
    review_count INTEGER DEFAULT 0,

    -- Address and contact
    street_address VARCHAR(500),
    city VARCHAR(100),
    state_province VARCHAR(100),
    postal_code VARCHAR(20),
    country VARCHAR(3),

    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_poi_gist ON points_of_interest USING GIST (coordinates);
CREATE INDEX idx_poi_category ON points_of_interest(poi_category, poi_subcategory);

-- Complex spatial analytics query - expensive operations
WITH user_daily_paths AS (
  -- Calculate daily movement patterns for users
  SELECT 
    user_id,
    DATE(recorded_at) as tracking_date,

    -- Aggregate location points into daily paths
    ST_MakeLine(
      ST_Transform(coordinates, 3857) 
      ORDER BY recorded_at
    ) as daily_path_meters,

    -- Calculate total distance traveled
    SUM(
      ST_Distance(
        ST_Transform(coordinates, 3857),
        ST_Transform(
          LAG(coordinates) OVER (
            PARTITION BY user_id, DATE(recorded_at) 
            ORDER BY recorded_at
          ), 3857
        )
      )
    ) as total_distance_meters,

    -- Time-based calculations
    MIN(recorded_at) as first_location_time,
    MAX(recorded_at) as last_location_time,
    COUNT(*) as location_points,

    -- Geographic bounds
    ST_XMin(ST_Extent(coordinates)) as min_longitude,
    ST_YMin(ST_Extent(coordinates)) as min_latitude,
    ST_XMax(ST_Extent(coordinates)) as max_longitude,
    ST_YMax(ST_Extent(coordinates)) as max_latitude

  FROM user_locations
  WHERE recorded_at >= CURRENT_DATE - INTERVAL '7 days'
  GROUP BY user_id, DATE(recorded_at)
  HAVING COUNT(*) >= 5 -- Only days with sufficient tracking
),

poi_proximity_analysis AS (
  -- Analyze proximity to points of interest
  SELECT 
    ul.user_id,
    ul.recorded_at,
    poi.poi_id,
    poi.poi_name,
    poi.poi_category,

    -- Distance calculations (expensive spatial operations)
    ST_Distance(
      ST_Transform(ul.coordinates, 3857),
      ST_Transform(poi.coordinates, 3857)
    ) as distance_meters,

    -- Determine if user is within proximity zones
    CASE 
      WHEN ST_DWithin(
        ST_Transform(ul.coordinates, 3857),
        ST_Transform(poi.coordinates, 3857),
        100  -- 100 meters
      ) THEN 'immediate_vicinity'
      WHEN ST_DWithin(
        ST_Transform(ul.coordinates, 3857),
        ST_Transform(poi.coordinates, 3857),
        500  -- 500 meters
      ) THEN 'nearby'
      WHEN ST_DWithin(
        ST_Transform(ul.coordinates, 3857),
        ST_Transform(poi.coordinates, 3857),
        1000 -- 1 kilometer
      ) THEN 'walking_distance'
      ELSE 'distant'
    END as proximity_category,

    -- Time spent near POI (complex window calculations)
    CASE 
      WHEN ST_DWithin(
        ST_Transform(ul.coordinates, 3857),
        ST_Transform(poi.coordinates, 3857),
        200
      ) THEN 
        EXTRACT(EPOCH FROM (
          LEAD(ul.recorded_at) OVER (
            PARTITION BY ul.user_id 
            ORDER BY ul.recorded_at
          ) - ul.recorded_at
        )) / 60.0 -- Convert to minutes
      ELSE 0
    END as time_spent_minutes

  FROM user_locations ul
  CROSS JOIN points_of_interest poi
  WHERE ul.recorded_at >= CURRENT_DATE - INTERVAL '7 days'
    AND ST_DWithin(
      ST_Transform(ul.coordinates, 3857),
      ST_Transform(poi.coordinates, 3857),
      5000 -- Only POIs within 5km
    )
),

geospatial_insights AS (
  -- Generate comprehensive geospatial insights
  SELECT 
    udp.user_id,
    udp.tracking_date,

    -- Daily movement analysis
    udp.total_distance_meters,
    ROUND(udp.total_distance_meters / 1000.0, 2) as total_distance_km,

    -- Calculate movement velocity and patterns
    CASE 
      WHEN EXTRACT(EPOCH FROM (udp.last_location_time - udp.first_location_time)) > 0
      THEN ROUND(
        udp.total_distance_meters / 
        EXTRACT(EPOCH FROM (udp.last_location_time - udp.first_location_time)) * 3.6,
        2
      )
      ELSE 0
    END as average_speed_kmh,

    -- Geographic coverage analysis
    ST_Area(
      ST_Transform(
        ST_MakeEnvelope(
          udp.min_longitude, udp.min_latitude,
          udp.max_longitude, udp.max_latitude,
          4326
        ), 3857
      )
    ) / 1000000.0 as coverage_area_km2, -- Convert to km²

    udp.location_points,
    udp.first_location_time,
    udp.last_location_time,

    -- POI interaction analysis
    (
      SELECT COUNT(DISTINCT poi_id)
      FROM poi_proximity_analysis ppa
      WHERE ppa.user_id = udp.user_id
        AND DATE(ppa.recorded_at) = udp.tracking_date
        AND ppa.proximity_category IN ('immediate_vicinity', 'nearby')
    ) as unique_pois_visited,

    (
      SELECT SUM(time_spent_minutes)
      FROM poi_proximity_analysis ppa
      WHERE ppa.user_id = udp.user_id
        AND DATE(ppa.recorded_at) = udp.tracking_date
        AND ppa.time_spent_minutes > 0
    ) as total_poi_time_minutes,

    -- Most frequent POI categories
    (
      SELECT STRING_AGG(poi_category, ', ' ORDER BY visit_count DESC)
      FROM (
        SELECT poi_category, COUNT(*) as visit_count
        FROM poi_proximity_analysis ppa
        WHERE ppa.user_id = udp.user_id
          AND DATE(ppa.recorded_at) = udp.tracking_date
          AND ppa.proximity_category = 'immediate_vicinity'
        GROUP BY poi_category
        ORDER BY visit_count DESC
        LIMIT 3
      ) top_categories
    ) as top_poi_categories,

    -- Mobility pattern classification
    CASE 
      WHEN udp.total_distance_meters < 1000 THEN 'stationary'
      WHEN udp.total_distance_meters < 5000 THEN 'local_movement'
      WHEN udp.total_distance_meters < 20000 THEN 'moderate_travel'
      ELSE 'extensive_travel'
    END as mobility_pattern

  FROM user_daily_paths udp
)

SELECT 
  gi.user_id,
  gi.tracking_date,
  gi.total_distance_km,
  gi.average_speed_kmh,
  ROUND(gi.coverage_area_km2, 3) as coverage_area_km2,
  gi.location_points,
  gi.unique_pois_visited,
  ROUND(gi.total_poi_time_minutes, 1) as total_poi_time_minutes,
  gi.top_poi_categories,
  gi.mobility_pattern,

  -- Daily insights and categorization
  CASE 
    WHEN gi.unique_pois_visited > 5 AND gi.total_distance_km > 10 THEN 'high_mobility_explorer'
    WHEN gi.unique_pois_visited > 3 AND gi.total_distance_km < 5 THEN 'local_explorer'
    WHEN gi.total_distance_km > 20 THEN 'long_distance_traveler'
    WHEN gi.total_poi_time_minutes > 60 THEN 'poi_focused_user'
    ELSE 'standard_user'
  END as user_behavior_profile,

  -- Environmental and context factors
  EXTRACT(DOW FROM gi.tracking_date) as day_of_week,
  CASE 
    WHEN EXTRACT(DOW FROM gi.tracking_date) IN (0, 6) THEN 'weekend'
    ELSE 'weekday'
  END as day_type,

  -- Performance and data quality metrics
  CASE 
    WHEN gi.location_points < 10 THEN 'insufficient_data'
    WHEN gi.location_points < 50 THEN 'moderate_tracking'
    ELSE 'comprehensive_tracking'
  END as tracking_quality

FROM geospatial_insights gi
ORDER BY gi.user_id, gi.tracking_date;

-- Problems with traditional geospatial analytics:
-- 1. Complex spatial extension dependencies (PostGIS) add deployment overhead
-- 2. Expensive coordinate transformations required for accurate distance calculations
-- 3. Multiple complex JOINs and spatial operations create performance bottlenecks
-- 4. Difficult to scale horizontally due to spatial index requirements
-- 5. Limited aggregation capabilities for complex spatial analytics
-- 6. Complex query syntax that's hard to optimize and maintain
-- 7. Poor integration with application object models
-- 8. Expensive spatial index maintenance and storage overhead
-- 9. Limited real-time processing capabilities for streaming location data
-- 10. Complex deployment and configuration of spatial extensions

MongoDB provides native geospatial analytics with powerful aggregation framework integration:

// MongoDB Geospatial Aggregation Framework - native spatial analytics
const { MongoClient } = require('mongodb');

// Advanced MongoDB Geospatial Analytics Manager
class MongoGeospatialAnalyticsManager {
  constructor() {
    this.client = null;
    this.db = null;
    this.spatialCollections = new Map();
    this.geospatialIndexes = new Map();
    this.analyticsCache = new Map();
  }

  async initialize() {
    console.log('Initializing MongoDB Geospatial Analytics Manager...');

    // Connect with geospatial optimization settings
    this.client = new MongoClient(process.env.MONGODB_URI || 'mongodb://localhost:27017', {
      // Connection optimization for geospatial operations
      maxPoolSize: 15,
      minPoolSize: 5,
      maxIdleTimeMS: 30000,

      // Read preferences optimized for analytics
      readPreference: 'secondaryPreferred',
      readConcern: { level: 'local' },

      // Write concern for location tracking
      writeConcern: { w: 1, j: false }, // Optimized for high-volume tracking

      appName: 'GeospatialAnalyticsManager'
    });

    await this.client.connect();
    this.db = this.client.db('location_intelligence');

    // Initialize geospatial collections and indexes
    await this.setupGeospatialCollections();
    await this.createSpatialIndexes();

    console.log('✅ MongoDB Geospatial Analytics Manager initialized');
  }

  async setupGeospatialCollections() {
    console.log('Setting up geospatial collections...');

    const collections = {
      // User location tracking with geospatial data
      userLocations: {
        validator: {
          $jsonSchema: {
            bsonType: 'object',
            required: ['userId', 'coordinates', 'recordedAt'],
            properties: {
              userId: { bsonType: 'objectId' },
              deviceId: { bsonType: 'string' },
              coordinates: {
                bsonType: 'object',
                required: ['type', 'coordinates'],
                properties: {
                  type: { enum: ['Point'] },
                  coordinates: {
                    bsonType: 'array',
                    minItems: 2,
                    maxItems: 2,
                    items: { bsonType: 'double' }
                  }
                }
              },
              accuracy: { bsonType: 'double', minimum: 0 },
              altitude: { bsonType: 'double' },
              address: {
                bsonType: 'object',
                properties: {
                  street: { bsonType: 'string' },
                  city: { bsonType: 'string' },
                  state: { bsonType: 'string' },
                  postalCode: { bsonType: 'string' },
                  country: { bsonType: 'string' }
                }
              },
              locationSource: { enum: ['gps', 'network', 'passive', 'manual'] },
              locationMethod: { enum: ['checkin', 'automatic', 'manual', 'passive'] },
              recordedAt: { bsonType: 'date' }
            }
          }
        }
      },

      // Points of interest for proximity analysis
      pointsOfInterest: {
        validator: {
          $jsonSchema: {
            bsonType: 'object',
            required: ['name', 'category', 'coordinates'],
            properties: {
              name: { bsonType: 'string' },
              category: { bsonType: 'string' },
              subcategory: { bsonType: 'string' },
              coordinates: {
                bsonType: 'object',
                required: ['type', 'coordinates'],
                properties: {
                  type: { enum: ['Point'] },
                  coordinates: {
                    bsonType: 'array',
                    minItems: 2,
                    maxItems: 2,
                    items: { bsonType: 'double' }
                  }
                }
              },
              address: {
                bsonType: 'object',
                properties: {
                  street: { bsonType: 'string' },
                  city: { bsonType: 'string' },
                  state: { bsonType: 'string' },
                  postalCode: { bsonType: 'string' },
                  country: { bsonType: 'string' }
                }
              },
              businessHours: { bsonType: 'object' },
              amenities: { bsonType: 'array' },
              rating: { bsonType: 'double', minimum: 0, maximum: 5 },
              reviewCount: { bsonType: 'int', minimum: 0 }
            }
          }
        }
      },

      // Geographic regions and boundaries
      geographicRegions: {
        validator: {
          $jsonSchema: {
            bsonType: 'object',
            required: ['name', 'regionType', 'geometry'],
            properties: {
              name: { bsonType: 'string' },
              regionType: { enum: ['city', 'district', 'neighborhood', 'zone', 'boundary'] },
              geometry: {
                bsonType: 'object',
                required: ['type', 'coordinates'],
                properties: {
                  type: { enum: ['Polygon', 'MultiPolygon'] },
                  coordinates: { bsonType: 'array' }
                }
              },
              properties: { bsonType: 'object' },
              population: { bsonType: 'int', minimum: 0 },
              area: { bsonType: 'double', minimum: 0 }
            }
          }
        }
      }
    };

    for (const [collectionName, options] of Object.entries(collections)) {
      await this.db.createCollection(collectionName, options);
      this.spatialCollections.set(collectionName, this.db.collection(collectionName));
    }

    console.log('✅ Geospatial collections created');
  }

  async createSpatialIndexes() {
    console.log('Creating optimized geospatial indexes...');

    const userLocations = this.spatialCollections.get('userLocations');
    const pointsOfInterest = this.spatialCollections.get('pointsOfInterest');
    const geographicRegions = this.spatialCollections.get('geographicRegions');

    // User locations indexes
    await userLocations.createIndex({ coordinates: '2dsphere' });
    await userLocations.createIndex({ userId: 1, recordedAt: -1 });
    await userLocations.createIndex({ recordedAt: -1 });
    await userLocations.createIndex({ 
      coordinates: '2dsphere', 
      userId: 1, 
      recordedAt: -1 
    });

    // Points of interest indexes  
    await pointsOfInterest.createIndex({ coordinates: '2dsphere' });
    await pointsOfInterest.createIndex({ category: 1, subcategory: 1 });
    await pointsOfInterest.createIndex({ 
      coordinates: '2dsphere', 
      category: 1 
    });

    // Geographic regions indexes
    await geographicRegions.createIndex({ geometry: '2dsphere' });
    await geographicRegions.createIndex({ regionType: 1 });

    console.log('✅ Geospatial indexes created');
  }

  async performUserMobilityAnalysis(userId, dateRange = { days: 7 }) {
    console.log(`Performing mobility analysis for user ${userId}...`);

    const startDate = new Date();
    startDate.setDate(startDate.getDate() - dateRange.days);

    const userLocations = this.spatialCollections.get('userLocations');

    const mobilityPipeline = [
      // Match user locations within date range
      {
        $match: {
          userId: userId,
          recordedAt: { $gte: startDate }
        }
      },

      // Sort by recording time for path analysis
      { $sort: { recordedAt: 1 } },

      // Group by day for daily analysis
      {
        $group: {
          _id: {
            userId: '$userId',
            day: { $dateToString: { format: '%Y-%m-%d', date: '$recordedAt' } }
          },

          // Collect all locations for the day
          locations: {
            $push: {
              coordinates: '$coordinates',
              recordedAt: '$recordedAt',
              accuracy: '$accuracy',
              locationSource: '$locationSource'
            }
          },

          // Basic aggregations
          locationCount: { $sum: 1 },
          firstLocation: { $first: '$recordedAt' },
          lastLocation: { $last: '$recordedAt' },

          // Calculate geographic bounds
          minLongitude: { $min: { $arrayElemAt: ['$coordinates.coordinates', 0] } },
          maxLongitude: { $max: { $arrayElemAt: ['$coordinates.coordinates', 0] } },
          minLatitude: { $min: { $arrayElemAt: ['$coordinates.coordinates', 1] } },
          maxLatitude: { $max: { $arrayElemAt: ['$coordinates.coordinates', 1] } }
        }
      },

      // Calculate daily movement metrics
      {
        $addFields: {
          // Calculate coverage area using geographic bounds
          coverageArea: {
            $multiply: [
              { $subtract: ['$maxLongitude', '$minLongitude'] },
              { $subtract: ['$maxLatitude', '$minLatitude'] },
              111320 // Approximate meters per degree (varies by latitude)
            ]
          },

          // Calculate time span for velocity analysis
          timeSpanMinutes: {
            $divide: [
              { $subtract: ['$lastLocation', '$firstLocation'] },
              60000 // Convert milliseconds to minutes
            ]
          },

          // Data quality assessment
          trackingQuality: {
            $switch: {
              branches: [
                { case: { $gte: ['$locationCount', 50] }, then: 'comprehensive' },
                { case: { $gte: ['$locationCount', 20] }, then: 'moderate' },
                { case: { $gte: ['$locationCount', 5] }, then: 'basic' }
              ],
              default: 'insufficient'
            }
          }
        }
      },

      // Calculate movement distances using geospatial operations
      {
        $addFields: {
          movementAnalysis: {
            $reduce: {
              input: { $range: [1, { $size: '$locations' }] },
              initialValue: { 
                totalDistance: 0, 
                segments: [] 
              },
              in: {
                totalDistance: {
                  $add: [
                    '$$value.totalDistance',
                    {
                      $let: {
                        vars: {
                          currentLoc: { $arrayElemAt: ['$locations', '$$this'] },
                          previousLoc: { $arrayElemAt: ['$locations', { $subtract: ['$$this', 1] }] }
                        },
                        in: {
                          // Use $geoNear equivalent calculation for distance
                          $multiply: [
                            {
                              $sqrt: {
                                $add: [
                                  {
                                    $pow: [
                                      {
                                        $multiply: [
                                          { $subtract: [
                                            { $arrayElemAt: ['$$currentLoc.coordinates.coordinates', 0] },
                                            { $arrayElemAt: ['$$previousLoc.coordinates.coordinates', 0] }
                                          ] },
                                          111320 // Meters per degree longitude (approximate)
                                        ]
                                      },
                                      2
                                    ]
                                  },
                                  {
                                    $pow: [
                                      {
                                        $multiply: [
                                          { $subtract: [
                                            { $arrayElemAt: ['$$currentLoc.coordinates.coordinates', 1] },
                                            { $arrayElemAt: ['$$previousLoc.coordinates.coordinates', 1] }
                                          ] },
                                          110540 // Meters per degree latitude
                                        ]
                                      },
                                      2
                                    ]
                                  }
                                ]
                              }
                            },
                            1 // Simplified distance calculation
                          ]
                        }
                      }
                    }
                  ]
                },
                segments: {
                  $concatArrays: [
                    '$$value.segments',
                    [{
                      from: { $arrayElemAt: ['$locations', { $subtract: ['$$this', 1] }] },
                      to: { $arrayElemAt: ['$locations', '$$this'] },
                      distance: '$$value.totalDistance'
                    }]
                  ]
                }
              }
            }
          }
        }
      },

      // Calculate velocity and mobility patterns
      {
        $addFields: {
          totalDistanceMeters: '$movementAnalysis.totalDistance',
          totalDistanceKm: { $divide: ['$movementAnalysis.totalDistance', 1000] },
          averageSpeedKmh: {
            $cond: {
              if: { $gt: ['$timeSpanMinutes', 0] },
              then: {
                $multiply: [
                  { $divide: ['$movementAnalysis.totalDistance', 1000] },
                  { $divide: [60, '$timeSpanMinutes'] }
                ]
              },
              else: 0
            }
          },

          // Classify mobility pattern
          mobilityPattern: {
            $switch: {
              branches: [
                { case: { $lt: ['$movementAnalysis.totalDistance', 1000] }, then: 'stationary' },
                { case: { $lt: ['$movementAnalysis.totalDistance', 5000] }, then: 'local_movement' },
                { case: { $lt: ['$movementAnalysis.totalDistance', 20000] }, then: 'moderate_travel' }
              ],
              default: 'extensive_travel'
            }
          }
        }
      },

      // Final projection with insights
      {
        $project: {
          userId: '$_id.userId',
          analysisDate: '$_id.day',
          totalDistanceKm: { $round: ['$totalDistanceKm', 2] },
          averageSpeedKmh: { $round: ['$averageSpeedKmh', 2] },
          coverageAreaKm2: { $round: [{ $divide: ['$coverageArea', 1000000] }, 4] },
          locationCount: 1,
          timeSpanHours: { $round: [{ $divide: ['$timeSpanMinutes', 60] }, 2] },
          mobilityPattern: 1,
          trackingQuality: 1,

          // Geographic bounds for mapping
          bounds: {
            northeast: { lat: '$maxLatitude', lng: '$maxLongitude' },
            southwest: { lat: '$minLatitude', lng: '$minLongitude' }
          },

          // Movement insights
          insights: {
            isHighMobility: { $gt: ['$totalDistanceKm', 10] },
            isLocalExplorer: {
              $and: [
                { $lt: ['$totalDistanceKm', 5] },
                { $gt: ['$locationCount', 20] }
              ]
            },
            hasLongPeriods: { $gt: ['$timeSpanMinutes', 480] }, // More than 8 hours
            dataQualitySufficient: { $ne: ['$trackingQuality', 'insufficient'] }
          }
        }
      },

      { $sort: { analysisDate: -1 } }
    ];

    const results = await userLocations.aggregate(mobilityPipeline).toArray();

    // Calculate summary statistics across all days
    const summaryPipeline = [
      { $match: { userId: userId, recordedAt: { $gte: startDate } } },
      {
        $group: {
          _id: '$userId',
          totalLocations: { $sum: 1 },
          uniqueDays: { $addToSet: { $dateToString: { format: '%Y-%m-%d', date: '$recordedAt' } } },
          averageAccuracy: { $avg: '$accuracy' },
          locationSources: { $addToSet: '$locationSource' },
          timeRange: {
            $push: '$recordedAt'
          }
        }
      },
      {
        $addFields: {
          uniqueDaysCount: { $size: '$uniqueDays' },
          totalTimeSpan: {
            $subtract: [{ $max: '$timeRange' }, { $min: '$timeRange' }]
          }
        }
      }
    ];

    const summary = await userLocations.aggregate(summaryPipeline).toArray();

    return {
      userId: userId,
      analysisTimestamp: new Date(),
      dateRange: { startDate, endDate: new Date() },
      dailyAnalysis: results,
      summary: summary.length > 0 ? summary[0] : null,
      recommendations: this.generateMobilityRecommendations(results, summary[0])
    };
  }

  generateMobilityRecommendations(dailyData, summary) {
    const recommendations = [];

    if (!summary || !dailyData.length) {
      recommendations.push({
        type: 'data_quality',
        priority: 'high',
        message: 'Insufficient location data for meaningful analysis'
      });
      return recommendations;
    }

    // Analyze movement patterns
    const highMobilityDays = dailyData.filter(day => day.insights.isHighMobility);
    const localExplorerDays = dailyData.filter(day => day.insights.isLocalExplorer);

    if (highMobilityDays.length > dailyData.length * 0.5) {
      recommendations.push({
        type: 'behavioral',
        priority: 'medium',
        message: 'High mobility pattern detected - consider travel optimization features',
        insight: `${highMobilityDays.length} of ${dailyData.length} days showed extensive travel`
      });
    }

    if (localExplorerDays.length > dailyData.length * 0.3) {
      recommendations.push({
        type: 'behavioral',
        priority: 'low',
        message: 'Local exploration pattern - recommend nearby POI discovery features',
        insight: `${localExplorerDays.length} days showed intensive local exploration`
      });
    }

    // Data quality recommendations
    const lowQualityDays = dailyData.filter(day => day.trackingQuality === 'insufficient' || day.trackingQuality === 'basic');
    if (lowQualityDays.length > dailyData.length * 0.3) {
      recommendations.push({
        type: 'data_quality',
        priority: 'medium',
        message: 'Consider improving location tracking frequency or accuracy',
        insight: `${lowQualityDays.length} days had insufficient tracking data`
      });
    }

    return recommendations;
  }

  async performPOIProximityAnalysis(userId, proximityRadius = 500) {
    console.log(`Performing POI proximity analysis for user ${userId} within ${proximityRadius}m...`);

    const userLocations = this.spatialCollections.get('userLocations');
    const pointsOfInterest = this.spatialCollections.get('pointsOfInterest');

    const proximityPipeline = [
      // Start with user locations
      {
        $match: {
          userId: userId,
          recordedAt: { $gte: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000) } // Last 7 days
        }
      },

      // Geospatial lookup to find nearby POIs
      {
        $lookup: {
          from: 'pointsOfInterest',
          let: { userLocation: '$coordinates' },
          pipeline: [
            {
              $geoNear: {
                near: '$$userLocation',
                distanceField: 'distance',
                maxDistance: proximityRadius,
                spherical: true
              }
            },
            {
              $project: {
                name: 1,
                category: 1,
                subcategory: 1,
                coordinates: 1,
                rating: 1,
                distance: 1
              }
            }
          ],
          as: 'nearbyPOIs'
        }
      },

      // Filter locations that have nearby POIs
      {
        $match: {
          nearbyPOIs: { $ne: [] }
        }
      },

      // Unwind to analyze each POI interaction
      { $unwind: '$nearbyPOIs' },

      // Calculate visit duration and interaction metrics
      {
        $addFields: {
          proximityCategory: {
            $switch: {
              branches: [
                { case: { $lte: ['$nearbyPOIs.distance', 100] }, then: 'immediate_vicinity' },
                { case: { $lte: ['$nearbyPOIs.distance', 200] }, then: 'very_close' },
                { case: { $lte: ['$nearbyPOIs.distance', proximityRadius] }, then: 'nearby' }
              ],
              default: 'distant'
            }
          }
        }
      },

      // Group by POI to analyze visit patterns
      {
        $group: {
          _id: {
            poiId: '$nearbyPOIs._id',
            poiName: '$nearbyPOIs.name',
            poiCategory: '$nearbyPOIs.category',
            poiSubcategory: '$nearbyPOIs.subcategory'
          },

          visitCount: { $sum: 1 },
          averageDistance: { $avg: '$nearbyPOIs.distance' },
          minDistance: { $min: '$nearbyPOIs.distance' },
          maxDistance: { $max: '$nearbyPOIs.distance' },

          visits: {
            $push: {
              timestamp: '$recordedAt',
              distance: '$nearbyPOIs.distance',
              proximityCategory: '$proximityCategory',
              accuracy: '$accuracy'
            }
          },

          firstVisit: { $min: '$recordedAt' },
          lastVisit: { $max: '$recordedAt' },

          // Calculate proximity engagement
          closeVisits: {
            $sum: {
              $cond: [
                { $lte: ['$nearbyPOIs.distance', 100] },
                1,
                0
              ]
            }
          }
        }
      },

      // Calculate visit duration and patterns
      {
        $addFields: {
          visitDurationMinutes: {
            $divide: [
              { $subtract: ['$lastVisit', '$firstVisit'] },
              60000 // Convert to minutes
            ]
          },

          engagementLevel: {
            $switch: {
              branches: [
                { case: { $gte: ['$closeVisits', 5] }, then: 'high' },
                { case: { $gte: ['$closeVisits', 2] }, then: 'medium' }
              ],
              default: 'low'
            }
          },

          visitFrequency: {
            $switch: {
              branches: [
                { case: { $gte: ['$visitCount', 10] }, then: 'frequent' },
                { case: { $gte: ['$visitCount', 3] }, then: 'occasional' }
              ],
              default: 'rare'
            }
          }
        }
      },

      // Final projection and insights
      {
        $project: {
          poiId: '$_id.poiId',
          poiName: '$_id.poiName',
          category: '$_id.poiCategory',
          subcategory: '$_id.poiSubcategory',

          visitMetrics: {
            visitCount: '$visitCount',
            visitFrequency: '$visitFrequency',
            averageDistanceMeters: { $round: ['$averageDistance', 1] },
            minDistanceMeters: { $round: ['$minDistance', 1] },
            closeVisits: '$closeVisits',
            engagementLevel: '$engagementLevel'
          },

          timeMetrics: {
            firstVisit: '$firstVisit',
            lastVisit: '$lastVisit',
            visitDurationMinutes: { $round: ['$visitDurationMinutes', 1] }
          },

          insights: {
            isRegularDestination: { $gte: ['$visitCount', 5] },
            hasCloseInteraction: { $gte: ['$closeVisits', 1] },
            isLongTermRelationship: {
              $gte: [
                { $divide: [{ $subtract: ['$lastVisit', '$firstVisit'] }, 86400000] }, // Days
                7
              ]
            }
          }
        }
      },

      { $sort: { 'visitMetrics.visitCount': -1, 'visitMetrics.averageDistanceMeters': 1 } }
    ];

    const poiAnalysis = await userLocations.aggregate(proximityPipeline).toArray();

    // Generate category-level insights
    const categoryPipeline = [
      ...proximityPipeline.slice(0, -2), // Reuse pipeline up to final projection
      {
        $group: {
          _id: '$_id.poiCategory',

          totalPOIs: { $sum: 1 },
          totalVisits: { $sum: '$visitCount' },
          averageVisitsPerPOI: { $avg: '$visitCount' },

          highEngagementPOIs: {
            $sum: {
              $cond: [{ $eq: ['$engagementLevel', 'high'] }, 1, 0]
            }
          },

          categories: { $addToSet: '$_id.poiSubcategory' }
        }
      },
      {
        $project: {
          category: '$_id',
          totalPOIs: 1,
          totalVisits: 1,
          averageVisitsPerPOI: { $round: ['$averageVisitsPerPOI', 2] },
          highEngagementPOIs: 1,
          subcategories: '$categories',

          categoryInsight: {
            $switch: {
              branches: [
                { case: { $gte: ['$totalVisits', 50] }, then: 'primary_interest' },
                { case: { $gte: ['$totalVisits', 20] }, then: 'significant_interest' },
                { case: { $gte: ['$totalVisits', 5] }, then: 'moderate_interest' }
              ],
              default: 'limited_interest'
            }
          }
        }
      },
      { $sort: { totalVisits: -1 } }
    ];

    const categoryInsights = await userLocations.aggregate(categoryPipeline).toArray();

    return {
      userId: userId,
      analysisTimestamp: new Date(),
      proximityRadius: proximityRadius,
      poiAnalysis: poiAnalysis,
      categoryInsights: categoryInsights,
      summary: {
        totalPOIsInteracted: poiAnalysis.length,
        totalVisits: poiAnalysis.reduce((sum, poi) => sum + poi.visitMetrics.visitCount, 0),
        topCategories: categoryInsights.slice(0, 5),
        behaviorProfile: this.generatePOIBehaviorProfile(poiAnalysis, categoryInsights)
      }
    };
  }

  generatePOIBehaviorProfile(poiAnalysis, categoryInsights) {
    if (!poiAnalysis.length) {
      return 'insufficient_data';
    }

    const frequentPOIs = poiAnalysis.filter(poi => poi.visitMetrics.visitFrequency === 'frequent');
    const highEngagementPOIs = poiAnalysis.filter(poi => poi.visitMetrics.engagementLevel === 'high');
    const primaryCategories = categoryInsights.filter(cat => cat.categoryInsight === 'primary_interest');

    if (frequentPOIs.length >= 3 && highEngagementPOIs.length >= 2) {
      return 'habitual_visitor';
    } else if (primaryCategories.length >= 2) {
      return 'diverse_explorer';
    } else if (poiAnalysis.length > 10 && frequentPOIs.length === 0) {
      return 'casual_explorer';
    } else if (frequentPOIs.length >= 1) {
      return 'routine_focused';
    } else {
      return 'occasional_visitor';
    }
  }

  async performGeographicRegionAnalysis(regionFilters = {}) {
    console.log('Performing geographic region analysis...');

    const userLocations = this.spatialCollections.get('userLocations');
    const geographicRegions = this.spatialCollections.get('geographicRegions');

    const regionAnalysisPipeline = [
      // Start with geographic regions
      {
        $match: {
          regionType: regionFilters.regionType || { $exists: true }
        }
      },

      // Lookup users within each region
      {
        $lookup: {
          from: 'userLocations',
          let: { regionGeometry: '$geometry' },
          pipeline: [
            {
              $match: {
                recordedAt: { 
                  $gte: regionFilters.startDate || new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) // Default 30 days
                }
              }
            },
            {
              $match: {
                $expr: {
                  $geoWithin: {
                    $geometry: '$$regionGeometry',
                    $centerSphere: ['$coordinates', 0.01] // Simplified spatial match
                  }
                }
              }
            }
          ],
          as: 'locationVisits'
        }
      },

      // Calculate region activity metrics
      {
        $addFields: {
          totalVisits: { $size: '$locationVisits' },
          uniqueUsers: {
            $size: {
              $setUnion: {
                $map: {
                  input: '$locationVisits',
                  as: 'visit',
                  in: '$$visit.userId'
                }
              }
            }
          },

          visitsByTimeOfDay: {
            $reduce: {
              input: '$locationVisits',
              initialValue: { morning: 0, afternoon: 0, evening: 0, night: 0 },
              in: {
                morning: {
                  $add: [
                    '$$value.morning',
                    {
                      $cond: [
                        { $and: [
                          { $gte: [{ $hour: '$$this.recordedAt' }, 6] },
                          { $lt: [{ $hour: '$$this.recordedAt' }, 12] }
                        ]},
                        1, 0
                      ]
                    }
                  ]
                },
                afternoon: {
                  $add: [
                    '$$value.afternoon',
                    {
                      $cond: [
                        { $and: [
                          { $gte: [{ $hour: '$$this.recordedAt' }, 12] },
                          { $lt: [{ $hour: '$$this.recordedAt' }, 18] }
                        ]},
                        1, 0
                      ]
                    }
                  ]
                },
                evening: {
                  $add: [
                    '$$value.evening',
                    {
                      $cond: [
                        { $and: [
                          { $gte: [{ $hour: '$$this.recordedAt' }, 18] },
                          { $lt: [{ $hour: '$$this.recordedAt' }, 22] }
                        ]},
                        1, 0
                      ]
                    }
                  ]
                },
                night: {
                  $add: [
                    '$$value.night',
                    {
                      $cond: [
                        { $or: [
                          { $gte: [{ $hour: '$$this.recordedAt' }, 22] },
                          { $lt: [{ $hour: '$$this.recordedAt' }, 6] }
                        ]},
                        1, 0
                      ]
                    }
                  ]
                }
              }
            }
          }
        }
      },

      // Calculate activity patterns and insights
      {
        $addFields: {
          averageVisitsPerUser: {
            $cond: {
              if: { $gt: ['$uniqueUsers', 0] },
              then: { $divide: ['$totalVisits', '$uniqueUsers'] },
              else: 0
            }
          },

          activityLevel: {
            $switch: {
              branches: [
                { case: { $gte: ['$totalVisits', 1000] }, then: 'very_high' },
                { case: { $gte: ['$totalVisits', 500] }, then: 'high' },
                { case: { $gte: ['$totalVisits', 100] }, then: 'medium' },
                { case: { $gte: ['$totalVisits', 10] }, then: 'low' }
              ],
              default: 'very_low'
            }
          },

          primaryTimeOfDay: {
            $let: {
              vars: { times: '$visitsByTimeOfDay' },
              in: {
                $switch: {
                  branches: [
                    { case: { $gte: ['$$times.morning', '$$times.afternoon'] }, then: 'morning' },
                    { case: { $gte: ['$$times.afternoon', '$$times.evening'] }, then: 'afternoon' },
                    { case: { $gte: ['$$times.evening', '$$times.night'] }, then: 'evening' }
                  ],
                  default: 'night'
                }
              }
            }
          }
        }
      },

      // Final projection
      {
        $project: {
          regionId: '$_id',
          regionName: '$name',
          regionType: '$regionType',

          activityMetrics: {
            totalVisits: '$totalVisits',
            uniqueUsers: '$uniqueUsers',
            averageVisitsPerUser: { $round: ['$averageVisitsPerUser', 2] },
            activityLevel: '$activityLevel',
            primaryTimeOfDay: '$primaryTimeOfDay'
          },

          timeDistribution: '$visitsByTimeOfDay',

          insights: {
            isPopularDestination: { $gte: ['$totalVisits', 500] },
            hasRegularVisitors: { $gte: ['$averageVisitsPerUser', 3] },
            isDaytimeActive: {
              $gt: [
                { $add: ['$visitsByTimeOfDay.morning', '$visitsByTimeOfDay.afternoon'] },
                { $add: ['$visitsByTimeOfDay.evening', '$visitsByTimeOfDay.night'] }
              ]
            }
          },

          regionProperties: '$properties',
          area: '$area',
          population: '$population'
        }
      },

      { $sort: { 'activityMetrics.totalVisits': -1 } }
    ];

    const regionAnalysis = await geographicRegions.aggregate(regionAnalysisPipeline).toArray();

    return {
      analysisTimestamp: new Date(),
      regionFilters: regionFilters,
      regionAnalysis: regionAnalysis,
      summary: {
        totalRegionsAnalyzed: regionAnalysis.length,
        mostActiveRegion: regionAnalysis.length > 0 ? regionAnalysis[0] : null,
        averageActivityLevel: this.calculateAverageActivityLevel(regionAnalysis)
      }
    };
  }

  calculateAverageActivityLevel(regions) {
    if (!regions.length) return 'no_data';

    const activityScores = {
      'very_high': 5,
      'high': 4,
      'medium': 3,
      'low': 2,
      'very_low': 1
    };

    const totalScore = regions.reduce((sum, region) => {
      return sum + (activityScores[region.activityMetrics.activityLevel] || 0);
    }, 0);

    const averageScore = totalScore / regions.length;

    if (averageScore >= 4.5) return 'very_high';
    if (averageScore >= 3.5) return 'high';
    if (averageScore >= 2.5) return 'medium';
    if (averageScore >= 1.5) return 'low';
    return 'very_low';
  }

  async getGeospatialAnalyticsMetrics() {
    console.log('Generating geospatial analytics metrics...');

    const collections = [
      { name: 'userLocations', collection: this.spatialCollections.get('userLocations') },
      { name: 'pointsOfInterest', collection: this.spatialCollections.get('pointsOfInterest') },
      { name: 'geographicRegions', collection: this.spatialCollections.get('geographicRegions') }
    ];

    const metrics = {
      timestamp: new Date(),
      collections: {},
      geospatialIndexes: {},
      queryPerformance: {},
      dataQuality: {}
    };

    for (const { name, collection } of collections) {
      try {
        // Basic collection statistics
        const stats = await this.db.command({ collStats: name });

        // Document count and size
        metrics.collections[name] = {
          documentCount: stats.count,
          storageSize: stats.storageSize,
          avgDocumentSize: stats.avgObjSize,
          indexCount: stats.nindexes,
          indexSize: stats.totalIndexSize
        };

        // Geospatial-specific metrics
        if (name === 'userLocations') {
          const locationMetrics = await collection.aggregate([
            {
              $group: {
                _id: null,
                totalLocations: { $sum: 1 },
                uniqueUsers: { $addToSet: '$userId' },
                dateRange: { $push: '$recordedAt' },
                locationSources: { $addToSet: '$locationSource' },
                averageAccuracy: { $avg: '$accuracy' }
              }
            },
            {
              $addFields: {
                uniqueUsersCount: { $size: '$uniqueUsers' },
                timeSpanDays: {
                  $divide: [
                    { $subtract: [{ $max: '$dateRange' }, { $min: '$dateRange' }] },
                    86400000
                  ]
                }
              }
            }
          ]).toArray();

          metrics.dataQuality.userLocations = locationMetrics[0] || {};
        }

      } catch (error) {
        metrics.collections[name] = { error: error.message };
      }
    }

    return metrics;
  }

  async shutdown() {
    console.log('Shutting down MongoDB Geospatial Analytics Manager...');

    if (this.client) {
      await this.client.close();
      console.log('✅ MongoDB connection closed');
    }

    this.spatialCollections.clear();
    this.geospatialIndexes.clear();
    this.analyticsCache.clear();
  }
}

// Export the geospatial analytics manager
module.exports = { MongoGeospatialAnalyticsManager };

// Benefits of MongoDB Geospatial Analytics:
// - Native 2dsphere indexes provide optimized spatial query performance
// - Aggregation framework enables complex geospatial calculations and analytics
// - Built-in distance calculations and proximity analysis without external libraries
// - Seamless integration of spatial operations with business logic and data transformations
// - Scalable geospatial processing that works across sharded deployments
// - Advanced spatial aggregation patterns for location intelligence applications
// - Real-time geospatial analytics with high-performance spatial indexing
// - Flexible coordinate system support and projection transformations
// - Production-ready spatial analytics with comprehensive monitoring and optimization
// - SQL-compatible geospatial operations through QueryLeaf integration

Understanding MongoDB Geospatial Aggregation Capabilities

Advanced Spatial Analytics Patterns

Implement sophisticated geospatial analytics with MongoDB's native spatial operators:

// Advanced geospatial aggregation patterns for enterprise location intelligence
class LocationIntelligenceProcessor extends MongoGeospatialAnalyticsManager {
  constructor() {
    super();
    this.spatialProcessors = new Map();
    this.heatmapGenerators = new Map();
    this.routeAnalyzers = new Map();
  }

  async generateLocationHeatmaps(parameters = {}) {
    console.log('Generating location density heatmaps...');

    const { 
      gridSize = 0.01, // Degrees for grid cells
      minDensity = 5,   // Minimum locations per cell
      dateRange = 30    // Days of data
    } = parameters;

    const userLocations = this.spatialCollections.get('userLocations');

    const heatmapPipeline = [
      // Filter recent locations
      {
        $match: {
          recordedAt: { 
            $gte: new Date(Date.now() - dateRange * 24 * 60 * 60 * 1000) 
          }
        }
      },

      // Create grid-based grouping for heatmap
      {
        $addFields: {
          gridCell: {
            lat: {
              $multiply: [
                { $floor: { $divide: [{ $arrayElemAt: ['$coordinates.coordinates', 1] }, gridSize] } },
                gridSize
              ]
            },
            lng: {
              $multiply: [
                { $floor: { $divide: [{ $arrayElemAt: ['$coordinates.coordinates', 0] }, gridSize] } },
                gridSize
              ]
            }
          }
        }
      },

      // Group by grid cells
      {
        $group: {
          _id: {
            lat: '$gridCell.lat',
            lng: '$gridCell.lng'
          },

          locationCount: { $sum: 1 },
          uniqueUsers: { $addToSet: '$userId' },
          averageAccuracy: { $avg: '$accuracy' },

          timeDistribution: {
            $push: {
              $dateToString: { format: '%H', date: '$recordedAt' }
            }
          },

          locationSources: { $addToSet: '$locationSource' }
        }
      },

      // Filter cells with minimum density
      {
        $match: {
          locationCount: { $gte: minDensity }
        }
      },

      // Calculate density metrics
      {
        $addFields: {
          density: '$locationCount',
          uniqueUsersCount: { $size: '$uniqueUsers' },

          // Calculate time-based activity patterns
          hourlyActivity: {
            $reduce: {
              input: { $range: [0, 24] },
              initialValue: {},
              in: {
                $mergeObjects: [
                  '$$value',
                  {
                    $let: {
                      vars: { hour: '$$this' },
                      in: {
                        $arrayToObject: [[{
                          k: { $toString: '$$hour' },
                          v: {
                            $size: {
                              $filter: {
                                input: '$timeDistribution',
                                cond: { $eq: ['$$this', { $toString: '$$hour' }] }
                              }
                            }
                          }
                        }]]
                      }
                    }
                  }
                ]
              }
            }
          },

          // Density classification
          densityLevel: {
            $switch: {
              branches: [
                { case: { $gte: ['$locationCount', 100] }, then: 'very_high' },
                { case: { $gte: ['$locationCount', 50] }, then: 'high' },
                { case: { $gte: ['$locationCount', 20] }, then: 'medium' }
              ],
              default: 'low'
            }
          }
        }
      },

      // Generate heatmap coordinates
      {
        $project: {
          coordinates: {
            lat: '$_id.lat',
            lng: '$_id.lng'
          },

          heatmapMetrics: {
            density: '$density',
            uniqueUsers: '$uniqueUsersCount',
            densityLevel: '$densityLevel',
            averageAccuracy: { $round: ['$averageAccuracy', 2] }
          },

          temporalPatterns: {
            hourlyActivity: '$hourlyActivity',
            peakHour: {
              $let: {
                vars: { 
                  maxHour: { $max: { $objectToArray: '$hourlyActivity' } } 
                },
                in: '$$maxHour.k'
              }
            }
          },

          insights: {
            isHighTrafficArea: { $gte: ['$locationCount', 50] },
            hasRegularActivity: { $gte: ['$uniqueUsersCount', 5] },
            isDataRich: { $gte: ['$averageAccuracy', 50] }
          }
        }
      },

      { $sort: { 'heatmapMetrics.density': -1 } }
    ];

    const heatmapData = await userLocations.aggregate(heatmapPipeline).toArray();

    return {
      generatedAt: new Date(),
      parameters: { gridSize, minDensity, dateRange },
      heatmapData: heatmapData,
      summary: {
        totalHotspots: heatmapData.length,
        highestDensity: heatmapData.length > 0 ? heatmapData[0].heatmapMetrics.density : 0,
        averageDensity: heatmapData.length > 0 ? 
          Math.round(heatmapData.reduce((sum, cell) => sum + cell.heatmapMetrics.density, 0) / heatmapData.length) : 0
      }
    };
  }

  async performRouteAnalysis(userId, analysisType = 'daily') {
    console.log(`Performing ${analysisType} route analysis for user ${userId}...`);

    const userLocations = this.spatialCollections.get('userLocations');

    const routeAnalysisPipeline = [
      // Match user locations
      { $match: { userId: userId } },
      { $sort: { recordedAt: 1 } },

      // Group by analysis period (daily, weekly, etc.)
      {
        $group: {
          _id: {
            period: {
              $dateToString: { 
                format: analysisType === 'daily' ? '%Y-%m-%d' : '%Y-%W',
                date: '$recordedAt' 
              }
            }
          },

          locations: {
            $push: {
              coordinates: '$coordinates',
              timestamp: '$recordedAt',
              accuracy: '$accuracy'
            }
          },

          startTime: { $first: '$recordedAt' },
          endTime: { $last: '$recordedAt' }
        }
      },

      // Calculate route metrics
      {
        $addFields: {
          routeMetrics: {
            $let: {
              vars: {
                locationCount: { $size: '$locations' }
              },
              in: {
                totalPoints: '$$locationCount',
                timeSpanHours: {
                  $divide: [
                    { $subtract: ['$endTime', '$startTime'] },
                    3600000 // Convert to hours
                  ]
                },

                // Simplified route distance calculation
                estimatedDistance: {
                  $reduce: {
                    input: { $range: [1, '$$locationCount'] },
                    initialValue: 0,
                    in: {
                      $add: [
                        '$$value',
                        {
                          $let: {
                            vars: {
                              current: { $arrayElemAt: ['$locations', '$$this'] },
                              previous: { $arrayElemAt: ['$locations', { $subtract: ['$$this', 1] }] }
                            },
                            in: {
                              // Simplified distance using coordinate differences
                              $sqrt: {
                                $add: [
                                  {
                                    $pow: [
                                      {
                                        $multiply: [
                                          { $subtract: [
                                            { $arrayElemAt: ['$$current.coordinates.coordinates', 0] },
                                            { $arrayElemAt: ['$$previous.coordinates.coordinates', 0] }
                                          ] },
                                          111320 // Approximate meters per degree
                                        ]
                                      },
                                      2
                                    ]
                                  },
                                  {
                                    $pow: [
                                      {
                                        $multiply: [
                                          { $subtract: [
                                            { $arrayElemAt: ['$$current.coordinates.coordinates', 1] },
                                            { $arrayElemAt: ['$$previous.coordinates.coordinates', 1] }
                                          ] },
                                          110540
                                        ]
                                      },
                                      2
                                    ]
                                  }
                                ]
                              }
                            }
                          }
                        }
                      ]
                    }
                  }
                }
              }
            }
          }
        }
      },

      // Calculate route insights
      {
        $addFields: {
          routeInsights: {
            distanceKm: { $divide: ['$routeMetrics.estimatedDistance', 1000] },
            averageSpeed: {
              $cond: {
                if: { $gt: ['$routeMetrics.timeSpanHours', 0] },
                then: {
                  $divide: [
                    { $divide: ['$routeMetrics.estimatedDistance', 1000] },
                    '$routeMetrics.timeSpanHours'
                  ]
                },
                else: 0
              }
            },

            routeComplexity: {
              $switch: {
                branches: [
                  { case: { $gte: ['$routeMetrics.totalPoints', 100] }, then: 'complex' },
                  { case: { $gte: ['$routeMetrics.totalPoints', 50] }, then: 'moderate' }
                ],
                default: 'simple'
              }
            },

            mobilityType: {
              $let: {
                vars: { avgSpeed: '$averageSpeed' },
                in: {
                  $switch: {
                    branches: [
                      { case: { $lte: ['$$avgSpeed', 5] }, then: 'walking' },
                      { case: { $lte: ['$$avgSpeed', 25] }, then: 'cycling' },
                      { case: { $lte: ['$$avgSpeed', 60] }, then: 'driving' }
                    ],
                    default: 'high_speed_transport'
                  }
                }
              }
            }
          }
        }
      },

      {
        $project: {
          period: '$_id.period',
          startTime: 1,
          endTime: 1,

          route: {
            totalPoints: '$routeMetrics.totalPoints',
            distanceKm: { $round: ['$routeInsights.distanceKm', 2] },
            timeSpanHours: { $round: ['$routeMetrics.timeSpanHours', 2] },
            averageSpeedKmh: { $round: ['$routeInsights.averageSpeed', 2] }
          },

          classification: {
            routeComplexity: '$routeInsights.routeComplexity',
            mobilityType: '$routeInsights.mobilityType'
          },

          insights: {
            isLongDistance: { $gte: ['$routeInsights.distanceKm', 10] },
            isHighSpeed: { $gte: ['$routeInsights.averageSpeed', 30] },
            hasExtendedActivity: { $gte: ['$routeMetrics.timeSpanHours', 4] }
          }
        }
      },

      { $sort: { startTime: -1 } }
    ];

    const routeAnalysis = await userLocations.aggregate(routeAnalysisPipeline).toArray();

    return {
      userId: userId,
      analysisType: analysisType,
      analysisTimestamp: new Date(),
      routes: routeAnalysis,
      summary: this.generateRouteSummary(routeAnalysis)
    };
  }

  generateRouteSummary(routes) {
    if (!routes.length) {
      return { message: 'No routes found for analysis period' };
    }

    const totalDistance = routes.reduce((sum, route) => sum + route.route.distanceKm, 0);
    const mobilityTypes = routes.map(route => route.classification.mobilityType);
    const mostCommonMobility = this.findMostCommon(mobilityTypes);

    return {
      totalRoutes: routes.length,
      totalDistanceKm: Math.round(totalDistance * 100) / 100,
      averageRouteDistance: Math.round((totalDistance / routes.length) * 100) / 100,
      primaryMobilityType: mostCommonMobility,
      longDistanceRoutes: routes.filter(r => r.insights.isLongDistance).length,
      highSpeedRoutes: routes.filter(r => r.insights.isHighSpeed).length
    };
  }

  findMostCommon(array) {
    return array.reduce((acc, val) => {
      acc[val] = (acc[val] || 0) + 1;
      return acc;
    }, {});
  }
}

// Export the enhanced location intelligence processor
module.exports = { LocationIntelligenceProcessor };

SQL-Style Geospatial Analytics with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB geospatial analytics and spatial operations:

-- QueryLeaf geospatial analytics with SQL-familiar patterns

-- Create location tracking collection with geospatial indexes
CREATE COLLECTION user_locations (
  user_id OBJECTID NOT NULL,
  device_id VARCHAR(100),

  -- GeoJSON point coordinates
  coordinates GEOMETRY(POINT) NOT NULL,
  accuracy_meters DECIMAL(8,2),
  altitude_meters DECIMAL(8,2),

  -- Address information
  address JSON,

  -- Location metadata
  location_source ENUM('gps', 'network', 'passive', 'manual'),
  location_method ENUM('checkin', 'automatic', 'manual', 'passive'),

  -- Timestamps
  recorded_at TIMESTAMP NOT NULL,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create geospatial indexes for optimal query performance
CREATE INDEX idx_user_locations_geo ON user_locations (coordinates) 
WITH (index_type = '2dsphere');

CREATE INDEX idx_user_locations_user_time ON user_locations (user_id, recorded_at);

-- Points of interest collection with spatial data
CREATE COLLECTION points_of_interest (
  poi_name VARCHAR(200) NOT NULL,
  category VARCHAR(100) NOT NULL,
  subcategory VARCHAR(100),

  coordinates GEOMETRY(POINT) NOT NULL,

  -- POI details
  description TEXT,
  business_hours JSON,
  amenities ARRAY<STRING>,
  rating_average DECIMAL(3,2),
  review_count INTEGER DEFAULT 0,

  -- Address
  address JSON,

  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_poi_geo ON points_of_interest (coordinates) 
WITH (index_type = '2dsphere');

CREATE INDEX idx_poi_category ON points_of_interest (category, subcategory);

-- Advanced geospatial analytics with proximity analysis
WITH user_mobility_analysis AS (
  SELECT 
    user_id,
    DATE(recorded_at) as tracking_date,

    -- Calculate daily movement patterns using geospatial functions
    COUNT(*) as location_points,
    MIN(recorded_at) as first_location,
    MAX(recorded_at) as last_location,

    -- Geographic bounds calculation
    ST_X(ST_ENVELOPE(ST_COLLECT(coordinates))) as min_longitude,
    ST_Y(ST_ENVELOPE(ST_COLLECT(coordinates))) as min_latitude,
    ST_X(ST_ENVELOPE(ST_COLLECT(coordinates))) as max_longitude,
    ST_Y(ST_ENVELOPE(ST_COLLECT(coordinates))) as max_latitude,

    -- Distance calculations using geospatial aggregation
    SUM(
      ST_DISTANCE(
        coordinates,
        LAG(coordinates) OVER (
          PARTITION BY user_id, DATE(recorded_at)
          ORDER BY recorded_at
        )
      )
    ) as total_distance_meters,

    -- Create daily path geometry
    ST_MAKELINE(coordinates ORDER BY recorded_at) as daily_path,

    -- Average location accuracy
    AVG(accuracy_meters) as avg_accuracy

  FROM user_locations
  WHERE recorded_at >= CURRENT_DATE - INTERVAL '7 days'
  GROUP BY user_id, DATE(recorded_at)
  HAVING COUNT(*) >= 5  -- Only days with sufficient tracking data
),

poi_proximity_analysis AS (
  -- Analyze user proximity to points of interest using spatial operations
  SELECT 
    ul.user_id,
    ul.recorded_at,
    poi.poi_name,
    poi.category,
    poi.subcategory,

    -- Distance calculation using native geospatial functions
    ST_DISTANCE(ul.coordinates, poi.coordinates) as distance_meters,

    -- Proximity categorization using geospatial predicates
    CASE 
      WHEN ST_DWITHIN(ul.coordinates, poi.coordinates, 100) THEN 'immediate_vicinity'
      WHEN ST_DWITHIN(ul.coordinates, poi.coordinates, 500) THEN 'nearby'
      WHEN ST_DWITHIN(ul.coordinates, poi.coordinates, 1000) THEN 'walking_distance'
      ELSE 'distant'
    END as proximity_category,

    -- Calculate time spent near POI using window functions
    CASE 
      WHEN ST_DWITHIN(ul.coordinates, poi.coordinates, 200) THEN
        EXTRACT(EPOCH FROM (
          LEAD(ul.recorded_at) OVER (
            PARTITION BY ul.user_id 
            ORDER BY ul.recorded_at
          ) - ul.recorded_at
        )) / 60.0  -- Convert to minutes
      ELSE 0
    END as time_spent_minutes

  FROM user_locations ul
  -- Use spatial join for efficient proximity queries
  INNER JOIN points_of_interest poi ON ST_DWITHIN(ul.coordinates, poi.coordinates, 5000)
  WHERE ul.recorded_at >= CURRENT_DATE - INTERVAL '7 days'
),

geospatial_insights AS (
  -- Generate comprehensive location-based insights
  SELECT 
    uma.user_id,
    uma.tracking_date,

    -- Movement analysis
    uma.total_distance_meters,
    ROUND(uma.total_distance_meters / 1000.0, 2) as total_distance_km,

    -- Calculate movement velocity
    CASE 
      WHEN EXTRACT(EPOCH FROM (uma.last_location - uma.first_location)) > 0 THEN
        ROUND(
          uma.total_distance_meters / 
          EXTRACT(EPOCH FROM (uma.last_location - uma.first_location)) * 3.6,
          2
        )
      ELSE 0
    END as average_speed_kmh,

    -- Geographic coverage analysis using spatial functions
    ST_AREA(
      ST_MAKEENVELOPE(
        uma.min_longitude, uma.min_latitude,
        uma.max_longitude, uma.max_latitude
      )
    ) / 1000000.0 as coverage_area_km2,  -- Convert to square kilometers

    uma.location_points,
    uma.avg_accuracy,

    -- POI interaction aggregations
    (
      SELECT COUNT(DISTINCT poi_name)
      FROM poi_proximity_analysis ppa
      WHERE ppa.user_id = uma.user_id
        AND DATE(ppa.recorded_at) = uma.tracking_date
        AND ppa.proximity_category IN ('immediate_vicinity', 'nearby')
    ) as unique_pois_visited,

    (
      SELECT SUM(time_spent_minutes)
      FROM poi_proximity_analysis ppa
      WHERE ppa.user_id = uma.user_id
        AND DATE(ppa.recorded_at) = uma.tracking_date
        AND ppa.time_spent_minutes > 0
    ) as total_poi_time_minutes,

    -- Most frequent POI categories using spatial aggregation
    (
      SELECT JSON_ARRAYAGG(
        JSON_OBJECT(
          'category', category,
          'visit_count', visit_count
        ) ORDER BY visit_count DESC
      )
      FROM (
        SELECT category, COUNT(*) as visit_count
        FROM poi_proximity_analysis ppa
        WHERE ppa.user_id = uma.user_id
          AND DATE(ppa.recorded_at) = uma.tracking_date
          AND ppa.proximity_category = 'immediate_vicinity'
        GROUP BY category
        ORDER BY visit_count DESC
        LIMIT 3
      ) top_categories
    ) as top_poi_categories,

    -- Mobility pattern classification
    CASE 
      WHEN uma.total_distance_meters < 1000 THEN 'stationary'
      WHEN uma.total_distance_meters < 5000 THEN 'local_movement'
      WHEN uma.total_distance_meters < 20000 THEN 'moderate_travel'
      ELSE 'extensive_travel'
    END as mobility_pattern,

    -- Route complexity analysis
    CASE 
      WHEN uma.location_points > 100 THEN 'complex_route'
      WHEN uma.location_points > 50 THEN 'moderate_route'
      ELSE 'simple_route'
    END as route_complexity

  FROM user_mobility_analysis uma
)

SELECT 
  gi.user_id,
  gi.tracking_date,
  gi.total_distance_km,
  gi.average_speed_kmh,
  ROUND(gi.coverage_area_km2, 4) as coverage_area_km2,
  gi.location_points,
  gi.unique_pois_visited,
  ROUND(gi.total_poi_time_minutes, 1) as total_poi_time_minutes,
  gi.top_poi_categories,
  gi.mobility_pattern,
  gi.route_complexity,

  -- User behavior profiling using geospatial insights
  CASE 
    WHEN gi.unique_pois_visited > 5 AND gi.total_distance_km > 10 THEN 'high_mobility_explorer'
    WHEN gi.unique_pois_visited > 3 AND gi.total_distance_km < 5 THEN 'local_explorer'
    WHEN gi.total_distance_km > 20 THEN 'long_distance_traveler'
    WHEN gi.total_poi_time_minutes > 60 THEN 'poi_focused_user'
    ELSE 'standard_user'
  END as user_behavior_profile,

  -- Transportation mode inference
  CASE 
    WHEN gi.average_speed_kmh <= 6 THEN 'walking'
    WHEN gi.average_speed_kmh <= 25 THEN 'cycling'
    WHEN gi.average_speed_kmh <= 60 THEN 'driving'
    ELSE 'high_speed_transport'
  END as inferred_transport_mode,

  -- Temporal patterns
  EXTRACT(DOW FROM gi.tracking_date) as day_of_week,
  CASE 
    WHEN EXTRACT(DOW FROM gi.tracking_date) IN (0, 6) THEN 'weekend'
    ELSE 'weekday'
  END as day_type,

  -- Data quality assessment
  CASE 
    WHEN gi.location_points < 10 THEN 'insufficient_data'
    WHEN gi.location_points < 50 THEN 'moderate_tracking'
    ELSE 'comprehensive_tracking'
  END as tracking_quality,

  ROUND(gi.avg_accuracy, 1) as average_accuracy_meters

FROM geospatial_insights gi
ORDER BY gi.user_id, gi.tracking_date DESC;

-- Location density heatmap generation using spatial aggregation
WITH location_density_grid AS (
  SELECT 
    -- Create spatial grid for heatmap visualization
    FLOOR(ST_X(coordinates) / 0.01) * 0.01 as grid_lng,
    FLOOR(ST_Y(coordinates) / 0.01) * 0.01 as grid_lat,

    COUNT(*) as location_count,
    COUNT(DISTINCT user_id) as unique_users,
    AVG(accuracy_meters) as avg_accuracy,

    -- Time-based activity patterns
    COUNT(*) FILTER (
      WHERE EXTRACT(HOUR FROM recorded_at) BETWEEN 6 AND 11
    ) as morning_activity,

    COUNT(*) FILTER (
      WHERE EXTRACT(HOUR FROM recorded_at) BETWEEN 12 AND 17
    ) as afternoon_activity,

    COUNT(*) FILTER (
      WHERE EXTRACT(HOUR FROM recorded_at) BETWEEN 18 AND 21
    ) as evening_activity,

    COUNT(*) FILTER (
      WHERE EXTRACT(HOUR FROM recorded_at) BETWEEN 22 AND 5
    ) as night_activity,

    -- Location source analysis
    JSON_OBJECTAGG(
      location_source, 
      COUNT(*)
    ) as source_distribution

  FROM user_locations
  WHERE recorded_at >= CURRENT_TIMESTAMP - INTERVAL '30 days'
  GROUP BY 
    FLOOR(ST_X(coordinates) / 0.01) * 0.01,
    FLOOR(ST_Y(coordinates) / 0.01) * 0.01
  HAVING COUNT(*) >= 5  -- Minimum density threshold
),

heatmap_analysis AS (
  SELECT 
    grid_lng,
    grid_lat,

    -- Create heatmap center point
    ST_POINT(grid_lng + 0.005, grid_lat + 0.005) as cell_center,

    location_count,
    unique_users,
    ROUND(avg_accuracy, 1) as avg_accuracy,

    -- Activity distribution
    morning_activity,
    afternoon_activity, 
    evening_activity,
    night_activity,

    -- Calculate peak activity time
    CASE 
      WHEN morning_activity >= GREATEST(afternoon_activity, evening_activity, night_activity) 
        THEN 'morning'
      WHEN afternoon_activity >= GREATEST(evening_activity, night_activity) 
        THEN 'afternoon'
      WHEN evening_activity >= night_activity 
        THEN 'evening'
      ELSE 'night'
    END as peak_activity_period,

    -- Density classification for visualization
    CASE 
      WHEN location_count >= 100 THEN 'very_high'
      WHEN location_count >= 50 THEN 'high'
      WHEN location_count >= 20 THEN 'medium'
      ELSE 'low'
    END as density_level,

    source_distribution,

    -- Calculate density score for heatmap intensity
    LOG(location_count) * LOG(unique_users + 1) as heatmap_intensity

  FROM location_density_grid
  ORDER BY location_count DESC
)

SELECT 
  -- Heatmap coordinates
  ST_X(cell_center) as center_longitude,
  ST_Y(cell_center) as center_latitude,

  -- Density metrics for visualization
  location_count as total_locations,
  unique_users,
  density_level,
  ROUND(heatmap_intensity, 2) as intensity_score,

  -- Temporal activity patterns
  JSON_OBJECT(
    'morning', morning_activity,
    'afternoon', afternoon_activity,
    'evening', evening_activity,
    'night', night_activity,
    'peak_period', peak_activity_period
  ) as activity_patterns,

  -- Data quality indicators
  avg_accuracy,
  source_distribution,

  -- Insights for location intelligence
  CASE 
    WHEN unique_users > location_count * 0.8 THEN 'transient_area'
    WHEN unique_users < location_count * 0.2 THEN 'regular_destination'
    ELSE 'mixed_usage'
  END as area_usage_pattern,

  CASE 
    WHEN peak_activity_period IN ('morning', 'evening') THEN 'commuter_zone'
    WHEN peak_activity_period = 'afternoon' THEN 'business_district'
    WHEN peak_activity_period = 'night' THEN 'entertainment_area'
    ELSE 'residential_area'
  END as inferred_area_type

FROM heatmap_analysis
WHERE density_level IN ('medium', 'high', 'very_high')
ORDER BY intensity_score DESC;

-- Geographic region analysis with spatial joins
CREATE COLLECTION geographic_regions (
  region_name VARCHAR(200) NOT NULL,
  region_type ENUM('city', 'district', 'neighborhood', 'zone'),
  geometry GEOMETRY(POLYGON) NOT NULL,
  properties JSON,
  population INTEGER,
  area_km2 DECIMAL(10,4),
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_geographic_regions_geo ON geographic_regions (geometry) 
WITH (index_type = '2dsphere');

-- Analyze user activity within geographic regions
WITH regional_activity_analysis AS (
  SELECT 
    gr.region_name,
    gr.region_type,
    gr.population,
    gr.area_km2,

    -- Count location visits within each region using spatial containment
    COUNT(ul.user_id) as total_visits,
    COUNT(DISTINCT ul.user_id) as unique_visitors,

    -- Calculate visit density
    ROUND(
      COUNT(ul.user_id)::DECIMAL / NULLIF(gr.area_km2, 0),
      2
    ) as visits_per_km2,

    -- Time-based visit patterns
    COUNT(*) FILTER (
      WHERE EXTRACT(DOW FROM ul.recorded_at) IN (0, 6)
    ) as weekend_visits,

    COUNT(*) FILTER (
      WHERE EXTRACT(DOW FROM ul.recorded_at) BETWEEN 1 AND 5
    ) as weekday_visits,

    -- Peak activity times within regions
    MODE() WITHIN GROUP (
      ORDER BY EXTRACT(HOUR FROM ul.recorded_at)
    ) as peak_hour,

    -- Average visit duration (simplified calculation)
    AVG(
      EXTRACT(EPOCH FROM (
        LEAD(ul.recorded_at) OVER (
          PARTITION BY ul.user_id, gr.region_name
          ORDER BY ul.recorded_at
        ) - ul.recorded_at
      )) / 60.0
    ) as avg_visit_duration_minutes,

    -- First and last visits
    MIN(ul.recorded_at) as first_visit,
    MAX(ul.recorded_at) as last_visit

  FROM geographic_regions gr
  -- Spatial join to find locations within regions
  LEFT JOIN user_locations ul ON ST_CONTAINS(gr.geometry, ul.coordinates)
  WHERE ul.recorded_at >= CURRENT_TIMESTAMP - INTERVAL '30 days'
  GROUP BY gr.region_name, gr.region_type, gr.population, gr.area_km2
),

region_insights AS (
  SELECT 
    region_name,
    region_type,
    population,
    area_km2,
    total_visits,
    unique_visitors,
    visits_per_km2,
    weekend_visits,
    weekday_visits,
    peak_hour,
    ROUND(avg_visit_duration_minutes, 2) as avg_visit_duration_minutes,
    first_visit,
    last_visit,

    -- Calculate regional activity metrics
    ROUND(
      total_visits::DECIMAL / NULLIF(unique_visitors, 0),
      2
    ) as avg_visits_per_user,

    ROUND(
      weekend_visits::DECIMAL / NULLIF(total_visits, 0) * 100,
      1
    ) as weekend_activity_percent,

    -- Population-adjusted activity
    CASE 
      WHEN population > 0 THEN 
        ROUND(unique_visitors::DECIMAL / population * 100, 2)
      ELSE NULL
    END as visitor_penetration_percent,

    -- Activity level classification
    CASE 
      WHEN total_visits >= 1000 THEN 'very_high'
      WHEN total_visits >= 500 THEN 'high'
      WHEN total_visits >= 100 THEN 'medium'
      WHEN total_visits >= 10 THEN 'low'
      ELSE 'very_low'
    END as activity_level,

    -- Regional characteristics inference
    CASE 
      WHEN weekend_visits > weekday_visits * 1.2 THEN 'leisure_destination'
      WHEN weekday_visits > weekend_visits * 1.5 THEN 'business_district'
      WHEN peak_hour BETWEEN 7 AND 9 OR peak_hour BETWEEN 17 AND 19 THEN 'commuter_area'
      ELSE 'mixed_use'
    END as inferred_usage_type

  FROM regional_activity_analysis
  WHERE total_visits > 0
)

SELECT 
  region_name,
  region_type,
  activity_level,
  inferred_usage_type,

  -- Core metrics
  total_visits,
  unique_visitors,
  avg_visits_per_user,
  visits_per_km2,

  -- Temporal patterns
  weekend_activity_percent,
  CONCAT(peak_hour, ':00') as peak_hour,
  avg_visit_duration_minutes,

  -- Geographic and demographic insights
  area_km2,
  population,
  visitor_penetration_percent,

  -- Activity insights
  CASE 
    WHEN visits_per_km2 > 100 THEN 'high_density_destination'
    WHEN visitor_penetration_percent > 10 THEN 'popular_local_area'
    WHEN avg_visit_duration_minutes > 60 THEN 'extended_stay_location'
    ELSE 'standard_activity_area'
  END as location_characteristic,

  -- Planning and optimization insights
  CASE 
    WHEN activity_level = 'very_high' AND inferred_usage_type = 'business_district' 
      THEN 'Consider infrastructure optimization'
    WHEN activity_level IN ('low', 'very_low') AND region_type = 'commercial'
      THEN 'Potential for development or marketing focus'
    WHEN weekend_activity_percent > 70 AND inferred_usage_type = 'leisure_destination'
      THEN 'Tourism and recreation optimization opportunity'
    ELSE 'Monitor for changes in activity patterns'
  END as planning_recommendation

FROM region_insights
ORDER BY total_visits DESC, visits_per_km2 DESC;

-- QueryLeaf provides comprehensive geospatial analytics capabilities:
-- 1. Native MongoDB 2dsphere indexing with SQL-familiar syntax
-- 2. Advanced spatial queries using ST_ geospatial functions
-- 3. Location-based proximity analysis with distance calculations
-- 4. Geographic aggregation and density heatmap generation
-- 5. Route analysis and mobility pattern detection
-- 6. Regional activity analysis with spatial joins
-- 7. Real-time location intelligence with temporal pattern analysis
-- 8. Integration with MongoDB's native geospatial capabilities
-- 9. Production-ready spatial analytics with performance optimization
-- 10. Enterprise-grade location-based insights accessible through familiar SQL patterns

Best Practices for MongoDB Geospatial Analytics Implementation

Spatial Index Optimization Strategies

Essential practices for maximizing geospatial query performance:

2dsphere Index Creation: Use 2dsphere indexes for all location-based queries and proximity analysis
Compound Spatial Indexes: Combine geospatial indexes with other frequently queried fields
Query Pattern Analysis: Design indexes to match common spatial query patterns and proximity searches
Coordinate System Consistency: Ensure consistent coordinate reference systems across all spatial data
Distance Calculation Optimization: Use appropriate distance calculations based on accuracy requirements
Spatial Data Validation: Implement comprehensive validation for coordinate data quality and accuracy

Production Deployment Considerations

Key factors for enterprise geospatial analytics deployments:

Real-time Processing: Design aggregation pipelines for real-time location intelligence and stream processing
Scalability Planning: Implement sharding strategies that work effectively with geospatial data distribution
Privacy and Security: Ensure location data privacy compliance and implement appropriate access controls
Data Quality Management: Monitor and maintain high-quality location data with accuracy tracking
Performance Monitoring: Track spatial query performance and optimize based on usage patterns
Integration Architecture: Design seamless integration with mapping services and location-based applications

Conclusion

MongoDB's Aggregation Framework provides comprehensive geospatial analytics capabilities that enable sophisticated location intelligence applications through native spatial operators, advanced proximity analysis, and high-performance spatial indexing. The combination of 2dsphere indexes, spatial aggregation operators, and flexible coordinate system support delivers enterprise-grade geospatial processing without the complexity of traditional GIS systems.

Key MongoDB geospatial analytics benefits include:

Native Spatial Processing: Built-in 2dsphere indexes and spatial operators provide optimized geospatial query performance
Advanced Location Intelligence: Sophisticated proximity analysis, route tracking, and mobility pattern detection
Real-time Spatial Analytics: High-performance aggregation pipelines for streaming location data processing
Flexible Coordinate Support: Comprehensive support for different coordinate systems and projection transformations
Scalable Spatial Operations: Distributed geospatial processing that scales across sharded MongoDB deployments
SQL Compatibility: Familiar spatial operations accessible through SQL-style syntax and functions

Whether you're building location-based services, analyzing user mobility patterns, or implementing proximity marketing applications, MongoDB's geospatial capabilities with QueryLeaf's SQL-familiar interface provide the foundation for scalable location intelligence that maintains high performance while simplifying spatial analytics development.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB geospatial operations while providing SQL-familiar syntax for spatial queries, proximity analysis, and location-based aggregations. Advanced geospatial analytics patterns, heatmap generation, and route analysis are seamlessly accessible through familiar SQL constructs, making sophisticated location intelligence both powerful and approachable for SQL-oriented teams.

The combination of MongoDB's robust geospatial capabilities with familiar SQL-style management makes it an ideal platform for applications that require both advanced spatial analytics and operational simplicity, ensuring your location-based applications can scale efficiently while maintaining familiar development and operational patterns.

November 20, 2025
27 min read

MongoDB Compound Indexes and Multi-Field Query Optimization: Advanced Performance Tuning for Complex Query Patterns

Modern database applications require sophisticated query optimization strategies that can efficiently handle complex multi-field queries, range operations, and sorting requirements across multiple dimensions. Traditional single-field indexing approaches often fail to provide optimal performance for real-world query patterns that involve filtering, sorting, and projecting across multiple document fields simultaneously, leading to inefficient query execution plans and degraded application performance.

MongoDB provides comprehensive compound indexing capabilities that enable optimal query performance through intelligent multi-field index construction, advanced query planning optimization, and sophisticated index intersection strategies. Unlike traditional databases that require manual index tuning and complex optimization hints, MongoDB's compound indexes automatically optimize query execution paths while supporting complex query patterns with minimal configuration overhead.

The Traditional Multi-Field Query Challenge

Conventional database indexing approaches often struggle with multi-field query optimization:

-- Traditional PostgreSQL multi-field query challenges - limited compound index effectiveness

-- Basic product catalog with traditional indexing approach
CREATE TABLE products (
    product_id SERIAL PRIMARY KEY,
    sku VARCHAR(100) UNIQUE NOT NULL,
    product_name VARCHAR(200) NOT NULL,
    category VARCHAR(100) NOT NULL,
    subcategory VARCHAR(100),
    brand VARCHAR(100) NOT NULL,

    -- Price and inventory fields
    price DECIMAL(10,2) NOT NULL,
    sale_price DECIMAL(10,2),
    cost DECIMAL(10,2),
    stock_quantity INTEGER NOT NULL DEFAULT 0,
    reserved_quantity INTEGER NOT NULL DEFAULT 0,

    -- Product attributes
    weight_kg DECIMAL(8,3),
    dimensions_length_cm DECIMAL(8,2),
    dimensions_width_cm DECIMAL(8,2),  
    dimensions_height_cm DECIMAL(8,2),
    color VARCHAR(50),
    size VARCHAR(50),

    -- Status and lifecycle
    status VARCHAR(50) DEFAULT 'active',
    availability VARCHAR(50) DEFAULT 'in_stock',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    published_at TIMESTAMP,
    discontinued_at TIMESTAMP,

    -- SEO and marketing
    seo_title VARCHAR(200),
    seo_description TEXT,
    marketing_tags TEXT[],
    featured BOOLEAN DEFAULT false,

    -- Supplier information
    supplier_id INTEGER,
    supplier_sku VARCHAR(100),
    lead_time_days INTEGER,
    minimum_order_quantity INTEGER DEFAULT 1,

    -- Performance tracking
    view_count INTEGER DEFAULT 0,
    purchase_count INTEGER DEFAULT 0,
    rating_average DECIMAL(3,2) DEFAULT 0,
    rating_count INTEGER DEFAULT 0
);

-- Traditional indexing approach with limited compound effectiveness
CREATE INDEX idx_products_category ON products(category);
CREATE INDEX idx_products_brand ON products(brand);  
CREATE INDEX idx_products_price ON products(price);
CREATE INDEX idx_products_status ON products(status);
CREATE INDEX idx_products_created_at ON products(created_at);
CREATE INDEX idx_products_stock ON products(stock_quantity);

-- Attempt at compound indexes with limited optimization
CREATE INDEX idx_products_category_brand ON products(category, brand);
CREATE INDEX idx_products_price_range ON products(price, status);
CREATE INDEX idx_products_inventory ON products(status, stock_quantity, availability);

-- Complex multi-field query that struggles with traditional indexing
WITH product_search_filters AS (
  SELECT 
    p.*,

    -- Calculate derived fields for filtering
    (p.stock_quantity - p.reserved_quantity) as available_quantity,
    CASE 
      WHEN p.sale_price IS NOT NULL AND p.sale_price < p.price THEN p.sale_price
      ELSE p.price
    END as effective_price,

    -- Performance scoring
    (p.rating_average * p.rating_count + p.view_count * 0.1 + p.purchase_count * 2) as performance_score,

    -- Age calculation
    EXTRACT(DAYS FROM CURRENT_TIMESTAMP - p.created_at) as days_since_created,

    -- Availability status
    CASE 
      WHEN p.stock_quantity <= 0 THEN 'out_of_stock'
      WHEN p.stock_quantity <= 5 THEN 'low_stock'
      ELSE 'in_stock'
    END as stock_status

  FROM products p
  WHERE 
    -- Multiple filtering conditions that don't align with index structure
    p.status = 'active'
    AND p.category IN ('Electronics', 'Computers', 'Mobile')
    AND p.brand IN ('Apple', 'Samsung', 'Sony', 'Microsoft')
    AND p.price BETWEEN 100.00 AND 2000.00
    AND p.stock_quantity > 0
    AND p.created_at >= CURRENT_DATE - INTERVAL '2 years'
    AND (p.featured = true OR p.rating_average >= 4.0)
    AND p.availability = 'in_stock'
),

price_range_analysis AS (
  -- Complex price range queries that can't utilize compound indexes effectively
  SELECT 
    psf.*,

    -- Price tier classification
    CASE 
      WHEN effective_price < 200 THEN 'budget'
      WHEN effective_price < 500 THEN 'mid_range'  
      WHEN effective_price < 1000 THEN 'premium'
      ELSE 'luxury'
    END as price_tier,

    -- Discount calculations
    CASE 
      WHEN psf.sale_price IS NOT NULL THEN
        ROUND(((psf.price - psf.sale_price) / psf.price * 100), 1)
      ELSE 0
    END as discount_percentage,

    -- Competitive analysis (requires additional queries)
    (
      SELECT AVG(price) 
      FROM products p2 
      WHERE p2.category = psf.category 
      AND p2.brand = psf.brand
      AND p2.status = 'active'
    ) as category_brand_avg_price,

    -- Inventory velocity
    CASE 
      WHEN psf.purchase_count > 0 AND psf.days_since_created > 0 THEN
        ROUND(psf.purchase_count::DECIMAL / (psf.days_since_created / 30), 2)
      ELSE 0
    END as monthly_sales_velocity

  FROM product_search_filters psf
),

sorting_and_ranking AS (
  -- Complex sorting requirements that bypass index usage
  SELECT 
    pra.*,

    -- Multi-criteria ranking
    (
      -- Price competitiveness (lower is better)
      CASE 
        WHEN pra.category_brand_avg_price > 0 THEN
          (pra.category_brand_avg_price - pra.effective_price) / pra.category_brand_avg_price * 20
        ELSE 0
      END +

      -- Rating score
      pra.rating_average * 10 +

      -- Stock availability score
      CASE 
        WHEN pra.available_quantity > 10 THEN 20
        WHEN pra.available_quantity > 5 THEN 15
        WHEN pra.available_quantity > 0 THEN 10
        ELSE 0
      END +

      -- Recency bonus
      CASE 
        WHEN pra.days_since_created <= 30 THEN 15
        WHEN pra.days_since_created <= 90 THEN 10
        WHEN pra.days_since_created <= 180 THEN 5
        ELSE 0
      END +

      -- Featured bonus
      CASE WHEN pra.featured THEN 25 ELSE 0 END

    ) as composite_ranking_score

  FROM price_range_analysis pra
)

SELECT 
  sar.product_id,
  sar.sku,
  sar.product_name,
  sar.category,
  sar.subcategory,
  sar.brand,
  sar.effective_price,
  sar.price_tier,
  sar.discount_percentage,
  sar.available_quantity,
  sar.stock_status,
  sar.rating_average,
  sar.rating_count,
  sar.performance_score,
  sar.monthly_sales_velocity,
  sar.composite_ranking_score,

  -- Additional computed fields
  CASE 
    WHEN sar.discount_percentage > 20 THEN 'great_deal'
    WHEN sar.discount_percentage > 10 THEN 'good_deal'
    ELSE 'regular_price'
  END as deal_status,

  -- Recommendation priority
  CASE 
    WHEN sar.composite_ranking_score > 80 THEN 'highly_recommended'
    WHEN sar.composite_ranking_score > 60 THEN 'recommended'  
    WHEN sar.composite_ranking_score > 40 THEN 'consider'
    ELSE 'standard'
  END as recommendation_level

FROM sorting_and_ranking sar

-- Complex ordering that can't utilize compound indexes effectively
ORDER BY 
  CASE 
    WHEN sar.featured = true THEN 0 
    ELSE 1 
  END,  -- Featured first
  sar.composite_ranking_score DESC,  -- Then by composite score
  sar.rating_average DESC,  -- Then by rating
  sar.available_quantity DESC,  -- Then by stock availability
  sar.created_at DESC  -- Finally by recency

LIMIT 50 OFFSET 0;

-- Performance analysis of the traditional approach
EXPLAIN (ANALYZE, BUFFERS) 
WITH product_search_filters AS (
  SELECT p.*
  FROM products p
  WHERE 
    p.status = 'active'
    AND p.category = 'Electronics'
    AND p.brand IN ('Apple', 'Samsung')
    AND p.price BETWEEN 200.00 AND 1000.00
    AND p.stock_quantity > 0
    AND p.rating_average >= 4.0
)
SELECT * FROM product_search_filters 
ORDER BY price, rating_average DESC, created_at DESC
LIMIT 10;

-- Problems with traditional multi-field indexing:
-- 1. Index selection conflicts - query planner struggles to choose optimal index
-- 2. Poor compound index utilization due to filter order mismatches
-- 3. Expensive sorting operations that can't use index ordering
-- 4. Multiple index scans and expensive merge operations
-- 5. Large intermediate result sets that require post-filtering
-- 6. Inability to optimize across different query patterns efficiently
-- 7. Complex explain plans with sequential scans and sorts
-- 8. High I/O overhead from multiple index lookups
-- 9. Limited flexibility for dynamic filtering combinations
-- 10. Poor performance scaling with data volume growth

-- Attempt to create better compound indexes (still limited)
CREATE INDEX idx_products_comprehensive_search ON products(
  status, category, brand, price, stock_quantity, rating_average, created_at
);

-- But this creates issues:
-- 1. Index becomes too wide and expensive to maintain
-- 2. Only effective for queries that match the exact field order
-- 3. Partial index usage for queries with different filter combinations  
-- 4. High storage overhead and slow write performance
-- 5. Still can't optimize for different sort orders efficiently

-- Complex aggregation query with traditional limitations
WITH category_brand_performance AS (
  SELECT 
    p.category,
    p.brand,

    -- Aggregated metrics that require full collection scans
    COUNT(*) as total_products,
    AVG(p.price) as avg_price,
    AVG(p.rating_average) as avg_rating,
    SUM(p.stock_quantity) as total_stock,
    SUM(p.purchase_count) as total_sales,

    -- Percentile calculations (expensive without proper indexing)
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY p.price) as price_p25,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY p.price) as price_median,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY p.price) as price_p75,

    -- Time-based analysis
    COUNT(*) FILTER (WHERE p.created_at >= CURRENT_DATE - INTERVAL '30 days') as new_products_30d,
    COUNT(*) FILTER (WHERE p.featured = true) as featured_count,
    COUNT(*) FILTER (WHERE p.stock_quantity = 0) as out_of_stock_count

  FROM products p
  WHERE p.status = 'active'
  GROUP BY p.category, p.brand
  HAVING COUNT(*) >= 5  -- Only brands with significant presence
),

performance_ranking AS (
  SELECT 
    cbp.*,

    -- Performance scoring
    (
      (cbp.avg_rating * 20) +
      (CASE WHEN cbp.avg_price > 0 THEN (cbp.total_sales * 1000.0 / cbp.total_products) ELSE 0 END) +
      ((cbp.featured_count * 100.0 / cbp.total_products)) +
      (100 - (cbp.out_of_stock_count * 100.0 / cbp.total_products))
    ) as performance_index,

    -- Market position analysis
    RANK() OVER (
      PARTITION BY cbp.category 
      ORDER BY cbp.total_sales DESC, cbp.avg_rating DESC
    ) as category_sales_rank,

    RANK() OVER (
      PARTITION BY cbp.category 
      ORDER BY cbp.avg_price DESC
    ) as category_price_rank

  FROM category_brand_performance cbp
)

SELECT 
  pr.category,
  pr.brand,
  pr.total_products,
  ROUND(pr.avg_price, 2) as avg_price,
  ROUND(pr.avg_rating, 2) as avg_rating,
  pr.total_stock,
  pr.total_sales,
  ROUND(pr.price_median, 2) as median_price,
  pr.new_products_30d,
  pr.featured_count,
  ROUND(pr.performance_index, 1) as performance_score,
  pr.category_sales_rank,
  pr.category_price_rank,

  -- Market analysis
  CASE 
    WHEN pr.category_price_rank <= 3 THEN 'premium_positioning'
    WHEN pr.category_price_rank <= pr.total_products * 0.5 THEN 'mid_market'
    ELSE 'value_positioning'
  END as market_position,

  CASE 
    WHEN pr.performance_index > 200 THEN 'top_performer'
    WHEN pr.performance_index > 150 THEN 'strong_performer'
    WHEN pr.performance_index > 100 THEN 'average_performer'
    ELSE 'underperformer'
  END as performance_tier

FROM performance_ranking pr
ORDER BY pr.category, pr.performance_index DESC;

-- Traditional limitations in complex analytics:
-- 1. Multiple full table scans required for different aggregations
-- 2. Expensive sorting and ranking operations
-- 3. No ability to use covering indexes for complex calculations
-- 4. High CPU and memory usage for large result set processing
-- 5. Poor query plan reusability across similar analytical queries
-- 6. Limited optimization for mixed OLTP and OLAP workloads
-- 7. Complex join operations that can't utilize compound indexes effectively
-- 8. Expensive window function calculations without proper index support
-- 9. Limited ability to optimize across time-series and categorical dimensions
-- 10. Inefficient handling of sparse data and optional field queries

MongoDB provides sophisticated compound indexing capabilities with advanced optimization:

// MongoDB Advanced Compound Indexing - high-performance multi-field query optimization
const { MongoClient, ObjectId } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('advanced_compound_indexing');

// Comprehensive MongoDB Compound Index Manager
class AdvancedCompoundIndexManager {
  constructor(db, config = {}) {
    this.db = db;
    this.collections = {
      products: db.collection('products'),
      categories: db.collection('categories'),
      brands: db.collection('brands'),
      indexAnalytics: db.collection('index_analytics'),
      queryPerformance: db.collection('query_performance'),
      indexRecommendations: db.collection('index_recommendations')
    };

    // Advanced compound indexing configuration
    this.config = {
      enableIntelligentIndexing: config.enableIntelligentIndexing !== false,
      enableQueryAnalysis: config.enableQueryAnalysis !== false,
      enablePerformanceTracking: config.enablePerformanceTracking !== false,
      enableIndexOptimization: config.enableIndexOptimization !== false,

      // Index management settings
      maxCompoundIndexFields: config.maxCompoundIndexFields || 8,
      backgroundIndexCreation: config.backgroundIndexCreation !== false,
      sparseIndexOptimization: config.sparseIndexOptimization !== false,
      partialIndexSupport: config.partialIndexSupport !== false,

      // Performance thresholds
      slowQueryThreshold: config.slowQueryThreshold || 1000, // milliseconds
      indexEfficiencyThreshold: config.indexEfficiencyThreshold || 0.8,
      cardinalityAnalysisThreshold: config.cardinalityAnalysisThreshold || 1000,

      // Optimization strategies
      enableESRRule: config.enableESRRule !== false, // Equality, Sort, Range
      enableQueryPlanCaching: config.enableQueryPlanCaching !== false,
      enableIndexIntersection: config.enableIndexIntersection !== false,
      enableCoveringIndexes: config.enableCoveringIndexes !== false,

      // Monitoring and maintenance
      indexMaintenanceInterval: config.indexMaintenanceInterval || 86400000, // 24 hours
      performanceAnalysisInterval: config.performanceAnalysisInterval || 3600000, // 1 hour
      enableAutomaticOptimization: config.enableAutomaticOptimization || false
    };

    // Performance tracking
    this.performanceMetrics = {
      queryPatterns: new Map(),
      indexUsage: new Map(),
      slowQueries: [],
      optimizationHistory: []
    };

    // Index strategy patterns
    this.indexStrategies = {
      searchOptimized: ['status', 'category', 'brand', 'price', 'rating_average'],
      analyticsOptimized: ['created_at', 'category', 'brand', 'total_sales'],
      inventoryOptimized: ['status', 'availability', 'stock_quantity', 'updated_at'],
      sortingOptimized: ['featured', 'price', 'rating_average', 'created_at']
    };

    this.initializeCompoundIndexing();
  }

  async initializeCompoundIndexing() {
    console.log('Initializing advanced compound indexing system...');

    try {
      // Create comprehensive compound indexes
      await this.createAdvancedCompoundIndexes();

      // Setup query pattern analysis
      if (this.config.enableQueryAnalysis) {
        await this.setupQueryPatternAnalysis();
      }

      // Initialize performance monitoring
      if (this.config.enablePerformanceTracking) {
        await this.setupPerformanceMonitoring();
      }

      // Enable automatic optimization if configured
      if (this.config.enableAutomaticOptimization) {
        await this.setupAutomaticOptimization();
      }

      console.log('Advanced compound indexing system initialized successfully');

    } catch (error) {
      console.error('Error initializing compound indexing:', error);
      throw error;
    }
  }

  async createAdvancedCompoundIndexes() {
    console.log('Creating advanced compound indexes with intelligent optimization...');

    try {
      const products = this.collections.products;

      // Core Search and Filtering Compound Index (ESR Pattern: Equality, Sort, Range)
      await products.createIndex({
        status: 1,              // Equality filter (most selective first)
        category: 1,            // Equality filter  
        brand: 1,               // Equality filter
        featured: -1,           // Sort field (featured items first)
        price: 1,               // Range filter
        rating_average: -1,     // Sort field
        created_at: -1          // Final sort field
      }, {
        name: 'idx_products_search_optimized',
        background: this.config.backgroundIndexCreation,
        sparse: this.config.sparseIndexOptimization
      });

      // E-commerce Product Catalog Compound Index
      await products.createIndex({
        status: 1,
        availability: 1,
        category: 1,
        subcategory: 1,
        price: 1,
        stock_quantity: -1,
        rating_average: -1
      }, {
        name: 'idx_products_catalog_comprehensive',
        background: this.config.backgroundIndexCreation,
        partialFilterExpression: {
          status: 'active',
          stock_quantity: { $gt: 0 }
        }
      });

      // Advanced Analytics and Reporting Index
      await products.createIndex({
        created_at: -1,         // Time-based queries (most recent first)
        category: 1,            // Grouping dimension
        brand: 1,               // Grouping dimension  
        purchase_count: -1,     // Performance metric
        rating_average: -1,     // Quality metric
        price: 1                // Financial metric
      }, {
        name: 'idx_products_analytics_optimized',
        background: this.config.backgroundIndexCreation
      });

      // Inventory Management Compound Index
      await products.createIndex({
        status: 1,
        availability: 1,
        stock_quantity: 1,
        reserved_quantity: 1,
        supplier_id: 1,
        updated_at: -1
      }, {
        name: 'idx_products_inventory_management',
        background: this.config.backgroundIndexCreation,
        sparse: true
      });

      // Text Search and SEO Optimization Index
      await products.createIndex({
        status: 1,
        category: 1,
        '$text': {
          product_name: 'text',
          seo_title: 'text', 
          seo_description: 'text',
          marketing_tags: 'text'
        },
        rating_average: -1,
        view_count: -1
      }, {
        name: 'idx_products_text_search_optimized',
        background: this.config.backgroundIndexCreation,
        weights: {
          product_name: 10,
          seo_title: 8,
          seo_description: 5,
          marketing_tags: 3
        }
      });

      // Price and Discount Analysis Index
      await products.createIndex({
        status: 1,
        category: 1,
        brand: 1,
        'pricing.effective_price': 1,
        'pricing.discount_percentage': -1,
        'pricing.price_tier': 1,
        featured: -1
      }, {
        name: 'idx_products_pricing_analysis',
        background: this.config.backgroundIndexCreation
      });

      // Performance and Popularity Tracking Index
      await products.createIndex({
        status: 1,
        performance_score: -1,
        view_count: -1,
        purchase_count: -1,
        rating_count: -1,
        created_at: -1
      }, {
        name: 'idx_products_performance_tracking',
        background: this.config.backgroundIndexCreation
      });

      // Supplier and Procurement Index
      await products.createIndex({
        supplier_id: 1,
        status: 1,
        lead_time_days: 1,
        minimum_order_quantity: 1,
        cost: 1,
        updated_at: -1
      }, {
        name: 'idx_products_supplier_procurement',
        background: this.config.backgroundIndexCreation,
        sparse: true
      });

      // Geographic and Dimensional Analysis Index (for products with physical attributes)
      await products.createIndex({
        status: 1,
        category: 1,
        'dimensions.weight_kg': 1,
        'dimensions.volume_cubic_cm': 1,
        'shipping.shipping_class': 1,
        availability: 1
      }, {
        name: 'idx_products_dimensional_analysis',
        background: this.config.backgroundIndexCreation,
        sparse: true,
        partialFilterExpression: {
          'dimensions.weight_kg': { $exists: true }
        }
      });

      // Customer Behavior and Recommendation Index
      await products.createIndex({
        status: 1,
        'analytics.recommendation_score': -1,
        'analytics.conversion_rate': -1,
        'analytics.customer_segment_affinity': 1,
        price: 1,
        rating_average: -1
      }, {
        name: 'idx_products_recommendation_engine',
        background: this.config.backgroundIndexCreation,
        sparse: true
      });

      // Seasonal and Temporal Analysis Index
      await products.createIndex({
        status: 1,
        'lifecycle.seasonality_pattern': 1,
        'lifecycle.peak_season_months': 1,
        created_at: -1,
        discontinued_at: 1,
        'analytics.seasonal_performance_score': -1
      }, {
        name: 'idx_products_seasonal_analysis',
        background: this.config.backgroundIndexCreation,
        sparse: true
      });

      console.log('Advanced compound indexes created successfully');

    } catch (error) {
      console.error('Error creating compound indexes:', error);
      throw error;
    }
  }

  async optimizeQueryWithCompoundIndexes(queryPattern, options = {}) {
    console.log('Optimizing query with intelligent compound index selection...');
    const startTime = Date.now();

    try {
      // Analyze query pattern and select optimal index strategy
      const indexStrategy = await this.analyzeQueryPattern(queryPattern);

      // Build optimized aggregation pipeline
      const optimizedPipeline = await this.buildOptimizedPipeline(queryPattern, indexStrategy, options);

      // Execute query with performance monitoring
      const results = await this.executeOptimizedQuery(optimizedPipeline, indexStrategy);

      // Track performance metrics
      await this.trackQueryPerformance({
        queryPattern: queryPattern,
        indexStrategy: indexStrategy,
        executionTime: Date.now() - startTime,
        results: results.length,
        pipeline: optimizedPipeline
      });

      return {
        results: results,
        performance: {
          executionTime: Date.now() - startTime,
          indexesUsed: indexStrategy.recommendedIndexes,
          optimizationStrategy: indexStrategy.strategy,
          queryPlan: indexStrategy.queryPlan
        },
        optimization: {
          pipelineOptimized: true,
          indexHintsApplied: indexStrategy.hintsApplied,
          coveringIndexUsed: indexStrategy.coveringIndexUsed,
          sortOptimized: indexStrategy.sortOptimized
        }
      };

    } catch (error) {
      console.error('Error optimizing query:', error);

      // Track failed query for analysis
      await this.trackQueryPerformance({
        queryPattern: queryPattern,
        executionTime: Date.now() - startTime,
        error: error.message,
        failed: true
      });

      throw error;
    }
  }

  async analyzeQueryPattern(queryPattern) {
    console.log('Analyzing query pattern for optimal index selection...');

    const analysis = {
      filterFields: [],
      sortFields: [],
      rangeFields: [],
      equalityFields: [],
      textSearchFields: [],
      recommendedIndexes: [],
      strategy: 'compound_optimized',
      hintsApplied: false,
      coveringIndexUsed: false,
      sortOptimized: false
    };

    // Extract filter conditions
    if (queryPattern.match) {
      Object.keys(queryPattern.match).forEach(field => {
        const condition = queryPattern.match[field];

        if (typeof condition === 'object' && condition.$gte !== undefined || condition.$lte !== undefined || condition.$lt !== undefined || condition.$gt !== undefined) {
          analysis.rangeFields.push(field);
        } else if (typeof condition === 'object' && condition.$in !== undefined) {
          analysis.equalityFields.push(field);
        } else if (typeof condition === 'object' && condition.$text !== undefined) {
          analysis.textSearchFields.push(field);
        } else {
          analysis.equalityFields.push(field);
        }

        analysis.filterFields.push(field);
      });
    }

    // Extract sort conditions
    if (queryPattern.sort) {
      Object.keys(queryPattern.sort).forEach(field => {
        analysis.sortFields.push(field);
      });
    }

    // Determine optimal index strategy based on ESR (Equality, Sort, Range) pattern
    analysis.recommendedIndexes = this.selectOptimalIndexes(analysis);

    // Determine if covering index can be used
    analysis.coveringIndexUsed = await this.canUseCoveringIndex(queryPattern, analysis);

    // Check if sort can be optimized
    analysis.sortOptimized = this.canOptimizeSort(analysis);

    return analysis;
  }

  selectOptimalIndexes(analysis) {
    const recommendedIndexes = [];

    // Check if query matches existing optimized compound indexes
    const fieldSet = new Set([...analysis.equalityFields, ...analysis.sortFields, ...analysis.rangeFields]);

    // Search-optimized pattern
    if (fieldSet.has('status') && fieldSet.has('category') && fieldSet.has('price')) {
      recommendedIndexes.push('idx_products_search_optimized');
    }

    // Analytics pattern
    if (fieldSet.has('created_at') && fieldSet.has('category') && fieldSet.has('brand')) {
      recommendedIndexes.push('idx_products_analytics_optimized');
    }

    // Inventory pattern
    if (fieldSet.has('availability') && fieldSet.has('stock_quantity')) {
      recommendedIndexes.push('idx_products_inventory_management');
    }

    // Pricing pattern
    if (fieldSet.has('pricing.effective_price') || fieldSet.has('price')) {
      recommendedIndexes.push('idx_products_pricing_analysis');
    }

    // Text search pattern
    if (analysis.textSearchFields.length > 0) {
      recommendedIndexes.push('idx_products_text_search_optimized');
    }

    // Default to comprehensive catalog index if no specific pattern matches
    if (recommendedIndexes.length === 0) {
      recommendedIndexes.push('idx_products_catalog_comprehensive');
    }

    return recommendedIndexes;
  }

  async buildOptimizedPipeline(queryPattern, indexStrategy, options = {}) {
    console.log('Building optimized aggregation pipeline...');

    const pipeline = [];

    // Match stage with index-optimized field ordering
    if (queryPattern.match) {
      const optimizedMatch = this.optimizeMatchStage(queryPattern.match, indexStrategy);
      pipeline.push({ $match: optimizedMatch });
    }

    // Add computed fields stage if needed
    if (options.addComputedFields) {
      pipeline.push({
        $addFields: {
          effective_price: {
            $cond: {
              if: { $and: [{ $ne: ['$sale_price', null] }, { $lt: ['$sale_price', '$price'] }] },
              then: '$sale_price',
              else: '$price'
            }
          },
          available_quantity: { $subtract: ['$stock_quantity', { $ifNull: ['$reserved_quantity', 0] }] },
          performance_score: {
            $add: [
              { $multiply: ['$rating_average', { $ifNull: ['$rating_count', 0] }] },
              { $multiply: ['$view_count', 0.1] },
              { $multiply: ['$purchase_count', 2] }
            ]
          },
          days_since_created: {
            $divide: [{ $subtract: [new Date(), '$created_at'] }, 1000 * 60 * 60 * 24]
          }
        }
      });
    }

    // Advanced filtering stage with computed fields
    if (options.advancedFilters) {
      pipeline.push({
        $match: {
          available_quantity: { $gt: 0 },
          effective_price: { $gte: options.minPrice || 0, $lte: options.maxPrice || Number.MAX_SAFE_INTEGER }
        }
      });
    }

    // Sorting stage optimized for index usage
    if (queryPattern.sort || options.sort) {
      const sortSpec = queryPattern.sort || options.sort;
      const optimizedSort = this.optimizeSortStage(sortSpec, indexStrategy);
      pipeline.push({ $sort: optimizedSort });
    }

    // Faceting stage for complex analytics
    if (options.enableFaceting) {
      pipeline.push({
        $facet: {
          results: [
            { $limit: options.limit || 50 },
            { $skip: options.skip || 0 }
          ],
          totalCount: [{ $count: 'count' }],
          categoryStats: [
            { $group: { _id: '$category', count: { $sum: 1 }, avgPrice: { $avg: '$effective_price' } } }
          ],
          brandStats: [
            { $group: { _id: '$brand', count: { $sum: 1 }, avgRating: { $avg: '$rating_average' } } }
          ],
          priceStats: [
            {
              $group: {
                _id: null,
                minPrice: { $min: '$effective_price' },
                maxPrice: { $max: '$effective_price' },
                avgPrice: { $avg: '$effective_price' }
              }
            }
          ]
        }
      });
    } else {
      // Standard pagination
      if (options.skip) pipeline.push({ $skip: options.skip });
      if (options.limit) pipeline.push({ $limit: options.limit });
    }

    // Projection stage for covering index optimization
    if (options.projection) {
      pipeline.push({ $project: options.projection });
    }

    return pipeline;
  }

  optimizeMatchStage(matchConditions, indexStrategy) {
    console.log('Optimizing match stage for compound index efficiency...');

    const optimizedMatch = {};

    // Reorder match conditions to align with compound index field order
    // ESR Pattern: Equality conditions first, then sort fields, then range conditions

    // Add equality conditions first (most selective)
    const equalityFields = ['status', 'category', 'brand', 'availability'];
    equalityFields.forEach(field => {
      if (matchConditions[field] !== undefined) {
        optimizedMatch[field] = matchConditions[field];
      }
    });

    // Add other non-range conditions
    Object.keys(matchConditions).forEach(field => {
      if (!equalityFields.includes(field) && !this.isRangeCondition(matchConditions[field])) {
        optimizedMatch[field] = matchConditions[field];
      }
    });

    // Add range conditions last
    Object.keys(matchConditions).forEach(field => {
      if (this.isRangeCondition(matchConditions[field])) {
        optimizedMatch[field] = matchConditions[field];
      }
    });

    return optimizedMatch;
  }

  optimizeSortStage(sortSpec, indexStrategy) {
    console.log('Optimizing sort stage for index-supported ordering...');

    // Reorder sort fields to match compound index ordering when possible
    const optimizedSort = {};

    // Priority order based on common compound index patterns
    const sortPriority = ['featured', 'status', 'category', 'brand', 'price', 'rating_average', 'created_at'];

    // Add sort fields in optimized order
    sortPriority.forEach(field => {
      if (sortSpec[field] !== undefined) {
        optimizedSort[field] = sortSpec[field];
      }
    });

    // Add any remaining sort fields
    Object.keys(sortSpec).forEach(field => {
      if (!sortPriority.includes(field)) {
        optimizedSort[field] = sortSpec[field];
      }
    });

    return optimizedSort;
  }

  isRangeCondition(condition) {
    if (typeof condition !== 'object' || condition === null) return false;
    return condition.$gte !== undefined || condition.$lte !== undefined || 
           condition.$gt !== undefined || condition.$lt !== undefined || 
           condition.$in !== undefined;
  }

  async canUseCoveringIndex(queryPattern, analysis) {
    // Determine if the query can be satisfied entirely by index fields
    // This is a simplified check - in production, analyze the actual query projection
    const projectionFields = queryPattern.projection ? Object.keys(queryPattern.projection) : [];
    const requiredFields = [...analysis.filterFields, ...analysis.sortFields, ...projectionFields];

    // Check against known covering indexes
    const coveringIndexes = [
      'idx_products_search_optimized',
      'idx_products_catalog_comprehensive'
    ];

    // Simplified check - in practice, would verify actual index field coverage
    return requiredFields.length <= 6; // Assume reasonable covering index size
  }

  canOptimizeSort(analysis) {
    // Check if sort fields align with compound index ordering
    return analysis.sortFields.length > 0 && analysis.sortFields.length <= 3;
  }

  async executeOptimizedQuery(pipeline, indexStrategy) {
    console.log('Executing optimized query with performance monitoring...');

    try {
      // Apply index hints if specified
      const aggregateOptions = {
        allowDiskUse: false, // Force memory-based operations for better performance
        maxTimeMS: 30000
      };

      if (indexStrategy.recommendedIndexes.length > 0) {
        aggregateOptions.hint = indexStrategy.recommendedIndexes[0];
      }

      const cursor = this.collections.products.aggregate(pipeline, aggregateOptions);
      const results = await cursor.toArray();

      return results;

    } catch (error) {
      console.error('Error executing optimized query:', error);
      throw error;
    }
  }

  async performCompoundIndexAnalysis() {
    console.log('Performing comprehensive compound index analysis...');

    try {
      // Analyze current index usage and effectiveness
      const indexStats = await this.analyzeIndexUsage();

      // Identify slow queries and optimization opportunities
      const slowQueryAnalysis = await this.analyzeSlowQueries();

      // Generate index recommendations
      const recommendations = await this.generateIndexRecommendations(indexStats, slowQueryAnalysis);

      // Perform index efficiency analysis
      const efficiencyAnalysis = await this.analyzeIndexEfficiency();

      return {
        indexUsage: indexStats,
        slowQueries: slowQueryAnalysis,
        recommendations: recommendations,
        efficiency: efficiencyAnalysis,

        // Summary metrics
        summary: {
          totalIndexes: indexStats.totalIndexes,
          efficientIndexes: indexStats.efficientIndexes,
          underutilizedIndexes: indexStats.underutilizedIndexes,
          recommendedOptimizations: recommendations.length,
          averageQueryTime: slowQueryAnalysis.averageExecutionTime
        }
      };

    } catch (error) {
      console.error('Error performing compound index analysis:', error);
      throw error;
    }
  }

  async analyzeIndexUsage() {
    console.log('Analyzing compound index usage patterns...');

    try {
      // Get index statistics from MongoDB
      const indexStats = await this.collections.products.aggregate([
        { $indexStats: {} }
      ]).toArray();

      const analysis = {
        totalIndexes: indexStats.length,
        efficientIndexes: 0,
        underutilizedIndexes: 0,
        indexDetails: [],
        usagePatterns: new Map()
      };

      for (const indexStat of indexStats) {
        const indexAnalysis = {
          name: indexStat.name,
          accessCount: indexStat.accesses.ops,
          lastAccessed: indexStat.accesses.since,
          keyPattern: indexStat.key,

          // Calculate efficiency metrics
          efficiency: this.calculateIndexEfficiency(indexStat),

          // Usage classification
          usageClass: this.classifyIndexUsage(indexStat)
        };

        analysis.indexDetails.push(indexAnalysis);

        if (indexAnalysis.efficiency > this.config.indexEfficiencyThreshold) {
          analysis.efficientIndexes++;
        }

        if (indexAnalysis.usageClass === 'underutilized') {
          analysis.underutilizedIndexes++;
        }
      }

      return analysis;

    } catch (error) {
      console.error('Error analyzing index usage:', error);
      return { error: error.message };
    }
  }

  calculateIndexEfficiency(indexStat) {
    // Calculate index efficiency based on access patterns and selectivity
    const accessCount = indexStat.accesses.ops || 0;
    const timeSinceCreation = Date.now() - (indexStat.accesses.since ? indexStat.accesses.since.getTime() : Date.now());
    const daysSinceCreation = Math.max(1, timeSinceCreation / (1000 * 60 * 60 * 24));

    const accessesPerDay = accessCount / daysSinceCreation;

    // Efficiency score based on access frequency and recency
    return Math.min(1.0, accessesPerDay / 100); // Normalize to 0-1 scale
  }

  classifyIndexUsage(indexStat) {
    const accessCount = indexStat.accesses.ops || 0;
    const efficiency = this.calculateIndexEfficiency(indexStat);

    if (efficiency > 0.8) return 'highly_utilized';
    if (efficiency > 0.5) return 'moderately_utilized';
    if (efficiency > 0.1) return 'lightly_utilized';
    return 'underutilized';
  }

  async analyzeSlowQueries() {
    console.log('Analyzing slow queries for optimization opportunities...');

    try {
      // This would typically analyze MongoDB profiler data
      // For demo purposes, we'll simulate slow query analysis

      const slowQueryAnalysis = {
        averageExecutionTime: 150, // milliseconds
        slowestQueries: [
          {
            pattern: 'complex_search',
            averageTime: 500,
            count: 25,
            indexMisses: ['category + brand + price range']
          },
          {
            pattern: 'analytics_aggregation', 
            averageTime: 800,
            count: 12,
            indexMisses: ['created_at + category grouping']
          }
        ],
        totalSlowQueries: 37,
        optimizationOpportunities: [
          'Add compound index for category + brand + price filtering',
          'Optimize sort operations with index-aligned ordering',
          'Consider covering indexes for frequent projections'
        ]
      };

      return slowQueryAnalysis;

    } catch (error) {
      console.error('Error analyzing slow queries:', error);
      return { error: error.message };
    }
  }

  async generateIndexRecommendations(indexStats, slowQueryAnalysis) {
    console.log('Generating intelligent index recommendations...');

    const recommendations = [];

    // Analyze missing compound indexes based on slow query patterns
    if (slowQueryAnalysis.slowestQueries) {
      for (const slowQuery of slowQueryAnalysis.slowestQueries) {
        if (slowQuery.indexMisses) {
          for (const missingIndex of slowQuery.indexMisses) {
            recommendations.push({
              type: 'create_compound_index',
              priority: 'high',
              description: `Create compound index for: ${missingIndex}`,
              estimatedImprovement: '60-80% query time reduction',
              implementation: this.generateIndexCreationCommand(missingIndex)
            });
          }
        }
      }
    }

    // Recommend removal of underutilized indexes
    if (indexStats.indexDetails) {
      for (const indexDetail of indexStats.indexDetails) {
        if (indexDetail.usageClass === 'underutilized' && !indexDetail.name.includes('_id_')) {
          recommendations.push({
            type: 'remove_index',
            priority: 'medium',
            description: `Consider removing underutilized index: ${indexDetail.name}`,
            estimatedImprovement: 'Reduced storage overhead and faster writes',
            implementation: `db.collection.dropIndex("${indexDetail.name}")`
          });
        }
      }
    }

    // Recommend index consolidation opportunities
    recommendations.push({
      type: 'consolidate_indexes',
      priority: 'medium',
      description: 'Consolidate multiple single-field indexes into compound indexes',
      estimatedImprovement: 'Better query optimization and reduced storage',
      implementation: 'Analyze query patterns and create strategic compound indexes'
    });

    return recommendations;
  }

  generateIndexCreationCommand(indexDescription) {
    // Generate MongoDB index creation command based on description
    // This is a simplified implementation
    return `db.products.createIndex({/* fields based on: ${indexDescription} */}, {background: true})`;
  }

  async analyzeIndexEfficiency() {
    console.log('Analyzing compound index efficiency and optimization potential...');

    const efficiencyAnalysis = {
      overallEfficiency: 0.75, // Simulated metric
      topPerformingIndexes: [
        'idx_products_search_optimized',
        'idx_products_catalog_comprehensive'
      ],
      improvementOpportunities: [
        {
          index: 'idx_products_analytics_optimized',
          currentEfficiency: 0.6,
          potentialImprovement: 'Reorder fields to better match query patterns',
          estimatedGain: '25% performance improvement'
        }
      ],
      resourceUtilization: {
        totalIndexSize: '450 MB',
        memoryUsage: '180 MB',
        maintenanceOverhead: 'Low'
      }
    };

    return efficiencyAnalysis;
  }

  async trackQueryPerformance(performanceData) {
    try {
      const performanceRecord = {
        ...performanceData,
        timestamp: new Date(),
        collection: 'products'
      };

      await this.collections.queryPerformance.insertOne(performanceRecord);

      // Update in-memory performance tracking
      const pattern = performanceData.queryPattern.match ? 
        Object.keys(performanceData.queryPattern.match).join('+') : 'unknown';

      if (!this.performanceMetrics.queryPatterns.has(pattern)) {
        this.performanceMetrics.queryPatterns.set(pattern, {
          count: 0,
          totalTime: 0,
          averageTime: 0
        });
      }

      const patternMetrics = this.performanceMetrics.queryPatterns.get(pattern);
      patternMetrics.count++;
      patternMetrics.totalTime += performanceData.executionTime;
      patternMetrics.averageTime = patternMetrics.totalTime / patternMetrics.count;

    } catch (error) {
      console.warn('Error tracking query performance:', error);
      // Don't throw - performance tracking shouldn't break queries
    }
  }

  async setupPerformanceMonitoring() {
    console.log('Setting up compound index performance monitoring...');

    setInterval(async () => {
      try {
        await this.performPeriodicAnalysis();
      } catch (error) {
        console.warn('Error in periodic performance analysis:', error);
      }
    }, this.config.performanceAnalysisInterval);
  }

  async performPeriodicAnalysis() {
    console.log('Performing periodic compound index performance analysis...');

    try {
      // Analyze recent query performance
      const recentPerformance = await this.analyzeRecentPerformance();

      // Generate optimization recommendations
      const recommendations = await this.generateOptimizationRecommendations(recentPerformance);

      // Log analysis results
      await this.collections.indexAnalytics.insertOne({
        timestamp: new Date(),
        analysisType: 'periodic_performance',
        performance: recentPerformance,
        recommendations: recommendations
      });

      // Apply automatic optimizations if enabled
      if (this.config.enableAutomaticOptimization && recommendations.length > 0) {
        await this.applyAutomaticOptimizations(recommendations);
      }

    } catch (error) {
      console.error('Error in periodic analysis:', error);
    }
  }

  async analyzeRecentPerformance() {
    const oneHourAgo = new Date(Date.now() - 60 * 60 * 1000);

    const recentQueries = await this.collections.queryPerformance.find({
      timestamp: { $gte: oneHourAgo }
    }).toArray();

    const analysis = {
      totalQueries: recentQueries.length,
      averageExecutionTime: 0,
      slowQueries: recentQueries.filter(q => q.executionTime > this.config.slowQueryThreshold).length,
      queryPatterns: new Map()
    };

    if (recentQueries.length > 0) {
      analysis.averageExecutionTime = recentQueries.reduce((sum, q) => sum + q.executionTime, 0) / recentQueries.length;
    }

    return analysis;
  }

  async generateOptimizationRecommendations(performanceAnalysis) {
    const recommendations = [];

    if (performanceAnalysis.averageExecutionTime > this.config.slowQueryThreshold) {
      recommendations.push({
        type: 'performance_degradation',
        priority: 'high',
        description: 'Average query time has increased significantly',
        action: 'analyze_index_usage'
      });
    }

    if (performanceAnalysis.slowQueries > performanceAnalysis.totalQueries * 0.1) {
      recommendations.push({
        type: 'slow_query_threshold',
        priority: 'medium', 
        description: 'High percentage of slow queries detected',
        action: 'review_compound_indexes'
      });
    }

    return recommendations;
  }

  async applyAutomaticOptimizations(recommendations) {
    console.log('Applying automatic compound index optimizations...');

    for (const recommendation of recommendations) {
      try {
        if (recommendation.action === 'analyze_index_usage') {
          await this.performCompoundIndexAnalysis();
        } else if (recommendation.action === 'review_compound_indexes') {
          await this.reviewIndexConfiguration();
        }
      } catch (error) {
        console.warn(`Error applying optimization ${recommendation.action}:`, error);
      }
    }
  }

  async reviewIndexConfiguration() {
    console.log('Reviewing compound index configuration for optimization opportunities...');

    // This would analyze current indexes and suggest improvements
    const review = {
      timestamp: new Date(),
      reviewType: 'automated_compound_index_review',
      findings: [
        'All critical compound indexes are present',
        'Index usage patterns are within expected ranges',
        'No immediate optimization required'
      ]
    };

    await this.collections.indexAnalytics.insertOne(review);
  }
}

// Benefits of MongoDB Advanced Compound Indexing:
// - Intelligent multi-field query optimization with ESR (Equality, Sort, Range) pattern adherence
// - Advanced compound index strategies tailored for different query patterns and use cases
// - Comprehensive performance monitoring and automatic optimization recommendations  
// - Sophisticated index selection and query plan optimization
// - Built-in covering index support for maximum query performance
// - Automatic index efficiency analysis and maintenance recommendations
// - Production-ready compound indexing with minimal configuration overhead
// - Advanced query pattern analysis for optimal index design
// - Resource-aware index management with storage and memory optimization
// - SQL-compatible compound indexing through QueryLeaf integration

module.exports = {
  AdvancedCompoundIndexManager
};

Understanding MongoDB Compound Index Architecture

Advanced Multi-Field Indexing and Query Optimization Strategies

Implement sophisticated compound indexing patterns for production MongoDB deployments:

// Production-ready MongoDB compound indexing with enterprise-grade optimization and monitoring
class ProductionCompoundIndexOptimizer extends AdvancedCompoundIndexManager {
  constructor(db, productionConfig) {
    super(db, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      enableDistributedIndexing: true,
      enableShardKeyOptimization: true,
      enableCrossCollectionIndexing: true,
      enableIndexPartitioning: true,
      enableAutomaticIndexMaintenance: true,
      enableComplianceIndexing: true
    };

    this.setupProductionOptimizations();
    this.initializeDistributedIndexing();
    this.setupAdvancedAnalytics();
  }

  async implementDistributedCompoundIndexing(shardingStrategy) {
    console.log('Implementing distributed compound indexing across sharded clusters...');

    const distributedStrategy = {
      // Shard-aware compound indexing
      shardKeyAlignment: {
        enableShardKeyPrefixing: true,
        optimizeForShardDistribution: true,
        minimizeCrossShardQueries: true,
        balanceIndexEfficiency: true
      },

      // Cross-shard optimization
      crossShardOptimization: {
        enableGlobalIndexes: true,
        optimizeForLatency: true,
        minimizeNetworkTraffic: true,
        enableIntelligentRouting: true
      },

      // High availability indexing
      highAvailabilityIndexing: {
        replicationAwareIndexing: true,
        automaticFailoverIndexing: true,
        consistencyLevelOptimization: true,
        geographicDistributionSupport: true
      }
    };

    return await this.deployDistributedIndexing(distributedStrategy);
  }

  async setupAdvancedIndexOptimization() {
    console.log('Setting up advanced compound index optimization...');

    const optimizationStrategies = {
      // Query pattern learning
      queryPatternLearning: {
        enableMachineLearningOptimization: true,
        adaptiveIndexCreation: true,
        predictiveIndexManagement: true,
        workloadPatternRecognition: true
      },

      // Resource optimization
      resourceOptimization: {
        memoryAwareIndexing: true,
        storageOptimization: true,
        cpuUtilizationOptimization: true,
        networkBandwidthOptimization: true
      },

      // Index lifecycle management
      lifecycleManagement: {
        automaticIndexAging: true,
        indexArchiving: true,
        historicalDataIndexing: true,
        complianceRetention: true
      }
    };

    return await this.deployOptimizationStrategies(optimizationStrategies);
  }

  async implementAdvancedQueryOptimization() {
    console.log('Implementing advanced query optimization with compound index intelligence...');

    const queryOptimizationStrategy = {
      // Query plan optimization
      queryPlanOptimization: {
        enableCostBasedOptimization: true,
        statisticsBasedPlanning: true,
        adaptiveQueryExecution: true,
        parallelExecutionOptimization: true
      },

      // Index intersection optimization
      indexIntersection: {
        enableIntelligentIntersection: true,
        costAwareIntersection: true,
        selectivityBasedOptimization: true,
        memoryCacheOptimization: true
      },

      // Covering index strategies
      coveringIndexStrategies: {
        automaticCoveringDetection: true,
        projectionOptimization: true,
        fieldOrderOptimization: true,
        sparseIndexOptimization: true
      }
    };

    return await this.deployQueryOptimizationStrategy(queryOptimizationStrategy);
  }
}

SQL-Style Compound Indexing with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB compound indexing and multi-field query optimization:

-- QueryLeaf advanced compound indexing with SQL-familiar syntax for MongoDB

-- Create comprehensive compound indexes with intelligent field ordering
CREATE COMPOUND INDEX idx_products_search_optimization ON products (
  -- ESR Pattern: Equality, Sort, Range
  status ASC,              -- Equality filter (highest selectivity)
  category ASC,            -- Equality filter
  brand ASC,               -- Equality filter  
  featured DESC,           -- Sort field (featured items first)
  price ASC,               -- Range filter
  rating_average DESC,     -- Sort field
  created_at DESC          -- Final sort field
) 
WITH (
  index_type = 'compound_optimized',
  background_creation = true,
  sparse_optimization = true,

  -- Advanced indexing options
  enable_covering_index = true,
  optimize_for_sorting = true,
  memory_usage_limit = '256MB',

  -- Performance tuning
  selectivity_threshold = 0.1,
  cardinality_analysis = true,
  enable_statistics_collection = true
);

-- E-commerce catalog compound index with partial filtering
CREATE COMPOUND INDEX idx_products_catalog_comprehensive ON products (
  status ASC,
  availability ASC,
  category ASC,
  subcategory ASC,
  price ASC,
  stock_quantity DESC,
  rating_average DESC
)
WITH (
  -- Partial index for active products only
  partial_filter = 'status = "active" AND stock_quantity > 0',
  include_null_values = false,

  -- Optimization settings  
  enable_prefix_compression = true,
  block_size = '16KB',
  fill_factor = 90
);

-- Advanced analytics compound index optimized for time-series queries
CREATE COMPOUND INDEX idx_products_analytics_temporal ON products (
  created_at DESC,         -- Time dimension (most recent first)
  category ASC,            -- Grouping dimension
  brand ASC,               -- Grouping dimension
  purchase_count DESC,     -- Performance metric
  rating_average DESC,     -- Quality metric
  price ASC                -- Financial metric
)
WITH (
  index_type = 'time_series_optimized',
  background_creation = true,

  -- Time-series specific optimizations
  enable_temporal_partitioning = true,
  partition_granularity = 'month',
  retention_policy = '2 years',

  -- Analytics optimization
  enable_aggregation_pipeline_optimization = true,
  support_window_functions = true
);

-- Text search compound index with weighted fields
CREATE COMPOUND INDEX idx_products_text_search_optimized ON products (
  status ASC,
  category ASC,

  -- Full-text search fields with weights
  FULLTEXT(
    product_name WEIGHT 10,
    seo_title WEIGHT 8,
    seo_description WEIGHT 5,
    marketing_tags WEIGHT 3
  ),

  rating_average DESC,
  view_count DESC
)
WITH (
  index_type = 'compound_text_search',
  text_search_language = 'english',
  enable_stemming = true,
  enable_stop_words = true,

  -- Text search optimization
  phrase_search_optimization = true,
  fuzzy_search_support = true,
  enable_search_analytics = true
);

-- Complex multi-field query optimization using compound indexes
WITH optimized_product_search AS (
  SELECT 
    p.*,

    -- Utilize compound index for efficient filtering
    -- idx_products_search_optimization will be used for this query pattern
    ROW_NUMBER() OVER (
      PARTITION BY p.category 
      ORDER BY 
        p.featured DESC,           -- Index-supported sort
        p.rating_average DESC,     -- Index-supported sort  
        p.price ASC                -- Index-supported sort
    ) as category_rank,

    -- Calculate derived metrics that benefit from covering indexes
    CASE 
      WHEN p.sale_price IS NOT NULL AND p.sale_price < p.price THEN p.sale_price
      ELSE p.price
    END as effective_price,

    (p.stock_quantity - COALESCE(p.reserved_quantity, 0)) as available_quantity,

    -- Performance score calculation
    (
      p.rating_average * p.rating_count + 
      p.view_count * 0.1 + 
      p.purchase_count * 2.0
    ) as performance_score,

    -- Age-based scoring
    EXTRACT(DAYS FROM CURRENT_TIMESTAMP - p.created_at) as days_since_created

  FROM products p
  WHERE 
    -- Compound index will efficiently handle this filter combination
    p.status = 'active'                    -- Uses index prefix
    AND p.category IN ('Electronics', 'Computers', 'Mobile')  -- Uses index
    AND p.brand IN ('Apple', 'Samsung', 'Sony')              -- Uses index
    AND p.featured = true OR p.rating_average >= 4.0         -- Uses index
    AND p.price BETWEEN 100.00 AND 2000.00                  -- Range condition last
    AND p.stock_quantity > 0                                 -- Additional filter

  -- Index hint to ensure optimal compound index usage
  USE INDEX (idx_products_search_optimization)
),

category_analytics AS (
  -- Utilize analytics compound index for efficient aggregation
  SELECT 
    category,
    brand,

    -- Aggregation operations optimized by compound index
    COUNT(*) as product_count,
    AVG(effective_price) as avg_price,
    AVG(rating_average) as avg_rating,
    SUM(purchase_count) as total_sales,

    -- Percentile calculations using index ordering
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY effective_price) as price_p25,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY effective_price) as price_median,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY effective_price) as price_p75,

    -- Performance metrics
    AVG(performance_score) as avg_performance_score,
    COUNT(*) FILTER (WHERE days_since_created <= 30) as new_products_30d,

    -- Stock analysis
    AVG(available_quantity) as avg_stock_level,
    COUNT(*) FILTER (WHERE available_quantity = 0) as out_of_stock_count

  FROM optimized_product_search
  GROUP BY category, brand

  -- Use analytics compound index for grouping optimization
  USE INDEX (idx_products_analytics_temporal)
),

performance_ranking AS (
  SELECT 
    ca.*,

    -- Ranking within category using index-optimized ordering
    RANK() OVER (
      PARTITION BY ca.category 
      ORDER BY ca.total_sales DESC, ca.avg_rating DESC
    ) as category_sales_rank,

    RANK() OVER (
      PARTITION BY ca.category
      ORDER BY ca.avg_price DESC
    ) as category_price_rank,

    -- Performance classification
    CASE 
      WHEN ca.avg_performance_score > 200 THEN 'top_performer'
      WHEN ca.avg_performance_score > 150 THEN 'strong_performer'
      WHEN ca.avg_performance_score > 100 THEN 'average_performer'
      ELSE 'underperformer'
    END as performance_tier,

    -- Market position analysis
    CASE 
      WHEN ca.category_price_rank <= 3 THEN 'premium_positioning'
      WHEN ca.category_price_rank <= (SELECT COUNT(DISTINCT brand) FROM optimized_product_search WHERE category = ca.category) / 2 THEN 'mid_market'
      ELSE 'value_positioning'
    END as market_position

  FROM category_analytics ca
)

SELECT 
  pr.category,
  pr.brand,
  pr.product_count,
  ROUND(pr.avg_price, 2) as average_price,
  ROUND(pr.price_median, 2) as median_price,
  ROUND(pr.avg_rating, 2) as average_rating,
  pr.total_sales,
  pr.new_products_30d as new_products_last_30_days,
  ROUND(pr.avg_stock_level, 1) as average_stock_level,
  pr.out_of_stock_count,

  -- Performance and positioning metrics
  pr.performance_tier,
  pr.market_position,
  pr.category_sales_rank,
  pr.category_price_rank,

  -- Efficiency metrics
  CASE 
    WHEN pr.product_count > 0 THEN ROUND((pr.total_sales * 1.0 / pr.product_count), 2)
    ELSE 0
  END as sales_per_product,

  ROUND(
    CASE 
      WHEN pr.out_of_stock_count = 0 THEN 100
      ELSE ((pr.product_count - pr.out_of_stock_count) * 100.0 / pr.product_count)
    END, 
    1
  ) as stock_availability_percent,

  -- Competitive analysis
  CASE 
    WHEN pr.avg_price > pr.price_p75 AND pr.avg_rating >= 4.0 THEN 'premium_quality'
    WHEN pr.avg_price < pr.price_p25 AND pr.total_sales > pr.avg_performance_score THEN 'value_leader'
    WHEN pr.avg_rating >= 4.5 THEN 'quality_leader'
    WHEN pr.total_sales > (SELECT AVG(total_sales) FROM performance_ranking WHERE category = pr.category) * 1.5 THEN 'market_leader'
    ELSE 'standard_competitor'
  END as competitive_position

FROM performance_ranking pr
WHERE pr.product_count >= 5  -- Only brands with significant presence

-- Optimize ordering using compound index
ORDER BY 
  pr.category ASC,
  pr.total_sales DESC,
  pr.avg_rating DESC,
  pr.avg_price ASC

-- Query execution will benefit from multiple compound indexes
WITH (
  -- Query optimization hints
  enable_compound_index_intersection = true,
  prefer_covering_indexes = true,
  optimize_for_sort_performance = true,

  -- Performance monitoring
  track_index_usage = true,
  collect_execution_statistics = true,
  enable_query_plan_caching = true,

  -- Resource management
  max_memory_usage = '512MB',
  enable_spill_to_disk = false,
  parallel_processing = true
);

-- Advanced compound index performance analysis and optimization
WITH compound_index_performance AS (
  SELECT 
    index_name,
    index_type,
    key_pattern,

    -- Usage statistics
    total_accesses,
    accesses_per_day,
    last_accessed,

    -- Performance metrics
    avg_execution_time_ms,
    index_hit_ratio,
    selectivity_factor,

    -- Resource utilization
    index_size_mb,
    memory_usage_mb,
    maintenance_overhead_percent,

    -- Efficiency calculations
    (total_accesses * 1.0 / NULLIF(index_size_mb, 0)) as access_efficiency,
    (index_hit_ratio * selectivity_factor) as effectiveness_score

  FROM index_statistics
  WHERE index_type = 'compound'
  AND created_date >= CURRENT_DATE - INTERVAL '30 days'
),

index_optimization_analysis AS (
  SELECT 
    cip.*,

    -- Performance classification
    CASE 
      WHEN effectiveness_score > 0.8 THEN 'highly_effective'
      WHEN effectiveness_score > 0.6 THEN 'moderately_effective'
      WHEN effectiveness_score > 0.4 THEN 'somewhat_effective'
      ELSE 'ineffective'
    END as effectiveness_classification,

    -- Resource efficiency
    CASE 
      WHEN access_efficiency > 100 THEN 'highly_efficient'
      WHEN access_efficiency > 50 THEN 'moderately_efficient'
      WHEN access_efficiency > 10 THEN 'somewhat_efficient'
      ELSE 'inefficient'
    END as efficiency_classification,

    -- Optimization recommendations
    CASE 
      WHEN accesses_per_day < 10 AND index_size_mb > 50 THEN 'consider_removal'
      WHEN avg_execution_time_ms > 1000 THEN 'needs_optimization'
      WHEN index_hit_ratio < 0.7 THEN 'review_field_order'
      WHEN maintenance_overhead_percent > 20 THEN 'reduce_complexity'
      ELSE 'performing_well'
    END as optimization_recommendation,

    -- Priority scoring for optimization
    (
      CASE effectiveness_classification
        WHEN 'ineffective' THEN 40
        WHEN 'somewhat_effective' THEN 20  
        WHEN 'moderately_effective' THEN 10
        ELSE 0
      END +
      CASE efficiency_classification
        WHEN 'inefficient' THEN 30
        WHEN 'somewhat_efficient' THEN 15
        WHEN 'moderately_efficient' THEN 5
        ELSE 0
      END +
      CASE 
        WHEN avg_execution_time_ms > 2000 THEN 25
        WHEN avg_execution_time_ms > 1000 THEN 15
        WHEN avg_execution_time_ms > 500 THEN 5
        ELSE 0
      END
    ) as optimization_priority_score

  FROM compound_index_performance cip
),

optimization_recommendations AS (
  SELECT 
    ioa.index_name,
    ioa.effectiveness_classification,
    ioa.efficiency_classification,
    ioa.optimization_recommendation,
    ioa.optimization_priority_score,

    -- Detailed recommendations based on analysis
    CASE ioa.optimization_recommendation
      WHEN 'consider_removal' THEN 
        FORMAT('Index %s is rarely used (%s accesses/day) but consumes %s MB - consider removal',
               ioa.index_name, ioa.accesses_per_day, ioa.index_size_mb)
      WHEN 'needs_optimization' THEN
        FORMAT('Index %s has slow execution time (%s ms avg) - review field order and selectivity',
               ioa.index_name, ioa.avg_execution_time_ms)
      WHEN 'review_field_order' THEN
        FORMAT('Index %s has low hit ratio (%s) - consider reordering fields for better selectivity',
               ioa.index_name, ROUND(ioa.index_hit_ratio * 100, 1))
      WHEN 'reduce_complexity' THEN
        FORMAT('Index %s has high maintenance overhead (%s%%) - consider simplifying or partitioning',
               ioa.index_name, ioa.maintenance_overhead_percent)
      ELSE
        FORMAT('Index %s is performing well - no immediate action required', ioa.index_name)
    END as detailed_recommendation,

    -- Implementation guidance
    CASE ioa.optimization_recommendation
      WHEN 'consider_removal' THEN 'DROP INDEX ' || ioa.index_name
      WHEN 'needs_optimization' THEN 'REINDEX ' || ioa.index_name || ' WITH (optimization_level = high)'
      WHEN 'review_field_order' THEN 'RECREATE INDEX ' || ioa.index_name || ' WITH (field_order_optimization = true)'
      WHEN 'reduce_complexity' THEN 'PARTITION INDEX ' || ioa.index_name || ' BY (time_range = monthly)'
      ELSE 'MAINTAIN INDEX ' || ioa.index_name || ' WITH (current_settings)'
    END as implementation_command,

    -- Expected impact
    CASE ioa.optimization_recommendation
      WHEN 'consider_removal' THEN 'Reduced storage cost and faster write operations'
      WHEN 'needs_optimization' THEN 'Improved query performance by 30-50%'
      WHEN 'review_field_order' THEN 'Better index utilization and reduced scan time'
      WHEN 'reduce_complexity' THEN 'Lower maintenance overhead and better scalability'
      ELSE 'Continued optimal performance'
    END as expected_impact

  FROM index_optimization_analysis ioa
  WHERE ioa.optimization_priority_score > 0
)

SELECT 
  or_rec.index_name,
  or_rec.effectiveness_classification,
  or_rec.efficiency_classification,
  or_rec.optimization_recommendation,
  or_rec.optimization_priority_score,
  or_rec.detailed_recommendation,
  or_rec.implementation_command,
  or_rec.expected_impact,

  -- Priority classification
  CASE 
    WHEN or_rec.optimization_priority_score > 60 THEN 'critical_priority'
    WHEN or_rec.optimization_priority_score > 40 THEN 'high_priority'
    WHEN or_rec.optimization_priority_score > 20 THEN 'medium_priority'
    ELSE 'low_priority'
  END as optimization_priority,

  -- Timeline recommendation  
  CASE 
    WHEN or_rec.optimization_priority_score > 60 THEN 'immediate_action_required'
    WHEN or_rec.optimization_priority_score > 40 THEN 'address_within_week'
    WHEN or_rec.optimization_priority_score > 20 THEN 'address_within_month'
    ELSE 'monitor_and_review_quarterly'
  END as timeline_recommendation

FROM optimization_recommendations or_rec

-- Order by priority for implementation planning
ORDER BY 
  or_rec.optimization_priority_score DESC,
  or_rec.index_name ASC;

-- Real-time compound index monitoring dashboard
CREATE VIEW compound_index_health_dashboard AS
WITH real_time_index_metrics AS (
  SELECT 
    -- Current timestamp for dashboard refresh
    CURRENT_TIMESTAMP as dashboard_time,

    -- Index overview statistics
    (SELECT COUNT(*) FROM compound_indexes WHERE status = 'active') as total_compound_indexes,
    (SELECT COUNT(*) FROM compound_indexes WHERE effectiveness_score > 0.8) as high_performing_indexes,
    (SELECT COUNT(*) FROM compound_indexes WHERE effectiveness_score < 0.4) as underperforming_indexes,

    -- Performance aggregates
    (SELECT AVG(avg_execution_time_ms) FROM index_performance_metrics 
     WHERE measurement_time >= CURRENT_TIMESTAMP - INTERVAL '1 hour') as system_avg_execution_time,

    (SELECT AVG(index_hit_ratio) FROM index_performance_metrics
     WHERE measurement_time >= CURRENT_TIMESTAMP - INTERVAL '1 hour') as system_avg_hit_ratio,

    -- Resource utilization
    (SELECT SUM(index_size_mb) FROM compound_indexes) as total_index_size_mb,
    (SELECT SUM(memory_usage_mb) FROM compound_indexes) as total_memory_usage_mb,

    -- Query optimization metrics
    (SELECT COUNT(*) FROM slow_queries 
     WHERE query_time >= CURRENT_TIMESTAMP - INTERVAL '5 minutes') as recent_slow_queries,

    (SELECT AVG(optimization_effectiveness) FROM query_optimizations
     WHERE applied_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour') as recent_optimization_effectiveness
),

index_health_summary AS (
  SELECT 
    index_name,
    effectiveness_score,
    efficiency_classification,
    avg_execution_time_ms,
    accesses_per_hour,
    last_optimized,

    -- Health indicators
    CASE 
      WHEN effectiveness_score > 0.8 AND avg_execution_time_ms < 100 THEN '🟢 Excellent'
      WHEN effectiveness_score > 0.6 AND avg_execution_time_ms < 500 THEN '🟡 Good' 
      WHEN effectiveness_score > 0.4 AND avg_execution_time_ms < 1000 THEN '🟠 Fair'
      ELSE '🔴 Poor'
    END as health_status,

    -- Trend indicators
    LAG(effectiveness_score) OVER (
      PARTITION BY index_name 
      ORDER BY measurement_timestamp
    ) as prev_effectiveness_score

  FROM compound_index_performance 
  WHERE measurement_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  QUALIFY ROW_NUMBER() OVER (PARTITION BY index_name ORDER BY measurement_timestamp DESC) = 1
)

SELECT 
  dashboard_time,

  -- System overview
  total_compound_indexes,
  high_performing_indexes,
  underperforming_indexes,
  FORMAT('%s high, %s underperforming', high_performing_indexes, underperforming_indexes) as performance_summary,

  -- Performance metrics
  ROUND(system_avg_execution_time, 1) as avg_execution_time_ms,
  ROUND(system_avg_hit_ratio * 100, 1) as avg_hit_ratio_percent,
  recent_slow_queries,

  -- Resource utilization
  ROUND(total_index_size_mb, 1) as total_index_size_mb,
  ROUND(total_memory_usage_mb, 1) as memory_usage_mb,
  ROUND((total_memory_usage_mb / NULLIF(total_index_size_mb, 0)) * 100, 1) as memory_efficiency_percent,

  -- Health status distribution
  (SELECT COUNT(*) FROM index_health_summary WHERE health_status LIKE '%Excellent%') as excellent_indexes,
  (SELECT COUNT(*) FROM index_health_summary WHERE health_status LIKE '%Good%') as good_indexes,
  (SELECT COUNT(*) FROM index_health_summary WHERE health_status LIKE '%Fair%') as fair_indexes,
  (SELECT COUNT(*) FROM index_health_summary WHERE health_status LIKE '%Poor%') as poor_indexes,

  -- Overall system health
  CASE 
    WHEN recent_slow_queries > 10 OR underperforming_indexes > total_compound_indexes * 0.3 THEN 'CRITICAL'
    WHEN recent_slow_queries > 5 OR underperforming_indexes > total_compound_indexes * 0.2 THEN 'WARNING'
    WHEN high_performing_indexes >= total_compound_indexes * 0.8 THEN 'EXCELLENT'
    ELSE 'HEALTHY'
  END as system_health_status,

  -- Active alerts
  ARRAY[
    CASE WHEN recent_slow_queries > 10 THEN FORMAT('%s slow queries in last 5 minutes', recent_slow_queries) END,
    CASE WHEN underperforming_indexes > 5 THEN FORMAT('%s compound indexes need optimization', underperforming_indexes) END,
    CASE WHEN system_avg_execution_time > 500 THEN 'High average query execution time detected' END,
    CASE WHEN memory_efficiency_percent < 50 THEN 'Low memory efficiency - consider index optimization' END
  ]::TEXT[] as active_alerts,

  -- Top performing indexes
  (SELECT JSON_AGG(
    JSON_BUILD_OBJECT(
      'index_name', index_name,
      'health', health_status,
      'execution_time', avg_execution_time_ms || 'ms',
      'accesses_per_hour', accesses_per_hour
    )
  ) FROM index_health_summary ORDER BY effectiveness_score DESC LIMIT 5) as top_performing_indexes,

  -- Indexes needing attention
  (SELECT JSON_AGG(
    JSON_BUILD_OBJECT(
      'index_name', index_name,
      'health', health_status,
      'execution_time', avg_execution_time_ms || 'ms',
      'issue', 'Performance degradation detected'
    )
  ) FROM index_health_summary WHERE health_status LIKE '%Poor%') as indexes_needing_attention

FROM real_time_index_metrics;

-- QueryLeaf provides comprehensive compound indexing capabilities:
-- 1. Advanced multi-field compound indexing with ESR pattern optimization
-- 2. Intelligent index selection and query plan optimization  
-- 3. Comprehensive performance monitoring and analytics
-- 4. Automated index maintenance and optimization recommendations
-- 5. SQL-familiar compound index creation and management syntax
-- 6. Advanced covering index strategies for maximum query performance
-- 7. Time-series and analytics-optimized compound indexing patterns
-- 8. Production-ready index monitoring with real-time health dashboards
-- 9. Cross-collection and distributed indexing optimization
-- 10. Integration with MongoDB's native compound indexing optimizations

Best Practices for Production Compound Indexing

Index Design Strategy and Performance Optimization

Essential principles for effective MongoDB compound indexing deployment:

ESR Pattern Adherence: Design compound indexes following Equality, Sort, Range field ordering for optimal query performance
Selectivity Analysis: Place most selective fields first in compound indexes to minimize document scan overhead
Query Pattern Alignment: Analyze application query patterns and create compound indexes that match common filter combinations
Covering Index Strategy: Design covering indexes that include all fields needed by frequent queries to eliminate document lookups
Index Intersection Planning: Understand when MongoDB will use index intersection vs. compound indexes for query optimization
Sort Optimization: Align sort operations with compound index field ordering to avoid expensive in-memory sorting

Scalability and Production Deployment

Optimize compound indexing for enterprise-scale requirements:

Shard Key Integration: Design compound indexes that work effectively with shard key distributions in sharded deployments
Resource Management: Monitor index size, memory usage, and maintenance overhead for optimal system resource utilization
Automated Optimization: Implement automated index analysis and optimization based on changing query patterns and performance metrics
Cross-Collection Strategy: Design compound indexing strategies that optimize queries spanning multiple collections
Compliance Integration: Ensure compound indexing meets audit, security, and data governance requirements
Operational Integration: Integrate compound index monitoring with existing alerting and operational workflows

Conclusion

MongoDB compound indexes provide comprehensive multi-field query optimization capabilities that enable optimal performance for complex query patterns through intelligent field ordering, advanced selectivity analysis, and sophisticated query plan optimization. The native compound indexing support ensures that multi-field queries benefit from MongoDB's optimized index intersection, covering index strategies, and ESR pattern adherence with minimal configuration overhead.

Key MongoDB Compound Indexing benefits include:

Intelligent Query Optimization: Advanced compound indexing with ESR pattern adherence for optimal multi-field query performance
Comprehensive Performance Analysis: Built-in index usage monitoring with automated optimization recommendations
Production-Ready Scalability: Enterprise-grade compound indexing strategies that scale efficiently across distributed deployments
Resource-Aware Management: Intelligent index size and memory optimization for optimal system resource utilization
Advanced Query Planning: Sophisticated query plan optimization with covering index support and index intersection intelligence
SQL Accessibility: Familiar SQL-style compound indexing operations through QueryLeaf for accessible database optimization

Whether you're optimizing e-commerce search queries, analytics aggregations, time-series data analysis, or complex multi-dimensional filtering operations, MongoDB compound indexes with QueryLeaf's familiar SQL interface provide the foundation for efficient, scalable, and high-performance multi-field query optimization.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB compound indexing while providing SQL-familiar syntax for index creation, performance monitoring, and optimization strategies. Advanced compound indexing patterns, ESR rule adherence, and covering index strategies are seamlessly handled through familiar SQL constructs, making sophisticated multi-field query optimization accessible to SQL-oriented development teams.

The combination of MongoDB's robust compound indexing capabilities with SQL-style index management operations makes it an ideal platform for applications requiring both complex multi-field query performance and familiar database optimization patterns, ensuring your queries can scale efficiently while maintaining optimal performance across diverse query patterns and data access requirements.